NESA Past Papers Scraper

A web scraper for NESA HSC past paper links, built with Scrapy on Python 2.7.14

This project is freely available under the MIT Licence. Please link back to this repo! :)

This scraper was built to get links for all past paper documents for http://hscpastpapers.com

Contents

Get the data
Data format
Running the scraper yourself
NESA HSC paper upload schedule
Acknowledgements

Get the data

https://raw.githubusercontent.com/notsidney/nesappscraper/master/data.json

Check meta.json to see when data.json was last updated and how many items it scraped.

Data format

Source of truth: scripts/types.d.ts

export type CourseItem = {
  course_name: string;
  packs: CoursePack[];
};

export type CoursePack = {
  year: string;
  link: string;
  docs: CourseDoc[];
};

export type CourseDoc = {
  doc_name: string;
  doc_link: string;
};

JSON Schema

Note: Each course_item is collapsed into one line.

{
    "type": "array",
    "items": {
        "object": "course_item",
        "type": "object",
        "properties": {
            "course_name": { "type": "string" },
            "packs": {
                "type": "array",
                "items": {
                    "object": "exam_pack_item"
                    "type": "object",
                    "properties": {
                        "docs": {
                            "type": "array",
                            "items": {
                                "object": "doc_item",
                                "type": "object",
                                "properties": {
                                    "doc_name": { "type": "string" },
                                    "doc_link": { "type": "string" }
                                }
                            }
                        }
                        "link": { "type": "string" },
                        "year": { "type": "number" }
                    }
                }
            }
        }
    }
}

Description

The first level is an array of course_item objects.
course_item is an object for each HSC course and each object contains:
- course_name, a string containing the course name and
- packs, an array of exam_pack_item objects.
exam_pack_item is an object for each year there are documents available for each course. Each object contains:
- docs, an array of doc_item objects,
- link, a string containing the link to the exam pack, and
- year, a number storing the year of the exam pack.
doc_item is an object for each document within each exam pack. Each object contains:
- doc_name, a string containing the name of the document and
- doc_link, a string containing the link to the PDF document.

Running the scraper yourself

pipenv run scrapy crawl nesapp
deno run --allow-read --allow-write scripts/merge.ts

The first command will write:

data_new.json, the data you just scraped.

The second command will write:

data.json, the data you just scraped merged with old data;
meta.json, metadata including the scrape time; and
data_list.json, the list of course items for improved Git diffs.

Dependencies

Python v3.11
Pipenv v2023.10.24+
Deno v1.37.2+

Install instructions

Download and install Python 3.11
- macOS, using Homebrew: brew install python
- Windows: https://www.python.org/downloads/
Download and install pipenv. Instructions: https://pipenv.pypa.io/en/latest/
Download and install Deno. Instructions: https://docs.deno.com/runtime/manual
Clone this repo or download ZIP using the green button above.
Open the directory of the cloned or downloaded repo.
Install Scrapy and other dependencies using pipenv, making sure it’s using Python 3.11: pipenv install

Running on Scrapy Cloud

This version of the scraper will not work on Scrapy Cloud without modifications. You need to switch the item pipeline in settings.py (in the nesappscraper folder):

Comment out line 69 by putting a # at the start of the line.
Uncomment line 68 by removing the # at the start of the line.

Changing output filename

In pipelines.py inside the nesappscraper folder:

On line 16, change data_new.json to the file name you want.
On line 18, change meta.json to the file name you want.
The file extension must remain .json

Runtime stats

On an M1 Max 16″ MacBook Pro with ~50 Mbps download connection:

Runtime: ~1 min
RAM usage: ~75 MB
Total bytes sent: ~440 KB
Total bytes received: ~100 MB
Scrapy stats:

On Scrapy Cloud with 1 unit, it ran for ~55 min.

To check if your data is valid:

Total request count should be 1661+ to get all papers.
There should be 1654+ items scraped to get all papers.
There should be 114 courses.

NESA HSC paper upload schedule

This crawler should be loaded frequently during the HSC exam block to get the latest papers. In 2017, papers are usually uploaded two business days after the exam, around noon.

Acknowledgements

Scrapy: https://scrapy.org/
All HSC papers are provided by NESA and owned by the State of New South Wales. They are protected by Crown copyright: http://educationstandards.nsw.edu.au/wps/portal/nesa/mini-footer/copyright
This scraper does not store or make copies of the documents themselves. It only obtains the links to the official copies on the NESA website. It is intended for information purposes only.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.vscode		.vscode
nesappscraper		nesappscraper
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
data.json		data.json
data_list.json		data_list.json
data_new.json		data_new.json
meta.json		meta.json
nesappscraper-hscpastpapers.sh		nesappscraper-hscpastpapers.sh
scrapy.cfg		scrapy.cfg
scrapy_stats.png		scrapy_stats.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NESA Past Papers Scraper

Get the data

https://raw.githubusercontent.com/notsidney/nesappscraper/master/data.json

Data format

Description

Running the scraper yourself

Dependencies

Running on Scrapy Cloud

Changing output filename

Runtime stats

NESA HSC paper upload schedule

Acknowledgements

About

Releases

Packages

Contributors 4

Languages

License

notsidney/nesappscraper

Folders and files

Latest commit

History

Repository files navigation

NESA Past Papers Scraper

Get the data

https://raw.githubusercontent.com/notsidney/nesappscraper/master/data.json

Data format

Description

Running the scraper yourself

Dependencies

Running on Scrapy Cloud

Changing output filename

Runtime stats

NESA HSC paper upload schedule

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages