The challenges are here.
All data are from the Titanic disaster (it reminds you Kaggle ?)
Challenges are :
- Extract all persons from one page
- Extract all persons from multiple pages (use pagination)
- Bypass the user-agent
Scrapy works only with Python 2.7.
Please install Python 2.7, and not Python 3.x!
git clone https://github.com/fabienvauchelles/scraping-challenge-workshop.git
cd scraping-challenge-workshop
pip install -r requirements.txt
Scraper code is inside the file myscraper/spiders/myscraper.py
.
Items are inside the file myscraper/items.py
.
cd scraping-challenge-workshop
scrapy crawl myscraper -t jsonlines -o persons.json
Exports items are inside the file persons.json
.
See the Licence.