PyVideo Scrape gets python conference videos (youtube lists) metadata and puts it into a PyVideo repository branch.
Why another PyVideo scraper? It was my initial attempt to get data from youtube lists, worked for me. Then I cleaned it a bit and uploaded to a repo.
mkdir ~/git # Example directory
cd ~/git
# Get the repos (better if you have a fork of them)
git clone "git@github.com:Daniel-at-github/pyvideo_scrape.git"
git clone "git@github.com:pyvideo/data.git" pyvideo_data
cd ~/git/pyvideo_scrape
$EDITOR events.yml # Add the conferences to scrape (see format below)
pipenv shell
pipenv update youtube-dl # This should be good from time to time
./pyvideo_scrape.py
events.yml
format
- title: PyCon AU 2019
dir: pycon-au-2019
youtube_list:
- https://www.youtube.com/playlist?list=PLs4CJRBY5F1LKqauI3V4E_xflt6Gow611
related_urls:
- label: Conference schedule
url: https://2019.pycon-au.org/schedule
language: eng
dates:
begin: 2019-08-02
end: 2019-08-06
default: 2019-08-02
issue: 843
tags:
minimal_download: false
overwrite:
# all: true # takes precedence over add_new_files and existing_files_fields
add_new_files: true
existing_files_fields:
- copyright_text
- duration
- thumbnail_url
- videos
- description
- language
# - recorded
- related_urls
# - speakers
# - tags
# - title
Field | description |
---|---|
title | Title field of the event |
dir | Directory name of the event |
youtube_list | List of youtube urls (videos and or lists) |
related_urls | Url list common to all events in video |
language | Videos ISO_639-3 language code |
dates | Three ISO 8601 Dates between which videos were recorded (YYYY-MM-DD[Thh:mm[+hh:mm]]) |
dates.begin | Start date of the event |
dates.end | End date of the event |
dates.default | Default date to use when the videos don't have a date between begin and end |
issue | Github issue solved scraping this videos |
minimal_download | Download only the fields that don't need human intervention, intended for a first download that exposes the minimal data. |
tags | Tags common to all events in video |
overwrite | Section needed to add new content to existing event |
overwrite.all | Removes event content and downloads present videos metadata (takes precedence over add_new_files and existing_files_fields) |
overwrite.add_new_files | Downloads new videos metadata (compatible with existing_files_fields) |
overwrite.existing_files_fields | Updates selected fields for existing videos (compatible with add_new_files) |
The files events_minimal_download.yml
and events_done.yml
are manually saved to ease future data reload (especially minimal_download file).
Use minimal_download: true
, download and pull request. If no more changes are added no review is needed and it's easier to publish and you can overwrite it later.
Reuse "minimal_download" conference configuration and add:
# After conference data
minimal_download: true
overwrite:
all: true
Old content will be erased (only automated work) and created again witth present content.
Use minimal_download: false
Download using:
# After conference data
minimal_download: false
overwrite:
all: true
Download using:
# After conference data
minimal_download: false
overwrite:
add_new_files: true
Suppose a conference downloaded with minimal_download
and the fields speakers
, title
, recorded
and tags
previously reviewed and commited.
You have to download possible new files and update the rest of the fields, using:
# After conference data
overwrite:
add_new_files: true
existing_files_fields:
- copyright_text
- duration
- thumbnail_url
- videos
- description
- language
- related_urls
The files produced by the scraping need some cleaning (You could use pyvideo_lektor for it):
- Fill the missing
speaker
value (as to do task the field have "TODO" in it, to be easy togrep
). - Clean each field to contain only what the field name says. See Data Completeness in the contributing guide