-
Notifications
You must be signed in to change notification settings - Fork 15
4. Running the pipeline
This page is intended for developers who want to contribute to the pipeline. All code is in python3
.
The project uses the concept of Data-Packages and its derivatives: the Tabular Data-Package and the Fiscal Data-Package. Each regional or national pipeline boils down to:
- Extract: The data source is downloaded or scraped, then validated against the source specific Tabular Data-Package schema
- Transform: The data is cleaned and reshaped, then validated against the global Fiscal Data-Package schema
- Load: The Fiscal Data-Package is uploaded to Open-Spending using the client library.
All pipelines are orchestrated by the data-package-pipelines
framework. We recommend that you take a look at the instructions in the README.
Make sure that python3
is installed and clone the repository:
$ git clone https://github.com/os-data/eu-structural-funds.git
Create a virtual environment and install dependencies:
$ virtualenv -p /usr/bin/python3.5 venv
$ source venv/bin/activate
$ pip install -r requirements.txt
Then cd
into the repository and add it to your Python path:
cd eu-structural-funds
export PYTHONPATH=$PYTHONPATH:`pwd`
To show which pipelines are currently available:
$ dpp
This should give you a list like:
Available Pipelines:
- ./data/AT.austria/AT11.burgenland/AT11.burgenland
- ./data/AT.austria/AT32.salzburg/AT32.salzburg
- ./data/AT.austria/AT21.kaernten/AT21.kaernten
- ./data/AT.austria/AT31.oberoesterreich/AT31.oberoesterreich
- ./data/AT.austria/national/national
- ./data/AT.austria/AT22.steiermark/AT22.steiermark
- ./data/AT.austria/AT12.niederoesterreich/AT12.niederoesterreich
- ./data/FR.france/2014-2020/2014-2020
- ./data/HR.croatia/HR.croatia
- ./data/EL.greece/2007-2013/2007-2013
- ./data/EL.greece/2014-2020/2014-2020
- ./data/DK.denmark/DK.denmark
- ./data/MT.malta/MT.malta
- ./data/EE.estonia/EE.estonia
To register a new national or regional pipeline, you first need to copy source.description.yaml
in the relevant source folder and fill it in. Then you must bootstrap the pipeline, like so:
python3 -m common.bootstrap FR.france/2014-2020
This generates an pipeline-status.json
file (where feedback messages go) and a minimal pipeline-spec.yaml
file in the source folder that looks like:
2014-2020:
pipeline:
- parameters:
- save_datapackage: false
run: processors.read_description
schedule:
- crontab: 0 0 1 1 *
You can now run the pipeline:
$ dpp ./data/FR.france/2014-2020/2014-2020
This will not do much apart from converting your description.source.yaml
into a tabular datapackage and make sure it's valid. If it's not, take a look inside pipeline-status.json
to get some feedback.
To add more processors ot the pipeline, simply append the pipeline-specs.yaml
file. You can re-use library-wide processors or project-wide processors. You can also write your own. Region or country specific processors must be saved in the data source folder. If you need inspiration, take a look at the pipeline for France 2014-2020.