Skip to content

4. Running the pipeline

__loic__ edited this page Oct 14, 2016 · 1 revision

This page is intended for developers who want to contribute to the pipeline. All code is in python3.

Overview

The project uses the concept of Data-Packages and its derivatives: the Tabular Data-Package and the Fiscal Data-Package. Each regional or national pipeline boils down to:

  1. Extract: The data source is downloaded or scraped, then validated against the source specific Tabular Data-Package schema
  2. Transform: The data is cleaned and reshaped, then validated against the global Fiscal Data-Package schema
  3. Load: The Fiscal Data-Package is uploaded to Open-Spending using the client library.

All pipelines are orchestrated by the data-package-pipelines framework. We recommend that you take a look at the instructions in the README.

Installation

Make sure that python3 is installed and clone the repository:

$ git clone https://github.com/os-data/eu-structural-funds.git

Create a virtual environment and install dependencies:

$ virtualenv -p /usr/bin/python3.5 venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Then cd into the repository and add it to your Python path:

cd eu-structural-funds
export PYTHONPATH=$PYTHONPATH:`pwd`

List all pipelines

To show which pipelines are currently available:

$ dpp

This should give you a list like:

Available Pipelines:
- ./data/AT.austria/AT11.burgenland/AT11.burgenland 
- ./data/AT.austria/AT32.salzburg/AT32.salzburg 
- ./data/AT.austria/AT21.kaernten/AT21.kaernten 
- ./data/AT.austria/AT31.oberoesterreich/AT31.oberoesterreich 
- ./data/AT.austria/national/national 
- ./data/AT.austria/AT22.steiermark/AT22.steiermark 
- ./data/AT.austria/AT12.niederoesterreich/AT12.niederoesterreich 
- ./data/FR.france/2014-2020/2014-2020 
- ./data/HR.croatia/HR.croatia 
- ./data/EL.greece/2007-2013/2007-2013 
- ./data/EL.greece/2014-2020/2014-2020 
- ./data/DK.denmark/DK.denmark 
- ./data/MT.malta/MT.malta 
- ./data/EE.estonia/EE.estonia 

Add a pipeline

To register a new national or regional pipeline, you first need to copy source.description.yaml in the relevant source folder and fill it in. Then you must bootstrap the pipeline, like so:

python3 -m common.bootstrap FR.france/2014-2020

This generates an pipeline-status.json file (where feedback messages go) and a minimal pipeline-spec.yaml file in the source folder that looks like:

2014-2020:
  pipeline:
  - parameters:
      - save_datapackage: false
    run: processors.read_description
  schedule: 
    - crontab: 0 0 1 1 *

You can now run the pipeline:

$ dpp ./data/FR.france/2014-2020/2014-2020

This will not do much apart from converting your description.source.yaml into a tabular datapackage and make sure it's valid. If it's not, take a look inside pipeline-status.json to get some feedback.

Add processors

To add more processors ot the pipeline, simply append the pipeline-specs.yaml file. You can re-use library-wide processors or project-wide processors. You can also write your own. Region or country specific processors must be saved in the data source folder. If you need inspiration, take a look at the pipeline for France 2014-2020.

Clone this wiki locally