Skip to content
Leobouloc edited this page Jul 28, 2017 · 7 revisions

What it can do

This API provides different services to describe, clean and merge tabular data.

How it works

Projects

Data is stored and transformed in "projects". There are two types of projects:

  • Normalization: Deals with operations (inference or transforms) that act on a single file. In particular you can:

    • Upload the file to work on
    • Infer missing values: probabilistically detect if your file has values that represent missing values: NR, NA, XXX, ---, etc.
    • Replace missing values: replace missing value representations by empty strings
    • Infer column types: assign a label to the columns
    • Recode types: normalize column values according to the given types
    • Dedupe: Detect duplicate rows in the file. (NB: requires some labelled data that can be generated in the interface)
  • Link: Aligns a "source" (dirty) file with a "referential" (clean). Link projects take as input two Normalization projects (one as "source", one as "referential"):

    • Select project: Select what Normalization project to use a "source" or "referential"
    • (Upload training data)
    • Infer reference restriction: Using the training pairs provided, looks for common words in the reference matches (for example, all reference matches contain "High School")
    • Perform reference restriction: Restrict reference file to rows that contain words returned by Infer reference restriction. This is useful to make the match process faster and more precise
    • Link: Link each line in the "source" file to a line in the "referential"

Scheduling

Most operations (infer, transform) are scheduled to a worker. As a result, to run, these operations, you have to 1) Make call to schedule a job 2) Regularly make call to check job status.

Examples

Perform type inference on a file

Create new project

 <> URL
/api/new/normalize

 <> POST REQUEST (json):
 {'description': 'Test project for type inference on a file',
 'internal': False}

 <> RESPONSE:
 {'error': False, 'project_id': '243e50763b5532d7e6bb1738da91a9db'}

Upload CSV file to project

 <> URL
 /api/normalize/upload/243e50763b5532d7e6bb1738da91a9db

 <> POST REQUEST (files):
{'file': OPEN FILE}

 <> RESPONSE:
 {'project_id': '243e50763b5532d7e6bb1738da91a9db',
 'run_info': {'encoding': 'utf-8',
              'file_name': 'source.csv',
              'file_type': 'csv',
              'ncols': 9,
              'nrows': 1414,
              'og_file_name': 'source.csv',
              'sep': ','}}

Schedule type inference job

 <> URL
 /api/schedule/infer_types/24e3a6fbcbb403e6d3d3b6e473e31794/

 <> POST REQUEST (json):
 {'data_params': {'file_name': 'MINI__ref_copy.csv',
                 'module_name': 'replace_mvs'}}
 <> RESPONSE:
 {'job_id': '24e3a6fbcbb403e6d3d3b6e473e31794_infer_types',
 'job_result_api_url': '/queue/result/24e3a6fbcbb403e6d3d3b6e473e31794_infer_types'}

Check job status

 <> URL
 /queue/result/21fc428ae20ca4156649ee6777096bec_infer_types

<> RESPONSE AFTER JOB COMPLETION
{'completed': True,
 'result': {'all_types': {'Adresse': ['Pays', 'Voie', 'Code Postal', 'Commune'],
                          'Article': ['ID publication',
                                      "Contenu d'article",
                                      'Résumé'],
                          'Autres types': ['Publication',
                                           'Education Nationale',
                                           ...}
Clone this wiki locally