-
Notifications
You must be signed in to change notification settings - Fork 4
Api
Leobouloc edited this page Jul 28, 2017
·
7 revisions
This API provides different services to describe, clean and merge tabular data.
Data is stored and transformed in "projects". There are two types of projects:
-
Normalization: Deals with operations (inference or transforms) that act on a single file. In particular you can:
- Upload the file to work on
- Infer missing values: probabilistically detect if your file has values that represent missing values: NR, NA, XXX, ---, etc.
- Replace missing values: replace missing value representations by empty strings
- Infer column types: assign a label to the columns
- Recode types: normalize column values according to the given types
- Dedupe: Detect duplicate rows in the file. (NB: requires some labelled data that can be generated in the interface)
- Concatenate with initial file: join results of transformation with the original file.
- Download file
-
Link: Aligns a "source" (dirty) file with a "referential" (clean). Link projects take as input two Normalization projects (one as "source", one as "referential"):
- Select project: Select what Normalization project to use a "source" or "referential"
- (Upload training data)
- Infer reference restriction: Using the training pairs provided, looks for common words in the reference matches (for example, all reference matches contain "High School")
- Perform reference restriction: Restrict reference file to rows that contain words returned by Infer reference restriction. This is useful to make the match process faster and more precise
- Link: Link each line in the "source" file to a line in the "referential"
Most operations (infer, transform) are scheduled to a worker. As a result, to run, these operations, you have to 1) Make call to schedule a job 2) Regularly make call to check job status.
<> URL
/api/new/normalize
<> POST REQUEST (json):
{'description': 'Test project for type inference on a file',
'internal': False}
<> RESPONSE:
{'error': False, 'project_id': '243e50763b5532d7e6bb1738da91a9db'}
<> URL
/api/normalize/upload/243e50763b5532d7e6bb1738da91a9db
<> POST REQUEST (files):
{'file': OPEN FILE}
<> RESPONSE:
{'project_id': '243e50763b5532d7e6bb1738da91a9db',
'run_info': {'encoding': 'utf-8',
'file_name': 'source.csv',
'file_type': 'csv',
'ncols': 9,
'nrows': 1414,
'og_file_name': 'source.csv',
'sep': ','}}
<> URL
/api/schedule/infer_types/24e3a6fbcbb403e6d3d3b6e473e31794/
<> POST REQUEST (json):
{'data_params': {'file_name': 'MINI__ref_copy.csv',
'module_name': 'replace_mvs'}}
<> RESPONSE:
{'job_id': '24e3a6fbcbb403e6d3d3b6e473e31794_infer_types',
'job_result_api_url': '/queue/result/24e3a6fbcbb403e6d3d3b6e473e31794_infer_types'}
<> URL
/queue/result/21fc428ae20ca4156649ee6777096bec_infer_types
<> RESPONSE AFTER JOB COMPLETION
{'completed': True,
'result': {'all_types': {'Adresse': ['Pays', 'Voie', 'Code Postal', 'Commune'],
'Article': ['ID publication',
"Contenu d'article",
'Résumé'],
'Autres types': ['Publication',
'Education Nationale',
...}