Article Information Parser is an instrument to parse, unify, and in some cases correct article meta-data. AIP creates a PostgreSQL database that allows for easily finding related work.
Developing such a database is tricky, an excerpt of our article introducing this instrument:
Current information sources do not cover the spectrum of the systems community entirely.
For example, DBLP -- which specifically focuses on computer science articles -- lacks certain venues and does not record article abstracts.
Other datasets such as Semantic Scholar and AMiner have similar and other limitations.
Moreover, these datasets also overlap, yet contain important information the others do not offer; they are disjoint.
Our approach is to parse each dataset and filter and unify the information provided.
This instrument combines three data sources: DBLP, Semantic Scholar, and AMiner, which we filter and store in a PostgreSQL database. DBLP is a well-known European archive that focuses on computer science and features all the top-level venues (journals and conferences). Semantic Scholar is an American project created by the Allen Institute for AI. The project aims to analyze and extract important data from scientific publications. AMiner is an Asian project that aims to provide a knowledge graph for mining academic social networks. Both AMiner and Semantic Scholar have incorporated Microsoft's Academic Graph (MAG) in their datasets nowadays.
AIP tackles several non-trivial challenges in unifying these datasets:
- Data discrepancies between sources. For example, titles in DBLP end with a dot, whereas they do not in the Semantic Scholar and AMiner corpuses, causing exact matching to fail.
- Titles and abstracts may contain encoded characters leading to mismatching articles that are in fact the same.
- Despite all data sources having a format specified, we encountered several instances where the format specified is not adhered to, or the data is malformed.
- Venue strings being different among these sources. Some sources use an abbreviation, some use a BibTeX string, etc. AIP maps all these occurrences to the same abbreviation.
- Complementing existing entries. For example, DBLP does not offer abstracts whilst Semantic Scholar and AMiner do.
We developed two useful scripts to run AIP and generate the database using raw datasources:
- A script to renew the data on a local (single) machine
- A script to renew the data on a distributed system having a SLURM scheduler (managed by Dask)
The steps to run AIP are as followed:
- Clone this repository.
- Update PostgreSQL settings in database_manager.py
- Download released datasets from three sources and store them in a directory.
- Run either one of the two scripts mentioned earlier or run separately parse_dblp.py, parse_semantic_scholar.py, or parse_aminer.py using as argument to root of the data.
Have a look at which argument each script accepts (such as file locations) for more options.
The database file contains the following tables:
publications
Column name | Explanation |
---|---|
id | A unique id for the paper, usually the ID assigned by DBLP. |
venue | The abbreviation of the venue the article was accepted at. |
year | The publishing year. |
volume | (Optional) the volume of the journal the article it was included in. |
title | The title of the article. |
doi | The DOI of the article, in case there are multiple, the first one is usually used. |
abstract | The abstract of the article (if present in one of the datasets). |
n_citations | The number of times this article has been cited. |
authors
Column name | Explanation |
---|---|
id | A unique identifier per author, this is the id used by DBLP. |
name | The full name of the author. |
orcid | The ORCID of the author if known. |
author_paper_pairs this is a table to make a link between authors and publications. We are aware of the use of paper
rather than article
(legacy).
Column name | Explanation |
---|---|
author_id | A id of an author. |
paper_id | The id of an article the author (co-)authored. |
cites is currently not used, this table will contain in the future two article ids: which paper cited which.
properties
Column name | Explanation |
---|---|
last_modified | The data when the contents of the database were last altered. |
version | The version of the database content, whenever a script modifies the database, after being done, this counter should be incremented. |
db_schema_version | The version of the database schema. We use this to incrementally alter the database (adding indices, modifying/deleting/adding tables, etc.) |
The following SQL command returns papers from 2011 onwards with keywords performance analysis quality
in either title or abstract, sorted by year in descending order.
SELECT * FROM publications WHERE year >= 2011
AND (lower(title) LIKE '%performance%'
OR lower(abstract) LIKE '%performance%')
AND (lower(title) LIKE '%analysis%'
OR lower(abstract) LIKE '%analysis%')
AND (lower(title) LIKE '%quality%'
OR lower(abstract) LIKE '%quality%')
ORDER BY year DESC