Skip to content

An instrument to combine, unify, and correct (scientific) article meta-data.

Notifications You must be signed in to change notification settings

jakubtokarz/AIP

 
 

Repository files navigation

AIP

Article Information Parser is an instrument to parse, unify, and in some cases correct article meta-data. AIP creates a PostgreSQL database that allows for easily finding related work.

Developing such a database is tricky, an excerpt of our article introducing this instrument:

Current information sources do not cover the spectrum of the systems community entirely.
For example, DBLP -- which specifically focuses on computer science articles -- lacks certain venues and does not record article abstracts.
Other datasets such as Semantic Scholar and AMiner have similar and other limitations.
Moreover, these datasets also overlap, yet contain important information the others do not offer; they are disjoint.
Our approach is to parse each dataset and filter and unify the information provided.

This instrument combines three data sources: DBLP, Semantic Scholar, and AMiner, which we filter and store in a PostgreSQL database. DBLP is a well-known European archive that focuses on computer science and features all the top-level venues (journals and conferences). Semantic Scholar is an American project created by the Allen Institute for AI. The project aims to analyze and extract important data from scientific publications. AMiner is an Asian project that aims to provide a knowledge graph for mining academic social networks. Both AMiner and Semantic Scholar have incorporated Microsoft's Academic Graph (MAG) in their datasets nowadays.

AIP tackles several non-trivial challenges in unifying these datasets:

  1. Data discrepancies between sources. For example, titles in DBLP end with a dot, whereas they do not in the Semantic Scholar and AMiner corpuses, causing exact matching to fail.
  2. Titles and abstracts may contain encoded characters leading to mismatching articles that are in fact the same.
  3. Despite all data sources having a format specified, we encountered several instances where the format specified is not adhered to, or the data is malformed.
  4. Venue strings being different among these sources. Some sources use an abbreviation, some use a BibTeX string, etc. AIP maps all these occurrences to the same abbreviation.
  5. Complementing existing entries. For example, DBLP does not offer abstracts whilst Semantic Scholar and AMiner do.

How to run AIP

We developed two useful scripts to run AIP and generate the database using raw datasources:

  1. A script to renew the data on a local (single) machine
  2. A script to renew the data on a distributed system having a SLURM scheduler (managed by Dask)

The steps to run AIP are as followed:

  1. Clone this repository.
  2. Update PostgreSQL settings in database_manager.py
  3. Download released datasets from three sources and store them in a directory.
  4. Run either one of the two scripts mentioned earlier or run separately parse_dblp.py, parse_semantic_scholar.py, or parse_aminer.py using as argument to root of the data.

Have a look at which argument each script accepts (such as file locations) for more options.

AIP database structure

The database file contains the following tables:

publications

Column name Explanation
id A unique id for the paper, usually the ID assigned by DBLP.
venue The abbreviation of the venue the article was accepted at.
year The publishing year.
volume (Optional) the volume of the journal the article it was included in.
title The title of the article.
doi The DOI of the article, in case there are multiple, the first one is usually used.
abstract The abstract of the article (if present in one of the datasets).
n_citations The number of times this article has been cited.

authors

Column name Explanation
id A unique identifier per author, this is the id used by DBLP.
name The full name of the author.
orcid The ORCID of the author if known.

author_paper_pairs this is a table to make a link between authors and publications. We are aware of the use of paper rather than article (legacy).

Column name Explanation
author_id A id of an author.
paper_id The id of an article the author (co-)authored.

cites is currently not used, this table will contain in the future two article ids: which paper cited which.

properties

Column name Explanation
last_modified The data when the contents of the database were last altered.
version The version of the database content, whenever a script modifies the database, after being done, this counter should be incremented.
db_schema_version The version of the database schema. We use this to incrementally alter the database (adding indices, modifying/deleting/adding tables, etc.)

Query Example

The following SQL command returns papers from 2011 onwards with keywords performance analysis quality in either title or abstract, sorted by year in descending order.

SELECT * FROM publications WHERE year >= 2011
AND (lower(title) LIKE '%performance%' 
	OR lower(abstract) LIKE '%performance%')
AND (lower(title) LIKE '%analysis%'
	OR lower(abstract) LIKE '%analysis%')
AND (lower(title) LIKE '%quality%'
	OR lower(abstract) LIKE '%quality%')
ORDER BY year DESC

About

An instrument to combine, unify, and correct (scientific) article meta-data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.2%
  • Python 1.8%