Skip to content

Latest commit

 

History

History
40 lines (26 loc) · 1.93 KB

README.md

File metadata and controls

40 lines (26 loc) · 1.93 KB

Generating an AIP database

Disclaimer: The datasets are on the order of several 100 GBs and take a significant amount to process and import into a PostgreSQL database.

In order to update the database you must download all 4 data sources and parse them.

  1. Create a new folder in the root of the repository where all the input files will be stored with the following command:

    $ export DOWNLOAD_DATE=$(date +'%d_%m_%Y')
    $ mkdir -p datasets/{dblp,s2-corpus,aminer,mag}_$DOWNLOAD_DATE
  2. Download the DBLP dataset (dblp.xml.gz and dblp.dtd):

    $ wget -P datasets/dblp_$DOWNLOAD_DATE https://dblp.uni-trier.de/xml/{dblp.xml.gz,dblp.dtd}
  3. Download the Open Academic Graph dataset which contains the Aminer and MAG papers using the following commands:

    $ wget -P datasets/aminer_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/aminer/paper/aminer_papers_{0..5}.zip
    $ wget -P datasets/mag_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/mag/paper/mag_papers_{0..16}.zip

    This will download the following files: img1.png

  4. Download the Semantic Scholar dataset by following the instructions to get the latest corpus and store the files in the s2-corpus_$DOWNLOAD_DATE directory.

  5. After downloading all the files, unzip them.

  6. After making sure all files are unzipped and stored in the same folder, change line 14 in the renew_data_locally.py which is located in the parser folder, to the correct path of the folder you downloaded all the files to.

    img3.png

  7. Finally, run the renew_data_locally.py file.

  8. After re-parsing the whole database, make sure to add the version dates of all the downloaded sources into the database.

    img4.png