Skip to content

phongnt570/ascent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ascent: Advanced Semantics for Commonsense Knowledge Extraction

Introduction

ASCENT is a pipeline for extracting and consolidating commonsense knowledge from the world wide web. ASCENT is capable of extracting facet-enriched assertions, for example, lawyer; represents; clients; [LOCATION] in courts or elephant; uses; its trunk; [PURPOSE] to suck up water. A web interface of the ASCENT knowledge base for 10,000 popular concepts can be found at https://ascent.mpi-inf.mpg.de/.

Prerequisites

Setting up environment

You need python3.7+ to run the pipeline.

First, create and activate a virtual environment using your favourite platform, e.g., python3-venv:

python -m venv .env
source .env/bin/activate

Then, install required packages:

pip install -r requirements.txt

Next, you need to download the following SpaCy model:

python -m spacy download en_core_web_md

Then, download the wordnet corpus for the nltk package:

python -c 'import nltk; nltk.download("wordnet")'

RoBERTa models

Download our pretrained models for triple clustering and facet type labeling from https://nextcloud.mpi-klsb.mpg.de/index.php/s/s2ELgPgC5LEGEFp then extract it to the project's root folder.

Bing API Key

Edit the file config.ini and provide your Bing API Key and Bing Search Custom Config under the section [bing_search]. Documentations to the Bing Custom Search API: https://docs.microsoft.com/en-us/rest/api/cognitiveservices-bingsearch/bing-custom-search-api-v7-reference

Usage

To run the ASCENT pipeline, navigate to the src/ folder and execute the main.py script:

cd src/
python main.py --config ../config.ini

You will be asked to fill in subject(s) which should be WordNet concepts. You can provide a single subject:

Enter subjects: lion.n.01

or a list of comma-separated subjects:

Enter subjects: lion.n.01,lynx.n.02,elephant.n.01

or path to a file containing one subject per line:

Enter subjects: /path/to/your/subjects.txt

Then, enter indices of the modules you want to execute:

[0] Bing Search
[1] Crawl articles
[2] Filter irrelevant articles
[3] Extract knowledge
[4] Cluster similar triples
[5] Label facets
[6] Group similar facets

For example, to run the complete pipeline:

From module: 0
  To module: 6

Final results will be written to output/kb/<subject>/final.json. Intermediate results of every module can be found in the output folder as well.

Configurations

An example config file is the config.ini file. The missing fields are the Bing API-related ones. You can find references of the config fields in the following:

  • [default]

    • res_dir: resource folder
    • output: output folder
    • gpu: list of comma-separated GPUs to be used. -1 means CPU will be used. E.g., gpu = 0,3 means that we'll use the 0-th and 3-rd GPUs of the machine.
  • [bing_search]

    • subscription_key: Bing API subscription key (required)
    • custom_config: Bing API custom config (required)
    • num_urls: number of URLs to be fetched by the Bing API
    • host = api.cognitive.microsoft.com
    • path = /bingcustomsearch/v7.0/search
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
    • num_processes: number of processors for this module
  • [article_grab]

    • num_crawlers: number of parallel crawlers, each crawler works with one subject at a time
    • processes_per_crawler: number of processors per crawlers
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [filter]

    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [extraction]

    • doc_threshold: document cosine-similarity threshold. Documents lower than this threshold will be filtered out (default: 0.55)
    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [triple_clustering]

    • model: path to the triple clustering model
    • threshold: threshold for the HAC algorithm (default: 0.005)
    • batch_size: size of triple pair batch to be processed at a time (default: 1024)
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [facet_labeling]

    • model: path to the facet labeling model
    • batch_size: size of faceted triple batch to be processed at a time (default: 1024)
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [facet_grouping]

    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not

Citation

If you use Ascent, please cite the following paper:

@inproceedings{ascent,
  author = {Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
  title = {Advanced Semantics for Commonsense Knowledge Extraction},
  year = {2021},
  isbn = {9781450383127},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3442381.3449827},
  doi = {10.1145/3442381.3449827},
  booktitle = {Proceedings of the Web Conference 2021},
  pages = {2636–2647},
  numpages = {12},
  location = {Ljubljana, Slovenia},
  series = {WWW '21}
}