- Website: https://ascent.mpi-inf.mpg.de
- Download 8.9M commonsense assertions: https://ascent.mpi-inf.mpg.de/download
ASCENT is a pipeline for extracting and consolidating commonsense
knowledge from the world wide web.
ASCENT is capable of extracting facet-enriched assertions, for
example, lawyer; represents; clients; [LOCATION] in courts
or
elephant; uses; its trunk; [PURPOSE] to suck up water
.
A web interface of the ASCENT knowledge base for 10,000 popular
concepts can be found at https://ascent.mpi-inf.mpg.de/.
You need python3.7+ to run the pipeline.
First, create and activate a virtual environment using your
favourite platform, e.g., python3-venv
:
python -m venv .env
source .env/bin/activate
Then, install required packages:
pip install -r requirements.txt
Next, you need to download the following SpaCy model:
python -m spacy download en_core_web_md
Then, download the wordnet
corpus for the nltk
package:
python -c 'import nltk; nltk.download("wordnet")'
Download our pretrained models for triple clustering and facet type labeling from https://nextcloud.mpi-klsb.mpg.de/index.php/s/s2ELgPgC5LEGEFp then extract it to the project's root folder.
Edit the file config.ini
and provide your Bing API Key and
Bing Search Custom Config under the section [bing_search]
.
Documentations to the Bing Custom Search API:
https://docs.microsoft.com/en-us/rest/api/cognitiveservices-bingsearch/bing-custom-search-api-v7-reference
To run the ASCENT pipeline, navigate to the src/
folder and execute
the main.py
script:
cd src/
python main.py --config ../config.ini
You will be asked to fill in subject(s) which should be WordNet concepts. You can provide a single subject:
Enter subjects: lion.n.01
or a list of comma-separated subjects:
Enter subjects: lion.n.01,lynx.n.02,elephant.n.01
or path to a file containing one subject per line:
Enter subjects: /path/to/your/subjects.txt
Then, enter indices of the modules you want to execute:
[0] Bing Search
[1] Crawl articles
[2] Filter irrelevant articles
[3] Extract knowledge
[4] Cluster similar triples
[5] Label facets
[6] Group similar facets
For example, to run the complete pipeline:
From module: 0
To module: 6
Final results will be written to
output/kb/<subject>/final.json
.
Intermediate results of every module can be found in the output
folder as well.
An example config file is the config.ini
file.
The missing fields are the Bing API-related ones.
You can find references of the config fields in the following:
-
[default]
res_dir
: resource folderoutput
: output foldergpu
: list of comma-separated GPUs to be used.-1
means CPU will be used. E.g.,gpu = 0,3
means that we'll use the 0-th and 3-rd GPUs of the machine.
-
[bing_search]
subscription_key
: Bing API subscription key (required)custom_config
: Bing API custom config (required)num_urls
: number of URLs to be fetched by the Bing APIhost
= api.cognitive.microsoft.compath
= /bingcustomsearch/v7.0/searchoverwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or notnum_processes
: number of processors for this module
-
[article_grab]
num_crawlers
: number of parallel crawlers, each crawler works with one subject at a timeprocesses_per_crawler
: number of processors per crawlersoverwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
-
[filter]
num_processes
: number of processors for this moduleoverwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
-
[extraction]
doc_threshold
: document cosine-similarity threshold. Documents lower than this threshold will be filtered out (default: 0.55)num_processes
: number of processors for this moduleoverwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
-
[triple_clustering]
model
: path to the triple clustering modelthreshold
: threshold for the HAC algorithm (default: 0.005)batch_size
: size of triple pair batch to be processed at a time (default: 1024)overwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
-
[facet_labeling]
model
: path to the facet labeling modelbatch_size
: size of faceted triple batch to be processed at a time (default: 1024)overwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
-
[facet_grouping]
num_processes
: number of processors for this moduleoverwrite
: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
If you use Ascent, please cite the following paper:
@inproceedings{ascent,
author = {Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
title = {Advanced Semantics for Commonsense Knowledge Extraction},
year = {2021},
isbn = {9781450383127},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3442381.3449827},
doi = {10.1145/3442381.3449827},
booktitle = {Proceedings of the Web Conference 2021},
pages = {2636–2647},
numpages = {12},
location = {Ljubljana, Slovenia},
series = {WWW '21}
}