This repository contains the code of lettuce's submission for triple scoring task in WSDM Cup 2017.
We address this task by combining multiple neural network classifiers using gradient boosted regression trees (GBRT). Similar to past work, we train these classifiers using the instances having single class and use them to predict the classes of instances with multiple classes.
First, you need to install the required Python packages.
% pip install -r requirements.txt
The following three databases are required to train our model:
- entity_db stores a Wikipedia redirection structure and basic statistics
- page_db contains paragraphs and links in the target Wikipedia pages
- sentence_db stores the parsed sentences contained in the wiki-sentences file
You also need to download a Wikipedia dump file from Wikimedia Downloads. In our experiments, we used the Wikipedia dump generated in June 2016.
% python cli.py build_entity_db WIKIPEDIA_DUMP_FILE entity_db
% python cli.py build_page_db --category=pro WIKIPEDIA_DUMP_FILE page_db_pro.db
% python cli.py build_page_db --category=nat WIKIPEDIA_DUMP_FILE page_db_nat.db
% python cli.py build_sentence_db dataset/wiki-sentences sentence.db
We combine the outputs of multiple supervised classifiers to compute features for our scoring model.
The first classifier is trained using the bag-of-words (BoW) and the bag-of-entities (BoE) in the target Wikipedia pages.
We train eight classifiers for each category (i.e., profession and nationality) with various training configurations. The classifiers can be built by using the following commands:
Preparing required data:
% python cli.py page_classifier build_dataset --category=pro --test-size=0 page_db_pro.db page_classifier_dataset_pro_full.joblib
% python cli.py page_classifier build_dataset --category=nat --test-size=0 page_db_nat.db page_classifier_dataset_nat_full.joblib
Training classifiers for profession task:
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_attention_300_full --dim-size=300
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_300_full --dim-size=300 --no-attention
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_attention_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_300_balanced_full --dim-size=300 --balanced-weight --no-attention
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_entity_attention_300_full --dim-size=300 --entity-only
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_entity_300_full --dim-size=300 --entity-only --no-attention
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_entity_attention_300_balanced_full --dim-size=300 --entity-only --balanced-weight
% python cli.py train_classifier page_classifier_dataset_pro_full.joblib page_classifier_model_pro_entity_300_balanced_full --dim-size=300 --entity-only --balanced-weight --no-attention
Training classifiers for nationality task:
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_attention_300_full --dim-size=300
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_300_full --dim-size=300 --no-attention
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_attention_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_300_balanced_full --dim-size=300 --balanced-weight --no-attention
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_entity_attention_300_full --dim-size=300 --entity-only
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_entity_300_full --dim-size=300 --entity-only --no-attention
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_entity_attention_300_balanced_full --dim-size=300 --entity-only --balanced-weight
% python cli.py train_classifier page_classifier_dataset_nat_full.joblib page_classifier_model_nat_entity_300_balanced_full --dim-size=300 --entity-only --balanced-weight --no-attention
The second classifier is trained using the word and entity co-occurrence data of the target entities obtained from Wikipedia. The classifier uses wiki-sentences to compute the co-occurrence data.
You can train the classifiers using the following commands:
Preparing required data:
% python cli.py coocc_classifier build_coocc_matrix coocc_matrix_win5 sentence.db --word-window=5
% python cli.py coocc_classifier build_coocc_matrix coocc_matrix_win10 sentence.db --word-window=10
% python cli.py coocc_classifier build_dataset --category=pro --test-size=0 coocc_matrix_win5 coocc_classifier_dataset_win5_pro_full.joblib
% python cli.py coocc_classifier build_dataset --category=pro --test-size=0 coocc_matrix_win10 coocc_classifier_dataset_win10_pro_full.joblib
% python cli.py coocc_classifier build_dataset --category=nat --test-size=0 coocc_matrix_win5 coocc_classifier_dataset_win5_nat_full.joblib
% python cli.py coocc_classifier build_dataset --category=nat --test-size=0 coocc_matrix_win10 coocc_classifier_dataset_win10_nat_full.joblib
Training classifiers for profession task:
% python cli.py train_classifier coocc_classifier_dataset_win5_pro_full.joblib coocc_classifier_model_pro_attention_win5_300_full --dim-size=300
% python cli.py train_classifier coocc_classifier_dataset_win5_pro_full.joblib coocc_classifier_model_pro_win5_300_full --dim-size=300 --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win5_pro_full.joblib coocc_classifier_model_pro_attention_win5_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier coocc_classifier_dataset_win5_pro_full.joblib coocc_classifier_model_pro_win5_300_balanced_full --dim-size=300 --balanced-weight --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win10_pro_full.joblib coocc_classifier_model_pro_attention_win10_300_full --dim-size=300
% python cli.py train_classifier coocc_classifier_dataset_win10_pro_full.joblib coocc_classifier_model_pro_win10_300_full --dim-size=300 --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win10_pro_full.joblib coocc_classifier_model_pro_attention_win10_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier coocc_classifier_dataset_win10_pro_full.joblib coocc_classifier_model_pro_win10_300_balanced_full --dim-size=300 --no-attention --balanced-weight
Training classifiers for nationality task:
% python cli.py train_classifier coocc_classifier_dataset_win5_nat_full.joblib coocc_classifier_model_nat_attention_win5_300_full --dim-size=300
% python cli.py train_classifier coocc_classifier_dataset_win5_nat_full.joblib coocc_classifier_model_nat_win5_300_full --dim-size=300 --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win5_nat_full.joblib coocc_classifier_model_nat_attention_win5_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier coocc_classifier_dataset_win5_nat_full.joblib coocc_classifier_model_nat_win5_300_balanced_full --dim-size=300 --balanced-weight --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win10_nat_full.joblib coocc_classifier_model_nat_attention_win10_300_full --dim-size=300
% python cli.py train_classifier coocc_classifier_dataset_win10_nat_full.joblib coocc_classifier_model_nat_win10_300_full --dim-size=300 --no-attention
% python cli.py train_classifier coocc_classifier_dataset_win10_nat_full.joblib coocc_classifier_model_nat_attention_win10_300_balanced_full --dim-size=300 --balanced-weight
% python cli.py train_classifier coocc_classifier_dataset_win10_nat_full.joblib coocc_classifier_model_nat_win10_300_balanced_full --dim-size=300 --no-attention --balanced-weight
In order to enable our software to run on a virtual machine (TIRA), we cache the results of the above classifiers into one file. The cache file can be generated with the following commands:
% python cli.py cache_classifier_results --category=pro page_db_pro.db classifier_results_pro.joblib
% python cli.py cache_classifier_results --category=nat page_db_nat.db classifier_results_nat.joblib
We adopt the gradient boosted regression trees (GBRT) to map the outputs of the above-mentioned classifiers to the final scores. We use two models for generating the final scores: the regression model and the binary classification model. The regression model directly estimates the final scores (ranging from 0 to 7), whereas the classification model outputs 5 and 2 for the true and false cases, respectively.
Because the training dataset is very small, we adopt the forward feature selection algorithm to select the small set of most useful features. The feature selection can be run using the following commands:
Profession:
% python cli.py scorer select_features --k-features=50 --learning-rate=0.04 --max-depth=4 -o features_pro_reg_rate0.04_depth4.json scorer_dataset_pro_reg.joblib
% python cli.py scorer select_features --k-features=50 --learning-rate=0.01 --max-depth=2 -o features_pro_bin_rate0.01_depth2.json scorer_dataset_pro_bin.joblib
Nationality:
% python cli.py scorer select_features --k-features=50 --learning-rate=0.03 --max-depth=2 -o features_nat_reg_rate0.03_depth2.json scorer_dataset_nat_reg.joblib
% python cli.py scorer select_features --k-features=50 --learning-rate=0.01 --max-depth=3 -o features_nat_bin_rate0.01_depth3.json scorer_dataset_nat_bin.joblib
Then, the final GBRT model can be constructed using the following commands:
Profession:
% python cli.py scorer train_model -f features_pro_reg_rate0.04_depth4.json --learning-rate=0.05 --max-depth=4 --min-samples-split=82 --max-features=9 --subsample=1.0 --n-estimators=3000 scorer_dataset_pro_reg.joblib scorer_model_pro_reg.pickle
% python cli.py scorer train_model -f features_pro_bin_rate0.01_depth2.json --learning-rate=0.01 --max-depth=2 --min-samples-split=22 --max-features=17 --subsample=1.0 --n-estimators=1000 scorer_dataset_pro_bin.joblib scorer_model_pro_bin.pickle
Nationality:
% python cli.py scorer train_model -f features_nat_reg_rate0.03_depth2.json --learning-rate=0.045 --max-depth=2 --min-samples-split=47 --max-features=15 --subsample=0.95 --n-estimators=3000 scorer_dataset_nat_reg.joblib scorer_model_nat_reg.pickle
% python cli.py scorer train_model -f features_nat_bin_rate0.01_depth3.json --learning-rate=0.01 --max-depth=3 --min-samples-split=27 --max-features=11 --subsample=1.0 --n-estimators=3000 scorer_dataset_nat_bin.joblib scorer_model_nat_bin.pickle
Now, the final scoring models (i.e., scorer_model_pro_reg.pickle, scorer_model_pro_bin.pickle, scorer_model_nat_reg.pickle, scorer_model_nat_bin.pickle) should appear in the current directory.
The submission file is generated using the run command:
Predicting scores using the regression model:
% python cli.py scorer run -i profession.test -i nationality.test -o OUTPUT_DIR
Predicting scores using the binary model:
% python cli.py scorer run --binary -i profession.test -i nationality.test -o OUTPUT_DIR
The final submission file should appear in OUTPUT_DIR.