This code provides the backend for the BabyLM Challenge's evaluation pipeline.
We provide support for zero-shot evaluations on BLiMP, as well as scripts for fine-tuning HuggingFace-based models on GLUE and MSGS tasks.
We also provide a Colab demo of the evaluation pipeline as a demonstration of how to use the code.
If you have questions about or suggestions for this code, please open an issue and consider joining our Slack. We also welcome pull requests!
To install dependencies, run this:
git clone https://github.com/babylm/evaluation-pipeline
cd evaluation-pipeline
pip install -e ".[dev]"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
If your GPU is compatible with CUDA 10, replace all instances of cu113
with cu102
.
We provide versions of BLiMP, GLUE, and MSGS which have been filtered according to the vocabulary of the strict-small
dataset. We filter for examples where each word has appeared in our training set at least twice.
Unzip the dataset into the root directory of this repository: unzip filter_data.zip
.
To evaluate a model on zero-shot tasks like BLiMP and the held-out BLiMP supplement tasks:
python babylm_eval.py 'path/to/model_and_tokenizer' 'model_type'
Where model_type
is one of "encoder", "decoder" or "encoder-decoder".
To fine-tune and evaluate a model on tasks that require fine-tuning, like the (Super)GLUE tasks or held-out MSGS tasks:
./finetune_all_tasks.sh 'path/to/model_and_tokenizer'
This script contains hyperparameter defaults that should work for a variety of model sizes, architectures, and tasks. You may adjust these hyperparameters as you wish, though we ask that you submit the best hyperparmeter settings in a README file if you don't use the defaults.
Here are the defaults that we use:
Hyperparameter | Value |
---|---|
Initial learning rate | 5e-5 |
Batch size | 64 |
Maximum epochs | 10 |
Evaluate every (steps) | 200 |
Patience | 10 |
Random seed | 12 |
We provide a shell script that will collect your results into a single file:
./collect_results.py path/to/model_and_tokenizer
This will output a file called all_predictions.json
in the root folder of this repository. We will ask you to upload this file to a submission portal.
We will also ask you to share a link where we can download your model and tokenizer.
If you wish to submit your results and you are not using the collect_results.py
script, please ensure that your predictions file conforms to the submission format (example provided here as sample_predictions.json
). This is a file consisting of line-separated JSON objects, where each line corresponds to a single subtask.
For each line, the JSON object includes a task
field ("blimp", "glue", "supplement", or "msgs"), a sub_task
field (the specific task, like "cola" or "anaphor_agreement"), and a predictions
field, which is a list of JSON objects containing example IDs and predictions for those examples. Here is an example:
{"task": "glue", "sub_task": "mnli", "predictions": [{"id": "mnli_0", "pred": 0}, {"id": "mnli_1": "pred": 1}, ..., {"id": "mnli_6561", "pred": 1}]}
This evaluation is based on Portelance, Duan, Lupyan and Frank 2023 (see citation below).
If you want to run it, run the zero-shot evaluation script with the "--run_aoa" flag:
python babylm_eval.py 'path/to/model_and_tokenizer' 'model_type' --run_aoa
Note, the evaluation requires access to forward pass labels from your tokenizer. It currently expects the tokenizer to either produce them under the key "labels" if the model type is a "decoder" where labels represent the shifted "input_ids", or if no labels are provided, it will set the "labels" to be equal to the "input_ids" (this is done automatically for "encoder" and "encoder-decoder" type models. In the event that your labels are not equal to the input_ids, please make sure your tokenizer contains them under the key "labels".
Once it runs, it will produce two json files in a folder called "aoa_prediction" in the model directory provided. One of the files contains the estimated average surprisal of words for the model in child directed utterances taken from CHILDES. The other contains the results of the evaluation. Models are evaluated using leave-one-out cross validation. The results are Mean Absolute Deviation (MAD) scores in months between the actual average age-of-acquisition (AoA) of these words by American English speaking children and the predicted AoA based on the models average surprisal scores (the closer the MAD scores are to zero, the better). MAD scores are provided over all the words, over nouns, over predicates, and over function words. Previous work has found that models tend to do better at predicting the AoA of predicates and function words over nouns.
The better the fit, the better a model's predictions and the actual AoA of words in kids (the smaller the MAD scores), the more the order in which models learn words resembles the order in which children tend to learn words.
Note that, while we do not require you to run this evaluation or submit your score for our evaluation, we highly encourage you to compute this metric and discuss it in your paper!
We provide a series of baseline models that we train on our strict or strict-small dataset. These are hosted on HuggingFace.
We simply take the hyperparameters used to pre-train the original versions of these models, and train them on our strict or strict-small datasets. While we do reduce the context length and, in some cases, the batch size, these are otherwise minimally modified.
Here are baseline scores. Metrics are marked for each task. For (Super)GLUE, tasks use accuracies unless otherwise marked (in parentheses) by the subtask name. F1 denoates macro-F1, and MCC denotes Matthew's correlation coefficient. Random chance accuracy on all BLiMP tasks is 50.
Strict-small Track
BLiMP (Acc.)
Model | Anaphor Agr. | Agr. Structure | Binding | Control/Raising | D-N Agr. | Ellipsis | Filler-Gap | Irregular Forms | Island Effects | NPI Licensing | Quantifiers | S-V Agr. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 63.8 | 70.6 | 67.1 | 66.5 | 78.5 | 62 | 63.8 | 67.5 | 48.6 | 46.7 | 59.6 | 56.9 |
RoBERTa-base | 81.5 | 67.1 | 67.3 | 67.9 | 90.8 | 76.4 | 63.5 | 87.4 | 39.9 | 55.9 | 70.5 | 65.4 |
T5-base | 68.9 | 63.8 | 60.4 | 60.9 | 72.2 | 34.4 | 48.2 | 77.6 | 45.6 | 47.8 | 61.2 | 65.0 |
BLiMP Supplement (Acc.)
Model | Hypernym | QA Congruence (easy) | QA Congruence (tricky) | Subj.-Aux. Inversion | Turn Taking |
---|---|---|---|---|---|
OPT-125m | 50.0 | 54.7 | 31.5 | 80.3 | 57.1 |
RoBERTa-base | 49.4 | 31.3 | 32.1 | 71.7 | 53.2 |
T5-base | 48.0 | 40.6 | 21.2 | 64.9 | 45.0 |
(Super)GLUE (Default: Acc.)
Model | CoLA (MCC) | SST-2 | MRPC (F1) | QQP (F1) | MNLI | MNLI-mm | QNLI | RTE | BoolQ | MultiRC | WSC |
---|---|---|---|---|---|---|---|---|---|---|---|
Majority label | 0.0 | 50.2 | 82.0 | 53.1 | 35.7 | 35.7 | 35.4 | 53.1 | 50.5 | 59.9 | 53.2 |
OPT-125m | 15.2 | 81.9 | 72.5 | 60.4 | 57.6 | 60.0 | 61.5 | 60.0 | 63.3 | 55.2 | 60.2 |
RoBERTa-base | 25.8 | 87.0 | 79.2 | 73.7 | 73.2 | 74.0 | 77.0 | 61.6 | 66.3 | 61.4 | 61.4 |
T5-base | 11.3 | 78.1 | 80.5 | 66.2 | 48.0 | 50.3 | 62.0 | 49.4 | 66.0 | 47.1 | 61.4 |
MSGS (MCC)
Model | CR (Control) | LC (Control) | MV (Control) | RP (Control) | SC (Control) | CR_LC | CR_RTP | MV_LC | MV_RTP | SC_LC | SC_RP |
---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 50.8 | 53.6 | 99.5 | 99.9 | 77.2 | 0.4 | -70.3 | -72.1 | -77.6 | 13.8 | -68.9 |
RoBERTa-base | 43.1 | 100.0 | 97.7 | 76.7 | 86.2 | -28.3 | -77.7 | -99.3 | -79.4 | 16.3 | -45.0 |
T5-base | 21.1 | 100.0 | 33.4 | 82.5 | 77.6 | -78.3 | -62.0 | -100.0 | -79.7 | -25.3 | -39.4 |
Age-of-acquisition Prediction (Mean absolute deviation in months across LOO cross-validation folds)
Model | Overall (591 words) | Nouns (322) | Predicates (167) | Function words (102) |
---|---|---|---|---|
OPT-125m | 2.03 | 1.98 | 1.81 | 2.57 |
RoBERTa-base | 2.06 | 1.99 | 1.85 | 2.65 |
T5-base | 2.04 | 1.97 | 1.82 | 2.64 |
Strict Track
BLiMP (Acc.)
Model | Anaphor Agr. | Agr. Structure | Binding | Control/Raising | D-N Agr. | Ellipsis | Filler-Gap | Irregular Forms | Island Effects | NPI Licensing | Quantifiers | S-V Agr. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 94.9 | 73.8 | 73.8 | 72.2 | 93.1 | 80.5 | 73.6 | 80.8 | 57.8 | 51.6 | 74.5 | 77.3 |
RoBERTa-base | 89.5 | 71.3 | 71 | 67.1 | 93.1 | 83.8 | 68.0 | 89.6 | 54.5 | 66.3 | 70.3 | 76.2 |
T5-base | 66.7 | 61.2 | 59.4 | 59.8 | 53.8 | 49.1 | 70.0 | 75.5 | 43.6 | 45.6 | 34.2 | 53.2 |
BLiMP Supplement (Acc.)
Model | Hypernym | QA Congruence (easy) | QA Congruence (tricky) | Subj.-Aux. Inversion | Turn Taking |
---|---|---|---|---|---|
OPT-125m | 46.3 | 76.5 | 47.9 | 85.3 | 82.9 |
RoBERTa-base | 50.8 | 34.4 | 34.5 | 45.6 | 46.8 |
T5-base | 51.1 | 45.3 | 25.5 | 69.2 | 48.9 |
(Super)GLUE (Default: Acc.)
Model | CoLA (MCC) | SST-2 | MRPC (F1) | QQP (F1) | MNLI | MNLI-mm | QNLI | RTE | BoolQ | MultiRC | WSC |
---|---|---|---|---|---|---|---|---|---|---|---|
Majority label | 0.0 | 50.2 | 82 | 53.1 | 35.7 | 35.7 | 35.4 | 53.1 | 50.5 | 59.9 | 53.2 |
OPT-125m | 36.2 | 86.6 | 82.1 | 77.8 | 70.1 | 71.9 | 80.1 | 67.7 | 66.0 | 61.1 | 59.0 |
RoBERTa-base | 45.3 | 88.6 | 80.5 | 78.5 | 68.7 | 78.0 | 82.3 | 51.5 | 59.9 | 61.3 | 61.4 |
T5-base | 37.5 | 88.0 | 85.9 | 79.7 | 71.5 | 74.0 | 83.1 | 60.6 | 69.0 | 62.4 | 60.2 |
MSGS (MCC)
Model | CR (Control) | LC (Control) | MV (Control) | RP (Control) | SC (Control) | CR_LC | CR_RTP | MV_LC | MV_RTP | SC_LC | SC_RP |
---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 89.1 | 42.3 | 99.9 | 99.1 | 52.1 | 35.5 | -70.3 | -76.2 | -99.5 | 34.7 | -60.5 |
RoBERTa-base | 74.7 | 100.0 | 99.9 | 100.0 | 59.2 | -89.0 | -91.2 | -99.8 | -15.3 | -57.7 | -39.2 |
T5-base | 81.0 | 100.0 | 100.0 | 99.4 | 56.2 | -1.0 | -71.2 | -97.5 | -94.0 | -32.2 | -64.9 |
Age-of-acquisition Prediction (Mean absolute deviation in months across LOO cross-validation folds)
Model | Overall (591 words) | Nouns (322) | Predicates (167) | Function words (102) |
---|---|---|---|---|
OPT-125m | 2.04 | 1.97 | 1.83 | 2.61 |
RoBERTa-base | 2.06 | 1.99 | 1.82 | 2.66 |
T5-base | 2.06 | 2.0 | 1.83 | 2.65 |
These are naïve baselines that are meant to provide a starting point for investigation. We look forward to seeing how you will improve upon these!
If you use the datasets or code from this repository, please cite the BabyLM Call for Papers:
@article{warstadt2023papers,
title = {Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus},
author = {Warstadt, Alex and
Choshen, Leshem and
Mueller, Aaron and
Williams, Adina and
Wilcox, Ethan and
Zhuang, Chengxu},
year = {2023},
journal = {Computing Research Repository},
volume = {arXiv:2301.11796}
}
Please also cite the lm-eval-harness paper:
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
}
Please cite the following if you choose to include the Age-of-acquisition prediction evaluation:
@article{portelance2023predicting,
author = {Portelance, Eva and Duan, Yuguang and Frank, Michael C. and Lupyan, Gary},
title = {Predicting age of acquisition for children’s early vocabulary in five languages using language model surprisal},
year = {To Appear},
journal = {Cognitive Science},
url = {https://github.com/evaportelance/multilingual-aoa-prediction}
}