PyTorch Pretrained BERT: The Big & Extending Repository of pretrained Transformers

This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:

These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the Examples section below.

Here are some information on these models:

BERT was released together with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. This PyTorch implementation of BERT is provided with Google's pre-trained models, examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.

OpenAI GPT was released together with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. This PyTorch implementation of OpenAI GPT is an adaptation of the PyTorch implementation by HuggingFace and is provided with OpenAI's pre-trained model and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.

Google/CMU's Transformer-XL was released together with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. This PyTorch implementation of Transformer-XL is an adaptation of the original PyTorch implementation which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.

OpenAI GPT-2 was released together with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. This PyTorch implementation of OpenAI GPT-2 is an adaptation of the OpenAI's implementation and is provided with OpenAI's pre-trained model and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.

Content

Section	Description
Installation	How to install the package
Overview	Overview of the package
Usage	Quickstart examples
Doc	Detailed documentation
Examples	Detailed examples on how to fine-tune Bert
Notebooks	Introduction on the provided Jupyter Notebooks
TPU	Notes on TPU support and pretraining scripts
Command-line interface	Convert a TensorFlow checkpoint in a PyTorch dump

Installation

This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0

With pip

PyTorch pretrained bert can be installed by pip as follows:

pip install pytorch-pretrained-bert

If you want to reproduce the original tokenization process of the OpenAI GPT paper, you will need to install ftfy (limit to version 4.4.3 if you are using Python 2) and SpaCy :

pip install spacy ftfy==4.4.3
python -m spacy download en

If you don't install ftfy and SpaCy, the OpenAI GPT tokenizer will default to tokenize using BERT's BasicTokenizer followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).

From source

Clone the repository and run:

pip install [--editable] .

Here also, if you want to reproduce the original tokenization process of the OpenAI GPT model, you will need to install ftfy (limit to version 4.4.3 if you are using Python 2) and SpaCy :

pip install spacy ftfy==4.4.3
python -m spacy download en

Again, if you don't install ftfy and SpaCy, the OpenAI GPT tokenizer will default to tokenize using BERT's BasicTokenizer followed by Byte-Pair Encoding (which should be fine for most usage).

A series of tests is included in the tests folder and can be run using pytest (install pytest if needed: pip install pytest).

You can run the tests with the command:

python -m pytest -sv tests/

Overview

This package comprises the following classes that can be imported in Python and are detailed in the Doc section of this readme:

Eight Bert PyTorch models (torch.nn.Module) with pre-trained weights (in the modeling.py file):
- BertModel - raw BERT Transformer model (fully pre-trained),
- BertForMaskedLM - BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained),
- BertForNextSentencePrediction - BERT Transformer with the pre-trained next sentence prediction classifier on top (fully pre-trained),
- BertForPreTraining - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (fully pre-trained),
- BertForSequenceClassification - BERT Transformer with a sequence classification head on top (BERT Transformer is pre-trained, the sequence classification head is only initialized and has to be trained),
- BertForMultipleChoice - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),
- BertForTokenClassification - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained),
- BertForQuestionAnswering - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained).
Three OpenAI GPT PyTorch models (torch.nn.Module) with pre-trained weights (in the modeling_openai.py file):
- OpenAIGPTModel - raw OpenAI GPT Transformer model (fully pre-trained),
- OpenAIGPTLMHeadModel - OpenAI GPT Transformer with the tied language modeling head on top (fully pre-trained),
- OpenAIGPTDoubleHeadsModel - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),
Two Transformer-XL PyTorch models (torch.nn.Module) with pre-trained weights (in the modeling_transfo_xl.py file):
- TransfoXLModel - Transformer-XL model which outputs the last hidden state and memory cells (fully pre-trained),
- TransfoXLLMHeadModel - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (fully pre-trained),
Three OpenAI GPT-2 PyTorch models (torch.nn.Module) with pre-trained weights (in the modeling_gpt2.py file):
- GPT2Model - raw OpenAI GPT-2 Transformer model (fully pre-trained),
- GPT2LMHeadModel - OpenAI GPT-2 Transformer with the tied language modeling head on top (fully pre-trained),
- GPT2DoubleHeadsModel - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),
Tokenizers for BERT (using word-piece) (in the tokenization.py file):
- BasicTokenizer - basic tokenization (punctuation splitting, lower casing, etc.),
- WordpieceTokenizer - WordPiece tokenization,
- BertTokenizer - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
Tokenizer for OpenAI GPT (using Byte-Pair-Encoding) (in the tokenization_openai.py file):
- OpenAIGPTTokenizer - perform Byte-Pair-Encoding (BPE) tokenization.
Tokenizer for Transformer-XL (word tokens ordered by frequency for adaptive softmax) (in the tokenization_transfo_xl.py file):
- OpenAIGPTTokenizer - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
Tokenizer for OpenAI GPT-2 (using byte-level Byte-Pair-Encoding) (in the tokenization_gpt2.py file):
- GPT2Tokenizer - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
Optimizer for BERT (in the optimization.py file):
- BertAdam - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
Optimizer for OpenAI GPT (in the optimization_openai.py file):
- OpenAIAdam - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective modeling.py, modeling_openai.py, modeling_transfo_xl.py files):
- BertConfig - Configuration class to store the configuration of a BertModel with utilities to read and write from JSON configuration files.
- OpenAIGPTConfig - Configuration class to store the configuration of a OpenAIGPTModel with utilities to read and write from JSON configuration files.
- GPT2Config - Configuration class to store the configuration of a GPT2Model with utilities to read and write from JSON configuration files.
- TransfoXLConfig - Configuration class to store the configuration of a TransfoXLModel with utilities to read and write from JSON configuration files.

The repository further comprises:

Five examples on how to use BERT (in the examples folder):
- extract_features.py - Show how to extract hidden states from an instance of BertModel,
- run_classifier.py - Show how to fine-tune an instance of BertForSequenceClassification on GLUE's MRPC task,
- run_squad.py - Show how to fine-tune an instance of BertForQuestionAnswering on SQuAD v1.0 and SQuAD v2.0 tasks.
- run_swag.py - Show how to fine-tune an instance of BertForMultipleChoice on Swag task.
- simple_lm_finetuning.py - Show how to fine-tune an instance of BertForPretraining on a target text corpus.
One example on how to use OpenAI GPT (in the examples folder):
- run_openai_gpt.py - Show how to fine-tune an instance of OpenGPTDoubleHeadsModel on the RocStories task.
One example on how to use Transformer-XL (in the examples folder):
- run_transfo_xl.py - Show how to load and evaluate a pre-trained model of TransfoXLLMHeadModel on WikiText 103.
One example on how to use OpenAI GPT-2 in the unconditional and interactive mode (in the examples folder):
- run_gpt2.py - Show how to use OpenAI GPT-2 an instance of GPT2LMHeadModel to generate text (same as the original OpenAI GPT-2 examples).
These examples are detailed in the Examples section of this readme.
Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the notebooks folder):
- Comparing-TF-and-PT-models.ipynb - Compare the hidden states predicted by BertModel,
- Comparing-TF-and-PT-models-SQuAD.ipynb - Compare the spans predicted by BertForQuestionAnswering instances,
- Comparing-TF-and-PT-models-MLM-NSP.ipynb - Compare the predictions of the BertForPretraining instances.
These notebooks are detailed in the Notebooks section of this readme.
A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:

This CLI is detailed in the Command-line interface section of this readme.

Usage

BERT

Here is a quick-start example using BertTokenizer, BertModel and BertForMaskedLM class with Google AI's pre-trained Bert base uncased model. See the doc section below for all the details on these classes.

First let's prepare a tokenized input with BertTokenizer

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Let's see how to use BertModel to get hidden states

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12

And how to use BertForMaskedLM

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'

OpenAI GPT

Here is a quick-start example using OpenAIGPTTokenizer, OpenAIGPTModel and OpenAIGPTLMHeadModel class with OpenAI's pre-trained model. See the doc section below for all the details on these classes.

First let's prepare a tokenized input with OpenAIGPTTokenizer

import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])

Let's see how to use OpenAIGPTModel to get hidden states

# Load pre-trained model (weights)
model = OpenAIGPTModel.from_pretrained('openai-gpt')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    hidden_states = model(tokens_tensor)

And how to use OpenAIGPTLMHeadModel

# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    predictions = model(tokens_tensor)

# get the predicted last token
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == '.</w>'

And how to use OpenAIGPTDoubleHeadsModel

# Load pre-trained model (weights)
model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
model.eval()

#  Prepare tokenized input
text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
tokenized_text1 = tokenizer.tokenize(text1)
tokenized_text2 = tokenizer.tokenize(text2)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])

# Predict hidden states features for each layer
with torch.no_grad():
    lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)

Transformer-XL

Here is a quick-start example using TransfoXLTokenizer, TransfoXLModel and TransfoXLModelLMHeadModel class with the Transformer-XL model pre-trained on WikiText-103. See the doc section below for all the details on these classes.

First let's prepare a tokenized input with TransfoXLTokenizer

import torch
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary from wikitext 103)
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')

# Tokenized input
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
tokenized_text_1 = tokenizer.tokenize(text_1)
tokenized_text_2 = tokenizer.tokenize(text_2)

# Convert token to vocabulary indices
indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)

# Convert inputs to PyTorch tensors
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])

Let's see how to use TransfoXLModel to get hidden states

# Load pre-trained model (weights)
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict hidden states features for each layer
    hidden_states_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)

And how to use TransfoXLLMHeadModel

# Load pre-trained model (weights)
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict all tokens
    predictions_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)

# get the predicted last token
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'who'

OpenAI GPT-2

Here is a quick-start example using GPT2Tokenizer, GPT2Model and GPT2LMHeadModel class with OpenAI's pre-trained model. See the doc section below for all the details on these classes.

First let's prepare a tokenized input with GPT2Tokenizer

import torch
from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode some inputs
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
indexed_tokens_1 = tokenizer.encode(text_1)
indexed_tokens_2 = tokenizer.encode(text_2)

# Convert inputs to PyTorch tensors
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])

Let's see how to use GPT2Model to get hidden states

# Load pre-trained model (weights)
model = GPT2Model.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    hidden_states_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
    # (see beam-search examples in the run_gpt2.py example).
    hidden_states_2, past = model(tokens_tensor_2, past=past)

And how to use GPT2LMHeadModel

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    predictions_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
    # (see beam-search examples in the run_gpt2.py example).
    predictions_2, past = model(tokens_tensor_2, past=past)

# get the predicted last token
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
predicted_token = tokenizer.decode([predicted_index])

And how to use GPT2DoubleHeadsModel

# Load pre-trained model (weights)
model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
model.eval()

#  Prepare tokenized input
text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
tokenized_text1 = tokenizer.tokenize(text1)
tokenized_text2 = tokenizer.tokenize(text2)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])

# Predict hidden states features for each layer
with torch.no_grad():
    lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)

Doc

Here is a detailed documentation of the classes in the package and how to use them:

Sub-section	Description
Loading pre-trained weights	How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
Serialization best-practices	How to save and reload a fine-tuned model
Configurations	API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
Models	API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL
Tokenizers	API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL
Optimizers	API of the optimizers

Loading Google AI or OpenAI pre-trained weights or PyTorch dump

`from_pretrained()` method

To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of BertForPreTraining saved with torch.save()), the PyTorch model classes and the tokenizer can be instantiated using the from_pretrained() method:

model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)

where

BERT_CLASS is either a tokenizer to load the vocabulary (BertTokenizer or OpenAIGPTTokenizer classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): BertModel, BertForMaskedLM, BertForNextSentencePrediction, BertForPreTraining, BertForSequenceClassification, BertForTokenClassification, BertForMultipleChoice, BertForQuestionAnswering, OpenAIGPTModel, OpenAIGPTLMHeadModel or OpenAIGPTDoubleHeadsModel, and
PRE_TRAINED_MODEL_NAME_OR_PATH is either:
- the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
  - bert-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
  - bert-large-uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  - bert-base-cased: 12-layer, 768-hidden, 12-heads , 110M parameters
  - bert-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  - bert-base-multilingual-uncased: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  - bert-base-multilingual-cased: (New, recommended) 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  - bert-base-chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
  - bert-base-german-cased: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters Performance Evaluation
  - bert-large-uncased-whole-word-masking: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
  - bert-large-cased-whole-word-masking: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
  - bert-large-uncased-whole-word-masking-finetuned-squad: The bert-large-uncased-whole-word-masking model finetuned on SQuAD (using the run_squad.py examples). Results: exact_match: 86.91579943235573, f1: 93.1532499015869
  - openai-gpt: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
  - gpt2: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
  - gpt2-medium: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
  - transfo-xl-wt103: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
- a path or url to a pretrained model archive containing:
  - bert_config.json or openai_gpt_config.json a configuration file for the model, and
  - pytorch_model.bin a PyTorch dump of a pre-trained instance of BertForPreTraining, OpenAIGPTModel, TransfoXLModel, GPT2LMHeadModel (saved with the usual torch.save())
If PRE_TRAINED_MODEL_NAME_OR_PATH is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links here) and stored in a cache folder to avoid future download (the cache folder can be found at ~/.pytorch_pretrained_bert/).
cache_dir can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example cache_dir='./pretrained_model_{}'.format(args.local_rank) (see the section on distributed training for more information).
from_tf: should we load the weights from a locally saved TensorFlow checkpoint
state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
*inputs, **kwargs: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)

Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers. Cased means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the Multilingual README or the original TensorFlow repository.

When using an uncased model, make sure to pass --do_lower_case to the example training scripts (or pass do_lower_case=True to FullTokenizer if you're using your own script and loading the tokenizer your-self.).

Examples:

# BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# OpenAI GPT
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTModel.from_pretrained('openai-gpt')

# Transformer-XL
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')

# OpenAI GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

Cache directory

pytorch_pretrained_bert save the pretrained weights in a cache directory which is located at (in this order of priority):

cache_dir optional arguments to the from_pretrained() method (see above),
shell environment variable PYTORCH_PRETRAINED_BERT_CACHE,
PyTorch cache home + /pytorch_pretrained_bert/ where PyTorch cache home is defined by (in this order):
- shell environment variable ENV_TORCH_HOME
- shell environment variable ENV_XDG_CACHE_HOME + /torch/)
- default: ~/.cache/torch/

Usually, if you don't set any specific environment variable, pytorch_pretrained_bert cache will be at ~/.cache/torch/pytorch_pretrained_bert/.

You can alsways safely delete pytorch_pretrained_bert cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.

Serialization best-practices

This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL). There are three types of files you need to save to be able to reload a fine-tuned model:

the model it-self which should be saved following PyTorch serialization best practices,
the configuration file of the model which is saved as a JSON file, and
the vocabulary (and the merges for the BPE-based models GPT and GPT-2).

The default filenames of these files are as follow:

the model weights file: pytorch_model.bin,
the configuration file: config.json,
the vocabulary file: vocab.txt for BERT and Transformer-XL, vocab.json for GPT/GPT-2 (BPE vocabulary),
for GPT/GPT-2 (BPE vocabulary) the additional merges file: merges.txt.

If you save a model using these default filenames, you can then re-load the model and tokenizer using the from_pretrained() method.

Here is the recommended way of saving the model, configuration and vocabulary to an output_dir directory and reloading the model and tokenizer afterwards:

from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME

output_dir = "./models/"

# Step 1: Save a model, configuration and vocabulary that you have fine-tuned

# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model

# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)

# Step 2: Re-load the saved model and vocabulary

# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)

Here is another way you can save and reload the model if you want to use specific paths for each type of files:

output_model_file = "./models/my_own_model_file.bin"
output_config_file = "./models/my_own_config_file.bin"
output_vocab_file = "./models/my_own_vocab_file.bin"

# Step 1: Save a model, configuration and vocabulary that you have fine-tuned

# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_vocab_file)

# Step 2: Re-load the saved model and vocabulary

# We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
# Here is how to do it in this situation:

# Example for a Bert model
config = BertConfig.from_json_file(output_config_file)
model = BertForQuestionAnswering(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)

# Example for a GPT model
config = OpenAIGPTConfig.from_json_file(output_config_file)
model = OpenAIGPTDoubleHeadsModel(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = OpenAIGPTTokenizer(output_vocab_file)

Configurations

Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which containes the parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON configuration files. The respective configuration classes are:

BertConfig for BertModel and BERT classes instances.
OpenAIGPTConfig for OpenAIGPTModel and OpenAI GPT classes instances.
GPT2Config for GPT2Model and OpenAI GPT-2 classes instances.
TransfoXLConfig for TransfoXLModel and Transformer-XL classes instances.

These configuration classes contains a few utilities to load and save configurations:

from_dict(cls, json_object): A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
from_json_file(cls, json_file): A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
to_dict(): Serializes an instance to a Python dictionary. Returns a dictionary.
to_json_string(): Serializes an instance to a JSON string. Returns a string.
to_json_file(json_file_path): Save an instance to a json file.

Models

1. `BertModel`

BertModel is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

Instantiation: The model can be instantiated with the following arguments:

config: a BertConfig class instance with the configuration to build a new model.
output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
keep_multihead_output: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

The inputs and output are identical to the TensorFlow model inputs and outputs.

We detail them here. This model takes as inputs: modeling.py

input_ids: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts extract_features.py, run_classifier.py and run_squad.py), and
token_type_ids: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a sentence A and type 1 corresponds to a sentence B token (see BERT paper for more details).
attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if some input sequence lengths are smaller than the max input sequence length of the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
output_all_encoded_layers: boolean which controls the content of the encoded_layers output as described below. Default: True.
head_mask: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.

This model outputs a tuple composed of:

encoded_layers: controled by the value of the output_encoded_layers argument:
- output_all_encoded_layers=True: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
- output_all_encoded_layers=False: outputs only the encoded-hidden-states corresponding to the last attention block, i.e. a single torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
pooled_output: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (CLF) to train on the Next-Sentence task (see BERT's paper).

An example on how to use this class is given in the extract_features.py script which can be used to extract the hidden states of the model for a given input.

2. `BertForPreTraining`

BertForPreTraining includes the BertModel Transformer followed by the two pre-training heads:

the masked language modeling head, and
the next sentence classification head.

Inputs comprises the inputs of the BertModel class plus two optional labels:

masked_lm_labels: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
next_sentence_label: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

Outputs:

if masked_lm_labels and next_sentence_label are not None: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
if masked_lm_labels or next_sentence_label is None: Outputs a tuple comprising
- the masked language modeling logits, and
- the next sentence classification logits.

There are two examples on how to use this class is given in the lm_finetuning/ directory. The scripts in this directory can be used to fine-tune the BERT language model. This should improve model performance, if the language style is different from the original BERT training corpus (Wiki + BookCorpus).

3. `BertForMaskedLM`

BertForMaskedLM includes the BertModel Transformer followed by the (possibly) pre-trained masked language modeling head.

Inputs comprises the inputs of the BertModel class plus optional label:

masked_lm_labels: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]

Outputs:

if masked_lm_labels is not None: Outputs the masked language modeling loss.
if masked_lm_labels is None: Outputs the masked language modeling logits.

4. `BertForNextSentencePrediction`

BertForNextSentencePrediction includes the BertModel Transformer followed by the next sentence classification head.

Inputs comprises the inputs of the BertModel class plus an optional label:

next_sentence_label: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

Outputs:

if next_sentence_label is not None: Outputs the next sentence classification loss.
if next_sentence_label is None: Outputs the next sentence classification logits.

5. `BertForSequenceClassification`

BertForSequenceClassification is a fine-tuning model that includes BertModel and a sequence-level (sequence or pair of sequences) classifier on top of the BertModel.

The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).

An example on how to use this class is given in the run_classifier.py script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.

6. `BertForMultipleChoice`

BertForMultipleChoice is a fine-tuning model that includes BertModel and a linear layer on top of the BertModel.

The linear layer outputs a single value for each choice of a multiple choice problem, then all the outputs corresponding to an instance are passed through a softmax to get the model choice.

This implementation is largely inspired by the work of OpenAI in Improving Language Understanding by Generative Pre-Training and the answer of Jacob Devlin in the following issue.

An example on how to use this class is given in the run_swag.py script which can be used to fine-tune a multiple choice classifier using BERT, for example for the Swag task.

7. `BertForTokenClassification`

BertForTokenClassification is a fine-tuning model that includes BertModel and a token-level classifier on top of the BertModel.

The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.

8. `BertForQuestionAnswering`

BertForQuestionAnswering is a fine-tuning model that includes BertModel with a token-level classifiers on top of the full sequence of last hidden states.

The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a start_span and a end_span token (see Figures 3c and 3d in the BERT paper).

An example on how to use this class is given in the run_squad.py script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.

9. `OpenAIGPTModel`

OpenAIGPTModel is the basic OpenAI GPT Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

OpenAI GPT use a single embedding matrix to store the word and special embeddings. Special tokens embeddings are additional tokens that are not pre-trained: [SEP], [CLS]... Special tokens need to be trained during the fine-tuning if you use them. The number of special embeddings can be controled using the set_num_special_tokens(num_special_tokens) function.

The embeddings are ordered as follow in the token embeddings matrice:

    [0,                                                         ----------------------
      ...                                                        -> word embeddings
      config.vocab_size - 1,                                     ______________________
      config.vocab_size,
      ...                                                        -> special embeddings
      config.vocab_size + config.n_special - 1]                  ______________________

where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is: total_tokens_embeddings = config.vocab_size + config.n_special You should use the associate indices to index the embeddings.

Instantiation: The model can be instantiated with the following arguments:

config: a OpenAIConfig class instance with the configuration to build a new model.
output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
keep_multihead_output: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

The inputs and output are identical to the TensorFlow model inputs and outputs.

We detail them here. This model takes as inputs: modeling_openai.py

input_ids: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
position_ids: an optional torch.LongTensor with the same shape as input_ids with the position indices (selected in the range [0, config.n_positions - 1[.
token_type_ids: an optional torch.LongTensor with the same shape as input_ids You can use it to add a third type of embedding to each input token in the sequence (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
head_mask: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.

This model outputs:

hidden_states: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)

10. `OpenAIGPTLMHeadModel`

OpenAIGPTLMHeadModel includes the OpenAIGPTModel Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

Inputs are the same as the inputs of the OpenAIGPTModel class plus optional labels:

lm_labels: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

Outputs:

if lm_labels is not None: Outputs the language modeling loss.
else: Outputs lm_logits: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)

11. `OpenAIGPTDoubleHeadsModel`

OpenAIGPTDoubleHeadsModel includes the OpenAIGPTModel Transformer followed by two heads:

a language modeling head with weights tied to the input embeddings (no additional parameters) and:
a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).

Inputs are the same as the inputs of the OpenAIGPTModel class plus a classification mask and two optional labels:

multiple_choice_token_ids: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
lm_labels: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
multiple_choice_labels: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

Outputs:

if lm_labels and multiple_choice_labels are not None: Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
else Outputs a tuple with:
- lm_logits: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
- multiple_choice_logits: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]

12. `TransfoXLModel`

The Transformer-XL model is described in "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".

Transformer XL use a relative positioning with sinusiodal patterns and adaptive softmax inputs which means that:

you don't need to specify positioning embeddings indices
the tokens in the vocabulary have to be sorted to decreasing frequency.

This model takes as inputs: modeling_transfo_xl.py

input_ids: a torch.LongTensor of shape [batch_size, sequence_length] with the token indices selected in the range [0, self.config.n_token[
mems: an optional memory of hidden states from previous forward passes as a list (num layers) of hidden states at the entry of each layer. Each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]. Note that the first two dimensions are transposed in mems with regards to input_ids.

This model outputs a tuple of (last_hidden_state, new_mems)

last_hidden_state: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
new_mems: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in mems with regards to input_ids.

Extracting a list of the hidden states at each layer of the Transformer-XL from `last_hidden_state` and `new_mems`:

The new_mems contain all the hidden states PLUS the output of the embeddings (new_mems[0]). new_mems[-1] is the output of the hidden state of the layer below the last layer and last_hidden_state is the output of the last layer (i.E. the input of the softmax when we have a language modeling head on top).

There are two differences between the shapes of new_mems and last_hidden_state: new_mems have transposed first dimensions and are longer (of size self.config.mem_len). Here is how to extract the full list of hidden states from the model output:

hidden_states, mems = model(tokens_tensor)
seq_length = hidden_states.size(1)
lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems)
all_hidden_states = lower_hidden_states + [hidden_states]

13. `TransfoXLLMHeadModel`

TransfoXLLMHeadModel includes the TransfoXLModel Transformer followed by an (adaptive) softmax head with weights tied to the input embeddings.

Inputs are the same as the inputs of the TransfoXLModel class plus optional labels:

target: an optional torch.LongTensor of shape [batch_size, sequence_length] with the target token indices selected in the range [0, self.config.n_token[

Outputs a tuple of (last_hidden_state, new_mems)

softmax_output: output of the (adaptive) softmax:
- if target is None: log probabilities of tokens, shape [batch_size, sequence_length, n_tokens]
- else: Negative log likelihood of target tokens with shape [batch_size, sequence_length]
new_mems: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in mems with regards to input_ids.

14. `GPT2Model`

GPT2Model is the OpenAI GPT-2 Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

Instantiation: The model can be instantiated with the following arguments:

config: a GPT2Config class instance with the configuration to build a new model.
output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
keep_multihead_output: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

The inputs and output are identical to the TensorFlow model inputs and outputs.

We detail them here. This model takes as inputs: modeling_gpt2.py

input_ids: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, vocab_size[
position_ids: an optional torch.LongTensor with the same shape as input_ids with the position indices (selected in the range [0, config.n_positions - 1[.
token_type_ids: an optional torch.LongTensor with the same shape as input_ids You can use it to add a third type of embedding to each input token in the sequence (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
past: an optional list of torch.LongTensor that contains pre-computed hidden-states (key and values in the attention blocks) to speed up sequential decoding (this is the presents output of the model, cf. below).
head_mask: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.

This model outputs:

hidden_states: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
presents: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the run_gpt2.py example).

15. `GPT2LMHeadModel`

GPT2LMHeadModel includes the GPT2Model Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

Inputs are the same as the inputs of the GPT2Model class plus optional labels:

lm_labels: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

Outputs:

if lm_labels is not None: Outputs the language modeling loss.
else: a tuple of
- lm_logits: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
- presents: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the run_gpt2.py example).

16. `GPT2DoubleHeadsModel`

GPT2DoubleHeadsModel includes the GPT2Model Transformer followed by two heads:

a language modeling head with weights tied to the input embeddings (no additional parameters) and:
a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).

Inputs are the same as the inputs of the GPT2Model class plus a classification mask and two optional labels:

multiple_choice_token_ids: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
lm_labels: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
multiple_choice_labels: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

Outputs:

if lm_labels and multiple_choice_labels are not None: Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
else Outputs a tuple with:
- lm_logits: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
- multiple_choice_logits: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
- presents: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the run_gpt2.py example).

Tokenizers

`BertTokenizer`

BertTokenizer perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

This class has five arguments:

vocab_file: path to a vocabulary file.
do_lower_case: convert text to lower-case while tokenizing. Default = True.
max_len: max length to filter the input of the Transformer. Default to pre-trained value for the model if None. Default = None
do_basic_tokenize: Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. Default = True.
never_split: a list of tokens that should not be splitted during tokenization. Default = ["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]

and three methods:

tokenize(text): convert a str in a list of str tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
convert_tokens_to_ids(tokens): convert a list of str tokens in a list of int indices in the vocabulary.
convert_ids_to_tokens(tokens): convert a list of int indices in a list of str tokens in the vocabulary.
save_vocabulary(directory_path): save the vocabulary file to directory_path. Return the path to the saved vocabulary file: vocab_file_path. The vocabulary can be reloaded with BertTokenizer.from_pretrained('vocab_file_path') or BertTokenizer.from_pretrained('directory_path').

Please refer to the doc strings and code in tokenization.py for the details of the BasicTokenizer and WordpieceTokenizer classes. In general it is recommended to use BertTokenizer unless you know what you are doing.

`OpenAIGPTTokenizer`

OpenAIGPTTokenizer perform Byte-Pair-Encoding (BPE) tokenization.

This class has four arguments:

vocab_file: path to a vocabulary file.
merges_file: path to a file containing the BPE merges.
max_len: max length to filter the input of the Transformer. Default to pre-trained value for the model if None. Default = None
special_tokens: a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's BasicTokenizer is used as the pre-BPE tokenizer, these tokens are not split. Default= None

and five methods:

tokenize(text): convert a str in a list of str tokens by performing BPE tokenization.
convert_tokens_to_ids(tokens): convert a list of str tokens in a list of int indices in the vocabulary.
convert_ids_to_tokens(tokens): convert a list of int indices in a list of str tokens in the vocabulary.
set_special_tokens(self, special_tokens): update the list of special tokens (see above arguments)
encode(text): convert a str in a list of int tokens by performing BPE encoding.
decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False): decode a list of int indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
save_vocabulary(directory_path): save the vocabulary, merge and special tokens files to directory_path. Return the path to the three files: vocab_file_path, merge_file_path, special_tokens_file_path. The vocabulary can be reloaded with OpenAIGPTTokenizer.from_pretrained('directory_path').

Please refer to the doc strings and code in tokenization_openai.py for the details of the OpenAIGPTTokenizer.

`TransfoXLTokenizer`

TransfoXLTokenizer perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper (Efficient softmax approximation for GPUs) for more details.

The API is similar to the API of BertTokenizer (see above).

Please refer to the doc strings and code in tokenization_transfo_xl.py for the details of these additional methods in TransfoXLTokenizer.

`GPT2Tokenizer`

GPT2Tokenizer perform byte-level Byte-Pair-Encoding (BPE) tokenization.

This class has three arguments:

vocab_file: path to a vocabulary file.
merges_file: path to a file containing the BPE merges.
errors: How to handle unicode decoding errors. Default = replace

and two methods:

tokenize(text): convert a str in a list of str tokens by performing byte-level BPE.
convert_tokens_to_ids(tokens): convert a list of str tokens in a list of int indices in the vocabulary.
convert_ids_to_tokens(tokens): convert a list of int indices in a list of str tokens in the vocabulary.
set_special_tokens(self, special_tokens): update the list of special tokens (see above arguments)
encode(text): convert a str in a list of int tokens by performing byte-level BPE.
decode(tokens): convert back a list of int tokens in a str.
save_vocabulary(directory_path): save the vocabulary, merge and special tokens files to directory_path. Return the path to the three files: vocab_file_path, merge_file_path, special_tokens_file_path. The vocabulary can be reloaded with OpenAIGPTTokenizer.from_pretrained('directory_path').

Please refer to tokenization_gpt2.py for more details on the GPT2Tokenizer.

Optimizers

`BertAdam`

BertAdam is a torch.optimizer adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:

BertAdam implements weight decay fix,
BertAdam doesn't compensate for bias as in the regular Adam optimizer.

The optimizer accepts the following arguments:

lr : learning rate
warmup : portion of t_total for the warmup, -1 means no warmup. Default : -1
t_total : total number of training steps for the learning rate schedule, -1 means constant learning rate. Default : -1
schedule : schedule to use for the warmup (see above). Can be 'warmup_linear', 'warmup_constant', 'warmup_cosine', 'none', None or a _LRSchedule object (see below). If None or 'none', learning rate is always kept constant. Default : 'warmup_linear'
betas : Adams betas. Default : 0.9, 0.999
e : Adams epsilon. Default : 1e-6
weight_decay: Weight decay. Default : 0.01
max_grad_norm : Maximum norm for the gradients (-1 means no clipping). Default : 1.0

`OpenAIAdam`

OpenAIAdam is similar to BertAdam. The differences with BertAdam is that OpenAIAdam compensate for bias as in the regular Adam optimizer.

OpenAIAdam accepts the same arguments as BertAdam.

Learning Rate Schedules

The .optimization module also provides additional schedules in the form of schedule objects that inherit from _LRSchedule. All _LRSchedule subclasses accept warmup and t_total arguments at construction. When an _LRSchedule object is passed into BertAdam or OpenAIAdam, the warmup and t_total arguments on the optimizer are ignored and the ones in the _LRSchedule object are used. An overview of the implemented schedules:

ConstantLR: always returns learning rate 1.
WarmupConstantSchedule: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Keeps learning rate equal to 1. after warmup.
WarmupLinearSchedule: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Linearly decreases learning rate from 1. to 0. over remaining 1 - warmup steps.
WarmupCosineSchedule: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Decreases learning rate from 1. to 0. over remaining 1 - warmup steps following a cosine curve. If cycles (default=0.5) is different from default, learning rate follows cosine function after warmup.
WarmupCosineWithHardRestartsSchedule: Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. If cycles (default=1.) is different from default, learning rate follows cycles times a cosine decaying learning rate (with hard restarts).
WarmupCosineWithWarmupRestartsSchedule: All training progress is divided in cycles (default=1.) parts of equal length. Every part follows a schedule with the first warmup fraction of the training steps linearly increasing from 0. to 1., followed by a learning rate decreasing from 1. to 0. following a cosine curve. Note that the total number of all warmup steps over all cycles together is equal to warmup * cycles

Examples

Sub-section	Description
Training large models: introduction, tools and examples	How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
Fine-tuning with BERT: running the examples	Running the examples in `./examples`: `extract_classif.py`, `run_classifier.py`, `run_squad.py` and `lm_finetuning/simple_lm_finetuning.py`
Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2	Running the examples in `./examples`: `run_openai_gpt.py`, `run_transfo_xl.py` and `run_gpt2.py`
Fine-tuning BERT-large on GPUs	How to fine tune `BERT large`

Training large models: introduction, tools and examples

BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).

To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts run_classifier.py and run_squad.py: gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read the tips on training large batches in PyTorch that I published earlier this month.

Here is how to use these techniques in our scripts:

Gradient Accumulation: Gradient accumulation can be used by supplying a integer greater than 1 to the --gradient_accumulation_steps argument. The batch at each step will be divided by this integer and gradient will be accumulated over gradient_accumulation_steps steps.
Multi-GPU: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
Distributed training: Distributed training can be activated by supplying an integer greater or equal to 0 to the --local_rank argument (see below).
16-bits training: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found here and a full documentation is here. In our scripts, this option can be activated by setting the --fp16 flag and you can play with loss scaling using the --loss_scale flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.

To use 16-bits training and distributed training, you need to install NVIDIA's apex extension as detailed here. You will find more information regarding the internals of apex and how to use apex in the doc and the associated repository. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in the relevant PR of the present repository.

Note: To use Distributed Training, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see the above mentioned blog post for more details):

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)

Where $THIS_MACHINE_INDEX is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address 192.168.1.1 and an open port 1234.

Fine-tuning with BERT: running the examples

We showcase several fine-tuning examples based on (and extended from) the original implementation:

a sequence-level classifier on nine different GLUE tasks,
a token-level classifier on the question answering dataset SQuAD, and
a sequence-level multiple-choice classifier on the SWAG classification corpus.
a BERT language model on another target corpus

GLUE results on dev set

We get the following results on the dev set of GLUE benchmark with an uncased BERT base model. All experiments were run on a P100 GPU with a batch size of 32.

Task	Metric	Result
CoLA	Matthew's corr.	57.29
SST-2	accuracy	93.00
MRPC	F1/accuracy	88.85/83.82
STS-B	Pearson/Spearman corr.	89.70/89.37
QQP	accuracy/F1	90.72/87.41
MNLI	matched acc./mismatched acc.	83.95/84.39
QNLI	accuracy	89.04
RTE	accuracy	61.01
WNLI	accuracy	53.52

Some of these results are significantly different from the ones reported on the test set of GLUE benchmark on the website. For QQP and WNLI, please refer to FAQ #12 on the webite.

Before running anyone of these GLUE tasks you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_classifier.py \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.

MRPC

This example code fine-tunes BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

Before running this example you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/

Our test ran on a few seeds with the original implementation hyper-parameters gave evaluation results between 84% and 88%.

Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds! First install apex as indicated here. Then run

export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16

Distributed training Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC:

python -m torch.distributed.launch --nproc_per_node 8 run_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name MRPC --do_train   --do_eval   --do_lower_case   --data_dir $GLUE_DIR/MRPC/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0  --output_dir /tmp/mrpc_output/

Training with these hyper-parameters gave us the following results:

  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798

Here is an example on MNLI:

python -m torch.distributed.launch --nproc_per_node 8 run_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name mnli --do_train   --do_eval   --do_lower_case   --data_dir /datadrive/bert_data/glue_data//MNLI/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir ../models/wwm-uncased-finetuned-mnli/ --overwrite_output_dir

***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904

***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904

This is the example of the bert-large-uncased-whole-word-masking-finetuned-mnli model

SQuAD

This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.

The data for SQuAD can be downloaded with the following links and should be saved in a $SQUAD_DIR directory.

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Training with the previous hyper-parameters gave us the following results:

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}

distributed training

Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

python -m torch.distributed.launch --nproc_per_node=8 \
 run_squad.py \
 --bert_model bert-large-uncased-whole-word-masking  \
 --do_train \
 --do_predict \
 --do_lower_case \
 --train_file $SQUAD_DIR/train-v1.1.json \
 --predict_file $SQUAD_DIR/dev-v1.1.json \
 --learning_rate 3e-5 \
 --num_train_epochs 2 \
 --max_seq_length 384 \
 --doc_stride 128 \
 --output_dir ../models/wwm_uncased_finetuned_squad/ \
 --train_batch_size 24 \
 --gradient_accumulation_steps 12

Training with these hyper-parameters gave us the following results:

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 86.91579943235573, "f1": 93.1532499015869}

This is the model provided as bert-large-uncased-whole-word-masking-finetuned-squad.

And here is the model provided as bert-large-cased-whole-word-masking-finetuned-squad:

python -m torch.distributed.launch --nproc_per_node=8  run_squad.py  --bert_model bert-large-cased-whole-word-masking   --do_train  --do_predict  --do_lower_case  --train_file $SQUAD_DIR/train-v1.1.json  --predict_file $SQUAD_DIR/dev-v1.1.json  --learning_rate 3e-5  --num_train_epochs 2  --max_seq_length 384  --doc_stride 128  --output_dir ../models/wwm_cased_finetuned_squad/  --train_batch_size 24  --gradient_accumulation_steps 12

Training with these hyper-parameters gave us the following results:

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 84.18164616840113, "f1": 91.58645594850135}

SWAG

The data for SWAG can be downloaded by cloning the following repository

export SWAG_DIR=/path/to/SWAG

python run_swag.py \
  --bert_model bert-base-uncased \
  --do_train \
  --do_lower_case \
  --do_eval \
  --data_dir $SWAG_DIR/data \
  --train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 80 \
  --output_dir /tmp/swag_output/ \
  --gradient_accumulation_steps 4

Training with the previous hyper-parameters on a single GPU gave us the following results:

eval_accuracy = 0.8062081375587323
eval_loss = 0.5966546792367169
global_step = 13788
loss = 0.06423990014260186

LM Fine-tuning

The data should be a text file in the same format as sample_text.txt (one sentence per line, docs separated by empty line). You can download an exemplary training corpus generated from wikipedia articles and splitted into ~500k sentences with spaCy. Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with train_batch_size=200 and max_seq_length=128:

Thank to the work of @Rocketknight1 and @tholor there are now several scripts that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the README of the examples/lm_finetuning/ folder.

OpenAI GPT, Transformer-XL and GPT-2: running the examples

We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:

fine-tuning OpenAI GPT on the ROCStories dataset
evaluating Transformer-XL on Wikitext 103
unconditional and conditional generation from a pre-trained OpenAI GPT-2 model

Fine-tuning OpenAI GPT on the RocStories dataset

This example code fine-tunes OpenAI GPT on the RocStories dataset.

Before running this example you should download the RocStories dataset and unpack it to some directory $ROC_STORIES_DIR.

export ROC_STORIES_DIR=/path/to/RocStories

python run_openai_gpt.py \
  --model_name openai-gpt \
  --do_train \
  --do_eval \
  --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
  --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
  --output_dir ../log \
  --train_batch_size 16 \

This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).

Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset

This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset. This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.

python run_transfo_xl.py --work_dir ../log

This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).

Unconditional and conditional generation from OpenAI's GPT-2 model

This example code is identical to the original unconditional and conditional generation codes.

Conditional generation:

python run_gpt2.py

Unconditional generation:

python run_gpt2.py --unconditional

The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.

Fine-tuning BERT-large on GPUs

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):

{"exact_match": 84.56953642384106, "f1": 91.04028647786927}

To get these results we used a combination of:

multi-GPU training (automatically activated on a multi-GPU server),
2 steps of gradient accumulation and
perform the optimization step on CPU to store Adam's averages in RAM.

Here is the full list of hyper-parameters for this run:

export SQUAD_DIR=/path/to/SQUAD

python ./run_squad.py \
  --bert_model bert-large-uncased \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/ \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2

If you have a recent GPU (starting from NVIDIA Volta series), you should try 16-bit fine-tuning (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:

export SQUAD_DIR=/path/to/SQUAD

python ./run_squad.py \
  --bert_model bert-large-uncased \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/ \
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128

The results were similar to the above FP32 results (actually slightly higher):

{"exact_match": 84.65468306527909, "f1": 91.238669287002}

Here is an example with the recent bert-large-uncased-whole-word-masking:

python -m torch.distributed.launch --nproc_per_node=8 \
  run_squad.py \
  --bert_model bert-large-uncased-whole-word-masking \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/ \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2

BERTology

There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:

BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341

In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):

accessing all the hidden-states of BERT/GPT/GPT-2,
accessing all the attention weights for each head of BERT/GPT/GPT-2,
retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.

To help you understand and use these features, we have added a specific example script: bertology.py while extract information and prune a model pre-trained on MRPC.

Notebooks

We include three Jupyter Notebooks that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.

The first NoteBook (Comparing-TF-and-PT-models.ipynb) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
The second NoteBook (Comparing-TF-and-PT-models-SQuAD.ipynb) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the BertForQuestionAnswering and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
The third NoteBook (Comparing-TF-and-PT-models-MLM-NSP.ipynb) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.

Please follow the instructions given in the notebooks to run and modify them.

Command-line interface

A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the BertForPreTraining class (for BERT) or NumPy checkpoint in a PyTorch dump of the OpenAIGPTModel class (for OpenAI GPT).

BERT

You can convert any TensorFlow checkpoint for BERT (in particular the pre-trained models released by Google) in a PyTorch save file by using the convert_tf_checkpoint_to_pytorch.py script.

This CLI takes as input a TensorFlow checkpoint (three files starting with bert_model.ckpt) and the associated configuration file (bert_config.json), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using torch.load() (see examples in extract_features.py, run_classifier.py and run_squad.py).

You only need to run this conversion script once to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with bert_model.ckpt) but be sure to keep the configuration file (bert_config.json) and the vocabulary file (vocab.txt) as these are needed for the PyTorch model too.

To run this specific conversion script you will need to have TensorFlow and PyTorch installed (pip install tensorflow). The rest of the repository only requires PyTorch.

Here is an example of the conversion process for a pre-trained BERT-Base Uncased model:

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
  $BERT_BASE_DIR/bert_model.ckpt \
  $BERT_BASE_DIR/bert_config.json \
  $BERT_BASE_DIR/pytorch_model.bin

You can download Google's pre-trained models for the conversion here.

OpenAI GPT

Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see here)

export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights

pytorch_pretrained_bert convert_openai_checkpoint \
  $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
  $PYTORCH_DUMP_OUTPUT \
  [OPENAI_GPT_CONFIG]

Transformer-XL

Here is an example of the conversion process for a pre-trained Transformer-XL model (see here)

export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint

pytorch_pretrained_bert convert_transfo_xl_checkpoint \
  $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
  $PYTORCH_DUMP_OUTPUT \
  [TRANSFO_XL_CONFIG]

GPT-2

Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.

export GPT2_DIR=/path/to/gpt2/checkpoint

pytorch_pretrained_bert convert_gpt2_checkpoint \
  $GPT2_DIR/model.ckpt \
  $PYTORCH_DUMP_OUTPUT \
  [GPT2_CONFIG]

TPU

TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent official announcement).

We will add TPU support when this next release is published.

The original TensorFlow code further comprises two scripts for pre-training BERT: create_pretraining_data.py and run_pretraining.py.

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details here) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 959 Commits
.circleci		.circleci
.github		.github
.idea		.idea
docker		docker
docs/imgs		docs/imgs
examples		examples
hubconfs		hubconfs
notebooks		notebooks
pytorch_pretrained_bert		pytorch_pretrained_bert
samples		samples
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
hubconf.py		hubconf.py
requirements.txt		requirements.txt
setup.py		setup.py

License

ethanjperez/pytorch-pretrained-BERT

Folders and files

Latest commit

History

Repository files navigation

PyTorch Pretrained BERT: The Big & Extending Repository of pretrained Transformers

Content

Installation

With pip

From source

Overview

Usage

BERT

OpenAI GPT

Transformer-XL

OpenAI GPT-2

Doc

Loading Google AI or OpenAI pre-trained weights or PyTorch dump

from_pretrained() method

Cache directory

Serialization best-practices

Configurations

Models

1. BertModel

2. BertForPreTraining

3. BertForMaskedLM

4. BertForNextSentencePrediction

5. BertForSequenceClassification

6. BertForMultipleChoice

7. BertForTokenClassification

8. BertForQuestionAnswering

9. OpenAIGPTModel

10. OpenAIGPTLMHeadModel

11. OpenAIGPTDoubleHeadsModel

12. TransfoXLModel

Extracting a list of the hidden states at each layer of the Transformer-XL from last_hidden_state and new_mems:

13. TransfoXLLMHeadModel

14. GPT2Model

15. GPT2LMHeadModel

16. GPT2DoubleHeadsModel

Tokenizers

BertTokenizer

OpenAIGPTTokenizer

TransfoXLTokenizer

GPT2Tokenizer

Optimizers

BertAdam

OpenAIAdam

Learning Rate Schedules

Examples

Training large models: introduction, tools and examples

Fine-tuning with BERT: running the examples

GLUE results on dev set

MRPC

SQuAD

SWAG

LM Fine-tuning

OpenAI GPT, Transformer-XL and GPT-2: running the examples

Fine-tuning OpenAI GPT on the RocStories dataset

Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset

Unconditional and conditional generation from OpenAI's GPT-2 model

Fine-tuning BERT-large on GPUs

BERTology

Notebooks

Command-line interface

BERT

OpenAI GPT

Transformer-XL

GPT-2

TPU

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`from_pretrained()` method

1. `BertModel`

2. `BertForPreTraining`

3. `BertForMaskedLM`

4. `BertForNextSentencePrediction`

5. `BertForSequenceClassification`

6. `BertForMultipleChoice`

7. `BertForTokenClassification`

8. `BertForQuestionAnswering`

9. `OpenAIGPTModel`

10. `OpenAIGPTLMHeadModel`

11. `OpenAIGPTDoubleHeadsModel`

12. `TransfoXLModel`

Extracting a list of the hidden states at each layer of the Transformer-XL from `last_hidden_state` and `new_mems`:

13. `TransfoXLLMHeadModel`

14. `GPT2Model`

15. `GPT2LMHeadModel`

16. `GPT2DoubleHeadsModel`

`BertTokenizer`

`OpenAIGPTTokenizer`

`TransfoXLTokenizer`

`GPT2Tokenizer`

`BertAdam`

`OpenAIAdam`

Packages