Note: This repository and its contents support the coursework of the INM706 module at City, University of London.
We use sequence-to-sequence (Seq2Seq) and transformer models for machine translation on an English to French dataset.
We use the 'Europarl v7' dataset for training/validation, which was the standard for training models in WMT 2015. The dataset is extracted from proceedings of the European Parliament, with translations in many languages. The dataset can be found here. Our test datasets can be found here.
We have also uploaded the datasets we used ourselves, which may be easier to download. These can be downloaded here. The exact datasets used are in "fr-en.zip".
- Python 3.9+
- PyTorch (with CUDA for GPU usage)
- All other requirements listed in requirements.txt
The training process involves optimising the Seq2Seq and Transformer models to generate accurate translations of English sentences. You can customise various hyperparameters such as batch size, learning rate, optimiser, and loss function in the config.yaml
file. Please refer to the config.yaml
file for a full list of hyperparameters.
Our checkpoints can be downloaded here. Each folder contains a 'readme.txt' file containing the hyperparameters used to train the models. You will need these if you edit the code later, otherwise we have already filled them in appropriately for you. You only need to replace the dummy checkpoints in this repository with those found here.
train.py
: File to initiate the training loop for the respective model, which uses the hyperparameters in theconfig.yaml
file.config.yaml
: File to edit hyperparameters. Additional hyperparameters are included for choosing project and logger names for 'wandb'.models.py
: Contains the definitions of the seq2seq/transformer models.dataset.py
: Creates our datasets in the relevant PyTorch format.utils.py
: Utility functions for calculating BLEU, NIST, etc.logger.py
: Logger class for logging training metrics to 'wandb'.checkpoints/
: Directory to save model checkpoints during training, and can be used for inference.inference.ipynb
: Jupyter Notebook used for inference. Allows for a sentence/test dataset to be input and to examine the results.
-
We recommend first downloading/cloning the whole repository, though if you wish to work only with the baseline model you do not need the
/seq2seq-final
folder, and vice versa for working with our final model and with transformers. We also recommend sticking to the folder structure found in this repository, otherwise you will need to make a few edits indicating where the datasets and checkpoints can be found. -
Secondly, you should ensure that all the libraries listed in the requirements.txt file have been installed on your environment (you may wish to create a new environment from scratch to avoid dependency conflicts). You can use your favourite method to install these libraries, such as through using the
pip install -r requirements.txt
command in your terminal. You must also install PyTorch, ideally the CUDA version if you wish to work with a GPU. Follow the instructions on the PyTorch website for the installation. -
Now, you will need to download a training and validation set (these could also be a single dataset as done in our work). In our work, we used 'Europarl-v7' as our training and validation set, at 80% and 20% respectively. We have created a folder for the dataset already in our repository, and you can download the datasets here. This folder also includes our test datasets, which are only used in the respective
inference.ipynb
. We recommend deleting our dataset folder and replacing them with the folder that you download. In other words, download the dataset(s), delete our dummy dataset folder, and extract the dataset contents (which should be in a folder with the same name as our dummy folder) and put them in the same location as where the dummy dataset folders were (see repository, you essentially need the dataset folders outside of the actual code folders). -
After completing all of the previous steps, you can safely run the
train.py
file from your chosen folder. You may edit theconfig.yaml
file with your chosen hyperparameters to use during training, including adding any checkpoint paths (such as those that we provide). -
Optional: Should you wish to use our checkpoints, you need to download them here for your choice of model. You need to edit the 'config.yaml' file with the checkpoint path if you interested in using our checkpoints for the relevant model. These checkpoints should be placed in the
/checkpoints
folder for the relevant model folder. -
IMPORTANT NOTE: The code assumes that you open your workspace in either the
/seq2seq-baseline
,/seq2seq-final
,/transformer-baseline
, or/transformer-final
model folders in your integrated development environment (IDE), rather than opening the whole folder containing all of the datasets, etc. You can edit these to your preferences, but we recommend following the folder structure set out in this repository for ease.
We provide a Jupyter Notebook (inference.ipynb
) for inference. You must have our checkpoints downloaded and added to the checkpoints/
folder, though you can edit the paths as you wish to the checkpoints inside the code. You can alternatively use your own checkpoints.
If you have not used 'wandb' previously, you will be prompted to enter your API key into the terminal. You need a (free) 'wandb' account if not already made, and you can find further instructions here.