This tutorial will walk you through pretraining TinyLlama.
Tip
To get started with zero setup, clone the TinyLlama studio on Lightning AI.
TinyLlama is architecturally the same as Meta AI's LLama 2, but only has 1.1B parameters and is instead trained on multiple epochs on a mix of SlimPajama and Starcoder datasets.
Here is a quick fact sheet:
Name | Description |
---|---|
Parameters | 1.1B |
Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size: 5632 |
Sequence Length | 2048 |
Learning Rate | 4e-4 |
Learning Rate Schedule | Cosine with 2000 warmup steps |
Training Data | SlimPajama (893 GB), Starcoder (290 GB) |
Combined Dataset Size | Around 950B tokens |
Total Tokens During Training | 3 trillion (3 epochs) |
Time to complete training | ~ 4 weeks with 64 A100 GPUs |
Model FLOPs Utilization (MFU) | 52% |
(this table was sourced from the author's README)
You can download the data using git lfs:
# Make sure you have git-lfs installed (https://git-lfs.com):
sudo apt install git-lfs
git clone https://huggingface.co/datasets/cerebras/slimpajama-627b data/slimpajama-raw
git clone https://huggingface.co/datasets/bigcode/starcoderdata data/starcoderdata-raw
Around 1.2 TB of disk space is required to store both datasets.
In order to start pretraining litgpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the litdata
optimization pipeline and streaming dataset.
First, install additional dependencies for preprocessing:
pip install '.[all]'
You will need to have the tokenizer config available:
litgpt download meta-llama/Llama-2-7b-hf \
--access_token your_hf_token \
--tokenizer_only true
Then, run the preprocessing script for each dataset and split. You will require 1.1 TB of disk space for Starcoder and 2.5 TB of space for the SlimPajama dataset.
Starcoder:
python litgpt/data/prepare_starcoder.py \
--input_dir data/starcoderdata-raw \
--output_dir data/starcoder \
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
SlimPajama:
python litgpt/data/prepare_slimpajama.py \
--input_dir data/slimpajama-raw/validation \
--output_dir data/slimpajama/val \
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
python litgpt/data/prepare_slimpajama.py \
--input_dir data/slimpajama-raw/test \
--output_dir data/slimpajama/test \
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
python litgpt/data/prepare_slimpajama.py \
--input_dir data/slimpajama-raw/train \
--output_dir data/slimpajama/train \
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
If you want to run on a small slice of the datasets first, pass the flag --fast_dev_run=true
to the commands above.
In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained SentencePiece tokenizer with a 32000 vocabulary size will do here.
Running the pretraining script with its default settings requires at least 8 A100 GPUs.
litgpt pretrain --config config_hub/pretrain/tinyllama.yaml
Tip
Use the litgpt pretrain --data.help TinyLlama
command to list additional dataset options.
The script will save checkpoints periodically to the folder out/
.
By default, the pretrain
script will pretrain the model with FSDP in
bfloat16
mixed precision and gradient accumulation.
Note that pretrain
is not actually a model-specific training script, so feel free try other configurations
or change the model type and size by passing a different string to the model name argument, for example:
litgpt pretrain Gemma-2b
The currently supported model names can be listed by executing litgpt pretrain
without any additional arguments.
Keep in mind that training with a single machine will take weeks. To speed up the process, you'll need access to a cluster. Once you're in a cluster, you can follow these instructions to launch the script across machines:
The script exposes several hyperparameters you can tweak through the command line.
For instance, --train.micro_batch_size
should be adjusted so the process will use the available
GPU memory. For more tips to avoid out-of-memory issues, please also see the more detailed
Dealing with out-of-memory (OOM) errors guide.
Last, logging is kept minimal in the script, but for long-running experiments we recommend switching to a proper experiment tracker.
As an example, we included WandB (set --logger_name=wandb
) to show how you can integrate any experiment tracking framework.
For reference, here are the loss curves for our reproduction.
The checkpoints saved during pretraining contain all the information to resume if needed.
Simply rerun the script with the --resume
argument added:
litgpt pretrain tiny-llama\
--config config_hub/pretrain/tinyllama.yaml \
--resume out/pretrain/tiny-llama/step-00060500
Important: Each checkpoint is a directory. Point to the directory, not the 'lit_model.pth' file inside of it.
Tip
Use the litgpt pretrain --data.help TinyLlama
command to list additional dataset options.
After training is completed, you can convert the checkpoint to a format that can be loaded for evaluation, inference, finetuning etc.
litgpt convert_pretrained_checkpoint out/pretrain/tiny-llama/step-00060500 \
--output_dir checkpoints/tiny-llama/final
After conversion, the output folder will contain these files:
checkpoints/tiny-llama/final
├── model_config.yaml
├── lit_model.pth
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
You can then use this checkpoint folder to run evaluation, inference, finetuning or process the checkpoint further.
The following Lightning Studio templates provide LitGPT pretraining projects in reproducible environments with multi-GPU and multi-node support:
Pretrain LLMs - TinyLlama 1.1B |
|
Continued Pretraining with TinyLlama 1.1B |
|