If you got this error while running a script
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.22 GiB. GPU 0 has a total capacity of 79.15 GiB of which 228.38 MiB is free. Including non-PyTorch memory, this process
has 78.93 GiB memory in use. Of the allocated memory 76.28 GiB is allocated by PyTorch, and 2.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory
is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
it means that your GPU memory size wasn't big enough for the model and script configuration.
Here's a few things you can try:
Adjust the --train.micro_batch_size
argument in the fine-tuning and pretraining scripts. This variable determines the number of samples loaded per iteration.
A smaller value will simply load fewer samples simultaneously. The minimum value is 1.
Experiment with different micro batch sizes to find a balance between memory consumption and computational efficiency. Smaller micro batch sizes consume less memory but may result in slower training convergence. Conversely, larger micro batch sizes require more memory but can accelerate training speed.
The context length (block_size
in the code) plays a significant role in running models with attention.
- The pretraining scripts are configured to use the full context length of the model to train.
- The finetuning scripts are configured to use the longest sample length of the training data to avoid allocating unnecessary memory (
--train.max_seq_length
argument). If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised.
However, your hardware may not support such large context lengths. Here's what you can do:
- For the pretraining scripts, you can simply reduce the
Config(block_size=...)
value. - For the finetuning scripts, you can trim the length of the samples in your dataset.
All the finetuning scripts expose a
--data.max_seq_length=...
argument. This might also be useful in cases where sample lengths are highly unbalanced, as the presence of a single very long sample would incur a larger memory usage for all other shorter samples. For example, the median length of the samples in Alpaca is 110 tokens. Truncating the Alpaca dataset to 256 max tokens reduces the memory requirements of a Falcon 7B model from 23.52 GB to 15.73 GB. For more information about the dataset truncation, please see the Truncating datasets section in the prepare_dataset.md tutorial.
Keep in mind that reducing the context length will affect the modelling performance on text sequences longer than the limit.
Our scripts expose the --precision
argument, this directly impacts the memory usage.
Using true lower precision (16-true
, bf16-true
) reduces the memory usage by half compared to 32-true
, however,
the model might start producing NaNs due to the limited range of representable values.
Mixed precision training (16-mixed
, bf16-mixed
) provides better stability but offers limited memory reduction.
For exceptionally large models, the aforementioned techniques might still not suffice. If you have multiple GPUs available,
you can trade off memory for speed by changing the --devices 1
argument in the scripts. Enabling this option enables a parallelism technique (FSDP), sharding the memory across different GPUs.
The default configuration already uses activation checkpointing, but you can enable CPU offloading by changing the cpu_offload=False
argument in the scripts.
Our scripts use the AdamW
optimizer.
It maintains 2 states for each trainable parameter of the model, meaning that the optimizer memory is double compared to
an optimizer like SGD
.
You can try replacing it with your optimizer of choice that is lighter in memory requirements. Keep in mind that different optimizers have distinct optimization behaviors, so it's essential to assess their impact on the training process and model performance. An example would be the recently published Sophia or Lion optimizers.
This suggestion is particularly relevant for pretraining, as the trainable parameters in the model represent a small subset of the total in the fine-tuning scripts.