Skip to content

Commit

Permalink
first draft infra post and update about page
Browse files Browse the repository at this point in the history
  • Loading branch information
AhmedTremo committed Jul 28, 2024
1 parent d74757f commit 56ab2fc
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 22 deletions.
59 changes: 38 additions & 21 deletions _posts/2024-07-27-What-Infra-does-it-take-to-train-llama405b?.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,52 @@
---
title: What Infra does it take to train llama405b?
date: 2024-07-27 21:30 +0200
categories: [Deep Learning, Infrastructure, GenAI]
tags: [deep learning, infrastructure, genai, llama405b]
title: What Infrastructure does it take to train a 405B Llama3-like model?
date: 2024-07-28
categories: [LLM, Infrastructure, GPU, Distributed Training]
tags: [LLM, infrastructure, GPU, distributed training]
author: tremo
---

## Intro
Training a large language model like llama-3-405b requires a significant amount of computational resources. In this post, we'll explore the infrastructure needed to train llama-3-405b and the challenges associated with it.

## Training Infrastructure
Training llama-3-405b involves several components, including:
Setting up the infrastructure for training the latest frontier models is not an easy feat; only a few companies have the scale to do it (Meta, Microsoft, Google). ML training has escalated from requiring up to 512 GPUs to needing 16k H100 GPUs to train Meta's latest Llama3-405B model. This posed a huge challenge for infrastructure setup, necessitating significant innovation to handle this sheer number of GPUs working in tandem, as LLM distributed training jobs require synchronous communication, known as gang scheduling.

**1. Compute Resources:**
- **GPUs:** Training large language models like llama-3-405b requires powerful GPUs to handle the massive amount of computation involved. GPUs are optimized for parallel processing and are essential for accelerating the training process.
Understanding the underlying infrastructure used to train the latest LLMs is essential for ML scientists to fully utilize it, especially as infrastructure costs rise. For example, AI labs are racing to build a 100K H100 cluster that would cost an estimated $4 billion (source: [100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing (semianalysis.com)](https://www.semianalysis.com/p/100000-h100-clusters-power-network)). With that in mind, here’s an overview of the components required for building the infrastructure for the latest LLMs.

**2. Storage:**
- **High-speed Storage:** Training large language models generates massive amounts of data that need to be stored and accessed quickly. High-speed storage solutions like NVMe SSDs are essential to ensure fast data access during training.
![Meta's 24k Cluster](/assets/img/posts/2024-07-27-What-Infra-does-it-take-to-train-llama405b/Infra%20Networking%20cluster.jpg)
__Meta’s 24k cluster design with a 7:1 oversubscription ratio__

**3. Networking:**
- **High-speed Networking:** Efficient communication between GPUs is crucial for distributed training. High-speed networking solutions like InfiniBand or Ethernet are used to ensure low latency and high bandwidth communication between GPUs.
## Network Topology

**4. Distributed Training Framework:**
- **Megatron-LM:** Megatron-LM is a distributed training framework developed by NVIDIA that is optimized for training large language models. It supports model parallelism and data parallelism and enables efficient distributed training across multiple GPUs.
1. **Network Topology:**
1. The first and most important step is designing the networking flow of data across the huge number of GPUs. As aforementioned, distributed training requires synchronous communication methods like All-reduce, all-gather, and broadcast to combine and share gradients. As model sizes increase (reportedly 1.7 trillion parameters for GPT-4), different parallelism techniques are required (Tensor, Context, Pipeline, Data) that necessitate more communication.
2. In the ideal scenario, a GPU can communicate with any other GPU at full bandwidth (400 GB/s) for the latest NVLink connection speed, known as full-bisection connection. However, doing so for clusters of 100k GPUs would require a huge number of switches and transceivers to route the communication traffic, making the cost prohibitive. Architects thus trade-off by oversubscribing the aggregation top layer (as shown in the figure of Meta’s 24K cluster design with a 7:1 ratio) to decrease the cluster cost.
3. GPUs within the same rack have full bisection bandwidth with one another. Therefore, deciding the communication patterns to be network-aware is essential to efficiently utilize the hardware and avoid stragglers that could slow down the entire cluster. For example, Meta forked Nvidia’s NCCL library to optimize the communication patterns to fit their cluster design.

## Challenges
Training llama-3-405b poses several challenges due to the scale and complexity of the model:
## Storage

**1. Cost:** Training large language models like llama-3-405b requires significant computational resources, which can be expensive. The cost of GPUs, storage, and networking infrastructure can add up quickly, making it challenging to train such models.
**2. Scalability:** Scaling training to hundreds or thousands of GPUs introduces challenges related to communication, synchronization, and resource management. Efficiently distributing the workload across multiple GPUs while maintaining performance is a complex task.
**3. Infrastructure Management:** Managing the infrastructure required for training large language models involves setting up and configuring GPUs, storage, and networking components, as well as monitoring and troubleshooting issues that may arise during training.
2. **Storage:**
1. Training LLMs is memory-bound. While compute has rapidly evolved from different versions of GPUs (A100 → H100), memory has almost stayed the same (80 GB max) on both chips. More memory is essential for storing model weights, activations, and optimizer states (with Adam being the most popular optimizer, storing 3x parameters). With the rumored size of GPT-4 (1.7 trillion parameters), a total of X terabytes would be required.
2. Memory is within the chip.
3. Additionally, memory is required for checkpointing (saving model weights frequently) to recover in case of failure or to choose the best-performing version of the model (if the model starts overfitting the data with more training).
1. Traditional way: offloading to CPU memory and then to persistent storage (adds delay but is easier to do).
2. Recent way: using spare GPUs' HBM to copy the current model state for checkpointing; fast but costly.
4. Storing datasets → 15 trillion tokens for LLaMA required (X Storage) for training, and fast data read speeds are needed to avoid wasting GPU cycles.

## Compute

3. **Compute:**
1. Nvidia is the leader in compute (market share and company value). H100s are currently in mass production and were used in training Llama405B, with most AI labs competing to build using H100s due to their leading performance and AI-friendly software stack (Cuda & NCCL). AMD is trying to gain a share in this market with their MI200, and cloud providers are starting to build their own chips. Google’s TPU chips, used to train the Gemini family of models, are notable, but it doesn’t seem the market will change significantly in the foreseeable future.

## Fault Tolerance & Health Checks

4. **Fault Tolerance & Health Checks:**
1. GPUs fail, and they fail a lot. According to Imbue, 10-20% of GPUs fail, and this isn’t the only source of failure in a large AI cluster. Networks can fail (flapping), host machines can die, power supplies may fluctuate, and even the current temperature can affect cluster throughput (source: Meta). Designing infrastructure to be fault-tolerant is essential to maximize hardware utilization. Imbue recommends having 10-20% more spare GPUs than required for training to quickly discard failed GPUs and use spare ones, to avoid stopping the full cluster due to a single GPU or cable failure halting the synchronous training run. Below is a categorized list of interruptions Meta’s team faced during their 54-day training run.
2. Health checks: creating scripts to check every single component is essential to automatically detect faulty hardware (GPU, InfiniBand, host machine, etc.). Automatic detection should be followed by automatically excluding faulty hardware and using another spare one. Tickets should be sent to data center technicians or vendors for fixes, and only when confirmed fixed, should the hardware be re-added to the training cluster.
3. Building a list of golden sets of machines and networks is important to narrow down failure sources. Running stress tests that maximize hardware utilization will reveal great from good machines.

![Meta Interruptions](/assets/img/posts/2024-07-27-What-Infra-does-it-take-to-train-llama405b/Meta%20interruptions.png)
__Meta’s interruptions during 54-day training run__

## Conclusion
Training large language models like llama-3-405b requires a sophisticated infrastructure that includes powerful GPUs, high-speed storage, and networking solutions, as well as a distributed training framework like Megatron-LM. Overcoming the challenges associated with training such models is essential to unlock their full potential and advance the field of natural language processing.

Understanding the infrastructure needs for training the frontier models might seem like a niche skill that only a few engineers in AI labs need. This is only true until an ML scientist faces a hardware error, which is inevitable due to the high percentage of failures of current hardware. Knowledge of the underlying infrastructure can help navigate these challenges with ease. Moreover, building software for training that fully utilizes the strengths and avoids the weaknesses in the infrastructure cluster will save a lot of money and might just be the differentiator between you and your competitors in the space.
2 changes: 1 addition & 1 deletion _tabs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ icon: fas fa-info-circle
order: 4
---

I'm Ahmed Tremo, a machine learning engineer working at [Dell Techonolgies](https://www.dell.com/) to make ML work in real-life. I previously I worked as a data engineer at [Lumin Systems](http://excelsystems-eg.com/).
I'm Ahmed Tremo, an Applied Scientist at [Microsoft](www.microsoft.com). I previously I worked as machine learning engineer working at [Dell Techonolgies](https://www.dell.com/) & a data engineer at [Lumin Systems](http://excelsystems-eg.com/).

I graduated from The [German university in Cairo](https://www.guc.edu.eg/), with a bachelor's degree in computer engineering with an Excellent with honors grade.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 56ab2fc

Please sign in to comment.