Skip to content

Commit

Permalink
third draft infra
Browse files Browse the repository at this point in the history
  • Loading branch information
AhmedTremo committed Jul 28, 2024
1 parent 7171850 commit 5fc964f
Showing 1 changed file with 17 additions and 11 deletions.
28 changes: 17 additions & 11 deletions _posts/2024-07-27-What-Infra-does-it-take-to-train-llama405b?.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ author: tremo

Setting up the infrastructure for training the latest frontier models is not an easy feat; only a few companies have the scale to do it (Meta, Microsoft, Google, ...). ML training has escalated from requiring up to 512 GPUs to needing 16k H100 GPUs to train Meta's latest Llama3-405B model. This posed a huge challenge for infrastructure setup, necessitating significant innovation to handle this sheer number of GPUs working in tandem, as LLM distributed training jobs require synchronous communication and gang scheduling.

Understanding the underlying infrastructure used to train the latest LLMs is essential for ML scientists to maximize the MFU (Model FLOPs Utilization), especially as infrastructure costs rise. For example, AI labs are racing to build the first 100K H100 cluster that would cost an estimated [$4 billion](https://www.semianalysis.com/p/100000-h100-clusters-power-network). With that in mind, here’s an overview of the components required for building the infrastructure for the latest & greatest LLMs.
Understanding the underlying infrastructure used to train the latest LLMs is essential for ML scientists to maximize the MFU (Model FLOPs Utilization), especially as infrastructure costs rise. For example, AI labs are currently racing to build the first 100K H100 cluster that would cost an estimated [$4 billion](https://www.semianalysis.com/p/100000-h100-clusters-power-network). With that in mind, here’s an overview of the components required for building the infrastructure for the latest & greatest LLMs.

![Meta's 24k Cluster](/assets/img/posts/2024-07-27-What-Infra-does-it-take-to-train-llama405b/Infra%20Networking%20cluster.jpg)
__Meta’s 24k cluster design with a 7:1 oversubscription ratio__
Expand All @@ -19,24 +19,31 @@ __Meta’s 24k cluster design with a 7:1 oversubscription ratio__

The first and most important step is designing the networking flow of gradients across the huge number of GPUs. As aforementioned, distributed training requires synchronous communication methods like All-reduce, all-gather, and broadcast to combine and share gradients. As model sizes increase (reportedly [1.8 trillion](https://www.semianalysis.com/p/100000-h100-clusters-power-network) parameters for GPT-4), different parallelism techniques are required (Tensor, Context, Pipeline, Data) known as 4D parallelism that necessitate more communication.

In the ideal scenario, a GPU can communicate with any other GPU at full bandwidth (400 GB/s) for the latest Infiniband connection speed. However, doing so for clusters of 100k GPUs would require a huge number of switches and transceivers to route the communication traffic, making the cost prohibitive. Architects thus trade-off by oversubscribing the aggregation top layer (as shown in the figure of Meta’s 24K cluster design with a 7:1 ratio) to decrease the cluster cost.
In the ideal scenario, a GPU can communicate with any other GPU at full bandwidth (400 Gbps) using the latest Infiniband connection speed. However, achieving this for clusters of 100k GPUs would require a vast number of switches and transceivers to handle the communication traffic, making it cost prohibitive. To mitigate this, network architects trade-off by oversubscribing the aggregation top layer (as shown in the figure of Meta’s 24K cluster design with a 7:1 ratio) to reduce the overall cost.

GPUs within the same rack have full bisection bandwidth with one another. Therefore, deciding the communication patterns to be network-aware is essential to efficiently utilize the hardware and avoid stragglers (slower-performing nodes in a distributed system) that could slow down the entire cluster. For example, Meta forked Nvidia’s NCCL library to optimize the communication patterns to fit their cluster design.

## Storage

Training LLMs is memory-bound. While compute capabilities have rapidly evolved from different versions of GPUs (A100 → H100), the maximum memory capacity per GPU has increased, though not as dramatically as compute power. For example, A100 GPUs typically have up to 80 GB of HBM2e memory, whereas H100 GPUs can have up to 80 GB or more of HBM3 memory. More memory is essential for storing model weights, activations, and optimizer states (with Adam being the most popular optimizer, storing 3x parameters (one for the parameters themselves, one for the first moment (mean of gradients), and one for the second moment (variance of gradients))). With the rumored size of GPT-4 (1.8 trillion parameters), a total of 10.8 terabytes of memory would be required for training.
Training LLMs is memory-bound. While compute capabilities have rapidly evolved from different versions of GPUs (A100 → H100), the maximum memory capacity per GPU has increased, though not as dramatically as compute power. For example, A100 GPUs typically have up to 80 GB of HBM2e memory, whereas H100 GPUs can have up to 80 GB or more of HBM3 memory. More memory is essential for storing model weights, activations, and optimizer states (with Adam being the most popular optimizer, storing 3x parameters: one for the parameters themselves, one for the first moment (mean of gradients), and one for the second moment (variance of gradients)). With the rumored size of GPT-4 (1.8 trillion parameters), a total of 10.8 terabytes of memory would be required for training.

Additionally, memory is required for checkpointing (saving model weights frequently) to recover in case of failure or to choose the best-performing version of the model (if the model starts overfitting the data with more training).

- Traditional way: offloading to CPU memory and then to persistent storage (adds delay but is simpler to do).
- Recent way: using spare GPUs' HBM to just RDMA copy the current model state for checkpointing; fast but costly.
- **Traditional way**: offloading to CPU memory and then to persistent storage (adds delay but is simpler to do).
- **Recent way**: using spare GPUs HBM to just RDMA copy the current model state for checkpointing; fast but costly.

Storing datasets → 15.6 trillion tokens for Llama-3 required building 240 PB Storage for training, and fast data read speeds are needed to avoid wasting GPU cycles.

To ensure efficient training, the following aspects of storage must be considered:

1. **High I/O Throughput**: Ensuring fast data read and write speeds to avoid GPU downtime.
2. **Scalability and Fault Tolerance**: The storage system must scale with the growing dataset size and ensure data redundancy to protect against hardware failures.
3. **Integration with Compute**: Seamless integration with the compute infrastructure to allow high-speed data transfer between storage and GPUs.
4. **Intermediate and Result Storage**: Handling storage for intermediate results, logs, and multiple versions of model checkpoints efficiently.

## Compute

Nvidia is the leader in compute with an estimated market share of 80% to 90% and becoming the most valuable company in the world. H100s are currently in mass production and were used in training Llama405B, with most AI labs competing to build using H100s due to their leading performance and AI-friendly software stack (CUDA & cuDNN). AMD is trying to gain a share in this market with their MI250, and cloud providers are starting to build their own chips. Google’s TPU chips, used to train the Gemini family of models, are notable, but it doesn’t seem the market will change significantly in the foreseeable future.
Nvidia is the leader in compute with an estimated market share of 80% to 90% and becoming the most valuable company in the world. H100s are currently in mass production and were used in training Llama3-405B, with most AI labs competing to build using H100s due to their leading performance and AI-friendly software stack (CUDA & cuDNN). AMD is trying to gain a share in this market with their MI250, and cloud providers are starting to build their own chips. Google’s TPU chips, used to train the Gemini family of models, are notable, but it doesn’t seem the market will change significantly in the foreseeable future.

| Feature | Nvidia H100 | AMD MI250 | Google TPU v4 |
|---------------------|-------------|------------|---------------|
Expand All @@ -57,18 +64,17 @@ Ensuring fault tolerance and performing regular health checks are crucial for ma
- [Imbue Recommendation](https://imbue.com/research/70b-infrastructure/) is to maintain 10-20% more spare GPUs than required which allows quick replacement of failed GPUs, ensuring the training run is not halted due to a single failure.

2. **Health Checks**:
- **Automated Scripts**: Implement scripts to detect faulty hardware (GPUs, InfiniBand, host machines, etc...).
- **Proactive Monitoring**: Measure power consumption, temperature, and fan speed of each GPU to detect potential failures early.
- Implement scripts to detect faulty hardware (GPUs, InfiniBand, host machines, etc...).
- Measure power consumption, temperature, and fan speed of each GPU to detect potential failures early.

3. **Network Reliability**:
- **Failure Points**: Networks can fail due to flapping, host machine failures, or power supply issues, you need to use redundant paths and automatic failover mechanisms to ensure continuous operation.
- Networks can fail due to flapping, host machine failures, or power supply issues, you need to use redundant paths and automatic failover mechanisms to ensure continuous operation.

4. **Golden Sets of Machines**:
- Maintain a set of machines that have been stress-tested and verified to be reliable by running stress tests that maximize hardware utilization to distinguish between great and med machines.

5. **Checkpointing**:
- **Traditional Method**: Offload data to CPU memory and then to persistent storage.
- **Modern Method**: Use spare GPUs' HBM to RDMA copy the current model state, offering a faster but more costly alternative.
- Use spare GPUs' HBM to RDMA (Remote Direct Memory Access) copy the current model state, offering a faster but more costly alternative. This method leverages the high bandwidth memory (HBM) of GPUs and the fast RDMA transfer speeds, reducing the time required for checkpointing.

6. **Proactive Checks & Automated Recovery**:
- Regularly measure and log power consumption, temperature, and fan speed.
Expand Down

0 comments on commit 5fc964f

Please sign in to comment.