Skip to content

Packages and instructions for training and inference of LLMs on NVIDIA's new GH200 machines

License

Notifications You must be signed in to change notification settings

abacusai/gh200-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GH200s

The new NVIDIA GH200 chips hold a lot of promise because of their high CPU->GPU memory bandwidth (7x the H100 DGX setup). Node to node communication along is supported by InfiniBand HDR 200Gbps non-blocking and ConnectX-7 NIC ratio for GPUDirect RDMA. NVIDIA has published some impressive raw numbers.

We ran some initial real world tests comparing a new batch of GH200 96GB GPU memory, 480GB node memory nodes that we got from Lambda Labs. We did the initial tests on CUDA 12.2 and are waiting to repeat this with CUDA 12.3 which has some additional optimizations for CPU->GPU memory paging.

To run tests on the GH200, we had to compile various libraries like Flash Attention, xformers, VLLM on CUDA 12.2 and ARM64 (the Grace Hopper CPU has ARM silicon). You can find the Docker image published at: ghcr.io/abacusai/gh200-llm/llm-train-serve:latest

The main use cases that we were interested in testing were inference on large context requests on large models and full finetuning of smaller 13B and 7B models.

Large Context 70B Inference Performance

We compared 4 H100 (on a 8xH100 DGX) and 4 GH200 MGX nodes serving Abacus Giraffe (a finetuned Llama2 70B 32k context LLM. We used Flash Attention 2 and VLLM for serving these models.

Input Tokens Output Tokens Concurrency 4xH100 Time (secs) 4xGH200 Time (secs) Improvement
1 31,242 100 1 18.229 9.699 87.94%
2 3,349 100 8 13.836 8.271 67.28%

Numbers are promising, we get almost a 2x improvement in latency at larger contexts.

Small Model Finetuning Performance

We compared full finetuning on a 13B base model (not LoRA) between 2 H100s and 1 GH200 with Deepspeed gradient offloads to CPU.

Hardware Micro Batch Size Batch Size Secs/iteration
2xH100 64 128 14
1xGH200 128 128 10

A 56% improved with 1/2 the number of nodes makes this fairly appealing especially given that keeping the training to a single node reduces a lot of operational complexity.

About

Packages and instructions for training and inference of LLMs on NVIDIA's new GH200 machines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages