updating post header

AhmedTremo · Aug 5, 2024 · 50a6686 · 50a6686
1 parent 8eb7d5d
commit 50a6686
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/_posts/2024-08-05-How-to-Efficiently-serve-an-llm b/_posts/2024-08-05-How-to-Efficiently-serve-an-llm
@@ -6,7 +6,7 @@ tags: [LLM, inference, optimization, serving]
 author: tremo
 ---
 
-# How to Efficiently Serve an LLM
+## How to Efficiently Serve an LLM
 
 LLMs, or **Large** Language Models, are so named because they can range from tens to hundreds of billions of parameters. Their utility is clear, as LLMs are setting new benchmarks on various evaluations and now often match or exceed human performance in multiple tasks ([GPT-4 Technical Report (arxiv.org)](https://arxiv.org/html/2303.08774v4)). Consequently, many companies are eager to deploy them in production. However, due to the unprecedented size of LLMs, there are significant challenges in serving them, such as slow token generation (tokens/second), memory limits for loading model parameters, KV cache (explained later), compute limits, and more. In this article, we will cover several recent ideas to help set up a robust LLM serving system.