Skip to content

Commit

Permalink
updating post header
Browse files Browse the repository at this point in the history
  • Loading branch information
AhmedTremo committed Aug 5, 2024
1 parent 8eb7d5d commit 50a6686
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion _posts/2024-08-05-How-to-Efficiently-serve-an-llm
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tags: [LLM, inference, optimization, serving]
author: tremo
---

# How to Efficiently Serve an LLM
## How to Efficiently Serve an LLM

LLMs, or **Large** Language Models, are so named because they can range from tens to hundreds of billions of parameters. Their utility is clear, as LLMs are setting new benchmarks on various evaluations and now often match or exceed human performance in multiple tasks ([GPT-4 Technical Report (arxiv.org)](https://arxiv.org/html/2303.08774v4)). Consequently, many companies are eager to deploy them in production. However, due to the unprecedented size of LLMs, there are significant challenges in serving them, such as slow token generation (tokens/second), memory limits for loading model parameters, KV cache (explained later), compute limits, and more. In this article, we will cover several recent ideas to help set up a robust LLM serving system.

Expand Down

0 comments on commit 50a6686

Please sign in to comment.