Skip to content

Commit

Permalink
third draft llm serving post
Browse files Browse the repository at this point in the history
  • Loading branch information
AhmedTremo committed Aug 5, 2024
1 parent 239bf7c commit b9d0e17
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions _posts/2024-08-05-How-to-Efficiently-serve-an-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@ LLMs, or **Large** Language Models, are named so because they can range from ten
3. After scheduling, the LLM Inference process is divided into two phases:
- **Prefill phase**: The LLM processes the input tokens in parallel and generates the output activations known as the “KV Cache”. This step is highly efficient at utilizing the GPU's parallel processing capabilities, making input tokens generally much cheaper than output tokens (as seen in the GPT-4o pricing chart). This phase produces the first output token and is typically compute-bound.

- ![gpt-4o pricing](/assets/img/posts/2024-08-05-How-to-Efficiently-serve-an-llm/gpt-4o%20pricing.png)__GPT-4o Pricing__
- ![gpt-4o pricing](/assets/img/posts/2024-08-05-How-to-Efficiently-serve-an-llm/gpt-4o%20pricing.png)
__GPT-4o Pricing__

- **Decode phase**: The LLM starts autoregressively generating output tokens one at a time. This phase is slower in terms of inference and is where optimizations are **necessary**. Output tokens at each step are concatenated with the previous tokens’ KV cache to generate the next token.

- ![KV Cache](/assets/img/posts/2024-08-05-How-to-Efficiently-serve-an-llm/KV%20Caching%20explanation%20&%20reuse.png)__KV Cache Explanation & Reuse__
- ![KV Cache](/assets/img/posts/2024-08-05-How-to-Efficiently-serve-an-llm/KV%20Caching%20explanation%20&%20reuse.png)
__KV Cache Explanation & Reuse__

## Optimizations

Expand Down

0 comments on commit b9d0e17

Please sign in to comment.