diff --git a/notebooks/en/tgi_messages_api_demo.ipynb b/notebooks/en/tgi_messages_api_demo.ipynb
index c99a6bdf..cba7be74 100644
--- a/notebooks/en/tgi_messages_api_demo.ipynb
+++ b/notebooks/en/tgi_messages_api_demo.ipynb
@@ -8,7 +8,7 @@
"\n",
"_Authored by: [Andrew Reed](https://huggingface.co/andrewrreed)_\n",
"\n",
- "This notebook demonstrates how you can easily transition from OpenAI models for Open LLMs without needing to refactor any existing code.\n",
+ "This notebook demonstrates how you can easily transition from OpenAI models to Open LLMs without needing to refactor any existing code.\n",
"\n",
"[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) now offers a Messages API, making it directly compatible with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models (via the OpenAI client library or third-party tools like LangChain or LlamaIndex) can be directly swapped out to use any open LLM running on a TGI endpoint!\n",
"\n",
@@ -20,9 +20,20 @@
"\n",
"In this notebook, we'll show you how to:\n",
"\n",
- "- [Create Inference Endpoint to Deploy a Model with TGI](#create-an-inference-endpoint)\n",
- "- [Query the Inference Endpoint with OpenAI Client Libraries](#using-inference-endpoints-with-openai-client-libraries)\n",
- "- [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#integrate-with-langchain-and-llamaindex)\n"
+ "1. [Create Inference Endpoint to Deploy a Model with TGI](#section_1)\n",
+ "2. [Query the Inference Endpoint with OpenAI Client Libraries](#section_2)\n",
+ "3. [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#section_3)\n",
+ "\n",
+ "**Let's dive in!**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Setup\n",
+ "\n",
+ "First we need to install dependencies and set an HF API key.\n"
]
},
{
@@ -51,7 +62,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Create an Inference Endpoint\n",
+ "\n",
+ "\n",
+ "## 1. Create an Inference Endpoint\n",
"\n",
"To get started, let's deploy [Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO), a fine-tuned Mixtral model, to Inference Endpoints using TGI.\n",
"\n",
@@ -116,14 +129,16 @@
"\n",
"Great, we now have a working endpoint!\n",
"\n",
- "> Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle.\n"
+ "_Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle._\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Using Inference Endpoints with OpenAI client libraries\n",
+ "\n",
+ "\n",
+ "## 2. Query the Inference Endpoint with OpenAI Client Libraries\n",
"\n",
"As mentioned above, since our model is hosted with TGI it now supports a Messages API meaning we can query it directly using the familiar OpenAI client libraries.\n"
]
@@ -197,7 +212,7 @@
"source": [
"Behind the scenes, TGI’s Messages API automatically converts the list of messages into the model’s required instruction format using its [chat template](https://huggingface.co/docs/transformers/chat_templating).\n",
"\n",
- "> Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`.\n"
+ "_Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`._\n"
]
},
{
@@ -239,7 +254,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Integrate with LangChain and LlamaIndex\n",
+ "\n",
+ "\n",
+ "## 3. Integrate with LangChain and LlamaIndex\n",
"\n",
"Now, let’s see how to use this newly created endpoint with popular RAG frameworks like LangChain and LlamaIndex.\n"
]
@@ -285,6 +302,7 @@
"metadata": {},
"source": [
"We’re able to directly leverage the same `ChatOpenAI` class that we would have used with the OpenAI models. This allows all previous code to work with our endpoint by changing just one line of code.\n",
+ "\n",
"Let’s now use our Mixtral model in a simple RAG pipeline to answer a question over the contents of a HF blog post.\n"
]
},
@@ -363,7 +381,9 @@
"source": [
"### How to use with LlamaIndex\n",
"\n",
- "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`). Note that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint.\n"
+ "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`).\n",
+ "\n",
+ "_Note: that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint._\n"
]
},
{