diff --git a/notebooks/en/tgi_messages_api_demo.ipynb b/notebooks/en/tgi_messages_api_demo.ipynb index c99a6bdf..cba7be74 100644 --- a/notebooks/en/tgi_messages_api_demo.ipynb +++ b/notebooks/en/tgi_messages_api_demo.ipynb @@ -8,7 +8,7 @@ "\n", "_Authored by: [Andrew Reed](https://huggingface.co/andrewrreed)_\n", "\n", - "This notebook demonstrates how you can easily transition from OpenAI models for Open LLMs without needing to refactor any existing code.\n", + "This notebook demonstrates how you can easily transition from OpenAI models to Open LLMs without needing to refactor any existing code.\n", "\n", "[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) now offers a Messages API, making it directly compatible with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models (via the OpenAI client library or third-party tools like LangChain or LlamaIndex) can be directly swapped out to use any open LLM running on a TGI endpoint!\n", "\n", @@ -20,9 +20,20 @@ "\n", "In this notebook, we'll show you how to:\n", "\n", - "- [Create Inference Endpoint to Deploy a Model with TGI](#create-an-inference-endpoint)\n", - "- [Query the Inference Endpoint with OpenAI Client Libraries](#using-inference-endpoints-with-openai-client-libraries)\n", - "- [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#integrate-with-langchain-and-llamaindex)\n" + "1. [Create Inference Endpoint to Deploy a Model with TGI](#section_1)\n", + "2. [Query the Inference Endpoint with OpenAI Client Libraries](#section_2)\n", + "3. [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#section_3)\n", + "\n", + "**Let's dive in!**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "First we need to install dependencies and set an HF API key.\n" ] }, { @@ -51,7 +62,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create an Inference Endpoint\n", + "\n", + "\n", + "## 1. Create an Inference Endpoint\n", "\n", "To get started, let's deploy [Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO), a fine-tuned Mixtral model, to Inference Endpoints using TGI.\n", "\n", @@ -116,14 +129,16 @@ "\n", "Great, we now have a working endpoint!\n", "\n", - "> Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle.\n" + "_Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle._\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using Inference Endpoints with OpenAI client libraries\n", + "\n", + "\n", + "## 2. Query the Inference Endpoint with OpenAI Client Libraries\n", "\n", "As mentioned above, since our model is hosted with TGI it now supports a Messages API meaning we can query it directly using the familiar OpenAI client libraries.\n" ] @@ -197,7 +212,7 @@ "source": [ "Behind the scenes, TGI’s Messages API automatically converts the list of messages into the model’s required instruction format using its [chat template](https://huggingface.co/docs/transformers/chat_templating).\n", "\n", - "> Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`.\n" + "_Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`._\n" ] }, { @@ -239,7 +254,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Integrate with LangChain and LlamaIndex\n", + "\n", + "\n", + "## 3. Integrate with LangChain and LlamaIndex\n", "\n", "Now, let’s see how to use this newly created endpoint with popular RAG frameworks like LangChain and LlamaIndex.\n" ] @@ -285,6 +302,7 @@ "metadata": {}, "source": [ "We’re able to directly leverage the same `ChatOpenAI` class that we would have used with the OpenAI models. This allows all previous code to work with our endpoint by changing just one line of code.\n", + "\n", "Let’s now use our Mixtral model in a simple RAG pipeline to answer a question over the contents of a HF blog post.\n" ] }, @@ -363,7 +381,9 @@ "source": [ "### How to use with LlamaIndex\n", "\n", - "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`). Note that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint.\n" + "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`).\n", + "\n", + "_Note: that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint._\n" ] }, {