diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb index 992d5820..55738b98 100644 --- a/notebooks/en/rag_zephyr_langchain.ipynb +++ b/notebooks/en/rag_zephyr_langchain.ipynb @@ -140,11 +140,7 @@ "source": [ "The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n", "\n", - "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.\n", - "\n", - "Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n", - "\n", - "The fixed-size chunking, however, works well for most common cases, so that is what we'll do here." + "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here. " ] }, { @@ -155,9 +151,9 @@ }, "outputs": [], "source": [ - "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", - "splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", + "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", "\n", "chunked_docs = splitter.split_documents(docs)" ]