Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use RecursiveCharacterTextSplitter to better split docs #41

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions notebooks/en/rag_zephyr_langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -140,11 +140,7 @@
"source": [
"The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n",
"\n",
"The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.\n",
"\n",
"Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n",
"\n",
"The fixed-size chunking, however, works well for most common cases, so that is what we'll do here."
"The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here. "
]
},
{
Expand All @@ -155,9 +151,9 @@
},
MKhalusova marked this conversation as resolved.
Show resolved Hide resolved
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
"\n",
"chunked_docs = splitter.split_documents(docs)"
]
Expand Down
Loading