diff --git a/_ingest-pipelines/processors/text-chunking.md b/_ingest-pipelines/processors/text-chunking.md index e9ff55b210..d11c380bde 100644 --- a/_ingest-pipelines/processors/text-chunking.md +++ b/_ingest-pipelines/processors/text-chunking.md @@ -157,119 +157,11 @@ The response confirms that, in addition to the `passage_text` field, the process } ``` -Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). - -## Chaining text chunking and embedding processors - -You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. - -**Prerequisites** - -Follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. - -**Step 1: Create a pipeline** - -The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field: - -```json -PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline -{ - "description": "A text chunking and embedding ingest pipeline", - "processors": [ - { - "text_chunking": { - "algorithm": { - "fixed_token_length": { - "token_limit": 10, - "overlap_rate": 0.2, - "tokenizer": "standard" - } - }, - "field_map": { - "passage_text": "passage_chunk" - } - } - }, - { - "text_embedding": { - "model_id": "LMLPWY4BROvhdbtgETaI", - "field_map": { - "passage_chunk": "passage_chunk_embedding" - } - } - } - ] -} -``` -{% include copy-curl.html %} - -**Step 2 (Optional): Test the pipeline** - -It is recommended that you test your pipeline before ingesting documents. -{: .tip} - -To test the pipeline, run the following query: - -```json -POST _ingest/pipeline/text-chunking-embedding-ingest-pipeline/_simulate -{ - "docs": [ - { - "_index": "testindex", - "_id": "1", - "_source":{ - "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." - } - } - ] -} -``` -{% include copy-curl.html %} - -#### Response - -The response confirms that, in addition to the `passage_text` and `passage_chunk` fields, the processor has generated text embeddings for each of the three passages in the `passage_chunk_embedding` field. The embedding vectors are stored in the `knn` field for each chunk: - -```json -{ - "docs": [ - { - "doc": { - "_index": "testindex", - "_id": "1", - "_source": { - "passage_chunk_embedding": [ - { - "knn": [...] - }, - { - "knn": [...] - }, - { - "knn": [...] - } - ], - "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.", - "passage_chunk": [ - "This is an example document to be chunked. The document ", - "The document contains a single paragraph, two sentences and 24 ", - "and 24 tokens by standard tokenizer in OpenSearch." - ] - }, - "_ingest": { - "timestamp": "2024-03-20T03:04:49.144054Z" - } - } - } - ] -} -``` - -Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). +Once you have created an ingest pipeline, you need to create an index for document ingestion. To learn more, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/). ## Cascaded text chunking processors -You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows: +You can chain multiple text chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another text chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows: ```json PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline @@ -309,7 +201,7 @@ PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline ## Next steps +- For a complete example, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/). - To learn more about semantic search, see [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/). - To learn more about sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). - To learn more about using models in OpenSearch, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). -- For a comprehensive example, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/). diff --git a/_search-plugins/neural-sparse-search.md b/_search-plugins/neural-sparse-search.md index 88d30e4391..3932825bbd 100644 --- a/_search-plugins/neural-sparse-search.md +++ b/_search-plugins/neural-sparse-search.md @@ -2,7 +2,7 @@ layout: default title: Neural sparse search nav_order: 50 -has_children: false +has_children: true redirect_from: - /search-plugins/sparse-search/ --- @@ -55,7 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse ``` {% include copy-curl.html %} -To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). +To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/). + ## Step 2: Create an index for ingestion @@ -364,4 +365,8 @@ The response contains both documents: ] } } -``` \ No newline at end of file +``` + +## Next steps + +- To learn more about splitting long text into passages for neural search, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/). \ No newline at end of file diff --git a/_search-plugins/semantic-search.md b/_search-plugins/semantic-search.md index 32bd18cd6c..7c3fbb738f 100644 --- a/_search-plugins/semantic-search.md +++ b/_search-plugins/semantic-search.md @@ -48,7 +48,7 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline ``` {% include copy-curl.html %} -To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors). +To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/). ## Step 2: Create an index for ingestion diff --git a/_search-plugins/text-chunking.md b/_search-plugins/text-chunking.md new file mode 100644 index 0000000000..b66cfeda61 --- /dev/null +++ b/_search-plugins/text-chunking.md @@ -0,0 +1,116 @@ +--- +layout: default +title: Text chunking +nav_order: 65 +--- + +# Text chunking +Introduced 2.13 +{: .label .label-purple } + +To split long text into passages, you can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the `text_embedding` processor. + +## Step 1: Create a pipeline + +The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field: + +```json +PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline +{ + "description": "A text chunking and embedding ingest pipeline", + "processors": [ + { + "text_chunking": { + "algorithm": { + "fixed_token_length": { + "token_limit": 10, + "overlap_rate": 0.2, + "tokenizer": "standard" + } + }, + "field_map": { + "passage_text": "passage_chunk" + } + } + }, + { + "text_embedding": { + "model_id": "LMLPWY4BROvhdbtgETaI", + "field_map": { + "passage_chunk": "passage_chunk_embedding" + } + } + } + ] +} +``` +{% include copy-curl.html %} + +## Step 2: Create an index for ingestion + +In order to use the ingest pipeline, you need to create a k-NN index. The `passage_chunk_embedding` field must be of the `nested` type. The `knn.dimension` field must contain the number of dimensions for your model: + +```json +PUT testindex +{ + "settings": { + "index": { + "knn": true + } + }, + "mappings": { + "properties": { + "text": { + "type": "text" + }, + "passage_chunk_embedding": { + "type": "nested", + "properties": { + "knn": { + "type": "knn_vector", + "dimension": 768 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Step 3: Ingest documents into the index + +To ingest a document into the index created in the previous step, send the following request: + +```json +POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline +{ + "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." +} +``` +{% include copy-curl.html %} + +## Step 4: Search the index using neural search + +You can use a `nested` query to perform vector search on your index. We recommend setting `score_mode` to `max`, where the document score is set to the highest score out of all passage embeddings: + +```json +GET testindex/_search +{ + "query": { + "nested": { + "score_mode": "max", + "path": "passage_chunk_embedding", + "query": { + "neural": { + "passage_chunk_embedding.knn": { + "query_text": "document", + "model_id": "-tHZeI4BdQKclr136Wl7" + } + } + } + } + } +} +``` +{% include copy-curl.html %}