Skip to content

Commit

Permalink
Add example to text chunking processor documentation (#6794)
Browse files Browse the repository at this point in the history
* add search document example for text chunking and embedding pipeline

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune document

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Add the text chunking page

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* correct example

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Update _search-plugins/text-chunking.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>

* Update _search-plugins/text-chunking.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>

* resolve review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Move cascading section to processor file

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
  • Loading branch information
4 people authored Mar 29, 2024
1 parent 5d9edcb commit d676a79
Show file tree
Hide file tree
Showing 4 changed files with 128 additions and 115 deletions.
114 changes: 3 additions & 111 deletions _ingest-pipelines/processors/text-chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,119 +157,11 @@ The response confirms that, in addition to the `passage_text` field, the process
}
```

Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).

## Chaining text chunking and embedding processors

You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage.

**Prerequisites**

Follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model.

**Step 1: Create a pipeline**

The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:

```json
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
```
{% include copy-curl.html %}

**Step 2 (Optional): Test the pipeline**

It is recommended that you test your pipeline before ingesting documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/text-chunking-embedding-ingest-pipeline/_simulate
{
"docs": [
{
"_index": "testindex",
"_id": "1",
"_source":{
"passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The response confirms that, in addition to the `passage_text` and `passage_chunk` fields, the processor has generated text embeddings for each of the three passages in the `passage_chunk_embedding` field. The embedding vectors are stored in the `knn` field for each chunk:

```json
{
"docs": [
{
"doc": {
"_index": "testindex",
"_id": "1",
"_source": {
"passage_chunk_embedding": [
{
"knn": [...]
},
{
"knn": [...]
},
{
"knn": [...]
}
],
"passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
"passage_chunk": [
"This is an example document to be chunked. The document ",
"The document contains a single paragraph, two sentences and 24 ",
"and 24 tokens by standard tokenizer in OpenSearch."
]
},
"_ingest": {
"timestamp": "2024-03-20T03:04:49.144054Z"
}
}
}
]
}
```

Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
Once you have created an ingest pipeline, you need to create an index for document ingestion. To learn more, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).

## Cascaded text chunking processors

You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows:
You can chain multiple text chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another text chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows:

```json
PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline
Expand Down Expand Up @@ -309,7 +201,7 @@ PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline

## Next steps

- For a complete example, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
- To learn more about semantic search, see [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/).
- To learn more about sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
- To learn more about using models in OpenSearch, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
- For a comprehensive example, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/).
11 changes: 8 additions & 3 deletions _search-plugins/neural-sparse-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: Neural sparse search
nav_order: 50
has_children: false
has_children: true
redirect_from:
- /search-plugins/sparse-search/
---
Expand Down Expand Up @@ -55,7 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
```
{% include copy-curl.html %}

To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors).
To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).


## Step 2: Create an index for ingestion

Expand Down Expand Up @@ -364,4 +365,8 @@ The response contains both documents:
]
}
}
```
```

## Next steps

- To learn more about splitting long text into passages for neural search, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
2 changes: 1 addition & 1 deletion _search-plugins/semantic-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
```
{% include copy-curl.html %}

To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors).
To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).

## Step 2: Create an index for ingestion

Expand Down
116 changes: 116 additions & 0 deletions _search-plugins/text-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
layout: default
title: Text chunking
nav_order: 65
---

# Text chunking
Introduced 2.13
{: .label .label-purple }

To split long text into passages, you can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the `text_embedding` processor.

## Step 1: Create a pipeline

The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:

```json
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
```
{% include copy-curl.html %}

## Step 2: Create an index for ingestion

In order to use the ingest pipeline, you need to create a k-NN index. The `passage_chunk_embedding` field must be of the `nested` type. The `knn.dimension` field must contain the number of dimensions for your model:

```json
PUT testindex
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"text": {
"type": "text"
},
"passage_chunk_embedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 768
}
}
}
}
}
}
```
{% include copy-curl.html %}

## Step 3: Ingest documents into the index

To ingest a document into the index created in the previous step, send the following request:

```json
POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
"passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}
```
{% include copy-curl.html %}

## Step 4: Search the index using neural search

You can use a `nested` query to perform vector search on your index. We recommend setting `score_mode` to `max`, where the document score is set to the highest score out of all passage embeddings:

```json
GET testindex/_search
{
"query": {
"nested": {
"score_mode": "max",
"path": "passage_chunk_embedding",
"query": {
"neural": {
"passage_chunk_embedding.knn": {
"query_text": "document",
"model_id": "-tHZeI4BdQKclr136Wl7"
}
}
}
}
}
}
```
{% include copy-curl.html %}

0 comments on commit d676a79

Please sign in to comment.