Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add html strip processor documentation #5984

Merged
merged 16 commits into from
Jun 6, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions _ingest-pipelines/processors/html-strip.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
layout: default
title: HTML strip
parent: Ingest processors
nav_order: 140
---

# HTML strip processor
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

The `html_strip` processor removes HTML tags from string fields in incoming documents. The processor is useful when indexing data from web pages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`).
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more precise than "clean"? What do we actually mean by this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted that sentence. It's not necessary info. "Clean" means readable text content without HTML tags.


The following is the syntax for the `html_strip` processor:

```json
<insert syntax example>
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
```
{% include copy-curl.html %}

## Configuration parameters

The following table lists the required and optional parameters for the `html_strip` processor.

Parameter | Required/Optional | Description |
|-----------|-----------|-----------|
`field` | Required | The string field from which to remove HTML tags.
`target_field` | Optional | The field to assign the cleaned value to. If not specified, field is updated in-place.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`ignore_missing` | Optional | Default is `false`. If `true`, the processor quietly exits without modifying the document when field does not exist.
`description` | Optional | Description of the processor's purpose or configuration.
`if` | Optional | Conditionally execute the processor.
`ignore_failure` | Optional | Ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
`on_failure` | Optional | Handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/).
`tag` | Optional | Identifier for the processor. Useful for debugging and metrics.

## Using the processor

Follow these steps to use the processor in a pipeline.

### Step 1: Create a pipeline

The following query creates a pipeline named `strip-html-pipeline` that uses the `html_strip` processor to remove HTML tags from the description field and store the processed value in a new field named `cleaned_description`:

```json
PUT _ingest/pipeline/strip-html-pipeline
{
"description": "A pipeline to strip HTML from description field",
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
"processors": [
{
"html_strip": {
"field": "description",
"target_field": "cleaned_description"
}
}
]
}
```
{% include copy-curl.html %}

### Step 2 (Optional): Test the pipeline

It is recommended that you test your pipeline before you ingest documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/strip-html-pipeline/_simulate
{
"docs": [
{
"_source": {
"description": "This is a <b>test</b> description with <i>some</i> HTML tags."
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The following example response confirms that the pipeline is working as expected:

```json
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"description": "This is a <b>test</b> description with <i>some</i> HTML tags.",
"cleaned_description": "This is a test description with some HTML tags."
},
"_ingest": {
"timestamp": "2024-05-22T21:46:11.227974965Z"
}
}
}
]
}
```
{% include copy-curl.html %}

### Step 3: Ingest a document

The following query ingests a document into an index named `products`:

```json
PUT products/_doc/1?pipeline=strip-html-pipeline
{
"name": "Product 1",
"description": "This is a <b>test</b> product with <i>some</i> HTML tags."
}
```
{% include copy-curl.html %}

#### Response

The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags, while storing the cleaned version in the `cleaned_description` field.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: "clean"


```json
{
"_index": "products",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
```
{% include copy-curl.html %}

### Step 4 (Optional): Retrieve the document

To retrieve the document, run the following query:

```json
GET products/_doc/1
```
{% include copy-curl.html %}

#### Response

The response includes both the original `description` field and the `cleaned_description` field with HTML tags removed.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"_index": "products",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"cleaned_description": "This is a test product with some HTML tags.",
"name": "Product 1",
"description": "This is a <b>test</b> product with <i>some</i> HTML tags."
}
}
```
Loading