Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add html strip processor documentation #5984

Merged
merged 16 commits into from
Jun 6, 2024
Merged

[DOC] Add html strip processor documentation #5984

merged 16 commits into from
Jun 6, 2024

Conversation

vagimeli
Copy link
Collaborator

Description

Closes content gap

Issues Resolved

#4647

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli added Needs SME Waiting on input from subject matter expert Content gap labels Dec 22, 2023
@vagimeli vagimeli self-assigned this Dec 22, 2023
@vagimeli
Copy link
Collaborator Author

@hdhalter We need a dev to take the first pass at drafting this content. I can support them in refining the content and getting it through the documentation process. The ingest processor template is provided with this PR, so it should help the dev get started. Thanks!

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli marked this pull request as draft February 15, 2024 21:15
@vagimeli vagimeli added the backport 2.14 PR: Backport label for 2.14 label May 22, 2024
@vagimeli vagimeli marked this pull request as ready for review May 22, 2024 19:57
@vagimeli
Copy link
Collaborator Author

@gaobinlong This PR is ready for technical review. Thank you!

@vagimeli vagimeli added 3 - Tech review PR: Tech review in progress and removed Needs SME Waiting on input from subject matter expert labels May 29, 2024
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Jun 5, 2024
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!

_ingest-pipelines/processors/html-strip.md Show resolved Hide resolved
_ingest-pipelines/processors/html-strip.md Outdated Show resolved Hide resolved

# HTML strip processor

The `html_strip` processor removes HTML tags from string fields in incoming documents. The processor is useful when indexing data from web pages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more precise than "clean"? What do we actually mean by this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted that sentence. It's not necessary info. "Clean" means readable text content without HTML tags.

_ingest-pipelines/processors/html-strip.md Outdated Show resolved Hide resolved
_ingest-pipelines/processors/html-strip.md Outdated Show resolved Hide resolved

#### Response

The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags, while storing the cleaned version in the `cleaned_description` field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: "clean"

_ingest-pipelines/processors/html-strip.md Outdated Show resolved Hide resolved
vagimeli and others added 4 commits June 6, 2024 08:45
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli removed the 5 - Editorial review PR: Editorial review in progress label Jun 6, 2024
@vagimeli vagimeli merged commit 6a119e1 into main Jun 6, 2024
6 checks passed
@vagimeli vagimeli deleted the html-strip branch June 6, 2024 15:34
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 6, 2024
* Add html strip processor documentation

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Add html strip processor documentation

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Add examples

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Copy edits

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _ingest-pipelines/processors/html-strip.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

---------

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
(cherry picked from commit 6a119e1)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.14 PR: Backport label for 2.14 Content gap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants