Skip to content

Commit

Permalink
Add documentation for node analyzer components
Browse files Browse the repository at this point in the history
This commit introduces documentation for new extension of Nodes API
that exposes names of analysis components available on cluster node(s).

This commit also contains additional changes:

- It makes strict distinction between terms: "token" and "term".
- It replaces the term "Normalize" in analysis part because it has
  special meaning in this context
- It introduces a dedicated pages for Normalization (which is a
  specific type of analysis)

This commit is part of PR OpenSearch/#10296

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
  • Loading branch information
lukas-vlcek committed Feb 1, 2024
1 parent d41ccb8 commit d31ea04
Show file tree
Hide file tree
Showing 5 changed files with 441 additions and 7 deletions.
331 changes: 326 additions & 5 deletions _analyzers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,24 @@ redirect_from:

# Text analysis

When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.

Text analysis consists of the following steps:
The objective of text analysis is to split the unstructured free-text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.

1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
From a technical point of view, the text analysis process consists of several steps, some of which are optional:

1. Before the free-text content can be split into individual words, it may be beneficial to treat it at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling a specific character patterns (like replacing the &#x1F642; emoji with the text `:slightly_smiling_face:`).

2. The next step is to split the free-text into individual words---_tokens_.. This is a job of a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.

3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.

Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene®, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with a lot less metadata. During search, matching operates at the term level.
{: .note}

## Analyzers

In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:

1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.

Expand All @@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
{: .note}

There is also a special type of analyzer called ***normalizer***. A normalizer is similar to an analyzer except it does not contain a tokenizer and it can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, but cannot operate on the whole-token level. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.

## Built-in analyzers

The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
Expand All @@ -54,6 +64,317 @@ Analyzer | Analysis performed | Analyzer output

If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.

With the introduction of OpenSearch `v2.12.0`, you can retrieve a comprehensive list of all available text analysis components using [Nodes Info]({{site.url}}{{site.baseurl}}/api-reference/nodes-apis/nodes-info/). This can be helpful when building custom analyzers, especially in cases where you need to recall the component's name or identify the associated analysis plugin this component is part of.

Introduced 2.12.0
{: .label .label-purple }

```json
GET /_nodes/analysis_components?pretty=true&filter_path=nodes.*.analysis_components
```
{% include copy-curl.html %}

The following is an example response from a node that includes a `common-analysis` module (a module that is present by default):

<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta}

```json
{
"nodes" : {
"cZidmv5kQbWQN8M8dz9f5g" : {
"analysis_components" : {
"analyzers" : [
"arabic",
"armenian",
"basque",
"bengali",
"brazilian",
"bulgarian",
"catalan",
"chinese",
"cjk",
"czech",
"danish",
"default",
"dutch",
"english",
"estonian",
"fingerprint",
"finnish",
"french",
"galician",
"german",
"greek",
"hindi",
"hungarian",
"indonesian",
"irish",
"italian",
"keyword",
"latvian",
"lithuanian",
"norwegian",
"pattern",
"persian",
"portuguese",
"romanian",
"russian",
"simple",
"snowball",
"sorani",
"spanish",
"standard",
"stop",
"swedish",
"thai",
"turkish",
"whitespace"
],
"tokenizers" : [
"PathHierarchy",
"char_group",
"classic",
"edgeNGram",
"edge_ngram",
"keyword",
"letter",
"lowercase",
"nGram",
"ngram",
"path_hierarchy",
"pattern",
"simple_pattern",
"simple_pattern_split",
"standard",
"thai",
"uax_url_email",
"whitespace"
],
"tokenFilters" : [
"apostrophe",
"arabic_normalization",
"arabic_stem",
"asciifolding",
"bengali_normalization",
"brazilian_stem",
"cjk_bigram",
"cjk_width",
"classic",
"common_grams",
"concatenate_graph",
"condition",
"czech_stem",
"decimal_digit",
"delimited_payload",
"delimited_term_freq",
"dictionary_decompounder",
"dutch_stem",
"edgeNGram",
"edge_ngram",
"elision",
"fingerprint",
"flatten_graph",
"french_stem",
"german_normalization",
"german_stem",
"hindi_normalization",
"hunspell",
"hyphenation_decompounder",
"indic_normalization",
"keep",
"keep_types",
"keyword_marker",
"kstem",
"length",
"limit",
"lowercase",
"min_hash",
"multiplexer",
"nGram",
"ngram",
"pattern_capture",
"pattern_replace",
"persian_normalization",
"porter_stem",
"predicate_token_filter",
"remove_duplicates",
"reverse",
"russian_stem",
"scandinavian_folding",
"scandinavian_normalization",
"serbian_normalization",
"shingle",
"snowball",
"sorani_normalization",
"standard",
"stemmer",
"stemmer_override",
"stop",
"synonym",
"synonym_graph",
"trim",
"truncate",
"unique",
"uppercase",
"word_delimiter",
"word_delimiter_graph"
],
"charFilters" : [
"html_strip",
"mapping",
"pattern_replace"
],
"normalizers" : [
"lowercase"
],
"plugins" : [
{
"name" : "analysis-common",
"classname" : "org.opensearch.analysis.common.CommonAnalysisModulePlugin",
"analyzers" : [
"arabic",
"armenian",
"basque",
"bengali",
"brazilian",
"bulgarian",
"catalan",
"chinese",
"cjk",
"czech",
"danish",
"dutch",
"english",
"estonian",
"fingerprint",
"finnish",
"french",
"galician",
"german",
"greek",
"hindi",
"hungarian",
"indonesian",
"irish",
"italian",
"latvian",
"lithuanian",
"norwegian",
"pattern",
"persian",
"portuguese",
"romanian",
"russian",
"snowball",
"sorani",
"spanish",
"swedish",
"thai",
"turkish"
],
"tokenizers" : [
"PathHierarchy",
"char_group",
"classic",
"edgeNGram",
"edge_ngram",
"keyword",
"letter",
"lowercase",
"nGram",
"ngram",
"path_hierarchy",
"pattern",
"simple_pattern",
"simple_pattern_split",
"thai",
"uax_url_email",
"whitespace"
],
"tokenFilters" : [
"apostrophe",
"arabic_normalization",
"arabic_stem",
"asciifolding",
"bengali_normalization",
"brazilian_stem",
"cjk_bigram",
"cjk_width",
"classic",
"common_grams",
"concatenate_graph",
"condition",
"czech_stem",
"decimal_digit",
"delimited_payload",
"delimited_term_freq",
"dictionary_decompounder",
"dutch_stem",
"edgeNGram",
"edge_ngram",
"elision",
"fingerprint",
"flatten_graph",
"french_stem",
"german_normalization",
"german_stem",
"hindi_normalization",
"hyphenation_decompounder",
"indic_normalization",
"keep",
"keep_types",
"keyword_marker",
"kstem",
"length",
"limit",
"lowercase",
"min_hash",
"multiplexer",
"nGram",
"ngram",
"pattern_capture",
"pattern_replace",
"persian_normalization",
"porter_stem",
"predicate_token_filter",
"remove_duplicates",
"reverse",
"russian_stem",
"scandinavian_folding",
"scandinavian_normalization",
"serbian_normalization",
"snowball",
"sorani_normalization",
"stemmer",
"stemmer_override",
"synonym",
"synonym_graph",
"trim",
"truncate",
"unique",
"uppercase",
"word_delimiter",
"word_delimiter_graph"
],
"charFilters" : [
"html_strip",
"mapping",
"pattern_replace"
],
"hunspellDictionaries" : [ ]
}
]
}
}
}
}
```
</details>

## Text analysis at indexing time and query time

OpenSearch performs text analysis on text fields when you index a document and when you send a search request. Depending on the time of text analysis, the analyzers used for it are classified as follows:
Expand Down
Loading

0 comments on commit d31ea04

Please sign in to comment.