From d31ea04acfcc1e8f260244fcda7d80b6b58fa45e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Luk=C3=A1=C5=A1=20Vl=C4=8Dek?= Date: Wed, 24 Jan 2024 16:44:44 +0100 Subject: [PATCH] Add documentation for node analyzer components MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit introduces documentation for new extension of Nodes API that exposes names of analysis components available on cluster node(s). This commit also contains additional changes: - It makes strict distinction between terms: "token" and "term". - It replaces the term "Normalize" in analysis part because it has special meaning in this context - It introduces a dedicated pages for Normalization (which is a specific type of analysis) This commit is part of PR OpenSearch/#10296 Signed-off-by: Lukáš Vlček --- _analyzers/index.md | 331 +++++++++++++++++- _analyzers/normalizers.md | 111 ++++++ _api-reference/nodes-apis/nodes-info.md | 2 + _field-types/supported-field-types/keyword.md | 2 +- _query-dsl/term-vs-full-text.md | 2 +- 5 files changed, 441 insertions(+), 7 deletions(-) create mode 100644 _analyzers/normalizers.md diff --git a/_analyzers/index.md b/_analyzers/index.md index 332a98210f4..ffbc7911c8c 100644 --- a/_analyzers/index.md +++ b/_analyzers/index.md @@ -15,16 +15,24 @@ redirect_from: # Text analysis -When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis. +When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis. -Text analysis consists of the following steps: +The objective of text analysis is to split the unstructured free-text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents. -1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`. -1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`. +From a technical point of view, the text analysis process consists of several steps, some of which are optional: + +1. Before the free-text content can be split into individual words, it may be beneficial to treat it at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling a specific character patterns (like replacing the 🙂 emoji with the text `:slightly_smiling_face:`). + +2. The next step is to split the free-text into individual words---_tokens_.. This is a job of a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`. + +3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`. + +Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene®, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with a lot less metadata. During search, matching operates at the term level. +{: .note} ## Analyzers -In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components: +In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components: 1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `

Actions speak louder than words

` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters. @@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters. {: .note} +There is also a special type of analyzer called ***normalizer***. A normalizer is similar to an analyzer except it does not contain a tokenizer and it can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, but cannot operate on the whole-token level. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details. + ## Built-in analyzers The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`. @@ -54,6 +64,317 @@ Analyzer | Analysis performed | Analyzer output If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. +With the introduction of OpenSearch `v2.12.0`, you can retrieve a comprehensive list of all available text analysis components using [Nodes Info]({{site.url}}{{site.baseurl}}/api-reference/nodes-apis/nodes-info/). This can be helpful when building custom analyzers, especially in cases where you need to recall the component's name or identify the associated analysis plugin this component is part of. + +Introduced 2.12.0 +{: .label .label-purple } + +```json +GET /_nodes/analysis_components?pretty=true&filter_path=nodes.*.analysis_components +``` +{% include copy-curl.html %} + +The following is an example response from a node that includes a `common-analysis` module (a module that is present by default): + +
+ + Response + + {: .text-delta} + +```json +{ + "nodes" : { + "cZidmv5kQbWQN8M8dz9f5g" : { + "analysis_components" : { + "analyzers" : [ + "arabic", + "armenian", + "basque", + "bengali", + "brazilian", + "bulgarian", + "catalan", + "chinese", + "cjk", + "czech", + "danish", + "default", + "dutch", + "english", + "estonian", + "fingerprint", + "finnish", + "french", + "galician", + "german", + "greek", + "hindi", + "hungarian", + "indonesian", + "irish", + "italian", + "keyword", + "latvian", + "lithuanian", + "norwegian", + "pattern", + "persian", + "portuguese", + "romanian", + "russian", + "simple", + "snowball", + "sorani", + "spanish", + "standard", + "stop", + "swedish", + "thai", + "turkish", + "whitespace" + ], + "tokenizers" : [ + "PathHierarchy", + "char_group", + "classic", + "edgeNGram", + "edge_ngram", + "keyword", + "letter", + "lowercase", + "nGram", + "ngram", + "path_hierarchy", + "pattern", + "simple_pattern", + "simple_pattern_split", + "standard", + "thai", + "uax_url_email", + "whitespace" + ], + "tokenFilters" : [ + "apostrophe", + "arabic_normalization", + "arabic_stem", + "asciifolding", + "bengali_normalization", + "brazilian_stem", + "cjk_bigram", + "cjk_width", + "classic", + "common_grams", + "concatenate_graph", + "condition", + "czech_stem", + "decimal_digit", + "delimited_payload", + "delimited_term_freq", + "dictionary_decompounder", + "dutch_stem", + "edgeNGram", + "edge_ngram", + "elision", + "fingerprint", + "flatten_graph", + "french_stem", + "german_normalization", + "german_stem", + "hindi_normalization", + "hunspell", + "hyphenation_decompounder", + "indic_normalization", + "keep", + "keep_types", + "keyword_marker", + "kstem", + "length", + "limit", + "lowercase", + "min_hash", + "multiplexer", + "nGram", + "ngram", + "pattern_capture", + "pattern_replace", + "persian_normalization", + "porter_stem", + "predicate_token_filter", + "remove_duplicates", + "reverse", + "russian_stem", + "scandinavian_folding", + "scandinavian_normalization", + "serbian_normalization", + "shingle", + "snowball", + "sorani_normalization", + "standard", + "stemmer", + "stemmer_override", + "stop", + "synonym", + "synonym_graph", + "trim", + "truncate", + "unique", + "uppercase", + "word_delimiter", + "word_delimiter_graph" + ], + "charFilters" : [ + "html_strip", + "mapping", + "pattern_replace" + ], + "normalizers" : [ + "lowercase" + ], + "plugins" : [ + { + "name" : "analysis-common", + "classname" : "org.opensearch.analysis.common.CommonAnalysisModulePlugin", + "analyzers" : [ + "arabic", + "armenian", + "basque", + "bengali", + "brazilian", + "bulgarian", + "catalan", + "chinese", + "cjk", + "czech", + "danish", + "dutch", + "english", + "estonian", + "fingerprint", + "finnish", + "french", + "galician", + "german", + "greek", + "hindi", + "hungarian", + "indonesian", + "irish", + "italian", + "latvian", + "lithuanian", + "norwegian", + "pattern", + "persian", + "portuguese", + "romanian", + "russian", + "snowball", + "sorani", + "spanish", + "swedish", + "thai", + "turkish" + ], + "tokenizers" : [ + "PathHierarchy", + "char_group", + "classic", + "edgeNGram", + "edge_ngram", + "keyword", + "letter", + "lowercase", + "nGram", + "ngram", + "path_hierarchy", + "pattern", + "simple_pattern", + "simple_pattern_split", + "thai", + "uax_url_email", + "whitespace" + ], + "tokenFilters" : [ + "apostrophe", + "arabic_normalization", + "arabic_stem", + "asciifolding", + "bengali_normalization", + "brazilian_stem", + "cjk_bigram", + "cjk_width", + "classic", + "common_grams", + "concatenate_graph", + "condition", + "czech_stem", + "decimal_digit", + "delimited_payload", + "delimited_term_freq", + "dictionary_decompounder", + "dutch_stem", + "edgeNGram", + "edge_ngram", + "elision", + "fingerprint", + "flatten_graph", + "french_stem", + "german_normalization", + "german_stem", + "hindi_normalization", + "hyphenation_decompounder", + "indic_normalization", + "keep", + "keep_types", + "keyword_marker", + "kstem", + "length", + "limit", + "lowercase", + "min_hash", + "multiplexer", + "nGram", + "ngram", + "pattern_capture", + "pattern_replace", + "persian_normalization", + "porter_stem", + "predicate_token_filter", + "remove_duplicates", + "reverse", + "russian_stem", + "scandinavian_folding", + "scandinavian_normalization", + "serbian_normalization", + "snowball", + "sorani_normalization", + "stemmer", + "stemmer_override", + "synonym", + "synonym_graph", + "trim", + "truncate", + "unique", + "uppercase", + "word_delimiter", + "word_delimiter_graph" + ], + "charFilters" : [ + "html_strip", + "mapping", + "pattern_replace" + ], + "hunspellDictionaries" : [ ] + } + ] + } + } + } +} +``` +
+ ## Text analysis at indexing time and query time OpenSearch performs text analysis on text fields when you index a document and when you send a search request. Depending on the time of text analysis, the analyzers used for it are classified as follows: diff --git a/_analyzers/normalizers.md b/_analyzers/normalizers.md new file mode 100644 index 00000000000..4195040602e --- /dev/null +++ b/_analyzers/normalizers.md @@ -0,0 +1,111 @@ +--- +layout: default +title: Normalizers +nav_order: 100 +--- + +# Normalizers + +A _normalizer_ has similar function to an analyzer but it outputs only a single token. It does not contain a tokenizer and it can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, but cannot operate on the whole-token level. This means that replacing a token with a synonym or stemming is not supported. + +A normalizer is useful in keyword search (that is, in term-based queries) because it enables running token filters and character filters on given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`. + +Consider the following example. + +Create a new index with custom normalizer: +```json +PUT /sample-index +{ + "settings": { + "analysis": { + "normalizer": { + "normalized_keyword": { + "type": "custom", + "char_filter": [], + "filter": [ "asciifolding", "lowercase" ] + } + } + } + }, + "mappings": { + "properties": { + "approach": { + "type": "keyword", + "normalizer": "normalized_keyword" + } + } + } +} +``` +{% include copy-curl.html %} + +Index document: +```json +POST /sample-index/_doc/ +{ + "approach": "naive" +} +``` +{% include copy-curl.html %} + +The following query matches the document. This is expected: +```json +GET /sample-index/_search +{ + "query": { + "term": { + "approach": "naive" + } + } +} +``` +{% include copy-curl.html %} + +But this query matches the document as well: +```json +GET /sample-index/_search +{ + "query": { + "term": { + "approach": "Naïve" + } + } +} +``` +{% include copy-curl.html %} + +To understand why, try to see the effect of the normalizer: +```json +GET /sample-index/_analyze +{ + "normalizer" : "normalized_keyword", + "text" : "Naïve" +} +``` + +Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository: + +### Module `common-analysis` + +This module does not require installation, it is available by default. + +Character filters: `pattern_replace`, `mapping` + +Token filters: `arabic_normalization`, `asciifolding`, `bengali_normalization`, `cjk_width`, `decimal_digit`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `serbian_normalization`, `sorani_normalization`, `trim`, `uppercase` + +### Plugin `analysis-icu` + +Character filters: `icu_normalizer` + +Token filter: `icu_normalizer`, `icu_folding`, `icu_transform` + +### Plugin `analysis-kuromoji` + +Character filters: `normalize_kanji`, `normalize_kana` + +### Plugin `analysis-nori` + +Character filters: `normalize_kanji`, `normalize_kana` + +This list of filters covers only analysis components found in [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository. +{: .note} \ No newline at end of file diff --git a/_api-reference/nodes-apis/nodes-info.md b/_api-reference/nodes-apis/nodes-info.md index d7c810410ef..9d2a21469dc 100644 --- a/_api-reference/nodes-apis/nodes-info.md +++ b/_api-reference/nodes-apis/nodes-info.md @@ -69,6 +69,7 @@ plugins | Information about installed plugins and modules. ingest | Information about ingest pipelines and available ingest processors. aggregations | Information about available [aggregations]({{site.url}}{{site.baseurl}}/opensearch/aggregations). indices | Static index settings configured at the node level. +analysis_components | Information about available [text analysis]({{site.url}}{{site.baseurl}}/analyzers/) components. ## Query parameters @@ -162,6 +163,7 @@ plugins | Information about the installed plugins, including name, version, Open modules | Information about the modules, including name, version, OpenSearch version, Java version, description, class name, custom folder name, a list of extended plugins, and `has_native_controller`, which specifies whether the plugin has a native controller process. Modules are different from plugins because modules are loaded into OpenSearch automatically, while plugins have to be installed manually. ingest | Information about ingest pipelines and processors. aggregations | Information about the available aggregation types. +analysis_components | Information about available [text analysis]({{site.url}}{{site.baseurl}}/analyzers/) components. ## Required permissions diff --git a/_field-types/supported-field-types/keyword.md b/_field-types/supported-field-types/keyword.md index 628d720b025..0f913e16396 100644 --- a/_field-types/supported-field-types/keyword.md +++ b/_field-types/supported-field-types/keyword.md @@ -49,7 +49,7 @@ Parameter | Description `index` | A Boolean value that specifies whether the field should be searchable. Default is `true`. `index_options` | Information to be stored in the index that will be considered when calculating relevance scores. Can be set to `freqs` for term frequency. Default is `docs`. `meta` | Accepts metadata for this field. -`normalizer` | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing). +[`normalizer`]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing). `norms` | A Boolean value that specifies whether the field length should be used when calculating relevance scores. Default is `false`. [`null_value`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/index#null-value) | A value to be used in place of `null`. Must be of the same type as the field. If this parameter is not specified, the field is treated as missing when its value is `null`. Default is `null`. `similarity` | The ranking algorithm for calculating relevance scores. Default is `BM25`. diff --git a/_query-dsl/term-vs-full-text.md b/_query-dsl/term-vs-full-text.md index 0bae2fb4a49..cbf5368f7a2 100644 --- a/_query-dsl/term-vs-full-text.md +++ b/_query-dsl/term-vs-full-text.md @@ -8,7 +8,7 @@ redirect_from: # Term-level and full-text queries compared -You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries analyze the query string. The following table summarizes the differences between term-level and full-text queries. +You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries [analyze]({{{site.url}}{{site.baseurl}}/analyzers/) the query string. The following table summarizes the differences between term-level and full-text queries. | | Term-level queries | Full-text queries :--- | :--- | :---