Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add new doc for grok processor #5019

Merged
merged 29 commits into from
Nov 6, 2023
Merged
Changes from 15 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
fae8001
Add new doc for grok processor
vagimeli Sep 13, 2023
632dc96
Merge branch 'main' into grok-processor
vagimeli Sep 13, 2023
c9cbd62
Merge branch 'main' into grok-processor
vagimeli Sep 22, 2023
f028ff0
Writing
vagimeli Sep 22, 2023
f23b3c5
Merge branch 'main' into grok-processor
vagimeli Sep 22, 2023
56cfb78
Writing
vagimeli Sep 25, 2023
3103ac7
Merge branch 'main' into grok-processor
vagimeli Sep 25, 2023
e0bcd0e
Merge branch 'main' into grok-processor
vagimeli Oct 16, 2023
8059c24
Merge branch 'main' into grok-processor
vagimeli Oct 17, 2023
8d84939
Add custom patterns
vagimeli Oct 17, 2023
589eaf4
Merge branch 'main' into grok-processor
vagimeli Oct 17, 2023
898ff52
Update _api-reference/ingest-apis/processors/grok.md
vagimeli Oct 23, 2023
431b45d
Update _api-reference/ingest-apis/processors/grok.md
vagimeli Oct 23, 2023
2b620fb
Merge branch 'main' into grok-processor
vagimeli Oct 23, 2023
bf81892
Address tech feedback
vagimeli Oct 23, 2023
ce08abe
Address tech feedback
vagimeli Oct 23, 2023
0ddb2db
Merge branch 'main' into grok-processor
vagimeli Oct 25, 2023
d7b7af8
Merge branch 'main' into grok-processor
vagimeli Oct 26, 2023
10acca3
Merge branch 'main' into grok-processor
vagimeli Oct 26, 2023
9318025
Merge branch 'main' into grok-processor
vagimeli Oct 31, 2023
efc2346
Revise content
vagimeli Nov 1, 2023
d591f6d
Revise intro section based on doc review
vagimeli Nov 2, 2023
1791c59
Revise intro section based on doc review
vagimeli Nov 2, 2023
f9c2ee0
Revise intro section based on doc review
vagimeli Nov 2, 2023
3eace4a
Merge branch 'main' into grok-processor
vagimeli Nov 2, 2023
549ee7d
Merge branch 'main' into grok-processor
vagimeli Nov 3, 2023
601677d
Change example and intro sentences to examples
kolchfa-aws Nov 6, 2023
08f1d71
Link fix
kolchfa-aws Nov 6, 2023
615f3d3
Address doc review feedback
vagimeli Nov 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions _ingest-pipelines/processors/grok.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
layout: default
title: Grok
parent: Ingest processors
grand_parent: Ingest pipelines
nav_order: 140
---

# Grok

Check failure on line 9 in _ingest-pipelines/processors/grok.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _ingest-pipelines/processors/grok.md#L9

[OpenSearch.HeadingCapitalization] 'Grok' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'Grok' is a heading and should be in sentence case.", "location": {"path": "_ingest-pipelines/processors/grok.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

The `grok` processor is used to parse and extract structured data from unstructured data. It is useful in log analytics and data processing pipelines where data is often in a raw and unformatted state. The `grok` processor uses a combination of pattern matching and regular expressions to identify and extract information from the input text. The processor supports a range of predefined patterns for common data types such as timestamps, IP addresses, and usernames. The `grok` processor can perform transformations on extracted data, such as converting a timestamp to a proper date field.

Check failure on line 11 in _ingest-pipelines/processors/grok.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _ingest-pipelines/processors/grok.md#L11

[OpenSearch.Spelling] Error: unformatted. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: unformatted. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_ingest-pipelines/processors/grok.md", "range": {"start": {"line": 11, "column": 184}}}, "severity": "ERROR"}

The following is the syntax for the `grok` processor:

```json
{
"grok": {
"field": "your_message",
"patterns": ["your_patterns"]
}
}
```
{% include copy-curl.html %}

## Grok patterns

Check failure on line 25 in _ingest-pipelines/processors/grok.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _ingest-pipelines/processors/grok.md#L25

[OpenSearch.HeadingCapitalization] 'Grok patterns' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'Grok patterns' is a heading and should be in sentence case.", "location": {"path": "_ingest-pipelines/processors/grok.md", "range": {"start": {"line": 25, "column": 4}}}, "severity": "ERROR"}

The Grok processor is based on the [`java-grok`](https://mvnrepository.com/artifact/io.krakens/java-grok) library and supports all compatible patterns. The `java-grok` library is built using the [`java.util.regex`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/package-summary.html) regular expression library.

You can add custom patterns to your pipelines using the `patterns_definitions` parameter. When debugging custom patterns, the [Grok Debugger](https://grokdebugger.com/) can help you test and debug grok patterns before using them in your [ingest pipelines]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/index/).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Grok Constructor available with OpenSearch? Or is this an Elastic feature only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which Grok Constructor are you referring to?

Copy link
Collaborator Author

@vagimeli vagimeli Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

## Configuration parameters

Check failure on line 31 in _ingest-pipelines/processors/grok.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _ingest-pipelines/processors/grok.md#L31

[OpenSearch.HeadingCapitalization] 'Configuration parameters' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'Configuration parameters' is a heading and should be in sentence case.", "location": {"path": "_ingest-pipelines/processors/grok.md", "range": {"start": {"line": 31, "column": 4}}}, "severity": "ERROR"}

To configure the `grok` processor, you have various options that allow you to define patterns, match specific keys, and control the processor's behavior. The following table lists the required and optional parameters for the `grok` processor.

Parameter | Required | Description |
|-----------|-----------|-----------|
`field` | Required | The name of the field to which the data should be parsed. |
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`patterns` | Required | A list of grok expressions used to match and extract named captures. The first expression in the list that matches is returned. |
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`pattern_definitions` | Optional | A dictionary of pattern names and pattern tuples (a pair of a pattern name, which is a string that identifies the pattern, and a pattern, which is the string that specifies the pattern itself) is used to define custom patterns for the current processor. If a pattern matches an existing name, it overrides the pre-existing definition. |
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`trace_match` | Optional | When the parameter is set to `true`, the processor adds a field named `_grok_match_index` to the processed document. This field contains the index of the pattern within the `patterns` array that successfully matched the document. This information can be useful for debugging and understanding which pattern was applied to the document. Default is `false`. |
`description` | Optional | A brief description of the processor. |
`if` | Optional | A condition for running this processor. |
`ignore_failure` | Optional | If set to `true`, failures are ignored. Default is `false`. |
`ignore_missing` | Optional | If set to `true`, the processor does not modify the document if the field does not exist or is `null`. Default is `false`. |
`on_failure` | Optional | A list of processors to run if the processor fails. |
`tag` | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |

## Using the processor

Follow these steps to use the processor in a pipeline.

**Step 1: Create a pipeline.**

The following query creates a pipeline, named `log_line`. It extracts fields from the `message` field of the document using the specified pattern. In this case, it extracts the `clientip`, `timestamp`, and `response_status` fields:

```json
PUT _ingest/pipeline/log_line
{
"description": "Extract fields from a log line",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"]
}
}
]
}
```
{% include copy-curl.html %}

**Step 2 (Optional): Test the pipeline.**

{::nomarkdown}<img src="{{site.url}}{{site.baseurl}}/images/icons/alert-icon.png" class="inline-icon" alt="alert icon"/>{:/} **NOTE**<br>It is recommended that you test your pipeline before you ingest documents.
{: .note}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/log_line/_simulate
{
"docs": [
{
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200"
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The following response confirms that the pipeline is working as expected:

```json
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200",
"response_status": 200,
"clientip": "198.126.12",
"timestamp": "10/Oct/2000:13:55:36 -0700"
},
"_ingest": {
"timestamp": "2023-09-13T21:41:52.064540505Z"
}
}
}
]
}
```

**Step 3: Ingest a document.**

The following query ingests a document into an index named `testindex1`:

```json
PUT testindex1/_doc/1?pipeline=log_line
{
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200"
}
```
{% include copy-curl.html %}

**Step 4 (Optional): Retrieve the document.**

To retrieve the document, run the following query:

```json
GET testindex1/_doc/1
```
{% include copy-curl.html %}

## Tailoring grok with custom patterns

Custom grok patterns can be used in a pipeline to extract structured data from log messages that do not match the built-in grok patterns. This can be useful for parsing log messages from custom applications or for parsing log messages that have been modified in some way. Custom patterns adhere to a straightforward structure: each pattern has a unique name and the corresponding regular expression that defines its matching behavior.These custom patterns can be incorporated into the `grok` processor using the `pattern_definitons` parameter. This parameter accepts a dictionary where the keys represent the pattern names and the values represent the corresponding regular expressions.

The following is an example of how to include a custom pattern in your configuration. In this example, `MY_CUSTOM_PATTERN` is defined and subsequently used in the `patterns` list, which tells grok to look for this pattern in the log message. The pattern is a regular expression that matches any sequence of alphanumeric characters and captures the matched characters into the `my_field`.

```json
{
"processors":[
{
"grok":{
"field":"message",
"patterns":[
"%{MY_CUSTOM_PATTERN:my_field}"
],
"pattern_definitions":{
"MY_CUSTOM_PATTERN":"([a-zA-Z0-9]+)"
}
}
}
]
}
```
{% include copy-curl.html %}

## Tracing which patterns matched
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

To trace which patterns matched and populated the fields, you can use the `trace_match` parameter. The following is an example of how to include this parameter in your configuration:

```json
PUT _ingest/pipeline/log_line
{
"description": "Extract fields from a log line",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"],
"trace_match": true
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The following response shows the output of the same pipeline used in step 1, but with `trace_match` set to true:

```json
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200",
"response_status": 200,
"clientip": "198.126.12",
"timestamp": "10/Oct/2000:13:55:36 -0700"
},
"_ingest": {
"_grok_match_index": "0",
"timestamp": "2023-10-23T19:18:37.14624097Z"
}
}
}
]
}
```