-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Add new doc for grok processor #5019
Changes from 15 commits
fae8001
632dc96
c9cbd62
f028ff0
f23b3c5
56cfb78
3103ac7
e0bcd0e
8059c24
8d84939
589eaf4
898ff52
431b45d
2b620fb
bf81892
ce08abe
0ddb2db
d7b7af8
10acca3
9318025
efc2346
d591f6d
1791c59
f9c2ee0
3eace4a
549ee7d
601677d
08f1d71
615f3d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,211 @@ | ||
--- | ||
layout: default | ||
title: Grok | ||
parent: Ingest processors | ||
grand_parent: Ingest pipelines | ||
nav_order: 140 | ||
--- | ||
|
||
# Grok | ||
Check failure on line 9 in _ingest-pipelines/processors/grok.md GitHub Actions / vale[vale] _ingest-pipelines/processors/grok.md#L9
Raw output
|
||
|
||
The `grok` processor is used to parse and extract structured data from unstructured data. It is useful in log analytics and data processing pipelines where data is often in a raw and unformatted state. The `grok` processor uses a combination of pattern matching and regular expressions to identify and extract information from the input text. The processor supports a range of predefined patterns for common data types such as timestamps, IP addresses, and usernames. The `grok` processor can perform transformations on extracted data, such as converting a timestamp to a proper date field. | ||
Check failure on line 11 in _ingest-pipelines/processors/grok.md GitHub Actions / vale[vale] _ingest-pipelines/processors/grok.md#L11
Raw output
|
||
|
||
The following is the syntax for the `grok` processor: | ||
|
||
```json | ||
{ | ||
"grok": { | ||
"field": "your_message", | ||
"patterns": ["your_patterns"] | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Grok patterns | ||
Check failure on line 25 in _ingest-pipelines/processors/grok.md GitHub Actions / vale[vale] _ingest-pipelines/processors/grok.md#L25
Raw output
|
||
|
||
The Grok processor is based on the [`java-grok`](https://mvnrepository.com/artifact/io.krakens/java-grok) library and supports all compatible patterns. The `java-grok` library is built using the [`java.util.regex`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/package-summary.html) regular expression library. | ||
|
||
You can add custom patterns to your pipelines using the `patterns_definitions` parameter. When debugging custom patterns, the [Grok Debugger](https://grokdebugger.com/) can help you test and debug grok patterns before using them in your [ingest pipelines]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/index/). | ||
|
||
## Configuration parameters | ||
Check failure on line 31 in _ingest-pipelines/processors/grok.md GitHub Actions / vale[vale] _ingest-pipelines/processors/grok.md#L31
Raw output
|
||
|
||
To configure the `grok` processor, you have various options that allow you to define patterns, match specific keys, and control the processor's behavior. The following table lists the required and optional parameters for the `grok` processor. | ||
|
||
Parameter | Required | Description | | ||
|-----------|-----------|-----------| | ||
`field` | Required | The name of the field to which the data should be parsed. | | ||
vagimeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`patterns` | Required | A list of grok expressions used to match and extract named captures. The first expression in the list that matches is returned. | | ||
vagimeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`pattern_definitions` | Optional | A dictionary of pattern names and pattern tuples (a pair of a pattern name, which is a string that identifies the pattern, and a pattern, which is the string that specifies the pattern itself) is used to define custom patterns for the current processor. If a pattern matches an existing name, it overrides the pre-existing definition. | | ||
vagimeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`trace_match` | Optional | When the parameter is set to `true`, the processor adds a field named `_grok_match_index` to the processed document. This field contains the index of the pattern within the `patterns` array that successfully matched the document. This information can be useful for debugging and understanding which pattern was applied to the document. Default is `false`. | | ||
`description` | Optional | A brief description of the processor. | | ||
`if` | Optional | A condition for running this processor. | | ||
`ignore_failure` | Optional | If set to `true`, failures are ignored. Default is `false`. | | ||
`ignore_missing` | Optional | If set to `true`, the processor does not modify the document if the field does not exist or is `null`. Default is `false`. | | ||
`on_failure` | Optional | A list of processors to run if the processor fails. | | ||
`tag` | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. | | ||
|
||
## Using the processor | ||
|
||
Follow these steps to use the processor in a pipeline. | ||
|
||
**Step 1: Create a pipeline.** | ||
|
||
The following query creates a pipeline, named `log_line`. It extracts fields from the `message` field of the document using the specified pattern. In this case, it extracts the `clientip`, `timestamp`, and `response_status` fields: | ||
|
||
```json | ||
PUT _ingest/pipeline/log_line | ||
{ | ||
"description": "Extract fields from a log line", | ||
"processors": [ | ||
{ | ||
"grok": { | ||
"field": "message", | ||
"patterns": ["%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"] | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
**Step 2 (Optional): Test the pipeline.** | ||
|
||
{::nomarkdown}<img src="{{site.url}}{{site.baseurl}}/images/icons/alert-icon.png" class="inline-icon" alt="alert icon"/>{:/} **NOTE**<br>It is recommended that you test your pipeline before you ingest documents. | ||
{: .note} | ||
|
||
To test the pipeline, run the following query: | ||
|
||
```json | ||
POST _ingest/pipeline/log_line/_simulate | ||
{ | ||
"docs": [ | ||
{ | ||
"_source": { | ||
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200" | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
#### Response | ||
|
||
The following response confirms that the pipeline is working as expected: | ||
|
||
```json | ||
{ | ||
"docs": [ | ||
{ | ||
"doc": { | ||
"_index": "_index", | ||
"_id": "_id", | ||
"_source": { | ||
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200", | ||
"response_status": 200, | ||
"clientip": "198.126.12", | ||
"timestamp": "10/Oct/2000:13:55:36 -0700" | ||
}, | ||
"_ingest": { | ||
"timestamp": "2023-09-13T21:41:52.064540505Z" | ||
} | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
|
||
**Step 3: Ingest a document.** | ||
|
||
The following query ingests a document into an index named `testindex1`: | ||
|
||
```json | ||
PUT testindex1/_doc/1?pipeline=log_line | ||
{ | ||
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
**Step 4 (Optional): Retrieve the document.** | ||
|
||
To retrieve the document, run the following query: | ||
|
||
```json | ||
GET testindex1/_doc/1 | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Tailoring grok with custom patterns | ||
|
||
Custom grok patterns can be used in a pipeline to extract structured data from log messages that do not match the built-in grok patterns. This can be useful for parsing log messages from custom applications or for parsing log messages that have been modified in some way. Custom patterns adhere to a straightforward structure: each pattern has a unique name and the corresponding regular expression that defines its matching behavior.These custom patterns can be incorporated into the `grok` processor using the `pattern_definitons` parameter. This parameter accepts a dictionary where the keys represent the pattern names and the values represent the corresponding regular expressions. | ||
|
||
The following is an example of how to include a custom pattern in your configuration. In this example, `MY_CUSTOM_PATTERN` is defined and subsequently used in the `patterns` list, which tells grok to look for this pattern in the log message. The pattern is a regular expression that matches any sequence of alphanumeric characters and captures the matched characters into the `my_field`. | ||
|
||
```json | ||
{ | ||
"processors":[ | ||
{ | ||
"grok":{ | ||
"field":"message", | ||
"patterns":[ | ||
"%{MY_CUSTOM_PATTERN:my_field}" | ||
], | ||
"pattern_definitions":{ | ||
"MY_CUSTOM_PATTERN":"([a-zA-Z0-9]+)" | ||
} | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Tracing which patterns matched | ||
vagimeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To trace which patterns matched and populated the fields, you can use the `trace_match` parameter. The following is an example of how to include this parameter in your configuration: | ||
|
||
```json | ||
PUT _ingest/pipeline/log_line | ||
{ | ||
"description": "Extract fields from a log line", | ||
"processors": [ | ||
{ | ||
"grok": { | ||
"field": "message", | ||
"patterns": ["%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"], | ||
"trace_match": true | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
#### Response | ||
|
||
The following response shows the output of the same pipeline used in step 1, but with `trace_match` set to true: | ||
|
||
```json | ||
{ | ||
"docs": [ | ||
{ | ||
"doc": { | ||
"_index": "_index", | ||
"_id": "_id", | ||
"_source": { | ||
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200", | ||
"response_status": 200, | ||
"clientip": "198.126.12", | ||
"timestamp": "10/Oct/2000:13:55:36 -0700" | ||
}, | ||
"_ingest": { | ||
"_grok_match_index": "0", | ||
"timestamp": "2023-10-23T19:18:37.14624097Z" | ||
} | ||
} | ||
} | ||
] | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Grok Constructor available with OpenSearch? Or is this an Elastic feature only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which Grok Constructor are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted