From 7a264ce1d54a99ed7e42bb37875d1d1c51b7c637 Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Tue, 3 Oct 2023 16:15:22 -0500 Subject: [PATCH 01/12] Documentation for the Data Prepper OpenSearch source Signed-off-by: Taylor Gray --- .../configuration/sinks/opensearch.md | 2 +- .../configuration/sources/opensearch.md | 321 ++++++++++++++++++ 2 files changed, 322 insertions(+), 1 deletion(-) create mode 100644 _data-prepper/pipelines/configuration/sources/opensearch.md diff --git a/_data-prepper/pipelines/configuration/sinks/opensearch.md b/_data-prepper/pipelines/configuration/sinks/opensearch.md index 8da02bd41b..8a8afccceb 100644 --- a/_data-prepper/pipelines/configuration/sinks/opensearch.md +++ b/_data-prepper/pipelines/configuration/sinks/opensearch.md @@ -262,7 +262,7 @@ Next, create a collection with the following settings: ], "Description":"Pipeline role access" } - ] + ] ``` ***Important***: Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md new file mode 100644 index 0000000000..ae7515c5f5 --- /dev/null +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -0,0 +1,321 @@ +--- +layout: default +title: opensearch source +parent: Sources +grand_parent: Pipelines +nav_order: 30 +--- + +# opensearch source + +You can use the `opensearch` source plugin to read indices from an OpenSearch cluster, a legacy Elasticsearch cluster, an Amazon OpenSearch Service domain, or an Amazon OpenSearch Serverless collection. + +The plugin supports OpenSearch 2.x, and Elasticsearch 7.x. + +## Usage + +#### Minimum required config with username and password + +```yaml +opensearch-source-pipeline: + source: + opensearch: + hosts: [ "https://localhost:9200" ] + username: "username" + password: "password" + ... +``` + +#### Full config example + +```yaml +opensearch-source-pipeline: + source: + opensearch: + hosts: [ "https://localhost:9200" ] + username: "username" + password: "password" + indices: + include: + - index_name_regex: "test-index-.*" + exclude: + - index_name_regex: "\..*" + scheduling: + interval: "PT1H" + index_read_count: 2 + start_time: "2023-06-02T22:01:30.00Z" + search_options: + search_context_type: "none" + batch_size: 1000 + connection: + insecure: false + cert: "/path/to/cert.crt" + ... +``` + +#### Amazon OpenSearch Service + +The OpenSearch source can also be configured for an Amazon OpenSearch service domain by passing an `sts_role_arn` with access to the domain. + +```yaml +opensearch-source-pipeline: + source: + opensearch: + hosts: [ "https://search-my-domain-soopywaovobopgs8ywurr3utsu.us-east-1.es.amazonaws.com" ] + aws: + region: "us-east-1" + sts_role_arn: "arn:aws:iam::123456789012:role/my-domain-role" + ... +``` + +#### Using Metadata + +When the OpenSearch source constructs Data Prepper Events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the EventMetadata with a `opensearch-document_id` key. This allows conditional routing based on the index or document_id, among other things. For example, one could send to an OpenSearch sink and use the same index and document_id from the source cluster in the destination cluster. A full config example for this use case is below + + +```yaml +opensearch-migration-pipeline: + source: + opensearch: + hosts: [ "https://source-cluster:9200" ] + username: "username" + password: "password" + sink: + - opensearch: + hosts: [ "https://sink-cluster:9200" ] + username: "username" + password: "password" + document_id: "${getMetadata(\"opensearch-document_id\")}" + index: "${getMetadata(\"opensearch-index\"}" +``` + +### Configuration options + + +The following table describes options you can configure for the `opensearch` source. + +Option | Required | Type | Description +:--- | :--- |:--------| :--- +`hosts` | Yes | List | List of OpenSearch hosts to write to (for example, `["https://localhost:9200", "https://remote-cluster:9200"]`). +`username` | No | String | Username for HTTP basic authentication. +`password` | No | String | Password for HTTP basic authentication. +`disable_authentication` | No | Boolean | Whether authentication is disabled. Defaults to false. +`aws` | No | Object | The AWS configuration. For more information, see [aws](#aws). +`acknowledgments` | No | Boolean | If `true`, enables the `opensearch` source to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks. Default is `false`. +`connection` | No | Object | The connection configuration, see [Connection](#connection). +`indices` | No | Object | The indices configuration for filtering which indices are processed. Defaults to all indices, including system indices. See [Indices](#indices) +`scheduling` | No | Object | The scheduling configuration. See [Scheduling](#scheduling). +`search_options` | No | Object | The search options configuration. See [Search options](#search_options) + +#### Scheduling + +The scheduling configuration allows the user to configure reprocessing of each included index up to `index_read_count` number of times every `interval` amount of time. +For example, setting `index_read_count` to 3 with an `interval` of 1 hour will result in all indices being processed 3 times, an hour apart. By default, +indices will only be processed once. + +Option | Required | Type | Description +:--- | :--- |:----------------| :--- +`index_read_count` | No | Integer | The number of times each index will be processed. Default to 1. +`interval` | No | String | The interval to reprocess indices. Supports ISO_8601 notation Strings ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to 8 hours. +`start_time` | No | String | The instant of time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. Defaults to starting processing immediately. + + +#### Indices +### +The below options allow filtering which indices are processed from the source cluster via regex patterns. An index will only +be processed if it matches one of the `index_name_regex` patterns under `include` and does not match any of the +patterns under `exclude`. + +Option | Required | Type | Description +:--- | :--- |:-----------------| :--- +`include` | No | Array of Objects | A List of [Index Configuration](#index_configuration) that specifies which indices will be processed. +`exclude` | No | Array of Objects | A List of [Index Configuration](#index_configuration) that specifies which indices will not be processed. For example, one can specify an `index_name_regex` pattern of `\..*` to exclude system indices. + +###### IndexConfiguration +### + +Option | Required | Type | Description +:--- |:----|:-----------------| :--- +`index_name_regex` | Yes | Regex String | The regex pattern to match indices against + +#### Search options +### + +Option | Required | Type | Description +:--- |:---------|:--------| :--- +`batch_size` | No | Integer | The number of documents to read at a time while paginating from OpenSearch. Defaults to `1000` +`search_context_type` | No | Enum | An override for the type of search/pagination to use on indices. Can be one of `point_in_time` (uses [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), `scroll` (uses [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search)), or `none` (uses [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter)). See [Default Search Behavior](#default_search_behavior) for default behavior. + +###### Default search behavior +### + +By default, the `opensearch` source will do a lookup of the cluster version and distribution to determine +which `search_context_type` to use. For versions and distributions that support [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after), `point_in_time` will be used. +If `point_in_time` is not supported by the cluster, then [Scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) +will be used as neither `point_in_time` or `scroll` are supported by collections. + +#### Connection + +Option | Required | Type | Description +:--- | :--- |:--------| :--- +`cert` | No | String | Path to the security certificate (for example, `"config/root-ca.pem"`) if the cluster uses the OpenSearch Security plugin. +`insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to true, certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default value is `false`. + + +#### AWS + +Use the following options when setting up authentication for `aws` services. + +Option | Required | Type | Description +:--- | :--- |:--------| :--- +`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon OpenSearch Service and Amazon OpenSearch Serverless. Default is `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`serverless` | No | Boolean | Should be set to true when processing from an Amazon OpenSearch Serverless collection. Defaults to false. + + +## OpenSearch cluster security + +In order to pull data from an OpenSearch cluster using the `opensearch` source plugin, you must specify your username and password within the pipeline configuration. The following example `pipelines.yaml` file demonstrates how to specify admin security credentials: + +```yaml +source: + opensearch: + username: "admin" + password: "admin" + ... +``` + +## Amazon OpenSearch Service domain security + +The `opensearch` source plugin can pull data from an [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) domain, which uses IAM for security. The plugin uses the default credential chain. Run `aws configure` using the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/) to set your credentials. + +Make sure the credentials that you configure have the required IAM permissions. The following domain access policy demonstrates the minimum required permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam:::user/data-prepper-user" + }, + "Action": "es:ESHttpGet", + "Resource": [ + "arn:aws:es:us-east-1::domain//", + "arn:aws:es:us-east-1::domain//_cat/indices", + "arn:aws:es:us-east-1::domain//_search", + "arn:aws:es:us-east-1::domain//_search/scroll", + "arn:aws:es:us-east-1::domain//*/_search" + ] + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam:::user/data-prepper-user" + }, + "Action": "es:ESHttpPost", + "Resource": [ + "arn:aws:es:us-east-1::domain//*/_search/point_in_time", + "arn:aws:es:us-east-1::domain//*/_search/scroll" + ] + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam:::user/data-prepper-user" + }, + "Action": "es:ESHttpDelete", + "Resource": [ + "arn:aws:es:us-east-1::domain//_search/point_in_time", + "arn:aws:es:us-east-1::domain//_search/scroll" + ] + } + ] +} +``` + +For instructions on how to configure the domain access policy, see [Resource-based policies +](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) in the Amazon OpenSearch Service documentation. + +## OpenSearch Serverless collection security + +The `opensearch` source plugin can receive data from an [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) collection. + +OpenSearch Serverless collection sources have the following limitations: + +- You can't read from a collection that uses virtual private cloud (VPC) access. The collection must be accessible from public networks. + +### Creating a pipeline role + +First, create an IAM role that the pipeline will assume in order to read from the collection. The role must have the following minimum permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "aoss:APIAccessAll" + ], + "Resource": "arn:aws:aoss:*::collection/*" + } + ] +} +``` + +## Creating a collection + +Next, create a collection with the following settings: + +- Public [network access](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-network.html) to both the OpenSearch endpoint and OpenSearch Dashboards. +- The following [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html), which grants the required permissions to the pipeline role: + + ```json + [ + { + "Rules":[ + { + "Resource":[ + "index/collection-name/*" + ], + "Permission":[ + "aoss:ReadDocument", + "aoss:DescribeIndex" + ], + "ResourceType":"index" + } + ], + "Principal":[ + "arn:aws:iam:::role/PipelineRole" + ], + "Description":"Pipeline role access" + } + ] + ``` + + ***Important***: Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. + + For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. + +### Creating a pipeline + +Within your `pipelines.yaml` file, specify the OpenSearch Serverless collection endpoint as the `hosts` option. In addition, you must set the `serverless` option to `true`. Specify the pipeline role in the `sts_role_arn` option: + +```yaml +opensearch-source-pipeline: + source: + opensearch: + hosts: [ "https://" ] + aws: + serverless: true + sts_role_arn: "arn:aws:iam:::role/PipelineRole" + region: "us-east-1" + processor: + - date: + from_time_received: true + destination: "@timestamp" + sink: + - stdout: +``` From 82241b3d71c24f9dee9e54b8b2554870788cb9e3 Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Tue, 10 Oct 2023 10:10:29 -0500 Subject: [PATCH 02/12] Update _data-prepper/pipelines/configuration/sources/opensearch.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Taylor Gray --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index ae7515c5f5..1efe5119ce 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -1,6 +1,6 @@ --- layout: default -title: opensearch source +title: opensearch parent: Sources grand_parent: Pipelines nav_order: 30 From 3de849a2f114acd76dbab6014d8d009fcc2ec1a5 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 10:35:08 -0500 Subject: [PATCH 03/12] Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../configuration/sources/opensearch.md | 132 +++++++++--------- 1 file changed, 65 insertions(+), 67 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 1efe5119ce..453a2f5ee8 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -6,15 +6,15 @@ grand_parent: Pipelines nav_order: 30 --- -# opensearch source +# opensearch -You can use the `opensearch` source plugin to read indices from an OpenSearch cluster, a legacy Elasticsearch cluster, an Amazon OpenSearch Service domain, or an Amazon OpenSearch Serverless collection. +The `opensearch` source plugin to read indexes from an OpenSearch cluster, a legacy Elasticsearch cluster, an Amazon OpenSearch Service domain, or an Amazon OpenSearch Serverless collection. -The plugin supports OpenSearch 2.x, and Elasticsearch 7.x. +The plugin supports OpenSearch 2.x and Elasticsearch 7.x. ## Usage -#### Minimum required config with username and password +To use the `opensearch` source with the minimum required settings, add the following configuration to your `pipeline.yaml` file: ```yaml opensearch-source-pipeline: @@ -26,7 +26,7 @@ opensearch-source-pipeline: ... ``` -#### Full config example +To use the `opensearch` source with all configuration settings, including `indices`, `scheduling`, `search_options`, and `connection`, add the following exampe to your `pipeline.yaml` file: ```yaml opensearch-source-pipeline: @@ -53,9 +53,9 @@ opensearch-source-pipeline: ... ``` -#### Amazon OpenSearch Service +## Amazon OpenSearch Service -The OpenSearch source can also be configured for an Amazon OpenSearch service domain by passing an `sts_role_arn` with access to the domain. +The `opensearch` source can be configured for an Amazon OpenSearch service domain by passing an `sts_role_arn` with access to the domain, as shown in the following example: ```yaml opensearch-source-pipeline: @@ -68,9 +68,9 @@ opensearch-source-pipeline: ... ``` -#### Using Metadata +## Using Metadata -When the OpenSearch source constructs Data Prepper Events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the EventMetadata with a `opensearch-document_id` key. This allows conditional routing based on the index or document_id, among other things. For example, one could send to an OpenSearch sink and use the same index and document_id from the source cluster in the destination cluster. A full config example for this use case is below +When the `opensource` source constructs Data Prepper events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the `EventMetadata` with the `opensearch-document_id` as the key. This allows for conditional routing based on the index or `document_id`. The following example sends events to an `opensearch` sink and uses the same index and `document_id` from the source cluster as in the destination cluster: ```yaml @@ -89,80 +89,78 @@ opensearch-migration-pipeline: index: "${getMetadata(\"opensearch-index\"}" ``` -### Configuration options +## Configuration options The following table describes options you can configure for the `opensearch` source. Option | Required | Type | Description :--- | :--- |:--------| :--- -`hosts` | Yes | List | List of OpenSearch hosts to write to (for example, `["https://localhost:9200", "https://remote-cluster:9200"]`). -`username` | No | String | Username for HTTP basic authentication. -`password` | No | String | Password for HTTP basic authentication. -`disable_authentication` | No | Boolean | Whether authentication is disabled. Defaults to false. +`hosts` | Yes | List | A list of OpenSearch hosts to write to, for example, `["https://localhost:9200", "https://remote-cluster:9200"]`. +`username` | No | String | The username for HTTP basic authentication. +`password` | No | String | The password for HTTP basic authentication. +`disable_authentication` | No | Boolean | Whether authentication is disabled. Defaults to `false`. `aws` | No | Object | The AWS configuration. For more information, see [aws](#aws). -`acknowledgments` | No | Boolean | If `true`, enables the `opensearch` source to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks. Default is `false`. -`connection` | No | Object | The connection configuration, see [Connection](#connection). -`indices` | No | Object | The indices configuration for filtering which indices are processed. Defaults to all indices, including system indices. See [Indices](#indices) -`scheduling` | No | Object | The scheduling configuration. See [Scheduling](#scheduling). -`search_options` | No | Object | The search options configuration. See [Search options](#search_options) +`acknowledgments` | No | Boolean | When `true`, enables the `opensearch` source to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks. Default is `false`. +`connection` | No | Object | The connection configuration. For more information, see [Connection](#connection). +`indices` | No | Object | The indices configuration for filtering which indexes are processed. Defaults to all indices, including system indices. For more information, see [Indices](#indices). +`scheduling` | No | Object | The scheduling configuration. For more information, see [Scheduling](#scheduling). +`search_options` | No | Object | The search options configuration. For more information, see [Search options](#search_options). -#### Scheduling +### Scheduling -The scheduling configuration allows the user to configure reprocessing of each included index up to `index_read_count` number of times every `interval` amount of time. -For example, setting `index_read_count` to 3 with an `interval` of 1 hour will result in all indices being processed 3 times, an hour apart. By default, -indices will only be processed once. +The `scheduling` configuration allows the user to configure how indexes are reprocessed in the source based on the the `index_read_count` and recount time `interval`. +For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indices being processed three times, one hour apart. By default, indexes will only be processed once. Option | Required | Type | Description :--- | :--- |:----------------| :--- -`index_read_count` | No | Integer | The number of times each index will be processed. Default to 1. -`interval` | No | String | The interval to reprocess indices. Supports ISO_8601 notation Strings ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to 8 hours. +`index_read_count` | No | Integer | The number of times each index will be processed. Default is `1`. +`interval` | No | String | The interval to reprocess indexes. Supports ISO_8601 notation Strings ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. `start_time` | No | String | The instant of time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. Defaults to starting processing immediately. -#### Indices -### -The below options allow filtering which indices are processed from the source cluster via regex patterns. An index will only -be processed if it matches one of the `index_name_regex` patterns under `include` and does not match any of the -patterns under `exclude`. +### indices -Option | Required | Type | Description -:--- | :--- |:-----------------| :--- -`include` | No | Array of Objects | A List of [Index Configuration](#index_configuration) that specifies which indices will be processed. -`exclude` | No | Array of Objects | A List of [Index Configuration](#index_configuration) that specifies which indices will not be processed. For example, one can specify an `index_name_regex` pattern of `\..*` to exclude system indices. +The following options help the `opensearch` source which indexes are processed from the source cluster using regex patterns. An index will only be processed if it matches one of the `index_name_regex` patterns under the `include` setting and does not match any of the +patterns under `exclude` setting. -###### IndexConfiguration -### +Option | Required | Type | Description +:--- | :--- |:-----------------| :--- +`include` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will be processed. +`exclude` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will not be processed. For example, you can specify an `index_name_regex` pattern of `\..*` to exclude system indices. -Option | Required | Type | Description +Use the following setting under the `include` and `exclude` options to indicate the regex pattern for the index. +Option | Required | Type | Description :--- |:----|:-----------------| :--- -`index_name_regex` | Yes | Regex String | The regex pattern to match indices against +`index_name_regex` | Yes | Regex String | The regex pattern to match indexes against. -#### Search options -### +### search_options + +Use the following settings under the `search_options` configuration. Option | Required | Type | Description :--- |:---------|:--------| :--- -`batch_size` | No | Integer | The number of documents to read at a time while paginating from OpenSearch. Defaults to `1000` -`search_context_type` | No | Enum | An override for the type of search/pagination to use on indices. Can be one of `point_in_time` (uses [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), `scroll` (uses [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search)), or `none` (uses [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter)). See [Default Search Behavior](#default_search_behavior) for default behavior. +`batch_size` | No | Integer | The number of documents to read at a time while paginating from OpenSearch. Default is `1000` +`search_context_type` | No | Enum | An override for the type of search/pagination to use on indices. Can be [point_in_time]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), [scroll]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#scroll-search), or `none`. The `none` option will use the [search_after]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#the-search_after-parameter) parameter. For more information, see [Default Search Behavior](#default_search_behavior) for default behavior. -###### Default search behavior -### +### Default search behavior -By default, the `opensearch` source will do a lookup of the cluster version and distribution to determine +By default, the `opensearch` source will lookip the cluster version and distribution to determine which `search_context_type` to use. For versions and distributions that support [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after), `point_in_time` will be used. -If `point_in_time` is not supported by the cluster, then [Scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) -will be used as neither `point_in_time` or `scroll` are supported by collections. +If `point_in_time` is not supported by the cluster, then [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) +will be used because neither `point_in_time` or `scroll` are supported by collections. + +### Connection -#### Connection +Use the following settings under the `connection` configuration. Option | Required | Type | Description :--- | :--- |:--------| :--- -`cert` | No | String | Path to the security certificate (for example, `"config/root-ca.pem"`) if the cluster uses the OpenSearch Security plugin. -`insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to true, certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default value is `false`. +`cert` | No | String | The path to the security certificate, for example `"config/root-ca.pem"`, when the cluster uses the OpenSearch Security plugin. +`insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to `true`, the certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent. Default is `false`. -#### AWS +### AWS Use the following options when setting up authentication for `aws` services. @@ -175,7 +173,7 @@ Option | Required | Type | Description ## OpenSearch cluster security -In order to pull data from an OpenSearch cluster using the `opensearch` source plugin, you must specify your username and password within the pipeline configuration. The following example `pipelines.yaml` file demonstrates how to specify admin security credentials: +In order to pull data from an OpenSearch cluster using the `opensearch` source plugin, you must specify your username and password within the pipeline configuration. The following example `pipelines.yaml` file demonstrates how to specify the default admin security credentials: ```yaml source: @@ -185,11 +183,11 @@ source: ... ``` -## Amazon OpenSearch Service domain security +### Amazon OpenSearch Service domain security -The `opensearch` source plugin can pull data from an [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) domain, which uses IAM for security. The plugin uses the default credential chain. Run `aws configure` using the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/) to set your credentials. +The `opensearch` source plugin can pull data from an [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) domain, which uses IAM for security. The plugin uses the default credential Amazon OpenSearch Service credential chain. Run `aws configure` using the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/) to set your credentials. -Make sure the credentials that you configure have the required IAM permissions. The following domain access policy demonstrates the minimum required permissions: +Make sure the credentials that you configure have the required IAM permissions. The following domain access policy shows the minimum required permissions: ```json { @@ -238,17 +236,16 @@ Make sure the credentials that you configure have the required IAM permissions. For instructions on how to configure the domain access policy, see [Resource-based policies ](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) in the Amazon OpenSearch Service documentation. -## OpenSearch Serverless collection security +### OpenSearch Serverless collection security The `opensearch` source plugin can receive data from an [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) collection. -OpenSearch Serverless collection sources have the following limitations: - -- You can't read from a collection that uses virtual private cloud (VPC) access. The collection must be accessible from public networks. +You can not read from a collection that uses virtual private cloud (VPC) access. The collection must be accessible from public networks. +{: .warning} -### Creating a pipeline role +#### Creating a pipeline role -First, create an IAM role that the pipeline will assume in order to read from the collection. The role must have the following minimum permissions: +To use OpenSearch Serverless collection security, create an IAM role that the pipeline will assume in order to read from the collection. The role must have the following minimum permissions: ```json { @@ -265,12 +262,12 @@ First, create an IAM role that the pipeline will assume in order to read from th } ``` -## Creating a collection +#### Creating a collection Next, create a collection with the following settings: - Public [network access](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-network.html) to both the OpenSearch endpoint and OpenSearch Dashboards. -- The following [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html), which grants the required permissions to the pipeline role: +- The following [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html), which grants the required permissions to the pipeline role, as shown in the following configuration: ```json [ @@ -295,13 +292,14 @@ Next, create a collection with the following settings: ] ``` - ***Important***: Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. +Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. +{: .tip} - For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. + For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. -### Creating a pipeline +#### Creating a pipeline -Within your `pipelines.yaml` file, specify the OpenSearch Serverless collection endpoint as the `hosts` option. In addition, you must set the `serverless` option to `true`. Specify the pipeline role in the `sts_role_arn` option: +Within your `pipelines.yaml` file, specify the OpenSearch Serverless collection endpoint as the `hosts` option. In addition, you must set the `serverless` option to `true`. Specify the pipeline role in the `sts_role_arn` option, as shown in the following example: ```yaml opensearch-source-pipeline: From e4f9c58180affa71e83f699d6f2bac06355e5eaa Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:16:35 -0500 Subject: [PATCH 04/12] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../configuration/sources/opensearch.md | 41 ++++++++++--------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 453a2f5ee8..7bf0e376c4 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -8,7 +8,7 @@ nav_order: 30 # opensearch -The `opensearch` source plugin to read indexes from an OpenSearch cluster, a legacy Elasticsearch cluster, an Amazon OpenSearch Service domain, or an Amazon OpenSearch Serverless collection. +The `opensearch` source plugin is used to read indexes from an OpenSearch cluster, a legacy Elasticsearch cluster, an Amazon OpenSearch Service domain, or an Amazon OpenSearch Serverless collection. The plugin supports OpenSearch 2.x and Elasticsearch 7.x. @@ -26,7 +26,7 @@ opensearch-source-pipeline: ... ``` -To use the `opensearch` source with all configuration settings, including `indices`, `scheduling`, `search_options`, and `connection`, add the following exampe to your `pipeline.yaml` file: +To use the `opensearch` source with all configuration settings, including `indices`, `scheduling`, `search_options`, and `connection`, add the following example to your `pipeline.yaml` file: ```yaml opensearch-source-pipeline: @@ -55,7 +55,7 @@ opensearch-source-pipeline: ## Amazon OpenSearch Service -The `opensearch` source can be configured for an Amazon OpenSearch service domain by passing an `sts_role_arn` with access to the domain, as shown in the following example: +The `opensearch` source can be configured for an Amazon OpenSearch Service domain by passing an `sts_role_arn` with access to the domain, as shown in the following example: ```yaml opensearch-source-pipeline: @@ -68,7 +68,7 @@ opensearch-source-pipeline: ... ``` -## Using Metadata +## Using metadata When the `opensource` source constructs Data Prepper events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the `EventMetadata` with the `opensearch-document_id` as the key. This allows for conditional routing based on the index or `document_id`. The following example sends events to an `opensearch` sink and uses the same index and `document_id` from the source cluster as in the destination cluster: @@ -103,9 +103,9 @@ Option | Required | Type | Description `aws` | No | Object | The AWS configuration. For more information, see [aws](#aws). `acknowledgments` | No | Boolean | When `true`, enables the `opensearch` source to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks. Default is `false`. `connection` | No | Object | The connection configuration. For more information, see [Connection](#connection). -`indices` | No | Object | The indices configuration for filtering which indexes are processed. Defaults to all indices, including system indices. For more information, see [Indices](#indices). +`indices` | No | Object | The configuration for filtering which indexes are processed. Defaults to all indexes, including system indexes. For more information, see [Indices](#indices). `scheduling` | No | Object | The scheduling configuration. For more information, see [Scheduling](#scheduling). -`search_options` | No | Object | The search options configuration. For more information, see [Search options](#search_options). +`search_options` | No | Object | A list of search options performed by the source. For more information, see [Search options](#search_options). ### Scheduling @@ -115,20 +115,21 @@ For example, setting `index_read_count` to `3` with an `interval` of `1h` will r Option | Required | Type | Description :--- | :--- |:----------------| :--- `index_read_count` | No | Integer | The number of times each index will be processed. Default is `1`. -`interval` | No | String | The interval to reprocess indexes. Supports ISO_8601 notation Strings ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. +`interval` | No | String | The interval in which indexes are reprocessed. Supports ISO_8601 notation strings, such as "PT20.345S" or "PT15M", as well as simple notation strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. `start_time` | No | String | The instant of time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. Defaults to starting processing immediately. ### indices -The following options help the `opensearch` source which indexes are processed from the source cluster using regex patterns. An index will only be processed if it matches one of the `index_name_regex` patterns under the `include` setting and does not match any of the -patterns under `exclude` setting. +The following options help the `opensearch` source determine which indexes are processed from the source cluster using regex patterns. An index will only be processed if it matches one of the `index_name_regex` patterns under the `include` setting and does not match any of the +patterns under the `exclude` setting. Option | Required | Type | Description :--- | :--- |:-----------------| :--- `include` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will be processed. `exclude` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will not be processed. For example, you can specify an `index_name_regex` pattern of `\..*` to exclude system indices. + Use the following setting under the `include` and `exclude` options to indicate the regex pattern for the index. Option | Required | Type | Description :--- |:----|:-----------------| :--- @@ -140,15 +141,15 @@ Use the following settings under the `search_options` configuration. Option | Required | Type | Description :--- |:---------|:--------| :--- -`batch_size` | No | Integer | The number of documents to read at a time while paginating from OpenSearch. Default is `1000` -`search_context_type` | No | Enum | An override for the type of search/pagination to use on indices. Can be [point_in_time]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), [scroll]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#scroll-search), or `none`. The `none` option will use the [search_after]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#the-search_after-parameter) parameter. For more information, see [Default Search Behavior](#default_search_behavior) for default behavior. +`batch_size` | No | Integer | The number of documents to read while paginating from OpenSearch. Default is `1000`. +`search_context_type` | No | Enum | An override for the type of search/pagination to use on indexes. Can be [point_in_time]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), [scroll]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#scroll-search), or `none`. The `none` option will use the [search_after]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#the-search_after-parameter) parameter. For more information, see [Default Search Behavior](#default_search_behavior). ### Default search behavior -By default, the `opensearch` source will lookip the cluster version and distribution to determine +By default, the `opensearch` source will look up the cluster version and distribution to determine which `search_context_type` to use. For versions and distributions that support [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after), `point_in_time` will be used. If `point_in_time` is not supported by the cluster, then [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) -will be used because neither `point_in_time` or `scroll` are supported by collections. +will be used because neither `point_in_time` nor `scroll` are supported by collections. ### Connection @@ -156,7 +157,7 @@ Use the following settings under the `connection` configuration. Option | Required | Type | Description :--- | :--- |:--------| :--- -`cert` | No | String | The path to the security certificate, for example `"config/root-ca.pem"`, when the cluster uses the OpenSearch Security plugin. +`cert` | No | String | The path to the security certificate, for example, `"config/root-ca.pem"`, when the cluster uses the OpenSearch Security plugin. `insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to `true`, the certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent. Default is `false`. @@ -168,12 +169,12 @@ Option | Required | Type | Description :--- | :--- |:--------| :--- `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon OpenSearch Service and Amazon OpenSearch Serverless. Default is `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). -`serverless` | No | Boolean | Should be set to true when processing from an Amazon OpenSearch Serverless collection. Defaults to false. +`serverless` | No | Boolean | Should be set to `true` when processing from an Amazon OpenSearch Serverless collection. Defaults to `false`. ## OpenSearch cluster security -In order to pull data from an OpenSearch cluster using the `opensearch` source plugin, you must specify your username and password within the pipeline configuration. The following example `pipelines.yaml` file demonstrates how to specify the default admin security credentials: +In order to pull data from an OpenSearch cluster using the `opensearch` source plugin, you must specify your username and password within the pipeline configuration. The following example `pipeline.yaml` file demonstrates how to specify the default admin security credentials: ```yaml source: @@ -185,7 +186,7 @@ source: ### Amazon OpenSearch Service domain security -The `opensearch` source plugin can pull data from an [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) domain, which uses IAM for security. The plugin uses the default credential Amazon OpenSearch Service credential chain. Run `aws configure` using the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/) to set your credentials. +The `opensearch` source plugin can pull data from an [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) domain, which uses AWS Identity and Access Management (IAM) for security. The plugin uses the default Amazon OpenSearch Service credential chain. Run `aws configure` using the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/) to set your credentials. Make sure the credentials that you configure have the required IAM permissions. The following domain access policy shows the minimum required permissions: @@ -240,7 +241,7 @@ For instructions on how to configure the domain access policy, see [Resource-bas The `opensearch` source plugin can receive data from an [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) collection. -You can not read from a collection that uses virtual private cloud (VPC) access. The collection must be accessible from public networks. +You cannot read from a collection that uses virtual private cloud (VPC) access. The collection must be accessible from public networks. {: .warning} #### Creating a pipeline role @@ -295,11 +296,11 @@ Next, create a collection with the following settings: Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. {: .tip} - For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. +For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. #### Creating a pipeline -Within your `pipelines.yaml` file, specify the OpenSearch Serverless collection endpoint as the `hosts` option. In addition, you must set the `serverless` option to `true`. Specify the pipeline role in the `sts_role_arn` option, as shown in the following example: +Within your `pipeline.yaml` file, specify the OpenSearch Serverless collection endpoint as the `hosts` option. In addition, you must set the `serverless` option to `true`. Specify the pipeline role in the `sts_role_arn` option, as shown in the following example: ```yaml opensearch-source-pipeline: From b56fcc0f9aa26d3bdc2e8311b9ed5d0779647045 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:37:07 -0500 Subject: [PATCH 05/12] Update _data-prepper/pipelines/configuration/sources/opensearch.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 7bf0e376c4..fc81922d20 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -293,7 +293,7 @@ Next, create a collection with the following settings: ] ``` -Make sure to replace the ARN in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. +Make sure to replace the Amazon Resource Name (ARN) in the `Principal` element with the ARN of the pipeline role that you created in the preceding step. {: .tip} For instructions on how to create collections, see [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create) in the Amazon OpenSearch Service documentation. From 7490514fc575bd4a79d319a467795085b8cf9df5 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:37:14 -0500 Subject: [PATCH 06/12] Update _data-prepper/pipelines/configuration/sources/opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index fc81922d20..e2407aef2b 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -133,7 +133,7 @@ Option | Required | Type | Description Use the following setting under the `include` and `exclude` options to indicate the regex pattern for the index. Option | Required | Type | Description :--- |:----|:-----------------| :--- -`index_name_regex` | Yes | Regex String | The regex pattern to match indexes against. +`index_name_regex` | Yes | Regex string | The regex pattern to match indexes against. ### search_options From 8eb75a9077b801f3d7fcad03414fc25b9b61b369 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:40:35 -0500 Subject: [PATCH 07/12] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../pipelines/configuration/sources/opensearch.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index e2407aef2b..1291f9ea6b 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -70,7 +70,7 @@ opensearch-source-pipeline: ## Using metadata -When the `opensource` source constructs Data Prepper events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the `EventMetadata` with the `opensearch-document_id` as the key. This allows for conditional routing based on the index or `document_id`. The following example sends events to an `opensearch` sink and uses the same index and `document_id` from the source cluster as in the destination cluster: +When the `opensource` source constructs Data Prepper events from documents in the cluster, the document index is stored in the EventMetadata with an `opensearch-index` key, and the document_id is stored in the `EventMetadata` with the `opensearch-document_id` as the key. This allows for conditional routing based on the index or `document_id`. The following example pipeline configuration sends events to an `opensearch` sink and uses the same index and `document_id` from the source cluster as in the destination cluster: ```yaml @@ -110,13 +110,15 @@ Option | Required | Type | Description ### Scheduling The `scheduling` configuration allows the user to configure how indexes are reprocessed in the source based on the the `index_read_count` and recount time `interval`. -For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indices being processed three times, one hour apart. By default, indexes will only be processed once. +For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indexes being processed 3 times, 1 hour apart. By default, indexes will only be processed once. + +Use the following options under the `scheduling` configuration. Option | Required | Type | Description :--- | :--- |:----------------| :--- `index_read_count` | No | Integer | The number of times each index will be processed. Default is `1`. `interval` | No | String | The interval in which indexes are reprocessed. Supports ISO_8601 notation strings, such as "PT20.345S" or "PT15M", as well as simple notation strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. -`start_time` | No | String | The instant of time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. Defaults to starting processing immediately. +`start_time` | No | String | The time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. The default option starts processing immediately. ### indices @@ -126,8 +128,8 @@ patterns under the `exclude` setting. Option | Required | Type | Description :--- | :--- |:-----------------| :--- -`include` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will be processed. -`exclude` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will not be processed. For example, you can specify an `index_name_regex` pattern of `\..*` to exclude system indices. +`include` | No | Array of objects | A list of index configuration patterns that specifies which indexes will be processed. +`exclude` | No | Array of Objects | A list of index configuration patterns that specifies which indexes will not be processed. For example, you can specify an `index_name_regex` pattern of `\..*` to exclude system indexes. Use the following setting under the `include` and `exclude` options to indicate the regex pattern for the index. From 79eaf066582b7db4765b61a17596a12a701fc8c4 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:42:13 -0500 Subject: [PATCH 08/12] Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sources/opensearch.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 1291f9ea6b..6864c85613 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -110,6 +110,7 @@ Option | Required | Type | Description ### Scheduling The `scheduling` configuration allows the user to configure how indexes are reprocessed in the source based on the the `index_read_count` and recount time `interval`. + For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indexes being processed 3 times, 1 hour apart. By default, indexes will only be processed once. Use the following options under the `scheduling` configuration. @@ -133,6 +134,7 @@ Option | Required | Type | Description Use the following setting under the `include` and `exclude` options to indicate the regex pattern for the index. + Option | Required | Type | Description :--- |:----|:-----------------| :--- `index_name_regex` | Yes | Regex string | The regex pattern to match indexes against. @@ -150,8 +152,7 @@ Option | Required | Type | Description By default, the `opensearch` source will look up the cluster version and distribution to determine which `search_context_type` to use. For versions and distributions that support [Point in Time](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#point-in-time-with-search_after), `point_in_time` will be used. -If `point_in_time` is not supported by the cluster, then [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) -will be used because neither `point_in_time` nor `scroll` are supported by collections. +If `point_in_time` is not supported by the cluster, then [scroll](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search) will be used. For Amazon OpenSearch Serverless collections, [search_after](https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#the-search_after-parameter) will be used because neither `point_in_time` nor `scroll` are supported by collections. ### Connection From a74afa67e0372d561315843f8a648d0b1a33c384 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 12:54:12 -0500 Subject: [PATCH 09/12] Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sources/opensearch.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 6864c85613..bafbf56d56 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -118,8 +118,8 @@ Use the following options under the `scheduling` configuration. Option | Required | Type | Description :--- | :--- |:----------------| :--- `index_read_count` | No | Integer | The number of times each index will be processed. Default is `1`. -`interval` | No | String | The interval in which indexes are reprocessed. Supports ISO_8601 notation strings, such as "PT20.345S" or "PT15M", as well as simple notation strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. -`start_time` | No | String | The time when processing should begin. The source will not start processing until this instant is reached. The String must be in ISO-8601 format, such as `2007-12-03T10:15:30.00Z`. The default option starts processing immediately. +`interval` | No | String | The interval that determines the amount of time between reprocessing. Supports ISO 8601 notation strings, such as "PT20.345S" or "PT15M", as well as simple notation strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. +`start_time` | No | String | The time when processing should begin. The source will not start processing until this instant is reached. The string must be in ISO 8601 format, such as `2007-12-03T10:15:30.00Z`. The default option starts processing immediately. ### indices From 19e5a4e1d6993988f167025b2149d7bd07bbd4de Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Wed, 11 Oct 2023 13:21:52 -0500 Subject: [PATCH 10/12] Update _data-prepper/pipelines/configuration/sources/opensearch.md Co-authored-by: Nathan Bower Signed-off-by: Taylor Gray --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index bafbf56d56..5b2aaaf0e4 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -119,7 +119,7 @@ Option | Required | Type | Description :--- | :--- |:----------------| :--- `index_read_count` | No | Integer | The number of times each index will be processed. Default is `1`. `interval` | No | String | The interval that determines the amount of time between reprocessing. Supports ISO 8601 notation strings, such as "PT20.345S" or "PT15M", as well as simple notation strings for seconds ("60s") and milliseconds ("1500ms"). Defaults to `8h`. -`start_time` | No | String | The time when processing should begin. The source will not start processing until this instant is reached. The string must be in ISO 8601 format, such as `2007-12-03T10:15:30.00Z`. The default option starts processing immediately. +`start_time` | No | String | The time when processing should begin. The source will not start processing until this time. The string must be in ISO 8601 format, such as `2007-12-03T10:15:30.00Z`. The default option starts processing immediately. ### indices From abef256a3c44c97782983677554dae4d652b636d Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Wed, 11 Oct 2023 13:22:58 -0500 Subject: [PATCH 11/12] Update _data-prepper/pipelines/configuration/sources/opensearch.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Taylor Gray --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index 5b2aaaf0e4..e3a09a45ff 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -111,7 +111,7 @@ Option | Required | Type | Description The `scheduling` configuration allows the user to configure how indexes are reprocessed in the source based on the the `index_read_count` and recount time `interval`. -For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indexes being processed 3 times, 1 hour apart. By default, indexes will only be processed once. +For example, setting `index_read_count` to `3` with an `interval` of `1h` will result in all indexes being reprocessed 3 times, 1 hour apart. By default, indexes will only be processed once. Use the following options under the `scheduling` configuration. From 6174965fae7fc2caf0ba69af0a2c622fc9f4d050 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 11 Oct 2023 13:44:14 -0500 Subject: [PATCH 12/12] Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sources/opensearch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index e3a09a45ff..faa5b0b68b 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -146,7 +146,7 @@ Use the following settings under the `search_options` configuration. Option | Required | Type | Description :--- |:---------|:--------| :--- `batch_size` | No | Integer | The number of documents to read while paginating from OpenSearch. Default is `1000`. -`search_context_type` | No | Enum | An override for the type of search/pagination to use on indexes. Can be [point_in_time]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), [scroll]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#scroll-search), or `none`. The `none` option will use the [search_after]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#the-search_after-parameter) parameter. For more information, see [Default Search Behavior](#default_search_behavior). +`search_context_type` | No | Enum | An override for the type of search/pagination to use on indexes. Can be [point_in_time]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#point-in-time-with-search_after)), [scroll]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#scroll-search), or `none`. The `none` option will use the [search_after]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/paginate/#the-search_after-parameter) parameter. For more information, see [Default Search Behavior](#default-search-behavior). ### Default search behavior