Skip to content

Commit

Permalink
Behavioral changes for Data Prepper S3 sink (#4897)
Browse files Browse the repository at this point in the history
* Updates the Data Prepper documentation for S3 sinks based on recent behavior changes.

Signed-off-by: David Venable <dlv@amazon.com>

* Updates from the PR feedback.

Signed-off-by: David Venable <dlv@amazon.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: David Venable <dlv@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
  • Loading branch information
dlvenable and Naarcha-AWS authored Aug 29, 2023
1 parent dc21de0 commit 64d59b9
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 9 deletions.
24 changes: 17 additions & 7 deletions _data-prepper/pipelines/configuration/sinks/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,21 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d

Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
In general, you should define your own schema because it will most accurately reflect your needs.

We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions).
Without the null union, each field must be present or the data will fail to write to the sink.
If you can be certain that each each event has a given field, you can make it non-nullable.

When you provide your own Avro schema, that schema defines the final structure of your data.
Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination.
To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema.

In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
The schema will only contain keys from this event.
Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema.


Option | Required | Type | Description
Expand Down Expand Up @@ -131,14 +142,13 @@ Option | Required | Type | Description
### parquet codec

The `parquet` codec writes events into a Parquet file.
You must set the `buffer_type` to `multipart` when using Parquet.
When using the Parquet codec, set the `buffer_type` to `in_memory`.

The Parquet codec writes data using the Avro schema. However, we generally recommend that you define your own schema so that it can best meet your needs.
The Parquet codec writes data using the Avro schema.
Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
However, we generally recommend that you define your own schema so that it can best meet your needs.

In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation.


Option | Required | Type | Description
Expand Down
4 changes: 2 additions & 2 deletions _data-prepper/pipelines/configuration/sinks/sinks.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ Option | Required | Type | Description
:--- | :--- |:------------| :---
routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
tags_target_key | No | String | When specified, includes event tags in the output of the provided key.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.

0 comments on commit 64d59b9

Please sign in to comment.