From b3cf7a46f2b78a677611eccbed87a6a75dd7fef9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=96zg=C3=BCr=20Akkurt?= Date: Thu, 3 Aug 2023 22:46:29 +0000 Subject: [PATCH] feat: add deduplication doc (#655) Added small document that explains how deduplication works --- content/smartmodules/deduplication.md | 74 +++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 content/smartmodules/deduplication.md diff --git a/content/smartmodules/deduplication.md b/content/smartmodules/deduplication.md new file mode 100644 index 00000000..85c2b00c --- /dev/null +++ b/content/smartmodules/deduplication.md @@ -0,0 +1,74 @@ +--- +title: Deduplication +--- + +The `Deduplication` feature in Fluvio is the way to deduplicate records based on their keys. + +Deduplication can be used on a topic. +It will drop the duplicate records in a window which is defined by the configured `bounds`. + +The current supported `bounds` are `age` and `count`, they are explained in the [bounds section](#bounds). + +## Behavior + +The deduplication function is deterministic and survives restarts. During restart the deduplication algorithm scans the data stream and rebuilds the memory object. + +## Example topic config + +Example configuration on topic: + +%copy% +```yaml +# topic.yaml +version: 0.1.0 +meta: + name: topic-with-dedup +deduplication: + bounds: + count: 5 # remember at least 5 last records + age: 5s # remember records for at least 5 seconds + filter: + transform: + uses: infinyon-labs/dedup-filter@0.0.2 +``` +A topic can be created using this config file like so: + +%copy first-line% +```bash +$ fluvio topic create -c topic.yaml +``` + +After creating the topic, it can be tested like so: + +%copy first-line% +```bash +$ fluvio produce topic-with-dedup --key-separator : +1:2 +1:2 +2:5 +``` + +%copy first-line% +```bash +$ fluvio consume -B topic-with-dedup +2 +5 +``` + +## Bounds + +| Parameter | default | type | optional | description | +|:-------------|:--------| :--- | :--- |:------------------------------------------------------| +| count | - | Integer | false | Minimum number of records the filter will remember. It doesn't guarantee to remember records that came `count` records before now. | +| age | - | Integer | true | Minimum amount of time this filter will remember a record for. It can be specified using this format: `15days 2min 2s`, or `2min 5s`, or `15ms` | | + +## Implementation + +The specific algorithm used for deduplication is defined by a smartmodule. Right now the only smartmodule for this is the `dedup-filter` smartmodule. + +The `dedup-filter` splits the data into smaller chunks and keeps these chunks in memory. Each chunk has an age so it knows if it can +drop a chunk based on it's age. + +When count of total records in memory exceeds the configured `bounds.count`, it will check if it can delete the oldest record chunk using the configured `bounds.age` and delete it if it can. This allows for fast deletion of old data with minimum amount of bookkeeping. + +The chunking of data introduces a small memory usage overhead but allows the filter to have no spikes in execution time and memory usage.