Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - feat: add deduplication doc #655

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions content/smartmodules/deduplication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Deduplication
---

The `Deduplication` feature in Fluvio is the way to deduplicate records based on their keys.
ajhunyady marked this conversation as resolved.
Show resolved Hide resolved

Deduplication can be used on a topic.
It will drop the duplicate records in a window which is defined by the configured `bounds`.

The current supported `bounds` are `age` and `count`, they are explained in the [bounds section](#bounds).

## Behavior

The deduplication function is deterministic and survives restarts. During restart the deduplication algorithm scans the data stream and rebuilds the memory object.

## Example topic config

Example configuration on topic:

%copy%
```yaml
# topic.yaml
version: 0.1.0
meta:
name: topic-with-dedup
deduplication:
bounds:
count: 5 # remember at least 5 last records
age: 5s # remember records for at least 5 seconds
filter:
transform:
uses: infinyon-labs/dedup-filter@0.0.2
```
A topic can be created using this config file like so:

%copy first-line%
```bash
$ fluvio topic create -c topic.yaml
```

After creating the topic, it can be tested like so:

%copy first-line%
```bash
$ fluvio produce topic-with-dedup --key-separator :
1:2
1:2
2:5
```

%copy first-line%
```bash
$ fluvio consume -B topic-with-dedup
2
5
```

## Bounds

| Parameter | default | type | optional | description |
|:-------------|:--------| :--- | :--- |:------------------------------------------------------|
| count | - | Integer | false | Minimum number of records the filter will remember. It doesn't guarantee to remember records that came `count` records before now. |
| age | - | Integer | true | Minimum amount of time this filter will remember a record for. It can be specified using this format: `15days 2min 2s`, or `2min 5s`, or `15ms` | |
ajhunyady marked this conversation as resolved.
Show resolved Hide resolved

## Implementation

The specific algorithm used for deduplication is defined by a smartmodule. Right now the only smartmodule for this is the `dedup-filter` smartmodule.

The `dedup-filter` splits the data into smaller chunks and keeps these chunks in memory. Each chunk has an age so it knows if it can
drop a chunk based on it's age.

When count of total records in memory exceeds the configured `bounds.count`, it will check if it can delete the oldest record chunk using the configured `bounds.age` and delete it if it can. This allows for fast deletion of old data with minimum amount of bookkeeping.

The chunking of data introduces a small memory usage overhead but allows the filter to have no spikes in execution time and memory usage.