Skip to content

Commit

Permalink
refactored Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
blublinsky committed Sep 8, 2024
1 parent 1c99fbf commit 14a425a
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 2 deletions.
2 changes: 1 addition & 1 deletion transforms/universal/fdedup_multi_step/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This version of fuzzy dedup code is split into 3 transforms:
by doc ID cache and are snapshotted at the end of this step. Additionally this step produces
minhash and bucket hash snapshot, which contain only non-duplicate documents. These two
snapshots can be used for implementation of the incremental fuzzy dedup (see below)
* Fuzzy dedup fillter is responsible for re reading of the original data and filtering out
* Fuzzy dedup filter is responsible for re reading of the original data and filtering out
all duplicate documents based on the doc ID cache which is created based on the snapshot
produced in the previous step.

Expand Down
15 changes: 14 additions & 1 deletion transforms/universal/fdedup_multi_step/kfp_ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,20 @@

## Summary
This project allows execution of the [multi step fuzzy dedup](../ray) as a
[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/)
[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/). As
defined [here](../README.md), this version of fuzzy dedup is a combination of 3 transforms:
* Fuzzy dedup preprocessor
* Fuzzy dedup bucket processor
* Fuzzy dedup filter

As a result, we provide here three workflows - one for each step. Each step has its own
resource requirements and can be executed independently. Note although, that supporting
actors parameters - number of doc, minhash and bucket actors and dedup parameters - threshold
and number of permutations can not change between steps. The number of actors is computed by
the preprocessor workflow based on the amount of documents and should be entered in the
subsequent ones.

For production purposes it also makes sense to create a "super workflow" combining all 3 steps

The detail pipeline is presented in the [Simplest Transform pipeline tutorial](../../../../kfp/doc/simple_transform_pipeline.md)

Expand Down

0 comments on commit 14a425a

Please sign in to comment.