diff --git a/transforms/universal/fdedup_multi_step/README.md b/transforms/universal/fdedup_multi_step/README.md index a8ff5fca0..0f5ca2a51 100644 --- a/transforms/universal/fdedup_multi_step/README.md +++ b/transforms/universal/fdedup_multi_step/README.md @@ -24,7 +24,7 @@ This version of fuzzy dedup code is split into 3 transforms: by doc ID cache and are snapshotted at the end of this step. Additionally this step produces minhash and bucket hash snapshot, which contain only non-duplicate documents. These two snapshots can be used for implementation of the incremental fuzzy dedup (see below) -* Fuzzy dedup fillter is responsible for re reading of the original data and filtering out +* Fuzzy dedup filter is responsible for re reading of the original data and filtering out all duplicate documents based on the doc ID cache which is created based on the snapshot produced in the previous step. diff --git a/transforms/universal/fdedup_multi_step/kfp_ray/README.md b/transforms/universal/fdedup_multi_step/kfp_ray/README.md index aff816b78..947d09e3f 100644 --- a/transforms/universal/fdedup_multi_step/kfp_ray/README.md +++ b/transforms/universal/fdedup_multi_step/kfp_ray/README.md @@ -3,7 +3,20 @@ ## Summary This project allows execution of the [multi step fuzzy dedup](../ray) as a -[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/) +[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/). As +defined [here](../README.md), this version of fuzzy dedup is a combination of 3 transforms: +* Fuzzy dedup preprocessor +* Fuzzy dedup bucket processor +* Fuzzy dedup filter + +As a result, we provide here three workflows - one for each step. Each step has its own +resource requirements and can be executed independently. Note although, that supporting +actors parameters - number of doc, minhash and bucket actors and dedup parameters - threshold +and number of permutations can not change between steps. The number of actors is computed by +the preprocessor workflow based on the amount of documents and should be entered in the +subsequent ones. + +For production purposes it also makes sense to create a "super workflow" combining all 3 steps The detail pipeline is presented in the [Simplest Transform pipeline tutorial](../../../../kfp/doc/simple_transform_pipeline.md)