refactored Readme

IBM · Sep 8, 2024 · 14a425a · 14a425a
1 parent 1c99fbf
commit 14a425a
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 2 deletions.
diff --git a/transforms/universal/fdedup_multi_step/README.md b/transforms/universal/fdedup_multi_step/README.md
@@ -24,7 +24,7 @@ This version of fuzzy dedup code is split into 3 transforms:
   by doc ID cache and are snapshotted at the end of this step. Additionally this step produces
   minhash and bucket hash snapshot, which contain only non-duplicate documents. These two 
   snapshots can be used for implementation of the incremental fuzzy dedup (see below)
-* Fuzzy dedup fillter is responsible for re reading of the original data and filtering out
+* Fuzzy dedup filter is responsible for re reading of the original data and filtering out
   all duplicate documents based on the doc ID cache which is created based on the snapshot 
   produced in the previous step.
 

diff --git a/transforms/universal/fdedup_multi_step/kfp_ray/README.md b/transforms/universal/fdedup_multi_step/kfp_ray/README.md
@@ -3,7 +3,20 @@
 
 ## Summary 
 This project allows execution of the [multi step fuzzy dedup](../ray) as a 
-[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/)
+[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/). As 
+defined [here](../README.md), this version of fuzzy dedup is a combination of 3 transforms:
+* Fuzzy dedup preprocessor
+* Fuzzy dedup bucket processor
+* Fuzzy dedup filter
+
+As a result, we provide here three workflows - one for each step. Each step has its own 
+resource requirements and can be executed independently. Note although, that supporting
+actors parameters - number of doc, minhash and bucket actors and dedup parameters - threshold
+and number of permutations can not change between steps. The number of actors is computed by 
+the preprocessor workflow based on the amount of documents and should be entered in the 
+subsequent ones.
+
+For production purposes it also makes sense to create a "super workflow" combining all 3 steps
 
 The detail pipeline is presented in the [Simplest Transform pipeline tutorial](../../../../kfp/doc/simple_transform_pipeline.md)