multi_step version of fuzzy dedup #549

blublinsky · 2024-08-27T13:44:59Z

Why are these changes needed?

This significantly simplify Fuzzy dedup implementation and create Python version of it

Related issue number (if any).

cmadam

I tried to run the fdedup_preprocessor_local_python.py file, and I do not think the pre-processing step works correctly in the current implementation. While this step generated the word shingles and the minhashes correctly, it failed to produce a correct list of buckets for each document.

First, there is an inconsistency in the variables used in the preprocessor transform. In the fdedup_preprocessor_transform_base.py file, line 102, the _generate_buckets function uses the self.num_bands variable to generate the appropriate number of buckets (line 112). The self.num_bands variable is not specified in the configuration, and is instantiated to a default value of 1. Instead, the num_buckets variable should be used, as calculated in line 102 of the fdedup_preprocessor_transform.py file.

Second, assuming that this inconsistency is fixed, I fail to see how the program is keeping track subsequently of the band in which each document bucket was calculated, as the band corresponding to a bucket is not saved in the buckets variable.

cmadam · 2024-09-05T05:24:31Z

I am confused, I have tried to run the fdedup_bucket_processor_local.py. It seems like the default input data for this step does not contain any buckets with more than one document. I tried following the code, but for the buckets with more than one document, the code is supposed to call the _get_minhashes_docs() function, which raises a NotImplemented exception. I am therefore not clear how to run this step locally.

blublinsky · 2024-09-05T07:40:49Z

I am confused, I have tried to run the fdedup_bucket_processor_local.py. It seems like the default input data for this step does not contain any buckets with more than one document. I tried following the code, but for the buckets with more than one document, the code is supposed to call the _get_minhashes_docs() function, which raises a NotImplemented exception. I am therefore not clear how to run this step locally.

Its not implemented in the superclass. Subclass implements it

transforms/universal/fdedup_multi_step/kfp_ray/README.md

roytman · 2024-09-08T05:49:43Z

transforms/universal/fdedup_multi_step/kfp_ray/Makefile_ori

@@ -0,0 +1,51 @@
+REPOROOT=${CURDIR}/../../../../


I assume this file is Makefile original. Therefore we need to rename it back or remove it and create a new Makefile.

roytman · 2024-09-08T09:08:31Z

Can you add this transformer to the kfp CI/CD test blacklist

data-prep-kit/.github/workflows/test.yml

Line 16 in 9fd27e0

KFP_BLACK_LIST: "doc_chunk-ray,pdf2parquet-ray,pii_redactor"

.
Later we can check how to test its pipelines.

blublinsky · 2024-09-08T13:44:30Z

Can you add this transformer to the kfp CI/CD test blacklist

data-prep-kit/.github/workflows/test.yml

Line 16 in 9fd27e0

KFP_BLACK_LIST: "doc_chunk-ray,pdf2parquet-ray,pii_redactor"

.
Later we can check how to test its pipelines.

Its not being tested now. - no make file. Once we have make file corrected, yes

stale

roytman · 2024-09-18T19:50:12Z

data-processing-lib/python/src/data_processing/__init__.py

did you add it intentionally? In the past, you did not want to add it.

Its weird, its not in my local copy

wait, its removed

transforms/universal/ededup/kfp_ray/src/ededup_compute_execution_params.py

transforms/universal/fdedup_multi_step/Makefile

transforms/universal/fdedup_multi_step/kfp_ray/Makefile

roytman · 2024-09-18T20:14:50Z

transforms/universal/fdedup_multi_step/python/src/fdedup/transforms/base/__init__.py

do we really need this so nested hierarchy?
[transforms/universal/fdedup_multi_step/python/src/fdedup/transforms/base/](https://github.com/IBM/data-prep-kit/pull/549/files#diff-922685cc5e257d7d0b885e4201a7f7e3e0abd3a584de496af6cba3667c12b363)
we are under transforms/../fdedup_multi_step dir, why do we have to define `python/src/fdedup/transform again ?

Its for Pypi.

transforms/universal/fdedup_multi_step/ray/pyproject.toml

roytman · 2024-09-22T11:42:50Z

I'm OK with the PR, but would prefer that somebody else reviews it.

cmadam

I have 2 immediate concerns with the current implementation of multi-step fuzzy dedup:

Correctness of the initialization function. The code is using the function:

FdedupSupport.fuzzy_optimal_param(
    threshold=0.8,
    num_perm=64,
    false_positive_weight=0.5,
    false_negative_weight=0.5,
)

to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 5, and length_bucket=11. Now, plugin these numbers into the formula that gives the probability that two documents with a similarity of 0.8 wil be matched as marked as duplicates:

>>> 1 - (1 - 0.8**11)**5
0.3617804572180058

I would have expected this probability to be somewhere near 0.9, not 0.36.
Same situation occurs if I tried with 128 permutations:

FdedupSupport.fuzzy_optimal_param(
    threshold=0.8,
    num_perm=128,
    false_positive_weight=0.5,
    false_negative_weight=0.5,
)

to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 9, and length_bucket=13. Plugin these numbers in the probability function, and we get:

>>> 1 - (1 - 0.8**13)**9
0.39884387824016365

Which means there is roughly a 40% probability that two documents with a Jaccard similarity of 0.8 will be marked as duplicates. Again, the results should have been closer to 90%. For further reference on the calculation of the probability function, please refer to page 20 of https://arxiv.org/pdf/2406.17557.

Second concern is about the scalability of the code. As the band hashes are not saved under different bands, and duplicates are searched across all the bands for all the documents, I do not think this code will scale for a larger number of bands. I am not convinced that the code will scale for 14 bands (the number of bands used in https://arxiv.org/pdf/2406.17557) even for moderate-size datasets (one commoncrawl snapshot).

blublinsky marked this pull request as draft August 27, 2024 13:45

blublinsky requested a review from cmadam August 27, 2024 14:06

blublinsky changed the title ~~initial commit of Python version of fuzzy dedup~~ multi_step version of fuzzy dedup Aug 28, 2024

cmadam previously requested changes Aug 29, 2024

View reviewed changes

blublinsky force-pushed the fdedup_multistep branch from d8af28f to 3f6a8e9 Compare August 29, 2024 08:58

blublinsky force-pushed the fdedup_multistep branch from d73b1f3 to 1c99fbf Compare September 7, 2024 08:41

blublinsky marked this pull request as ready for review September 7, 2024 13:07

roytman reviewed Sep 8, 2024

View reviewed changes

blublinsky force-pushed the fdedup_multistep branch from ee6c979 to 44de317 Compare September 13, 2024 06:53

blublinsky force-pushed the fdedup_multistep branch from 1a126d0 to 10665b4 Compare September 17, 2024 09:03

blublinsky requested review from roytman and cmadam September 17, 2024 09:04

blublinsky force-pushed the fdedup_multistep branch from acde879 to 22a3c14 Compare September 18, 2024 07:16

blublinsky added 12 commits September 18, 2024 17:35

initial commit of Python version of fuzzy dedup

482489e

fixed make file

f275934

fixed make file

bb7846d

started Ray implementation

914498b

more Ray implementation

7258999

addressed review comments

bd0ad86

addressed review comments

e104451

more Ray code

0241f08

Ray tests

4612c37

small refactoring

7a18dee

small refactoring

0311624

small refactoring

cc3c959

blublinsky added 21 commits September 18, 2024 17:35

optimized local processing in preprocessor

4e14a34

refactoring Ray to match Python

7400499

Small Python restructuring

7ed3c7c

Ray code refactoring

eae0d20

more refactoring

b3b16da

Add Ray Make file

4d9d3fd

extended Readme

7a68e26

Add cluster estimator

508b7e4

optimization on bucket chunks and minhash access

65c8ac1

initial kfp implementation

9cd2dcd

refactored Readme

ca6b078

small fixes

f575839

workflow updates

97180b8

workflow fixes

e1b928f

workflow build enablements

18adce2

small bug fixes

d0270bf

removed None not supported by kfpV2

05b97fe

updated make file

378ee77

updated make file

2ff7e88

updated make file

65f4d70

improved error handling

32c2ad3

blublinsky force-pushed the fdedup_multistep branch from eb1fe16 to 32c2ad3 Compare September 18, 2024 16:35

use the latest data access

9d0a75a

roytman reviewed Sep 18, 2024

View reviewed changes

addressed comments

cc8ea1b

roytman approved these changes Sep 22, 2024

View reviewed changes

cmadam requested changes Sep 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_step version of fuzzy dedup #549

multi_step version of fuzzy dedup #549

blublinsky commented Aug 27, 2024

cmadam left a comment •

edited

Loading

cmadam commented Sep 5, 2024

blublinsky commented Sep 5, 2024

roytman Sep 8, 2024

roytman commented Sep 8, 2024

blublinsky commented Sep 8, 2024

roytman Sep 18, 2024

blublinsky Sep 19, 2024

blublinsky Sep 19, 2024

roytman Sep 18, 2024

blublinsky Sep 19, 2024

roytman commented Sep 22, 2024

cmadam left a comment

multi_step version of fuzzy dedup #549

Are you sure you want to change the base?

multi_step version of fuzzy dedup #549

Conversation

blublinsky commented Aug 27, 2024

Why are these changes needed?

Related issue number (if any).

cmadam left a comment • edited Loading

Choose a reason for hiding this comment

cmadam commented Sep 5, 2024

blublinsky commented Sep 5, 2024

roytman Sep 8, 2024

Choose a reason for hiding this comment

roytman commented Sep 8, 2024

blublinsky commented Sep 8, 2024

roytman Sep 18, 2024

Choose a reason for hiding this comment

blublinsky Sep 19, 2024

Choose a reason for hiding this comment

blublinsky Sep 19, 2024

Choose a reason for hiding this comment

roytman Sep 18, 2024

Choose a reason for hiding this comment

blublinsky Sep 19, 2024

Choose a reason for hiding this comment

roytman commented Sep 22, 2024

cmadam left a comment

Choose a reason for hiding this comment

cmadam left a comment •

edited

Loading