-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi_step version of fuzzy dedup #549
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to run the fdedup_preprocessor_local_python.py
file, and I do not think the pre-processing step works correctly in the current implementation. While this step generated the word shingles and the minhashes correctly, it failed to produce a correct list of buckets for each document.
First, there is an inconsistency in the variables used in the preprocessor transform. In the fdedup_preprocessor_transform_base.py
file, line 102, the _generate_buckets
function uses the self.num_bands
variable to generate the appropriate number of buckets (line 112). The self.num_bands
variable is not specified in the configuration, and is instantiated to a default value of 1
. Instead, the num_buckets
variable should be used, as calculated in line 102 of the fdedup_preprocessor_transform.py
file.
Second, assuming that this inconsistency is fixed, I fail to see how the program is keeping track subsequently of the band in which each document bucket was calculated, as the band corresponding to a bucket is not saved in the buckets
variable.
d8af28f
to
3f6a8e9
Compare
I am confused, I have tried to run the |
Its not implemented in the superclass. Subclass implements it |
d73b1f3
to
1c99fbf
Compare
@@ -0,0 +1,51 @@ | |||
REPOROOT=${CURDIR}/../../../../ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this file is Makefile original
. Therefore we need to rename it back or remove it and create a new Makefile.
Can you add this transformer to the kfp CI/CD test blacklist data-prep-kit/.github/workflows/test.yml Line 16 in 9fd27e0
Later we can check how to test its pipelines. |
Its not being tested now. - no make file. Once we have make file corrected, yes |
ee6c979
to
44de317
Compare
1a126d0
to
10665b4
Compare
acde879
to
22a3c14
Compare
eb1fe16
to
32c2ad3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add it intentionally? In the past, you did not want to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its weird, its not in my local copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, its removed
transforms/universal/ededup/kfp_ray/src/ededup_compute_execution_params.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this so nested hierarchy?
[transforms/universal/fdedup_multi_step/python/src/fdedup/transforms/base/](https://github.com/IBM/data-prep-kit/pull/549/files#diff-922685cc5e257d7d0b885e4201a7f7e3e0abd3a584de496af6cba3667c12b363)
we are under transforms/../fdedup_multi_step dir, why do we have to define `python/src/fdedup/transform again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its for Pypi.
I'm OK with the PR, but would prefer that somebody else reviews it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 immediate concerns with the current implementation of multi-step fuzzy dedup:
- Correctness of the initialization function. The code is using the function:
FdedupSupport.fuzzy_optimal_param(
threshold=0.8,
num_perm=64,
false_positive_weight=0.5,
false_negative_weight=0.5,
)
to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 5
, and length_bucket=11
. Now, plugin these numbers into the formula that gives the probability that two documents with a similarity of 0.8 wil be matched as marked as duplicates:
>>> 1 - (1 - 0.8**11)**5
0.3617804572180058
I would have expected this probability to be somewhere near 0.9, not 0.36.
Same situation occurs if I tried with 128 permutations:
FdedupSupport.fuzzy_optimal_param(
threshold=0.8,
num_perm=128,
false_positive_weight=0.5,
false_negative_weight=0.5,
)
to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 9
, and length_bucket=13
. Plugin these numbers in the probability function, and we get:
>>> 1 - (1 - 0.8**13)**9
0.39884387824016365
Which means there is roughly a 40% probability that two documents with a Jaccard similarity of 0.8 will be marked as duplicates. Again, the results should have been closer to 90%. For further reference on the calculation of the probability function, please refer to page 20 of https://arxiv.org/pdf/2406.17557.
- Second concern is about the scalability of the code. As the band hashes are not saved under different bands, and duplicates are searched across all the bands for all the documents, I do not think this code will scale for a larger number of bands. I am not convinced that the code will scale for 14 bands (the number of bands used in https://arxiv.org/pdf/2406.17557) even for moderate-size datasets (one commoncrawl snapshot).
Why are these changes needed?
This significantly simplify Fuzzy dedup implementation and create Python version of it
Related issue number (if any).