Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_step version of fuzzy dedup #549

Open
wants to merge 35 commits into
base: dev
Choose a base branch
from
Open

multi_step version of fuzzy dedup #549

wants to merge 35 commits into from

Conversation

blublinsky
Copy link
Collaborator

Why are these changes needed?

This significantly simplify Fuzzy dedup implementation and create Python version of it

Related issue number (if any).

@blublinsky blublinsky marked this pull request as draft August 27, 2024 13:45
@blublinsky blublinsky requested a review from cmadam August 27, 2024 14:06
@blublinsky blublinsky changed the title initial commit of Python version of fuzzy dedup multi_step version of fuzzy dedup Aug 28, 2024
cmadam
cmadam previously requested changes Aug 29, 2024
Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run the fdedup_preprocessor_local_python.py file, and I do not think the pre-processing step works correctly in the current implementation. While this step generated the word shingles and the minhashes correctly, it failed to produce a correct list of buckets for each document.

First, there is an inconsistency in the variables used in the preprocessor transform. In the fdedup_preprocessor_transform_base.py file, line 102, the _generate_buckets function uses the self.num_bands variable to generate the appropriate number of buckets (line 112). The self.num_bands variable is not specified in the configuration, and is instantiated to a default value of 1. Instead, the num_buckets variable should be used, as calculated in line 102 of the fdedup_preprocessor_transform.py file.

Second, assuming that this inconsistency is fixed, I fail to see how the program is keeping track subsequently of the band in which each document bucket was calculated, as the band corresponding to a bucket is not saved in the buckets variable.

@cmadam
Copy link
Collaborator

cmadam commented Sep 5, 2024

I am confused, I have tried to run the fdedup_bucket_processor_local.py. It seems like the default input data for this step does not contain any buckets with more than one document. I tried following the code, but for the buckets with more than one document, the code is supposed to call the _get_minhashes_docs() function, which raises a NotImplemented exception. I am therefore not clear how to run this step locally.

@blublinsky
Copy link
Collaborator Author

I am confused, I have tried to run the fdedup_bucket_processor_local.py. It seems like the default input data for this step does not contain any buckets with more than one document. I tried following the code, but for the buckets with more than one document, the code is supposed to call the _get_minhashes_docs() function, which raises a NotImplemented exception. I am therefore not clear how to run this step locally.

Its not implemented in the superclass. Subclass implements it

@blublinsky blublinsky marked this pull request as ready for review September 7, 2024 13:07
@@ -0,0 +1,51 @@
REPOROOT=${CURDIR}/../../../../
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this file is Makefile original. Therefore we need to rename it back or remove it and create a new Makefile.

@roytman
Copy link
Member

roytman commented Sep 8, 2024

Can you add this transformer to the kfp CI/CD test blacklist

KFP_BLACK_LIST: "doc_chunk-ray,pdf2parquet-ray,pii_redactor"
.
Later we can check how to test its pipelines.

@blublinsky
Copy link
Collaborator Author

Can you add this transformer to the kfp CI/CD test blacklist

KFP_BLACK_LIST: "doc_chunk-ray,pdf2parquet-ray,pii_redactor"

.
Later we can check how to test its pipelines.

Its not being tested now. - no make file. Once we have make file corrected, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add it intentionally? In the past, you did not want to add it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its weird, its not in my local copy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, its removed

transforms/universal/fdedup_multi_step/Makefile Outdated Show resolved Hide resolved
transforms/universal/fdedup_multi_step/kfp_ray/Makefile Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this so nested hierarchy?
[transforms/universal/fdedup_multi_step/python/src/fdedup/transforms/base/](https://github.com/IBM/data-prep-kit/pull/549/files#diff-922685cc5e257d7d0b885e4201a7f7e3e0abd3a584de496af6cba3667c12b363)
we are under transforms/../fdedup_multi_step dir, why do we have to define `python/src/fdedup/transform again ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its for Pypi.

transforms/universal/fdedup_multi_step/ray/pyproject.toml Outdated Show resolved Hide resolved
@roytman
Copy link
Member

roytman commented Sep 22, 2024

I'm OK with the PR, but would prefer that somebody else reviews it.

Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 immediate concerns with the current implementation of multi-step fuzzy dedup:

  • Correctness of the initialization function. The code is using the function:
FdedupSupport.fuzzy_optimal_param(
    threshold=0.8,
    num_perm=64,
    false_positive_weight=0.5,
    false_negative_weight=0.5,
)

to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 5, and length_bucket=11. Now, plugin these numbers into the formula that gives the probability that two documents with a similarity of 0.8 wil be matched as marked as duplicates:

>>> 1 - (1 - 0.8**11)**5
0.3617804572180058

I would have expected this probability to be somewhere near 0.9, not 0.36.
Same situation occurs if I tried with 128 permutations:

FdedupSupport.fuzzy_optimal_param(
    threshold=0.8,
    num_perm=128,
    false_positive_weight=0.5,
    false_negative_weight=0.5,
)

to determine the optimal number of bands and length of a band. Called with the parameters above, the function returns num_buckets = 9, and length_bucket=13. Plugin these numbers in the probability function, and we get:

>>> 1 - (1 - 0.8**13)**9
0.39884387824016365

Which means there is roughly a 40% probability that two documents with a Jaccard similarity of 0.8 will be marked as duplicates. Again, the results should have been closer to 90%. For further reference on the calculation of the probability function, please refer to page 20 of https://arxiv.org/pdf/2406.17557.

  • Second concern is about the scalability of the code. As the band hashes are not saved under different bands, and duplicates are searched across all the bands for all the documents, I do not think this code will scale for a larger number of bands. I am not convinced that the code will scale for 14 bands (the number of bands used in https://arxiv.org/pdf/2406.17557) even for moderate-size datasets (one commoncrawl snapshot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants