Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Speedup Connected Components #302

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Oct 15, 2024

This pull request includes several changes to the nemo_curator/modules/fuzzy_dedup.py file, focusing on removing the convert_str_ids functionality, optimizing performance, and improving logging.

The most important changes are:

Removal of convert_str_ids functionality:

  • Removed the convert_str_ids parameter and its associated logic from the __init__ method and other methods in nemo_curator/modules/fuzzy_dedup.py. [1] [2] [3] [4] [5] [6]

This is done because now we have longstrings support in cuDF so we no longer need to convert string to int ids

Performance optimizations:

  • Decreased the block size for reading parquet files in _write_dedup_parsed_id [1] to a lesser value to allow scaling of drop_duplicates (which has a big memory overhead 16x+ ) to prevent OOMs, this will allow us to run CC at larger scales without requiring more hardware.

  • Increased the chuck size in _write_encoded_jaccard_pair methods to improve merge performance, as with large base chunks, we have bigger transfers so the throughput of transfer is better on TCP [2]

  • Updated the _run_connected_components method to initialize Comms with p2p=False

Merge Improvements:

  • This PR optimizes the merge process by using an index-based approach instead of the previous batched method, while maintaining the broadcast merge.
  • The new method reduces shuffles to 2*num_batches - 1 through indexing.
  • The only additional operation is setting the index on the ddf_id column.

Main: 22m 10s
PR: 444.85 s

image

Dask Profiles:
cc_profiles.zip

Logging improvements:

  • Added start time logging in the cc_workflow method and end-to-end time logging for the workflow. [1] [2]

Verify Equal Results:

ddf_1 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache_pull_302/connected_components.parquet")
ddf_1 = ddf_1.repartition(npartitions=4).sort_values(by=['id', 'group'])
len(ddf_1)

376321911

ddf_2 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache/connected_components.parquet")
ddf_2 = ddf_2.repartition(npartitions=4).sort_values(by=['id', 'group'])

len(ddf_2)

376321911

Check same ids

merged_df = ddf_1[['id']].merge(ddf_2[['id']], on='id', how='inner')
len(merged_df)

376321911

CC: @ayushdg

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@VibhuJawa VibhuJawa marked this pull request as ready for review October 15, 2024 07:44
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from f424de2 to ea65bad Compare October 16, 2024 02:35
@VibhuJawa VibhuJawa changed the base branch from main to r0.3.0 October 16, 2024 02:37
@VibhuJawa VibhuJawa changed the base branch from r0.3.0 to main October 16, 2024 02:37
VibhuJawa and others added 5 commits October 15, 2024 19:46
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@login-eos02.eos.clusters.nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from ea65bad to 5caa34a Compare October 16, 2024 02:46
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
@VibhuJawa VibhuJawa changed the title [WIP] Speedup Connected Components [REVIEW] Speedup Connected Components Oct 22, 2024
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from df62a1f to 8396237 Compare October 22, 2024 22:42
Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would wait for @ayushdg to also TAL!
Great speedup 🎊

@VibhuJawa
Copy link
Collaborator Author

@ayushdg , Please take a look. Lets land this in soon. Have addressed all the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants