Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MG similarity issues #4741

Draft
wants to merge 2 commits into
base: branch-24.12
Choose a base branch
from

Conversation

ChuckHastings
Copy link
Collaborator

This PR adds C++ tests for the all-pairs variation of similarity algorithms. Previously the all-pairs variation was only tested in SG mode.

This also addresses an issue where the all-pairs implementation would crash when there was a load imbalance across the GPUs and one of the GPUs ran out of work before the others.

@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
sum_two_hop_degrees,
MAX_PAIRS_PER_BATCH);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the lines above,

    top_v1.reserve(*topk, handle.get_stream());
    top_v2.reserve(*topk, handle.get_stream());
    top_score.reserve(*topk, handle.get_stream());

Shouldn't reserve here be resize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    raft::update_host(&sum_two_hop_degrees,
                      two_hop_degree_offsets.data() + two_hop_degree_offsets.size() - 1,
                      1,
                      handle.get_stream());

We are missing handle.sync_stream() after this to ensure that sum_two_hop_degrees is ready to use in the following compute_offset_aligned_element_chunks.

raft::device_span<vertex_t const> batch_seeds{tmp_vertices.data(), size_t{0}};

if (((batch_number + 1) < batch_offsets.size()) &&
(batch_offsets[batch_number + 1] > batch_offsets[batch_number])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(batch_number + 1) < batch_offsets.size() should always be true here, right? batch_number < num_batches and batch_offsets.size() is num_batches + 1.

if (top_score.size() == *topk) {
raft::update_host(
&similarity_threshold, top_score.data() + *topk - 1, 1, handle.get_stream());
if (top_score.size() == *topk) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print top_score.size(). It is 10 in rank0, 0 in rank 1. So, only rank0 participates in the host_scalar_bcast. This is causing the hang you see.

Comment on lines +493 to +500
thrust::copy(
handle.get_thrust_policy(), v1.begin(), v1.begin() + top_v1.size(), top_v1.begin());
thrust::copy(
handle.get_thrust_policy(), v2.begin(), v2.begin() + top_v1.size(), top_v2.begin());
thrust::copy(handle.get_thrust_policy(),
score.begin(),
score.begin() + top_v1.size(),
top_score.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure top_v1 and top_v2 are properly re-sized here (not just reserved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants