-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MG similarity issues #4741
base: branch-24.12
Are you sure you want to change the base?
Fix MG similarity issues #4741
Conversation
@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle, | |||
sum_two_hop_degrees, | |||
MAX_PAIRS_PER_BATCH); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the lines above,
top_v1.reserve(*topk, handle.get_stream());
top_v2.reserve(*topk, handle.get_stream());
top_score.reserve(*topk, handle.get_stream());
Shouldn't reserve
here be resize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raft::update_host(&sum_two_hop_degrees,
two_hop_degree_offsets.data() + two_hop_degree_offsets.size() - 1,
1,
handle.get_stream());
We are missing handle.sync_stream()
after this to ensure that sum_two_hop_degrees
is ready to use in the following compute_offset_aligned_element_chunks
.
raft::device_span<vertex_t const> batch_seeds{tmp_vertices.data(), size_t{0}}; | ||
|
||
if (((batch_number + 1) < batch_offsets.size()) && | ||
(batch_offsets[batch_number + 1] > batch_offsets[batch_number])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(batch_number + 1) < batch_offsets.size()
should always be true here, right? batch_number < num_batches
and batch_offsets.size()
is num_batches + 1
.
if (top_score.size() == *topk) { | ||
raft::update_host( | ||
&similarity_threshold, top_score.data() + *topk - 1, 1, handle.get_stream()); | ||
if (top_score.size() == *topk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print top_score.size()
. It is 10 in rank0, 0 in rank 1. So, only rank0 participates in the host_scalar_bcast. This is causing the hang you see.
thrust::copy( | ||
handle.get_thrust_policy(), v1.begin(), v1.begin() + top_v1.size(), top_v1.begin()); | ||
thrust::copy( | ||
handle.get_thrust_policy(), v2.begin(), v2.begin() + top_v1.size(), top_v2.begin()); | ||
thrust::copy(handle.get_thrust_policy(), | ||
score.begin(), | ||
score.begin() + top_v1.size(), | ||
top_score.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure top_v1 and top_v2 are properly re-sized here (not just reserved).
This PR adds C++ tests for the all-pairs variation of similarity algorithms. Previously the all-pairs variation was only tested in SG mode.
This also addresses an issue where the all-pairs implementation would crash when there was a load imbalance across the GPUs and one of the GPUs ran out of work before the others.