Fix MG similarity issues #4741

ChuckHastings · 2024-10-31T20:23:10Z

This PR adds C++ tests for the all-pairs variation of similarity algorithms. Previously the all-pairs variation was only tested in SG mode.

This also addresses an issue where the all-pairs implementation would crash when there was a load imbalance across the GPUs and one of the GPUs ran out of work before the others.

seunghwak · 2024-11-08T01:24:22Z

cpp/src/link_prediction/similarity_impl.cuh

@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
      sum_two_hop_degrees,
      MAX_PAIRS_PER_BATCH);



In the lines above,

top_v1.reserve(*topk, handle.get_stream()); top_v2.reserve(*topk, handle.get_stream()); top_score.reserve(*topk, handle.get_stream());

Shouldn't reserve here be resize?

raft::update_host(&sum_two_hop_degrees, two_hop_degree_offsets.data() + two_hop_degree_offsets.size() - 1, 1, handle.get_stream());

We are missing handle.sync_stream() after this to ensure that sum_two_hop_degrees is ready to use in the following compute_offset_aligned_element_chunks.

seunghwak · 2024-11-08T01:27:10Z

cpp/src/link_prediction/similarity_impl.cuh

+      raft::device_span<vertex_t const> batch_seeds{tmp_vertices.data(), size_t{0}};
+
+      if (((batch_number + 1) < batch_offsets.size()) &&
+          (batch_offsets[batch_number + 1] > batch_offsets[batch_number])) {


(batch_number + 1) < batch_offsets.size() should always be true here, right? batch_number < num_batches and batch_offsets.size() is num_batches + 1.

seunghwak · 2024-11-08T01:32:29Z

cpp/src/link_prediction/similarity_impl.cuh

-        if (top_score.size() == *topk) {
-          raft::update_host(
-            &similarity_threshold, top_score.data() + *topk - 1, 1, handle.get_stream());
+      if (top_score.size() == *topk) {


Print top_score.size(). It is 10 in rank0, 0 in rank 1. So, only rank0 participates in the host_scalar_bcast. This is causing the hang you see.

seunghwak · 2024-11-08T01:33:08Z

cpp/src/link_prediction/similarity_impl.cuh

+      thrust::copy(
+        handle.get_thrust_policy(), v1.begin(), v1.begin() + top_v1.size(), top_v1.begin());
+      thrust::copy(
+        handle.get_thrust_policy(), v2.begin(), v2.begin() + top_v1.size(), top_v2.begin());
+      thrust::copy(handle.get_thrust_policy(),
+                   score.begin(),
+                   score.begin() + top_v1.size(),
+                   top_score.begin());


Make sure top_v1 and top_v2 are properly re-sized here (not just reserved).

Refactor similarity tests to have MG test all-pairs logic

969ec7d

github-actions bot added the cuGraph label Oct 31, 2024

add fix for batch size anomalies

1e5b5ce

seunghwak reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MG similarity issues #4741

Fix MG similarity issues #4741

ChuckHastings commented Oct 31, 2024

seunghwak Nov 8, 2024

seunghwak Nov 8, 2024

seunghwak Nov 8, 2024

seunghwak Nov 8, 2024

seunghwak Nov 8, 2024

		@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
		sum_two_hop_degrees,
		MAX_PAIRS_PER_BATCH);

Fix MG similarity issues #4741

Are you sure you want to change the base?

Fix MG similarity issues #4741

Conversation

ChuckHastings commented Oct 31, 2024

seunghwak Nov 8, 2024

Choose a reason for hiding this comment

seunghwak Nov 8, 2024

Choose a reason for hiding this comment

seunghwak Nov 8, 2024

Choose a reason for hiding this comment

seunghwak Nov 8, 2024

Choose a reason for hiding this comment

seunghwak Nov 8, 2024

Choose a reason for hiding this comment