Fix numWorkItemsArePending bug #1102

mikaylathompson · 2024-10-23T22:05:58Z

Description

The numWorkItemsArePending function had a parameter for maximum results, but if that field was set to -1 (intended to be no maximum), it just wasn't set in the API call, which actually meant that it defaulted to a max of 10.

There was also a note in the function about switching it to use _count, and that was an easy way to fix the bug. The max results param was no longer relevant, so I removed it.

Issues Resolved

n/a

Testing

This being broken was breaking a different test, so fixing it has made that test pass.

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2024-10-23T22:06:51Z

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.66%. Comparing base (db0075e) to head (7dc4f28).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
...ad/workcoordination/OpenSearchWorkCoordinator.java	62.50%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #1102   +/-   ##
=========================================
  Coverage     80.66%   80.66%           
- Complexity     2893     2906   +13     
=========================================
  Files           383      384    +1     
  Lines         14361    14360    -1     
  Branches        989      989           
=========================================
  Hits          11584    11584           
+ Misses         2184     2183    -1     
  Partials        593      593

Flag	Coverage Δ
gradle-test	`78.75% <66.66%> (+<0.01%)`	⬆️
python-test	`90.33% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn · 2024-10-24T02:49:47Z

...main/java/org/opensearch/migrations/bulkload/workcoordination/OpenSearchWorkCoordinator.java

    }

    @Override
    public boolean workItemsArePending(Supplier<IWorkCoordinationContexts.IPendingWorkItemsContext> contextSupplier)
        throws IOException, InterruptedException {
-        return numWorkItemsArePending(1, contextSupplier) >= 1;
+        return numWorkItemsArePendingInternal(contextSupplier) >= 1;


This is a very different call now. My original intention was that in this case, I really only needed to know if the set was empty or not. Counting up thousands of documents won't be necessary. If -1 wasn't working to return the whole list, we should probably just have two separate functions - isEmpty() and count().

Okay, I ran some tests with a couple thousand items in the index (And it turns out that adding 3790 items pretty quickly basically completely locks up a testcontainer cluster -- even with 30 second sleeps between attempts, I haven't been able to add a single document to it in >10 minutes -- our worst case situations for blocking indices are really bad).

The tests here are 1/ _count, 2/ _search with size=1, 3/ _search with terminate_after=1. I ran each test 5 times, but I think we should just compare on the first because it looks like they're just cached after that.

_count: 0.110 total
_search with size=1: 0.058 total
_search with terminate_after=1: 0.080 total

Conveniently _search with size=1 includes a block like "hits":{"total":{"value":3790,"relation":"eq"}}, so it is actually computing (either exact or an approximation) the total number of hits. I do wonder how it's doing that and still twice as fast as _count, but we might as well take advantage of that and replace the whole query here with this. Amusingly, this is actually what I originally did, but then i saw the note about _count and reworked it.

❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_count -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}}'; done {"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H -d 0.00s user 0.01s system 6% cpu 0.110 total {"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H -d 0.00s user 0.01s system 10% cpu 0.068 total {"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H -d 0.00s user 0.01s system 24% cpu 0.032 total {"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H -d 0.00s user 0.00s system 22% cpu 0.027 total {"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H -d 0.00s user 0.00s system 25% cpu 0.023 total ❯ ❯ ❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_search -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "size": 1}'; done {"took":17,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 12% cpu 0.058 total {"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 23% cpu 0.029 total {"took":13,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 16% cpu 0.042 total {"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 25% cpu 0.026 total {"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 26% cpu 0.025 total ❯ ❯ ❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_search -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "terminate_after": 1}'; done {"took":21,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.01s system 11% cpu 0.080 total {"took":3,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 29% cpu 0.024 total {"took":3,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 30% cpu 0.024 total {"took":4,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.00s system 23% cpu 0.026 total {"took":2,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H -d 0.00s user 0.01s system 31% cpu 0.025 total

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn · 2024-10-24T19:26:30Z

...main/java/org/opensearch/migrations/bulkload/workcoordination/OpenSearchWorkCoordinator.java

        }
    }

    @Override
    public int numWorkItemsArePending(Supplier<IWorkCoordinationContexts.IPendingWorkItemsContext> contextSupplier)
        throws IOException, InterruptedException {
-        return numWorkItemsArePending(-1, contextSupplier);
+        return numWorkItemsArePendingInternal(contextSupplier);


how do you get back the right value for this if you're running the query w/ size=1?

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn · 2024-10-25T03:36:27Z

...main/java/org/opensearch/migrations/bulkload/workcoordination/OpenSearchWorkCoordinator.java

-            // TODO: Switch this to use _count
-            log.warn("Switch this to use _count");


For future reference: The author of this P found that _count was slower than the search

mikaylathompson requested review from AndreKurait, chelma, gregschohn, lewijacn, peternied and sumobrian as code owners October 23, 2024 22:05

mikaylathompson added 2 commits October 23, 2024 16:16

Fix bug in numWorkItemsArePending

31ae47d

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Just switch to _count

e48005b

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

mikaylathompson force-pushed the fix-numWorkItemsArePending-bug branch from 682a2d0 to e48005b Compare October 23, 2024 22:16

gregschohn reviewed Oct 24, 2024

View reviewed changes

_count -> _search

ab9b653

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn reviewed Oct 24, 2024

View reviewed changes

mikaylathompson added 3 commits October 24, 2024 16:02

Rename for clarity (pending->notYetCompleted)

1a926d1

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Add extra check around the uncertain case if hits==0

f573157

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

Add comment

7dc4f28

Signed-off-by: Mikayla Thompson <thomika@amazon.com>

gregschohn approved these changes Oct 25, 2024

View reviewed changes

mikaylathompson merged commit 3c47e51 into opensearch-project:main Oct 25, 2024
13 of 14 checks passed

mikaylathompson deleted the fix-numWorkItemsArePending-bug branch October 25, 2024 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix numWorkItemsArePending bug #1102

Fix numWorkItemsArePending bug #1102

mikaylathompson commented Oct 23, 2024

codecov bot commented Oct 23, 2024 •

edited

Loading

gregschohn Oct 24, 2024

mikaylathompson Oct 24, 2024

gregschohn Oct 24, 2024

gregschohn Oct 25, 2024

		// TODO: Switch this to use _count
		log.warn("Switch this to use _count");

Fix numWorkItemsArePending bug #1102

Fix numWorkItemsArePending bug #1102

Conversation

mikaylathompson commented Oct 23, 2024

Description

Issues Resolved

Testing

Check List

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

gregschohn Oct 24, 2024

Choose a reason for hiding this comment

mikaylathompson Oct 24, 2024

Choose a reason for hiding this comment

gregschohn Oct 24, 2024

Choose a reason for hiding this comment

gregschohn Oct 25, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 23, 2024 •

edited

Loading