Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numWorkItemsArePending bug #1102

Conversation

mikaylathompson
Copy link
Collaborator

Description

The numWorkItemsArePending function had a parameter for maximum results, but if that field was set to -1 (intended to be no maximum), it just wasn't set in the API call, which actually meant that it defaulted to a max of 10.

There was also a note in the function about switching it to use _count, and that was an easy way to fix the bug. The max results param was no longer relevant, so I removed it.

Issues Resolved

n/a

Testing

This being broken was breaking a different test, so fixing it has made that test pass.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.66%. Comparing base (db0075e) to head (7dc4f28).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ad/workcoordination/OpenSearchWorkCoordinator.java 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #1102   +/-   ##
=========================================
  Coverage     80.66%   80.66%           
- Complexity     2893     2906   +13     
=========================================
  Files           383      384    +1     
  Lines         14361    14360    -1     
  Branches        989      989           
=========================================
  Hits          11584    11584           
+ Misses         2184     2183    -1     
  Partials        593      593           
Flag Coverage Δ
gradle-test 78.75% <66.66%> (+<0.01%) ⬆️
python-test 90.33% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
}

@Override
public boolean workItemsArePending(Supplier<IWorkCoordinationContexts.IPendingWorkItemsContext> contextSupplier)
throws IOException, InterruptedException {
return numWorkItemsArePending(1, contextSupplier) >= 1;
return numWorkItemsArePendingInternal(contextSupplier) >= 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very different call now. My original intention was that in this case, I really only needed to know if the set was empty or not. Counting up thousands of documents won't be necessary. If -1 wasn't working to return the whole list, we should probably just have two separate functions - isEmpty() and count().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I ran some tests with a couple thousand items in the index (And it turns out that adding 3790 items pretty quickly basically completely locks up a testcontainer cluster -- even with 30 second sleeps between attempts, I haven't been able to add a single document to it in >10 minutes -- our worst case situations for blocking indices are really bad).

The tests here are 1/ _count, 2/ _search with size=1, 3/ _search with terminate_after=1. I ran each test 5 times, but I think we should just compare on the first because it looks like they're just cached after that.

_count: 0.110 total
_search with size=1: 0.058 total
_search with terminate_after=1: 0.080 total

Conveniently _search with size=1 includes a block like "hits":{"total":{"value":3790,"relation":"eq"}}, so it is actually computing (either exact or an approximation) the total number of hits. I do wonder how it's doing that and still twice as fast as _count, but we might as well take advantage of that and replace the whole query here with this. Amusingly, this is actually what I originally did, but then i saw the note about _count and reworked it.

❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_count -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}}'; done
{"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H  -d   0.00s user 0.01s system 6% cpu 0.110 total
{"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H  -d   0.00s user 0.01s system 10% cpu 0.068 total
{"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H  -d   0.00s user 0.01s system 24% cpu 0.032 total
{"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H  -d   0.00s user 0.00s system 22% cpu 0.027 total
{"count":3790,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}curl http://localhost:58803/.migrations_working_state/_count -H  -d   0.00s user 0.00s system 25% cpu 0.023 total
❯
❯
❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_search -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "size": 1}'; done
{"took":17,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 12% cpu 0.058 total
{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 23% cpu 0.029 total
{"took":13,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 16% cpu 0.042 total
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 25% cpu 0.026 total
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3790,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 26% cpu 0.025 total
❯
❯
❯ for i in {1..5}; do time curl http://localhost:58803/.migrations_working_state/_search -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "terminate_after": 1}'; done
{"took":21,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.01s system 11% cpu 0.080 total
{"took":3,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 29% cpu 0.024 total
{"took":3,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 30% cpu 0.024 total
{"took":4,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.00s system 23% cpu 0.026 total
{"took":2,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".migrations_working_state","_type":"_doc","_id":"R492","_score":1.0,"_source":{"numAttempts":0,"scriptVersion":"poc","creatorId":"docCreatorWorker","expiration":0}}]}}curl http://localhost:58803/.migrations_working_state/_search -H  -d   0.00s user 0.01s system 31% cpu 0.025 total

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
}
}

@Override
public int numWorkItemsArePending(Supplier<IWorkCoordinationContexts.IPendingWorkItemsContext> contextSupplier)
throws IOException, InterruptedException {
return numWorkItemsArePending(-1, contextSupplier);
return numWorkItemsArePendingInternal(contextSupplier);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you get back the right value for this if you're running the query w/ size=1?

Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Signed-off-by: Mikayla Thompson <thomika@amazon.com>
Comment on lines -406 to -407
// TODO: Switch this to use _count
log.warn("Switch this to use _count");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference: The author of this P found that _count was slower than the search

@mikaylathompson mikaylathompson merged commit 3c47e51 into opensearch-project:main Oct 25, 2024
13 of 14 checks passed
@mikaylathompson mikaylathompson deleted the fix-numWorkItemsArePending-bug branch October 25, 2024 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants