Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Search requests with size=0 & terminate_after sometimes return incorrectly hit count #10435

Closed
jed326 opened this issue Oct 5, 2023 · 15 comments · Fixed by #12179
Closed
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@jed326
Copy link
Collaborator

jed326 commented Oct 5, 2023

Describe the bug
When both size=0 and terminate_after parameters are provided to a search request, the incorrect total hit count is sometimes returned. This was discovered as a part of the investigation for a flakey test, see #9946 (comment)

To Reproduce
Set size=0 in the search request in the below test.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.simple.SimpleSearchIT" -Dtests.method="testSimpleTerminateAfterCount {p0={"search.concurrent_segment_search.enabled":"true"}}" -Dtests.seed=C37FD3C5A66D02BC -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=he-IL -Dtests.timezone=Asia/Qostanay -Druntime.java=17

Expected behavior
The above test should always pass

@msfroh
Copy link
Collaborator

msfroh commented Oct 11, 2023

If terminate_after is specified, won't collection terminate after some amount of time? In that case, it won't collect all the hits, so the total hit count may be off?

@msfroh msfroh removed the untriaged label Oct 11, 2023
@jed326
Copy link
Collaborator Author

jed326 commented Oct 11, 2023

@msfroh these tests are not flaky whenever size != 0, and specifically the tests assert that hit_count == terminate_after count. size=0 is a special case where the TopDocsCollector is not created since we won't need to return any top docs, however I'm not sure what specifically is causing this flakiness in that code path yet.

Ref:

for (int i = 1; i < max; i++) {
searchResponse = client().prepareSearch("test")
.setQuery(QueryBuilders.rangeQuery("field").gte(1).lte(max))
.setTerminateAfter(i)
.setSize(size)
.setTrackTotalHits(true)
.get();
assertHitCount(searchResponse, i);
assertTrue(searchResponse.isTerminatedEarly());
assertEquals(Math.min(i, size), searchResponse.getHits().getHits().length);
}

@austintlee
Copy link
Contributor

I have narrowed down the cause of this "issue" to a change introduced in this PR - #4043. (cc @reta)

Specifically, this change:

// See please https://github.com/apache/lucene/pull/964
collector.setWeight(weight);
leafCollector = collector.getLeafCollector(ctx);

That PR brought in a Lucene feature (https://issues.apache.org/jira/browse/LUCENE-10620) that allows you to take advantage of Weights that can speed up Collectors' work during search.

Now, what is really interesting about this change is that when combined with these three (I think ALL three) search conditions - 1) size = 0, 2) terminate_after is set and 3) tracK_total_hits set to true, it hits the following code path where weight.count() returns the total count for a Lucene segment (it's an optimization).

https://github.com/apache/lucene/blob/bbf197fdc2822e9f1ba1767f83ae23037336ed27/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java#L47-L52

public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
    int leafCount = weight == null ? -1 : weight.count(context);
    if (leafCount != -1) {
      totalHits += leafCount;
      throw new CollectionTerminatedException();
    }

And, of course, that's exactly the setup for this test. But it is not a bug. I would say it's just incompatible with what you might expect when you set terminate_after, but I think it makes use of the intended optimization without actually violating the "contract" of the terminate_after flag (since it does terminate early and terminatedEarly is set to true in the response as well).

As to the non-deterministic manifestation of this issue, there are two things that pertain to this feature that seem to contribute to non-determinism. One is that the test uses indexRandom to randomly distribute documents and this (by design) leads to a non-deterministic layout of Lucene segments each time. The other is that the Weight objects are stored in an LRUCache. Also, the test makes a range query and this also seems to uniquely contribute to how the Weights are used depending on the layout and sequence of documents IDs within a segment.

@reta
Copy link
Collaborator

reta commented Oct 23, 2023

And, of course, that's exactly the setup for this test. But it is not a bug. I would say it's just incompatible with what you might expect when you set terminate_after, but I think it makes use of the intended optimization without actually violating the "contract" of the terminate_after flag (since it does terminate early and terminatedEarly is set to true in the response as well).

Thanks a lot for digging this one in. The issue seems to manifest itself only during the concurrent search scenario which probably makes sense since we may go over segments concurrently (depending on random distribution).

UPD: manifest itself only during the concurrent search scenario not really

Tests with failures:
 - org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterCountSize0 {p0={"search.concurrent_segment_search.enabled":"true"}}
 - org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterCountSize0 {p0={"search.concurrent_segment_search.enabled":"false"}}

@austintlee
Copy link
Contributor

austintlee commented Oct 23, 2023

It has nothing to do with concurrent search and I don't know if concurrent search makes it more likely. It has more to do with what's in a segment and how Weight#count is being used.

(Interestingly, it was your concurrent search blog that drew me to this. I thought I'd help with a small contribution.)

@reta
Copy link
Collaborator

reta commented Oct 23, 2023

It has nothing to do with concurrent search

Right, corrected myself in #10435 (comment), sadly we didn't have the tests for this specific case before so the issue stayed unnoticed

@jed326
Copy link
Collaborator Author

jed326 commented Oct 23, 2023

@austintlee Thanks for looking into this!

But it is not a bug. I would say it's just incompatible with what you might expect when you set terminate_after, but I think it makes use of the intended optimization without actually violating the "contract" of the terminate_after flag (since it does terminate early and terminatedEarly is set to true in the response as well).

I think there's some more discussion needed on what the correct "contract" for terminate_after actually looks like. From the docs it's defined as "The maximum number of documents OpenSearch should process before terminating the request" which to me sounds like if I pass a specific terminate_after value in a search request, I would expect the total hit count have a upper bound of the terminate_after value if the terminated_early is true, which is not necessarily true in the edge case you've identified here.

This is something that we've struggled with for concurrent search as well -- it's not straightforward to have concurrent search early terminate at the exact terminate_after count either. See #8371

Curious to hear your thoughts on this too @sohami

@msfroh
Copy link
Collaborator

msfroh commented Oct 24, 2023

From the docs it's defined as "The maximum number of documents OpenSearch should process before terminating the request" which to me sounds like if I pass a specific terminate_after value in a search request, I would expect the total hit count have a upper bound of the terminate_after value if the terminated_early is true, which is not necessarily true in the edge case you've identified here.

But it's not processing that many docs -- it's using the count returned from the Weight. If the Weight returned a count, it did so in O(1) time. See https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/Weight.html#count(org.apache.lucene.index.LeafReaderContext):

The default implementation returns -1 for every query. This indicates that the count could not be computed in O(1) time.

So, if the Weight returned a value other than -1, it did so in O(1) time, without processing documents. The contract is still upheld.

@sohami
Copy link
Collaborator

sohami commented Oct 26, 2023

Sorry for delayed response.

@msfroh Even though the weight.count optimization is O(1) operation but it is counting more hits than what user expected and not really holding the terminate_after contract. I think hit_count should reflect the number of processed documents as well. Is there any other case where hit_count is returned more than processed documents ? This can be even more confusing when aggregators are in picture in the query.

Lets assume there is a single leaf with 21 matches. So the collectors for this test query is of type: EarlyTerminatingCollector(maxHitCount= say 2) ---> TotalHitCountCollector(). If the optimization for weight.count() kicks in, then TotalHitCountCollector will record that value while creating leafCollector itself and will not participate in collection anymore so leaving EarlyTerminatingCollector to process the collection on this leaf docs. EarlyTerminatingCollector will terminate the collection after 3rd match but total hits returned will be 21 in this case.

Now assume there is an Aggregator in the mix like histo aggs what will happen is collector tree will be of form: EarlyTerminatingCollector(maxHitCount= say 2) ---> TotalHitCountCollector() ---> BucketCollector(histo). Again the TotalHitCountCollector will say hit count is 21 and will get removed from leaf collectors chain leaving behind EarlyTerminatingCollector ---> BucketCollector. Then after collecting 2 docs the query will be terminated. To the user, it will see only 2 docs collected in aggs output but hitcount will be 21.

I guess the question is in case of terminate_after + track_total_hits=true scenario what is the point of returning this extra hit count above terminate_after value. To user it can be confusing because somehow OS is able to find total docs matching the query which is greater than terminate_after but aggs or any collector in the chain has only collected at max terminated_after documents. I tried with sample tests and here is the response:

{"took":148625,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":21,"relation":"eq"},"max_score":null,"hits":[]},"aggregations":{"histo":{"buckets":[{"key":15.0,"doc_count":1},{"key":16.0,"doc_count":0},{"key":17.0,"doc_count":0},{"key":18.0,"doc_count":0},{"key":19.0,"doc_count":0},{"key":20.0,"doc_count":0},{"key":21.0,"doc_count":1}]}}}

@austintlee
Copy link
Contributor

@sohami Did you run that query with size = 0? Also, keep in mind that the specific instance of this problem doesn't always occur so I am not sure what conclusion we can draw from the response you posted above...

I think we should document this behavior as best we can. I don't think anyone is proposing we make further code changes to change this behavior.

@msfroh
Copy link
Collaborator

msfroh commented Oct 26, 2023

IMO, we should just clearly document that the purpose of terminate_after is to avoid iterating through too many documents -- it's essentially a way of saying "I'm willing to accept approximately-correct results in order to limit the cost of my query", not "I would like incorrect results and will be offended if you give me the correct answer efficiently".

@sohami
Copy link
Collaborator

sohami commented Oct 26, 2023

@sohami Did you run that query with size = 0? Also, keep in mind that the specific instance of this problem doesn't always occur so I am not sure what conclusion we can draw from the response you posted above...

@austintlee Yes the query was with size=0. I just updated same query in test to include the aggregation. It will not occur if the optimization to fetch the hit count via weight.count doesn't kick in which depends upon the weights created for the query.

I think we should document this behavior as best we can. I don't think anyone is proposing we make further code changes to change this behavior.

Thats what I am discussing here if we need to handle this case separately instead of just keeping the current behavior with proper documentation. Seems like current alignment is to document it properly

IMO, we should just clearly document that the purpose of terminate_after is to avoid iterating through too many documents

I guess then in documentation we should be very clear that hits.count in response is not controlled by terminate_after. Later controls collecting/processing number of the matched document whereas former is count of matching docs and both the information can be fetched independently for each shard ?

@austintlee
Copy link
Contributor

Later controls collecting/processing number of the matched document whereas former is count of matching docs and both the information can be fetched independently for each shard ?

"Independently" would be mischaracterization as they are still correlated.

How about:

The latter (spelling correction) controls collecting/processing number of the matched document whereas former is count of matching docs and the actual values of hits.count and terminate_after can differ when Weight.count, a sub-linear time counting optimization, contributes to a higher count than is needed to satisfy the counting requirement set by terminate_after. Put another way, OpenSearch can yield higher output (hits.count) while maintaining the same input/cost (terminate_after) thanks to the aforementioned optimization.

@jed326
Copy link
Collaborator Author

jed326 commented Nov 8, 2023

Apologies for the super delayed response here. I'm also in agreement that the best path forward is to document this difference. In addition to the reasons previously discussed, this "issue" has been present since OpenSearch 2.2.0 and there doesn't seem to have been any complaints about this so adding an explanation to the docs seems sufficient.

I'm happy to close out this issue with PR #10836 from @austintlee -- I will create a docs issue to follow up on the documentation updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment