Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query grouping framework for Top N queries and group by query similarity #66

Merged
merged 29 commits into from
Sep 4, 2024

Conversation

deshsidd
Copy link
Collaborator

@deshsidd deshsidd commented Aug 2, 2024

Description

For Top N queries by latency, we can encounter scenarios where some (or most) of the Top N queries contain duplicate queries. Say the same dashboard query is triggered continuously and happens to be the most expensive query in terms of latency - in this scenario all the Top N queries by latency will likely be spammed by the same query. To overcome such scenarios and to get a more detailed view of the Top N query patterns we have implemented Grouping Top N queries by similarity. As a followup we can also use this framework to implement grouping top N queries by frequency, user_id, etc.

Major changes:

  1. Query Grouping Service that groups queries based on a group_id and uses a Min and Max priority queue approach as discussed in the RFC
  2. Created the Measurement class as an abstraction for number that is used to store the measurement for the specific MetricType. Measurement can support DimensionType (Average, Sum) for the specific measurement. For grouping by similarity we use the average latency, average cpu and average memory to maintain the ordering.
  3. We have a GroupingType enum that describes how we group the Top N queries (similarity, user_id)
  4. The Grouping setting applies to ALL metric types and we cannot set this only for a subset of MetricType as discussed in the RFC.
  5. Each TopQueriesService has its instance of QueryGroupingService. We have one TopQueriesService for each metrictype.
  6. In QueryInsightsService we add ALL the records to the queryRecordsQueue for the TopQueriesService to consume if search.query.metric feature is enabled or is grouping enabled. Note that we skip the optimization in this case.
public boolean addRecord(final SearchQueryRecord record) {
        boolean shouldAdd = isSearchQueryMetricsFeatureEnabled() || isGroupingEnabled();
        if (!shouldAdd) {
            for (Map.Entry<MetricType, TopQueriesService> entry : topQueriesServices.entrySet()) {
                if (!enableCollect.get(entry.getKey())) {
                    continue;
                }
                List<SearchQueryRecord> currentSnapshot = entry.getValue().getTopQueriesCurrentSnapshot();
                // skip add to top N queries store if the incoming record is smaller than the Nth record
                if (currentSnapshot.size() < entry.getValue().getTopNSize()
                    || SearchQueryRecord.compare(record, currentSnapshot.get(0), entry.getKey()) > 0) {
                    shouldAdd = true;
                    break;
                }
            }
        }
        if (shouldAdd) {
            return queryRecordsQueue.offer(record);
        }
        return false;
    }
  1. Added exhaustive unit tests for QueryGroupingService.

Issues Resolved

addresses #13357

Configure Grouping

deshsid@c889f3bdacfb query-insights-unzip % curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "search.insights.top_queries.group_by": "similarity"
  }
}
'
{"acknowledged":true,"persistent":{"search":{"insights":{"top_queries":{"group_by":"similarity"}}}},"transient":{}}%

Configure Grouping Error Response

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "search.insights.top_queries.group_by": "similarit"
  }
}'

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"illegal value can't update [search.insights.top_queries.group_by] from [similarity] to [similarit]"}],"type":"illegal_argument_exception","reason":"illegal value can't update [search.insights.top_queries.group_by] from [similarity] to [similarit]","caused_by":{"type":"illegal_argument_exception","reason":"Invalid grouping type [similarit], type should be one of [SIMILARITY, USER_ID, NONE]"}},"status":400}%

Get Top N Queries by latency with grouping enabled, group_by SIMILARITY

curl -XGET "http://localhost:9200/_insights/top_queries"
{
  "top_queries": [
    {
      "timestamp": 1722630496342,
      "query_hashcode": 29791,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 135,
          "parentTaskId": 134,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 625000,
            "memory_in_bytes": 41512
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 134,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 84000,
            "memory_in_bytes": 3264
          }
        }
      ],
      "source": {},
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 774,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 774,
          "count": 1,
          "dimensionType": "AVERAGE"
        }
      }
    },
    {
      "timestamp": 1722630528201,
      "query_hashcode": 709023605,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 163,
          "parentTaskId": 162,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 9412000,
            "memory_in_bytes": 618968
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 162,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 158000,
            "memory_in_bytes": 3720
          }
        }
      ],
      "source": {
        "sort": [
          {
            "age": {
              "order": "asc"
            }
          }
        ]
      },
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 10,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 11,
          "count": 1,
          "dimensionType": "AVERAGE"
        }
      }
    },
    {
      "timestamp": 1722630499772,
      "query_hashcode": -1204891025,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 137,
          "parentTaskId": 136,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 7603000,
            "memory_in_bytes": 477600
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 136,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 127000,
            "memory_in_bytes": 3232
          }
        }
      ],
      "source": {
        "query": {
          "match": {
            "occupation": {
              "query": "Software Engineer",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1.0
            }
          }
        }
      },
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 8,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 9,
          "count": 1,
          "dimensionType": "AVERAGE"
        }
      }
    }
  ]
}

Get Top N queries with group_by NONE

{
  "top_queries": [
    {
      "timestamp": 1722632764895,
      "query_hashcode": -1204891025,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 953,
          "parentTaskId": 952,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 1640000,
            "memory_in_bytes": 120760
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 952,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 150000,
            "memory_in_bytes": 3232
          }
        }
      ],
      "source": {
        "query": {
          "match": {
            "occupation": {
              "query": "Software Engineer",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1.0
            }
          }
        }
      },
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 2,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 3,
          "count": 1,
          "dimensionType": "NONE"
        }
      }
    },
    {
      "timestamp": 1722632770456,
      "query_hashcode": 605146258,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 959,
          "parentTaskId": 958,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 852000,
            "memory_in_bytes": 49328
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 958,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 120000,
            "memory_in_bytes": 3240
          }
        }
      ],
      "source": {
        "query": {
          "range": {
            "age": {
              "from": 30,
              "to": null,
              "include_lower": false,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      },
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 1,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 2,
          "count": 1,
          "dimensionType": "NONE"
        }
      }
    },
    {
      "timestamp": 1722632769697,
      "query_hashcode": 605146258,
      "search_type": "query_then_fetch",
      "task_resource_usages": [
        {
          "action": "indices:data/read/search[phase/query]",
          "taskId": 957,
          "parentTaskId": 956,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 980000,
            "memory_in_bytes": 49328
          }
        },
        {
          "action": "indices:data/read/search",
          "taskId": 956,
          "parentTaskId": -1,
          "nodeId": "zp2vxuVsRwawzBK2u7f7FA",
          "taskResourceUsage": {
            "cpu_time_in_nanos": 130000,
            "memory_in_bytes": 3240
          }
        }
      ],
      "source": {
        "query": {
          "range": {
            "age": {
              "from": 30,
              "to": null,
              "include_lower": false,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      },
      "indices": ["my_test_index"],
      "total_shards": 1,
      "labels": {},
      "phase_latency_map": {
        "expand": 0,
        "query": 1,
        "fetch": 0
      },
      "node_id": "zp2vxuVsRwawzBK2u7f7FA",
      "measurements": {
        "latency": {
          "metricType": "latency",
          "number": 2,
          "count": 1,
          "dimensionType": "NONE"
        }
      }
    }
  ]
}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Uploading Screen Recording 2024-08-28 at 4.35.10 PM.mov…

@deshsidd deshsidd changed the title Query grouping framework and group by query similarity Query grouping framework for Top N queries and group by query similarity Aug 2, 2024
Copy link
Member

@ansjcy ansjcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall class designs looks good to me.

But I have concerns on the correctness of the logic in QueryGroupingService to implement algorithms proposed in opensearch-project/OpenSearch#13357 (comment). Please see the individual comments for details.

Also, a lot of the heap operations are O(n) and O(total possible number of groups), which is not acceptable. Please refer to the comment: opensearch-project/OpenSearch#13357 (comment) to resolve it.

@ansjcy
Copy link
Member

ansjcy commented Aug 5, 2024

Please also add integration tests for this feature - We are already lacking integration test coverage for many features in Query Insights.

@deshsidd
Copy link
Collaborator Author

deshsidd commented Aug 5, 2024

We can run some benchmarks to view the performance here and keep this feature as experimental/beta and also limit the number of groups. If needed we can use an indexed priority queue from here as followup changes. Let me know your thoughts.

@ansjcy
Copy link
Member

ansjcy commented Aug 6, 2024

If needed we can use an indexed priority queue as followup changes.

I don't think this is a good idea, the whole algorithm mentioned in opensearch-project/OpenSearch#13357 (comment) is based on using indexed pq to store the groups. Otherwise we are storing all the queries groups, updating / deletion can take O(total possible number of groups) in a worst case scenario, which is not acceptable.

@deshsidd
Copy link
Collaborator Author

deshsidd commented Aug 6, 2024

Made all the required refactoring based on the comments. Highlights include:

  1. Decouple Measurement and MetricType
  2. Ensure NONE aggregation type performs no aggregations.
  3. Refactor unit tests to re-use code whenever applicable
  4. Added one missing edge case in QueryGroupingService algorithm

Only major open question is regarding the java priority queue verses indexed priority queue.

  • Indexed priority queue has O(logn) updates while with the java PQ it takes O(n) since we have to remove (O(n)) and then re-add (O(logn)). Note that n is the number of groups in a Top N window.
  • Only viable indexed priority queue I found was from here. Tried including this in gradle and got errors due to the following issue. There seems to be other ways to add this library but not sure we want to pursue these unconventional routes. Furthermore, not sure about the stability and community support for this library.
  • We can also consider trying to limit the number of groups per window or implementing our own version of indexed PQ to improve the performance.

Lets discuss more and figure out a path forward!

Copy link
Collaborator

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the logic for query grouping seems complex, and I am wondering if there is a way to simplify some of it by making some reasonable assumptions

@deshsidd deshsidd force-pushed the sid/query-shape branch 2 times, most recently from f36d2a9 to 886ca2f Compare August 28, 2024 20:26
@deshsidd
Copy link
Collaborator Author

Ran some benchmarks to figure out a reasonable number for the cardinality of the groups and here are the results:

  1. logging heavy: http_logs
    Here are the number of groups logged at the end of a window cycle over the course of approx 3 days:
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 7, 3, 7, 6, 3, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 3, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 3, 0]
    Maximum: 8

  2. search heavy: nyc_taxis
    Here are the number of groups logged at the end of a window cycle over the course of approx 3 days:
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 2, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0]
    Maximum: 8

  3. custom workload simulating real world traffic:
    [0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0]
    Maximum: 7

Note that window size set to : 1 hour

IMHO it might be reasonable to having a setting max_groups to set the maximum number of groups and limit the PQ to that number. If we exceed this number we can drop and add debug logs. The max_groups value should have a validation such that it cannot be set beyond 10,000.

@deshsidd
Copy link
Collaborator Author

I personally think we should not even add a record if the feature is disabled. Whenever it is switched on, we start calculating from that point. Otherwise it looks like more of a leak.

If only query metrics is enabled we always add the records.
If only top N is enabled we perform an optimization and skip adding the records if they do not make it to the Top N.
If grouping is enabled we need to add all the records since we cannot perform the optimization above.

Not sure what you are referring to here?

@deshsidd
Copy link
Collaborator Author

As discussed added interfaces and implementation for the following:
interface -> implementation

QueryGrouper -> MinMaxHeapQueryGrouper
TopQueriesStore -> PriorityQueueTopQueriesStore

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Copy link
Member

@ansjcy ansjcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security based integ tests are failing for the change: #85
Let's double check if we are missing anything for this change on the permission side before merging.

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
@deshsidd
Copy link
Collaborator Author

deshsidd commented Sep 4, 2024

The security based integ tests are failing for the change: #85
Let's double check if we are missing anything for this change on the permission side before merging.

Thanks for checking! The integration test PR build is failing due to grouping settings not found. This PR needs to be merged for the builds to pass there. The security ITs are run in the checks for this PR. Also ran the security ITs locally and they are passing.

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
@deshsidd deshsidd requested a review from ansjcy September 4, 2024 01:15
Comment on lines 235 to 239
if (maxHeapQueryStore.size() > 0) {
addToMaxPQPromoteToMinPQ(aggregateSearchQueryRecord, groupId);
} else {
addToMinPQOverflowToMaxPQ(aggregateSearchQueryRecord, groupId);
}
Copy link
Member

@ansjcy ansjcy Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this if/else here? Can't we simply always add to max/min queue and then do a swap top?
Also the "else" logic looks effectiveness to me. If the execution enters this else, that means we have already removed a record from the min queue, and max queue is also empty. So when we do addToMinPQOverflowToMaxPQ we are adding the previous (but updated) record back it to the min queue again and we won't overflow anything to max queue.

}

private boolean checkMaxGroupsLimitReached(String groupId) {
if (maxGroups <= maxHeapQueryStore.size() && minHeapTopQueriesStore.size() >= topNSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should emit a metric for this as well.

public static final GroupingType DEFAULT_GROUPING_TYPE = GroupingType.NONE;
public static final int DEFAULT_GROUPS_EXCLUDING_TOPN_LIMIT = 100;

public static final int MAX_GROUPS_EXCLUDING_TOPN_LIMIT = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much memory would 10000 records consume based on the benchmark results?

Copy link
Collaborator Author

@deshsidd deshsidd Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not capture this and we did not reach anywhere close to the 10000 limit in the benchmarks.

If a search query record is around 1kb then 10000 groups means we will consume 10mb memory at most for this feature, which should be fine. but we still need to watch out on the memory consumption here for queries with large source.

This analysis seems reasonable but we would need to keep watch out for the memory consumption here.

@ansjcy
Copy link
Member

ansjcy commented Sep 4, 2024

IMHO it might be reasonable to having a setting max_groups to set the maximum number of groups and limit the PQ to that number. If we exceed this number we can drop and add debug logs. The max_groups value should have a validation such that it cannot be set beyond 10,000.

The benchmark numbers looks good, I think it's a good idea to confirm the reasonable upper bound for number of groups as well so that we won't consume too much memory.

* @return return the search query record that represents the group
*/
@Override
public SearchQueryRecord add(SearchQueryRecord searchQueryRecord) {
Copy link
Member

@ansjcy ansjcy Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grouper can be simplied with something like:

public SearchQueryRecord add(SearchQueryRecord searchQueryRecord) {
    if (!groupIdToAggSearchQueryRecord.containsKey(groupId)) {
        boolean maxGroupsLimitReached = checkMaxGroupsLimitReached(groupId);
        if (maxGroupsLimitReached) {
            return null;
        }
        aggregateSearchQueryRecord = searchQueryRecord;
        aggregateSearchQueryRecord.setGroupingId(groupId);
        aggregateSearchQueryRecord.setMeasurementAggregation(metricType, aggregationType);
        addToMinPQ(aggregateSearchQueryRecord, groupId);
    } else {
        aggregateSearchQueryRecord = groupIdToAggSearchQueryRecord.get(groupId).v1();
        boolean isPresentInMinPQ = groupIdToAggSearchQueryRecord.get(groupId).v2();
        if (isPresentInMinPQ) {
            minHeapTopQueriesStore.remove(aggregateSearchQueryRecord);
        } else {
            maxHeapTopQueriesStore.remove(aggregateSearchQueryRecord);
        }
        addAndPromote(searchQueryRecord, aggregateSearchQueryRecord, groupId);
        
    }
    return aggregateSearchQueryRecord;
}
private void addToMinPQ(SearchQueryRecord searchQueryRecord, String groupId) {
    minHeapTopQueriesStore.add(searchQueryRecord);
    groupIdToAggSearchQueryRecord.put(groupId, new Tuple<>(searchQueryRecord, true));
    overflow();
}
private void addAndPromote(SearchQueryRecord searchQueryRecord, SearchQueryRecord aggregateSearchQueryRecord, String groupId) {
    Number measurementToAdd = searchQueryRecord.getMeasurement(metricType);
    aggregateSearchQueryRecord.addMeasurement(metricType, measurementToAdd);
    addToMinPQ(aggregateSearchQueryRecord, groupId);
    if (maxHeapQueryStore.isEmpty()) {
        return;
    } 
    if (SearchQueryRecord.compare(maxHeapQueryStore.peek(), minHeapTopQueriesStore.peak(), metricType) > 0) {
        SearchQueryRecord recordMovedFromMaxToMin = maxHeapQueryStore.poll();
        addToMinPQ(recordMovedFromMaxToMin, recordMovedFromMaxToMin.getGroupingId());
    }
}
private void overflow() {
    if (minHeapTopQueriesStore.size() > topNSize) {
        SearchQueryRecord recordMovedFromMinToMax = minHeapTopQueriesStore.poll();
        maxHeapQueryStore.add(recordMovedFromMinToMax);
        groupIdToAggSearchQueryRecord.put(recordMovedFromMinToMax.getGroupingId(), new Tuple<>(recordMovedFromMinToMax, false));
    }
}

@ansjcy
Copy link
Member

ansjcy commented Sep 4, 2024

Overall it looks good and I'm fine approving it. But I still have some concerns and we need follow-ups to resolve them.

  • Since we are not considering indexed pq in this PR, then removing an element in max pq becomes O(number of groups) operation - and remember this is O(number of groups) per search request so potentially it could be very bad. So the possible number of groups in the real world matters a lot in this case. We should have a metric emited to track this very important information so we can make decisions on whether to increase or decrease the max number of groups limit.
  • If a search query record is around 1kb then 10000 groups means we will consume 10mb memory at most for this feature, which should be fine. but we still need to watch out on the memory consumption here for queries with large source.
  • The logic in the grouper is too complicated and refactoring is needed to simplify it.

measurements = new HashMap<>();
in.readMap(MetricType::readFromStream, StreamInput::readGenericValue)
.forEach(((metricType, o) -> measurements.put(metricType, metricType.parseValue(o))));
if (in.getVersion().onOrAfter(Version.V_2_17_0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why is this needed? SearchQueryRecord is only used internally and we are not providing any clients that could cause version mismatch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
@deshsidd
Copy link
Collaborator Author

deshsidd commented Sep 4, 2024

Thanks @ansjcy.

  1. Will add metrics to track the number of groups discarded as a followup
  2. Refactored the logic for grouper

@ansjcy ansjcy merged commit 65e4489 into opensearch-project:main Sep 4, 2024
16 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 4, 2024
…ity (#66)

* Query grouping framework and group by query similarity

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Spotless apply

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Build fix

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Properly configure settings update consumer

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Refactor unit tests

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Decouple Measurement and MetricType

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Aggregate type NONE will ensure no aggregations computed

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Perform renaming

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Integrate query shape library with grouping

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Spotless

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Create and consume string hashcode interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Health checks in code

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Fix tests and spotless apply

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Minor fixes

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Max groups setting and unit tests

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Create query grouper interface and top query store interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Removed unused interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Rebase main and spotless

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Renaming variable

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Remove TopQueriesStore interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Drain top queries service on group change

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Rename max groups setting and allow minimum 0

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Make write/read from io backword compatible

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Minor fix

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Refactor query grouper

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

---------

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
(cherry picked from commit 65e4489)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deshsidd pushed a commit that referenced this pull request Sep 4, 2024
…ity (#66) (#86)

(cherry picked from commit 65e4489)

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 5, 2024
…ity (#66)

* Query grouping framework and group by query similarity

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Spotless apply

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Build fix

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Properly configure settings update consumer

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Refactor unit tests

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Decouple Measurement and MetricType

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Aggregate type NONE will ensure no aggregations computed

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Perform renaming

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Integrate query shape library with grouping

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Spotless

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Create and consume string hashcode interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Health checks in code

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Fix tests and spotless apply

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Minor fixes

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Max groups setting and unit tests

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Create query grouper interface and top query store interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Address review comments

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Removed unused interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Rebase main and spotless

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Renaming variable

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Remove TopQueriesStore interface

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Drain top queries service on group change

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Rename max groups setting and allow minimum 0

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Make write/read from io backword compatible

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Minor fix

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Refactor query grouper

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

---------

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
(cherry picked from commit 65e4489)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ansjcy pushed a commit that referenced this pull request Sep 5, 2024
…ity (#66) (#104)

* Query grouping framework and group by query similarity



* Spotless apply



* Build fix



* Properly configure settings update consumer



* Address review comments



* Refactor unit tests



* Decouple Measurement and MetricType



* Aggregate type NONE will ensure no aggregations computed



* Perform renaming



* Integrate query shape library with grouping



* Spotless



* Create and consume string hashcode interface



* Health checks in code



* Fix tests and spotless apply



* Minor fixes



* Max groups setting and unit tests



* Address review comments



* Address review comments



* Create query grouper interface and top query store interface



* Address review comments



* Removed unused interface



* Rebase main and spotless



* Renaming variable



* Remove TopQueriesStore interface



* Drain top queries service on group change



* Rename max groups setting and allow minimum 0



* Make write/read from io backword compatible



* Minor fix



* Refactor query grouper



---------


(cherry picked from commit 65e4489)

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants