Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query shape for agg & sort #44

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Conversation

dzane17
Copy link
Collaborator

@dzane17 dzane17 commented Jul 23, 2024

Description

Add ability to generate query shape for aggregation and sort portions of search body.

Testing

Large Search

curl -XGET 'http://localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "order_date": {
              "gte": "2023-01-01",
              "lte": "2023-12-31"
            }
          }
        },
        {
          "term": {
            "status": "completed"
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "category": ["electronics", "clothing"]
          }
        }
      ]
    }
  },
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "order_date",
        "calendar_interval": "month",
        "format": "yyyy-MM",
        "order": {
          "_key": "desc"
        }
      },
      "aggs": {
        "total_sales": {
          "sum": {
            "field": "total_amount"
          }
        },
        "avg_order_quantity": {
          "avg": {
            "field": "order_quantity"
          }
        },
        "max_order_amount": {
          "max": {
            "field": "total_amount"
          }
        }
      }
    },
    "avg_monthly_sales": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>total_sales"
      }
    },
    "avg_max_order_amount": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>max_order_amount"
      }
    }
  },
  "sort": [
    { "sales_per_month>key": { "order": "desc" } },
    { "total_sales": { "order": "desc" } },
    { "avg_order_quantity": { "order": "asc" } }
  ]
}'
[2024-07-30T15:43:31,782][INFO ][stdout                   ] [integTest-0] bool []
[2024-07-30T15:43:31,783][INFO ][stdout                   ] [integTest-0]   must:
[2024-07-30T15:43:31,784][INFO ][stdout                   ] [integTest-0]     range [order_date]
[2024-07-30T15:43:31,784][INFO ][stdout                   ] [integTest-0]     term [status]
[2024-07-30T15:43:31,784][INFO ][stdout                   ] [integTest-0]   filter:
[2024-07-30T15:43:31,784][INFO ][stdout                   ] [integTest-0]     terms [category]
[2024-07-30T15:43:31,785][INFO ][stdout                   ] [integTest-0] aggregation:
[2024-07-30T15:43:31,785][INFO ][stdout                   ] [integTest-0]   date_histogram [order_date]
[2024-07-30T15:43:31,785][INFO ][stdout                   ] [integTest-0]     aggregation:
[2024-07-30T15:43:31,785][INFO ][stdout                   ] [integTest-0]       avg [order_quantity]
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]       max [total_amount]
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]       sum [total_amount]
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]   pipeline aggregation:
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]     avg_bucket
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]     avg_bucket
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0] sort:
[2024-07-30T15:43:31,786][INFO ][stdout                   ] [integTest-0]   asc [avg_order_quantity]
[2024-07-30T15:43:31,787][INFO ][stdout                   ] [integTest-0]   desc [sales_per_month>key]
[2024-07-30T15:43:31,787][INFO ][stdout                   ] [integTest-0]   desc [total_sales]

Aggregation

curl -XGET 'http://localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
  "aggs": {
    "group_by_category_4": {
      "terms": {
        "field": "category.keyword4"
      }
    },
    "group_by_category_3": {
      "terms": {
        "field": "category.keyword3"
      }
    },
    "group_by_category_2": {
      "terms": {
        "field": "category.keyword2"
      }
    },
    "group_by_category_1": {
      "terms": {
        "field": "category.keyword1"
      }
    }
  },
  "size": 10
}'
[2024-07-30T15:44:11,820][INFO ][stdout                   ] [integTest-0] aggregation:
[2024-07-30T15:44:11,822][INFO ][stdout                   ] [integTest-0]   terms [category.keyword1]
[2024-07-30T15:44:11,822][INFO ][stdout                   ] [integTest-0]   terms [category.keyword2]
[2024-07-30T15:44:11,822][INFO ][stdout                   ] [integTest-0]   terms [category.keyword3]
[2024-07-30T15:44:11,822][INFO ][stdout                   ] [integTest-0]   terms [category.keyword4]

Nested Aggregation

curl -XGET 'http://localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
  "aggs": {
    "sales_by_category": {
      "terms": {
        "field": "category.keyword",
        "size": 10
      },
      "aggs": {
        "total_sales": {
          "sum": {
            "field": "total_amount"
          }
        },
        "top_products": {
          "nested": {
            "path": "items"
          },
          "aggs": {
            "product_names": {
              "terms": {
                "field": "items.product_name.keyword",
                "size": 5
              },
              "aggs": {
                "avg_price": {
                  "avg": {
                    "field": "items.price"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 10
}'
[2024-07-30T15:44:16,824][INFO ][stdout                   ] [integTest-0] aggregation:
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]   terms [category.keyword]
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]     aggregation:
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]       nested []
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]         aggregation:
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]           terms [items.product_name.keyword]
[2024-07-30T15:44:16,825][INFO ][stdout                   ] [integTest-0]             aggregation:
[2024-07-30T15:44:16,826][INFO ][stdout                   ] [integTest-0]               avg [items.price]
[2024-07-30T15:44:16,826][INFO ][stdout                   ] [integTest-0]       sum [total_amount]

Pipeline Aggregation

curl -XGET 'http://localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1M"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "cumulative_sum_price": {
          "cumulative_sum": {
            "buckets_path": "avg_price"
          }
        },
        "moving_avg_price": {
          "moving_avg": {
            "buckets_path": "cumulative_sum_price",
            "window": 3,
            "model": "simple"
          }
        }
      }
    }
  }
}'
[2024-07-30T15:44:26,831][INFO ][stdout                   ] [integTest-0] aggregation:
[2024-07-30T15:44:26,831][INFO ][stdout                   ] [integTest-0]   date_histogram [timestamp]
[2024-07-30T15:44:26,832][INFO ][stdout                   ] [integTest-0]     aggregation:
[2024-07-30T15:44:26,832][INFO ][stdout                   ] [integTest-0]       avg [price]
[2024-07-30T15:44:26,832][INFO ][stdout                   ] [integTest-0]       pipeline aggregation:
[2024-07-30T15:44:26,832][INFO ][stdout                   ] [integTest-0]         cumulative_sum
[2024-07-30T15:44:26,832][INFO ][stdout                   ] [integTest-0]         moving_avg

Sort

curl -XGET 'http://localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
  "query": {
    "match_all": {}
  },
  "sort": [
    { "field1": "asc" },
    { "field2": "desc" },
    { "field3": "desc" },
    { "field4": "asc" }
  ]
}'
[2024-07-30T15:44:31,838][INFO ][stdout                   ] [integTest-0] match_all []
[2024-07-30T15:44:31,838][INFO ][stdout                   ] [integTest-0] sort:
[2024-07-30T15:44:31,839][INFO ][stdout                   ] [integTest-0]   asc [field1]
[2024-07-30T15:44:31,839][INFO ][stdout                   ] [integTest-0]   asc [field4]
[2024-07-30T15:44:31,839][INFO ][stdout                   ] [integTest-0]   desc [field2]
[2024-07-30T15:44:31,839][INFO ][stdout                   ] [integTest-0]   desc [field3]

@dzane17 dzane17 marked this pull request as ready for review July 23, 2024 22:35
@deshsidd
Copy link
Collaborator

Thanks! @dzane17 do you have some examples where we have the entire query shape including Query, Aggregations, Sort and Pipeline aggregations all in one?

Something like this:

bool
  filter:
    terms [field1:text]
  must:
    range [field2:date]
    term [field3:keyword]
  must_not:
    exists [field4:keyword]
  should:
    match [field5:text]
sort:
  asc [field6:number]
  desc [field2:date]
aggregation:
  terms [field3:keyword]
    aggregation:
      avg [field6:number]
      cardinality [field3:keyword]
      date_histogram [field2:date]
      max [field6:number]
      min [field6:number]
      percentile_ranks [field6:number]
      sum [field6:number]
pipeline_aggregation:
  avg_bucket
  max_bucket

Also, will the field related information for query operations will be as a followup?

@deshsidd
Copy link
Collaborator

Was hoping we could have a QueryShapeService/QueryShapeBuilder or something similar that hides all the abstractions of processing the query shape. It only takes in an input search query source and logs the query shape. It can also include the string representation and the hash of a query shape. We can configure the builder to include_field_details = True/False and so on.
Other services such as TopN and Categorization can now use this builder/service to get the query shape, log the query shape, get the unique hash for a query shape and so on. Let me know your thoughts @dzane17

@ansjcy
Copy link
Member

ansjcy commented Jul 24, 2024

Was hoping we could have a QueryShapeService/QueryShapeBuilder or something similar that hides all the abstractions of processing the query shape

+1 on this, we should create a lib to extract shape from a source obj so that we can use it in other places as well.

@dzane17 dzane17 force-pushed the query-shape branch 2 times, most recently from cc8f17e to 2f995db Compare July 24, 2024 21:48
@dzane17
Copy link
Collaborator Author

dzane17 commented Jul 24, 2024

@deshsidd @ansjcy I have added a large search request example to the description.

QueryShapeService now contains logic to generate shape given a source. We can call it when and where ever needed. Currently all methods are static. I could not envision why we'd want to create instances of QueryShapeService unless we are calling QueryShapeService.buildShape() repeatedly for the same search. Let me know what you think. Also added unit tests for QueryShapeService.

@jainankitk Regarding query field data, there is not an easy way to extract field names from a QueryBuilder object. This file QueryBuilders.java has a (complete?) list of classes which extend QueryBuilder. Some child classes have a fieldName field, some don't.

The options I came up with are

  1. Type checking - this is tedious, won't support new query types in the future, can break if existing query types change
  2. Method overriding - update QueryBuilder parent class to return some blank value for fieldName() unless overridden in the child class
  3. Ignore query field names completely

This also gets more complicated if we want to expand beyond field names... For example, we have width param which is unique to range type.

range [field2:value2, width:100]

@deshsidd
Copy link
Collaborator

Thanks @dzane17.
For the large query shape why is desc [sales_per_month>key] in the shape?

Looks good overall, but need to figure out the best way to get the field details for the query type. Based on the above proposed approaches I like approach 2 Method overriding - update QueryBuilder parent class to return some blank value for fieldName() unless overridden in the child class @jainankitk let us know your thoughts here.

For properties such as width, we would have to do custom handling for the specific query type. Eg: if we encounter a range we can use the source to get the width.

@jainankitk
Copy link
Collaborator

@jainankitk Regarding query field data, there is not an easy way to extract field names from a QueryBuilder object. This file QueryBuilders.java has a (complete?) list of classes which extend QueryBuilder. Some child classes have a fieldName field, some don't.

The options I came up with are

Type checking - this is tedious, won't support new query types in the future, can break if existing query types change
Method overriding - update QueryBuilder parent class to return some blank value for fieldName() unless overridden in the child class
Ignore query field names completely

This also gets more complicated if we want to expand beyond field names... For example, we have width param which is unique to range type.

@dzane17 - Thank you for explaining the issue here, and some of the possible options. While option 2 looks decent, I am wondering about a fourth option to have a map containing QueryBuilder class type and list of things you want to collect for that specific class. For example:

<RangeQueryBuilder, List.of(getFieldName, getWidth)>
<TermBuilder, List.of(getFieldName)>

If a specific class is not present in the map, we will collect default data without any field names. Thoughts?

@jainankitk
Copy link
Collaborator

@jainankitk Regarding query field data, there is not an easy way to extract field names from a QueryBuilder object. This file QueryBuilders.java has a (complete?) list of classes which extend QueryBuilder. Some child classes have a fieldName field, some don't.

The options I came up with are

1. Type checking - this is tedious, won't support new query types in the future, can break if existing query types change

2. Method overriding - update QueryBuilder parent class to return some blank value for fieldName() unless overridden in the child class

3. Ignore query field names completely

This also gets more complicated if we want to expand beyond field names... For example, we have width param which is unique to range type.

range [field2:value2, width:100]

@dzane17 - Thank you for explaining the issue here, and some of the possible options. While option 2 looks decent, I am wondering about a fourth option to have a map containing QueryBuilder class type and list of things you want to collect for that specific class. For example:

<RangeQueryBuilder, List.of(getFieldName, getWidth)>
<TermBuilder, List.of(getFieldName)>

If a specific class is not present in the map, we will collect default data without any field names. Thoughts?

@dzane17
Copy link
Collaborator Author

dzane17 commented Jul 25, 2024

@deshsidd desc [sales_per_month>key] is one of the sort criteria

"sort": [
    { "sales_per_month>key": { "order": "desc" } },
    { "total_sales": { "order": "desc" } },
    { "avg_order_quantity": { "order": "asc" } }
  ]

@dzane17
Copy link
Collaborator Author

dzane17 commented Jul 25, 2024

@jainankitk Regarding query field data, there is not an easy way to extract field names from a QueryBuilder object. This file QueryBuilders.java has a (complete?) list of classes which extend QueryBuilder. Some child classes have a fieldName field, some don't.
The options I came up with are

1. Type checking - this is tedious, won't support new query types in the future, can break if existing query types change

2. Method overriding - update QueryBuilder parent class to return some blank value for fieldName() unless overridden in the child class

3. Ignore query field names completely

This also gets more complicated if we want to expand beyond field names... For example, we have width param which is unique to range type.

range [field2:value2, width:100]

@dzane17 - Thank you for explaining the issue here, and some of the possible options. While option 2 looks decent, I am wondering about a fourth option to have a map containing QueryBuilder class type and list of things you want to collect for that specific class. For example:

<RangeQueryBuilder, List.of(getFieldName, getWidth)> <TermBuilder, List.of(getFieldName)>

If a specific class is not present in the map, we will collect default data without any field names. Thoughts?

@jainankitk This 4th option will also work and solves additional field data beyond field names. The main disadvantage is that we need to maintain the Map.

If we intend to collect field data unique to each QueryBuilder, option 4 seems good. If we only need fieldNames, then option 2 seems more natural.

@jainankitk
Copy link
Collaborator

@jainankitk This 4th option will also work and solves additional field data beyond field names. The main disadvantage is that we need to maintain the Map.

If we intend to collect field data unique to each QueryBuilder, option 4 seems good. If we only need fieldNames, then option 2 seems more natural.

We can build the map gradually over time. We should write code in a manner that not having entry in the map still allows us to collect the basic information. As and when more stuff is added to the map, shape gets decorated with more stuff

@dzane17
Copy link
Collaborator Author

dzane17 commented Jul 25, 2024

@jainankitk @deshsidd I have added query field data using option 4 Map<Class<?>, List<Function<Object, String>>>. I have added all QueryBuilder classes which have a fieldName() method to the map. The List data structure is set up to accommodate additional field data in the future. Example in PR description is updated with new shape output.

@deshsidd
Copy link
Collaborator

@jainankitk Let us know your thoughts

@dzane17 dzane17 force-pushed the query-shape branch 2 times, most recently from 8ff59f9 to a40fad5 Compare July 29, 2024 22:25
@dzane17
Copy link
Collaborator Author

dzane17 commented Jul 30, 2024

@jainankitk @deshsidd Currently we have no field data for pipeline aggregations. Should we use bucketsPaths?

Copy link
Collaborator

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial changes LGTM! Looking forward to the anonymization and using the hash of query shape, and exposing that through topN to begin with. cc: @ansjcy @deshsidd

Signed-off-by: David Zane <davizane@amazon.com>
@deshsidd
Copy link
Collaborator

@jainankitk @deshsidd Currently we have no field data for pipeline aggregations. Should we use bucketsPaths?

Not sure if this will be helpful. Can leave it without any additional data for now?

@deshsidd
Copy link
Collaborator

Initial changes LGTM! Looking forward to the anonymization and using the hash of query shape, and exposing that through topN to begin with. cc: @ansjcy @deshsidd

Yup, next would be to expose a function using hash of query shape and using it in the grouping top n by similarity feature.

@deshsidd
Copy link
Collaborator

Thanks @dzane17 this is looking good overall! Lets iterate and keep improving this.

@dzane17 dzane17 merged commit b55d760 into opensearch-project:main Jul 31, 2024
15 of 16 checks passed
@dzane17 dzane17 deleted the query-shape branch July 31, 2024 17:42
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 28, 2024
Signed-off-by: David Zane <davizane@amazon.com>
(cherry picked from commit b55d760)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
dzane17 pushed a commit that referenced this pull request Aug 28, 2024
(cherry picked from commit b55d760)

Signed-off-by: David Zane <davizane@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants