SQL like GROUP BY AND HAVING
Implementation of HAVING-like behavior
You may use one of the pipeline aggregations
, namely bucket selector aggregation. The query would look like this:
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_selector": {
"buckets_path": {
"var1": "sales"
},
"script": "params.var1 > 10000"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
},
"size": 5
}
}
},
"size": 0
}
After putting the following documents in the index:
"hits": [
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh5F-dSw48Z0DWDys",
"_score": 1,
"_source": {
"requestAmountValue": 7000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh684dSw48Z0DWDyt",
"_score": 1,
"_source": {
"requestAmountValue": 5000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh8TBdSw48Z0DWDyu",
"_score": 1,
"_source": {
"requestAmountValue": 1000,
"senderResellerId": "ID_2"
}
}
]
The result of the query is:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ID_1",
"doc_count": 2,
"sales": {
"value": 12000
}
}
]
}
}
I.e. only those senderResellerId
whose cumulative sales are >10000
.
Counting the buckets
To implement an equivalent of SELECT COUNT(*) FROM (... HAVING)
one may use a combination of bucket script aggregation with sum bucket aggregation. Though there seems to be no direct way to count how many buckets did bucket_selector
actually select, we may define a bucket_script
that produces 0
or 1
depending on a condition, and sum_bucket
that produces its sum
:
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_script": {
"buckets_path": {
"var1": "sales"
},
"script": "if (params.var1 > 10000) { 1 } else { 0 }"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
}
}
},
"max_sales_stats": {
"sum_bucket": {
"buckets_path": "reseller_sale_sum>max_sales"
}
}
},
"size": 0
}
The output will be:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
]
},
"max_sales_stats": {
"value": 1
}
}
The desired bucket count is located in max_sales_stats.value
.
Important considerations
I have to point out 2 things:
- The feature is experimental (as of ES 5.6 it is still experimental, though it was added in 2.0.0-beta1.)
- pipeline aggregations are applied on the result of previous aggregations:
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.
This means that bucket_selector
aggregation will be applied after and on the result of terms
aggregation on senderResellerId
. For example, if there are more senderResellerId
than size
of terms
aggregation defines, you will not get all the ids in the collection with sum(sales) > 10000
, but only those that appear in the output of terms
aggregation. Consider using sorting and/or set sufficient size
parameter.
This also applies for the second case, COUNT() (... HAVING)
, which will only count those buckets that are actually present in the output of aggregation.
In case this query is too heavy or the number of buckets too big, consider denormalizing your data or store this sum directly in the document, so you can use plain range
query to achieve your goal.
Related videos on Youtube
damjad
A software engineer curious about the science of data and the relationship between different mathematical aspects of the world.
Updated on July 09, 2022Comments
-
damjad almost 2 years
I want to get the counts of groups which satisfy a certain condition. In SQL terms, I want to do the following in Elasticsearch.
SELECT COUNT(*) FROM ( SELECT senderResellerId, SUM(requestAmountValue) AS t_amount FROM transactions GROUP BY senderResellerId HAVING t_amount > 10000 ) AS dum;
So far, I could group by senderResellerId by term aggregation. But when I apply filters, it does not work as expected.
Elastic Request
{ "aggregations": { "reseller_sale_sum": { "aggs": { "sales": { "aggregations": { "reseller_sale": { "sum": { "field": "requestAmountValue" } } }, "filter": { "range": { "reseller_sale": { "gte": 10000 } } } } }, "terms": { "field": "senderResellerId", "order": { "sales>reseller_sale": "desc" }, "size": 5 } } }, "ext": {}, "query": { "match_all": {} }, "size": 0 }
Actual Response
{ "took" : 21, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 150824, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "reseller_sale_sum" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 149609, "buckets" : [ { "key" : "RES0000000004", "doc_count" : 8, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000005", "doc_count" : 39, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000006", "doc_count" : 57, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000007", "doc_count" : 134, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } } } } ] } } }
As you can see from above response, it is returning resellers but the reseller_sale aggregation is zero in results.
More details are here.
-
Nikolay Vasiliev over 6 yearsMay you please provide the ES mapping you are using and a couple of example documents?
-
damjad over 6 years@NikolayVasiliev Mapping is updated here. discuss.elastic.co/t/sql-like-group-by-and-having/104705
-
-
damjad over 6 yearsThanks a lot. :) Can you tell me how to calculate the count of the buckets, if I mention INT_MAX as my size?
-
damjad over 6 yearsI just want the count of buckets, not the content of buckets.
-
Nikolay Vasiliev over 6 years@chuckskull Right, I missed that point in the original answer. Please check again!
-
A kram over 4 years@chuckskull how can I mention INT_MAX as my size?
-
damjad over 4 years@Akram Manually by writing the value instead of INT_MAX. AFAIK.
-
A kram over 4 years@chuckskull thanks, I would like the get only the count of the buckets and not the contents?
-
Nikolay Vasiliev over 4 years@Akram You can use cardinality aggregation to do that.
-
A kram over 4 years@Nicolay thanks but how can I modify your original answer to Implement cardinality aggregation and then display only the count of the buckets
-
Nikolay Vasiliev over 4 years@Akram now I think I understand what you need. If you only care about the
COUNT(*)
part, and wish to omit the buckets it was counted upon, you can use filter_path to specify which part of the response to show. Please tell me if that works for you, I will add it to the answer then. -
A kram over 4 years@Nicolay I tried the filter_path aggregations.**.value it's just performing the filter on the response but this is still generating all the buckets on the server ...
-
Nikolay Vasiliev over 4 years@Akram if you need to filter on the generated buckets (the
HAVING
part), ES has to generate them, I don't think there's a way around this. Maybe you can ask another question with your setup exactly? It would be easier to answer having a description and a couple of examples.