How to calculate median in AWS Redshift?
Solution 1
And as of 2014-10-17, Redshift supports the MEDIAN window function:
# select min(median) from (select median(num) over () from temp);
min
-----
4.0
Solution 2
Try the NTILE function.
You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.
create table temp (num smallint);
insert into temp values (1),(5),(10),(2),(4);
select num, ntile(2) over(order by num desc) from temp ;
num | ntile
-----+-------
10 | 1
5 | 1
4 | 1
2 | 2
1 | 2
select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1;
median
--------
4
Solution 3
I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.
They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.
# select count(num), min(median) as median
from
(select num, percentile_cont (0.5) within group (order by num) over () as median from temp);
count | median
-------+--------
5 | 4.0
(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)
In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.
Related videos on Youtube
Comments
-
tayl0rs over 1 year
Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.
You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.
http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html
-
tayl0rs about 10 yearsMarking this as the accepted answer since it seems like it should work in theory, but I haven't actually tested it. Good idea!
-
Keith over 8 yearsBetter answer linked
-
Cristian Scutaru almost 8 yearsselect distinct median(field) over () from table
-
Davos over 4 yearsIts bizarre that Redshift implemented this as a window function when postgres has
percentile_cont
as an ordered-set aggregate . It forces you to do either thatdistinct
or thatmin
aggrgegation over the top of it.