How to calculate median in AWS Redshift?

17,943

Solution 1

And as of 2014-10-17, Redshift supports the MEDIAN window function:

# select min(median) from (select median(num) over () from temp);
 min 
-----
 4.0

Solution 2

Try the NTILE function.

You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.

create table temp (num smallint);
insert into temp values (1),(5),(10),(2),(4);

select num, ntile(2) over(order by num desc) from temp ;
 num | ntile 
-----+-------
  10 |     1
   5 |     1
   4 |     1
   2 |     2
   1 |     2

select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1;
 median 
--------
      4

Solution 3

I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.

They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.

# select count(num), min(median) as median from (select num, percentile_cont (0.5) within group (order by num) over () as median from temp); count | median -------+-------- 5 | 4.0

(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)

In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.

Share:
17,943

Related videos on Youtube

tayl0rs
Author by

tayl0rs

indie game developer / unity3d / sql expert

Updated on September 15, 2022

Comments

  • tayl0rs
    tayl0rs over 1 year

    Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.

    You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.

    http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

  • tayl0rs
    tayl0rs about 10 years
    Marking this as the accepted answer since it seems like it should work in theory, but I haven't actually tested it. Good idea!
  • Keith
    Keith over 8 years
    Better answer linked
  • Cristian Scutaru
    Cristian Scutaru almost 8 years
    select distinct median(field) over () from table
  • Davos
    Davos over 4 years
    Its bizarre that Redshift implemented this as a window function when postgres has percentile_cont as an ordered-set aggregate . It forces you to do either that distinct or that min aggrgegation over the top of it.