Calculating the Median with Mysql

27,447

Solution 1

val is your time column, x and y are two references to the data table (you can write data AS x, data AS y).

EDIT: To avoid computing your sums twice, you can store the intermediate results.

CREATE TEMPORARY TABLE average_user_total_time 
      (SELECT SUM(time) AS time_taken 
            FROM scores 
            WHERE created_at >= '2010-10-10' 
                    and created_at <= '2010-11-11' 
            GROUP BY user_id);

Then you can compute median over these values which are in a named table.

EDIT: Temporary table won't work here. You could try using a regular table with "MEMORY" table type. Or just have your subquery that computes the values for the median twice in your query. Apart from this, I don't see another solution. This doesn't mean there isn't a better way, maybe somebody else will come with an idea.

Solution 2

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

This will find an arbitrary percentile by replacing the COUNT(*)/2 with COUNT(*)*n where n is the percentile (.5 for median, .75 for 75th percentile, etc).

Solution 3

Simplest ways me and my friend have found out... ENJOY!!

SELECT count(*) INTO @c from station;
select ROUND((@c+1)/2) into @final; 
SELECT round(lat_n,4) from station a where @final-1=(select count(lat_n) from station b where b.lat_n > a.lat_n);

Solution 4

First try to understand what the median is: it is the middle value in the sorted list of values.

Once you understand that, the approach is two steps:

  1. sort the values in either order
  2. pick the middle value (if not an odd number of values, pick the average of the two middle values)

Example:

Median of 0 1 3 7 9 10: 5 (because (7+3)/2=5)
Median of 0 1 3 7 9 10 11: 7 (because 7 is the middle value)

So, to sort dates you need a numerical value; you can get their time stamp (as seconds elapsed from epoch) and use the definition of median.

Solution 5

If you have a table R with a column named A, and you want the median of A, you can do as follows:

SELECT A FROM R R1
WHERE ( SELECT COUNT(A) FROM R R2 WHERE R2.A < R1.A ) = ( SELECT COUNT(A) FROM R R3 WHERE R3.A > R1.A )

Note: This will only work if there are no duplicated values in A. Also, null values are not allowed.

Share:
27,447
Tim
Author by

Tim

Updated on July 09, 2022

Comments

  • Tim
    Tim almost 2 years

    I'm having trouble with calculating the median of a list of values, not the average.

    I found this article Simple way to calculate median with MySQL

    It has a reference to the following query which I don't understand properly.

    SELECT x.val from data x, data y
    GROUP BY x.val
    HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2
    

    If I have a time column and I want to calculate the median value, what do the x and y columns refer to?

  • Tim
    Tim over 13 years
    Thanks for that @Krab! Don't suppose you could help me with the following. SELECT AVG(time_taken) FROM ( SELECT SUM(time) AS time_taken FROM scores WHERE created_at >= '2010-10-10' and created_at <= '2010-11-11' GROUP BY user_id) AS average_user_total_time" ) to calculate the average of the totals of users scores but not sure how to apply the median formula to this query. Sorry for the re-post, timed out.
  • Tim
    Tim over 13 years
    When I try that, I get "cannot reopen table x". Here is my total sql. CREATE TEMPORARY TABLE average_user_total_time (SELECT SUM(time) AS time_taken FROM scores WHERE created_at >= '2010-10-10' and created_at <= '2010-11-11' GROUP BY user_id); SELECT x.time_taken from average_user_total_time as x, average_user_total_time as y GROUP BY x.time_taken HAVING SUM(SIGN(1-SIGN(y.time_taken-x.time_taken))) = (COUNT(*)+1)/2
  • zanlok
    zanlok over 13 years
    disagree on your first example: median is always an actual member of the set
  • Escualo
    Escualo over 13 years
    @zanlok: any "well accepted" software package will compute the median as I presented it (average value if even number of values) Matlab averages, R averages. What you are talking about is the "medoid", where the value is always a member of the data set.
  • Ella Ryan
    Ella Ryan over 10 years
    Good solution but if there is an odd number of items you should prob get the mean of the two middle points SELECT avg(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue, @numvalues) x; where @numvalues is (@middlevalue mod 2) +1
  • Amitrajit Bose
    Amitrajit Bose almost 5 years
    @nicholas-de-bin How does this work when there are an even number of rows in the column? Because, the traditional logic is for even numbers we need to return the average of the two numbers in the middle. This is not handled in the above query. Please correct me if wrong.