Simple way to calculate median with MySQL

321,790

Solution 1

In MariaDB / MySQL:

SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
  FROM data d, (SELECT @rownum:=0) r
  WHERE d.val is NOT NULL
  -- put some where clause here
  ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );

Steve Cohen points out, that after the first pass, @rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.

Also AVG(dd.val) and dd.row_number IN(...) is used to correctly produce a median when there are an even number of records. Reasoning:

SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3

Finally, MariaDB 10.3.3+ contains a MEDIAN function

Solution 2

I just found another answer online in the comments:

For medians in almost any SQL:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.

select count(*) from table --find the number of rows

Calculate the "median" row number. Maybe use: median_row = floor(count / 2).

Then pick it out of the list:

select val from table order by val asc limit median_row,1

This should return you one row with just the value you want.

Solution 3

I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1

Solution 4

Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.

Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.

So, here's velcro's solution patched to handle both odd and even number sets:

SELECT AVG(middle_values) AS 'median' FROM (
  SELECT t1.median_column AS 'middle_values' FROM
    (
      SELECT @row:=@row+1 as `row`, x.median_column
      FROM median_table AS x, (SELECT @row:=0) AS r
      WHERE 1
      -- put some where clause here
      ORDER BY x.median_column
    ) AS t1,
    (
      SELECT COUNT(*) as 'count'
      FROM median_table x
      WHERE 1
      -- put same where clause here
    ) AS t2
    -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
    WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;

To use this, follow these 3 easy steps:

  1. Replace "median_table" (2 occurrences) in the above code with the name of your table
  2. Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
  3. If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition

Solution 5

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

Share:
321,790

Related videos on Youtube

davr
Author by

davr

Hello World

Updated on March 15, 2022

Comments

  • davr
    davr over 2 years

    What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x) for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.

    Example data:

    id | val
    --------
     1    4
     2    7
     3    2
     4    2
     5    9
     6    8
     7    3
    

    Sorting on val gives 2 2 3 4 7 8 9, so the median should be 4, versus SELECT AVG(val) which == 5.

  • davr
    davr almost 15 years
    This looks the most useful, but I don't want to install unstable alpha software that may cause mysql to crash onto my production server :(
  • Alex Martelli
    Alex Martelli almost 15 years
    So study their sources for the function of interest, fix them or modify them as needed, and install "your own" stable and non-alpha version once you've made it -- how's that any worse than similarly tweaking less-proven code suggestions you get on SO?-)
  • Alex Martelli
    Alex Martelli almost 15 years
    Nope, no ROW_NUMBER OVER, no PARTITION BY, none of that; this is MySql, not a real DB engine like PostgreSQL, IBM DB2, MS SQL Server, and so forth;-).
  • Bryan
    Bryan almost 13 years
    Why not: SELECT val FROM data ORDER BY val limit @middlevalue, 1
  • mblackwell8
    mblackwell8 over 12 years
    IMHO, this one is clearly the best for situations where you need the median from a complicated subset(s) (I needed to calculate separate medians of a large number of data subsets)
  • Rob
    Rob about 12 years
    absolutely correct, works perfectly and very speedy on my indexed tables
  • Rob
    Rob about 12 years
    this seems to be the fastest solution on mysql out of all the answers here, 200ms with just short of a million records in the table
  • Ilia Hadzhiev
    Ilia Hadzhiev about 12 years
    @rob can you help edit please? Or should I just bow down to the velcrow solution? (not actually sure how to defer to another solution) Thanks, Jacob
  • Trip
    Trip over 11 years
    How do you pull the variable output of your first code block into your second code block?
  • Trip
    Trip over 11 years
    As in, where does @middlevalue come from?
  • Frank Conijn - Support Ukraine
    Frank Conijn - Support Ukraine about 11 years
    I am a front-end designer with only a basic knowledge of MySQL, and am having a problem with the syntax. After 'FROM' I've only seen come one variable, the name of the table. Does this formula select data from two tables, and if so, how would the formula be if just the median of one data column of one table is required?
  • Rob Van Dam
    Rob Van Dam about 11 years
    This will fail on any substantial amount of data because GROUP_CONCAT is limited to 1023 characters, even when used inside another function like this.
  • giordano
    giordano almost 11 years
    Be aware: For even number of values it takes the higher of the two middle values. For odds number of values it takes the next higher value after the median.
  • giordano
    giordano almost 11 years
    Thanks for this! The user should be aware that missing values (NULL) are considered as values. to avoid this problem add 'x IS NOT NULL where condition.
  • saulob
    saulob over 10 years
    any way to make it to show group values? like: place / median for that place... like select place, median_value from table... any way? thanks
  • Shane N
    Shane N over 10 years
    @Bryan - I agree with you, that makes much more sense to me. Did you ever find a reason not to do it that way?
  • Brian
    Brian about 10 years
    @FrankConijn: It selects from one table twice. The table's name is data and it is being used with two names, x and y.
  • Mikl
    Mikl almost 10 years
    Works fine for me. 5.6.14 MySQL Community Server. Table with 11M records (about 20Gb on disk), has two not primary indexes (model_id, price). In table (after filtration) we have 500K records to calculate median for. In result we have 30K records (model_id, median_price). Query duration is 1.5-2 seconds. Speed is Fast for me.
  • Przemyslaw Remin
    Przemyslaw Remin about 9 years
    @giordano In which line of the code x IS NOT NULL should be added?
  • giordano
    giordano about 9 years
    @PrzemyslawRemin Sorry, I was not clear in my statement and I realized now that the SP does already consider the case of missing values. The SP should be called in this way: CALL median("table","x","x IS NOT NULL").
  • Rick James
    Rick James over 8 years
    Note that it does a "cross join", which is very slow for large tables.
  • Xenonite
    Xenonite over 8 years
    just saying i stalled my mysqld with this exact query on a table with 33k rows...
  • Ahmed-Anas
    Ahmed-Anas over 7 years
    @rowNum will have the 'total count' at the end of the execution. So you can use that if you want to avoid having to do a 'count all' again (which was my case because my query wasn't so simple)
  • Nomiluks
    Nomiluks about 7 years
    It is not returning the correct median for even number of values, For example , the median of {98,102,102,98} is 100 but your code gives 102. It worked fine for odd numbers.
  • Nomiluks
    Nomiluks about 7 years
    Don't you think the median of ` {22,26}` should be 24?
  • Shanemeister
    Shanemeister about 7 years
    The logic of having one statement: ( floor((total_rows+1)/2), floor((total_rows+2)/2) ) calculate the rows needed for the median is awesome! Not sure how you thought of that, but it is brilliant. The part I don't follow is the (SELECT @rownum:=0) r -- what purpose does this serve?
  • codepk
    codepk almost 7 years
    This does not work as a variable cannot be used in limit clause.
  • Slava Murygin
    Slava Murygin over 6 years
    It is the best solution. However, for large data sets it will slow down because it re-counts for every item in each set. To make it faster put "COUNT(*)" to separate sub-query.
  • chiliNUT
    chiliNUT about 6 years
    does not work for unsigned columns, BIGINT UNSIGNED value is out of range in '(`db`.`y`.`val` - `db`.`x`.`val`)'
  • chiliNUT
    chiliNUT about 6 years
    change the first WHERE 1 to WHERE d.val IS NOT NULL so that it excludes NULL rows to keep this method aligned with the native AVG
  • Dustin Sun
    Dustin Sun almost 6 years
    @Trip middlevalue == SELECT CEIL(COUNT(*)/2) FROM data;
  • Y. Chang
    Y. Chang almost 6 years
    I believe this only works with a table that has the number of entries is odd. For even number of entries, this may have a problem.
  • kuttumiah
    kuttumiah almost 6 years
    This query returns wrong answer for even number of rows.
  • kuttumiah
    kuttumiah almost 6 years
    This answer returns nothing for even number of rows.
  • Kem Mason
    Kem Mason over 5 years
    This answer doesn't work at all for some data sets, e.g., the trivial data set with values 0.1, 0.1, 0.1, 2 -- it will work if all the values are distinct, but only works if the values
  • Kem Mason
    Kem Mason over 5 years
    I just tried this myself, and for what it's worth, installing it was super fast / easy, and it worked as advertised, including grouping, e.g. "select name, median(x) FROM t1 group by name" -- github source here: github.com/infusion/udf_infusion
  • Gurpreet.S
    Gurpreet.S about 5 years
    Many thanks for your query, but can you please explain your query of Having part?
  • Daniel Buckmaster
    Daniel Buckmaster almost 5 years
    My value came from a two-table join, so I had to add another subquery in order to make sure the row ordering was correct after the join! The structure was sort of select avg(value) from (select value, row_number from (select a - b as value from a_table join b_table order by value))
  • davzaman
    davzaman over 4 years
    I know this is very old but for some reason this produces very different results than just moving set @rn:=-1 to the outer select instead of instantiating at 0 inside the inner select. For some reason I could not get the results to match
  • Rick James
    Rick James over 4 years
    And, what do you do for the Median of string values?
  • xliiv
    xliiv almost 4 years
    Could you tell what is the t2 table?
  • GuyStalks
    GuyStalks almost 4 years
    MySQL has window functions now, so this basically works. The only change you need is that you have to take the average of your results in the end.
  • Ma'ruf
    Ma'ruf over 3 years
    @Y.Chang you are right. This returns nothing for even number of rows
  • Abhishek Sengupta
    Abhishek Sengupta over 3 years
    is the taking into account the odd number scenerio ?
  • Renan Benedicto Pereira
    Renan Benedicto Pereira about 3 years
    Maybe you can solve it using GROUP_CONCAT (pay attention to the function limitations on group_concat_max_len system variable). SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(value order by value), ',', CEIL(COUNT(0)/2)), ',', -1)
  • Vinamra Bali
    Vinamra Bali about 3 years
    QUERY :: I changed the last WHERE clause to WHERE rownumber IN (FLOOR((rownumber + 1)/2), FLOOR((rownumber + 2)/2)); as it was mentioned that rownumber will have total number of rows after first run. This gives me a different result. Can anyone explain why and also why we introduced two variables instead of one.
  • rodrigo-silveira
    rodrigo-silveira over 2 years
    @xliiv t2 is the alias for the second table being selected (named table in the example). The issue with this method is that it does a cross join, but also, doesn't work on my trivial example: ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in ("t.score" - "t2.score")
  • bkr879
    bkr879 about 2 years
    Now, I want to be able to use the above logic as in median(column) - is it possible to embed the above as a function / procedure?