Simple way to calculate median with MySQL
Solution 1
In MariaDB / MySQL:
SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
FROM data d, (SELECT @rownum:=0) r
WHERE d.val is NOT NULL
-- put some where clause here
ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );
Steve Cohen points out, that after the first pass, @rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.
Also AVG(dd.val)
and dd.row_number IN(...)
is used to correctly produce a median when there are an even number of records. Reasoning:
SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3
Finally, MariaDB 10.3.3+ contains a MEDIAN function
Solution 2
I just found another answer online in the comments:
For medians in almost any SQL:
SELECT x.val from data x, data y GROUP BY x.val HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2
Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.
select count(*) from table --find the number of rows
Calculate the "median" row number. Maybe use: median_row = floor(count / 2)
.
Then pick it out of the list:
select val from table order by val asc limit median_row,1
This should return you one row with just the value you want.
Solution 3
I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1
Solution 4
Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.
Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.
So, here's velcro's solution patched to handle both odd and even number sets:
SELECT AVG(middle_values) AS 'median' FROM (
SELECT t1.median_column AS 'middle_values' FROM
(
SELECT @row:=@row+1 as `row`, x.median_column
FROM median_table AS x, (SELECT @row:=0) AS r
WHERE 1
-- put some where clause here
ORDER BY x.median_column
) AS t1,
(
SELECT COUNT(*) as 'count'
FROM median_table x
WHERE 1
-- put same where clause here
) AS t2
-- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;
To use this, follow these 3 easy steps:
- Replace "median_table" (2 occurrences) in the above code with the name of your table
- Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
- If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition
Solution 5
I propose a faster way.
Get the row count:
SELECT CEIL(COUNT(*)/2) FROM data;
Then take the middle value in a sorted subquery:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;
I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.
Related videos on Youtube
Comments
-
davr over 2 years
What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used
AVG(x)
for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.Example data:
id | val -------- 1 4 2 7 3 2 4 2 5 9 6 8 7 3
Sorting on
val
gives2 2 3 4 7 8 9
, so the median should be4
, versusSELECT AVG(val)
which ==5
.-
berturion almost 6 yearsMariaDB since version 10.3 has one, see mariadb.com/kb/en/library/median
-
-
davr almost 15 yearsThis looks the most useful, but I don't want to install unstable alpha software that may cause mysql to crash onto my production server :(
-
Alex Martelli almost 15 yearsSo study their sources for the function of interest, fix them or modify them as needed, and install "your own" stable and non-alpha version once you've made it -- how's that any worse than similarly tweaking less-proven code suggestions you get on SO?-)
-
Alex Martelli almost 15 yearsNope, no
ROW_NUMBER OVER
, no PARTITION BY, none of that; this is MySql, not a real DB engine like PostgreSQL, IBM DB2, MS SQL Server, and so forth;-). -
Bryan almost 13 yearsWhy not: SELECT val FROM data ORDER BY val limit @middlevalue, 1
-
mblackwell8 over 12 yearsIMHO, this one is clearly the best for situations where you need the median from a complicated subset(s) (I needed to calculate separate medians of a large number of data subsets)
-
Rob about 12 yearsabsolutely correct, works perfectly and very speedy on my indexed tables
-
Rob about 12 yearsthis seems to be the fastest solution on mysql out of all the answers here, 200ms with just short of a million records in the table
-
Ilia Hadzhiev about 12 years@rob can you help edit please? Or should I just bow down to the velcrow solution? (not actually sure how to defer to another solution) Thanks, Jacob
-
Trip over 11 yearsHow do you pull the variable output of your first code block into your second code block?
-
Trip over 11 yearsAs in, where does @middlevalue come from?
-
Frank Conijn - Support Ukraine about 11 yearsI am a front-end designer with only a basic knowledge of MySQL, and am having a problem with the syntax. After 'FROM' I've only seen come one variable, the name of the table. Does this formula select data from two tables, and if so, how would the formula be if just the median of one data column of one table is required?
-
Rob Van Dam about 11 yearsThis will fail on any substantial amount of data because
GROUP_CONCAT
is limited to 1023 characters, even when used inside another function like this. -
giordano almost 11 yearsBe aware: For even number of values it takes the higher of the two middle values. For odds number of values it takes the next higher value after the median.
-
giordano almost 11 yearsThanks for this! The user should be aware that missing values (NULL) are considered as values. to avoid this problem add 'x IS NOT NULL where condition.
-
saulob over 10 yearsany way to make it to show group values? like: place / median for that place... like select place, median_value from table... any way? thanks
-
Shane N over 10 years@Bryan - I agree with you, that makes much more sense to me. Did you ever find a reason not to do it that way?
-
Brian about 10 years@FrankConijn: It selects from one table twice. The table's name is
data
and it is being used with two names,x
andy
. -
Mikl almost 10 yearsWorks fine for me. 5.6.14 MySQL Community Server. Table with 11M records (about 20Gb on disk), has two not primary indexes (model_id, price). In table (after filtration) we have 500K records to calculate median for. In result we have 30K records (model_id, median_price). Query duration is 1.5-2 seconds. Speed is Fast for me.
-
Przemyslaw Remin about 9 years@giordano In which line of the code
x IS NOT NULL
should be added? -
giordano about 9 years@PrzemyslawRemin Sorry, I was not clear in my statement and I realized now that the SP does already consider the case of missing values. The SP should be called in this way:
CALL median("table","x","x IS NOT NULL")
. -
Rick James over 8 yearsNote that it does a "cross join", which is very slow for large tables.
-
Xenonite over 8 yearsjust saying i stalled my mysqld with this exact query on a table with 33k rows...
-
Ahmed-Anas over 7 years@rowNum will have the 'total count' at the end of the execution. So you can use that if you want to avoid having to do a 'count all' again (which was my case because my query wasn't so simple)
-
Nomiluks about 7 yearsIt is not returning the correct median for even number of values, For example , the median of
{98,102,102,98}
is100
but your code gives102
. It worked fine for odd numbers. -
Nomiluks about 7 yearsDon't you think the median of ` {22,26}` should be 24?
-
Shanemeister about 7 yearsThe logic of having one statement: ( floor((total_rows+1)/2), floor((total_rows+2)/2) ) calculate the rows needed for the median is awesome! Not sure how you thought of that, but it is brilliant. The part I don't follow is the (SELECT @rownum:=0) r -- what purpose does this serve?
-
codepk almost 7 yearsThis does not work as a variable cannot be used in limit clause.
-
Slava Murygin over 6 yearsIt is the best solution. However, for large data sets it will slow down because it re-counts for every item in each set. To make it faster put "COUNT(*)" to separate sub-query.
-
chiliNUT about 6 yearsdoes not work for unsigned columns,
BIGINT UNSIGNED value is out of range in '(`db`.`y`.`val` - `db`.`x`.`val`)'
-
chiliNUT about 6 yearschange the first
WHERE 1
toWHERE d.val IS NOT NULL
so that it excludesNULL
rows to keep this method aligned with the nativeAVG
-
Dustin Sun almost 6 years@Trip middlevalue == SELECT CEIL(COUNT(*)/2) FROM data;
-
Y. Chang almost 6 yearsI believe this only works with a table that has the number of entries is odd. For even number of entries, this may have a problem.
-
kuttumiah almost 6 yearsThis query returns wrong answer for even number of rows.
-
kuttumiah almost 6 yearsThis answer returns nothing for even number of rows.
-
Kem Mason over 5 yearsThis answer doesn't work at all for some data sets, e.g., the trivial data set with values 0.1, 0.1, 0.1, 2 -- it will work if all the values are distinct, but only works if the values
-
Kem Mason over 5 yearsI just tried this myself, and for what it's worth, installing it was super fast / easy, and it worked as advertised, including grouping, e.g. "select name, median(x) FROM t1 group by name" -- github source here: github.com/infusion/udf_infusion
-
Gurpreet.S about 5 yearsMany thanks for your query, but can you please explain your query of Having part?
-
Daniel Buckmaster almost 5 yearsMy value came from a two-table join, so I had to add another subquery in order to make sure the row ordering was correct after the join! The structure was sort of
select avg(value) from (select value, row_number from (select a - b as value from a_table join b_table order by value))
-
davzaman over 4 yearsI know this is very old but for some reason this produces very different results than just moving set @rn:=-1 to the outer select instead of instantiating at 0 inside the inner select. For some reason I could not get the results to match
-
Rick James over 4 yearsAnd, what do you do for the Median of string values?
-
xliiv almost 4 yearsCould you tell what is the t2 table?
-
GuyStalks almost 4 yearsMySQL has window functions now, so this basically works. The only change you need is that you have to take the average of your results in the end.
-
Ma'ruf over 3 years@Y.Chang you are right. This returns nothing for even number of rows
-
Abhishek Sengupta over 3 yearsis the taking into account the odd number scenerio ?
-
Renan Benedicto Pereira about 3 yearsMaybe you can solve it using GROUP_CONCAT (pay attention to the function limitations on
group_concat_max_len
system variable).SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(value order by value), ',', CEIL(COUNT(0)/2)), ',', -1)
-
Vinamra Bali about 3 yearsQUERY :: I changed the last WHERE clause to
WHERE rownumber IN (FLOOR((rownumber + 1)/2), FLOOR((rownumber + 2)/2));
as it was mentioned that rownumber will have total number of rows after first run. This gives me a different result. Can anyone explain why and also why we introduced two variables instead of one. -
rodrigo-silveira over 2 years@xliiv
t2
is the alias for the second table being selected (namedtable
in the example). The issue with this method is that it does a cross join, but also, doesn't work on my trivial example:ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in ("t.score" - "t2.score")
-
bkr879 about 2 yearsNow, I want to be able to use the above logic as in
median(column)
- is it possible to embed the above as a function / procedure?