Getting data for histogram plot
Solution 1
This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.
There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise. This is how you do it:
SELECT ROUND(numeric_value, -2) AS bucket, COUNT(*) AS COUNT, RPAD('', LN(COUNT(*)), '*') AS bar FROM my_table GROUP BY bucket;
Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.
numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.
e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).
This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.
+--------+----------+-----------------+ | bucket | count | bar | +--------+----------+-----------------+ | -500 | 1 | | | -400 | 2 | * | | -300 | 2 | * | | -200 | 9 | ** | | -100 | 52 | **** | | 0 | 5310766 | *************** | | 100 | 20779 | ********** | | 200 | 1865 | ******** | | 300 | 527 | ****** | | 400 | 170 | ***** | | 500 | 79 | **** | | 600 | 63 | **** | | 700 | 35 | **** | | 800 | 14 | *** | | 900 | 15 | *** | | 1000 | 6 | ** | | 1100 | 7 | ** | | 1200 | 8 | ** | | 1300 | 5 | ** | | 1400 | 2 | * | | 1500 | 4 | * | +--------+----------+-----------------+
Some notes: Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.
I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html
Solution 2
Mike DelGaudio's answer is the way I do it, but with a slight change:
select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1
The advantage? You can make the bins as large or as small as you want. Bins of size 100? floor(mycol/100)*100
. Bins of size 5? floor(mycol/5)*5
.
Bernardo.
Solution 3
SELECT b.*,count(*) as total FROM bins b
left outer join table1 a on a.value between b.min_value and b.max_value
group by b.min_value
The table bins contains columns min_value and max_value which define the bins. note that the operator "join... on x BETWEEN y and z" is inclusive.
table1 is the name of the data table
Solution 4
Ofri Raviv's answer is very close but incorrect. The count(*)
will be 1
even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum
:
SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;
Solution 5
select "30-34" as TotalRange,count(total) as Count from table_name
where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name
where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
where total between 45 and 49)
etc ....
As long as there are not too many intervals, this is a pretty good solution.
Comments
-
Legend almost 2 years
Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:
select total, count(total) from faults GROUP BY total;
The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL?
Example:
+-------+--------------+ | total | count(total) | +-------+--------------+ | 30 | 1 | | 31 | 2 | | 33 | 1 | | 34 | 3 | | 35 | 2 | | 36 | 6 | | 37 | 3 | | 38 | 2 | | 41 | 1 | | 42 | 5 | | 43 | 1 | | 44 | 7 | | 45 | 4 | | 46 | 3 | | 47 | 2 | | 49 | 3 | | 50 | 2 | | 51 | 3 | | 52 | 4 | | 53 | 2 | | 54 | 1 | | 55 | 3 | | 56 | 4 | | 57 | 4 | | 58 | 2 | | 59 | 2 | | 60 | 4 | | 61 | 1 | | 63 | 2 | | 64 | 5 | | 65 | 2 | | 66 | 3 | | 67 | 5 | | 68 | 5 | ------------------------
What I am looking for:
+------------+---------------+ | total | count(total) | +------------+---------------+ | 30 - 40 | 23 | | 40 - 50 | 15 | | 50 - 60 | 51 | | 60 - 70 | 45 | ------------------------------
I guess this cannot be achieved in a straight forward manner but a reference to any related stored procedure would be fine as well.
-
Ofri Raviv over 14 yearsWhy is the syntax coloring for SQL so bad? How can I improve this? Maybe I should post it on meta ;)
-
Cesar over 14 yearsIn this case is necessary a template table to define min an max. Only with SQL is not possible.
-
Legend over 14 yearsSQL Guru! Exactly what I wanted. I guess care should be taken when creating the bins table. Otherwise everything works perfectly. :) Thank You. I just finished writing a python script but this is just what I needed...
-
Ofri Raviv over 14 years@Legend: Actually, I'm quite a n00b when it comes to SQL. but this was a cool and useful question so I liked the exercise...
-
Legend over 14 yearsThanks... In addition, I will also post a direct python script that utilizes your sql query to generate the data for say, gnuplot
-
Ibrahim Quraish over 11 years+1 This is the only solution here that allows bins to be of different size
-
B M about 10 yearsas carillonator said your group by & order by better should be bin_floor or 1 - Ill upvote if you correct it, this is the best answer for me
-
Bernardo Siu about 10 yearsFair enough, @b-m. Changed as suggested by carillonator.
-
Christopher Schultz over 9 yearsIt's important to see @David West's answer (which should have been a comment, here) about how the COUNT(*) produces 1 when when it should produce zero. That may not be a big problem for you, but it can skew statistical data and make you look a little silly if someone notices :)
-
NiRR almost 8 yearsgreat - no need for additional tables
-
alex9311 almost 8 yearsand if you want a nicer column name you can do
concat(floor(mycol/5)*5," to ",floor(mycol/5)*5+5)
-
Jaro over 6 yearsAs of MySQL 8.0.3, you now have the ability to create histogram statistics in order to provide more statistics to the optimizer - see mysqlserverteam.com/histogram-statistics-in-mysql
-
enharmonic over 4 yearsYou don't even need the "bar" part of the query; the numbers themselves already form a logarithmic bar chart / histogram.
-
Ryan McCoy over 4 years+1 This is the most flexible solution imo, and seems to best fit the use case of wanting to bin from within SQL. any case where bin ranges need to be programatically derived, its likely better to do that outside of SQL. again imo
-
meridius almost 4 yearsThis is actually better than simple
round(mycol, -2)
from the accepted answer since it lets user define any non-decimal "range". I'd just useround
instead offloor
since it properly rounds the numbers. -
Rúnar Berg almost 4 yearsThe
WITH something AS
is very useful if you have to compute the value that goes into the bins. -
luky about 3 yearswhat is group by 1?
-
Bernardo Siu about 3 years@luky it means that you group by the first field in the select statement (in this case, bin_floor). As far as I know, only mysql implements this syntax.