Getting data for histogram plot

54,525

Solution 1

This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.

There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise. This is how you do it:

SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;

Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.

numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.

e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).

This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.

+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+

Some notes: Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

Solution 2

Mike DelGaudio's answer is the way I do it, but with a slight change:

select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1

The advantage? You can make the bins as large or as small as you want. Bins of size 100? floor(mycol/100)*100. Bins of size 5? floor(mycol/5)*5.

Bernardo.

Solution 3

SELECT b.*,count(*) as total FROM bins b 
left outer join table1 a on a.value between b.min_value and b.max_value 
group by b.min_value

The table bins contains columns min_value and max_value which define the bins. note that the operator "join... on x BETWEEN y and z" is inclusive.

table1 is the name of the data table

Solution 4

Ofri Raviv's answer is very close but incorrect. The count(*) will be 1 even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum:

SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
  LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;

Solution 5

select "30-34" as TotalRange,count(total) as Count from table_name
   where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name 
   where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
   where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
   where total between 45 and 49)
etc ....

As long as there are not too many intervals, this is a pretty good solution.

Share:
54,525
Legend
Author by

Legend

Just a simple guy :)

Updated on July 08, 2022

Comments

  • Legend
    Legend almost 2 years

    Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:

    select total, count(total) from faults GROUP BY total;
    

    The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL?

    Example:

    +-------+--------------+
    | total | count(total) |
    +-------+--------------+
    |    30 |            1 | 
    |    31 |            2 | 
    |    33 |            1 | 
    |    34 |            3 | 
    |    35 |            2 | 
    |    36 |            6 | 
    |    37 |            3 | 
    |    38 |            2 | 
    |    41 |            1 | 
    |    42 |            5 | 
    |    43 |            1 | 
    |    44 |            7 | 
    |    45 |            4 | 
    |    46 |            3 | 
    |    47 |            2 | 
    |    49 |            3 | 
    |    50 |            2 | 
    |    51 |            3 | 
    |    52 |            4 | 
    |    53 |            2 | 
    |    54 |            1 | 
    |    55 |            3 | 
    |    56 |            4 | 
    |    57 |            4 | 
    |    58 |            2 | 
    |    59 |            2 | 
    |    60 |            4 | 
    |    61 |            1 | 
    |    63 |            2 | 
    |    64 |            5 | 
    |    65 |            2 | 
    |    66 |            3 | 
    |    67 |            5 | 
    |    68 |            5 | 
    ------------------------
    

    What I am looking for:

    +------------+---------------+
    | total      | count(total)  |
    +------------+---------------+
    |    30 - 40 |            23 | 
    |    40 - 50 |            15 | 
    |    50 - 60 |            51 | 
    |    60 - 70 |            45 | 
    ------------------------------
    

    I guess this cannot be achieved in a straight forward manner but a reference to any related stored procedure would be fine as well.

  • Ofri Raviv
    Ofri Raviv over 14 years
    Why is the syntax coloring for SQL so bad? How can I improve this? Maybe I should post it on meta ;)
  • Cesar
    Cesar over 14 years
    In this case is necessary a template table to define min an max. Only with SQL is not possible.
  • Legend
    Legend over 14 years
    SQL Guru! Exactly what I wanted. I guess care should be taken when creating the bins table. Otherwise everything works perfectly. :) Thank You. I just finished writing a python script but this is just what I needed...
  • Ofri Raviv
    Ofri Raviv over 14 years
    @Legend: Actually, I'm quite a n00b when it comes to SQL. but this was a cool and useful question so I liked the exercise...
  • Legend
    Legend over 14 years
    Thanks... In addition, I will also post a direct python script that utilizes your sql query to generate the data for say, gnuplot
  • Ibrahim Quraish
    Ibrahim Quraish over 11 years
    +1 This is the only solution here that allows bins to be of different size
  • B M
    B M about 10 years
    as carillonator said your group by & order by better should be bin_floor or 1 - Ill upvote if you correct it, this is the best answer for me
  • Bernardo Siu
    Bernardo Siu about 10 years
    Fair enough, @b-m. Changed as suggested by carillonator.
  • Christopher Schultz
    Christopher Schultz over 9 years
    It's important to see @David West's answer (which should have been a comment, here) about how the COUNT(*) produces 1 when when it should produce zero. That may not be a big problem for you, but it can skew statistical data and make you look a little silly if someone notices :)
  • NiRR
    NiRR almost 8 years
    great - no need for additional tables
  • alex9311
    alex9311 almost 8 years
    and if you want a nicer column name you can do concat(floor(mycol/5)*5," to ",floor(mycol/5)*5+5)
  • Jaro
    Jaro over 6 years
    As of MySQL 8.0.3, you now have the ability to create histogram statistics in order to provide more statistics to the optimizer - see mysqlserverteam.com/histogram-statistics-in-mysql
  • enharmonic
    enharmonic over 4 years
    You don't even need the "bar" part of the query; the numbers themselves already form a logarithmic bar chart / histogram.
  • Ryan McCoy
    Ryan McCoy over 4 years
    +1 This is the most flexible solution imo, and seems to best fit the use case of wanting to bin from within SQL. any case where bin ranges need to be programatically derived, its likely better to do that outside of SQL. again imo
  • meridius
    meridius almost 4 years
    This is actually better than simple round(mycol, -2) from the accepted answer since it lets user define any non-decimal "range". I'd just use round instead of floor since it properly rounds the numbers.
  • Rúnar Berg
    Rúnar Berg almost 4 years
    The WITH something AS is very useful if you have to compute the value that goes into the bins.
  • luky
    luky about 3 years
    what is group by 1?
  • Bernardo Siu
    Bernardo Siu about 3 years
    @luky it means that you group by the first field in the select statement (in this case, bin_floor). As far as I know, only mysql implements this syntax.