Pandas bar plot with binned range

67,007

Solution 1

You can make use of pd.cut to partition the values into bins corresponding to each interval and then take each interval's total counts using pd.value_counts. Plot a bar graph later, additionally replace the X-axis tick labels with the category name to which that particular tick belongs.

out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
ax = out.value_counts(sort=False).plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.show()

enter image description here


If you want the Y-axis to be displayed as relative percentages, normalize the frequency counts and multiply that result with 100.

out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
out_norm = out.value_counts(sort=False, normalize=True).mul(100)
ax = out_norm.plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.ylabel("pct")
plt.show()

enter image description here

Solution 2

You may consider using matplotlib to plot the histogram. Unlike pandas' hist function, matplotlib.pyplot.hist accepts an array as input for the bins.

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
import pandas as pd

x = np.random.rand(120)
df = pd.DataFrame({"x":x})

bins= [0,0.35,0.7,1]
plt.hist(df.values, bins=bins, edgecolor="k")
plt.xticks(bins)

plt.show()

enter image description here

Solution 3

You can use pd.cut

bins = [0,0.35,0.7,1]
df = df.groupby(pd.cut(df['val'], bins=bins)).val.count()
df.plot(kind='bar')

enter image description here

Share:
67,007

Related videos on Youtube

Arnold Klein
Author by

Arnold Klein

Data science enthusiast and machine learner.

Updated on March 25, 2020

Comments

  • Arnold Klein
    Arnold Klein about 4 years

    Is there a way to create a bar plot from continuous data binned into predefined intervals? For example,

    In[1]: df
    Out[1]: 
    0      0.729630
    1      0.699620
    2      0.710526
    3      0.000000
    4      0.831325
    5      0.945312
    6      0.665428
    7      0.871845
    8      0.848148
    9      0.262500
    10     0.694030
    11     0.503759
    12     0.985437
    13     0.576271
    14     0.819742
    15     0.957627
    16     0.814394
    17     0.944649
    18     0.911111
    19     0.113333
    20     0.585821
    21     0.930131
    22     0.347222
    23     0.000000
    24     0.987805
    25     0.950570
    26     0.341317
    27     0.192771
    28     0.320988
    29     0.513834
    
    231    0.342541
    232    0.866279
    233    0.900000
    234    0.615385
    235    0.880597
    236    0.620690
    237    0.984375
    238    0.171429
    239    0.792683
    240    0.344828
    241    0.288889
    242    0.961686
    243    0.094402
    244    0.960526
    245    1.000000
    246    0.166667
    247    0.373494
    248    0.000000
    249    0.839416
    250    0.862745
    251    0.589873
    252    0.983871
    253    0.751938
    254    0.000000
    255    0.594937
    256    0.259615
    257    0.459916
    258    0.935065
    259    0.969231
    260    0.755814
    

    and instead of a simple histogram:

    df.hist()
    

    usual histogram of df

    I need to create a bar plot, where each bar will count a number of instances within a predefined range. For example, the following plot should have three bars with the number of points which fall into: [0 0.35], [0.35 0.7] [0.7 1.0]

    EDIT

    Many thanks for your answers. Another question, how to order bins? For example, I get the following result:

    In[349]: out.value_counts()
    Out[349]:  
    [0, 0.001]      104
    (0.001, 0.1]     61
    (0.1, 0.2]       32
    (0.2, 0.3]       20
    (0.3, 0.4]       18
    (0.7, 0.8]        6
    (0.4, 0.5]        6
    (0.5, 0.6]        5
    (0.6, 0.7]        4
    (0.9, 1]          3
    (0.8, 0.9]        2
    (1, 1.001]        0
    

    as you can see, the last three bins are not ordered. How to sort the data frame based on 'categories' or my bins?

    EDIT 2

    Just found how to solve it, simply with 'reindex()':

    In[355]: out.value_counts().reindex(out.cat.categories)
    Out[355]: 
    [0, 0.001]      104
    (0.001, 0.1]     61
    (0.1, 0.2]       32
    (0.2, 0.3]       20
    (0.3, 0.4]       18
    (0.4, 0.5]        6
    (0.5, 0.6]        5
    (0.6, 0.7]        4
    (0.7, 0.8]        6
    (0.8, 0.9]        2
    (0.9, 1]          3
    (1, 1.001]        0
    
  • Arnold Klein
    Arnold Klein about 7 years
    and if I need to normalise the plot? (vertical axis should be percentage and not the frequency. Similar to 'normed=True' in .hist()
  • famargar
    famargar over 6 years
    I get TypeError: 'pandas._libs.interval.Interval' object is not subscriptable when trying to change the tick labels
  • Nickil Maveli
    Nickil Maveli over 6 years
    @famargar: If you're on versions >= 0.20.0, the old way of slicing would not work as it has now become an IntervalIndex which has it's own dtype and doesn't support indexing. Hence, the preferred way of doing the same would be to make use of the attributes (left/right) to define the interval of X-axis tick labels.
  • Flavio
    Flavio almost 5 years
    This method here it's less verbose and should be placed as the main answer.
  • ipramusinto
    ipramusinto over 3 years
    Thanks for the answer. A bit change to above code -> str(c)[1:-1].replace(...)
  • Maurício Collaça
    Maurício Collaça over 2 years
    Current pandas 1.3.3 and probably older pandas.hist() does accept a bin sequence: (...)If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
  • Simon
    Simon over 2 years
    Possibly a noob question: isn't the default mode in pandas using mathplotlib?