Pandas bar plot with binned range

python pandas histogram bar-chart

67,007

Solution 1

You can make use of pd.cut to partition the values into bins corresponding to each interval and then take each interval's total counts using pd.value_counts. Plot a bar graph later, additionally replace the X-axis tick labels with the category name to which that particular tick belongs.

out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
ax = out.value_counts(sort=False).plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.show()

If you want the Y-axis to be displayed as relative percentages, normalize the frequency counts and multiply that result with 100.

out = pd.cut(s, bins=[0, 0.35, 0.7, 1], include_lowest=True)
out_norm = out.value_counts(sort=False, normalize=True).mul(100)
ax = out_norm.plot.bar(rot=0, color="b", figsize=(6,4))
ax.set_xticklabels([c[1:-1].replace(","," to") for c in out.cat.categories])
plt.ylabel("pct")
plt.show()

Solution 2

You may consider using matplotlib to plot the histogram. Unlike pandas' hist function, matplotlib.pyplot.hist accepts an array as input for the bins.

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
import pandas as pd

x = np.random.rand(120)
df = pd.DataFrame({"x":x})

bins= [0,0.35,0.7,1]
plt.hist(df.values, bins=bins, edgecolor="k")
plt.xticks(bins)

plt.show()

Solution 3

You can use pd.cut

bins = [0,0.35,0.7,1]
df = df.groupby(pd.cut(df['val'], bins=bins)).val.count()
df.plot(kind='bar')

67,007

Arnold Klein

Data science enthusiast and machine learner.

Updated on March 25, 2020

Comments

Arnold Klein about 4 years

Is there a way to create a bar plot from continuous data binned into predefined intervals? For example,

In[1]: df
Out[1]: 
0      0.729630
1      0.699620
2      0.710526
3      0.000000
4      0.831325
5      0.945312
6      0.665428
7      0.871845
8      0.848148
9      0.262500
10     0.694030
11     0.503759
12     0.985437
13     0.576271
14     0.819742
15     0.957627
16     0.814394
17     0.944649
18     0.911111
19     0.113333
20     0.585821
21     0.930131
22     0.347222
23     0.000000
24     0.987805
25     0.950570
26     0.341317
27     0.192771
28     0.320988
29     0.513834

231    0.342541
232    0.866279
233    0.900000
234    0.615385
235    0.880597
236    0.620690
237    0.984375
238    0.171429
239    0.792683
240    0.344828
241    0.288889
242    0.961686
243    0.094402
244    0.960526
245    1.000000
246    0.166667
247    0.373494
248    0.000000
249    0.839416
250    0.862745
251    0.589873
252    0.983871
253    0.751938
254    0.000000
255    0.594937
256    0.259615
257    0.459916
258    0.935065
259    0.969231
260    0.755814

and instead of a simple histogram:

df.hist()

I need to create a bar plot, where each bar will count a number of instances within a predefined range. For example, the following plot should have three bars with the number of points which fall into: [0 0.35], [0.35 0.7] [0.7 1.0]

EDIT

Many thanks for your answers. Another question, how to order bins? For example, I get the following result:

In[349]: out.value_counts()
Out[349]:  
[0, 0.001]      104
(0.001, 0.1]     61
(0.1, 0.2]       32
(0.2, 0.3]       20
(0.3, 0.4]       18
(0.7, 0.8]        6
(0.4, 0.5]        6
(0.5, 0.6]        5
(0.6, 0.7]        4
(0.9, 1]          3
(0.8, 0.9]        2
(1, 1.001]        0

as you can see, the last three bins are not ordered. How to sort the data frame based on 'categories' or my bins?

EDIT 2

Just found how to solve it, simply with 'reindex()':

In[355]: out.value_counts().reindex(out.cat.categories)
Out[355]: 
[0, 0.001]      104
(0.001, 0.1]     61
(0.1, 0.2]       32
(0.2, 0.3]       20
(0.3, 0.4]       18
(0.4, 0.5]        6
(0.5, 0.6]        5
(0.6, 0.7]        4
(0.7, 0.8]        6
(0.8, 0.9]        2
(0.9, 1]          3
(1, 1.001]        0

Arnold Klein about 7 years

and if I need to normalise the plot? (vertical axis should be percentage and not the frequency. Similar to 'normed=True' in .hist()
famargar over 6 years

I get TypeError: 'pandas._libs.interval.Interval' object is not subscriptable when trying to change the tick labels
Nickil Maveli over 6 years

@famargar: If you're on versions >= 0.20.0, the old way of slicing would not work as it has now become an IntervalIndex which has it's own dtype and doesn't support indexing. Hence, the preferred way of doing the same would be to make use of the attributes (left/right) to define the interval of X-axis tick labels.
Flavio almost 5 years

This method here it's less verbose and should be placed as the main answer.
ipramusinto over 3 years

Thanks for the answer. A bit change to above code -> str(c)[1:-1].replace(...)
Maurício Collaça over 2 years

Current pandas 1.3.3 and probably older pandas.hist() does accept a bin sequence: (...)If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
Simon over 2 years

Possibly a noob question: isn't the default mode in pandas using mathplotlib?