Plotting a histogram from pre-counted data in Matplotlib

python matplotlib histogram

28,220

Solution 1

I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:

pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))

This allows me to rely on hist to re-bin my data.

Solution 2

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)

val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)

Assuming you only have integers as the keys, you can also use bar directly:

min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())

bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)

for k,v in counted_data.items():
    vals[k - min_bin] = v

plt.bar(bins, vals, ...)

where ... is what ever arguments you want to pass to bar (doc)

If you want to re-bin your data see Histogram with separate list denoting frequency

Solution 3

You can also use seaborn to plot the histogram :

import seaborn as sns

sns.distplot(
    list(
        counted_data.keys()
    ), 
    hist_kws={
        "weights": list(counted_data.values())
    }
)

Solution 4

the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:

import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
                             weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)

View more solutions

28,220

Author by

Josh Rosen

I'm an Apache Spark committer and PMC member.

Updated on July 09, 2022

Comments

Josh Rosen almost 2 years
I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data
```
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
```
Given this data, I can use
```
pylab.hist(data, bins=[...])
```
to plot a histogram.

In my case, the data has been pre-counted and is represented as a dictionary:
```
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
```
Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:
```
data = list(chain.from_iterable(repeat(value, count)
            for (value, count) in counted_data.iteritems()))
```
This is inefficient when counted_data contains counts for millions of data points.

Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?

Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?
Josh Rosen over 10 years

Thanks for the pointer to the weights option; I had overlooked it, but it solves my problem perfectly (see my answer).
tacaswell over 10 years

I hadn't made that connection (got blinded by directly using bar). Edited to reflect your comment.
tacaswell over 10 years

and your way of getting the data out makes more sense than mine. It's fine with me if you accept your own answer.
Ash Berlin-Taylor over 6 years

This was the clue I needed. In my case I have a list of counts, and bin ranges: plt.hist(bins, bins=len(bins), weights=counts) was the invocation I needed
icemtel over 3 years

Word of warning: I have noticed that this gives incorrect result if bins have different size, and density=True is used. Probably not a bug, rather a mathematical difference between pdf and cdf.