Plotting a histogram from pre-counted data in Matplotlib

28,220

Solution 1

I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:

pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))

This allows me to rely on hist to re-bin my data.

Solution 2

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)

val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)

Assuming you only have integers as the keys, you can also use bar directly:

min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())

bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)

for k,v in counted_data.items():
    vals[k - min_bin] = v

plt.bar(bins, vals, ...)

where ... is what ever arguments you want to pass to bar (doc)

If you want to re-bin your data see Histogram with separate list denoting frequency

Solution 3

You can also use seaborn to plot the histogram :

import seaborn as sns

sns.distplot(
    list(
        counted_data.keys()
    ), 
    hist_kws={
        "weights": list(counted_data.values())
    }
)

Solution 4

the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:

import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
                             weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
Share:
28,220
Josh Rosen
Author by

Josh Rosen

I'm an Apache Spark committer and PMC member.

Updated on July 09, 2022

Comments

  • Josh Rosen
    Josh Rosen almost 2 years

    I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data

    data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
    

    Given this data, I can use

    pylab.hist(data, bins=[...])
    

    to plot a histogram.

    In my case, the data has been pre-counted and is represented as a dictionary:

    counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
    

    Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:

    data = list(chain.from_iterable(repeat(value, count)
                for (value, count) in counted_data.iteritems()))
    

    This is inefficient when counted_data contains counts for millions of data points.

    Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?

    Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?

  • Josh Rosen
    Josh Rosen over 10 years
    Thanks for the pointer to the weights option; I had overlooked it, but it solves my problem perfectly (see my answer).
  • tacaswell
    tacaswell over 10 years
    I hadn't made that connection (got blinded by directly using bar). Edited to reflect your comment.
  • tacaswell
    tacaswell over 10 years
    and your way of getting the data out makes more sense than mine. It's fine with me if you accept your own answer.
  • Ash Berlin-Taylor
    Ash Berlin-Taylor over 6 years
    This was the clue I needed. In my case I have a list of counts, and bin ranges: plt.hist(bins, bins=len(bins), weights=counts) was the invocation I needed
  • icemtel
    icemtel over 3 years
    Word of warning: I have noticed that this gives incorrect result if bins have different size, and density=True is used. Probably not a bug, rather a mathematical difference between pdf and cdf.