Stacked histogram from already summarized counts using ggplot2
Solution 1
Very quickly, you can do what the OP wants using the stat="identity"
option and the plyr
package to manually calculate the histogram, like so:
library(plyr)
X$mid <- floor(X$C/20)*20+10
X_plot <- ddply(X, .(mid), summarize, total=length(C), split=sum(C1)/sum(C)*length(C))
ggplot(data=X_plot) + geom_histogram(aes(x=mid, y=total), fill="blue", stat="identity") + geom_histogram(aes(x=mid, y=split), fill="deeppink", stat="identity")
We basically just make a 'mids' column for how to locate the columns and then make two plots: one with the count for the total (C) and one with the columns adjusted to the count of one of the columns (C1). You should be able to customize from here.
Update 1: I realized I made a small error in calculating the mids. Fixed now. Also, I don't know why I used a 'ddply' statement to calculate the mids. That was silly. The new code is clearer and more concise.
Update 2: I returned to view a comment and noticed something slightly horrifying: I was using sums as the histogram frequencies. I have cleaned up the code a little and also added suggestions from the comments concerning the coloring syntax.
Solution 2
Here's a hack using ggplot_build
. The idea is to first get your old/original plot:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
stored in p
. Then, use ggplot_build(p)$data[[1]]
to extract the data, specifically, the columns xmin
and xmax
(to get the same breaks/binwidths of histogram) and count
column (to normalize the percentage by count
. Here's the code:
# get old plot
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
# get data of old plot: cols = count, xmin and xmax
d <- ggplot_build(p)$data[[1]][c("count", "xmin", "xmax")]
# add a id colum for ddply
d$id <- seq(nrow(d))
How to generate data now? What I understand from your post is this. Take for example the first bar in your plot. It has a count of 2 and it extends from xmin = 147
to xmax = 156.8
. When we check X
for these values:
X[X$C >= 147 & X$C <= 156.8, ] # count = 2 as shown below
# C1 C2 C
# 19 91 63 154
# 75 86 70 156
Here, I compute (91+86)/(154+156)*(count=2) = 1.141935
and (63+70)/(154+156) * (count=2) = 0.8580645
as the two normalised values for each bar we'll generate.
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(0,0))
p <- colSums(t)[1:2]/colSums(t)[3] * x$count
})
# then, it just normal plotting
require(reshape2)
dd <- melt(dd, id.var="id")
ggplot(data = dd, aes(x=id, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
And this is the original plot:
And this is what I get:
Edit: If you also want to get the breaks proper, then, you can get the corresponding x
coordinates from the old plot and use it here instead of id
:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
d <- ggplot_build(p)$data[[1]][c("count", "x", "xmin", "xmax")]
d$id <- seq(nrow(d))
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(x$x,0,0))
p <- c(x=x$x, colSums(t)[1:2]/colSums(t)[3] * x$count)
})
require(reshape2)
dd.m <- melt(dd, id.var="V1", measure.var=c("V2", "V3"))
ggplot(data = dd.m, aes(x=V1, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
Solution 3
How about:
library("reshape2")
mm <- melt(X[,1:2])
ggplot(mm,aes(x=value,fill=variable))+geom_histogram(position="stack")
Paul J Hurtado
Updated on June 07, 2022Comments
-
Paul J Hurtado almost 2 years
I would like some help coloring a ggplot2 histogram generated from already-summarized count data.
The data are something like counts of # males and # females living in a number of different areas. It's easy enough to plot the histogram for the total counts (i.e. males + females):
set.seed(1) N=100; X=data.frame(C1=rnbinom(N,15,0.1), C2=rnbinom(N,15,0.1),C=rep(0,N)); X$C=X$C1+X$C2; ggplot(X,aes(x=C)) + geom_histogram()
However, I'd like to color each bar according to the relative contribution from C1 and C2, so that I get the same histogram (i.e. overall bar heights) as in the above example, plus I see the proportion of type "C1" and "C2" individuals as in a stacked bar chart.
Suggestions for a clean way to do this with ggplot2, using data like "X" in the example?
-
Paul J Hurtado about 11 yearsI don't think that works, unfortunately. Overall distribution is different. I'd like to keep counts of, e.g., 100 individuals in the 100 bin, but color the overall breakdown of M and F in that bin.
-
Dinre about 11 years@PaulJHurtado I think you misunderstand Ben's code. The total counts will be exactly the same for each bin, since they will be stacked. The 'melt' function just condenses the data and then the histogram option
position="stack"
puts the variables on top of each other. The total height will be the same. I'll add some detail to Ben's answer to hopefully make it clearer. -
Paul J Hurtado about 11 yearsThanks for the effort @Dinre. Be sure to run the code example I posted and compare. Ben's example gives a different overall distribution.
-
Dinre about 11 yearsAh... found it. It's a matter of scaling and not a matter of values being different. In the original post, you are spreading out the data by using the total, which is fine, but it's inaccurate once you split into groups. Splitting the data into groups, Ben's approach is the more accurate one, because it shows you the distribution of both groups individually and then stacks. Is there some reason you are trying to avoid this?
-
Dinre about 11 years@PaulJHurtado If you really want to preserve the original stack, speak up and I will write up a different function for you. We'll have to flip over to calculating the stacks ourselves and using
stat="identity"
in order to do something like that. -
Paul J Hurtado about 11 yearsI get that Ben's approach is 99% of the time the more appropriate graphic, and is much more in line with ways of doing a formal analysis on such data, however in this particular case I'm primarily interested in plotting total distribution colored as described. If it's easy enough to code up, and you have time to kill, I won't hold you back! ;-)
-
Ben Bolker about 11 yearsthis is good except that your legend is wacky. Start with
geom_histogram(aes(x=mid, y=total), fill="blue")
(i.e. put thefill
specification outside the mapping); then you will need to figure out how to add the guide (legend) manually. -
Dinre about 11 years@BenBolker Yeah, it's just a quick solution to get the data displaying correctly. Now, the OP just needs to customize from here.
-
russellpierce over 10 yearsWhat is your solution doing that
require(reshape2);ggplot(melt(X,id.vars="C"),aes(x=C,fill=variable)) + geom_histogram()
does not do? -
5th over 5 yearsSince few people use
plyr
andreshape2
these days, I created a version of @Arun's answer withtidyr
andlapply
in this answer