Formatting histogram x-axis when working with dates using R

19,092

Solution 1

Since you effectively challenged us to provide a ggplot solution, here it is:

dates <- seq(as.Date("2011-10-01"), length.out=60, by="+1 day")

set.seed(1)
dat <- data.frame(
  suburb <- rep(LETTERS[24:26], times=c(100, 200, 300)),
  Date_of_Onset <- c(
    sample(dates-30, 100, replace=TRUE),
    sample(dates,    200, replace=TRUE),
    sample(dates+30, 300, replace=TRUE)
  )
)

library(scales)
library(ggplot2)
ggplot(dat, aes(x=Date_of_Onset, fill=suburb)) + 
  stat_bin(binwidth=1, position="identity") + 
  scale_x_date(breaks=date_breaks(width="1 month"))

Note the use of position="identity" to force each bar to originate on the axis, otherwise you get a stacked chart by default.

enter image description here

Solution 2

There are 2 available solutions; 1 using hist() and the other using ggplot():

library(date)
hist(dat.geo$Date_of_Onset[(dat.geo$suburb=="x")], "weeks", 
 format = "%d %b %y", freq=T, col=rgb(0,0,0,1), axes=F, main="")
hist(dat.geo$Date_of_Onset[(dat.geo$suburb=="y")], "weeks", 
 format = "%d %b %y", freq=T, main="", col=rgb(1,1,1,.6), add=T, axes=F)
axis.Date(1, at=seq(as.Date("2011-10-10"), as.Date("2012-03-19"), by="2 weeks"),
 format="%d %b %y")
axis.Date(1, at=seq(as.Date("2011-10-10"), as.Date("2012-03-19"), by="weeks"), 
 labels=F, tcl= -0.5)

This epidemic curve is as below:

enter image description here

A solution using ggplot, suggested by Andrie above, is as follows:

library(scales)
library(ggplot2)
ggplot(dat.geo,aes(x=Date_of_Onset, group=suburb, fill=suburb))+
 stat_bin(colour="black", binwidth=1, alpha=0.5,
 position="identity") + theme_bw()+
 xlab("Date of onset of symptoms")+
 ylab("Number of cases")+
 scale_x_date(breaks=date_breaks("1 month"), labels=date_format("%b %y"))

which gives an epidemic curve as below:

enter image description here

Share:
19,092
Jonny
Author by

Jonny

Updated on June 29, 2022

Comments

  • Jonny
    Jonny about 2 years

    I am in the process of creating an epidemic curve (histogram of number of cases of a disease per day) using R, and am struggling a little with formatting the x-axis.

    I am aware that ggplot gives very nice graphs and easily manipulatable axes ( Understanding dates and plotting a histogram with ggplot2 in R ), but in this case I prefer to use the hist() command, because I am describing 2 different patterns at the same time, as below (I don't think you can do something similar in ggplot):

    enter image description here

    The problem here is that the x-axis does not begin at the first case, has too many tick marks, and I'd like to be able to have a systematic date marker, eg. every 7 days, or every 1st of the month.

    The data are stored in a database (dat.geo) as one row per suspected case, with info on date of onset and suburb (whether black or white in histogram), as below:

    > head(dat.geo)
      number age sex       suburb Date_of_Onset
    1      1  12   F            x    2011-10-11
    2      2  28   M            x    2011-10-10
    3      3  15   F            x    2011-10-12
    4      4  12   M            y    2011-10-25
    5      5  10   F            x    2011-10-15
    6      6   9   M            y    2011-10-20
    

    Here is my code:

    pdf(file='1.epi.curve.pdf')
    hist(dat.geo$Date_of_Onset[(dat.geo$suburb=="x")], "days", 
     format = "%d %b %y", freq=T, col=rgb(0,0,0,1), axes=T, main="", add=T)
    hist(dat.geo$Date_of_Onset[(dat.geo$suburb=="y")], "days", 
     format = "%d %b %y", freq=T, main="", col=rgb(1,1,1,.6), add=T, axes=F)
    dev.off()
    

    I have tried suppressing the axis and adding a manipulated one later using this code

    axis(1, labels=T)
    axis(2)
    

    but this is what I get (and I have no idea how to manipulate that):

    enter image description here

    Your help is greatly appreciated!

    thanks

  • Jonny
    Jonny almost 12 years
    Thanks for your efforts - that does look good, but I prefer the original solution for this particular problem, because you can see the two curves in their entirety. I'll remember this technique for future use, though
  • Andrie
    Andrie almost 12 years
    @jpolonsky Can you please explain what you mean by being able to see the two curves in their entirety? Do you propose to stack the bars? Or to dodge the bars? Or something else? Any of these options are possible in ggplot too.
  • Jonny
    Jonny almost 12 years
    My apologies, I have just seen that by adding alpha=0.5 to the stat_bin command, you can get transparencies, which was really what I was after. Thanks for this excellent solution!
  • Andrie
    Andrie almost 12 years
    @jpolonsky Glad to be of help!
  • Jonny
    Jonny almost 12 years
    just one more thing - your code works fine up until the last line; + scale_x_date(breaks=date_breaks(width="1 month")) when I get an error message: Error in ggplot(dat.geo, aes(x = Date_of_Onset, fill = suburb)) + stat_bin(binwidth = 1, : non-numeric argument to binary operator In addition: Warning message: Incompatible methods ("+.ggplot", "Ops.date") for "+" If I delete everything from the + onwards, it works fine, just the x axis is not ideal. Do you know what might be the cause? Do you get this message as well? Thanks
  • Andrie
    Andrie almost 12 years
    @jpolonsky I didn't get that message when I posted the code. What is your ggplot2 version?
  • Jonny
    Jonny almost 12 years
    I'm using version 0.9.1, which seems to be a current version
  • Andrie
    Andrie almost 12 years
    @jpolonsky Me too. I just ran my code again in a clean session and didn't get any errors.
  • Ben Bolker
    Ben Bolker almost 12 years
    @jpolonsky, you might be leaving line breaks in the wrong places: make sure that each line ends with +, or surround the whole ggplot call in parentheses ...
  • Jonny
    Jonny almost 12 years
    Thanks for the suggestions - I have located the problem; there seems to be an interaction between the 'dates' and the 'scales' packages. With both loaded, scales does not work, but without 'dates' loaded, it works just fine.