R Setting Y Axis to Count Distinct in ggplot2

12,539

I think you're missing a step where you summarize the data to get the quantities to plot on the y-axis. Here's an example with some toy data similar to how you describe yours:

# Make toy data with three levels of employment type
set.seed(1)
df <- data.frame(Entity.ID = rep(LETTERS[1:10], 3), Degree.Year = rep(seq(1990, 1992), each=10),
    Degree.Type = sample(c("grad", "undergrad"), 30, replace=TRUE),
    Employment.Data.Type = sample(as.character(1:3), 30, replace=TRUE))

# Here's the part you're missing, where you summarize for plotting 
library(dplyr)
dfsum <- df %>%
    group_by(Degree.Year, Employment.Data.Type) %>%
    tally()

# Now plot that, using the sums as your y values
library(ggplot2)
ggplot(dfsum, aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
    geom_bar(stat="identity") + labs(fill="Employment")

The result could use some fine-tuning, but I think it's what you mean. Here, the bars are equal height because each year in the toy data include an equal numbers of IDs; if the count of IDs varied, so would the total bar height. enter image description here

If you don't want to add objects to your workspace, just do the summing in the call to ggplot():

ggplot(tally(group_by(df, Degree.Year, Employment.Data.Type)),
    aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
    geom_bar(stat="identity") + labs(fill="Employment")
Share:
12,539
KWalker
Author by

KWalker

Updated on August 04, 2022

Comments

  • KWalker
    KWalker over 1 year

    I have a data frame that contains 4 variables: an ID number (chr), a degree type (factor w/ 2 levels of Grad and Undergrad), a degree year (chr with year), and Employment Record Type (factor w/ 6 levels).

    I would like to display this data as a count of the unique ID numbers by year as a stacked area plot of the 6 Employment Record Types. So, count of # of ID numbers on the y-axis, degree year on the x-axis, the value of x being number of IDs for that year, and the fill will handle the Record Type. I am using ggplot2 in RStudio.

    I used the following code, but the y axis does not count distinct IDs:

    ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID,
           fill=Employment.Data.Type)) + geom_freqpoly() +
           scale_fill_brewer(palette="Blues",
           breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
    

    I also tried setting y = Entity.ID to y = ..count.. and that did not work either. I have searched for solutions as it seems to be a problem with how I am writing the aes code.

    I also tried the following code based on examples of similar plots:

    ggplot(AlumJobStatusCopy, aes(interval)) + 
          geom_area(aes(x=Degree.Year, y = Entity.ID, 
                        fill = Employment.Data.Type)) +
          scale_fill_brewer(palette="Blues",
              breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
    

    This does not even seem to work. I've read the documentation and am at my wit's end.

    EDIT:

    After figuring out the answer to the problem, I realized that I was not actually using the correct values for my Year variable. A count tells me nothing as I am trying to display the rise in a lack of records and the decline in current records.

    My Dataset:
    Year, int, 1960-2015
    Current Record, num: % of total records that are current
    No Record, num: % of total records that are not current

    Ergo each Year value has two corresponding percent values. I am now using 2 lines instead of an area plot since the Y axis has distinct values instead of a count function, but I would still like the area under the curves filled. I tried using Melt to convert the data from wide to long, but was still unable to fill both lines. Filling is just for aesthetic purposes as I would like to use a gradient for each with 1 fill being slightly lighter than the other.

    Here is my current code:

    ggplot(Alum, aes(Year)) + 
        geom_line(aes(y = Percent.Records, colour = "Percent.Records")) +
        geom_line(aes(y = Percent.No.Records, colour = "Percent.No.Records")) + 
        scale_y_continuous(labels = percent) + ylab('Percent of Total Records') + 
        ggtitle("Active, Living Alumni Employment Record") +
        scale_x_continuous(breaks=seq(1960, 2014, by=5))
    

    I cannot post an image yet.