Summing values in R based on column value with dplyr

r dataframe dplyr

12,945

Solution 1

Check this solution:

df %>%
  gather(key, val, Value1:Value3) %>%
  group_by(Subject) %>%
  mutate(
    Sum = sum(val[c(1:(UniqueNumber[1]))]),
    Mean = mean(val[c(1:(UniqueNumber[1]))]),
  ) %>%
  spread(key, val)

Output:

 Subject UniqueNumber   Sum  Mean Value1 Value2 Value3
  <chr>          <int> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 001                3     2 0.667      1      0      1
2 002                2     1 0.5        0      1      1
3 003                1     1 1          1      1      1

Solution 2

Not a tidyverse fan/expert, but I would try this using long format. Then, just filter by row index per group and then run any functions you want on a single column (much easier this way).

library(tidyr)
library(dplyr)

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% # long format
  group_by(Subject) %>% # group by Subject in order to get row counts
  filter(row_number() <= UniqueNumber) %>% # filter by row index
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup() 

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

A very similar way to achieve this could be filtering by the integers in the column names. The filter step comes before the group_by so it could potentially increase performance (or not?) but it is less robust as I'm assuming that the cols of interest are called "Value#"

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% #long format
  filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
  group_by(Subject) %>% # group by Subject
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup()

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

Just for fun, adding a data.table solution

library(data.table)

data.table(Data) %>% 
  melt(id = c("Subject", "UniqueNumber")) %>%
  .[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
    .(Mean = round(mean(value), 3), Total = sum(value)),
    by = Subject]

#    Subject  Mean Total
# 1:       1 0.667     2
# 2:       2 0.500     1
# 3:       3 1.000     1

Solution 3

Here is another method that uses tidyr::nest to collect the Values columns into a list so that we can iterate through the table with map2. In each row, we select the correct values from the Values list-col and take the sum or mean respectively.

library(tidyverse)
tbl <- read_table2(
"Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1"
)
tbl %>%
  filter(UniqueNumber > 0) %>%
  nest(starts_with("Value"), .key = "Values") %>%
  mutate(
    sum = map2_dbl(UniqueNumber, Values, ~ sum(.y[1:.x], na.rm = TRUE)),
    mean = map2_dbl(UniqueNumber, Values, ~ mean(as.numeric(.y[1:.x], na.rm = TRUE))),
  )
#> # A tibble: 3 x 5
#>   Subject UniqueNumber Values             sum  mean
#>   <chr>          <dbl> <list>           <dbl> <dbl>
#> 1 001                3 <tibble [1 × 3]>     2 0.667
#> 2 002                2 <tibble [1 × 3]>     1 0.5  
#> 3 003                1 <tibble [1 × 3]>     1 1

^{Created on 2019-02-14 by the reprex package (v0.2.1)}

Solution 4

OP might be interested only for dplyr solution but for comparison purposes and for future readers a base R option using mapply

cols <- grep("^Value", names(df))

cbind(df, t(mapply(function(x, y) {
      if (y > 0) {
        vals = as.numeric(df[x, cols[1:y]])
        c(Sum = sum(vals, na.rm = TRUE), Mean = mean(vals, na.rm = TRUE))
       }
       else 
        c(0, 0)
},1:nrow(df), df$UniqueNumber)))

#  Subject Value1 Value2 Value3 UniqueNumber Sum  Mean
#1       1      1      0      1            3   2 0.667
#2       2      0      1      1            2   1 0.500
#3       3      1      1      1            1   1 1.000

Here we subset each row based on its respective UniqueNumber and then calculate it's sum and mean if the UniqueNumber value is greater than 0 or else return only 0.

Solution 5

A solution that uses purrr::map_df(which is from the same author as dplyr).

library(dplyr)
library(purrr)
l_dat <- split(dat, dat$Subject) # first we need to split in a list

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber # finds the number of columns
  x <- as.numeric(x[2:(n_cols+1)]) # subsets x and converts to numeric
  mean(x, na.rm=T) # mean to be returned
})
# output:
# # A tibble: 1 x 3
#     `1`   `2`   `3`
#   <dbl> <dbl> <dbl>
# 1 0.667   0.5     1

Another option (output format closer to a dplyr solution):

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber
  id <- x$Subject
  x <- as.numeric(x[2:(n_cols+1)])
  tibble(id=id, mean_values=mean(x, na.rm=T))
})
# # A tibble: 3 x 2
# id mean_values
# <int>       <dbl>
# 1     1       0.667
# 2     2       0.5  
# 3     3       1

Just as an example I added a sum() then divided by length(x)-1:

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber
  id <- x$Subject
  x <- as.numeric(x[2:(n_cols+1)])
  tibble(id=id, 
                mean_values=sum(x, na.rm=T)/(length(x)-1)) # change here
})
# # A tibble: 3 x 2
# id mean_values
# <int>       <dbl>
# 1     1          1.
# 2     2          1.
# 3     3        Inf  #beware of this case where you end up dividing by 0

Data:

tt <- "Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1"

dat <- read.table(text=tt, header=T)

View more solutions

12,945

Author by

statsguyz

Updated on June 29, 2022

Comments

statsguyz almost 2 years
I have a data set that has the following information:
```
Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1
```
If the value of UniqueNumber > 0, I would like to sum the values with dplyr for each subject from rows 1 through UniqueNumber and calculate the mean. So for Subject 001, sum = 2 and mean = .67.
```
total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
   for(j in 1:ncols(Data)){
   if(Data$UniqueNumber[i] > 0){
    total[i] = sum(Data[i,1:j])
    average[i] = mean(Data[i,1:j])
   }
}
```
Edit: I am only looking to sum through the number of columns listed in the 'UniqueNumber' column. So this is looping through every row and stopping at column listed in 'UniqueNumber'. Example: Row 2 with Subject 002 should sum up the values in columns 'Value1' and 'Value2', while Row 3 with Subject 003 should only sum the value in column 'Value1'.
statsguyz about 5 years

Receive the following error when I run your code: Error in 2:(n_cols + 1) : NA/NaN argument
RLave about 5 years

I don't have this error, have you tried it on my example data? If your "UniqueNumber" column is named differently you need to change this part x$UniqueNumber accordingly.
statsguyz about 5 years

Thanks. I my data had a missing column so the code crashed. Went back and fixed the issue, it works!
statsguyz about 5 years

Edit: Looks like few subjects didn't have UniqueValues. Need to check this. Everything is working well!
statsguyz about 5 years

Is there a way to modify this to deal with missing values? Also, is it possible to calculate the mean with a denominator taking in to account the missing values?
David Arenburg about 5 years

What do you mean by missing values? NAs in the Value column? Just add na.rm = TRUE to the functions, e.g. summarise(Mean = mean(value, na.rm = TRUE), Total = sum(value, na.rm = TRUE)). Not sure I understand your second question. Can you please show an example with desired output?
statsguyz about 5 years

Oh ok, that's what I thought. And if I wanted to modify the "Mean" column to modify the denominator of the mean function by +1 or -1, is that possible?
David Arenburg about 5 years

I'm not sure I understand what you mean but you could do summarise(Total = sum(value, na.rm = TRUE), Mean = Total / n())
statsguyz about 5 years

Can you modify the denominator of the 'mean' function so that it divides by 1 less? I need to include the first value (i.e. Value1), but it's a starting point. So I would like to divide by one less in each instance (while still removing the NAs).
David Arenburg about 5 years

How exactly does this gives the correct results? That gives me wrong results when I insert random NAs into the data. For example, try inserting NA into Value1 into the first row.
RLave about 5 years

@statsguyz yes you can, you can do whatever you want inside the function, just change mean() with whatever you like, I'll update with an example.