R data.table: How to sum variables by group based on a condition?
Solution 1
Remember the logic of data.table
: dt[i, j, by]
, that is take dt
, subset rows using i
, then calculate j
grouped by by
.
dt[times != 0 & !is.na(times), sum(Frequency), by = Category]
Category V1
1: Second 2
2: Third 34
Solution 2
You can use rowsum() for this.
rowsum
Give Column Sums of a Matrix or Data Frame, Based on a Grouping Variable
Compute column sums across rows of a numeric matrix-like object for each level of a grouping variable. rowsum is generic, with a method for data frames and a default method for vectors and matrices.
Keywords: manip
Usage
rowsum(x, group, reorder = TRUE, …)
S3 method for data.frame
rowsum(x, group, reorder = TRUE, na.rm = FALSE, …)
S3 method for default
rowsum(x, group, reorder = TRUE, na.rm = FALSE, …)
Arguments a matrix, data frame or vector of numeric data. Missing values are allowed. A numeric vector will be treated as a column vector. group
a vector or factor giving the grouping, with one element per row of x. Missing values will be treated as another group and a warning will be given.
reorder
if TRUE, then the result will be in order of sort(unique(group)), if FALSE, it will be in the order that groups were encountered.
na.rm
logical (TRUE or FALSE). Should NA (including NaN) values be discarded?
other arguments to be passed to or from methods
Details
The default is to reorder the rows to agree with tapply as in the example below. Reordering should not add noticeably to the time except when there are very many distinct values of group and x has few columns.
The original function was written by Terry Therneau, but this is a new implementation using hashing that is much faster for large matrices.
To sum over all the rows of a matrix (ie, a single group) use colSums, which should be even faster.
For integer arguments, over/underflow in forming the sum results in NA.
Value
A matrix or data frame containing the sums. There will be one row per unique value of
Solution 3
You can use bracket subsetting to select only the rows with non-zero and non-NA values for times
and then run your grouping operation.
dt[which(dt$times > 0)][, sum(Frequency), by = Category]
ShanZhengYang
Updated on June 05, 2022Comments
-
ShanZhengYang almost 2 years
Let's say I have the following R
data.table
(though I'm happy to work with base R, data.frame as well)library(data.table) dt = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3), times = c(0, 0, 0, 3, 3, 1)) > dt Category Frequency times 1: First 10 0 2: First 15 0 3: First 5 0 4: Second 2 3 5: Third 14 3 6: Third 20 1 7: Second 3 0
If I wished to sum the Frequencies by Category, I would use the following:
data[, sum(Frequency), by = Category]
However, let's say I wanted to sum
Frequency
byCategory
if and only iftimes
is non-zero and not equal toNA
?How would one make this sum a conditional based on the values of a separate column?
EDIT: apologies for the obvious question. A quick addition: what about if the elements of a certain column are strings?
e.g.
> dt Category Frequency times 1: First ten 0 2: First ten 0 3: First five 0 4: Second five 3 5: Third five 3 6: Third five 1 7: Second ten 0
Sum()
will not calculate the frequencies often
versusfive
-
Roland over 6 yearsWell, put your condition into data.table's i argument, obviously.
-
Frank over 6 yearsTest your code, which throws a warning. Also, show desired output -- it's not clear if you want groups for which no rows meet the condition or not.
-
sindri_baldur over 6 yearsI don't think it is correct to add another question after the fact... rather open a new question for that... but you need to convert the Frequency column into numeric before you do any math on it. Here is an example: stackoverflow.com/questions/18332463/…
-
-
ShanZhengYang over 6 yearsRight. This makes sense now. Question: what if the column
Frequency
has elements type character? -
ShanZhengYang over 6 yearsQuestion: what if the column Frequency has elements type character?
-
Andrew Brēza over 6 years@ShanZhengYang how do you want your results to look?
-
ShanZhengYang over 6 yearsI guess I'm describing expanding this data.table, such that there is a column of each categorical variable and the counts associated with it
-
sindri_baldur over 6 yearsWell then you can't use
sum()
unless you convert first into numeric. -
sindri_baldur over 6 yearsYou can reduce the first part of the chain to simply
dt[times > 0]