Understand the `Reduce` function

39,294

Solution 1

Reduce takes a binary function and a list of data items and successively applies the function to the list elements in a recursive fashion. For example:

Reduce(intersect,list(a,b,c))

is the same as

intersect((intersect(a,b),c)

However, I don't think that construct will help you here as it will only return those elements that are common to all vectors.

To count the number of vectors that a gene appears in you could do the following:

vlist <- list(v1,v2,v3,v4,v5)
addmargins(table(gene=unlist(vlist), vec=rep(paste0("v",1:5),times=sapply(vlist,length))),2,list(Count=function(x) sum(x[x>0])))
       vec
gene    v1 v2 v3 v4 v5 Count
  geneA  1  1  0  1  0     3
  geneB  1  0  0  0  1     2
  geneC  0  1  0  0  1     2
  geneD  0  0  1  0  0     1
  geneE  0  0  1  1  0     2

Solution 2

A nice way to see what Reduce() is doing is to run it with its argument accumulate=TRUE. When accumulate=TRUE, it will return a vector or list in which each element shows its state after processing the first n elements of the list in x. Here are a couple of examples:

Reduce(`*`, x=list(5,4,3,2), accumulate=TRUE)
# [1]   5  20  60 120

i2 <- seq(0,100,by=2)
i3 <- seq(0,100,by=3)
i5 <- seq(0,100,by=5)
Reduce(intersect, x=list(i2,i3,i5), accumulate=TRUE)
# [[1]]
#  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
# [20]  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74
# [39]  76  78  80  82  84  86  88  90  92  94  96  98 100
# 
# [[2]]
#  [1]  0  6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
# 
# [[3]]
# [1]  0 30 60 90

Solution 3

Assuming the input values given at the end of this answer, the expression

Reduce(intersect,list(a,b,c,d,e))
## character(0)

gives the genes that are present in all vectors, not the genes that are present in at least two vectors. It means:

intersect(intersect(intersect(intersect(a, b), c), d), e)
## character(0)

If we want the genes that are in at least two vectors:

L <- list(a, b, c, d, e)
u <- unlist(lapply(L, unique)) # or:  Reduce(c, lapply(L, unique))

tab <- table(u)
names(tab[tab > 1])
## [1] "geneA" "geneB" "geneC" "geneE"

or

sort(unique(u[duplicated(u)]))
## [1] "geneA" "geneB" "geneC" "geneE"

Note: We used:

a <- c("geneA","geneB")
b <- c("geneA","geneC")
c <- c("geneD","geneE")
d <- c("geneA","geneE")
e <- c("geneB","geneC")
Share:
39,294
Johnathan
Author by

Johnathan

Updated on August 30, 2020

Comments

  • Johnathan
    Johnathan over 3 years

    I have a question about the Reduce function in R. I read its documentation, but I am still confused a bit. So, I have 5 vectors with genes name. For example:

    v1 <- c("geneA","geneB",""...)
    v2 <- c("geneA","geneC",""...)
    v3 <- c("geneD","geneE",""...)
    v4 <- c("geneA","geneE",""...)
    v5 <- c("geneB","geneC",""...)
    

    And I would like to find out which genes are present in at least two vectors. Some people have suggested:

    Reduce(intersect,list(a,b,c,d,e))
    

    I would greatly appreciate if someone could please explain to me how this statement works, because I have seen Reduce used in other scenarios.

  • Johnathan
    Johnathan about 9 years
    Thank you very much for your input. I have never used the table and addmargins functions before. If you don't mind, I'd like to ask you about them.
  • Johnathan
    Johnathan about 9 years
    table: so gene is the object that can be used as factor (i.e. categorial data), and vec is the names of the dimensions (i.e."v1","v2"), right? I am confused about what times means. It returns vectors of length. As for addmargins, it is a function that extends a table to add the marginal totals (i.e. total counts of the cases over the categories of interest), right? "2" means add a column that will hold the row marginal totals, right? Finally, the last argument is a list that contains the function. Thank you for your time and help!
  • James
    James about 9 years
    @Johnathan Yes, you are right. times is an argument to rep which determines how many times each element gets repeated - this is to ensure that the genes are mapped to the correct variable.
  • Johnathan
    Johnathan about 9 years
    Thank you for your input. I have read R's documentation. Rep the values in x (e.g. v1...). I am still confused by length. Apparently, it is a vector giving the number of times to repeat each element if of length length(x). For example, if v2 has length 2, shouldn't it mean that each element be repeated twice? I don't understand how this ensure that the genes are mapped to the correct variable. Sorry for the confusion. I am sure that it is something obvious. thank you!
  • user3507767
    user3507767 almost 8 years
    Can it be used with greater than comparisons, obviously I tried it an I got the first number in the sequence followed by either 11111 or 00000. My expected values for something like Reduce('<',c(3,4,7,2,6,8,9), accumulate=T) would have been 3 3 2 2 2 2. Is this achievable with Reduce? A bit of explaining, I assumed Reduce takes the elements of a vector 2 at a time starting from the left. Since my function is "less than" it would compare the first 2 returning the smaller number, then compare that to the 3rd returning the smaller...
  • user3507767
    user3507767 almost 8 years
    just to clarify, I know I can get my desired results with cummin(), just trying to understand Reduce() here.
  • Vlady Veselinov
    Vlady Veselinov about 6 years
    It's very weird how the argument types are flipped. It would feel much better if it was consistent with the convention "apply(list, function)".