Using apply to find max in a data frame with missing values and strings

10,782

One of the points of using a data frame is that everything in a column must have the same class. If you want to treat your data as numeric, then run as.numeric() on each column and the strings, like "SON", will be converted to NA.

Data frames are also focused on column-wise operations. If you want to go row-wise, a matrix probably makes more sense:

mat = sapply(df, function(x) as.numeric(as.character(x)))
# as.numeric(as.character()) is necessary when starting with a factor
mat
#      ID N1  N2 N3 N4
# [1,]  1  2   3  4  5
# [2,] 11 NA -12 14 55
# [3,] 21 12  NA 34 14

apply(mat, 1, max, na.rm = T)
# [1]  5 55 34 

Why does R reject na.rm= TRUE when I use checkd but not when I use max in apply

After the first three arguments, (X, MARGIN, FUN), apply just passes arguments on through to the function you pass to FUN. If you look at the help for ?max, you'll see that it is defined to take an argument called na.rm. Your definition for checkd has no such argument. If you want to add an na.rm argument to your function, you could do it like this:

checkd <- function(x, na.rm = TRUE) if(is.integer(x)) max(x, na.rm = na.rm)
# or even this
checkd <- function(x, ...) if(is.integer(x)) max(x, ...)

Note that this function probably doesn't do what you want - it checks to see if the vector you give it - a whole row in your example - consists only of integers, and if so it will return the max. Since a vector can only have one type, if you have any non-integer in there, is.integer(x) will be false and the the max won't be calculated.

I also deleted your == TRUE, which doesn't do anything.

Share:
10,782
DomB
Author by

DomB

Updated on July 16, 2022

Comments

  • DomB
    DomB almost 2 years

    I have the following data set:

    df<-data.frame(read.table(header = TRUE, text = "
         ID N1 N2 N3 N4
          1 2 3 4 5
         11 NA -12 14 55
         21 12 SON 34 14"))
    

    I want to find out what is the max entry in each row. This would be, for example, 5 in the first row. Obviously, the situation is more complicated because of missing values ('NA') and a string ('SON').

    I first tried the following command:

    df$Result<-apply(df,1, max, na.rm= TRUE)
    

    The results are [5,55, SON]! Not what I wanted. I therefore then tried:

    checkd<- function(x) if(is.integer(x)== TRUE)max(x)
    df$Result<-apply(df,1, checkd)
    

    Funnily, it removed the last column df$Result. Does anyone know what did I do wrong? Also, what would be the solution to my problem?

    Also, of I try the following code:

    checkd<- function(x) if(is.integer(x)== TRUE)max(x)
    df$Result<-apply(df,1, checkd, na.rm= TRUE)
    

    it gives me Error in FUN(newX[, i], ...) : unused argument (na.rm = TRUE)! Why is that? My function checkd does generally not seem to cause any problems to R. Why does R reject na.rm= TRUE when I use checkd but not when I use max in apply?

    Thanks,

    Dom

  • DomB
    DomB over 8 years
    thanks! That is really useful! Just a quick follow-up question. I played around with as.numeric! I happened to notice that it turns the third column 3 -12 SON into the following numerical values 2 1 3! My question why does it not translate 3 into 3 but into 2 and -12 not into -12 but into 1 etc. If I wanted to do this, what would be the way? Anyway, thanks so much for your explanations. Super useful!
  • Gregor Thomas
    Gregor Thomas over 8 years
    You probably have a factor to start with. See edits. And see here for more details.