Replace mean or mode for missing values in R

17,195

If you simply remove the obvious bugs then it works as intended:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

print(df_test)

#   age    a      b
# 1   5   aa banana
# 2   8   bb  apple
# 3  10 <NA>   pear
# 4  12   cc  grape
# 5  NA   cc   <NA>

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]) %in% c("character", "factor")) {
        df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
    }
}

print(df_test)

#     age  a       b
# 1  5.00 aa  banana
# 2  8.00 bb   apple
# 3 10.00 cc    pear
# 4 12.00 cc   grape
# 5  8.75 cc >1 mode

I recommend that you use an editor with syntax highlighting and bracket matching, which would make it easier to find these sorts of syntax errors.

Share:
17,195
user971102
Author by

user971102

Updated on August 21, 2022

Comments

  • user971102
    user971102 almost 2 years

    I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the missing values using either the mean of the respective column if numerical or the mode if character/factor.

    This is what I have until now:

    #fake array:
    age<- c(5,8,10,12,NA)
    a <- factor(c("aa", "bb", NA, "cc", "cc"))
    b <- c("banana", "apple", "pear", "grape", NA)
    df_test <- data.frame(age=age, a=a, b=b)
    df_test$b <- as.character(df_test$b)
    
    for (var in 1:ncol(df_test)) {
        if (class(df_test[,var])=="numeric") {
            df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]=="character") {
            Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
    } 
    }
    

    Where 'Mode' is the function:

    Mode <- function (x, na.rm) {
        xtab <- table(x)
        xmode <- names(which(xtab == max(xtab)))
        if (length(xmode) > 1)
            xmode <- ">1 mode"
        return(xmode)
    }
    

    It seems as it is just ignoring the statements though, without giving any error… I have also tried to work the first part out with indexes:

    ## create an index of missing values
    index <- which(is.na(df_test)[,1], arr.ind = TRUE)
    ## calculate the row means and "duplicate" them to assign to appropriate cells
    df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]
    

    But I get this error: "Error in colMeans(df_test, na.rm = TRUE) : 'x' must be numeric"

    Does anybody have any idea how to solve this?

    Thank you very much for all the great help! -f