K-Means clustering in R error

16,936

Solution 1

There is a variety of reasons for getting this error message, in particular in the presence of invalid data types (NA, NaN, Inf) or dates. Let's go through them:

But first, let's check that it works with the mtcars dataset since I will be using it:

kmeans(mtcars, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
--- lengthy output omitted

Likely problem 1: invalid data types: NA/NaN/Inf

df <- mtcars
df[1,1] <- NA
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

df[1,1] <- Inf
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

df[1,1] <- NaN
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

You can check for these values using the following:

df[1:3,1] <- c(NA, Inf, NaN) # one NA, one Inf, one NaN
sum(sapply(df, is.na))
[1] 2
sum(sapply(df, is.infinite))
[1] 1
sum(sapply(df, is.nan))
[1] 1

To get rid of these, we can remove the corresponding observations. But note that complete.cases does not remove Inf:

complete_df <- df[complete.cases(df),]
sum(sapply(complete_df, is.infinite))
[1] 1

Instead, use e.g.

df[apply(sapply(df, is.finite), 1, all),]

You can also reassign these values or impute them, but this is a whole different procedure.

Likely problem II: Dates: See the following:

library(lubridate)
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(df, 3) : NAs introduced by coercion

You can get around this problem by excluding the dates or by converting the dates to something else, e.g.

df$newdate <- seq_along(df$date)
df$date <- NULL
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
---- lengthy output omitted

Or you can try to coerce the dates to numeric yourself before you pass it to kmeans:

df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
df$date <- as.numeric(df$date)
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 16, 7
--- lengthy output omitted

Solution 2

Check datatype of the variable on which you are clustering. Most probably the error can come if the datatype is non-numeric. Also try handling date formats properly before you cluster.

Share:
16,936
zsad512
Author by

zsad512

Business Analytics Masters Student with specialization in Data Science.

Updated on June 04, 2022

Comments

  • zsad512
    zsad512 almost 2 years

    I have a dataset that I have created in R. It is structured as follows:

    > head(btc_data)
               Date btc_close eth_close vix_close gold_close DEXCHUS change
    1647 2010-07-18      0.09        NA        NA         NA      NA      0
    1648 2010-07-19      0.08        NA     25.97    115.730      NA     -1
    1649 2010-07-20      0.07        NA     23.93    116.650      NA     -1
    1650 2010-07-21      0.08        NA     25.64    115.850      NA      1
    1651 2010-07-22      0.05        NA     24.63    116.863      NA     -1
    1652 2010-07-23      0.06        NA     23.47    116.090      NA      1
    

    I am trying to cluster the observations using k-means. However, I get the following error message:

    > km <- kmeans(trainingDS, 3)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    In addition: Warning message:
    In storage.mode(x) <- "double" : NAs introduced by coercion 
    

    What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it? I cant drop the NA's because out of 4500 initial observations, if i run complete cases I am left with only 100 observations.

    Essentially I am hoping that 3 clusters will form based on the change column which has values of -1,0,1. I then wish to analyze the components of each cluster to find the strongest predictors for change. What other algorithms that would be most useful for doing this?

    I also tried to remove all the NA values using the following code, but I still get the same error message:

    > complete_cases <- btc_data[complete.cases(btc_data), ]
    > km <- kmeans(complete_cases, 3, nstart = 20)
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    In addition: Warning message:
    In storage.mode(x) <- "double" : NAs introduced by coercion
    
    > sum(!sapply(btc_data, is.finite)) 
    [1] 8008
    > sum(sapply(btc_data, is.nan))
    [1] 0
    > 
    > sum(!sapply(complete_cases, is.finite)) 
    [1] 0
    > sum(sapply(complete_cases, is.nan))
    [1] 0
    

    Here is the format of the data:

    > sapply(btc_data, class)
          Date  btc_close  eth_close  vix_close gold_close    DEXCHUS     change 
        "Date"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"   "factor"