K-Means clustering in R error
Solution 1
There is a variety of reasons for getting this error message, in particular in the presence of invalid data types (NA, NaN, Inf) or dates. Let's go through them:
But first, let's check that it works with the mtcars
dataset since I will be using it:
kmeans(mtcars, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
--- lengthy output omitted
Likely problem 1: invalid data types: NA/NaN/Inf
df <- mtcars
df[1,1] <- NA
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
df[1,1] <- Inf
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
df[1,1] <- NaN
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
You can check for these values using the following:
df[1:3,1] <- c(NA, Inf, NaN) # one NA, one Inf, one NaN
sum(sapply(df, is.na))
[1] 2
sum(sapply(df, is.infinite))
[1] 1
sum(sapply(df, is.nan))
[1] 1
To get rid of these, we can remove the corresponding observations. But note that complete.cases
does not remove Inf
:
complete_df <- df[complete.cases(df),]
sum(sapply(complete_df, is.infinite))
[1] 1
Instead, use e.g.
df[apply(sapply(df, is.finite), 1, all),]
You can also reassign these values or impute them, but this is a whole different procedure.
Likely problem II: Dates: See the following:
library(lubridate)
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(df, 3) : NAs introduced by coercion
You can get around this problem by excluding the dates or by converting the dates to something else, e.g.
df$newdate <- seq_along(df$date)
df$date <- NULL
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
---- lengthy output omitted
Or you can try to coerce the dates to numeric yourself before you pass it to kmeans
:
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
df$date <- as.numeric(df$date)
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 16, 7
--- lengthy output omitted
Solution 2
Check datatype of the variable on which you are clustering. Most probably the error can come if the datatype is non-numeric. Also try handling date formats properly before you cluster.
zsad512
Business Analytics Masters Student with specialization in Data Science.
Updated on June 04, 2022Comments
-
zsad512 almost 2 years
I have a dataset that I have created in R. It is structured as follows:
> head(btc_data) Date btc_close eth_close vix_close gold_close DEXCHUS change 1647 2010-07-18 0.09 NA NA NA NA 0 1648 2010-07-19 0.08 NA 25.97 115.730 NA -1 1649 2010-07-20 0.07 NA 23.93 116.650 NA -1 1650 2010-07-21 0.08 NA 25.64 115.850 NA 1 1651 2010-07-22 0.05 NA 24.63 116.863 NA -1 1652 2010-07-23 0.06 NA 23.47 116.090 NA 1
I am trying to cluster the observations using k-means. However, I get the following error message:
> km <- kmeans(trainingDS, 3) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In storage.mode(x) <- "double" : NAs introduced by coercion
What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it? I cant drop the NA's because out of 4500 initial observations, if i run
complete cases
I am left with only 100 observations.Essentially I am hoping that 3 clusters will form based on the
change
column which has values of -1,0,1. I then wish to analyze the components of each cluster to find the strongest predictors for change. What other algorithms that would be most useful for doing this?I also tried to remove all the NA values using the following code, but I still get the same error message:
> complete_cases <- btc_data[complete.cases(btc_data), ] > km <- kmeans(complete_cases, 3, nstart = 20) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In storage.mode(x) <- "double" : NAs introduced by coercion > sum(!sapply(btc_data, is.finite)) [1] 8008 > sum(sapply(btc_data, is.nan)) [1] 0 > > sum(!sapply(complete_cases, is.finite)) [1] 0 > sum(sapply(complete_cases, is.nan)) [1] 0
Here is the format of the data:
> sapply(btc_data, class) Date btc_close eth_close vix_close gold_close DEXCHUS change "Date" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"