NbClust package error

r cluster-analysis

15,125

Solution 1

I am pretty sure I found the cause of this error message, and it is essentially data related. I looked up the original code for the NbClust package and found the error originates in the beginning part of the code:

NbClust <- function(data, diss="NULL", distance = "euclidean", min.nc=2, max.nc=15, method = "ward", index = "all", alphaBeale = 0.1)
{
x<-0
min_nc <- min.nc
max_nc <- max.nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed 
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]
TT <- t(jeu)%*%jeu   
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value
for (i in 1:sizeEigenTT) 
{
        if (eigenValues[i] < 0) {
    print(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
    stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
        } 
}

So, in my case, my matrix produces negative eigenvalues. I double-checked this, and it does: up to about 100 principal submatrices the eigenvalues stay positive, then they start getting negative. So this is a mathematical issue with my matrix, it means it is not a positive-definite matrix. Which is important for quite a lot of reasons - a really good explanation of causes and possible solutions is given at http://www2.gsu.edu/~mkteer/npdmatri.html I am now analyzing my data to find out what causes this. So the code is fine: if you get this error message, you probably have to go back to your data.

I would caution against transposing your data, because then you're essentially multiplying the transpose of your transpose data (i.e. the original data) with your transposed data. And original times transposed is NOT the same as transposed times the original!!

Solution 2

I don't know what happen with the Function, but you can apply the diferents methods with a loop: (If you want apply this code for you have to change "base_muli_sinna")

lista.methods = c("kl", "ch", "hartigan","mcclain", "gamma", "gplus",
                  "tau", "dunn", "sdindex", "sdbw", "cindex", "silhouette",
                  "ball","ptbiserial", "gap","frey")
lista.distance = c("metodo","euclidean", "maximum", "manhattan", "canberra")

tabla = as.data.frame(matrix(ncol = length(lista.distance), nrow = length(lista.methods)))
names(tabla) = lista.distance

for (j in 2:length(lista.distance)){
for(i in 1:length(lista.methods)){

nb = NbClust(base_multi_sinna, distance = lista.distance[j],
             min.nc = 2, max.nc = 10, 
             method = "complete", index =lista.methods[i])
tabla[i,j] = nb$Best.nc[1]
tabla[i,1] = lista.methods[i]

}}

tabla

Solution 3

I had the same problem when working with a matrix that has more columns than rows - a problem that can impact over other R functions, like princomp when you are trying to do a PCA analysis (in that case, you should use prcomp).

My way of doing this in this case is simply using the transposed matrix:

NbClust(t(mydata), distance="euclidean", min.nc=2, max.nc=99, method="ward", 
index="duda")

15,125

Author by

Geraldine

Updated on June 04, 2022

Comments

Geraldine almost 2 years
I am trying to run the package NbClust on my data (100 rows x 130 columns) to determine the number of clusters I should choose, but I keep getting this error if I try to apply it to the full data set:
```
> nc <- NbClust(mydata, distance="euclidean", min.nc=2, max.nc=99, method="ward",
index="duda")     
[1] "There are only 100 nonmissing observations out of a possible 100 observations."
Error in NbClust(mydata, distance = "euclidean", min.nc = 2, max.nc = 99,  : 
The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
```
When I apply the method to a 100x80 matrix, it does produce the required output (100x100 also gave me an error message, but a different one). However, obviously, I want to apply this method to the whole dataset. FYI - creating the distance matrix, and clustering with Ward's Method was both no problem. Both the distance matrix and the dendrogram were produced…