How to calculate BIC for k-means clustering in R

21,572

Solution 1

For anyone else landing here, there's a method proposed by Sherry Towers at http://sherrytowers.com/2013/10/24/k-means-clustering/, which uses output from stats::kmeans. I quote:

The AIC can be calculated with the following function:

kmeansAIC = function(fit){

m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}

From the help for stats::AIC, you can also see that the BIC can be calculated in a similar way to the AIC. An easy way to get the BIC is to replace the return() in the above function, with this:

return(data.frame(AIC = D + 2*m*k,
                  BIC = D + log(n)*m*k))

So you would use this as follows:

fit <- kmeans(x = data,centers = 6)
kmeansAIC(fit)

Solution 2

To compute BIC, just add .5*k*d*log(n) (where k is the number of means, d is the length of a vector in your dataset, and n is the number of data points) to the standard k-means error function.

The standard k-means penalty is \sum_n (m_k(n)-x_n)^2, where m_k(n) is the mean associated with the nth data point. This penalty can be interpreted as a log probability, so BIC is perfectly valid.

BIC just adds an additional penalty term to the k-means error proportional to k.

Solution 3

Just to add to what user1149913 said (I don't have enough reputation to comment), since you're using the kmeans function in R, \sum_n (m_k(n)-x_n)^2 is already calculated for you as KClData$tot.withinss.

Solution 4

Rather than reimplementing AIC or BIC, we can define a log-likelihood function for kmeans objects; this will then get used by the BIC function in the stats package.

logLik.kmeans <- function(object) structure(
  -object$tot.withinss/2,
  df = nrow(object$centers)*ncol(object$centers),
  nobs = length(object$cluster)
)

Then to use it, call BIC as normal. For example:

example(kmeans, local=FALSE)
BIC(cl)
# [1] 26.22842084

This method will be provided in the next release of the stackoverflow package.

Share:
21,572
UnivStudent
Author by

UnivStudent

Updated on July 05, 2022

Comments

  • UnivStudent
    UnivStudent almost 2 years

    I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I've been using in R is:

    KClData <- kmeans(Data, centers=2, nstart= 100)
    

    But I'd like to be able to extract the BIC and Log Likelihood. Any help would be greatly appreciated!

  • Jason
    Jason almost 9 years
    I don't think the k-means penalty \sum_n (m_k(n) - x_n)^2 (or the negative of that) is the log-likelihood. The log-likelihood should have 3 more terms: -n*log(K), -0.5*n*d*log(2*pi) and -n*d*log(\sigma), where \sigma is the common std for all Gaussians. Also, the "k" in the BIC formula is not the number of clusters, it is the number of free parameters in the mixture Gaussian model, so k should be: k = K-1 + 1 + K*d = K*(d+1). Here I am using lower k for the BIC parameter term, and capital K as the number of means/clusters.
  • Jason
    Jason almost 9 years
    K-1 is K-1 number of weights for the Gaussians, since the weights add up to 1, the dof is K-1. 1 for the one common std for all Gaussians. And K*d is the number of parameters for the coordinates of the cluster centers.
  • pengchy
    pengchy over 8 years
    Waiting for the new version of stackoverflow, current version 0.1.2 hasn't implemented this function.
  • pengchy
    pengchy over 8 years
    I have used your methods, but the BIC value of the kmeans results always monotone decreasing with the cluster number increasing. Please look the posts: stats.stackexchange.com/questions/55147/…
  • Andy Clifton
    Andy Clifton over 8 years
    Taking a wild guess, I'd say there's an error somewhere.
  • pengchy
    pengchy over 8 years
    Thank you Any Clifton, I have tested the BIC with higher K number, when K reach 155 the BIC come to the smallest value. Originally, I have only tested the K with maximum 50.
  • Andy Clifton
    Andy Clifton over 8 years
    @pengchy: you may be interested in this answer about finding optimum values of k: stackoverflow.com/a/18047359/2514568
  • pengchy
    pengchy over 8 years
    Hi Andy Clifton, thank you for the information. It is very helpful.
  • Neal Fultz
    Neal Fultz over 8 years
    @pengchy the github version has it, I'll submit to cran when I get the chance.
  • user5054
    user5054 almost 8 years
    Are there any weights in k-means? I think not.. So why do we need K-1?
  • EngrStudent
    EngrStudent over 6 years
    you are sign inverted. You need a minus sign.
  • Chris
    Chris over 2 years