How to calculate BIC for kmeans clustering in R
Solution 1
For anyone else landing here, there's a method proposed by Sherry Towers at http://sherrytowers.com/2013/10/24/kmeansclustering/, which uses output from stats::kmeans
. I quote:
The AIC can be calculated with the following function:
kmeansAIC = function(fit){ m = ncol(fit$centers) n = length(fit$cluster) k = nrow(fit$centers) D = fit$tot.withinss return(D + 2*m*k) }
From the help for stats::AIC
, you can also see that the BIC can be calculated in a similar way to the AIC. An easy way to get the BIC is to replace the return()
in the above function, with this:
return(data.frame(AIC = D + 2*m*k,
BIC = D + log(n)*m*k))
So you would use this as follows:
fit < kmeans(x = data,centers = 6)
kmeansAIC(fit)
Solution 2
To compute BIC, just add .5*k*d*log(n)
(where k
is the number of means, d
is the length of a vector in your dataset, and n
is the number of data points) to the standard kmeans error function.
The standard kmeans penalty is \sum_n (m_k(n)x_n)^2
, where m_k(n)
is the mean associated with the nth data point. This penalty can be interpreted as a log probability, so BIC is perfectly valid.
BIC just adds an additional penalty term to the kmeans error proportional to k
.
Solution 3
Just to add to what user1149913 said (I don't have enough reputation to comment), since you're using the kmeans function in R, \sum_n (m_k(n)x_n)^2
is already calculated for you as KClData$tot.withinss
.
Solution 4
Rather than reimplementing AIC
or BIC
, we can define a loglikelihood function for kmeans
objects; this will then get used by the BIC
function in the stats
package.
logLik.kmeans < function(object) structure(
object$tot.withinss/2,
df = nrow(object$centers)*ncol(object$centers),
nobs = length(object$cluster)
)
Then to use it, call BIC
as normal. For example:
example(kmeans, local=FALSE)
BIC(cl)
# [1] 26.22842084
This method will be provided in the next release of the stackoverflow
package.
UnivStudent
Updated on July 05, 2022Comments

UnivStudent over 1 year
I've been using kmeans to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I've been using in R is:
KClData < kmeans(Data, centers=2, nstart= 100)
But I'd like to be able to extract the BIC and Log Likelihood. Any help would be greatly appreciated!

Jason over 8 yearsI don't think the kmeans penalty
\sum_n (m_k(n)  x_n)^2
(or the negative of that) is the loglikelihood. The loglikelihood should have 3 more terms:n*log(K)
,0.5*n*d*log(2*pi)
andn*d*log(\sigma)
, where \sigma is the common std for all Gaussians. Also, the "k" in the BIC formula is not the number of clusters, it is the number of free parameters in the mixture Gaussian model, so k should be:k = K1 + 1 + K*d = K*(d+1)
. Here I am using lower k for the BIC parameter term, and capital K as the number of means/clusters. 
Jason over 8 years
K1
is K1 number of weights for the Gaussians, since the weights add up to 1, the dof is K1.1
for the one common std for all Gaussians. AndK*d
is the number of parameters for the coordinates of the cluster centers. 
pengchy almost 8 yearsWaiting for the new version of stackoverflow, current version 0.1.2 hasn't implemented this function.

pengchy almost 8 yearsI have used your methods, but the BIC value of the kmeans results always monotone decreasing with the cluster number increasing. Please look the posts: stats.stackexchange.com/questions/55147/…

Andy Clifton almost 8 yearsTaking a wild guess, I'd say there's an error somewhere.

pengchy almost 8 yearsThank you Any Clifton, I have tested the BIC with higher K number, when K reach 155 the BIC come to the smallest value. Originally, I have only tested the K with maximum 50.

Andy Clifton almost 8 years@pengchy: you may be interested in this answer about finding optimum values of k: stackoverflow.com/a/18047359/2514568

pengchy almost 8 yearsHi Andy Clifton, thank you for the information. It is very helpful.

Neal Fultz almost 8 years@pengchy the github version has it, I'll submit to cran when I get the chance.

user5054 over 7 yearsAre there any weights in kmeans? I think not.. So why do we need K1?

EngrStudent almost 6 yearsyou are sign inverted. You need a minus sign.

Chris about 2 years