How Could One Implement the K-Means++ Algorithm?

algorithm language-agnostic machine-learning cluster-analysis k-means

32,098

Solution 1

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

Example:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3

Solution 2

One Liner.

Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.

Solution 3

I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.

Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.

from __future__ import division

def readfile(filename):
  lines=[line for line in file(filename)]
  rownames=[]
  data=[]
  for line in lines:
    p=line.strip().split(' ') #single space as separator
    #print p
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
    #print [float(x) for x in p[1:]]
  return rownames,data

from math import sqrt

def pearson(v1,v2):
  # Simple sums
  sum1=sum(v1)
  sum2=sum(v2)

  # Sums of the squares
  sum1Sq=sum([pow(v,2) for v in v1])
  sum2Sq=sum([pow(v,2) for v in v2])    

  # Sum of the products
  pSum=sum([v1[i]*v2[i] for i in range(len(v1))])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/len(v1))
  den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  if den==0: return 0

  return 1.0-num/den

import numpy
from numpy.random import *

def initialize(X, K):
    C = [X[0]]
    for _ in range(1, K):
        #D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
        D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        #print "cumprobs=",cumprobs
        r = rand()
        #print "r=",r
        i=-1
        for j,p in enumerate(cumprobs):
            if r 0:
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
          avgs[j]/=len(bestmatches[i])
        clusters[i]=avgs

  return bestmatches

rows,data=readfile('/home/toncho/Desktop/data.txt')

kclust = kcluster(data,k=4)

print "Result:"
for c in kclust:
    out = ""
    for r in c:
        out+=rows[r] +' '
    print "["+out[:-1]+"]"

print 'done'

data.txt:

32,098

Author by

Anton Andreev

http://bg.linkedin.com/in/antonandreev

Updated on October 15, 2020

Comments

Anton Andreev over 3 years
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
1. Is the probability function used based on distance or Gaussian?
2. In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Anton Andreev about 13 years

So for every point in X we generate a probability. Then cumsum is supposed to put weight on these probabilities. I think the idea is to put more weight to the points with higher probability, so when we do "random centroid select" we choose within them. But how do we know to which points we should put more weight (?) - we haven't sorted the probs array and the cumsum function makes the ones at the end of the array with bigger weight (cumsum definition).
Anton Andreev about 13 years

I mean that cumsum has specific behavior to accumulate to the right (an array where x1<x2), which might be not what we want - put more weight to the ones with higher probability. We might have points with higher probability in the middle (which will get less weight than the ones at the end).
Anton Andreev about 13 years

To execute the code I used 'numpy' instead of 'scipy'. Also to get the division right I used 'from future import division', otherwise probs is all [0,0,...].
Anton Andreev about 13 years

Code is available here: a-algorithms for CPython and IronPython.
Kobe-Wan Kenobi about 9 years

What happens if you have a large number of points? Can you please check my question stats.stackexchange.com/questions/144190/… ? Thanks
Yoshua Joo Bin almost 8 years

What if the probability falls into the same value, I mean if c1 = [20,19,80] and incidentally the c2 were also falling into the same value as c1. Should the code be fixed? and add the line of this following code if X[i] not in C