Predicting Values with k-Means Clustering Algorithm

13,737

Solution 1

To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point.

In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters. The only difference here is that the centroids you are using for this computation is the final set--i.e., the values for the centroids at the last iteration.

Here's one implementation in python (w/ NumPy):

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])

Solution 2

I know that I might be late, but that is my general solution to your problem:

def predict(data, centroids):
    centroids, data = np.array(centroids), np.array(data)
    distances = []
    for unit in data:
        for center in centroids:
            distances.append(np.sum((unit - center) ** 2))                
    distances = np.reshape(distances, data.shape)
    closest_centroid = [np.argmin(dist) for dist in distances]
    print(closest_centroid)

Solution 3

If you are considering assigning a value based on the average value within the nearest cluster, you are talking about some form of "soft decoder", which estimates not only the correct value of the coordinate but your level of confidence in the estimate. The alternative would be a "hard decoder" where only values of 0 and 1 are legal (occur in the training data set), and the new coordinate would get the median of the values within the nearest cluster. My guess is that you should always assign only a known-valid class value (0 or 1) to each coordinate, and averaging class values is not a valid approach.

Share:
13,737
DizzyDoo
Author by

DizzyDoo

Updated on July 24, 2022

Comments

  • DizzyDoo
    DizzyDoo almost 2 years

    I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1.

    What confuses me about the algorithm is how I can then use it to predict some values for another set of two dimensional data that doesn't have a 0 or a 1, but instead is unknown. For each cluster, should I average the points within it to either a 0 or a 1, and if an unknown point is closest to that cluster, then that unknown point takes on the averaged value? Or is there a smarter method?

    Cheers!