K- Means algorithm

34,084

Solution 1

I haven't studied the code myself, but there's a multithreaded K-means implementation given in this JavaWorld article that looks pretty instructive.

Solution 2

Classification, Clustering and grouping are well developed areas of IR. There is a very good (Java) library/software (open source) here Called WEKA. There are several algorithms for clustering there. Although there is a learning curve, it might useful when you encounter harder problems.

Solution 3

OpenCV is one of the most horribly written libraries I've ever had to use. On the other hand, Matlab does it very neatly.

If you have to code it yourself, the algorithm is incredibly simple for how efficient it is.

  1. Pick number of clusters (k)
  2. Make k points (they're going to be the centroids)
  3. Randomize all these points location
  4. Calculate Euclidean distance from each point to all centroids
  5. Assign 'membership' of each point to the nearest centroid
  6. Establish the new centroids by averageing locations of all points belonging to a given cluster
  7. Goto 4 Until convergence is achieved, or changes made are irrelevant.

Solution 4

There's a very nice Python implementation of K-means clustering in "Programming Collective Intelligence". I highly recommend it.

I realize that you'll have to translate to Java, but it doesn't look to be too difficult.

Solution 5

Really, KMeans is a really easy algorithm. Any good reason why not hand coding it yourself? I did it in Qt and then ported the code to plain old STL, without too much problems.

I am started to be a fan to Joel's idea: no external dependencies, so please feel free to tell me what's good about a large piece of software you don't control, and others on this question have already mentioned it's not a good piece of software/

Talk is cheap, real man show their code to the world: http://github.com/elcuco/data_mining_demo

I should clean the code a little to be more generic, and current version is not ported to STL, but it's a start!

Share:
34,084
dedalo
Author by

dedalo

Updated on June 17, 2020

Comments

  • dedalo
    dedalo almost 4 years

    I'm trying to program a k-means algorithm in Java. I have calculated a number of arrays, each of them containing a number of coefficients. I need to use a k-means algorithm in order to group all this data. Do you know any implementation of this algorithm?

    Thanks

  • duffymo
    duffymo almost 15 years
    +1 - nice find. "PCI" is still recommended, because it's got a lot of great stuff besides K-means.
  • ldog
    ldog almost 15 years
    Using OpenCV for KMeans might be overkill, but I don't see how OpenCV is "horribly" written. It may not be as easy to use as matlab (matlab is proprietory, slow and meant to be an easy way to test out algorithms using the large amount of algorithms already available to you in matlab) but it is for sure way faster than matlab, simply by virtue of being coded in C.
  • dedalo
    dedalo almost 15 years
    Hi elcuco, I have coded it myself, but wanted to crosscheck the initialization part. I wanted to see how others implementations had assigned initial clusters. I also think it's not a good idea using a code you don't have control over. I'll keep digging, thank you all!
  • hythlodayr
    hythlodayr almost 15 years
    Thanks. I've been looking for a practical companion to my (old) machine learning textbook for some time now.
  • mmr
    mmr about 13 years
    Matlab is slow if you don't code in it properly. If you've written a 'for' loop in matlab, you're doing it wrong.
  • Chthonic Project
    Chthonic Project over 10 years
    I wouldn't use this package for clustering. The k-means implementation in Java-ML is ridiculously slow.
  • jutky
    jutky about 10 years
    Currently, OpenCV officially supports Java. So, there is no need for hand-written JNI.