How can i cluster document using k-means (Flann with python)?

13,519

Solution 1

You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.

So here's what you do:

  1. Count up the number of times each word appears in the document.

  2. Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.

  3. Make a vector for each document based on the counts of the feature words.

Here's an example.

If your "documents" are single sentences, and they look like (one doc per line):

there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog

If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:

[1, 1, 0, 0, 0]  // dog 1 time, cat 1 time
[0, 0, 0, 1, 1]  // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0]  // dog 2 times, cat 1 time, street 1 time

You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.

Solution 2

There is one big problem here:

K-means is designed for Euclidean distance.

The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).

Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.

For text vectors, you probably will want to use a different distance function such as cosine similarity.

And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.

There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

Share:
13,519
Phyo Arkar Lwin
Author by

Phyo Arkar Lwin

Will Add Later!

Updated on June 05, 2022

Comments

  • Phyo Arkar Lwin
    Phyo Arkar Lwin almost 2 years

    I want to cluster documents based on similarity.

    I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).

    I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.

    My questions are:

    1. Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
    2. Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
    3. Is k-means the right algorithm?
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    Very interesting , i read somewhere a few days ago that scikit.learn have such feature to vectorize any text file or strings. I am wondering the data structure that it gives out will be suitable for Flann?
  • jpsfer
    jpsfer over 11 years
    I would just add that you could use some stemming algorithm to ensure you consider small variations of the same word as the same keyword. This will decrease the number of variables and should make the overall process more accurate. See this link for more info link
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    yes that would be nice , i can use NLTK to produce / tokenize words.
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    Thank you so much for pointing that out. any implementations of K-medoids that you recommend?
  • Carpetfizz
    Carpetfizz over 7 years
    If I don't have any feature words and my bag of words is just "any word that's not a stopword" will each vector be the length of all possible words, with each indice representing the occurences of the word assigned to that indice?
  • CKM
    CKM about 7 years
    @Carpetfizz. That's true.