Better text documents clustering than tf/idf and cosine similarity?

11,587

Solution 1

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

Solution 2

As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.

If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.

While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.

Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.

  1. "Steve Jobs was the CEO at Apple" is clearly about the company
  2. "I'm eating the most delicious apple" is clearly about the fruit
  3. "I'm going to the big apple when I travel to the USA" is most likely about visiting New York

Solution 3

Long answer:

TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).

You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.

The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.

Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

Share:
11,587

Related videos on Youtube

Jack Twain
Author by

Jack Twain

Someone someone someone someone ... is ... ...

Updated on September 15, 2022

Comments

  • Jack Twain
    Jack Twain about 1 year

    I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

    The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

    1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

    The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

    1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

    Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

    My question: is there better techniques to cluster documents?

    • Steve
      Steve
      One thing to watch with LSI / LDA / NMF etc. is topic drift. Training a model on a known dataset will yield good results if your pipeline isn't done correctly. If you then apply your model to a totally unseen dataset you may see significant drop in performance due to fitting the original training data. Because Twitter text is so short the representation will need a bit of fiddling with as there may not be enough text to train a model properly.
    • Has QUIT--Anony-Mousse
      Has QUIT--Anony-Mousse
      @ThomasJungblut well, TF-IDF is supposed to be a weighting scheme that puts more weight on relevant keywords already. If figure the problem is that tweets are just so tiny text fragments, you can't expect similarity to work very well on them beyond "near identity". Most tweets aren't even complete sentences, so NLP will likely also fail.
  • Jack Twain
    Jack Twain over 10 years
    are topic models clustering techniques? or features representation?
  • Fred Foo
    Fred Foo over 10 years
    @guckogucko: feature representations.