Implementing Bag-of-Words Naive-Bayes classifier in NLTK

26,094

Solution 1

scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.

As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)

import numpy as np
from nltk.probability import FreqDist
from nltk.classify import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfTransformer()),
                     ('chi2', SelectKBest(chi2, k=1000)),
                     ('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

from nltk.corpus import movie_reviews
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
add_label = lambda lst, lab: [(x, lab) for x in lst]
classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))

l_pos = np.array(classif.classify_many(pos[100:]))
l_neg = np.array(classif.classify_many(neg[100:]))
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
          (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
          (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

This printed for me:

Confusion matrix:
524     376
202     698

Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.

Solution 2

The features in the NLTK bayes classifier are "nominal", not numeric. This means they can take a finite number of discrete values (labels), but they can't be treated as frequencies.

So with the Bayes classifier, you cannot directly use word frequency as a feature-- you could do something like use the 50 more frequent words from each text as your feature set, but that's quite a different thing

But maybe there are other classifiers in the NLTK that depend on frequency. I wouldn't know, but have you looked? I'd say it's worth checking out.

Solution 3

  • put the string you are looking at into a list, broken into words
  • for each item in the list, ask: is this item a feature I have in my feature list.
  • If it is, add the log prob as normal, if not, ignore it.

If your sentence has the same word multiple times, it will just add the probs multiple times. If the word appears multiple times in the same class, your training data should reflect that in the word count.

For added accuracy, count all bi-grams, tri-grams, etc as separate features.

It helps to manually write your own classifiers so that you understand exactly what is happening and what you need to do to imporve accuracy. If you use a pre-packaged solution and it doesn't work well enough, there is not much you can do about it.

Share:
26,094
Ben G
Author by

Ben G

Python/Django, Rails, PHP, and JavaScript programmer. Been making websites for a really long time.. interested in learning new stuff all the time.

Updated on July 27, 2022

Comments

  • Ben G
    Ben G almost 2 years

    I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at ("bag-of-words").

    One of the answers seems to suggest this can't be done with the built in NLTK classifiers. Is that the case? How can I do frequency/bag-of-words NB classification with NLTK?