Python Maxent Classifier

10,961

Solution 1

There's probably a fix for the numpy overflow issue but since this is just a movie review classifier for learning NLTK / text classification (and you probably don't want training to take a long time anyway), I'll provide a simple workaround: you can just restrict the words used in feature sets.

You can find the 300 most commonly used words in all reviews like this (you can obviously make that higher if you want),

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

Then all you have to do is cross-reference top_words in your feature extractor for reviews. Also, just as a suggestion, it's more efficient to use dictionary comprehension rather than convert a list of tuples to a dict. So this might look like,

def word_feats(words):
    return {word:True for word in words if word in top_words}

Solution 2

I changed and update the code a bit.

import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation


from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
 return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)

algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)

classifier.show_most_informative_features(10)

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

def word_feats(words):
    return {word:True for word in words if word in top_words}
Share:
10,961
cjds
Author by

cjds

Hello. I work at Fetch Robotics as a Robototcist doing robot things!! I'm interested in: Social and Web Robotics, and Robot Collaboration Artificial Intelligence Augmented and Mixed Reality

Updated on June 17, 2022

Comments

  • cjds
    cjds almost 2 years

    I've been using the maxent classifier in python and its failing and I don't understand why.

    I'm using the movie reviews corpus. (total noob)

    import nltk.classify.util
    from nltk.classify import MaxentClassifier
    from nltk.corpus import movie_reviews
    
    def word_feats(words):
     return dict([(word, True) for word in words])
    
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')
    
    negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
    
    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4
    
    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    classifier = MaxentClassifier.train(trainfeats)
    

    This is the error (I know I'm doing this wrong please link to how Maxent works)

    Warning (from warnings module): File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1334 sum1 = numpy.sum(exp_nf_delta * A, axis=0) RuntimeWarning: invalid value encountered in multiply

    Warning (from warnings module): File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1335 sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0) RuntimeWarning: invalid value encountered in multiply

    Warning (from warnings module): File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1341 deltas -= (ffreq_empirical - sum1) / -sum2 RuntimeWarning: invalid value encountered in divide