n-grams with Naive Bayes classifier

python nltk n-gram

16,539

A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.

As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.

import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)

Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.

Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.

Edit 1:

There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence

"I drive a truck"

the word unigram features would be each of those 4 words while the word bigram features would be:

["I drive", "drive a", "a truck"]

Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.

def bigramReturner (tweetString):
  tweetString = tweetString.lower()
  tweetString = removePunctuation (tweetString)
  bigramFeatureVector = []
  for item in nltk.bigrams(tweetString.split()):
      bigramFeatureVector.append(' '.join(item))
  return bigramFeatureVector

Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

16,539

Author by

Aikin

Updated on September 15, 2022

Comments

Aikin over 1 year

Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Ive tried this one

from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict

train_samples = {}

with file ('positive.txt', 'rt') as f:
   for line in f.readlines():
       train_samples[line]='pos'

with file ('negative.txt', 'rt') as d:
   for line in d.readlines():
       train_samples[line]='neg'

f=open("test.txt", "r")
test_samples=f.readlines()

def bigramReturner(text):
    tweetString = text.lower()
    bigramFeatureVector = {}
    for item in bigrams(tweetString.split()):
        bigramFeatureVector.append(' '.join(item))
    return bigramFeatureVector

def get_labeled_features(samples):
    word_freqs = {}
    for text, label in train_samples.items():
        tokens = text.split()
        for token in tokens:
            if token not in word_freqs:
                word_freqs[token] = {'pos': 0, 'neg': 0}
            word_freqs[token][label] += 1
    return word_freqs


def get_label_probdist(labeled_features):
    label_fd = FreqDist()
    for item,counts in labeled_features.items():
        for label in ['neg','pos']:
            if counts[label] > 0:
                label_fd.inc(label)
    label_probdist = ELEProbDist(label_fd)
    return label_probdist


def get_feature_probdist(labeled_features):
    feature_freqdist = defaultdict(FreqDist)
    feature_values = defaultdict(set)
    num_samples = len(train_samples) / 2
    for token, counts in labeled_features.items():
        for label in ['neg','pos']:
            feature_freqdist[label, token].inc(True, count=counts[label])
            feature_freqdist[label, token].inc(None, num_samples - counts[label])
            feature_values[token].add(None)
            feature_values[token].add(True)
    for item in feature_freqdist.items():
        print item[0],item[1]
    feature_probdist = {}
    for ((label, fname), freqdist) in feature_freqdist.items():
        probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
        feature_probdist[label,fname] = probdist
    return feature_probdist



labeled_features = get_labeled_features(train_samples)

label_probdist = get_label_probdist(labeled_features)

feature_probdist = get_feature_probdist(labeled_features)

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

for sample in test_samples:
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))

but getting this error, why?

    Traceback (most recent call last):
  File "C:\python\naive_test.py", line 76, in <module>
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
  File "C:\python\naive_test.py", line 23, in bigramReturner
    bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'

Aikin over 11 years

ahhhh dont understand how to use bigrams in python...is there any tutorials?
Aikin over 11 years

Ive editted my question could you help me with the error im getting? i used your code.
user823743 over 11 years

In my code you can see that I have defined a list like this: (bigramFeatureVector = []). However for some reason you have changed that to (bigramFeatureVector = {}) which is a dictionary. The command "append" doesn't work on dictionaries!
Aikin over 11 years

oops it was my mistake while typing...but still getting the error Traceback (most recent call last): File "C:\python\naive_test.py", line 74, in <module> print "%s | %s" % (sample, classifier.classify(bigramReturner(sample))) File "C:\python\lib\site-packages\nltk\classify\naivebayes.py", line 88, in cl assify return self.prob_classify(featureset).max() File "C:\python\lib\site-packages\nltk\classify\naivebayes.py", line 94, in pr ob_classify featureset = featureset.copy() AttributeError: 'list' object has no attribute 'copy'
Aikin over 11 years

i am newbie in programming so could you tell is my code right? I would really appreciate any of your help. Maybe my understanding is not right(((
Poik about 11 years

@Aikin Use featureset[:] instead of featureset.copy().