n-grams with Naive Bayes classifier
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector
.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation
function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.
Aikin
Updated on September 15, 2022Comments
-
Aikin over 1 year
Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams from nltk.probability import ELEProbDist, FreqDist from nltk import NaiveBayesClassifier from collections import defaultdict train_samples = {} with file ('positive.txt', 'rt') as f: for line in f.readlines(): train_samples[line]='pos' with file ('negative.txt', 'rt') as d: for line in d.readlines(): train_samples[line]='neg' f=open("test.txt", "r") test_samples=f.readlines() def bigramReturner(text): tweetString = text.lower() bigramFeatureVector = {} for item in bigrams(tweetString.split()): bigramFeatureVector.append(' '.join(item)) return bigramFeatureVector def get_labeled_features(samples): word_freqs = {} for text, label in train_samples.items(): tokens = text.split() for token in tokens: if token not in word_freqs: word_freqs[token] = {'pos': 0, 'neg': 0} word_freqs[token][label] += 1 return word_freqs def get_label_probdist(labeled_features): label_fd = FreqDist() for item,counts in labeled_features.items(): for label in ['neg','pos']: if counts[label] > 0: label_fd.inc(label) label_probdist = ELEProbDist(label_fd) return label_probdist def get_feature_probdist(labeled_features): feature_freqdist = defaultdict(FreqDist) feature_values = defaultdict(set) num_samples = len(train_samples) / 2 for token, counts in labeled_features.items(): for label in ['neg','pos']: feature_freqdist[label, token].inc(True, count=counts[label]) feature_freqdist[label, token].inc(None, num_samples - counts[label]) feature_values[token].add(None) feature_values[token].add(True) for item in feature_freqdist.items(): print item[0],item[1] feature_probdist = {} for ((label, fname), freqdist) in feature_freqdist.items(): probdist = ELEProbDist(freqdist, bins=len(feature_values[fname])) feature_probdist[label,fname] = probdist return feature_probdist labeled_features = get_labeled_features(train_samples) label_probdist = get_label_probdist(labeled_features) feature_probdist = get_feature_probdist(labeled_features) classifier = NaiveBayesClassifier(label_probdist, feature_probdist) for sample in test_samples: print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last): File "C:\python\naive_test.py", line 76, in <module> print "%s | %s" % (sample, classifier.classify(bigramReturner(sample))) File "C:\python\naive_test.py", line 23, in bigramReturner bigramFeatureVector.append(' '.join(item)) AttributeError: 'dict' object has no attribute 'append'
-
Aikin over 11 yearsahhhh dont understand how to use bigrams in python...is there any tutorials?
-
Aikin over 11 yearsIve editted my question could you help me with the error im getting? i used your code.
-
user823743 over 11 yearsIn my code you can see that I have defined a list like this: (bigramFeatureVector = []). However for some reason you have changed that to (bigramFeatureVector = {}) which is a dictionary. The command "append" doesn't work on dictionaries!
-
Aikin over 11 yearsoops it was my mistake while typing...but still getting the error Traceback (most recent call last): File "C:\python\naive_test.py", line 74, in <module> print "%s | %s" % (sample, classifier.classify(bigramReturner(sample))) File "C:\python\lib\site-packages\nltk\classify\naivebayes.py", line 88, in cl assify return self.prob_classify(featureset).max() File "C:\python\lib\site-packages\nltk\classify\naivebayes.py", line 94, in pr ob_classify featureset = featureset.copy() AttributeError: 'list' object has no attribute 'copy'
-
Aikin over 11 yearsi am newbie in programming so could you tell is my code right? I would really appreciate any of your help. Maybe my understanding is not right(((
-
Poik about 11 years@Aikin Use
featureset[:]
instead offeatureset.copy()
.