remove stopwords and tokenize for collocationbigramfinder NLTK
11,293
I am presuming that sentiment_test.txt is just plain text, and not a specific format. You are trying to filter lines and not words. You should first tokenize and then filter the stopwords.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
with open('sentiment_test.txt', 'r') as text_file:
text = text_file.read()
tokens=word_tokenize(str(text))
tokens = [w for w in tokens if not w in stopset]
print tokens
Hope this helps.
Related videos on Youtube
Author by
jxn
Updated on June 04, 2022Comments
-
jxn almost 2 years
I keep getting this error
sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer
when i try to run this script. Not sure what is wrong. I am essentially reading from a text file, filtering out the stopwords and tokenizing them using NLTK.
import nltk from nltk.collocations import * from nltk.tokenize import word_tokenize from nltk.corpus import stopwords stopset = set(stopwords.words('english')) bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() text_file=open('sentiment_test.txt', 'r') lines=text_file.readlines() filtered_words = [w for w in lines if not w in stopwords.words('english')] print filtered_words tokens=word_tokenize(str(filtered_words) print tokens finder = BigramCollocationFinder.from_words(tokens)
Any help would be much appreciated.