remove stopwords and tokenize for collocationbigramfinder NLTK

11,293

I am presuming that sentiment_test.txt is just plain text, and not a specific format. You are trying to filter lines and not words. You should first tokenize and then filter the stopwords.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopset = set(stopwords.words('english'))

with open('sentiment_test.txt', 'r') as text_file:
    text = text_file.read()
    tokens=word_tokenize(str(text))
    tokens = [w for w in tokens if not w in stopset]
    print tokens

Hope this helps.

Share:
11,293

Related videos on Youtube

jxn
Author by

jxn

Updated on June 04, 2022

Comments

  • jxn
    jxn almost 2 years

    I keep getting this error

    sub
        return _compile(pattern, flags).sub(repl, string, count)
    TypeError: expected string or buffer
    

    when i try to run this script. Not sure what is wrong. I am essentially reading from a text file, filtering out the stopwords and tokenizing them using NLTK.

    import nltk
    from nltk.collocations import *
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    stopset = set(stopwords.words('english'))
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    
    
    text_file=open('sentiment_test.txt', 'r')
    lines=text_file.readlines()
    filtered_words = [w for w in lines if not w in stopwords.words('english')]
    print filtered_words
    tokens=word_tokenize(str(filtered_words)
    print tokens
    finder = BigramCollocationFinder.from_words(tokens)
    

    Any help would be much appreciated.