NLTK Stopword List

44,272

A few things of note.

  • If you are going to be checking membership against a list over and over, I would use a set instead of a list.

  • stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.

  • You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w
Share:
44,272
saph_top
Author by

saph_top

Updated on January 09, 2020

Comments

  • saph_top
    saph_top over 4 years

    I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

    import nltk
    from nltk.corpus import stopwords
    
    word_list = open("xxx.y.txt", "r")
    filtered_words = [w for w in word_list if not w in stopwords.words('english')]
    print filtered_words
    
  • Hooked
    Hooked about 10 years
    Note that you still aren't filtering for punctuation, you'll want to remove things like ';"{}[]/?.,! for example.
  • saph_top
    saph_top about 10 years
    brilliant that worked, must have been reading over the file incorrectly, thanks.