How to remove stop words using nltk or python

python nltk stop-words

227,837

Solution 1

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Solution 2

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Solution 3

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Solution 4

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Solution 5

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

View more solutions

227,837

Author by

Alex

Updated on February 25, 2021

Comments

Alex over 3 years
So I have a dataset that I would like to remove stop words from using
```
stopwords.words('english')
```
I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm struggling with is comparing to this list and removing the stop words. Any help is appreciated.
- tumultous_rooster about 10 years
  
  Where did you get the stopwords from? Is this from NLTK?
- danodonovan about 9 years
  
  @MattO'Brien from nltk.corpus import stopwords for future googlers
- sffc almost 9 years
  
  It is also necessary to run nltk.download("stopwords") in order to make the stopword dictionary available.
- alvas almost 8 years
  
  See also stackoverflow.com/questions/19130512/stopword-removal-with-n‌ltk
- anegru almost 5 years
  
  Pay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results.
Alex about 13 years

Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet!
isakkarlsson over 10 years

To improve performance, consider stops = set(stopwords.words("english")) instead.
drevicko almost 8 years

this will be a whole lot slower than Daren Thomas's list comprehension...
David Dehghan over 7 years

Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result
Admin over 6 years

>>> import nltk >>> nltk.download() Source
Alex almost 6 years

stopwords.words('english') are lower case. So make sure to use only lower case words in the list e.g. [w.lower() for w in word_list]
Robert over 4 years

if word_list is large this code is very slow. It is better to convert the stopwords list to a set before using it: .. in set(stopwords.words('english')).
Led over 4 years

its best to add the stopwords.words("english") than to specify every words you need to remove.
Ujjwal over 4 years

converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word.
David Beauchemin over 4 years

Don't use this approach in french l' or else will not be capture.
rubencart about 4 years

I'm getting len(get_stop_words('en')) == 174 vs len(stopwords.words('english')) == 179
Роман Коптев almost 3 years

Iteration through a list is not efficient.