How to remove stop words using nltk or python

227,837

Solution 1

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Solution 2

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Solution 3

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Solution 4

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Solution 5

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian
Share:
227,837
Alex
Author by

Alex

Updated on February 25, 2021

Comments

  • Alex
    Alex over 3 years

    So I have a dataset that I would like to remove stop words from using

    stopwords.words('english')
    

    I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm struggling with is comparing to this list and removing the stop words. Any help is appreciated.

    • tumultous_rooster
      tumultous_rooster about 10 years
      Where did you get the stopwords from? Is this from NLTK?
    • danodonovan
      danodonovan about 9 years
      @MattO'Brien from nltk.corpus import stopwords for future googlers
    • sffc
      sffc almost 9 years
      It is also necessary to run nltk.download("stopwords") in order to make the stopword dictionary available.
    • alvas
      alvas almost 8 years
    • anegru
      anegru almost 5 years
      Pay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results.
  • Alex
    Alex about 13 years
    Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet!
  • isakkarlsson
    isakkarlsson over 10 years
    To improve performance, consider stops = set(stopwords.words("english")) instead.
  • drevicko
    drevicko almost 8 years
    this will be a whole lot slower than Daren Thomas's list comprehension...
  • David Dehghan
    David Dehghan over 7 years
    Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result
  • Admin
    Admin over 6 years
    >>> import nltk >>> nltk.download() Source
  • Alex
    Alex almost 6 years
    stopwords.words('english') are lower case. So make sure to use only lower case words in the list e.g. [w.lower() for w in word_list]
  • Robert
    Robert over 4 years
    if word_list is large this code is very slow. It is better to convert the stopwords list to a set before using it: .. in set(stopwords.words('english')).
  • Led
    Led over 4 years
    its best to add the stopwords.words("english") than to specify every words you need to remove.
  • Ujjwal
    Ujjwal over 4 years
    converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word.
  • David Beauchemin
    David Beauchemin over 4 years
    Don't use this approach in french l' or else will not be capture.
  • rubencart
    rubencart about 4 years
    I'm getting len(get_stop_words('en')) == 174 vs len(stopwords.words('english')) == 179
  • Роман Коптев
    Роман Коптев almost 3 years
    Iteration through a list is not efficient.