How to remove stop words using nltk or python
Solution 1
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
Solution 2
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
Solution 3
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
Solution 4
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
Solution 5
There's a very simple light-weight python package stop-words
just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2
and Python 3
,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Alex
Updated on February 25, 2021Comments
-
Alex over 3 years
So I have a dataset that I would like to remove stop words from using
stopwords.words('english')
I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm struggling with is comparing to this list and removing the stop words. Any help is appreciated.
-
tumultous_rooster about 10 yearsWhere did you get the stopwords from? Is this from NLTK?
-
danodonovan about 9 years@MattO'Brien
from nltk.corpus import stopwords
for future googlers -
sffc almost 9 yearsIt is also necessary to run
nltk.download("stopwords")
in order to make the stopword dictionary available. -
alvas almost 8 years
-
anegru almost 5 yearsPay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results.
-
-
Alex about 13 yearsThanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet!
-
isakkarlsson over 10 yearsTo improve performance, consider
stops = set(stopwords.words("english"))
instead. -
drevicko almost 8 yearsthis will be a whole lot slower than Daren Thomas's list comprehension...
-
David Dehghan over 7 yearsNote: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result
-
Admin over 6 years>>> import nltk >>> nltk.download() Source
-
Alex almost 6 years
stopwords.words('english')
are lower case. So make sure to use only lower case words in the list e.g.[w.lower() for w in word_list]
-
Robert over 4 yearsif
word_list
is large this code is very slow. It is better to convert the stopwords list to a set before using it:.. in set(stopwords.words('english'))
. -
Led over 4 yearsits best to add the stopwords.words("english") than to specify every words you need to remove.
-
Ujjwal over 4 yearsconverting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word.
-
David Beauchemin over 4 yearsDon't use this approach in french l' or else will not be capture.
-
rubencart about 4 yearsI'm getting
len(get_stop_words('en')) == 174
vslen(stopwords.words('english')) == 179
-
Роман Коптев almost 3 yearsIteration through a list is not efficient.