Add/remove custom stop words with spacy

64,975

Solution 1

You can edit them before processing your text like this (see this post):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

Solution 2

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

To add a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

Solution 3

For version 2.0 I used this:

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

This loads all stop words into a set.

You can amend your stop words to STOP_WORDS or use your own list in the first place.

Solution 4

For 2.0 use the following:

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

Solution 5

This collects the stop words too :)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Share:
64,975
E.K.
Author by

E.K.

Updated on July 08, 2022

Comments

  • E.K.
    E.K. almost 2 years

    What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

    • Xeoncross
      Xeoncross over 6 years
      The complete list: from spacy.en.word_sets import STOP_WORDS
  • E.K.
    E.K. over 7 years
    Ah nice. Thank you!
  • E.K.
    E.K. over 6 years
    This solution does not seem to be working anymore with version 1.9.0? I am getting TypeError: an integer is required
  • user1025852
    user1025852 over 6 years
    did that with version 2.0 and got "ImportError: No module named en.stop_words"...suggestions?
  • Eb Abadi
    Eb Abadi over 6 years
    @E.K. the reason for the error is because the vocab input word should be unicode (use u"the" instead of "the")
  • lucid_dreamer
    lucid_dreamer almost 6 years
    You are showing how to fix a broken model as per this bug/workaround. Whilst it is easy to adapt this for the OP needs you could have expanded on why you are writing the code this way: it is currently required because of the bug, but it's an otherwise redundant step, as les.is_stop should already be True in the bug-free future.
  • Romain
    Romain over 5 years
    @AustinT It is syntactic sugar to obtain the union of two sets, a|=b being equivalent to a=a.union(b). Similarly, the operator -= allows to perform a set difference. The curly bracket syntax allows to create sets in a simple way, a={1,2,3} being equivalent to a=set(1,2,3).
  • fny
    fny over 4 years
    This doesn't actually affect the model.
  • fny
    fny over 4 years
    I mean that it actually doesn't seem to affect the current execution either. (Maybe I'm running something out of order.) The other method seems foolproof.
  • Toby
    Toby almost 4 years
    I concur with @fny. While this adds the stopwords to nlp.Defaults.stop_word, if you check that word with token.is_stop, you still get False.