How to filter tokens from spaCy document

11,223

Solution 1

I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.

You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.

Code:

import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy

def remove_tokens_on_match(doc):
    indexes = []
    for index, token in enumerate(doc):
        if (token.pos_  in ('PUNCT', 'NUM', 'SYM')):
            indexes.append(index)
    np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
    np_array = numpy.delete(np_array, indexes, axis = 0)
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
    doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
    return doc2

# load english model
nlp  = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))

You can look to a similar question that I answered here.

Solution 2

Depending on what you want to do there are several approaches.

1. Get the original Document

Tokens in SpaCy have references to their document, so you can do this:

original_doc = final_tokens[0].doc

This way you can still get PoS, parse data etc. from the original sentence.

2. Construct a new document without the removed tokens

You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws.

doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))

This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.

If neither of those was what you had in mind, let me know and maybe I can help.

Share:
11,223
Kon Pal
Author by

Kon Pal

Updated on June 13, 2022

Comments

  • Kon Pal
    Kon Pal almost 2 years

    I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.

    text = u"This document is only an example. " \
        "I would like to create a custom pipeline that will remove specific tokesn from the final document."
    
    doc = nlp(text)
    
    def keep_token(tok):
        # This is only an example rule
        return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
    
    final_tokens = list(filter(keep_token, doc))
    
    # How to get a spacy.Doc from final_tokens?
    

    I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.