Removing punctuation using spaCy; AttributeError
Solution 1
From what I can see, your main problem here is actually quite simple: n.lemma_
returns a string, not a Token
object. So it doesn't have an is_punct
attribute. I think what you were looking for here is n.is_punct
(whether the token is punctuation).
If you want to do this more elegantly, check out spaCy's new custom processing pipeline components (requires v2.0+). This lets you wrap your logic in a function which is run automatically when you call nlp()
on your text. You could even take this one step further, and add a custom attribute to your Doc
– for example, doc._.my_stripped_doc
or doc._.pd_columns
or something. The advantage here is that you can keep using spaCy's performant, built-in data structures like the Doc
(and its views Token
and Span
) as the "single source of truth" of your application. This way, no information is lost and you'll always keep a reference to the original document – which is also very useful for debugging.
Solution 2
as you are using spacy use this function to remove punctuation .
df["newcolname"] = df.column name(onwhich yoy want to remove stopword).apply(lambda text:
" ".join(token.lemma_ for token in nlp(text)
if not token.is_punct)
df["puncfree"] = df.review.apply(lambda text:
" ".join(token.lemma_ for token in nlp(text)
if not token.is_punct))
for convince and better understanding i am posting my code that i used to remove punctuation
"review" is column name i want to remove punctuation from.
Solution 3
Building off @khawaja-fahad-shafi's answer, I created the following pattern, where data
is a pandas DataFrame and text
is a field of strings. I originally posted this as a comment not an answer, but the formatting was off. Hope it's helpful for someone.
import pandas
import spacy
nlp = spacy.load("en_core_web_md")
(data["text"]
.apply(lambda text: " "
.join(token.lemma_ for token in nlp(text) if
not token.is_punct
and not token.is_currency
and not token.is_digit
and not token.is_punct
and not token.is_oov
and not token.is_space
and not token.is_stop
and not token.like_num
and not token.pos_ == "PROPN")))
Related videos on Youtube
Comments
-
LMGagne about 2 years
Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy:
lemma = [] for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844, n_threads=3): if doc.is_parsed: lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"]) else: lemma.append(None) df['lemma_col'] = lemma vect = sklearn.feature_extraction.text.TfidfVectorizer() lemmas = df['lemma_col'].apply(lambda x: ' '.join(x)) vect = sklearn.feature_extraction.text.TfidfVectorizer() features = vect.fit_transform(lemmas) feature_names = vect.get_feature_names() dense = features.todense() denselist = dense.tolist() df = pd.DataFrame(denselist, columns=feature_names) df = pd.DataFrame(denselist, columns=feature_names) lemmas = pd.concat([lemmas, df]) df= pd.concat([df, lemmas])
I need to strip out proper nouns, punctuation, and stop words but am having some trouble doing that within my current code. I've read some documentation and other resources, but am now running into an error:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-21-e924639f7822> in <module>() 7 if doc.is_parsed: 8 tokens.append([n.text for n in doc]) ----> 9 lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"]) 10 pos.append([n.pos_ for n in doc]) 11 else: <ipython-input-21-e924639f7822> in <listcomp>(.0) 7 if doc.is_parsed: 8 tokens.append([n.text for n in doc]) ----> 9 lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"]) 10 pos.append([n.pos_ for n in doc]) 11 else: AttributeError: 'str' object has no attribute 'is_punct'
Is there an easier way to strip this stuff out of the text, without having to drastically change my approach?
Full code available here.
-
Soumendra over 5 yearsIs there any way I can remove punctuation from the middle of a spaCy Token ? For e.g. there is a token "hello-world" how can I convert it into "hello world" ? I have referred spacy.io/api/token#attributes