Python (NLTK) - more efficient way to extract noun phrases?
Solution 1
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk
and solution from
[code]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[out]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
To use the custom RegexpParser
:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[out]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
Solution 2
I suggest referring to this prior thread: Extracting all Nouns from a text file using nltk
They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
Silent-J
Data Science student with 3 years of experience programming with Python. Interests include machine learning, satellite imagery, & geography
Updated on June 13, 2022Comments
-
Silent-J almost 2 years
I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.
But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?
import nltk import pandas as pd myData = pd.read_excel("\User\train_.xlsx") texts = myData['message'] # Defining a grammar & Parser NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}" chunkr = nltk.RegexpParser(NP) tokens = [nltk.word_tokenize(i) for i in texts] tag_list = [nltk.pos_tag(w) for w in tokens] phrases = [chunkr.parse(sublist) for sublist in tag_list] leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples
leaves = [tupls for sublists in leaves for tupls in sublists]
Join the extracted terms into one bigram
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
-
Silent-J about 6 yearsFantastic answer! The links are super helpful as well. Thanks @alvas ! Question, why do you write 'prev = None' ? when defining 'get_continuous_chunks'?
-
alvas about 6 yearsOh that was a mistake, it's not necessary. I think I was using prev to check the history but actually only current_chunk is needed to check the history. Thanks for catching that!
-
Daniel Vilas-Boas about 4 yearshey @alvas, where did you come with that regex (NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}") - Is this a standard Noun phrase detection standard?
-
Silent-J about 3 yearsThanks for contributing to the discussion on this question! Textblob definitely has advantages over the sometimes-bulky NLTK. However, your offered solution doesn't allow for customized parsing to occur - which ultimately could be a stronger pro for NLTK.