wordnet lemmatization and pos tagging in python

81,530

Solution 1

First of all, you can use nltk.pos_tag() directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

Solution 2

Steps to convert : Document->Sentences->Tokens->POS->Lemmas

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'

class Splitter(object):
    """
    split the document into sentences and tokenize each sentence
    """
    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):
        """
        out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
        """
        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)

Solution 3

As in the source code of nltk.corpus.reader.wordnet (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

#{ Part-of-speech constants
 ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]

Solution 4

You can create a map using the python default dict and take advantage of the fact that for the lemmatizer the default tag is Noun.

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

Solution 5

@Suzana_K was working. But I there are some case result in KeyError as @ Clock Slave mention.

Convert treebank tags to Wordnet tag

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None # for easy if-statement 

Now, we only input pos into lemmatize function only if we have wordnet tag

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
    wntag = get_wordnet_pos(tag)
    if wntag is None:# not supply tag in case of None
        lemma = lemmatizer.lemmatize(word) 
    else:
        lemma = lemmatizer.lemmatize(word, pos=wntag) 
Share:
81,530
user1946217
Author by

user1946217

Updated on July 05, 2022

Comments

  • user1946217
    user1946217 almost 2 years

    I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.

    My question is what is the best shot inorder to perform the above lemmatization accurately?

    I did the pos tagging using nltk.pos_tag and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help

    from nltk.stem.wordnet import WordNetLemmatizer
    lmtzr = WordNetLemmatizer()
    tagged = nltk.pos_tag(tokens)
    

    I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?

    Also do I have to train nltk.pos_tag() with a tagged corpus or can I use it directly on my data to evaluate?

  • alvas
    alvas about 11 years
    remember also satellite adjectives =) ADJ_SAT = 's' wordnet.princeton.edu/wordnet/man/wngloss.7WN.html
  • mPrinC
    mPrinC over 7 years
    Or more generally: from nltk.corpus import wordnet; print wordnet._FILEMAP;
  • Clock Slave
    Clock Slave about 7 years
    the pos tag for 'it' in the "I'm loving it." string is 'PRP'. The function returns an empty string which the lemmatizer doesn't accept and throws a KeyError. What can be done in that case?
  • Ksofiac
    Ksofiac almost 7 years
    Does anyone know how efficient this is when processing entire documents?
  • zwep
    zwep over 6 years
    I would rather use something like... treebank_tag[0].lower() as input for the lemmatizer pos-tag. In most cases this covers the conversion, except with the ADJ. But this can then be a simple if statement
  • Suzana
    Suzana about 6 years
    @ClockSlave: Don't put empty strings into the lemmatizer.
  • pragMATHiC
    pragMATHiC almost 6 years
    to make this answer self-contained, remember import wn: from nltk.corpus import wordnet as wn
  • Simon Hessner
    Simon Hessner almost 6 years
    Why is ADJ_SAT not represented in the POST_LIST? What are examples for ADJ_SAT adjectives?
  • Simon Hessner
    Simon Hessner almost 6 years
    @alvas Which treebank tags should be mapped to the ADJ_SAT WordNet tag?
  • pg2455
    pg2455 almost 6 years
    ADJ_SAT falls under Adjective cluster. You can read more about how adjective clusters are arranged here: wordnet.princeton.edu/documentation/wngloss7wn
  • Shuchita Banthia
    Shuchita Banthia over 5 years
    @pragMATHiC, included it. Thanks.
  • ryh
    ryh about 4 years
    'RP' starts with 'R'. But it is a particle. Is the 'particle' to 'ADV' reasonable?