Python NLTK pos_tag not returning the correct part-of-speech tag

18,213

Solution 1

In short:

NLTK is not perfect. In fact, no model is perfect.

Note:

As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger) 

Still it's better but not perfect:

>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli


In long:

Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

  • HunPos
  • Stanford POS
  • Senna

Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

Using Stanford POS tagger:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

Or try building a better POS tagger:


Complains about pos_tag accuracy on stackoverflow include:

Issues about NLTK HunPos include:

Issues with NLTK and Stanford POS tagger include:

Solution 2

Solutions such as changing to the Stanford or Senna or HunPOS tagger will definitely yield results, but here is a much simpler way to experiment with different taggers that are also included within NLTK.

The default POS tagger in NTLK right now is the averaged perceptron tagger. Here's a function that will opt to use the Maxent Treebank Tagger instead:

def treebankTag(text)
    words = nltk.word_tokenize(text)
    treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')
    return treebankTagger.tag(words)

I have found that the averaged perceptron pre-trained tagger in NLTK is biased to treating some adjectives as nouns, as in your example. The treebank tagger has gotten more adjectives correct for me.

Share:
18,213
faceoff
Author by

faceoff

Updated on June 23, 2022

Comments

  • faceoff
    faceoff almost 2 years

    Having this:

    text = word_tokenize("The quick brown fox jumps over the lazy dog")
    

    And running:

    nltk.pos_tag(text)
    

    I get:

    [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
    

    This is incorrect. The tags for quick brown lazy in the sentence should be:

    ('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')
    

    Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.

  • alexis
    alexis almost 9 years
    Yeah yeah, no model is perfect, but this example is pretty disappointing. Considering all the technology that went into this "recommended" tagger, it's not unreasonable to expect more.
  • alexis
    alexis almost 9 years
    Nice demo of the alternatives, though.
  • alvas
    alvas almost 9 years
    It has been 3 years since the model is update, possibly we should raise this to nltk-dev google group: github.com/arne-cl/nltk-maxent-pos-tagger. And the model was created 7 years ago =( github.com/nltk/nltk/blob/develop/nltk/tag/__init__.py#L84
  • Houman
    Houman almost 9 years
    By the look of it Stanford and Senna are superior taggers, isn't it?
  • alvas
    alvas almost 9 years
    Yes, stanford and senna tagger are more complicated and lots of effort were put in to build the tools from both groups.
  • tech4242
    tech4242 over 6 years
    @alvas Thank you for the amazing answer! It's still (sadly) pretty relevant in 2017 as I have been working with NLTK in the past few months
  • alvas
    alvas over 6 years
    @tech4242, genau. Given a larger annotated corpus, it might be possible to reach a better tagger's accuracy.
  • Shan
    Shan over 6 years
    mmm..in one sentence it was correctly tagging "change" as verb whereas in another sentence it was incorrectly tagging "change" as noun! bizzare
  • luky
    luky over 2 years
    interesting but with this mode is "The quick brown fox jumps over the lazy dog." "jumps" tagged as a noun not a verb.