How to use pos_tag in NLTK?

27,304

Solution 1

Firstly, use human-readable variable names, it helps =)

Next, pos_tag input is a list of string. So it's

>>> from nltk import pos_tag
>>> sentences = [ ['hello', 'world'], ['good', 'morning'] ]
>>> [pos_tag(sent) for sent in sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

Also, if you have the input as raw strings, you can use word_tokenize before pos_tag:

>>> from nltk import pos_tag, word_tokenize
>>> a_sentence = 'hello world'
>>> word_tokenize(a_sentence)
['hello', 'world']
>>> pos_tag(word_tokenize(a_sentence))
[('hello', 'NN'), ('world', 'NN')]

>>> two_sentences = ['hello world', 'good morning']
>>> [word_tokenize(sent) for sent in two_sentences]
[['hello', 'world'], ['good', 'morning']]
>>> [pos_tag(word_tokenize(sent)) for sent in two_sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

And you have the sentences in a paragraph, you can use sent_tokenize to split the sentence up.

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Hello world. Good morning."
>>> sent_tokenize(text)
['Hello world.', 'Good morning.']
>>> [word_tokenize(sent) for sent in sent_tokenize(text)]
[['Hello', 'world', '.'], ['Good', 'morning', '.']]
>>> [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
[[('Hello', 'NNP'), ('world', 'NN'), ('.', '.')], [('Good', 'JJ'), ('morning', 'NN'), ('.', '.')]]

See also: How to do POS tagging using the NLTK POS tagger in Python?

Solution 2

A common function to parse a document with pos tags,

def get_pos(string):
    string = nltk.word_tokenize(string)
    pos_string = nltk.pos_tag(string)
    return pos_string

get_post(sentence)

Hope this helps !

Solution 3

if you have the input as raw strings, you can use word_tokenize before pos_tag:

import nltk

is_noun = lambda pos: pos[:2] == 'NN'

lines = 'You can never plan the future by the past'

lines = lines.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

print(nouns) # ['future', 'past']
Share:
27,304
SSBakh
Author by

SSBakh

Updated on July 09, 2022

Comments

  • SSBakh
    SSBakh almost 2 years

    So I was trying to tag a bunch of words in a list (POS tagging to be exact) like so:

    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
    

    where lw is a list of words (it's really long or I would have posted it but it's like [['hello'],['world']] (aka a list of lists which each list containing one word) but when I try and run it I get:

    Traceback (most recent call last):
      File "<pyshell#183>", line 1, in <module>
        pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
      File "<pyshell#183>", line 1, in <listcomp>
        pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
      File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
        return _pos_tag(tokens, tagset, tagger)
      File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
        tagged_tokens = tagger.tag(tokens)
      File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
        context = self.START + [self.normalize(w) for w in tokens] + self.END
      File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
        context = self.START + [self.normalize(w) for w in tokens] + self.END
      File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
        elif word[0].isdigit():
    IndexError: string index out of range
    

    Can someone tell me why and how I get this error and how to fix it? Many thanks.

  • SSBakh
    SSBakh over 6 years
    Thanks for the answer, and it works, just the issue here is that I was moreover wondering why this was happening. But I appreciate your answer nonetheless.