Multilingual NLTK for POS Tagging and Lemmatizer

11,117

Solution 1

If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.

In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:

python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon 
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")

Solution 2

There is no option that you can pass to NLTK's POS-tagging and lemmatizing functions that will make them process other languages.

One solution would be to get a training corpus for each language and to train your own POS-taggers with NLTK, then figure out a lemmatizing solution, maybe dictonary-based, for each language.

That might be overkill though, as there is already a single stop solution for both tasks in Italian, French, Spanish and German (and many other languages): TreeTagger. It is not as state-of-the-art as the POS-taggers and lemmatizers in English, but it still does a good job.

What you want is to install TreeTagger on your system and be able to call it from Python. Here is a GitHub repo by miotto that lets you do just that.

The following snippet shows you how to test that you set up everything correctly. As you can see, I am able to POS-tag and lemmatize in one function call, and I can do it just as easily in English and in French.

>>> import os
>>> os.environ['TREETAGGER'] = "/opt/treetagger/cmd" # Or wherever you installed TreeTagger
>>> from treetagger import TreeTagger
>>> tt_en = TreeTagger(encoding='utf-8', language='english')
>>> tt_en.tag('Does this thing even work?')
[[u'Does', u'VBZ', u'do'], [u'this', u'DT', u'this'], [u'thing', u'NN', u'thing'], [u'even', u'RB', u'even'], [u'work', u'VB', u'work'], [u'?', u'SENT', u'?']]
>>> tt_fr = TreeTagger(encoding='utf-8', language='french')
>>> tt_fr.tag(u'Mon Dieu, faites que ça marche!')
[[u'Mon', u'DET:POS', u'mon'], [u'Dieu', u'NOM', u'Dieu'], [u',', u'PUN', u','], [u'faites', u'VER:pres', u'faire'], [u'que', u'KON', u'que'], [u'\xe7a', u'PRO:DEM', u'cela'], [u'marche', u'NOM', u'marche'], [u'!', u'SENT', u'!']]

Since this question gets asked a lot (and since the installation process is not super straight-forward, IMO), I will write a blog post on the matter and update this answer with a link to it as soon as it is done.

EDIT: Here is the above-mentioned blog post.

Share:
11,117

Related videos on Youtube

Alessio Schiavelli
Author by

Alessio Schiavelli

Updated on June 04, 2022

Comments

  • Alessio Schiavelli
    Alessio Schiavelli almost 2 years

    Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:

    tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')
    

    I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?

    Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.

    Many Thanks.

  • aceminer
    aceminer about 8 years
    I was trying this tagger for thai but it doesnt seem to work. It gives me a whole long string as NCNM. Does it have to take in a string of tokens instead?
  • NQD
    NQD about 8 years
    Yes, you have to perform Thai word segmentation before using the tagger.