Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

12,194

Solution 1

Try:

# StanfordPOSTagger
from nltk.tag.stanford import StanfordPOSTagger
stanford_dir = '/home/me/stanford/stanford-postagger-full-2015-04-20/'
modelfile = stanford_dir + 'models/english-bidirectional-distsim.tagger'
jarfile = stanford_dir + 'stanford-postagger.jar'

st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)


# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'

st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)

For detailed information on NLTK API with Stanford tools, take a look at: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#stanford-tagger-ner-tokenizer-and-parser

Note: The NLTK APIs are for the individual Stanford tools, if you're using Stanford Core NLP, it's best to follow @dimazest instructions on http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html


EDITED

As for Spanish NER Tagging, I strongly suggest that you us Stanford Core NLP (http://nlp.stanford.edu/software/corenlp.shtml) instead of using the Stanford NER package (http://nlp.stanford.edu/software/CRF-NER.shtml). And follow @dimazest solution for JSON file reading.

Alternatively, if you must use the NER packge, you can try following the instructions from https://github.com/alvations/nltk_cli (Disclaimer: This repo is not affiliated with NLTK officially). Do the following on the unix command line:

cd $HOME
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar
unzip stanford-spanish-corenlp-2015-01-08-models.jar -d stanford-spanish
cp stanford-spanish/edu/stanford/nlp/models/ner/* /home/me/stanford/stanford-ner-2015-04-20/ner/classifiers/

Then in python:

# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/spanish.ancora.distsim.s512.crf.ser.gz'

st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)

Solution 2

The error lies in the arguments written for the StanfordNerTagger function.

The first argument should be a model file or the classifier you are using. You can find that file inside the Stanford zip file. For example:

    st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20/classifier/tagger.ser.gz', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')

Solution 3

POS Tagger

In order to use the StanfordPOSTagger for Spanish with python, you have to install the Stanford tagger that includes a model for spanish.

In this example I download the tagger on /content folder

cd /content
wget https://nlp.stanford.edu/software/stanford-tagger-4.1.0.zip
unzip stanford-tagger-4.1.0.zip

After unziping, I have a folder stanford-postagger-full-2020-08-06 in /content, so I can use the tagger with:

from nltk.tag.stanford import StanfordPOSTagger

stanford_dir = '/content/stanford-postagger-full-2020-08-06'
modelfile = f'{stanford_dir}/models/spanish-ud.tagger'
jarfile =   f'{stanford_dir}/stanford-postagger.jar'

st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)

To check that everything works fine, we can do:

>st.tag(["Juan","Medina","es","un","ingeniero"])

>[('Juan', 'PROPN'),
 ('Medina', 'PROPN'),
 ('es', 'AUX'),
 ('un', 'DET'),
 ('ingeniero', 'NOUN')]

NER Tagger

In this case is necessary to download the NER core and the spanish model separatelly.

cd /content
#download NER core
wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
unzip stanford-ner-4.0.0.zip
#download spanish models
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
unzip stanford-spanish-corenlp-2018-02-27-models.jar -d stanford-spanish
#copy only the necessary files
cp stanford-spanish/edu/stanford/nlp/models/ner/* stanford-ner-4.0.0/classifiers/
rm -rf stanford-spanish stanford-ner-4.0.0.zip stanford-spanish-corenlp-2018-02-27-models.jar

To use it on python:

from nltk.tag.stanford import StanfordNERTagger
stanford_dir = '/content/stanford-ner-4.0.0/'
jarfile = f'{stanford_dir}/stanford-ner.jar'
modelfile = f'{stanford_dir}/classifiers/spanish.ancora.distsim.s512.crf.ser.gz'

st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)

To check that everything works fine, we can do:

>st.tag(["Juan","Medina","es","un","ingeniero"])

>[('Juan', 'PERS'),
 ('Medina', 'PERS'),
 ('es', 'O'),
 ('un', 'O'),
 ('ingeniero', 'O')]
Share:
12,194
nanounanue
Author by

nanounanue

Updated on June 12, 2022

Comments

  • nanounanue
    nanounanue almost 2 years

    The NLTK documentation is rather poor in this integration. The steps I followed were:

    Then in a ipython console:

    In [11]: import nltk

    In [12]: nltk.__version__
    Out[12]: '3.1'
    
    In [13]: from nltk.tag import StanfordNERTagger
    

    Then

    st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20.zip', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')
    

    But when I tried to run it:

    st.tag('Adolfo se la pasa corriendo'.split())
    Error: no se ha encontrado o cargado la clase principal edu.stanford.nlp.ie.crf.CRFClassifier
    
    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-14-0c1a96b480a6> in <module>()
    ----> 1 st.tag('Adolfo se la pasa corriendo'.split())
    
    /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag(self, tokens)
         64     def tag(self, tokens):
         65         # This function should return list of tuple rather than list of list
    ---> 66         return sum(self.tag_sents([tokens]), [])
         67 
         68     def tag_sents(self, sentences):
    
    /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag_sents(self, sentences)
         87         # Run the tagger and get the output
         88         stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
    ---> 89                                                        stdout=PIPE, stderr=PIPE)
         90         stanpos_output = stanpos_output.decode(encoding)
         91 
    
    /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/__init__.py in java(cmd, classpath, stdin, stdout, stderr, blocking)
        132     if p.returncode != 0:
        133         print(_decode_stdoutdata(stderr))
    --> 134         raise OSError('Java command failed : ' + str(cmd))
        135 
        136     return (stdout, stderr)
    
    OSError: Java command failed : ['/usr/bin/java', '-mx1000m', '-cp', '/home/nanounanue/Descargas/stanford-spanish-corenlp-2015-01-08-models.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/home/nanounanue/Descargas/stanford-postagger-full-2015-04-20.zip', '-textFile', '/tmp/tmp6y169div', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
    

    The same occur with the StandfordPOSTagger

    NOTE: I need that this will be the spanish version. NOTE: I am running this in python 3.4.3