Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish
Solution 1
Try:
# StanfordPOSTagger
from nltk.tag.stanford import StanfordPOSTagger
stanford_dir = '/home/me/stanford/stanford-postagger-full-2015-04-20/'
modelfile = stanford_dir + 'models/english-bidirectional-distsim.tagger'
jarfile = stanford_dir + 'stanford-postagger.jar'
st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'
st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
For detailed information on NLTK API with Stanford tools, take a look at: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#stanford-tagger-ner-tokenizer-and-parser
Note: The NLTK APIs are for the individual Stanford tools, if you're using Stanford Core NLP, it's best to follow @dimazest instructions on http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html
EDITED
As for Spanish NER Tagging, I strongly suggest that you us Stanford Core NLP (http://nlp.stanford.edu/software/corenlp.shtml) instead of using the Stanford NER package (http://nlp.stanford.edu/software/CRF-NER.shtml). And follow @dimazest solution for JSON file reading.
Alternatively, if you must use the NER packge, you can try following the instructions from https://github.com/alvations/nltk_cli (Disclaimer: This repo is not affiliated with NLTK officially). Do the following on the unix command line:
cd $HOME
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar
unzip stanford-spanish-corenlp-2015-01-08-models.jar -d stanford-spanish
cp stanford-spanish/edu/stanford/nlp/models/ner/* /home/me/stanford/stanford-ner-2015-04-20/ner/classifiers/
Then in python:
# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/spanish.ancora.distsim.s512.crf.ser.gz'
st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
Solution 2
The error lies in the arguments written for the StanfordNerTagger function.
The first argument should be a model file or the classifier you are using. You can find that file inside the Stanford zip file. For example:
st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20/classifier/tagger.ser.gz', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')
Solution 3
POS Tagger
In order to use the StanfordPOSTagger for Spanish with python, you have to install the Stanford tagger that includes a model for spanish.In this example I download the tagger on /content folder
cd /content
wget https://nlp.stanford.edu/software/stanford-tagger-4.1.0.zip
unzip stanford-tagger-4.1.0.zip
After unziping, I have a folder stanford-postagger-full-2020-08-06 in /content, so I can use the tagger with:
from nltk.tag.stanford import StanfordPOSTagger
stanford_dir = '/content/stanford-postagger-full-2020-08-06'
modelfile = f'{stanford_dir}/models/spanish-ud.tagger'
jarfile = f'{stanford_dir}/stanford-postagger.jar'
st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
To check that everything works fine, we can do:
>st.tag(["Juan","Medina","es","un","ingeniero"])
>[('Juan', 'PROPN'),
('Medina', 'PROPN'),
('es', 'AUX'),
('un', 'DET'),
('ingeniero', 'NOUN')]
NER Tagger
In this case is necessary to download the NER core and the spanish model separatelly.
cd /content
#download NER core
wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
unzip stanford-ner-4.0.0.zip
#download spanish models
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
unzip stanford-spanish-corenlp-2018-02-27-models.jar -d stanford-spanish
#copy only the necessary files
cp stanford-spanish/edu/stanford/nlp/models/ner/* stanford-ner-4.0.0/classifiers/
rm -rf stanford-spanish stanford-ner-4.0.0.zip stanford-spanish-corenlp-2018-02-27-models.jar
To use it on python:
from nltk.tag.stanford import StanfordNERTagger
stanford_dir = '/content/stanford-ner-4.0.0/'
jarfile = f'{stanford_dir}/stanford-ner.jar'
modelfile = f'{stanford_dir}/classifiers/spanish.ancora.distsim.s512.crf.ser.gz'
st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
To check that everything works fine, we can do:
>st.tag(["Juan","Medina","es","un","ingeniero"])
>[('Juan', 'PERS'),
('Medina', 'PERS'),
('es', 'O'),
('un', 'O'),
('ingeniero', 'O')]
nanounanue
Updated on June 12, 2022Comments
-
nanounanue almost 2 years
The
NLTK
documentation is rather poor in this integration. The steps I followed were:Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to
/home/me/stanford
Download http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar to
/home/me/stanford
Then in a
ipython
console:In [11]: import nltk
In [12]: nltk.__version__ Out[12]: '3.1' In [13]: from nltk.tag import StanfordNERTagger
Then
st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20.zip', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')
But when I tried to run it:
st.tag('Adolfo se la pasa corriendo'.split()) Error: no se ha encontrado o cargado la clase principal edu.stanford.nlp.ie.crf.CRFClassifier --------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-14-0c1a96b480a6> in <module>() ----> 1 st.tag('Adolfo se la pasa corriendo'.split()) /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag(self, tokens) 64 def tag(self, tokens): 65 # This function should return list of tuple rather than list of list ---> 66 return sum(self.tag_sents([tokens]), []) 67 68 def tag_sents(self, sentences): /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag_sents(self, sentences) 87 # Run the tagger and get the output 88 stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar, ---> 89 stdout=PIPE, stderr=PIPE) 90 stanpos_output = stanpos_output.decode(encoding) 91 /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/__init__.py in java(cmd, classpath, stdin, stdout, stderr, blocking) 132 if p.returncode != 0: 133 print(_decode_stdoutdata(stderr)) --> 134 raise OSError('Java command failed : ' + str(cmd)) 135 136 return (stdout, stderr) OSError: Java command failed : ['/usr/bin/java', '-mx1000m', '-cp', '/home/nanounanue/Descargas/stanford-spanish-corenlp-2015-01-08-models.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/home/nanounanue/Descargas/stanford-postagger-full-2015-04-20.zip', '-textFile', '/tmp/tmp6y169div', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
The same occur with the
StandfordPOSTagger
NOTE: I need that this will be the spanish version. NOTE: I am running this in
python 3.4.3