Training n-gram NER with Stanford NLP

nlp stanford-nlp opennlp named-entity-recognition named-entity-extraction

16,112

Solution 1

It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

Lingpipe provides various NER methods.

1) Dictionary Based NER

2) Statistical NER (HMM Based)

3) Rule Based NER etc.

I have used the Dictionary as well as the statistical approaches.

First one is a direct look up methodology and the second one being a training based.

An example for the dictionary based NER can be found here

The statstical approach requires a training file. I have used the file with the following format -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

I then used the following code to train the entities.

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

And to test the NER I used the following class

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

Code Courtesy : Google :)

Solution 2

The answer is basically given in your quoted example, where "Emma Woodhouse" is a single name. The default models we supply use IO encoding, and assume that adjacent tokens of the same class are part of the same entity. In many circumstances, this is almost always true, and keeps the models simpler. However, if you don't want to do that you can train NER models with other label encodings, such as the commonly used IOB encoding, where you would instead label things:

Emma    B-PERSON
Woodhouse    I-PERSON

Then, adjacent tokens of the same category but not the same entity can be represented.

Solution 3

I faced the same challenge of tagging ngram phrases for automative domain.I was looking for an efficient keyword mapping that can be used to create training files at a later stage. I ended up using regexNER in the NLP pipeline, by providing a mapping file with the regular expressions (ngram component terms) and their corresponding label. Note that there is no NER machine learning achieved in this case. Hope this information helps someone!

16,112

Arun A K

http://ak-arun.github.io/

Updated on October 04, 2020

Comments

Arun A K over 3 years
Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b

With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.

Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.

What I am stuck with is the following property
```
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
```
Here the first column is the word (unigram) and the second column is the entity, for example
```
CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS
```
Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?
- Khalid Usman over 7 years
  
  Dear @Arun did you succeeded to train NER for n-grams? I want to train education like , Master in Science : EDUCATION , PhD in Electronics : EDUCATION. Can you guide me? Thanks
- Arun A K over 7 years
  
  @KhalidUsman, Thanks for reaching out. I have used LingPipe as in below answer to achieve this. Worked very well with pretty decent volume of training dataset. Any model would work fine only depending on how good the data set you provide it to learn.
Arun A K about 11 years

tech.groups.yahoo.com/group/LingPipe/message/68 provides more information on the corpus preparation.
Arun A K almost 11 years

Thanks @Chris, Let me try creating a new model with this encoding format.
Neil McGuigan over 10 years

@ChristopherManning how do I enable IOB encoding in NER? Thx
Christopher Manning over 10 years

I provide a discussion of options for IOB encoding in my answer to this question: stackoverflow.com/questions/21469082/…
lulu about 10 years

I also tried the same code. Can u plz mention how did u prepare the training set.I added this as a text file and tried to add my own entity but it's not working ...plz help me .I don't know if I had misinterpreted about the training set
lulu about 10 years

The <ENAMEX TYPE="ORGANIZATION">USAir</ENAMEX> flight attendant in the rear of the plane making a short flight to <ENAMEX TYPE="LOCATION">Charlotte</ENAMEX>, <ENAMEX TYPE="LOCATION">N.C.</ENAMEX>, kept peeking around the corner of a seat in Row 21, making 9-month-old <ENAMEX TYPE="PERSON">Danasia Brown</ENAMEX> laugh.
Arun A K about 10 years

The training set used is of the same format that I have discussed above. You would require quite a lot of data for the model to 'learn'. Probably some news articles or wiki pages etc etc in well formed sentences.
Arun A K about 10 years

Please check out the entire discussions at groups.yahoo.com/neo/groups/LingPipe/conversations/topics/68
lulu about 10 years

thank u very much Arun. I got it and one more doubt currently this program identifies only one user defined entity. can i make it in a way such that it identifies all entities in a text
Arun A K about 10 years

Yes you can... please go ahead and add as many entities you want in the same input file.
lulu about 10 years

I have added DAY as my entity and added many but if I give this as input <ENAMEX>Tuesday</ENAMEX> it shows incorrect output as LOCATION instead of giving o/p as DAY.If same word for eg DELHI apperas more than once in a document is there a need for it to be redeclared as LOCATION.I had added many to training set but if I gave anything as input that was already in training set sometimes it gives DUAL O/P as DAY AND LOCATION. I don't know what went wrong
lulu about 10 years

is it mandatory that each time when we add a news it should be put inside<s>...</s> tags or a common <s> tag is enough. I am not getting the correct output for some entities
lulu about 10 years

let us continue this discussion in chat
chopss almost 10 years

@ArunAK can u please show a small snippet of your training set. My pgrm is not working and identifying entities and I think it may be because of any fault in the training set.
Arun A K almost 10 years

@chopu : what format have you ued? Can you validate across chat.stackoverflow.com/rooms/51072/… All you need is a file with a start and end tag like <root> and </root> and each sentence in between <s> and </s>. Whatever entity you want to 'teach' should go between the enamex tags
Arun A K almost 10 years

Try to download some tagged data sets because, hand prepared ones would be too meager for it to learn. Basically it is expected to learn from context, or from features... where features could be adjacent words, upper/lower casing, punctuations etc. So real world data would be a better choice
chopss almost 10 years

<root><s> The burglar used weapons like <ENAMEX TYPE="WEAPON">riffles</ENAMEX></s>.<s> Policemen are seen working in a jewellery store that was attacked using <ENAMEX TYPE="WEAPON">pistols</ENAMEX> .</s></root>
chopss almost 10 years

I want to identify weapons in an input.The above one is the small snippet of my training set. The problem is that sometimes it identifies some weapons and also if more than one weapon is there it will not identify that.
Arun A K almost 10 years

@chopu - no guarantee on the small data size. Lingpipe yahoo forum had one discussion on the training data set size.
Khalid Usman over 7 years

@ArunAK It is my first ever project in this field, Would you please like to guide me on skype or email etc. Email: [email protected]. Your guidance will be appreciated. Thanks
Khalid Usman over 7 years

@ArunAK I used your above code and i get the following issue. "Muc6ChunkParser cannot be resolved to a type"
Khalid Usman over 7 years

@ArunAK How did you given input in text file, Its working fine now on genetag example but not working on my custom given input in text file. Master of Science in Biomedical Sciences EDUCATION Major in Research EDUCATION Bachelor of Science (B.S.) EDUCATION Biomedical Sciences EDUCATION PhD EDUCATION Master EDUCATION Graduated Nursing EDUCATION Post-graduate degree EDUCATION Bachelor EDUCATION Bachelor's degree - RN EDUCATION Master of Science (MSc) EDUCATION