Lemmatization java

38,087

Solution 1

The Stanford CoreNLP Java library contains a lemmatizer that is a little resource intensive but I have run it on my laptop with <512MB of RAM.

To use it:

  1. Download the jar files;
  2. Create a new project in your editor of choice/make an ant script that includes all of the jar files contained in the archive you just downloaded;
  3. Create a new Java as shown below (based upon the snippet from Stanford's site);
import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}

Solution 2

Chris's answer regarding the Standford Lemmatizer is great! Absolutely beautiful. He even included a pointer to the jar files, so I didn't have to google for it.

But one of his lines of code had a syntax error (he somehow switched the ending closing parentheses and semicolon in the line that begins with "lemmas.add...), and he forgot to include the imports.

As far as the NoSuchMethodError error, it's usually caused by that method not being made public static, but if you look at the code itself (at http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h) that is not the problem. I suspect that the problem is somewhere in the build path (I'm using Eclipse Kepler, so it was no problem configuring the 33 jar files that I use in my project).

Below is my minor correction of Chris's code, along with an example (my apologies to Evanescence for butchering their perfect lyrics):

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

Here is my results (I was very impressed; it caught "'s" as "is" (sometimes), and did almost everything else perfectly):

Starting Stanford Lemmatizer

Adding annotator tokenize

Adding annotator ssplit

Adding annotator pos

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].

Adding annotator lemma

[how, could, you, be, see, into, my, eye, like, open, door, ?, you, lead, I, down, into, my, core, where, I, have, become, so, numb, without, a, soul, my, spirit, 's, sleep, somewhere, cold, until, you, find, it, there, and, lead, it, back, home, you, wake, I, up, inside, call, my, name, and, save, I, from, the, dark, you, have, bid, my, blood, and, it, run, before, I, would, become, undo, you, save, I, from, the, nothing, I, have, almost, become, you, be, bring, I, to, life, now, that, I, know, what, I, be, without, you, can, have, just, leave, I, you, breathe, into, I, and, make, I, real, frozen, inside, without, you, touch, without, you, love, ,, darling, only, you, be, the, life, among, the, dead, I, have, be, live, a, lie, ,, there, be, nothing, inside, you, be, bring, I, to, life, .]

Share:
38,087
Admin
Author by

Admin

Updated on August 08, 2022

Comments

  • Admin
    Admin over 1 year

    I am looking for a lemmatisation implementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I do not need a stemmer.