How do I do word Stemming or Lemmatization?

138,846

Solution 1

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

Solution 2

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

Solution 3

I tried your list of terms on this snowball demo site and the results look okay....

  • cats -> cat
  • running -> run
  • ran -> ran
  • cactus -> cactus
  • cactuses -> cactus
  • community -> communiti
  • communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

Solution 4

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.

See Stemmers vs Lemmatizers

Here's an example with python NLTK:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

Solution 5

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.

If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

Share:
138,846

Related videos on Youtube

manixrock
Author by

manixrock

Updated on July 08, 2022

Comments

  • manixrock
    manixrock almost 2 years

    I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

    My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

    See also:

  • Chris Pfohl
    Chris Pfohl over 13 years
    Oh sad...before I knew to search S.O. I implemented my own!
  • Mathieu Rodic
    Mathieu Rodic almost 13 years
    Do not forget to install the corpus before using nltk for the first time! velvetcache.org/2010/03/01/…
  • CTsiddharth
    CTsiddharth about 12 years
    sorry for the late reply .. i got this issue solved only now ! :)
  • jogojapan
    jogojapan over 11 years
    Welcome to SO, and thanks for your post, +1. It would be great if you could make a few comments on this stemmer's usage, performance etc. Just a link isn't usually considered a very good answer.
  • SexyBeast
    SexyBeast almost 11 years
    Well, this one uses some non-deterministic algorithm like Porter Stemmer, for if you try it with dies, it gives you dy instead of die. Isn't there some kind of hardcoded stemmer dictionary?
  • alvas
    alvas almost 11 years
    any idea what are the words that WordNetLemmatizer wrongly lemmatize?
  • Malcolm
    Malcolm almost 11 years
    An ideal solution would learn these expectations automatically. Have you had any experience with such a system?
  • Van Gale
    Van Gale almost 11 years
    No. In our case the documents being indexed were the code & regulations for a specific area of law and there were dozens of (human) editors analyzing the indexes for any bad stems.
  • mchangun
    mchangun over 10 years
    In terms of performance (execution speed), is Lemmatization much slower than stemming?
  • Adam_G
    Adam_G over 10 years
    The line 'pipeline = new...' does not compile for me. If I change it to 'StanfordCoreNLP pipelne= new...' it compiles. Os this correct?
  • Stompchicken
    Stompchicken about 10 years
    The point is that stem("updates") == stem("update"), which it does (update -> updat)
  • user
    user about 10 years
    The software can do stem(x) == stem(y) but that's not answering the question completely
  • Fashandge
    Fashandge almost 10 years
    nltk WordNetLemmatizer requires a pos tag as argument. By default it is 'n' (standing for noun). So it will not work correctly for verbs. If POS tags are not available, a simple (but ad-hoc) approach is to do lemmatization twice, one for 'n', and the other for 'v' (standing for verb), and choose the result that is different from the original word (usually shorter in length, but 'ran' and 'run' have the same length). It seems that we don't need to worry about 'adj', 'adv', 'prep', etc, since they are already in the original form in some sense.
  • Jindra Helcl
    Jindra Helcl almost 10 years
    Careful with the lingo, a stem is not a base form of a word. If you want a base form, you need a lemmatizer. A stem is the largest part of a word that does not contain prefixes or suffixes. The stem of a word update is indeed "updat". The words are created from stems by adding endings and suffixes, e.g. updat-e, or updat-ing. (en.wikipedia.org/wiki/Word_stem)
  • Jindra Helcl
    Jindra Helcl almost 10 years
    Yes, you must declare the pipeline var first. The Stanford NLP can be used from command line as well so you don't have to do any programming, you just make the properties file and feed the executables with it. Read the docs: nlp.stanford.edu/software/corenlp.shtml
  • Sisyphus
    Sisyphus over 9 years
    without POS, if input has, it output ha, so there are some problems on @Fashandge 's method
  • Nick Ruiz
    Nick Ruiz almost 9 years
    As mentioned before, WordNetLemmatizer's lemmatize() can take a POS tag. So from your example: " ".join([wnl.lemmatize(i, pos=VERB) for i in sent.split()]) gives 'cat run run cactus cactuses cacti community communities'.
  • alvas
    alvas almost 9 years
    @NickRuiz, I think you meant pos=NOUN? BTW: Long time no see, hopefully we'll meet each other in conference soon =)
  • Nick Ruiz
    Nick Ruiz almost 9 years
    actually, no (Hopefully 'yes' to conferences, though). Because if you set pos=VERB you only do lemmatization on verbs. The nouns remain the same. I just had to write some of my own code to pivot around the actual Penn Treebank POS tags to apply the correct lemmatization to each token. Also, WordNetLemmatizer stinks at lemmatizing nltk's default tokenizer. So examples like does n't do not lemmatize to do not.
  • Lerner Zhang
    Lerner Zhang over 5 years
    but, but port.stem("this") produces thi and port.stem("was") wa, even when the right pos is provided for each.
  • alvas
    alvas over 5 years
    A stemmer don't return linguistically sound outputs. It's just to make the text more "dense" (i.e. contain less vocab). See stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers and stackoverflow.com/questions/51943811/…