How do I do word Stemming or Lemmatization?

nlp stemming lemmatization

138,846

Solution 1

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

Solution 2

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

Solution 3

I tried your list of terms on this snowball demo site and the results look okay....

cats -> cat
running -> run
ran -> ran
cactus -> cactus
cactuses -> cactus
community -> communiti
communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

Solution 4

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.

See Stemmers vs Lemmatizers

Here's an example with python NLTK:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

Solution 5

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.

If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

View more solutions

138,846

manixrock

Updated on July 08, 2022

Comments

manixrock almost 2 years
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:
- Stemming algorithm that produces real words
- Stemming - code examples or open source projects?
- MSalters about 15 years
  
  Shouldn't that be cacti ?
- Renaud Bompuis about 15 years
  
  Just to make a circular reference to the original question posted on Reddit: How do I programmatically do stemming? (e.g. "eating" to "eat", "cactuses" to "cactus") Posting it here because the comments include useful information.
- alvas about 10 years
  
  see stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers
Chris Pfohl over 13 years

Oh sad...before I knew to search S.O. I implemented my own!
Mathieu Rodic almost 13 years

Do not forget to install the corpus before using nltk for the first time! velvetcache.org/2010/03/01/…
CTsiddharth about 12 years

sorry for the late reply .. i got this issue solved only now ! :)
jogojapan over 11 years

Welcome to SO, and thanks for your post, +1. It would be great if you could make a few comments on this stemmer's usage, performance etc. Just a link isn't usually considered a very good answer.
SexyBeast almost 11 years

Well, this one uses some non-deterministic algorithm like Porter Stemmer, for if you try it with dies, it gives you dy instead of die. Isn't there some kind of hardcoded stemmer dictionary?
alvas almost 11 years

any idea what are the words that WordNetLemmatizer wrongly lemmatize?
Malcolm almost 11 years

An ideal solution would learn these expectations automatically. Have you had any experience with such a system?
Van Gale almost 11 years

No. In our case the documents being indexed were the code & regulations for a specific area of law and there were dozens of (human) editors analyzing the indexes for any bad stems.
mchangun over 10 years

In terms of performance (execution speed), is Lemmatization much slower than stemming?
Adam_G over 10 years

The line 'pipeline = new...' does not compile for me. If I change it to 'StanfordCoreNLP pipelne= new...' it compiles. Os this correct?
Stompchicken about 10 years

The point is that stem("updates") == stem("update"), which it does (update -> updat)
user about 10 years

The software can do stem(x) == stem(y) but that's not answering the question completely
Fashandge almost 10 years

nltk WordNetLemmatizer requires a pos tag as argument. By default it is 'n' (standing for noun). So it will not work correctly for verbs. If POS tags are not available, a simple (but ad-hoc) approach is to do lemmatization twice, one for 'n', and the other for 'v' (standing for verb), and choose the result that is different from the original word (usually shorter in length, but 'ran' and 'run' have the same length). It seems that we don't need to worry about 'adj', 'adv', 'prep', etc, since they are already in the original form in some sense.
Jindra Helcl almost 10 years

Careful with the lingo, a stem is not a base form of a word. If you want a base form, you need a lemmatizer. A stem is the largest part of a word that does not contain prefixes or suffixes. The stem of a word update is indeed "updat". The words are created from stems by adding endings and suffixes, e.g. updat-e, or updat-ing. (en.wikipedia.org/wiki/Word_stem)
Jindra Helcl almost 10 years

Yes, you must declare the pipeline var first. The Stanford NLP can be used from command line as well so you don't have to do any programming, you just make the properties file and feed the executables with it. Read the docs: nlp.stanford.edu/software/corenlp.shtml
Sisyphus over 9 years

without POS, if input has, it output ha, so there are some problems on @Fashandge 's method
Nick Ruiz almost 9 years

As mentioned before, WordNetLemmatizer's lemmatize() can take a POS tag. So from your example: " ".join([wnl.lemmatize(i, pos=VERB) for i in sent.split()]) gives 'cat run run cactus cactuses cacti community communities'.
alvas almost 9 years

@NickRuiz, I think you meant pos=NOUN? BTW: Long time no see, hopefully we'll meet each other in conference soon =)
Nick Ruiz almost 9 years

actually, no (Hopefully 'yes' to conferences, though). Because if you set pos=VERB you only do lemmatization on verbs. The nouns remain the same. I just had to write some of my own code to pivot around the actual Penn Treebank POS tags to apply the correct lemmatization to each token. Also, WordNetLemmatizer stinks at lemmatizing nltk's default tokenizer. So examples like does n't do not lemmatize to do not.
Lerner Zhang over 5 years

but, but port.stem("this") produces thi and port.stem("was") wa, even when the right pos is provided for each.
alvas over 5 years

A stemmer don't return linguistically sound outputs. It's just to make the text more "dense" (i.e. contain less vocab). See stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers and stackoverflow.com/questions/51943811/…