How do I do word Stemming or Lemmatization?
Solution 1
If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.
Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:
>>> import nltk
>>> nltk.download('wordnet')
You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'
There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.
Solution 2
I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .
import java.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
public class example
{
public static void main(String[] args)
{
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props, false);
String text = /* the string you want */;
Annotation document = pipeline.process(text);
for(CoreMap sentence: document.get(SentencesAnnotation.class))
{
for(CoreLabel token: sentence.get(TokensAnnotation.class))
{
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println("lemmatized version :" + lemma);
}
}
}
}
It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.
Solution 3
I tried your list of terms on this snowball demo site and the results look okay....
- cats -> cat
- running -> run
- ran -> ran
- cactus -> cactus
- cactuses -> cactus
- community -> communiti
- communities -> communiti
A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.
I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.
Solution 4
The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.
Here's an example with python NLTK:
>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'
Solution 5
Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.
If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.
Related videos on Youtube
manixrock
Updated on July 08, 2022Comments
-
manixrock almost 2 years
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
-
MSalters about 15 yearsShouldn't that be cacti ?
-
Renaud Bompuis about 15 yearsJust to make a circular reference to the original question posted on Reddit: How do I programmatically do stemming? (e.g. "eating" to "eat", "cactuses" to "cactus") Posting it here because the comments include useful information.
-
alvas about 10 years
-
-
Chris Pfohl over 13 yearsOh sad...before I knew to search S.O. I implemented my own!
-
Mathieu Rodic almost 13 yearsDo not forget to install the corpus before using nltk for the first time! velvetcache.org/2010/03/01/…
-
CTsiddharth about 12 yearssorry for the late reply .. i got this issue solved only now ! :)
-
jogojapan over 11 yearsWelcome to SO, and thanks for your post, +1. It would be great if you could make a few comments on this stemmer's usage, performance etc. Just a link isn't usually considered a very good answer.
-
SexyBeast almost 11 yearsWell, this one uses some non-deterministic algorithm like Porter Stemmer, for if you try it with
dies
, it gives youdy
instead ofdie
. Isn't there some kind of hardcoded stemmer dictionary? -
alvas almost 11 yearsany idea what are the words that
WordNetLemmatizer
wrongly lemmatize? -
Malcolm almost 11 yearsAn ideal solution would learn these expectations automatically. Have you had any experience with such a system?
-
Van Gale almost 11 yearsNo. In our case the documents being indexed were the code & regulations for a specific area of law and there were dozens of (human) editors analyzing the indexes for any bad stems.
-
mchangun over 10 yearsIn terms of performance (execution speed), is Lemmatization much slower than stemming?
-
Adam_G over 10 yearsThe line 'pipeline = new...' does not compile for me. If I change it to 'StanfordCoreNLP pipelne= new...' it compiles. Os this correct?
-
Stompchicken about 10 yearsThe point is that stem("updates") == stem("update"), which it does (update -> updat)
-
user about 10 yearsThe software can do stem(x) == stem(y) but that's not answering the question completely
-
Fashandge almost 10 yearsnltk WordNetLemmatizer requires a pos tag as argument. By default it is 'n' (standing for noun). So it will not work correctly for verbs. If POS tags are not available, a simple (but ad-hoc) approach is to do lemmatization twice, one for 'n', and the other for 'v' (standing for verb), and choose the result that is different from the original word (usually shorter in length, but 'ran' and 'run' have the same length). It seems that we don't need to worry about 'adj', 'adv', 'prep', etc, since they are already in the original form in some sense.
-
Jindra Helcl almost 10 yearsCareful with the lingo, a stem is not a base form of a word. If you want a base form, you need a lemmatizer. A stem is the largest part of a word that does not contain prefixes or suffixes. The stem of a word update is indeed "updat". The words are created from stems by adding endings and suffixes, e.g. updat-e, or updat-ing. (en.wikipedia.org/wiki/Word_stem)
-
Jindra Helcl almost 10 yearsYes, you must declare the pipeline var first. The Stanford NLP can be used from command line as well so you don't have to do any programming, you just make the properties file and feed the executables with it. Read the docs: nlp.stanford.edu/software/corenlp.shtml
-
Sisyphus over 9 yearswithout POS, if input
has
, it outputha
, so there are some problems on @Fashandge 's method -
Nick Ruiz almost 9 yearsAs mentioned before,
WordNetLemmatizer
'slemmatize()
can take a POS tag. So from your example:" ".join([wnl.lemmatize(i, pos=VERB) for i in sent.split()])
gives'cat run run cactus cactuses cacti community communities'
. -
alvas almost 9 years@NickRuiz, I think you meant
pos=NOUN
? BTW: Long time no see, hopefully we'll meet each other in conference soon =) -
Nick Ruiz almost 9 yearsactually, no (Hopefully 'yes' to conferences, though). Because if you set
pos=VERB
you only do lemmatization on verbs. The nouns remain the same. I just had to write some of my own code to pivot around the actual Penn Treebank POS tags to apply the correct lemmatization to each token. Also,WordNetLemmatizer
stinks at lemmatizing nltk's default tokenizer. So examples likedoes n't
do not lemmatize todo not
. -
Lerner Zhang over 5 yearsbut, but
port.stem("this")
producesthi
andport.stem("was")
wa
, even when the right pos is provided for each. -
alvas over 5 yearsA stemmer don't return linguistically sound outputs. It's just to make the text more "dense" (i.e. contain less vocab). See stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers and stackoverflow.com/questions/51943811/…