How to perform Lemmatization in R?
Solution 1
Hello you can try package koRpus
which allow to use Treetagger :
tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
[email protected]
## token tag lemma lttr wclass desc stop stem
## 1 run NN run 3 noun Noun, singular or mass NA NA
## 2 ran VVD run 3 verb Verb, past tense NA NA
## 3 running VVG run 7 verb Verb, gerund or present participle NA NA
See the lemma
column for the result you're asking for.
Solution 2
As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:
library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)
## [1] "run" "run" "run"
Solution 3
@Andy and @Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.
> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make
terrific th-grade learning tool samuel beckett applied iranian voting process bard
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make
terrific th - grade learn tool samuel beckett apply iranian vote process bard black
comedy willie love another trumpet blast may new mexican cinema - bornin"
Do not forget to run the following line of code after you have done lemmatization:
> corpus <- tm_map(corpus, PlainTextDocument)
This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).
Hope this helps!
Solution 4
Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.
If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.
Solution 5
I think the answers are a bit outdated here. You should be using R package udpipe now - available at https://CRAN.R-project.org/package=udpipe - see https://github.com/bnosac/udpipe or docs at https://bnosac.github.io/udpipe/en
Notice the difference between the word meeting (NOUN) and the word meet (VERB) in the following example when doing lemmatisation and when doing stemming, and the annoying screwing up of the word 'someone' to 'someon' when doing stemming.
library(udpipe)
x <- c(doc_a = "In our last meeting, someone said that we are meeting again tomorrow",
doc_b = "It's better to be good at being the best")
anno <- udpipe(x, "english")
anno[, c("doc_id", "sentence_id", "token", "lemma", "upos")]
#> doc_id sentence_id token lemma upos
#> 1 doc_a 1 In in ADP
#> 2 doc_a 1 our we PRON
#> 3 doc_a 1 last last ADJ
#> 4 doc_a 1 meeting meeting NOUN
#> 5 doc_a 1 , , PUNCT
#> 6 doc_a 1 someone someone PRON
#> 7 doc_a 1 said say VERB
#> 8 doc_a 1 that that SCONJ
#> 9 doc_a 1 we we PRON
#> 10 doc_a 1 are be AUX
#> 11 doc_a 1 meeting meet VERB
#> 12 doc_a 1 again again ADV
#> 13 doc_a 1 tomorrow tomorrow NOUN
#> 14 doc_b 1 It it PRON
#> 15 doc_b 1 's be AUX
#> 16 doc_b 1 better better ADJ
#> 17 doc_b 1 to to PART
#> 18 doc_b 1 be be AUX
#> 19 doc_b 1 good good ADJ
#> 20 doc_b 1 at at SCONJ
#> 21 doc_b 1 being be AUX
#> 22 doc_b 1 the the DET
#> 23 doc_b 1 best best ADJ
lemmatisation <- paste.data.frame(anno, term = "lemma",
group = c("doc_id", "sentence_id"))
lemmatisation
#> doc_id sentence_id
#> 1 doc_a 1
#> 2 doc_b 1
#> lemma
#> 1 in we last meeting , someone say that we be meet again tomorrow
#> 2 it be better to be good at be the best
library(SnowballC)
tokens <- strsplit(x, split = "[[:space:][:punct:]]+")
stemming <- lapply(tokens, FUN = function(x) wordStem(x, language = "en"))
stemming
#> $doc_a
#> [1] "In" "our" "last" "meet" "someon" "said"
#> [7] "that" "we" "are" "meet" "again" "tomorrow"
#>
#> $doc_b
#> [1] "It" "s" "better" "to" "be" "good" "at" "be"
#> [9] "the" "best"
Comments
-
StrikeR over 3 years
This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.
According to Wikipedia, lemmatization is defined as:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
A simple Google search for lemmatization in R will only point to the package
wordnet
of R. When I tried this package expecting that a character vectorc("run", "ran", "running")
input to the lemmatization function would result inc("run", "run", "run")
, I saw that this package only provides functionality similar togrepl
function through various filter names and a dictionary.An example code from
wordnet
package, which gives maximum of 5 words starting with "car", as the filter name explains itself:filter <- getTermFilter("StartsWithFilter", "car", TRUE) terms <- getIndexTerms("NOUN", 5, filter) sapply(terms, getLemma)
The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using
R
I want to find true roots of the words: (For e.g. fromc("run", "ran", "running")
toc("run", "run", "run")
). -
StrikeR about 9 yearsStemming is what I'm currently using for my corpus, but what I'm really looking for is lemmatization and I want to compare how well the results are going to be improved (sceptical) when I use lemmatization in place of stemming. Thanks for the info though.
-
StrikeR about 9 yearsThanks Victor. That answer helped me. But I'm still working on it. Would like to wait for 2 more days to look for any other solutions and accept this if no better answer is given.
-
Victorp about 9 yearsNo problem, I understand, use an external software can be tricky.
-
wordsforthewise over 6 yearsThat's not really a very good code example. It's not even formatted properly. How do you apply it to a VCorpus object?