R stemming a string/document/corpus
Solution 1
The RTextTools package on CRAN allows you to do this.
library(RTextTools)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
matrix <- create_matrix(df1, stemWords=TRUE, removeStopwords=FALSE, minWordLength=2)
colnames(matrix) # SEE THE STEMMED TERMS
This returns a DocumentTermMatrix
that can be used with package tm
. You can play around with the other parameters (e.g. removing stopwords, changing the minimum word length, using a stemmer for a different language) to get the results you need. When displayed as.matrix
the example produces the following term matrix:
Terms
Docs am are differ distil he is it place sampl speak take the these this was
1 I am taking 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 these are the samples 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0
3 He speaks differently 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0
4 This is distilled 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
5 It was placed 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1
Solution 2
This works in R
as expected with tm
version 0.6. You had a few minor errors that prevented the stemming for working correctly, perhaps they're from an older version of tm
? Anyway, here's how to make it work:
require(RWeka)
require(tm)
The stemming package is not your Snowball
but SnowballC
:
require(SnowballC)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
corp1 <- Corpus(VectorSource(df1$words))
inspect(corp1)
Change your SnowballStemmer
to stemDocument
in the next line like so:
corp1 <- tm_map(corp1, stemDocument)
inspect(corp1)
Words are stemmed, as expected:
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
I am take
[[2]]
<<PlainTextDocument (metadata: 7)>>
these are the sampl
[[3]]
<<PlainTextDocument (metadata: 7)>>
He speak differ
[[4]]
<<PlainTextDocument (metadata: 7)>>
This is distil
[[5]]
<<PlainTextDocument (metadata: 7)>>
It was place
Now do the term document matrix:
corp1 <- Corpus(VectorSource(df1$words))
Change your stemDocument
to stemming
:
tdm1 <- TermDocumentMatrix(corp1, control=list(stemming=TRUE))
as.matrix(tdm1)
And we get a tdm of stemmed words, as expected:
Docs
Terms 1 2 3 4 5
are 0 1 0 0 0
differ 0 0 1 0 0
distil 0 0 0 1 0
place 0 0 0 0 1
sampl 0 1 0 0 0
speak 0 0 1 0 0
take 1 0 0 0 0
the 0 1 0 0 0
these 0 1 0 0 0
this 0 0 0 1 0
was 0 0 0 0 1
So there you go. Perhaps a more careful reading of the tm
docs might have saved a bit of your time with this ;)
Solution 3
Yes for steming words of document in a Corpus you required Rweka
, Snowball
and tm
package.
use following instruction
> library (tm)
#set your directory Suppose u have set "F:/St" then next command is
> a<-Corpus(DirSource("/st"),
readerControl=list(language="english")) # "/st" it is path of your directory
> a<-tm_map(a, stemDocument, language="english")
> inspect(a)
sure you will find your desired result.
Comments
-
screechOwl almost 2 years
I'm trying to do some stemming in R but it only seems to work on individual documents. My end goal is a term document matrix that shows the frequency of each term in the document.
Here's an example:
require(RWeka) require(tm) require(Snowball) worder1<- c("I am taking","these are the samples", "He speaks differently","This is distilled","It was placed") df1 <- data.frame(id=1:5, words=worder1) > df1 id words 1 1 I am taking 2 2 these are the samples 3 3 He speaks differently 4 4 This is distilled 5 5 It was placed
This method works for the stemming part but not the term document matrix part:
> corp1 <- Corpus(VectorSource(df1$words)) > inspect(corp1) A corpus with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] I am taking [[2]] these are the samples [[3]] He speaks differently [[4]] This is distilled [[5]] It was placed > corp1 <- tm_map(corp1, SnowballStemmer) > inspect(corp1) A corpus with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] [1] I am tak [[2]] [1] these are the sampl [[3]] [1] He speaks differ [[4]] [1] This is distil [[5]] [1] It was plac > class(corp1) [1] "VCorpus" "Corpus" "list" > tdm1 <- TermDocumentMatrix(corp1) Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character"
So instead I tried creating the term document matrix first but this time the words don't get stemmed:
> corp1 <- Corpus(VectorSource(df1$words)) > tdm1 <- TermDocumentMatrix(corp1, control=list(stemDocument=TRUE)) > as.matrix(tdm1) Docs Terms 1 2 3 4 5 are 0 1 0 0 0 differently 0 0 1 0 0 distilled 0 0 0 1 0 placed 0 0 0 0 1 samples 0 1 0 0 0 speaks 0 0 1 0 0 taking 1 0 0 0 0 the 0 1 0 0 0 these 0 1 0 0 0 this 0 0 0 1 0 was 0 0 0 0 1
Here the words are obviously not stemmed.
Any suggestions?