R stemming a string/document/corpus

10,697

Solution 1

The RTextTools package on CRAN allows you to do this.

library(RTextTools)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)

matrix <- create_matrix(df1, stemWords=TRUE, removeStopwords=FALSE, minWordLength=2)
colnames(matrix) # SEE THE STEMMED TERMS

This returns a DocumentTermMatrix that can be used with package tm. You can play around with the other parameters (e.g. removing stopwords, changing the minimum word length, using a stemmer for a different language) to get the results you need. When displayed as.matrix the example produces the following term matrix:

                         Terms
Docs                      am are differ distil he is it place sampl speak take the these this was
  1 I am taking            1   0      0      0  0  0  0     0     0     0    1   0     0    0   0
  2 these are the samples  0   1      0      0  0  0  0     0     1     0    0   1     1    0   0
  3 He speaks differently  0   0      1      0  1  0  0     0     0     1    0   0     0    0   0
  4 This is distilled      0   0      0      1  0  1  0     0     0     0    0   0     0    1   0
  5 It was placed          0   0      0      0  0  0  1     1     0     0    0   0     0    0   1

Solution 2

This works in R as expected with tm version 0.6. You had a few minor errors that prevented the stemming for working correctly, perhaps they're from an older version of tm? Anyway, here's how to make it work:

require(RWeka)
require(tm)

The stemming package is not your Snowball but SnowballC:

require(SnowballC)

worder1<- c("I am taking","these are the samples",
            "He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
corp1 <- Corpus(VectorSource(df1$words))
inspect(corp1)

Change your SnowballStemmer to stemDocument in the next line like so:

corp1 <- tm_map(corp1, stemDocument)
inspect(corp1)

Words are stemmed, as expected:

<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
I am take

[[2]]
<<PlainTextDocument (metadata: 7)>>
these are the sampl

[[3]]
<<PlainTextDocument (metadata: 7)>>
He speak differ

[[4]]
<<PlainTextDocument (metadata: 7)>>
This is distil

[[5]]
<<PlainTextDocument (metadata: 7)>>
It was place

Now do the term document matrix:

corp1 <- Corpus(VectorSource(df1$words))

Change your stemDocument to stemming:

tdm1 <- TermDocumentMatrix(corp1, control=list(stemming=TRUE))
as.matrix(tdm1)

And we get a tdm of stemmed words, as expected:

        Docs
Terms    1 2 3 4 5
  are    0 1 0 0 0
  differ 0 0 1 0 0
  distil 0 0 0 1 0
  place  0 0 0 0 1
  sampl  0 1 0 0 0
  speak  0 0 1 0 0
  take   1 0 0 0 0
  the    0 1 0 0 0
  these  0 1 0 0 0
  this   0 0 0 1 0
  was    0 0 0 0 1

So there you go. Perhaps a more careful reading of the tm docs might have saved a bit of your time with this ;)

Solution 3

Yes for steming words of document in a Corpus you required Rweka, Snowball and tm package.

use following instruction

> library (tm)
#set your directory Suppose u have set "F:/St" then next command is 
> a<-Corpus(DirSource("/st"), 
            readerControl=list(language="english")) # "/st" it is path of your directory
> a<-tm_map(a, stemDocument, language="english")
> inspect(a)

sure you will find your desired result.

Share:
10,697
screechOwl
Author by

screechOwl

https://financenerd.blog/blog/

Updated on June 17, 2022

Comments

  • screechOwl
    screechOwl almost 2 years

    I'm trying to do some stemming in R but it only seems to work on individual documents. My end goal is a term document matrix that shows the frequency of each term in the document.

    Here's an example:

    require(RWeka)
    require(tm)
    require(Snowball)
    
    worder1<- c("I am taking","these are the samples",
    "He speaks differently","This is distilled","It was placed")
    df1 <- data.frame(id=1:5, words=worder1)
    
    > df1
      id                 words
    1  1           I am taking
    2  2 these are the samples
    3  3 He speaks differently
    4  4     This is distilled
    5  5         It was placed
    

    This method works for the stemming part but not the term document matrix part:

    > corp1 <- Corpus(VectorSource(df1$words))
    > inspect(corp1)
    A corpus with 5 text documents
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 
    
    [[1]]
    I am taking
    
    [[2]]
    these are the samples
    
    [[3]]
    He speaks differently
    
    [[4]]
    This is distilled
    
    [[5]]
    It was placed
    
    > corp1 <- tm_map(corp1, SnowballStemmer)
    > inspect(corp1)
    A corpus with 5 text documents
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 
    
    [[1]]
    [1] I am tak
    
    [[2]]
    [1] these are the sampl
    
    [[3]]
    [1] He speaks differ
    
    [[4]]
    [1] This is distil
    
    [[5]]
    [1] It was plac
    
    >  class(corp1)
    [1] "VCorpus" "Corpus"  "list"   
    > tdm1 <- TermDocumentMatrix(corp1)
    Error in UseMethod("Content", x) : 
      no applicable method for 'Content' applied to an object of class "character"
    

    So instead I tried creating the term document matrix first but this time the words don't get stemmed:

    > corp1 <- Corpus(VectorSource(df1$words))
    > tdm1 <- TermDocumentMatrix(corp1, control=list(stemDocument=TRUE))
    >  as.matrix(tdm1)
                 Docs
    Terms         1 2 3 4 5
      are         0 1 0 0 0
      differently 0 0 1 0 0
      distilled   0 0 0 1 0
      placed      0 0 0 0 1
      samples     0 1 0 0 0
      speaks      0 0 1 0 0
      taking      1 0 0 0 0
      the         0 1 0 0 0
      these       0 1 0 0 0
      this        0 0 0 1 0
      was         0 0 0 0 1
    

    Here the words are obviously not stemmed.

    Any suggestions?