Unable to convert a Corpus to Data Frame in R

17,465

Solution 1

This ought to do it:

data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)

edited with working solution, using crude as example

The problem here is that you cannot apply stemCompletion as a transformation.

getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"       "stemDocument"      "stripWhitespace"  

does not include stemCompletion, which takes a vector of stemmed tokens as input.

So this should do it: first you extract the transformed texts and tokenise them, then complete the stems, then paste back together. Here I have illustrated the solution using the built-in crude corpus.

data(crude)
myCorpus <- crude 
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
# tokenize the corpus
myCorpusTokenized <- lapply(myCorpus, scan_tokenizer)
# stem complete each token vector
myTokensStemCompleted <- lapply(myCorpusTokenized, stemCompletion, dictCorpus)
# concatenate tokens by document, create data frame
myDf <- data.frame(text = sapply(myTokensStemCompleted, paste, collapse = " "), stringsAsFactors = FALSE)

Solution 2

I've redone some of your earlier code with magrittr, just cause.

library(dplyr)
library(tm)


dictCorpus = 
  c("I love my cat", "Cullen bae is bae", "4ever alone :(") %>%
  VectorSource %>%
  Corpus %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removePunctuation)

myCorpus = 
  dictCorpus %>%
  tm_map(stemDocument) %>%
  tm_map(stemCompletion, dictionary=dictCorpus)

data = 
  data_frame(object = 
               myCorpus %>% 
               `class<-`("list") %>% 
               use_series(content) ) %>%
  rowwise %>%
  mutate(content = 
           object %>%
           names %>%
           extract(1) )
Share:
17,465
wrahool
Author by

wrahool

Prof slinking in the alleys of SO pondering R, Python, stats, TeX, and Unix.

Updated on July 28, 2022

Comments

  • wrahool
    wrahool almost 2 years

    I've looked at the other similar questions that have been posted here (like this), but the problem persists.

    I have a dataframe of textual data, which I need to stem. So I'm converting it into a corpus, stemming it, then completing the words from the stems, and then trying to get a dataframe of text as output.

    myCorpus <- Corpus(VectorSource(textDf$text))
    myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
    myCorpus <- tm_map(myCorpus, content_transformer(tolower))
    myCorpus <- tm_map(myCorpus, removePunctuation)
    dictCorpus <- myCorpus
    myCorpus <- tm_map(myCorpus, stemDocument)
    myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus)
    

    Now I'm trying to get a dataframe back from this corpus so I've tried these following commands.

    dataframe<-data.frame(text=unlist(sapply(myCorpus, '[', "content")), stringsAsFactors=F)

    and

    dataframe<-data.frame(text=unlist(sapply(myCorpus,[)), stringsAsFactors=F)

    and also

    dataframe <- 
        data.frame(id=sapply(corpus, meta, "id"),
                   text=unlist(lapply(sapply(corpus, '[', "content"),paste,collapse="\n")),
                   stringsAsFactors=FALSE)
    

    from this link

    All of them produce the following error:

    Error in UseMethod("meta", x) : 
      no applicable method for 'meta' applied to an object of class "character"
    

    Any help would be greatly appreciated.