transformation drops documents error in R

r tm
17,501

This warning appears only when you use content_transformer to create your own specific function. And it only appears when you have a corpus based on a VectorSource.

The reason is that there is a check in the underlying code to see if the number of names of the corpus content matches the length of the corpus content. With reading the text as a vector there are no document names and this warning pops up. And this is only a warning, no documents have been dropped.

See the following examples:

text <- c("this is my text with a forward slash / and some other text")
library(tm)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

text <- c("this is my text with a forward slash / and some other text")
text_corpus <- Corpus(VectorSource(text))
inspect(text_corpus)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

[1] this is my text with a forward slash / and some other text

# warning appears here
text_corpus <- tm_map(text_corpus, toSpace, "/")
inspect(text_corpus)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

[1] this is my text with a forward slash   and some other text

You can see that there are no names in the text_corpus with the following command:

names(content(text_corpus))
NULL

If you do not want this warning to appear you need to create a data.frame and use that as a source with DataframeSource.

text <- c("this is my text with a forward slash / and some other text")
doc_ids <- c(1)

df <- data.frame(doc_id = doc_ids, text = text, stringsAsFactors = FALSE)
df_corpus <- Corpus(DataframeSource(df))
inspect(df_corpus)
# no warning appears
df_corpus <- tm_map(df_corpus, toSpace, "/")
inspect(df_corpus)

names(content(df_corpus))
"1"
Share:
17,501
NRR
Author by

NRR

Updated on July 25, 2022

Comments

  • NRR
    NRR almost 2 years

    Whenever i run this code, tm_map line give me warning message as Warning message: In tm_map.SimpleCorpus(docs, toSpace, "/") : transformation drops documents

    texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE)
            docs <- Corpus(VectorSource(texts))
            toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
            docs <- tm_map(docs, toSpace, "/")
            docs <- tm_map(docs, toSpace, "@")
            docs <- tm_map(docs, toSpace, "\\|")
            docs <- tm_map(docs, content_transformer(tolower))
            docs <- tm_map(docs, removeNumbers)
            docs <- tm_map(docs, removeWords, stopwords("english"))
            docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
            docs <- tm_map(docs, removePunctuation)
            docs <- tm_map(docs, stripWhitespace)
    
  • Sagar
    Sagar almost 4 years
    What does this do? toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))