R tm removeWords function not removing words

33,913

Solution 1

I switched some code around and added tolower. The stopwords are all in lowercase, so you need to do that first before you remove stopwords.

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
# added tolower
paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
# moved stripWhitespace
paperCorp <- tm_map(paperCorp, stripWhitespace)

paperCorp <- tm_map(paperCorp, stemDocument)

Upper case words no longer needed, since we set everything to lower case. You can remove these.

paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                               "download", "google", "figure",
                                               "fig", "groups","Google", "however",
                                               "high", "human", "levels",
                                               "larger", "may", "number",
                                               "shown", "study", "studies", "this",
                                               "using", "two", "the", "Scholar",
                                               "pubmedncbi", "PubMedNCBI",
                                               "view", "View", "the", "biol",
                                               "via", "image", "doi", "one", 
                                               "analysis"))

paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

dtm <- DocumentTermMatrix(paperCorpPTD)

termFreq <- colSums(as.matrix(dtm))
head(termFreq)

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)

           term  freq
fatty     fatty 29568
pparα     ppara 23232
acids     acids 22848
gene       gene 15360
dietary dietary 12864
scholar scholar 11904

tf[tf$term == "study"]


data frame with 0 columns and 1659 rows

And as you can see, the outcome is that study is no longer in the corpus. The rest of the words are also gone

Solution 2

If someone gets error like me and above solution still doesn't work, try use: paperCorp <- tm_map(paperCorp, content_transformer(tolower)) instead of paperCorp <- tm_map(paperCorp, tolower) because tolower() is a function from base package and returns different structure (I mean changes something in the result type) so you can't use for example paperCorp[[j]]$content but only paperCorp[[j]]. It's just a digression, maybe halpful to someone.

Share:
33,913
Adam
Author by

Adam

Updated on December 06, 2020

Comments

  • Adam
    Adam over 3 years

    I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below:

    install.packages("rvest")
    install.packages("tm")
    install.packages("SnowballC")
    install.packages("stringr")
    library(stringr) 
    library(tm) 
    library(SnowballC) 
    library(rvest)
    
    # Pull in the data I have been using. 
    paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")
    paperURLs <- paperList %>%
      html_nodes(xpath="//*[@class='search-results-title']/a") %>%
      html_attr("href")
    paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")
    paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x]))
    
    paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
                          html_nodes(xpath="//*[@class='article-content']") %>%
                          html_text() %>%
                          str_trim(.))
    # Create corpus
    paperCorp <- Corpus(VectorSource(paperText))
    for(j in seq(paperCorp))
    {
      paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
      paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
      paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])
    }
    
    paperCorp <- tm_map(paperCorp, removePunctuation)
    paperCorp <- tm_map(paperCorp, removeNumbers)
    
    paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
    
    paperCorp <- tm_map(paperCorp, stemDocument)
    
    paperCorp <- tm_map(paperCorp, stripWhitespace)
    paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
    
    dtm <- DocumentTermMatrix(paperCorpPTD)
    
    termFreq <- colSums(as.matrix(dtm))
    head(termFreq)
    
    tf <- data.frame(term = names(termFreq), freq = termFreq)
    tf <- tf[order(-tf[,2]),]
    head(tf)
    
    # After having identified words I am not interested in
    # create new corpus with these words removed.
    paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                                  "download", "google", "figure",
                                                  "fig", "groups","Google", "however",
                                                  "high", "human", "levels",
                                                  "larger", "may", "number",
                                                  "shown", "study", "studies", "this",
                                                  "using", "two", "the", "Scholar",
                                                  "pubmedncbi", "PubMedNCBI",
                                                  "view", "View", "the", "biol",
                                                  "via", "image", "doi", "one", 
                                                  "analysis"))
    
    paperCorp1 <- tm_map(paperCorp1, stripWhitespace)
    paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument)
    dtm1 <- DocumentTermMatrix(paperCorpPTD1)
    termFreq1 <- colSums(as.matrix(dtm1))
    tf1 <- data.frame(term = names(termFreq1), freq = termFreq1)
    tf1 <- tf1[order(-tf1[,2]),]
    head(tf1, 100)
    

    If you look through tf1 you will notice that plenty of the words that were specified to be removed have not actually been removed.

    Just wondering what I am doing wrong, and how I might remove these words from my data?

    NOTE: removeWords is doing something because the output from head(tm, 100) and head(tm1, 100) are not exactly the same. So removeWords seems to removing some instances of the words I am trying to get rid of, but not all instances.