R tm removeWords function not removing words
Solution 1
I switched some code around and added tolower. The stopwords are all in lowercase, so you need to do that first before you remove stopwords.
paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
# added tolower
paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
# moved stripWhitespace
paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorp <- tm_map(paperCorp, stemDocument)
Upper case words no longer needed, since we set everything to lower case. You can remove these.
paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article",
"download", "google", "figure",
"fig", "groups","Google", "however",
"high", "human", "levels",
"larger", "may", "number",
"shown", "study", "studies", "this",
"using", "two", "the", "Scholar",
"pubmedncbi", "PubMedNCBI",
"view", "View", "the", "biol",
"via", "image", "doi", "one",
"analysis"))
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
head(termFreq)
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)
term freq
fatty fatty 29568
pparα ppara 23232
acids acids 22848
gene gene 15360
dietary dietary 12864
scholar scholar 11904
tf[tf$term == "study"]
data frame with 0 columns and 1659 rows
And as you can see, the outcome is that study is no longer in the corpus. The rest of the words are also gone
Solution 2
If someone gets error like me and above solution still doesn't work, try use:
paperCorp <- tm_map(paperCorp, content_transformer(tolower))
instead of paperCorp <- tm_map(paperCorp, tolower)
because tolower()
is a function from base package and returns different structure (I mean changes something in the result type) so you can't use for example paperCorp[[j]]$content
but only paperCorp[[j]]
. It's just a digression, maybe halpful to someone.
Adam
Updated on December 06, 2020Comments
-
Adam over 3 years
I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below:
install.packages("rvest") install.packages("tm") install.packages("SnowballC") install.packages("stringr") library(stringr) library(tm) library(SnowballC) library(rvest) # Pull in the data I have been using. paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192") paperURLs <- paperList %>% html_nodes(xpath="//*[@class='search-results-title']/a") %>% html_attr("href") paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "") paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x])) paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>% html_nodes(xpath="//*[@class='article-content']") %>% html_text() %>% str_trim(.)) # Create corpus paperCorp <- Corpus(VectorSource(paperText)) for(j in seq(paperCorp)) { paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]]) paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]]) paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]]) } paperCorp <- tm_map(paperCorp, removePunctuation) paperCorp <- tm_map(paperCorp, removeNumbers) paperCorp <- tm_map(paperCorp, removeWords, stopwords("english")) paperCorp <- tm_map(paperCorp, stemDocument) paperCorp <- tm_map(paperCorp, stripWhitespace) paperCorpPTD <- tm_map(paperCorp, PlainTextDocument) dtm <- DocumentTermMatrix(paperCorpPTD) termFreq <- colSums(as.matrix(dtm)) head(termFreq) tf <- data.frame(term = names(termFreq), freq = termFreq) tf <- tf[order(-tf[,2]),] head(tf) # After having identified words I am not interested in # create new corpus with these words removed. paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article", "download", "google", "figure", "fig", "groups","Google", "however", "high", "human", "levels", "larger", "may", "number", "shown", "study", "studies", "this", "using", "two", "the", "Scholar", "pubmedncbi", "PubMedNCBI", "view", "View", "the", "biol", "via", "image", "doi", "one", "analysis")) paperCorp1 <- tm_map(paperCorp1, stripWhitespace) paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument) dtm1 <- DocumentTermMatrix(paperCorpPTD1) termFreq1 <- colSums(as.matrix(dtm1)) tf1 <- data.frame(term = names(termFreq1), freq = termFreq1) tf1 <- tf1[order(-tf1[,2]),] head(tf1, 100)
If you look through
tf1
you will notice that plenty of the words that were specified to be removed have not actually been removed.Just wondering what I am doing wrong, and how I might remove these words from my data?
NOTE:
removeWords
is doing something because the output fromhead(tm, 100)
andhead(tm1, 100)
are not exactly the same. SoremoveWords
seems to removing some instances of the words I am trying to get rid of, but not all instances.