R remove stopwords from a character vector using %in%
16,622
Solution 1
You are not accessing the list properly and you're not getting the elements back from the result of %in%
(which gives a logical vector of TRUE/FALSE). You should do something like this:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(or)
str1[[1]][!(str1[[1]] %in% stopWords)]
For the whole data.frame
df1, you could do something like:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."
Solution 2
First. You should unlist str1
or use lapply
if str1
is vector:
!(unlist(str1) %in% words)
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
Second. Complex solution:
string <- c("This string is a string.",
"This string is a slightly longer string.",
"This string is an even longer string.",
"This string is a slightly shorter string.",
"This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
stopifnot(is.character(string), is.character(words))
spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string." "string slightly longer string." "string even longer string."
#> [4] "string slightly shorter string." "string longest string strings."
Comments
-
screechOwl over 1 year
I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the
tm
package as it's a large data set andtm
seems to run a bit slowly. I am using thetm
stopword
dictionary.library(plyr) library(tm) stopWords <- stopwords("en") class(stopWords) df1 <- data.frame(id = seq(1,5,1), string1 = NA) head(df1) df1$string1[1] <- "This string is a string." df1$string1[2] <- "This string is a slightly longer string." df1$string1[3] <- "This string is an even longer string." df1$string1[4] <- "This string is a slightly shorter string." df1$string1[5] <- "This string is the longest string of all the other strings." head(df1) df1$string1 <- tolower(df1$string1) str1 <- strsplit(df1$string1[5], " ") > !(str1 %in% stopWords) [1] TRUE
This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the
stopWords
vector.What am I doing wrong?
-
screechOwl over 10 yearsI didn't realize str1 was outputting as a list, I assumed it was a vector, thank you.
-
Carl Witthoft over 10 yearsThanks for using
Negate
-- I'd completely forgotten about thefunprog
suite of goodies. -
hadley over 10 yearsUsing
setdiff
would be even simpler, and you should probably uselapply
on the results ofstrsplit
:lapply(strsplit(df1$string, " "), setdiff, stopWords)
. The only disadvantage is you get unique words. -
Artem Klevtsov almost 8 years
setdiff
calls%in%
(exactlymatch(x, y, 0L) == 0L
).