Removing overly common words (occur in more than 80% of the documents) in R

11,705

Solution 1

What if you made a removeCommonTerms function

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

Then if you wanted to remove terms that in are >=80% of the documents, you could do

data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity           : 91%
# Maximal term length: 17
# Weighting          : term frequency (tf)

removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity           : 92%
# Maximal term length: 17
# Weighting          : term frequency (tf)

Solution 2

If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))
Share:
11,705

Related videos on Youtube

Fawaz
Author by

Fawaz

Updated on May 31, 2022

Comments

  • Fawaz
    Fawaz almost 2 years

    I am working with the 'tm' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occur in more than 80% of the documents). Can anybody help me with this?

    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, otherWords1)
    dsc <- tm_map(dsc, removeWords, otherWords2)
    dsc <- tm_map(dsc, removeWords, otherWords3)
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, 
                             stopwords = FALSE))
    
    dtm = removeSparseTerms(dtm, 0.99) 
    # ^-  Removes overly rare words (occur in less than 2% of the documents)
    
  • lawyeR
    lawyeR over 9 years
    this is probably an un-SO-like comment, but you are amazing!
  • hhh
    hhh over 7 years
    Any idea how this would be possible with the Quanteda package? Moved this here.
  • Saqib Ali
    Saqib Ali about 7 years
    simply brilliant!! :)