bigrams instead of single words in termdocument matrix using R and Rweka

14,764

Solution 1

Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):

# Sets the default number of threads to use
options(mc.cores=1)

Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.

Solution 2

Seems there are problems using RWeka with parallel package. I found workaround solution here.

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

So your tokenizer should look like

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
Share:
14,764
ds10
Author by

ds10

Updated on July 15, 2022

Comments

  • ds10
    ds10 almost 2 years

    I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R

    The idea goes something like this:

    library(tm)
    library(RWeka)
    data(crude)
    
    #Tokenizer for n-grams and passed on to the term-document matrix constructor
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
    txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
    

    However the final line gives me the error:

    Error in rep(seq_along(x), sapply(tflist, length)) : 
      invalid 'times' argument
    In addition: Warning message:
    In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
    

    If I remove the tokenizer from the last line it creates a regular tdm, so I guess the problem is somewhere in the BigramTokenizer function although this is the same example that the Weka site gives here: http://tm.r-forge.r-project.org/faq.html#Bigrams.

  • jadianes
    jadianes almost 9 years
    Didn't experience the problem but in Shinyapps.io. This solved the problem. Thanks!
  • harsha
    harsha about 7 years
    Is there any alternative to NGramTokenizer ? In my computer RWeka is not working due to some R / Java version issues.