R's tm package for word count

18,265

Solution 1

As Tyler notes, your question is incomplete without a reproducible example. Here's how to make a reproducible example for this kind of question - use the data that comes built-in with the package:

library("tm") # version 0.6, you seem to be using an older version
data(crude)
revs <- tm_map(crude, content_transformer(tolower)) 
revs <- tm_map(revs, removeWords, stopwords("english")) 
revs <- tm_map(revs, removePunctuation) 
revs <- tm_map(revs, removeNumbers) 
revs <- tm_map(revs, stripWhitespace) 
dtm <- DocumentTermMatrix(revs)

And here's how to get a word count per document, each row of the dtm is one document, so you simply sum the columns for a row and you have the word count for the document:

# Word count per document
rowSums(as.matrix(dtm))

Solution 2

You can also do this in the quanteda package that I developed with Paul Nulty. It is easy to create your own corpus using the quanteda tools for this purpose, but it also imports tm VCorpus objects directly (as shown below).

You can get token counts per document using the summary() method for the corpus object type, or by creating a document-feature matrix using dfm() and then using rowSums() on the resulting document-feature matrix. dfm() by default applies the cleaning steps that you would need to apply separately using the tm package.

data(crude, package="tm")
mycorpus <- corpus(crude)
summary(mycorpus)
## Corpus consisting of 20 documents.
## 
## Text Types Tokens Sentences
## reut-00001.xml    56     90         8
## reut-00002.xml   224    439        21
## reut-00004.xml    39     51         4
## reut-00005.xml    49     66         6
## reut-00006.xml    59     88         3
## reut-00007.xml   229    443        25
## reut-00008.xml   232    420        23
## reut-00009.xml    96    134         9
## reut-00010.xml   165    297        22
## reut-00011.xml   179    336        20
## reut-00012.xml   179    360        23
## reut-00013.xml    67     92         3
## reut-00014.xml    68    103         7
## reut-00015.xml    71     97         4
## reut-00016.xml    72    109         4
## reut-00018.xml    90    144         9
## reut-00019.xml   117    194        13
## reut-00021.xml    47     77        12
## reut-00022.xml   142    281        12
## reut-00023.xml    30     43         8
## 
## Source:  Converted from tm VCorpus 'crude'.
## Created: Sun May 31 18:24:07 2015.
## Notes:   .
mydfm <- dfm(mycorpus)
## Creating a dfm from a corpus ...
## ... indexing 20 documents
## ... tokenizing texts, found 3,979 total tokens
## ... cleaning the tokens, 115 removed entirely
## ... summing tokens by document
## ... indexing 1,048 feature types
## ... building sparse matrix
## ... created a 20 x 1048 sparse dfm
## ... complete. Elapsed time: 0.039 seconds.
rowSums(mydfm)
## reut-00001.xml reut-00002.xml reut-00004.xml reut-00005.xml reut-00006.xml reut-00007.xml 
##             90            439             51             66             88            443 
## reut-00008.xml reut-00009.xml reut-00010.xml reut-00011.xml reut-00012.xml reut-00013.xml 
##            420            134            297            336            360             92 
## reut-00014.xml reut-00015.xml reut-00016.xml reut-00018.xml reut-00019.xml reut-00021.xml 
##            103             97            109            144            194             77 
## reut-00022.xml reut-00023.xml 
##            281             43 

I'm happy to help with any quanteda-related questions.

Share:
18,265
monarque13
Author by

monarque13

My work primarily involves #rstats, data visualization, machine learning, and computational research in academia.

Updated on June 13, 2022

Comments

  • monarque13
    monarque13 almost 2 years

    I have a corpus with over 5000 text files. I would like to get individual word counts for each file after running pre-processing each (turning to lower, removing stopwords, etc). I haven't had any luck with the word count for the individual text files. Any help would be appreciated.

    library(tm)
    revs<-Corpus(DirSource("data/")) 
    revs<-tm_map(revs,tolower) 
    revs<-tm_map(revs,removeWords, stopwords("english")) 
    revs<-tm_map(revs,removePunctuation) 
    revs<-tm_map(revs,removeNumbers) 
    revs<-tm_map(revs,stripWhitespace) 
    dtm<-DocumentTermMatrix(revs)