Using R to find top ten words in a text

regex r

13,803

Solution 1

To get word frequency:

> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")

> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that      The     This     very 
       3        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
 written 
       1

To ignore case:

> mytext = tolower(mytext)
> 
> sort(table(mytext), decreasing=T)
mytext
     the    count      for     test    words        a       be     been      can checking     have       is       of randomly       so     that     this     very  written 
       4        2        2        2        2        1        1        1        1        1        1        1        1        1        1        1        1        1        1 
>

For top ten words only:

> sort(table(mytext), decreasing=T)[1:10]
mytext
     the    count      for     test    words        a       be     been      can checking 
       4        2        2        2        2        1        1        1        1        1

Solution 2

You can use regex for this, but using a text-mining package will give you a lot more flexibility. For example, to do a basic word-separation, you simply do the following:

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

w <- strsplit(book, "[[:space:]]+")[[1]]
tail(sort(table(w)), 10)
# w
# which    is  that    be     a    in   and    to    of   the 
#  1968  1995  2690  3766  3881  4184  4943  6905 11896 16726

But if you want to, for example, be able to remove common stop words or better handle capitalization (which, in the above, will mean Hello and hello are not counted together), you should dig into tm:

library("tm")
s <- URISource(u)
corpus <- VCorpus(s)

m <- DocumentTermMatrix(corpus)
findFreqTerms(m, 600) # words appearing more than 600 times
# "all"   "and"   "are"   "been"  "but"   "for"   "from"  "have"  "its" "may"  
# "not"   "that"  "the"   "their" "they"  "this"  "which" "will"  "with" "would"

c2 <- tm_map(corpus, removeWords, stopwords("english"))
m2 <- DocumentTermMatrix(c2)
findFreqTerms(m2, 400) # words appearing more than 500 times
# [1] "can" "government" "may" "must" "one" "power" "state" "the" "will"

Solution 3

Not regex but may be more of what you're after with less fuss...Here's a qdap approach using Thomas's data (PS nice data approach):

u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))

library(qdap)
freq_terms(book, 10)

##    WORD  FREQ
## 1  the  18195
## 2  of   12015
## 3  to    7177
## 4  and   5191
## 5  in    4518
## 6  a     4051
## 7  be    3846
## 8  that  2800
## 9  it    2565
## 10 is    2218

This has the advantage that you can control:

Stopwords with stopwords
Minimum length words with at.least
Account for ties with extend = TRUE (default)
Plot method for output

Here it is again with stop words and min length set (often these two arguments overlap as stopwords tend to be min length words) and a plot:

(ft <- freq_terms(book, 10, at.least=3, stopwords=qdapDictionaries::Top25Words))
plot(ft)

##    WORD       FREQ
## 1  which      2075
## 2  would      1273
## 3  will       1257
## 4  not        1238
## 5  their      1098
## 6  states      864
## 7  may         839
## 8  government  830
## 9  been        798
## 10 state       792

enter image description here

13,803

Author by

MBK

Updated on June 14, 2022

Comments

MBK almost 2 years
I'm new to R and very new to regex. I looked for this in other discussions, but couldn't quite find the right match.

I have a large data set of text (book). I've used the following code to delineate words within this text:
```
> a <- gregexpr("[a-zA-Z0-9'\\-]+", book[1])

> regmatches (book[1], a)
[[1]]
[1] "she" "runs"
```
I now want to split all of the text from the whole dataset (book) into individual words so that I can determine what the top ten words are in the whole text (tokenize it). I'd then need to count count the words using the table function and then sort somehow to get the top ten.

Also, any thoughts on how to figure out the cumulative distribution, i.e. how many words would be needed to cover half (50%) of all of the words used?

Thank you very much for your response and your patience with my basic questions.
Carl Witthoft over 9 years

ya beat me to it. table to the rescue!
GSee over 9 years

yes, table, but you probably need a regex (or strsplit or something) and unlist too because I think book is a vector with several words per element.
MBK over 9 years

You are correct- book is a vector with several words per element. Based on these many great responses, it seems that I would do something like: sort(table(book), decreasing=T); however, book still has several words per element (as you mentioned), so it needs to be broken down further. Alternatively, I thought to do sort(table(a), decreasing=T), as "a" is broken down by words, but then I got the error "Error in table(a) : all arguments must have the same length" I'm clearly missing something.
MBK over 9 years

I.e. how would I simplistically tokenize all of the individual lines?
GSee over 9 years

Given that " " is the most common "word", I think your pattern in strsplit should include a plus: "[[:space:]]+"
Thomas over 9 years

@GSee Yes, yes it should.
rnso over 9 years

I am not sure how to do that.
lawyeR over 9 years

Nice. Can someone create a vector of two-word terms and obtain their frequency counts with qdap::freq_terms? For example, "law department" or "general counsel"?
Tyler Rinker over 9 years

I think if you want counts then you're after termco (term count) and there you can do two-word terms.
Tyler Rinker over 9 years

@lawyer I hadn't originally included the ability to keep characters in the original freq_terms implementation but can see that it may be useful. In the dev version of qdap this feature will be enable per your suggestion (credit in NEWS file to you). I still think you're after termco as you want a count of terms not a list of top n terms.
lawyeR over 9 years

You honor me. Never had that happen; hope it helps others.
Thomas over 9 years

+1 This is really nice.