Using R to find top ten words in a text
Solution 1
To get word frequency:
> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that The This very
3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
written
1
To ignore case:
> mytext = tolower(mytext)
>
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that this very written
4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
For top ten words only:
> sort(table(mytext), decreasing=T)[1:10]
mytext
the count for test words a be been can checking
4 2 2 2 2 1 1 1 1 1
Solution 2
You can use regex for this, but using a text-mining package will give you a lot more flexibility. For example, to do a basic word-separation, you simply do the following:
u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))
w <- strsplit(book, "[[:space:]]+")[[1]]
tail(sort(table(w)), 10)
# w
# which is that be a in and to of the
# 1968 1995 2690 3766 3881 4184 4943 6905 11896 16726
But if you want to, for example, be able to remove common stop words or better handle capitalization (which, in the above, will mean Hello and hello are not counted together), you should dig into tm:
library("tm")
s <- URISource(u)
corpus <- VCorpus(s)
m <- DocumentTermMatrix(corpus)
findFreqTerms(m, 600) # words appearing more than 600 times
# "all" "and" "are" "been" "but" "for" "from" "have" "its" "may"
# "not" "that" "the" "their" "they" "this" "which" "will" "with" "would"
c2 <- tm_map(corpus, removeWords, stopwords("english"))
m2 <- DocumentTermMatrix(c2)
findFreqTerms(m2, 400) # words appearing more than 500 times
# [1] "can" "government" "may" "must" "one" "power" "state" "the" "will"
Solution 3
Not regex but may be more of what you're after with less fuss...Here's a qdap
approach using Thomas's data (PS nice data approach):
u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))
library(qdap)
freq_terms(book, 10)
## WORD FREQ
## 1 the 18195
## 2 of 12015
## 3 to 7177
## 4 and 5191
## 5 in 4518
## 6 a 4051
## 7 be 3846
## 8 that 2800
## 9 it 2565
## 10 is 2218
This has the advantage that you can control:
- Stopwords with
stopwords
- Minimum length words with
at.least
- Account for ties with
extend = TRUE
(default) - Plot method for output
Here it is again with stop words and min length set (often these two arguments overlap as stopwords tend to be min length words) and a plot:
(ft <- freq_terms(book, 10, at.least=3, stopwords=qdapDictionaries::Top25Words))
plot(ft)
## WORD FREQ
## 1 which 2075
## 2 would 1273
## 3 will 1257
## 4 not 1238
## 5 their 1098
## 6 states 864
## 7 may 839
## 8 government 830
## 9 been 798
## 10 state 792
MBK
Updated on June 14, 2022Comments
-
MBK almost 2 years
I'm new to R and very new to regex. I looked for this in other discussions, but couldn't quite find the right match.
I have a large data set of text (book). I've used the following code to delineate words within this text:
> a <- gregexpr("[a-zA-Z0-9'\\-]+", book[1]) > regmatches (book[1], a) [[1]] [1] "she" "runs"
I now want to split all of the text from the whole dataset (book) into individual words so that I can determine what the top ten words are in the whole text (tokenize it). I'd then need to count count the words using the table function and then sort somehow to get the top ten.
Also, any thoughts on how to figure out the cumulative distribution, i.e. how many words would be needed to cover half (50%) of all of the words used?
Thank you very much for your response and your patience with my basic questions.
-
Carl Witthoft over 9 yearsya beat me to it.
table
to the rescue! -
GSee over 9 yearsyes,
table
, but you probably need a regex (orstrsplit
or something) andunlist
too because I thinkbook
is a vector with several words per element. -
MBK over 9 yearsYou are correct- book is a vector with several words per element. Based on these many great responses, it seems that I would do something like: sort(table(book), decreasing=T); however, book still has several words per element (as you mentioned), so it needs to be broken down further. Alternatively, I thought to do sort(table(a), decreasing=T), as "a" is broken down by words, but then I got the error "Error in table(a) : all arguments must have the same length" I'm clearly missing something.
-
MBK over 9 yearsI.e. how would I simplistically tokenize all of the individual lines?
-
GSee over 9 yearsGiven that " " is the most common "word", I think your pattern in
strsplit
should include a plus: "[[:space:]]+" -
Thomas over 9 years@GSee Yes, yes it should.
-
rnso over 9 yearsI am not sure how to do that.
-
lawyeR over 9 yearsNice. Can someone create a vector of two-word terms and obtain their frequency counts with qdap::freq_terms? For example, "law department" or "general counsel"?
-
Tyler Rinker over 9 yearsI think if you want counts then you're after
termco
(term count) and there you can do two-word terms. -
Tyler Rinker over 9 years@lawyer I hadn't originally included the ability to keep characters in the original
freq_terms
implementation but can see that it may be useful. In the dev version of qdap this feature will be enable per your suggestion (credit in NEWS file to you). I still think you're aftertermco
as you want a count of terms not a list of top n terms. -
lawyeR over 9 yearsYou honor me. Never had that happen; hope it helps others.
-
Thomas over 9 years+1 This is really nice.