How to find collocations in text, python
Solution 1
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder
, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Solution 2
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter
is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
Solution 3
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
Solution 4
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Solution 5
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !
Gusto
Updated on June 11, 2022Comments
-
Gusto almost 2 years
How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] >>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
-
Björn Pollex over 13 yearsYou would have to define more often. Do you mean statistic significance?
-
Glenn Maynard over 13 yearsPython has no such builtin, nor anything by that name in the standard library.
-
Spike Gronim over 13 yearsuse the nltk library for this nltk.googlecode.com/svn/trunk/doc/api/…
-
Katriel over 13 years
-
-
Gusto over 13 yearsgood work but your code is for another purpose - i just need collocations (without any count or similar). in the end i will need to return the most 10 colloc-s (
collocations[:10]
) and the total number of them usinglen(collocations)
-
Sven Marnach over 13 yearsYou actually did not define well what you actually want. Maybe give some example output for some example input.
-
Gusto over 13 yearsis it able to work on unicode text? I got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-8: ordinal not in range(128)
-
Tim McNamara over 13 yearsUnicode works fine for most operations.
nltk.Text
may have issues, because it's just a helper class written for teaching linguistics students - and gets caught sometimes. It's mainly for demonstration purposes.