nltk language model (ngram) calculate the prob of a word from context
Solution 1
Quick fix:
print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006
Solution 2
I know this question is old but it pops up every time I google nltk's NgramModel class. NgramModel's prob implementation is a little unintuitive. The asker is confused. As far as I can tell, the answers aren't great. Since I don't use NgramModel often, this means I get confused. No more.
The source code lives here: https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py. Here is the definition of NgramModel's prob method:
def prob(self, word, context):
"""
Evaluate the probability of this word in this context using Katz Backoff.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: list(str)
"""
context = tuple(context)
if (context + (word,) in self._ngrams) or (self._n == 1):
return self[context].prob(word)
else:
return self._alpha(context) * self._backoff.prob(word, context[1:])
(note: 'self[context].prob(word) is equivalent to 'self._model[context].prob(word)')
Okay. Now at least we know what to look for. What does context need to be? Let's look at an excerpt from the constructor:
for sent in train:
for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
self._ngrams.add(ngram)
context = tuple(ngram[:-1])
token = ngram[-1]
cfd[context].inc(token)
if not estimator_args and not estimator_kwargs:
self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)
Alright. The constructor creates a conditional probability distribution (self._model) out of a conditional frequency distribution whose "context" is tuples of unigrams. This tells us 'context' should not be a string or a list with a single multi-word string. 'context' MUST be something iterable containing unigrams. In fact, the requirement is a little more strict. These tuples or lists must be of size n-1. Think of it this way. You told it to be a trigram model. You better give it the appropriate context for trigrams.
Let's see this in action with a simpler example:
>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0
(As a side note, actually trying to do anything with MLE as your estimator in NgramModel is a bad idea. Things will fall apart. I guarantee it.)
As for the original question, I suppose my best guess at what OP wants is this:
print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())
...but there are so many misunderstandings going on here that I can't possible tell what he was actually trying to do.
Solution 3
As regards your second question: this happens because "b"
doesn't occur in the Brown corpus category news
, as you can verify with:
>>> 'b' in brown.words(categories='news')
False
whereas
>>> 'word' in brown.words(categories='news')
True
I admit the error message is very cryptic, so you might want to file a bug report with the NLTK authors.
Solution 4
I would stay away from NLTK's NgramModel for the time being. There is currently a smoothing bug that causes the model to greatly overestimate likelihoods when n>1. If you do end up using NgramModel, you should definitely apply the fix mentioned in the git issue tracker here: https://github.com/nltk/nltk/issues/367
Huang Yen-Chieh
Updated on April 11, 2020Comments
-
Huang Yen-Chieh about 4 years
I am using Python and NLTK to build a language model as follows:
from nltk.corpus import brown from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), estimator) # Thanks to miku, I fixed this problem print lm.prob("word", ["This is a context which generates a word"]) >> 0.00493261081006 # But I got another program like this one... print lm.prob("b", ["This is a context which generates a word"])
But it doesn't seem to work. The result is as follows:
>>> print lm.prob("word", "This is a context which generates a word") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob return self._alpha(context) * self._backoff.prob(word, context[1:]) File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob return self._alpha(context) * self._backoff.prob(word, context[1:]) File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob "context %s" % (word, ' '.join(context))) TypeError: not all arguments converted during string formatting
Can anyone help me out? Thanks!
-
Huang Yen-Chieh almost 13 yearsBut i got another problem...why does print lm.prob("word", ["word"]), print lm.prob("word", ["word word word"]), print lm.prob("word", ["this"]) all generate exactly same probability? All are 0.00493261081006...
-
miku almost 13 years@Austin, sorry, I am short on time, so I can't go into the details right now - maybe later.
-
Huang Yen-Chieh almost 13 yearsThanks! I agree that the error should not happen in this way so I will file a bug report to NLTK. Thanks anyway.
-
IM94 almost 6 yearsWhat Python library would you instead recommend using? This is especially keeping in mind that the current version of NLTK (3.3) no longer has NGramModel.