ARPA language model documentation

10,404

Solution 1

There is actually not much more to say about the format than is said in those docs..

Besides, you'll probably want to prepare a text file with sample sentences and generate the language file based on it. There is an online version which can do it for you: lmtool

Solution 2

You can complement those docs with this tech report that gives a comprehensive overview of smoothing for language modeling: http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf You will also find definitions for backoff models and interpolated models.

Solution 3

I'm probably very late to answer this, I found the ARPA LM format well documented in this link from The HTK Book by Steve Young et. al.

Each line of ARPA is a triple that stores:

n-gram log-probability(base10) ; the n-gram itself ; back-off weight (also in log space). 
Share:
10,404

Related videos on Youtube

Lukasz
Author by

Lukasz

Updated on June 04, 2022

Comments

  • Lukasz
    Lukasz almost 2 years

    Where can I find documentation on ARPA language model format?

    I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my language model for my custom needs.

    All I found is some very brief ARPA format descriptions:

    I am beginner to STT and I have trouble to wrap head around this (n-grams, etc...). I am looking for more detailed docs. Something like documentation on JSGF grammar here:

    http://www.w3.org/TR/jsgf/

  • Lukasz
    Lukasz about 11 years
    Still, in uses some kind of n-grams, backoff, etc... what are those and where can I find more info about those?
  • Dariusz
    Dariusz about 11 years
    @Lukasz What is n-gram? A sequence of N words. Backoff is optional. And the probability is in log 10 scale as far as I remember.
  • 0x5050
    0x5050 about 4 years
    Backoff is a way to estimate probability of a unseen (during training) ngram. It basically backs off to a lower order ngram if a higher order ngram is not in the LM. E.g back off to 2gram if encountered 3gram is not present. The back off weight is to make sure the joint probability is a true probability, i.e sums to 1.