ARPA language model documentation
Solution 1
There is actually not much more to say about the format than is said in those docs..
Besides, you'll probably want to prepare a text file with sample sentences and generate the language file based on it. There is an online version which can do it for you: lmtool
Solution 2
You can complement those docs with this tech report that gives a comprehensive overview of smoothing for language modeling: http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf You will also find definitions for backoff models and interpolated models.
Solution 3
I'm probably very late to answer this, I found the ARPA LM format well documented in this link from The HTK Book by Steve Young et. al.
Each line of ARPA is a triple that stores:
n-gram log-probability(base10) ; the n-gram itself ; back-off weight (also in log space).
Related videos on Youtube
Lukasz
Updated on June 04, 2022Comments
-
Lukasz almost 2 years
Where can I find documentation on ARPA language model format?
I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my language model for my custom needs.
All I found is some very brief ARPA format descriptions:
- http://kered.org/blog/2008-08-12/arpa-language-model-file-format/
- http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
- http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
I am beginner to STT and I have trouble to wrap head around this (n-grams, etc...). I am looking for more detailed docs. Something like documentation on JSGF grammar here:
-
sbhartihave a look at this msdn link.. arpa and args format are well explained Compile Grammar Input and Output File Format
-
Lukasz about 11 yearsStill, in uses some kind of n-grams, backoff, etc... what are those and where can I find more info about those?
-
Dariusz about 11 years@Lukasz What is n-gram? A sequence of N words. Backoff is optional. And the probability is in log 10 scale as far as I remember.
-
0x5050 about 4 yearsBackoff is a way to estimate probability of a unseen (during training) ngram. It basically backs off to a lower order ngram if a higher order ngram is not in the LM. E.g back off to 2gram if encountered 3gram is not present. The back off weight is to make sure the joint probability is a true probability, i.e sums to 1.