How to improve tesseract performance?
5,201
Tesseract performs much better when it gets trained: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
What we have found in our work on over 50 million PDFs to parse, the following strategy:
- From PNG-type files, try and identify the font being used.
- Train Tesseract with a TTF form of the font (rather than bitmap of the PNG image)
- Run tesseract with this new training.
We are automating #2 above, but there are online tools to identify a font. I would suggest: http://www.whatfontis.com/
This Stack Overflow question may also help.
Related videos on Youtube
Author by
katriel
Updated on September 18, 2022Comments
-
katriel almost 2 years
By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance:
2 3 academics 1 1711 2 3 Achlmbobelmann 211 191—2 1 3 Aoqusmono|Food 1 171 n 5 AFD.seeAgem:eFIan§asedeDével 1 (muessmm) 3 4 allluence 211 I849 81 5 Afnca 33:21 9.lZ3l.$50Z55&9l.93-4.9898100.II8r2D.IZ§£
This is from dark blue text against a white field. The original image can be found here. How can I do better?
-
terdon over 10 yearsHow are you running it? Please show the actual command line you used.
-
katriel over 10 yearsI'm away from that computer at the moment, so I'm not sure, but I think I just wrote
tesseract <inputfile> <outputfile>
-