How to improve tesseract performance?

5,201

Tesseract performs much better when it gets trained: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

What we have found in our work on over 50 million PDFs to parse, the following strategy:

  1. From PNG-type files, try and identify the font being used.
  2. Train Tesseract with a TTF form of the font (rather than bitmap of the PNG image)
  3. Run tesseract with this new training.

We are automating #2 above, but there are online tools to identify a font. I would suggest: http://www.whatfontis.com/

This Stack Overflow question may also help.

Share:
5,201

Related videos on Youtube

katriel
Author by

katriel

Updated on September 18, 2022

Comments

  • katriel
    katriel almost 2 years

    By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance:

       2 3 academics 1 1711
       2 3 Achlmbobelmann 211 191—2
       1 3 Aoqusmono|Food 1 171
       n 5 AFD.seeAgem:eFIan§asedeDével 1 (muessmm)
       3 4 allluence 211 I849
       81 5 Afnca 33:21 9.lZ3l.$50Z55&9l.93-4.9898100.II8r2D.IZ§£
    

    This is from dark blue text against a white field. The original image can be found here. How can I do better?

    • terdon
      terdon over 10 years
      How are you running it? Please show the actual command line you used.
    • katriel
      katriel over 10 years
      I'm away from that computer at the moment, so I'm not sure, but I think I just wrote tesseract <inputfile> <outputfile>