How to improve tesseract performance?

command-line image-processing ocr tesseract

5,201

Tesseract performs much better when it gets trained: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

What we have found in our work on over 50 million PDFs to parse, the following strategy:

From PNG-type files, try and identify the font being used.
Train Tesseract with a TTF form of the font (rather than bitmap of the PNG image)
Run tesseract with this new training.

We are automating #2 above, but there are online tools to identify a font. I would suggest: http://www.whatfontis.com/

This Stack Overflow question may also help.

5,201

Related videos on Youtube

Author by

katriel

Updated on September 18, 2022

Comments

katriel almost 2 years
By all accounts, tesseract is superb. However, my results are dismal. I need to convert (digital, as opposed to from a book) text that I only have as a png. For instance:
```
   2 3 academics 1 1711
   2 3 Achlmbobelmann 211 191—2
   1 3 Aoqusmono|Food 1 171
   n 5 AFD.seeAgem:eFIan§asedeDével 1 (muessmm)
   3 4 allluence 211 I849
   81 5 Afnca 33:21 9.lZ3l.$50Z55&9l.93-4.9898100.II8r2D.IZ§£
```
This is from dark blue text against a white field. The original image can be found here. How can I do better?
- terdon over 10 years
  
  How are you running it? Please show the actual command line you used.
- katriel over 10 years
  
  I'm away from that computer at the moment, so I'm not sure, but I think I just wrote tesseract <inputfile> <outputfile>

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

Image Preprocessing before OCR process

Open-CV - Not loading correctly

Tesseract OCR Library - Learning Font

How can I run tesseract with multiple languages one time?

Preprocessing image for Tesseract OCR with OpenCV

Getting the bounding box of the recognized words using python-tesseract

image processing to improve tesseract OCR accuracy

Is it possible to check orientation of an image before passing it through pytesseract ocr module

iOS Tesseract OCR Image Preperation

what's the best image input type for tesseract?