What OCR options exist beyond Tesseract?

php python ruby ocr tesseract

14,577

Solution 1

I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.

Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.

Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:

require 'fileutils'
tmp = 'directory'
file = 'file.png'

`ocropus book2pages #{tmp}/out #{file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`

text = File.read("#{tmp}/output.html")
FileUtils.rm_rf(tmp)

Solution 2

We use OCR XTR Lite from Vividata at my office. It uses the ScanSoft engine and is very accurate but isn't a free solution. Currently it is being scripted from bash and I process from 75,000 to 150,000 pages a day with it. Accuracy is almost perfect and it auto-rotates the images to determine the OCR orientation.

14,577

Author by

ylluminate

Updated on June 24, 2022

Comments

ylluminate almost 2 years

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty reliable font, however the characters are variable enough that simply using an image size checksum or such is not going to work.

What options exist for OCR besides sticking with Tesseract or doing a complete custom training of it? Also, it would be VERY helpful if this were compatible with Heroku style hosting (at least where I can compile the bins and shove them over).
ylluminate about 12 years

Very interesting! Thanks a bunch. I would be particularly interested in training. I can limit the vocabulary to about 50 "words" if vocabulary training or limiting is possible so as to give it a defined set of boundaries.
user2398029 about 12 years

I recommend you have a look at this video, which gives a solid explanation of how to train Ocropus. Training for GOCR remains a mystery to me; I am not even sure it is possible, and the docs are unhelpful.
ylluminate about 12 years

For ocropus, did you use the older codebase that hasn't been updated for a few years or checkout from the repo and compile the newer updates in the works?
user2398029 about 12 years

I used port install - not sure how old the port definitions are/were when I installed it. I don't know if it is still the case, but for a long time this was the only way to get it to compile on Mac OS X without hours of burning in dependency hell. But I'd definitely try compiling from source, if you can get it to work.
ylluminate about 12 years

I'm considering working on a homebrew recipe, however it seems a bit involved. The new source release from just the past few days has an install script, but it needs some help for mac os x. http://code.google.com/p/ocropus/source/list and http://code.google.com/p/ocropus/wiki/InstallTranscript may prove some useful references.
user2398029 about 12 years

I'm sure that would be welcome by many - its definitely a great tool, and should be made more accessible IMO.
ylluminate about 12 years

We had some discussion about it on IRC, however it appears no one is really willing to tackle a head based formula for it. Any idea of how close we are to a full release of the .5?
user2398029 about 12 years

I honestly have no idea. Sorry.