Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

13,989

Solution 1

The problem is that the Tesseract engine was not trained to read this kind of text topology.

You can:

  • train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
  • reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

In your case that would be:

  • "if the height of those letters are smaller, it's cents",
  • "if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

TrainingTesseract2

TrainingTesseract4

Joint Unsupervised Learning of Deep Representations and Image Clusters

Solution 2

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

detected contours of original(small) image

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

detected contours of resized(larger) image

Bellow is a simple python script that will solve your purpose

import pytesseract
import cv2

img = cv2.imread('dKC6k.png')
img = cv2.resize(img, None, fx=2, fy=2)

data = pytesseract.image_to_string(img)
print(data)

Detected text:

je Beutel

89
1.

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n')
data = data.split('\n')

dollars = data[2].strip(',').strip('.')
cents = data[1]

print('{}.{}'.format(dollars, cents))

Desired Format:

1.89
Share:
13,989

Related videos on Youtube

NONONONONO
Author by

NONONONONO

Updated on June 23, 2022

Comments

  • NONONONONO
    NONONONONO almost 2 years

    I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

    Sample 1:

    tesseract D:\tesseract\tesseract_test_images\test.png output
    

    And the output of the sample image is this.

    je Beutel

    13

    However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

    Processed image(cropped and shrinked price):

    je Beutel

    1,89

    How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
    sample2sample3sample4sample5 sample6 sample7

    • dROOOze
      dROOOze about 6 years
      Try come up with an algorithm which uses e.g. the cv2.connectedComponents and cv2.boundingRect functions to detect connected regions which are of dissimilar size on the same horizontal region. You can then call tesseract after either enlarging the smaller regions, shrinking the larger regions, or isolate the dissimilar regions and make the call separately.
  • skt7
    skt7 about 6 years
    The questioner has clearly mentioned that he/she is trying to detect price labels text which are always clearly preprocessed in the shown format.
  • NONONONONO
    NONONONONO about 6 years
    I am updating the question with more test cases, and for almost all this does not work and in your answer 89 being recognized in front of 1 is saying something is wrong with it too(they should have been in the same line and 1 is not below 89, also the comma is recognized as dot). I am really focusing more on the part that there is digits on top of comma.
  • skt7
    skt7 about 6 years
    This is how tesseract works, it recognizes characters and prints text on the basis of position of it recognized them. You will have to somehow understand this or need to train your own model that perfectly works as per your convince which I think is more preferable in your scenario as you need to process images with same formatting.
  • skt7
    skt7 about 6 years
    @NONONONONO can you upload images to a GitHub repo and share the link so I can more clearly understand your dataset and suggest you something accordingly.
  • NONONONONO
    NONONONONO about 6 years
    I really can not as they are really something I should not be sharing but, added a few test cases anyhow. I am not sure what you meant by "position" because as you can see despite 89 being in the same line and right to the 1, it failed to be recognized as 1,89(just like reading). Also, image size evidently is not the problem, as the letters above price numbers (for all the images I have) are recognized correctly. I moved to a completely new architecture for recognizing price digits.
  • skt7
    skt7 about 6 years
    gist.github.com/skt7/f98042c6c9c8bd81095fedadd322094e use this code to analyze all your images and you can then come up with a way to parse different types of text returned by tesseract. You need to try different resizeFactor as it changes the output.
  • Anuj Teotia
    Anuj Teotia about 6 years
    I have a image which has Characters which are not horizontally aligned and are of different font size. I have tried your approach but no luck :(
  • skt7
    skt7 about 6 years
    This code specifically works for the particular case shown. Can you provide the image?
  • NONONONONO
    NONONONONO about 6 years
    I am sorry but your approach is wrong and the answer is misleading, it is not about the image being smaller size, since even smaller letters are recognized correctly above it. My theory is that you stretch it along the X axis more so than the Y axis, because of the proportions being a rectangle, so characters stacked on top of comma are seperated a little bit more and recognized individually, however it is still read wrong (i.e. 89\n1.)
  • skt7
    skt7 about 6 years
    I answered that, you need to findout some simple hacks to crack this, and this was what I came out with and even said you can play with the code to make you own. I also clearly mentioned that you can train your own model but that would need some serious work. This hack was just to give you an idea how you can achieve different things using simple manipulations to image.