Getting the bounding box of the recognized words using python-tesseract

102,649

Solution 1

Use pytesseract.image_to_data()

import pytesseract
from pytesseract import Output
import cv2
img = cv2.imread('image.jpg')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
    (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('img', img)
cv2.waitKey(0)

Among the data returned by pytesseract.image_to_data():

  • left is the distance from the upper-left corner of the bounding box, to the left border of the image.
  • top is the distance from the upper-left corner of the bounding box, to the top border of the image.
  • width and height are the width and height of the bounding box.
  • conf is the model's confidence for the prediction for the word within that bounding box. If conf is -1, that means that the corresponding bounding box contains a block of text, rather than just a single word.

The bounding boxes returned by pytesseract.image_to_boxes() enclose letters so I believe pytesseract.image_to_data() is what you're looking for.

Solution 2

tesseract.GetBoxText() method returns the exact position of each character in an array.

Besides, there is a command line option tesseract test.jpg result hocr that will generate a result.html file with each recognized word's coordinates in it. But I'm not sure whether it can be called through python script.

Solution 3

Python tesseract can do this without writing to file, using the image_to_boxes function:

import cv2
import pytesseract

filename = 'image.png'

# read the image and get the dimensions
img = cv2.imread(filename)
h, w, _ = img.shape # assumes color image

# run tesseract, returning the bounding boxes
boxes = pytesseract.image_to_boxes(img) # also include any config options you use

# draw the bounding boxes on the image
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)

# show annotated image and wait for keypress
cv2.imshow(filename, img)
cv2.waitKey(0)

Solution 4

Using the below code you can get the bounding box corresponding to each character.

import csv
import cv2
from pytesseract import pytesseract as pt

pt.run_tesseract('bw.png', 'output', lang=None, boxes=True, config="hocr")

# To read the coordinates
boxes = []
with open('output.box', 'rb') as f:
    reader = csv.reader(f, delimiter = ' ')
    for row in reader:
        if(len(row)==6):
            boxes.append(row)

# Draw the bounding box
img = cv2.imread('bw.png')
h, w, _ = img.shape
for b in boxes:
    img = cv2.rectangle(img,(int(b[1]),h-int(b[2])),(int(b[3]),h-int(b[4])),(255,0,0),2)

cv2.imshow('output',img)

Solution 5

Would comment under lennon310 but don't have enough reputation to comment...

To run his command line command tesseract test.jpg result hocr in a python script:

from subprocess import check_call

tesseractParams = ['tesseract', 'test.jpg', 'result', 'hocr']
check_call(tesseractParams)
Share:
102,649

Related videos on Youtube

Abtin Rasoulian
Author by

Abtin Rasoulian

Updated on July 09, 2022

Comments

  • Abtin Rasoulian
    Abtin Rasoulian almost 2 years

    I am using python-tesseract to extract words from an image. This is a python wrapper for tesseract which is an OCR code.

    I am using the following code for getting the words:

    import tesseract
    
    api = tesseract.TessBaseAPI()
    api.Init(".","eng",tesseract.OEM_DEFAULT)
    api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
    api.SetPageSegMode(tesseract.PSM_AUTO)
    
    mImgFile = "test.jpg"
    mBuffer=open(mImgFile,"rb").read()
    result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
    print "result(ProcessPagesBuffer)=",result
    

    This returns only the words and not their location/size/orientation (or in other words a bounding box containing them) in the image. I was wondering if there is any way to get that as well

  • Henry
    Henry over 7 years
    I get result.hocr file with the command, though the file is HTML format.
  • Stepan Yakovenko
    Stepan Yakovenko over 5 years
    doesn't work, boxes is unknown parameter in lastest pytesseract
  • Parikshit Chalke
    Parikshit Chalke over 5 years
    This is actually the correct answer for this question. But might be ignored by people due to complexity of this method
  • Atinesh
    Atinesh about 5 years
    Why y-coordinate is subtracted from the height of the image
  • jtbr
    jtbr about 5 years
    I believe the pytesseract and opencv have different notions of the origin of the image (top left or bottom left), or at least that's what I I seemed to experience when I wrote the answer. If it works without the h there, great.
  • Eswar RDS
    Eswar RDS about 4 years
    Do you know the meaning of other columns(level, page_num, block_num, par_num, line_num, word_num) in the output generated by image_to_data?
  • Bùi Nhật Duy
    Bùi Nhật Duy almost 4 years
    This work only for tesseract >= 3.05. I need a solution for lower version.