Use pytesseract OCR to recognize text from an image

126,601

Solution 1

Here is my solution:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("temp.jpg") # the second one 
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)

Solution 2

Here's a simple approach using OpenCV and Pytesseract OCR. To perform OCR on an image, its important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's threshold to obtain a binary image. From here, we can apply morphological operations to remove noise. Finally we invert the image. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.


Here's a visualization of the image processing pipeline:

Input image

enter image description here

Convert to grayscale -> Gaussian blur -> Otsu's threshold

enter image description here

Notice how there are tiny specs of noise, to remove them we can perform morphological operations

enter image description here

Finally we invert the image

enter image description here

Result from Pytesseract OCR

2HHH

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()

Solution 3

I have something different pytesseract approach for our community. Here is my approach

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

print(text)

Solution 4

To extract the text directly from the web, you can try the following implementation (making use of the first image):

import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance

response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)

Solution 5

Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open(img)  # img is the path of the image 
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()

for item in datas:
    if item[0] < 112 or item[1] < 112 or item[2] < 112:
        newimdata.append(item)
    else:
        newimdata.append((255, 255, 255))
im.putdata(newimdata)

im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)
Share:
126,601

Related videos on Youtube

Smith John
Author by

Smith John

Updated on July 09, 2022

Comments

  • Smith John
    Smith John almost 2 years

    I need to use Pytesseract to extract text from this picture:

    enter image description here

    and the code:

    from PIL import Image, ImageEnhance, ImageFilter
    import pytesseract
    path = 'pic.gif'
    img = Image.open(path)
    img = img.convert('RGBA')
    pix = img.load()
    for y in range(img.size[1]):
        for x in range(img.size[0]):
            if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
                pix[x, y] = (0, 0, 0, 255)
            else:
                pix[x, y] = (255, 255, 255, 255)
    img.save('temp.jpg')
    text = pytesseract.image_to_string(Image.open('temp.jpg'))
    # os.remove('temp.jpg')
    print(text)
    

    and the "temp.jpg" is

    enter image description here

    Not bad, but the result of print is ,2 WW Not the right text2HHH, so how can I remove those black dots?

  • MAK
    MAK over 6 years
    Hi,when i use this code i am getting below error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-12: c haracter maps to <undefined>". can you suggest a way to over come this
  • Moon Cheesez
    Moon Cheesez over 6 years
    @MAK You will need to install win-unicode-console on your windows
  • David
    David about 6 years
    something never worked with the image, can you edit and try again?
  • nishit chittora
    nishit chittora about 6 years
    @David can you please elaborate. What's not working?
  • David
    David about 6 years
    mhm, don't remeber in the moment, but I'm sure it was not related to the code but to an uploaded image here propably. Did you remove an upload? Don't see it anymore.
  • RAno
    RAno almost 5 years
    I have tried -psm and nothing worked, but after seeing your post I tried --psmand it solved everything. great
  • Md. Rezaul Karim
    Md. Rezaul Karim over 2 years
    this is one of the most accurate and neatly explained answers I have seen in SO! thanks!
  • Hariharan AR
    Hariharan AR over 2 years
    This will not work when The text in the image is not English. when i Tried this with Japanese and Arabic, The result is not good