Extract Text From Images in Python (OCR)

Introduction

Optical Character Recognition (OCR) is a technology that enables the extraction of text from images. In this article, we'll explore how to use Python along with the Tesseract OCR engine to extract text from images containing text, such as street signs, logos, and more.

Installing Tesseract

To begin, you need to install Tesseract. It's an open-source OCR engine available on GitHub. You can find it at github.com/tesseract-ocr/tesseract.

Scroll down to the "Installing Tesseract" section.
Depending on your operating system, follow the instructions to install Tesseract. For Windows, you can download an installer from the link provided for the 64-bit version.
Once installed, remember to add Tesseract's directory to your system's PATH. You can find the directory by looking at the uninstaller's file location. It should be something like C:\Program Files\Tesseract-OCR.
Additionally, create a variable called TESSDATA_PREFIX pointing to the Tesseract data directory for language options.

Installing Python Libraries

Once Tesseract is installed, you also need the pytesseract library for Python. You can install it using pip:

pip install pytesseract

You will also need Pillow and OpenCV, which can be installed with:

pip install Pillow opencv-python

Using Tesseract Manually

Before diving into Python, it's useful to see how Tesseract operates from the command line:

Open your terminal and navigate to the folder containing an image file that you'd like to analyze.
Run Tesseract with the following command:

tesseract image.jpg stdout

This command recognizes text from image.jpg and displays it in the terminal.

Testing Tesseract

To test Tesseract's capabilities, use various image files:

For a straight text image, Tesseract performs excellently.
For complicated images like logos or signs, results may vary.

For better outputs, you can adjust Tesseract’s settings, such as page segmentation modes (PSM) and OCR engine modes (OEM). Here are some settings:

PSM Mode 6: Assume a single uniform block of text.
PSM Mode 11: Sparse text, find text in any arrangement.

Implementing OCR in Python

Now, let's jump into Python. Here's how to implement the OCR functionally:

Import the necessary libraries:

import pytesseract
from PIL import Image
import cv2

Create a configuration string to specify PSM and OEM settings:

my_config = r'--psm 6 --oem 3'

Load your images and run OCR:

text = pytesseract.image_to_string(Image.open('text.jpg'), config=my_config)
print(text)

You can also read different image files (e.g., logos and signs) and adjust the my_config settings to improve recognition.

Visualizing Recognized Characters

To visualize recognized text, you can use OpenCV to draw rectangles around the characters or words. Here's how:

Read the image with OpenCV:

image = cv2.imread('text.jpg')

Use Tesseract to create bounding boxes:

boxes = pytesseract.image_to_boxes(image, config=my_config)

Draw the boxes on the image and display it:

for b in boxes.splitlines():
    b = b.split(' ')
    cv2.rectangle(image, (int(b[1]), int(b[2])), (int(b[3]), int(b[4])), (0, 255, 0), 2)
cv2.imshow('Image', image)
cv2.waitKey(0)

Conclusion

In this article, we explored how to extract text from images using OCR with Python and Tesseract. We've covered installation, basic usage, and how to visualize recognized text. Feel free to experiment with different images and configurations to enhance your OCR results!

Keywords

Optical Character Recognition
Tesseract
Python
OpenCV
Pytesseract
Image Processing
Text Extraction
Page Segmentation Modes
OCR Engine Modes

FAQ

Q1: What is OCR?
A1: Optical Character Recognition (OCR) is a technology that allows the conversion of different types of documents—such as scanned paper documents, PDFs, or images taken by a digital camera—into editable and searchable data.

Q2: What do I need to install to use Tesseract with Python?
A2: You need to install Tesseract OCR software, the pytesseract library, Pillow, and OpenCV with pip.

Q3: How do I improve OCR results with Tesseract?
A3: You can improve OCR results by adjusting the page segmentation modes (PSM) and OCR engine modes (OEM) according to the type of text and image.

Q4: Can Tesseract recognize text in different languages?
A4: Yes, Tesseract can be configured to recognize multiple languages based on the trained data you provide.

Q5: Is Tesseract free to use?
A5: Yes, Tesseract is an open-source software project and can be used freely under the Apache License.