Extract Text From Images in Python (OCR)
Science & Technology
Introduction
Optical Character Recognition (OCR) is a technology that enables the extraction of text from images. In this article, we'll explore how to use Python along with the Tesseract OCR engine to extract text from images containing text, such as street signs, logos, and more.
Installing Tesseract
To begin, you need to install Tesseract. It's an open-source OCR engine available on GitHub. You can find it at github.com/tesseract-ocr/tesseract.
- Scroll down to the "Installing Tesseract" section.
- Depending on your operating system, follow the instructions to install Tesseract. For Windows, you can download an installer from the link provided for the 64-bit version.
- Once installed, remember to add Tesseract's directory to your system's PATH. You can find the directory by looking at the uninstaller's file location. It should be something like
C:\Program Files\Tesseract-OCR
. - Additionally, create a variable called
TESSDATA_PREFIX
pointing to the Tesseract data directory for language options.
Installing Python Libraries
Once Tesseract is installed, you also need the pytesseract
library for Python. You can install it using pip:
pip install pytesseract
You will also need Pillow
and OpenCV
, which can be installed with:
pip install Pillow opencv-python
Using Tesseract Manually
Before diving into Python, it's useful to see how Tesseract operates from the command line:
- Open your terminal and navigate to the folder containing an image file that you'd like to analyze.
- Run Tesseract with the following command:
tesseract image.jpg stdout
This command recognizes text from image.jpg
and displays it in the terminal.
Testing Tesseract
To test Tesseract's capabilities, use various image files:
- For a straight text image, Tesseract performs excellently.
- For complicated images like logos or signs, results may vary.
For better outputs, you can adjust Tesseract’s settings, such as page segmentation modes (PSM) and OCR engine modes (OEM). Here are some settings:
- PSM Mode 6: Assume a single uniform block of text.
- PSM Mode 11: Sparse text, find text in any arrangement.
Implementing OCR in Python
Now, let's jump into Python. Here's how to implement the OCR functionally:
- Import the necessary libraries:
import pytesseract
from PIL import Image
import cv2
- Create a configuration string to specify PSM and OEM settings:
my_config = r'--psm 6 --oem 3'
- Load your images and run OCR:
text = pytesseract.image_to_string(Image.open('text.jpg'), config=my_config)
print(text)
You can also read different image files (e.g., logos and signs) and adjust the my_config
settings to improve recognition.
Visualizing Recognized Characters
To visualize recognized text, you can use OpenCV to draw rectangles around the characters or words. Here's how:
- Read the image with OpenCV:
image = cv2.imread('text.jpg')
- Use Tesseract to create bounding boxes:
boxes = pytesseract.image_to_boxes(image, config=my_config)
- Draw the boxes on the image and display it:
for b in boxes.splitlines():
b = b.split(' ')
cv2.rectangle(image, (int(b[1]), int(b[2])), (int(b[3]), int(b[4])), (0, 255, 0), 2)
cv2.imshow('Image', image)
cv2.waitKey(0)
Conclusion
In this article, we explored how to extract text from images using OCR with Python and Tesseract. We've covered installation, basic usage, and how to visualize recognized text. Feel free to experiment with different images and configurations to enhance your OCR results!
Keywords
- Optical Character Recognition
- Tesseract
- Python
- OpenCV
- Pytesseract
- Image Processing
- Text Extraction
- Page Segmentation Modes
- OCR Engine Modes
FAQ
Q1: What is OCR?
A1: Optical Character Recognition (OCR) is a technology that allows the conversion of different types of documents—such as scanned paper documents, PDFs, or images taken by a digital camera—into editable and searchable data.
Q2: What do I need to install to use Tesseract with Python?
A2: You need to install Tesseract OCR software, the pytesseract
library, Pillow
, and OpenCV
with pip.
Q3: How do I improve OCR results with Tesseract?
A3: You can improve OCR results by adjusting the page segmentation modes (PSM) and OCR engine modes (OEM) according to the type of text and image.
Q4: Can Tesseract recognize text in different languages?
A4: Yes, Tesseract can be configured to recognize multiple languages based on the trained data you provide.
Q5: Is Tesseract free to use?
A5: Yes, Tesseract is an open-source software project and can be used freely under the Apache License.