As a developer, you might want to extract textual information from an image. Using Python, we can create a program that extracts such textual data from any given image.
Python has been one of the most popular languages developers enjoy working with. Its human-readable syntax makes it easy to learn.
In this guide, we will write a Python script that extracts images, scans for text, transcribes it, and saves it to a text file. We will use the Python tesseract library to recognize textual data from images.
Table of contents
- Table of contents
- Setting up tesseract OCR
- Adding the project dependencies
- Create a Python tesseract script
- Extract images
- Extract text information
- Test the app
To follow along with this article, ensure that you have Python installed and running on your computer.
Also, ensure you have some basic understanding of Python.
Setting up tesseract OCR
Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts.
To use OCR, you need to install and configure tesseract on your computer.
First, download the Tesseract OCR executables here. While installing this executable, make sure you copy the tesseract installation path and add it to your system environment varibales.
Once the process is done, run the tesseract -v command to verify that the OCR is installed.
To test whether this environment is working, you may run OCR on any image and see if the textual data gets extracted and saved in a readable text file.
To do that, ensure you have an image with textual information. Use your command line to navigate to the image location and run the following tesseract command:
In this case, you will provide the image name and the file name. When the command is executed, a .txt file will be created and saved in the same folder.
This confirms that the tesseract library is successfully installed. We may now proceed to implement the same using a Python script.
Adding the project dependencies
We need to install a few dependent libraries to help us get started with the Python script.
Python-tesseract is an OCR library that is used to scan and transcribe any textual data in images. This library is used to recognize textual information but not to save it to any text document.
To install pytesseract, run the following command:
PyMuPDF is a python library that is used to access file documents and images, such as PDFs.
In this application, PyMuPDF will read PDF documents and check for any saved images. PyMuPDF renders the PDF files into PNG formats, scans for any text, and finally extracts the text from the rendered PNG images.
To install PyMuPDF, run the following command:
Pillow library acts as an image interpreter with all image processing capabilities.
To install pillow, run the following command:
Opencv-python is used to read images and videos, manipulate media files with image transformations, draw shapes, and put text on those files.
We will use OpenCV to recognize texts from the media files (images).
To install opencv-python, run the following command:
Create a Python tesseract script
Create a project folder and add a new main.py file inside that folder.
First, we need to import these library dependencies that we installed. Add the following imports inside the main.py file:
Then, allow this application to process the image files:
Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.
In this case, we need to create a few global variables that help to create and save these images to the project path. We also specify the path to save the extracted text into a .txt file.
Go ahead and add these global variables as shown:
This will create a directory images where the PDF extracted images will be saved. An output_txt directory will be created to save the scanned text information as .txt file.
Now, let’s create the method that helps us access the installed tesseract library, and the required files. We will do this under gInUs() function as shown:
From the code above:
- “[.] Add the tesseract.exe local path” – it helps us access the tesseract library.
- “[!] Add the PDF file local path:” – it helps us access the local PDF file we want to use.
Once we enter this path, we need first to verify whether the file path is correct. If the path is incorrect, the application will display Please enter a valid PATH to a file error message. If the path is correct, the application will extract text from the images by executing the extIm() method.
Once we have the correct PDF file path, we need to run the file and extract the text to the .txt file.
First, we need to open the text file and read its contents. To do that, we will use the fitz module as shown below:
We create a path to save the images that we extract from the file:
We need to check if there are any images available in the folder. If so, list them and print the contents of each image as shown:
If no images are available in the folder, we iterate over the PDF files and extract their contents.
Let’s print the count of total images that we have extracted and display an error message if no image is found in the folder:
In the loop, we name every image that is generated from the PDF. Here, we will append the image count to the string image. For example, image2_1:
Here, we execute the function reImg() to render these images and extract their content. Let’s do this in the next step.
Extract text information
Let’s create a function named reImg() to hold these global variables:
At this point, we will have to access the tesseract.exe file. To do that, we use the global variable inputTeEx, where we accept the file path from the user:
Python will use the pytesseract module to access the tesseract through the cmd.
We need to loop through each extracted images and read its content to extract textual information as shown:
Finally, call the gInUs() function to execute the program:
Test the app
To test the app, run python main.py.
First provide the tesseract path and hit enter:
Once you hit enter, you will be instructed to add the PDF path:
On execution, the program creates an output_txt folder to save the extracted text information in .txt files.
In this guide, we created a Python script that extracts textual information from the images by scanning, transcribing, and saving it to a text file. You can get the code used in this guide on GitHub.
I hope you found this tutorial helpful.
Clap 👏 If this article helps you.
Peer Review Contributions by: Srishilesh P S