Performing OCR by running parallel instances of Tesseract 4.0 : Python

In this blog-post, we will build a python program which performs Optical Character Recognition (OCR) and demonstrates to leverage it for solving real world business problems. Let us first understand the problem in brief. OCR refers to the technology which can process and convert the printed text from scanned images or documents into raw text which can be manipulated by machine.

In this blog-post, we will be walking through the following modules:

1. Installing Tesseract OCR Engine.
2. Running Tesseract with Command line.
3. Running Tesseract with Python
4. Running Parallel instances for Speed up
5. Building the Pipeline for Real World Application.

You can download samples which are used in this blog-post from here.

1. Installing Tesseract OCR Engine

Tesseract is a popular open source project for OCR. You can visit the GitHub repository of Tesseract here. Much recently (in 2016), OCR developers had implemented LSTM based deep neural network (DNN) models (Tesseract 4.0) to perform OCR which is more accurate and faster than the previous conventional models.

Installing tesseract on windows is easy with the precompiled binaries found here. You can download and install the beta version exe from the Mannheim University Library page. Do not forget to edit "path" environment variable and add tesseract path.

2. Running Tesseract: Command Line

You can see the converted text on command line by typing the following:

tesseract image_path stdoud

To write the output text in a file:

tesseract image_path result.txt

To specify the language model name, by default it takes english:
tesseract image_path result.txt -l eng

3. Running Tesseract : Python

There are few wrappers built on the top of tesseract library in python. Python-tesseract (pytesseract) is a python wrapper for Google's Tesseract-OCR. Type pip command to install the wrapper.

pip install pytesseract

4. Running Parallel instances for Speed up

In the previous section, we defined a function which takes an input image path and converts it into readable text. An obvious question of scale comes in when we have to process large number of images for example 1 million images. Thinking of that, I am penning down some of the ideas which one can try.

Multi-Threading : If the system has 4 physical cores, one can run 4 parallel instances of tesseract and thus performing OCR of 4 images in parallel.
Multi-page Feature : Multi-page feature of tesseract is much faster than single image conversion sequentially. To speed up the process, one should make a list of image paths and feed it to tesseract.
Using SSDs or RAM as Disk : If there are large number of images, it can help in saving lot of I/O time. SSDs will have faster access and loading time.
Running in Distributed system : Use MPI for python on a distributed system and scale it as much as you want. It is different than multi-threading as it is not limited to number of cores of a single system. You may have to bear more cost in terms of hardware.

Basically, In a multi-threading setup, a single server of 15-20 cores with SSD storage could process 1 million images in a day. A lot depends on implementation though. Below is an easy implementation to run multiple instances of tesseract in parallel over all the cores of the system using concurrent.futures library.

import os  import glob  import concurrent.futures   import time    os.environ['OMP_THREAD_LIMIT'] = '1'  def main():      path = "test"          if os.path.isdir(path) == 1:          out_dir = "ocr_results//"          if not os.path.exists(out_dir):              os.makedirs(out_dir)                    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:                 image_list = glob.glob(path+"\\*.png          for img_path,out_file in zip(image_list,executor.map(ocr,image_list)):                  print(img_path.split("\\")[-1],',',out_file,', processed')    if __name__ == '__main__':      start = time.time()      main()      end = time.time()      print(end-start)

We call map function of ProcessPoolExecutor class which takes a definition and the list of images as an input. It distributes the list of image paths and executes the passed definition on each core in parallel.

5. Building the Pipeline for Real World Application.

The Architecture of the ICR system consists of the 4 main components. They are shown below in sequence:

1. Image Corrector:

Important Note

There are few important things to keep in mind while building an tesseract based OCR application for solving some business problem.

The standard format of input image for tesseract is ".tiff" or ".png". It will be convert all formats to ".tiff".
Tesseract OCR works best with high-resolution images. It is recommended to convert all images to 300 DPI (Use ImageMagick).
By default, Tesseract uses 4 threads for OCR. It's better to set thread=1 for a single image as it reduces overheads. Further one can run multiple instances.
OCR accuracy is affected by borders and lines in the images. Also background cleaning is required for better results. If you are not getting good results with tesseract, you may like to improve image quality (look for Fred's Textcleaner script).

At the End

Hope it was a convenient read for all of you. I would encourage readers to reproduce the results demonstrated in the blog-post with python scripts. There is still a lot to explore in tesseract. Further, one can look for:

To detect the layout of texts in images using the bounding boxes and its confidence probability. You can look for the other functions which gives such finer details for OCR.
To train tesseract for new text fonts through transfer learning on LSTM models in order to improve accuracy.
To understand LSTM based tesseract models and train it from scratch in order to perform handwritten text recognition.

To showcase the end-to-end application, I developed a basic QT desktop application. The below video demonstrates the idea.

If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can gain from this. Please feel free to discuss anything regarding solving such business problems.

iPhoneから送信

Mobile Study

2018年7月3日火曜日