Robust Scene Text Detection and Recognition: Introduction

Identification and recognition of text from natural scenes and images become important for use cases like video caption text recognition, detecting signboards from vehicle-mounted cameras, information retrieval, scene understanding, vehicle number plate recognition, and recognizing text on products.

Most of these use cases require near real-time performance. The common technique for text extraction includes using an optical character recognition (OCR) system. However, most of the free and commercially available OCR systems are trained to recognize text from documents. There are many challenges when it comes to recognizing text from natural scenes or captioned videos like image perspective, reflections, blurriness, and so on.

The next post in this series, Robust Scene Text Detection and Recognition: Implementation, discusses the implementation of an STDR pipeline using state-of-the-art deep learning algorithms and techniques like incremental learning and fine-tuning. The third post, Robust Scene Text Detection and Recognition: Inference Optimization, covers production-ready optimization and performance for your STDR pipeline.

Typically, the text extraction process involves the following steps:

Text fields are detected from the bigger scene by text detection algorithms.
This text is extracted and recognized using a custom OCR technique.

The recognition of irregular text in natural scene images can be challenging due to the variability in text appearance, such as curvature, orientation, and distortion. To overcome this, sophisticated deep-learning architectures and fine-grained annotations are often required.

However, these can lead to optimization and latency challenges when creating and deploying these algorithms. Despite these challenges, advancements in computer vision have made significant strides in text detection and recognition, providing a powerful tool for various industries. To further optimize inference, you can use specialized optimization tools to reduce latency and improve performance.

In this post, we describe these challenges and our approach for optimization and acceleration of inference. We emphasize that deploying a scene text detection and recognition (STDR) pipeline requires careful consideration of real-world scenarios and conditions. To meet these needs, we have used state-of-the-art deep learning algorithms and leveraged techniques like incremental learning and fine-tuning for specific use cases.

To ensure low latency, we used the following model inference optimization tools:

ONNX Runtime is a cross-platform machine-learning model accelerator that offers flexibility for integrating hardware-specific libraries. It can be used with models from PyTorch, TensorFlow and Keras, TensorFlow Lite, scikit-learn, and other frameworks.
NVIDIA TensorRT SDK is used for high-performance deep learning inference, providing a deep learning inference optimizer and runtime that guarantees low latency and high throughput for inference applications.
NVIDIA Triton Inference Server is used for high-performance inference serving across cloud, on-premises, and edge devices.

TensorRT and Triton Inference Server are included in NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform.

STDR applications

Recognizing text from images and videos is used in various industries.

Healthcare and Life Sciences: Scene text detection and recognition are used in the healthcare industry to scan and store the medical history of patients on a computer, including reports, X-rays, previous diseases, treatments, diagnostics, and hospital records. It is also required in medical device and drug manufacturing for logistics and warehouse operations.

Picture of four medicine bottles with prescription labels. — *Figure 1. Sample of medicine package and bottles* (Image: The Times)

Manufacturing Supply Chain/Logistics: Scene text detection and recognition play a crucial role in the food, drink, and cosmetics industries for quality control throughout the supply chain. It is used to track products and read product codes, batch codes, expiry dates, and serial numbers. This information can be used to ensure compliance with safety and anti-counterfeiting laws and to locate products within the supply chain at any given time. OCR is often used in conjunction with barcoding to maximize information collection accuracy.

Warehouse shelves full of boxes with package labels. — *Figure 2. Samples of warehouse packages* (Image: shelving.com)

Banking: Scene text detection and recognition is widely used in the banking industry to automate know-your-customer (KYC) documents like birth certificates, marriage certificates, and so on.

Automotive and utilities: Self-driving cars and utility line maintenance drives often require scene images to be identified and data to be extracted (for example, street names, establishment names, utility pole numbers, and transformer and generator details). Usually, the text appears for a fraction of a time as the vehicle is moving, creating a motion blur. In that case, manual detection becomes impossible.

STDR challenges

The biggest challenge in detecting and extracting text from complex images taken from videos and mobile phones is that the text in such images is often irregular and overlayed on varied backgrounds like glass, plastics, rubber, and so on.

Also, even if the machine learning model is developed with decent accuracy, the expectation is that the model should process images live or in near real time. Thus, catering to both accuracy and performance expectations requires highly refined models that can work optimally in the cloud as well as edge devices. These challenges are described in detail in this post.

Creating robust models

Often, the leading reason for accuracy concerns in scene text models is the number of variations in the input data. Here are some of the data variations.

Text size-scale-blur: Text in natural scenes can appear in various sizes and scales. The distance from the camera also plays an important role in scaling the text. The angle from the camera brings the perspective distortions. Also, lighting conditions create the reflections and shadows around the text. The moving objects or camera movements add to the blur effects. All these conditions contribute to the size-scale-blur distortions in images.

Text orientation, color, and font: Text may appear horizontally, vertically, diagonally, and even circularly. This variation in text orientation can make it difficult for algorithms to correctly detect and recognize text. The color, transparency, and font style used also cause challenges when not reflected by the data used in training.

Background and overlays: Text in natural scenes can appear with various backgrounds, such as buildings, trees, vehicles, and so on, and is often overlayed on glass, metal objects, plastics, or stickers. It can also be embossed or debossed onto various kinds of materials.

Multiple languages: Real-world images contain text in multiple scripts and languages. Often, signage or restaurant menus are written in mixed languages.

Another typical challenge in ML projects is the availability of labeled data to train the model. However, for this pipeline, we used a pretrained CRAFT model for text detection, which is trained on the SynthText, IC13, and IC17 datasets.

For text recognition, we used the PARseq model, which is trained on various datasets (MJSynth, SynthText, COCO-Text, RCTW17, Uber-Text, ArT, LSVT, MLT19, and ReCTS, TextOCR) and finetuned with in-house data.

Meeting the performance expectations

Deploying a scene text detection solution can also present various challenges.

Computational resources: Today, modern STDR systems use complex deep learning algorithms. These models have an abundance of parameters, making them computationally expensive to run. Consequently, it can be difficult to deploy these solutions on devices with limited computational resources, such as smartphones or Internet of Things (IoT) devices.

Latency and response time: In many scenarios, scene text detection and recognition must be real-time to be effective. Deep learning models can offer excellent accuracy, but their high number of parameters can lead to increased inference time compared to models having a low number of parameters, resulting in unacceptable latency and response time. To optimize accuracy, state-of-the-art algorithms must be used, while inference time can be reduced through optimization techniques such as quantization, lowering precision, and pruning. These optimizations may reduce the accuracy of the model.

Data privacy and security: The privacy and security of the data used for training and running the model are important when deploying the solutions in real-world scenarios. The model needs to be protected from malicious attacks and data breaches. Compliance with data privacy regulations must be ensured.

The deployment of scene text detection solutions demands meticulous consideration of the real-world scenarios and conditions in which the solution will be employed. This process is a crucial step that necessitates thorough testing, evaluation, and fine-tuning.

Consider a package delivery company that requires a label-reading application on a conveyor belt. In this case, high accuracy is critical, as any error can cause delays and result in additional costs for the company. The speed of the conveyor belt is another essential factor to consider, as it affects the overall time required to process the packages.

Achieving high accuracy may require complex deep-learning models that can be computationally expensive and impact system latency. To optimize performance, it’s important to consider the specific requirements and constraints of the deployment scenario, such as conveyor belt speed and computational resources, and adjust the deep learning models accordingly to strike a balance between accuracy, latency, and resources.

Summary

In this post, we discussed the importance of robust scene text detection and recognition (STDR) in various industries. We highlighted the challenges faced in STDR, including creating accurate models, meeting performance expectations, and dealing with real-world scenarios and conditions.

For more information, see the next posts in this series: