How to improve OCR accuracy

On their quest to streamline and automate workflows, extracting text from images has become increasingly crucial for businesses. Optical Character Recognition (OCR) technology has made great strides in recent years – even consumer smartphones can now accurately scan documents.

However, OCR accuracy still hinges on the quality of input images. This is where image pre-processing comes in. This article explores how image pre-processing techniques enhance OCR performance, making text extraction more reliable and efficient.

What is OCR, and how does it work?

Optical Character Recognition (OCR) technology converts images of text into digital text. The software first separates the text from the background and then recognizes the individual characters. The result is machine-readable text – data that can be copied, searched, and processed automatically.

Generally, an OCR solution works like this:

Image capture: The physical document is scanned with a camera, such as a smartphone’s. The result is a grayscale or color scan.
Character recognition: The OCR engine detects areas on the scanned document that contain individual characters, words, or lines. Historically, engines checked the individual characters against a database of characters in a variety of fonts (matrix matching) or of topological features like open areas or line intersections (feature extraction). Modern OCR instead uses neural networks, which provide superior pattern recognition. They can reliably recognize text even under challenging conditions such as low lighting, shadows, skewed text, or badly printed characters.
Output: Once the text is extracted from the input image, it can be processed. The output format depends on the use case: To create a searchable PDF file, for instance, the recognized text is added as an invisible layer over the input image. Or, if further editing of the document is required, the software can create an editable text file that mimics the formatting and layout of the input image.

Benefits of OCR

Accurate OCR improves data integrity by ensuring that text extracted from documents faithfully represents the original content. It eliminates the mistakes that arise from manual data entry, thus reducing the need for manual data correction and validation. This, and the fact that the documents are now searchable, dramatically speeds up document management workflows from data entry to retrieval. The result is significant time and cost savings.

The workflow automation this enables further boosts productivity and efficiency.

To reach its full potential, however, OCR depends on high input image quality. Less-than-ideal images with blurry text, skewing, shadows, and low contrast pose a serious challenge to OCR accuracy.

Enhance OCR accuracy with image pre-processing

Modern OCR software is more than the OCR engine that performs the actual text recognition. Today’s solutions use image pre-processing techniques to enhance input image quality. The focus is usually on adjusting image angle and enhancing contrast between text and background.

Deskewing

Scans of paper documents are often skewed at a slight angle, resulting in distorted text characters. Deskewing corrects this and makes it easier for the OCR software to establish text baselines, which are crucial reference lines the OCR engine uses for character alignment and recognition.

Image filters

Image filters are often applied as a first step to improve quality. As color information is unnecessary for text recognition, images are typically converted to grayscale. The different shades of gray can result in cleaner backgrounds and sharper text characters.

Some OCR engines don’t perform well on grayscale or color images and need more contrast to deliver accurate results. Binarization converts the image from grayscale to pure black-and-white pixels, creating maximum contrast between text and background.

Apart from enhancing contrast, these filters also reduce image noise. By removing dust specks, image compression, and other noise, they improve character recognition.

Though these image-processing techniques can enhance OCR accuracy, they only go so far. Some images are so poor they cannot be salvaged. The solution is to prevent these poor-quality images in the first place – and this is where a feedback system for users comes into play.

Prevent low-quality scans and use only high-quality input images

Recently, we introduced the Scanbot SDK Document Quality Analyzer (DQA). This feature provides you with built-in quality control right at the source: in your document scanner app. It ensures that documents are captured with optimal quality and are suitable for subsequent OCR processing.

This is how it works:

The DQA detects all Latin characters and numbers in the input image.
Next, it examines each detected text string and rates it – based on how sharp the characters are – as “excellent”, “good”, “reasonable”, “poor”, or “very poor”.
If the average rating of the input image does not match your set threshold, a notification asks the user to retake the scan.

You can set your own acceptance threshold, but we recommend using at least “good” for automatic OCR processing. This prevents low-quality images from entering the OCR process to begin with.

The DQA is part of our Document Scanner SDK, but is also available as a stand-alone, self-hosted module. The SDK’s advanced OCR software turns any smartphone and tablet into a powerful document scanner. Features such as user guidance, automatic capture, and automatic cropping minimize low-quality submissions and ensure accurate results. Its powerful OCR engine also works on full-color images.

Try our free demo app and see for yourself why companies like HUK Coburg, one of the ten largest insurance groups in Germany, trust our document-scanning solution. If you would like to learn more about our Document Scanner SDK or the Document Quality Analyzer, please contact us at sdk@scanbot.io.