OCR Product

Optical Character Recognition

Do quick and efficient OCR on Digital Documents using Pixuate’s Technology

Introduction
OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing. In OCR processing, the scanned-in image or bitmap is analyzed for light and dark areas in order to identify each alphabetic letter or numeric digit. When a character is recognized, it is converted into an ASCII code. Special circuit boards and computer chips designed expressly for OCR are used to speed up the recognition process. OCR is being used by libraries to digitize and preserve their holdings. OCR is also used to process checks and credit card slips and sort the mail. Billions of magazines and letters are sorted every day by OCR machines, considerably speeding up mail delivery. Pixuate offers OCR solutions which are crafted and fine tuned for the particular industrial needs that require a very high real time accuracy.
Features of Pixuate OCR
  • Uses the state of the art Deep Neural Networks (DNNs)
  • Trained with millions of characters
  • Can identify the frequency of occurrence of each character and keep records of it's location
  • User friendly Pixuate GUI to capture & recognize. The users can even edit the OCR results manually.
  • Ability to work on Structured and Unstructured documents
  • Very high accuracy
  • Eliminates manual labor and human errors
How it's Done ?
The process of OCR involves three stages:
Pre-processing
A pre process stage is necessary to make sure that the state of the art accuracy is achieved. These techniques include:
  • De-skew – If the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical.
  • Despeckle – remove positive and negative spots, smoothing edges
  • Binarisation – Convert an image from color or greyscale to black-and-white (called a "binary image" because there are two colours). The task of binarisation is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarisation itself is necessary since most commercial recognition algorithms work only on binary images since it proves to be simpler to do so. In addition, the effectiveness of the binarisation step influences to a significant extent the quality of the character recognition stage and the careful decisions are made in the choice of the binarisation employed for a given input image type; since the quality of the binarisation method employed to obtain the binary result depends on the type of the input image (scanned document, scene text image, historical degraded document etc.).
  • Line removal – Cleans up non-glyph boxes and lines
  • Layout analysis or "zoning" – Identifies columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables.
  • Line and word detection – Establishes baseline for word and character shapes, separates words if necessary.
  • Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script.
  • Character isolation or "segmentation" – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected.
  • Normalise aspect ratio and scale
Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.
Character recognition
There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", "pattern recognition", or "image correlation". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognised with high confidence on the first pass to recognise better the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). Pixuate is using state of the art Deep Neural Networks for character classification, which ensures a very high accuracy when compared to the conventional methods. The OCR result can be stored in the standardised ALTO format, a dedicated XML scheme.
Post-processing
OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation. "Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together.vFor example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.
Documents Supported
Structured Form Document Processing
As the name suggests, these types of documents (commonly insurance forms, tax returns, voting ballots and standardized tests) have a consistent structure, with every data field located in the same place. Consequently, structured forms are the easiest to process with Optical Character Recognition (OCR) engines, and generate excellent accuracy rates for data capture. Although structured forms are ideal for accurate and efficient high-volume document processing, it is estimated some 80 percent of organizations use semi-structured or unstructured forms. Pixuate® can assist organizations in the design of standardized, structured paper and web-based forms to improve the accuracy and efficiency of data capture—a business process improvement that delivers significant cost savings.
Advantages
  • Fast, automated data capture
  • Lower processing costs
  • Very high data-capture accuracy
Examples Include
  • Identity Documents
  • Passport
  • Driving Licence
Unstructured Document Processing
Information found in unstructured documents are inconsistent. For example, invoices will normally have a vendor name and address, a tax I.D. number, a total amount due, and an invoice number and date. However, this information can be placed anywhere on a wide variety of forms from different vendors. Pixuate® solution incorporates intelligent, automated forms classification technologies and multiple optical character recognition (OCR) engines to meet the complex data entry challenge associate with unstructured forms. Our software platform has the capability to locate information on unstructured forms and automatically extract the required data. This saves critical time and reduces the need for manual forms processing by data-entry operators.
Advantages
  • Fast, automated data capture even with inconsistent documents
  • Eliminate late fees
  • Capture early payment discounts
Examples Include
  • Purchase Order
  • Invoices
  • Bank Account Statements
Deep Neural Networks for OCR
The past few years have witnessed a really fast development of Deep Neural Networks (DNN) as a learning mechanism to perform recognition. This popularity is owed primarily to the high accuracy DNN has achieved in both spotting text region and deciphering the characters simultaneously. Deep Neural Networks, or Convolution Neural Networks (CNN) are essentially multi-layered learning and feature processing neural networks. Each neuron (node) in each layer is fed with information passed from nodes connected to it. A processing mechanism (transfer function) then determines how much of the processed information will be passed to the nodes connected to the present one. The architecture of the network, that is, the way neurons and layers are connected, plays a primary role in determining the network’s ability to produce meaningful results.

The advantage of DNNs is that architecture can be made heterogeneous. Similar to the human visual system, different neurons and processing layers are more sensitive to different features of objects. Edges of objects are seen more sharply by one set of neurons, while others are more sensitive to color gradients. This heterogeneity is exploited by researchers to construct sophisticated architectures, in which neurons and layers are connected in a way that data propagates back and forth between them before producing a result. Many researchers and industrial practitioners have widely demonstrated the potential of DNNs in OCR. Several architectures for both text detection and character recognition have been implemented and have shown excellent accuracy in real time. An implementation of DNN on GPUs has shown to speed up processing time and accuracy in an architecture which was trained and tested against a deformed character dataset. Detection and OCR in natural scene images and real-time video obtained by cellphones is also shown to exhibit very high real time accuracy.
Benefits of Pixuate OCR
  • With OCR, recognized document looks just like the original.
  • Advanced, powerful OCR software allows you to save a lot of time and effort when creating, processing and repurposing various documents.
  • With state of the art OCR, you can scan paper documents for further editing and sharing with your colleagues and partners. You can extract quotes from books and magazines and use them for creating your course studies and papers without the need of retyping.
  • With a digital camera and OCR, you can capture text outdoors from banners, posters and timetables and then use the captured information for your purposes.
  • In the same way, you can capture information from paper documents and books – for example if there is no a scanner close at hand or you cannot use it. In addition, you can use OCR software for creating searchable PDF archives.
  • The entire process of data conversion from original paper document, image or PDF takes few seconds, and the final recognized document looks just like the original!
Applications
  • Data entry for business documents, e.g. check, invoice, bank statement and receipt
  • Automatic number plate recognition
  • Automatic insurance documents key information extraction
  • Extracting business card information into a contact list More quickly make textual versions of printed documents, e.g. book scanning
  • Make electronic images of printed documents searchable, e.g. Google Books
  • Converting handwriting in real time to control a computer (pen computing)
  • Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. The purpose can also be to test the robustness of CAPTCHA anti-bot systems.
  • Assistive technology for blind and visually impaired users

Want to know how OCR can be used to automate reading digital documents? Speak to our team now...