Optical Character Recognition
IntroductionOptical Character Recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image . It is widely used as a form of information entry from printed paper data records, whether passport documents, invoices, bank statements, computerised receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitising printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
OCR in 20th CenturyIn the late 1920s and into the 1930s Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931 he was granted USA Patent number 1,838,389 for the invention. The patent was acquired by IBM. With the advent of smart-phones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have OCR functionality built into the operating system will typically use an OCR API to extract the text from the image file captured and provided by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display.
Techniques of OCRPre-processing
OCR software often "pre-processes" images to improve the chances of successful recognition. Techniques include:
- De-skew – If the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical.
- Despeckle – Remove positive and negative spots, smoothing edges
- Binarisation – Convert an image from color or greyscale to black-and-white (called a "binary image" because there are two colours). The task of binarisation is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarisation itself is necessary since most commercial recognition algorithms work only on binary images since it proves to be simpler to do so. In addition, the effectiveness of the binarisation step influences to a significant extent the quality of the character recognition stage and the careful decisions are made in the choice of the binarisation employed for a given input image type; since the quality of the binarisation method employed to obtain the binary result depends on the type of the input image (scanned document, scene text image, historical degraded document etc.).
- Line removal – Cleans up non-glyph boxes and lines
- Layout analysis or "zoning" – Identifies columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables.
- Line and word detection – Establishes baseline for word and character shapes, separates words if necessary.
- Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script.
- Character isolation or "segmentation" – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected.
- Normalise aspect ratio and scale
There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", "pattern recognition", or "image correlation". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognised with high confidence on the first pass to recognise better the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). The OCR result can be stored in the standardised ALTO format, a dedicated XML scheme.
OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation. "Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together. For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.
Deep Neural Networks for OCR
The fast development of Deep Neural Networks (DNN) as a learning mechanism to perform recognition has gained popularity in the past decade. This popularity is owed primarily to the high accuracy DNN has achieved in both spotting text region and deciphering the characters simultaneously. Deep Neural Networks, or Convolution Neural Networks (CNN) are essentially multi-layered learning and feature processing neural networks. Each neuron (node) in each layer is fed with information passed from nodes connected to it. A processing mechanism (transfer function) then determines how much of the processed information will be passed to the nodes connected to the present one. The architecture of the network, that is, the way neurons and layers are connected, plays a primary role in determining the network’s ability to produce meaningful results. The advantage of DNNs is that architecture can be made heterogeneous. Similar to the human visual system, different neurons and processing layers are more sensitive to different features of objects. Edges of objects are seen more sharply by one set of neurons, while others are more sensitive to color gradients. This heterogeneity is exploited by researchers to construct sophisticated architectures, in which neurons and layers are connected in a way that data propagates back and forth between them before producing a result. Many researchers and industrial practitioners have widely demonstrated the potential of DNNs in OCR. Several architectures for both text detection and character recognition have been implemented and have shown excellent accuracy in real time. An implementation of DNN on GPUs has shown to speed up processing time and accuracy in an architecture which was trained and tested against a deformed character dataset. Detection and OCR in natural scene images and real-time video obtained by cellphones is also shown to exhibit very high real time accuracy.
Potential Benefits of OCR
- With OCR, recognized document looks just like the original.
- Advanced, powerful OCR software allows you to save a lot of time and effort when creating, processing and repurposing various documents.
- With state of the art OCR, you can scan paper documents for further editing and sharing with your colleagues and partners. You can extract quotes from books and magazines and use them for creating your course studies and papers without the need of retyping.
- With a digital camera and OCR, you can capture text outdoors from banners, posters and timetables and then use the captured information for your purposes.
- In the same way, you can capture information from paper documents and books – for example if there is no a scanner close at hand or you cannot use it. In addition, you can use OCR software for creating searchable PDF archives.
- The entire process of data conversion from original paper document, image or PDF takes few seconds, and the final recognized document looks just like the original!
Applications of OCR
- Data entry for business documents, e.g. check, passport, invoice, bank statement and receipt
- Automatic number plate recognition
- Automatic insurance documents key information extraction
- Extracting business card information into a contact list More quickly make textual versions of printed documents, e.g. book scanning
- Make electronic images of printed documents searchable, e.g. Google Books
- Converting handwriting in real time to control a computer (pen computing)
- Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. The purpose can also be to test the robustness of CAPTCHA anti-bot systems.
- Assistive technology for blind and visually impaired users