What is OCR?
“OCR” is an acronym for “optical character recognition.” It refers to the conversion (usually electronically) of images of printed, typed or handwritten text into computer text data. The source can be a scanned copy or photograph of any document or other graphic image that includes text. OCR is commonly employed by governments and business for digitizing printed texts – passports, bank statements business cards, receipts, etc. – so the information can be electronically edited, searched, stored and otherwise used by computers.
Optical character recognition analyzes single characters, or “glyphs” at a time. Related technologies are “optical word recognition” (OCW), which looks at complete words, and “intelligent character recognition” (ICR) “intelligent word recognition” (IWR), which operate on cursive script, one character or word at a time.
OCR in localization
OCR is a key component of the localization process, particularly translation. Typically, various printed documents, including marketing materials, instruction manuals, labels, warranties and legal statements are first scanned, and then using OCR are converted into text files which are readable by translation software or human translators. Because of the sheer volume of material to be translated and the many target languages required, automatic transcription is almost mandatory. OCR is a key component of this process, eliminating the need to hire language experts or to enter source text by hand.
A brief history of OCR
In 1931, Emmanuel Goldberg presented his “Statistical Machine”, a device employing photoelectric cells and a pattern recognition algorithm to search metadata on microfilm records. His patent was later acquired by IBM.
In 1974, Ray Kurzweil worked to develop “omni-font OCR”, which can recognize text in almost any font. This technology was used to develop a reading machine for the blind that could read text aloud. It required two related inventions: the CCD flatbed scanner and text-to-speech synthesis. The finished product was unveiled in 1976. Kurzweil Computer Products released a commercial version of this OCR computer program in 1978.
An early Kurzweil customer was LexisNexis, which purchased the program in order to upload legal and news documents onto its online databases. In 1989, Kurzweil’s company was sold to Xerox, which wanted to monetize the conversion of paper to computer documents. This new business was spun off as Scansoft.
In 1999, OCR became available as an online service when Expervision released WebOCR. It operates in a cloud-computing environment, working with smartphone applications for real-time foreign-language translation.
A research group at the Indian Institute of Science announced the development of a Print-To-Braille tool in 2014. This is an open-source interface that can be exploited by any OCR application to create Braille books from scanned text images.
Preparing documents for OCR
OCR software often performs several preparatory pre-processing operations to increase the chances of accurate results. These include:
- De-skewing: The original document may not have been aligned correctly when it was scanned, so it is rotated a few degrees to make the text align vertically and horizontally.
- Despeckling: dark spots or holes are removed and edges smoothed.
- Binarization: Grayscale images are converted to “binary” black-and-white. Most commercial OCR algorithms require binary images.
- Line removal: Any lines or boxes that are not part of characters (glyphs) are removed.
- Layout analysis: Also called “zoning”, captions, columns and paragraphs are identified as discrete blocks. This is critical for accurate OCR of multi-column text layouts.
- Line and word detection: This process establishes base line shapes for characters and words.
- Script recognition: Within multilingual documents, all of the different scripts must be identified before the OCR software can be run.
- Character isolation: Sometimes called “segmentation”, this is critical for per-character OCR. Connected characters are separated, and any “broken” characters are reconnected.
- Normalization: For non-proportional (fixed-pitch) fonts, characters are simply aligned to a grid. Proportional fonts require more complex processing to ensure characters are spaced for error-free OCR.
The OCR process
Two general classes of OCR algorithms exist, the output of which may be a ranked list of possible character candidates.
“Matrix matching”, or “pattern recognition” performs a pixel-by-pixel comparison of an image to a set of glyphs stored in a dictionary. This algorithm is most effective with typewritten text and can be unreliable when new fonts are encountered.
“Feature extraction” breaks down a glyph into distinct features – loops, intersections, etc. – comparing it against an abstract topological model of different characters. This is the method most often used in modern OCR software.
OCR accuracy is significantly improved when the output is compared against a dictionary of allowable words, although this can create problems with proper nouns or other legitimate words not in the dictionary.
Post processing
Basic OCR algorithms output a stream of unformatted characters, whereas more sophisticated applications are able to reconstruct the page layout of the source documentation. They may also be capable of performing “near neighbor analysis”, an error-detection system that tracks the frequency of word combinations and flags those that are rare as possibly wrong, such as “New York” versus “Now York”.
Grammar checkers are also commonly employed to improve accuracy by supplying context, determining for example whether a word is a noun, verb or adjective.
Commercial OCR software
Several excellent applications are available to perform sophistication OCR of most common writing systems, including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, Bengali and Tamil characters. A partial list, in chronological order of first release, includes:
- Tesseract was developed by Hewlett-Packard and made available in 1985. Released as open source in 2005, development has been sponsored by Google since 2006. At that time, it was considered by industry experts to be among the most accurate open-source OCR applications available. It uses a two-pass method known as “adaptive recognition” to improve accuracy with unusual fonts. Output can be stored in ALTO format, a data standard maintained by the US Library of Congress.
- LEADTOOLS, first published in 1990 by LEAD Technologies, is a family of software development toolkits (SDK’s) to assist programmers in integrating OCR and related technologies into desktop, server or mobile applications..
- ABBYY FineReader was developed by Russian-based company ABBYY and introduced in 1993. With over 40 million users worldwide, version 14 supports text recognition in 192 languages, with built-in spell checking for 48 of these. ABBYY licenses its technology to Xerox, Microsoft, Ricoh, Fujitsu, Panasonic, Samsung and others. A FineReader mobile app is also available.
- Cuneiform OpenOCR, developed by Russian software company Cognitive Technologies, was released in 1993, and in 1997 adopted neural network-based technology for greater accuracy. It has been incorporated into hardware and software products from many leading companies, including Corel Draw, Brother, Canon Xerox and OKI.
- OCRopus is a free OCR system developed in 2007 by the German Research Centre for Artificial Intelligence. Sponsored by Google, it was designed primarily for high-volume digitization of books, including Google Books.
- SDL Trados Studio, a suite of computer-assisted translation software tools including OCR, evolved from Translation Workbench by German company Trados. The current version, now available from SDL plc, is dictionary-based and supports more than 70 file types.