historical document, which makes the manual transcription extremely costly. ..... increasing number of documents ranging from user manuals of electronic devices to ...... [OCR15] OCRopus, January 2015. available at https://github.com/tmbdev/.
Generic Text Recognition using Long Short-Term Memory Networks by Adnan Ul-Hasan Thesis approved by the Department of Computer Science University of Kaiserslautern for the award of Doctoral Degree Doctor of Engineering (Dr. -Ing)
Date of PhD Defense: 11.01.2016 Dean of the Department: Prof. Dr. Klaus Schneider Chairperson of the PhD Committee: Prof. Dr. Paul Lukowicz Thesis Reviewers: Prof. Dr. Andreas Dengel, DFKI Kaiserslautern Associate Prof. Dr. Faisal Shafait, SEECS, NUST Pakistan apl. Prof. Dr. Marcus Liwicki, University of Kaiserslautern
D 386
It always seems impossible until it is done.
Nelson Mandela
Abstract The task of printed Optical Character Recognition (OCR) is considered a “solved” issue by many Pattern Recognition (PR) researchers. The notion, however, partially true, does not represent the whole picture. Although, it is true that state-of-the-art OCR systems for many scripts exist, for example, for Latin, Greek, Han (Chinese), and Kana (Japanese), there is still a need for exhaustive research for many other challenging modern scripts. Example of such scripts are: cursive Nabataean, which include Arabic, Persian, and Urdu; and the Brahamic family of scripts, which contain Devanagari, Sanskrit, and its derivatives. These scripts present many challenging issues for OCR, for example, change in shape of character within a word depending upon its location, kerning, and a huge number of ligatures. Moreover, OCR research for historical documents still requires much probing; therefore, efforts are required to develop robust OCR systems to preserve the literary heritage. Likewise, there is a need to address the issue of OCR of multilingual documents. Plenty of multilingual documents exist in the current time of globalization, which has increased the influence of different languages on each other. There is an increase in the usage of foreign words and phrases in articles, newspapers, and books, which are generating a large body of multilingual literature everyday. Another effect is seen in the products we use in our daily lives. From packaging of imported food items to sophisticated electronics, the demand of international customers to access information about these products in their native language is ever increasing. The use of multilingual operational manuals, books, and dictionaries motivates the need to have multilingual OCR systems for their digitization. The aim of this thesis is to find the answers to some of these challenges using the contemporary machine learning methodologies, especially the Recurrent Neural Networks (RNN). Specifically, a recent architecture of these networks, referred to as Long Short-Term Memory (LSTM) networks, has been employed to OCR modern as well historical documents. The excellent OCR results obtained on these documents encourage us to extend their application to the field of multilingual OCR. The LSTM networks are first evaluated on standard English datasets to benchmark their performance. They yield better recognition results than any other contemporary OCR techniques without using sophisticated features and language modeling. Therefore, their application is further extended to more complex scripts that include Urdu Nastaleeq and Devanagari. For Urdu Nastaleeq script, LSTM networks achieve
the best reported OCR results (2.55% Character Error Rate (CER)) on a publicly available data set, while for Devanagari script, a new freely available database has been introduced on which CER of 9% is achieved. The LSTM-based methodology is further extended to the OCR of historical documents. In this regard, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15th century, and the Polytonic Greek script. LSTM-based systems outperform the contemporary OCR systems on all of these scripts. For old documents, it is usually very hard to prepare transcribed dataset for training a neural network in supervised learning paradigm. A novel methodology has been proposed by combining segmentation-based and segmentation-free approaches to OCR scripts for which no transcribed training data is available. For German Fraktur and Polytonic Greek scripts, artificially generated data from existing text corpora yield highly promising results (CER of