Nepali, which is an Indo-Aryan language written in the Devanagari Script, is the most ...... Devanagari script is derived from ancient Brahmi script through many ...
NEPALI OCR USING HYBRID APPROACH OF RECOGNITION
By
NIRAJAN PANT Master of Technology in Information Technology, Kathmandu University, 2016
A Thesis Submitted to the Department of Computer Science and Engineering Kathmandu University
In partial fulfillment of the requirements for the degree of Master of Technology in Information Technology
July 2016
DECLARATION OF ORIGINALITY
Being a student, I understand that I have an ethical and moral obligation ensuring that the dissertation that I have submitted to the Kathmandu University is my own, original and free of plagiarism. All the sources are acknowledged properly, exact words paraphrased or quoted, with appropriate references throughout the dissertation. Hence I am fully satisfied that the work I am submitting to the Department of Computer Science and Engineering, Kathmandu University is my own research and original.
_______________ Nirajan Pant Candidate University Registration No: 015493-13
I
THESIS EVALUATION This thesis, submitted by Nirajan Pant in partial fulfillment of the requirements for the Degree of Master of Technology in Information Technology from the Kathmandu University, has been read by the faculty Advisory Committee under whom the work has been done and is hereby approved.
____________________ Dr. Bal Krishna Bal (Supervisor) Assistant Professor Department of Computer Science and Engineering, Kathmandu University
_____________________ Suresh K. Regmi (Extermal Examinor) Managing Director Professional Computer System (P) Ltd.
____________________ Dr. Manish Pokharel Head of Department Department of Computer Science and Engineering, Kathmandu University This thesis is being submitted by the appointed advisory committee as having met all of the requirements of the School of Engineering at the Kathmandu University and is hereby approved.
________________________________ Prof. Dr. Bhupendra Bimal Chhetri Dean School of Engineering Kathmandu University
Date:
II
PERMISSION Title
Nepali OCR Using Hybrid Approach of Recognition
Department
Computer Sciences and Engineering
Degree
Master of Technology in Information Technology
In presenting this thesis in partial fulfillment of the requirements for a graduate degree from Kathmandu University, I agree that the library of this University shall make it freely available for inspection. I further agree that permission for extensive copying for scholarly purposes may be granted by the supervisors who supervised my thesis work or, in his (or her) absence, by the Head of the Department or other use of this thesis or part thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the Kathmandu University in any scholarly use which may be made of any material in my thesis.
________________ Nirajan Pant Date:
III
ACKNOWLEDGEMENTS I express my sincere gratitude to Dr. Bal Krishna Bal for supervising me in this thesis. I will always be indebted to his continued motivations, suggestions and involvements which have helped significantly for the completion of this thesis. I am thankful to Madan Puraskar Pustakalaya (MPP), Lalitpur, Nepal who has provided Nepali text image data for this thesis work. At last, I express my thankfulness to all members of Department of Computer Science and Engineering, my friends and family members who helped me directly or indirectly till this day for successful accomplishment of this thesis. This day would not have been possible without their continued support, motivation and encouragements.
Nirajan Pant Master of Technology in Information Technology Kathmandu University
IV
ABSTRACT Nepali, which is an Indo-Aryan language written in the Devanagari Script, is the most widely spoken language in Nepal with more than 35 million speakers. It is also spoken in many areas of India, Bhutan, and Myanmar. The Optical Character Recognition (OCR) systems developed so far for the Nepali language has a very poor recognition rate. Devanagari script has some special features like ‘dika’ and the rules for joining the vowel modifiers which makes it different from Latin script, where every character in a word is written separately. One of the major reasons for poor recognition rate is due to the error in character segmentation. The presence of conjuncts, compound and touching characters in the scanned documents complicates the segmentation process, creating the major problems when designing an effective character segmentation technique. Thus, the aim of work is to reduce the scope of the segmentation task so that the segmentation errors could be minimized. In this work, I have proposed a hybrid OCR system for printed Nepali text using the Random Forest (RF) algorithm. It incorporates two different techniques of OCR – firstly, the Holistic approach and secondly, the Character Level Recognition approach. The system first tries to recognize a word as a whole and if it is not confident about the word, the character level recognition is performed. Histogram Oriented Gradients (HOG) descriptors are used to define a feature vector of a word or character. The performance of 78.87%, and 94.80% recognition rates are achieved for character level recognition approach and the hybrid approach respectively. Keywords— OCR, Devanagari Script, Pre-processing, Segmentation, HOG Feature, Feature Descriptor, Classification, Random Forest (RF)
V
Contents ACKNOWLEDGEMENTS ..................................................................................................... IV ABSTRACT.............................................................................................................................. V List of Figures ....................................................................................................................... VIII List of Tables ........................................................................................................................... IX List of Abbreviations ................................................................................................................ X CHAPTER I INTRODUCTION ............................................................................................... 1 1.1
Optical Character Recognition ................................................................................... 1
1.1.1
General OCR Architecture ................................................................................. 2
1.1.2
Uses and Current Limitations of OCR ............................................................... 5
1.2
Devanagari Script....................................................................................................... 6
1.3
Problem Definition................................................................................................... 10
1.4
Motivation ................................................................................................................ 11
1.5
Research Questions .................................................................................................. 12
1.6
Objectives ................................................................................................................ 12
1.7
Organization of Document ....................................................................................... 13
CHAPTER II LITERATURE REVIEW.................................................................................. 14 2.1
Different Models of Character Segmentation in OCR Systems ............................... 14
2.1.1
Dissection Techniques ..................................................................................... 15
2.1.2
Recognition Driven Segmentation ................................................................... 16
2.1.3
Holistic Technique ........................................................................................... 17
2.2
Segmentation Challenges in Devanagari OCR ........................................................ 17
2.2.1
Over Segmentation of Basic Characters .......................................................... 18
2.2.2
Handling vowel modifiers and Diacritics ........................................................ 18
2.2.3
Handling Compound characters and Ligatures ................................................ 19
2.3
Related work ............................................................................................................ 20
2.3.1
Segmentation.................................................................................................... 20
2.3.2
Recognition ...................................................................................................... 24
2.4
OCR Tools Developed for Devanagari .................................................................... 26
CHAPTER III METHODOLOGY .......................................................................................... 30 3.1
Training:................................................................................................................... 31
3.1.1
Dataset Generation: .......................................................................................... 31
3.1.2
Feature Extraction: ........................................................................................... 33
3.2
Recognition: ............................................................................................................. 33
3.2.1
Line and Word Segmentation .......................................................................... 34
3.2.2
Character Segmentation: .................................................................................. 35
3.2.3
Classifier Tool .................................................................................................. 36 VI
3.2.4
Confidence and Threshold: .............................................................................. 40
CHAPTER IV RESULTS AND DISCUSSION ...................................................................... 42 4.1
Experimental Setup .................................................................................................. 42
4.2
Segmentation Results ............................................................................................... 42
4.3
Recognition Results ................................................................................................. 43
4.4
Computational Cost ................................................................................................. 44
CHAPTER V CONCLUSION AND FUTURE WORK ......................................................... 48 References ................................................................................................................................ 50 APPENDIX I Snapshots ........................................................................................................... A APPENDIX II Word Recognition Data Sample ........................................................................E
VII
List of Figures Figure 1 General OCR Architecture .......................................................................................... 2 Figure 2 Structure of Nepali Text Word .................................................................................... 8 Figure 3 Over-segmentation Example (Letter ण, श, ग) .............................................................. 18 Figure 4 Segmentation using Projection Profile Technique .................................................... 18 Figure 5 Proposed Nepali OCR Model .................................................................................... 30 Figure 6 Training Dataset Generation ...................................................................................... 32 Figure 7 Feature Extraction ..................................................................................................... 33 Figure 8 Nepali text words as Blobs ........................................................................................ 34 Figure 9 Snapshot of Character Segmentation ......................................................................... 34 Figure 10 Learning Curve - Word classifier 1 ......................................................................... 38 Figure 11 Learning Curve - Word classifier 2 ......................................................................... 38 Figure 12 Learning Curve - Word classifier 3 ......................................................................... 39 Figure 13 Learning Curve - Character classifier ...................................................................... 40 Figure 14 Recognition Results ................................................................................................. 44
VIII
List of Tables Table 1 Vowels and Corresponding Modifiers .......................................................................... 8 Table 2 Diacritics and Special Symbols .................................................................................... 8 Table 3 Consonants and their half-forms ................................................................................... 9 Table 4 Letter Variants .............................................................................................................. 9 Table 5 Formation of Compound Characters ............................................................................. 9 Table 6 Existing Text Segmentation Approaches for Devanagari OCR .................................. 23 Table 7 Feature Extraction and Classifiers in Devangari OCR ............................................... 25 Table 8 Word Classifier Training ............................................................................................ 37 Table 9 Character Classifier Training ...................................................................................... 39 Table 10 Experimental Environment ....................................................................................... 42 Table 11 Character Segmentation Results ............................................................................... 43 Table 12 Recognition Results .................................................................................................. 43
IX
List of Abbreviations ASCII – American Standard Code for Information Interchange BAG – Block Adjacency Graph C-DAC – Centre for Development of Advanced Computing DOCR – Devanagari Optical Character Recognition DSP – Digital Signal Processing GHIC – Generalized Hausdorff Image Comparison GSC – Gradient, Structural and Concavity GUI – Graphical user interface HMM – Hidden Markov model HOG – Histogram Oriented Gradient HPP – Horizontal Projection Profile HTK - Hidden Markov Model Toolkit IPA – Integrity, purposefulness and adaptability ISCII – Indian Script Code for Information Interchange MPP - Madan Puraskar Pustakalaya OCR – Optical Character Recognition PDF – Portable Document File PP – Projection Profile RF – Random Forest SFSA – Stochastic Finite Automata VPP – Vertical Projection Profile
X
CHAPTER I INTRODUCTION This thesis is about improving the performance of Nepali OCR by proper handling of segmentation problems prevalent in the Nepali language. The assumption made is: “The performance of Nepali OCR can be improved by using the Hybrid recognition approach”. Based on this assumption, a Nepali language specific OCR model has been developed. The model will be tested by experimenting the proposed model.
The concepts of OCR and its general architecture, Devanagari script for Nepali language from the point of view of OCR, and uses and limitations of OCR are discussed in this chapter. This chapter includes basic introduction of thesis which covers problem definition, motivation, research questions and objectives, and the basic overview of terms and terminologies that are used in this thesis.
1.1 Optical Character Recognition OCR is a field of computer science that involves converting texts from images of typewritten or printed or handwritten documents into computer readable text. OCR enables conversion of texts in image data into textual data and facilitates editing, searching, republishing without retyping the whole document. Any written or printed document, if it is to be replicated digitally, needs to be photocopied or scanned. Such a replicated document cannot be altered in terms of the spellings, words, font-style, and font-size that the document contains. Also typing an entire document in order to replicate it is extremely time consuming. In order to overcome the above mentioned issues, an OCR system is needed. Documents containing characters images can be scanned through the scanner and then the recognition engine of the OCR system interprets the images and turns images of printed or handwritten characters into machine - readable characters (e.g. ASCII or Unicode). Therefore, OCR allows users to quickly automate data captured from image document, eliminates keystrokes to reduce typing costs and still maintains the high level of accuracy in text processing applications. 1
1.1.1
General OCR Architecture
While talking about how an OCR system recognizes text, first, the program analyzes structure of document image. It divides the page into elements such as blocks of texts, tables, images, etc. The lines are divided into words and then into characters. Once the characters have been singled out, the program compares them with a set of pattern images. The process of character recognition consists of a series of stages, with each stage passing its results on to the next in a pipeline fashion. There is no feedback loop that would permit an
Figure 1 General OCR Architecture
earlier stage to make use of knowledge gained at a later point in the process (Casey & Lecolinet, 1996). The recognition process can be divided into three major steps: Preprocessing, Recognition (Feature Extraction) and Post Processing (Optical character recognition, 2015) (OCR Processing Steps [ABBYY Developer Portal], n.d.). Pre-processing OCR software loads the image and performs pre-processing to increase the recognition accuracy. Most of the OCRs expect some pre-defined formats of input image such as font-size ranges, foreground, background, image format, and color format. The pre-processing steps often performed in OCR are: i) Binarization ii) Morphological Operations and iii) Segmentation (Hansen, 2002). Binarization is the process of converting an image to bi-tonal image; most of 2
the OCRs work on bi-tonal images. Morphological Operations are used in pre or post processing (filtering, thinning, and pruning). They may be applied in degraded documents to increase the performance of OCR. Different actions performed during pre-processing are: -
De-skewing
-
Binarization
-
Page Layout Analysis
-
Detection of text lines and words
-
Character segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Usually, in every OCR system, the recognition is performed at the character level. So the segmentation is the basic and important phase of recognition. Effective segmentation at character level yields the better accuracy in recognition.
-
Normalization
Character recognition Recognition algorithm is the brain of the OCR system. After successful pre-processing of input image document, now OCR algorithm can start recognition of characters and translate them into character codes (ASCII/Unicode). Creating one hundred percent accurate algorithm is probably impossible where there is a lot of noise and different font styles are present. In general, a character recognition consists of the following procedures:
Learning - The recognition algorithms relies on a set of learned characters and their properties. It compares the characters in the scanned image file to the characters in this learned set.
Extraction and isolation of individual characters from an image
Determination of the properties of the extracted characters 3
Comparison of the properties of the learned and extracted characters
There are two basic types of core OCR algorithm – matrix matching and feature extraction. (Optical Character Recognition, 2015). Matrix matching also known as “pattern matching” or “pattern matching” involves comparing an image to a stored glyph on a pixel-by-pixel basis. This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in most modern OCR software. Machine learning algorithms such as Neural Networks, Nearest-neighbor classifier algorithms are used to compare image features with stored glyph features and choose the nearest match. Most modern Omnifont OCR programs (ones that can recognize printed text in any font) work by feature detection rather than pattern recognition. Post-processing This step can help to improve recognition quality; sometimes OCR can output wrong character code in such case a dictionary support can help to make the decision. OCR accuracy can also be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document. With dictionary support, the program ensures even more accurate analysis and recognition of documents and simplifies further verification of recognition results. The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation.
4
The exact mechanisms that allow humans to recognize objects are yet to be understood, but the three basic principles are already well known by scientists – integrity, purposefulness and adaptability (IPA). The most advanced optical character recognition systems are focused on replicating natural or “animal like” recognition. In the heart of these systems lie three fundamental principles: Integrity, Purposefulness and Adaptability. The principle of integrity says that the observed object must always be considered as a “whole” consisting of many interrelated parts. The principle of purposefulness supposes that any interpretation of data must always serve some purpose. And the principle of adaptability means that the program must be capable of self-learning. These principles endow the program with maximum flexibility and intelligence, bringing it as close as possible to human recognition (What is OCR?, 2015).
1.1.2
Uses and Current Limitations of OCR
OCR is widely used to recognize and search text from electronic documents or to publish the text on a website ( Singh, Bacchuwar, & Bhasin , 2012). It has enabled scanned documents to become more than just image files, turning into fully searchable documents with text content that is recognized by computers. OCR is a vast field with a number of varied applications such as invoice imaging, legal industry, banking, health care industry etc. It is widely being used in digital libraries for searching scanned books and magazines (e.g. Google books), data entry such as bill payment, passport, text-to-speech synthesis, machine translation, text mining, and check entry, automatic number plate recognition etc. Optical character recognition has been applied to a number of applications. Some of them have been listed below: -
Institutional Repositories and Digital Libraries
-
Banking: Form processing, check collection etc.
-
Healthcare: General forms, insurance forms, and prescription documents processing
-
Automatic Number Plate Recognition
-
Handwriting Recognition
5
OCR has simplified data collection and analysis process. With its continuous advancement, more and more applications powered by OCR are being developed in various fields including finance, education, and government agencies. The advantages of OCR can be summarized as: -
Cheaper than paying someone to manually enter large amounts of text
-
Much faster than someone manually entering large amounts of text
-
The latest software can recreate tables and the original layout
OCR system has a lot of advantages even then it has many limitations. Some of the limitations are outlined below: -
Limited Documents: It does not perform well with documents containing both images and text, containing tables, and noise or dirt.
-
Accuracy: The accuracy depends upon the quality and type of document, including the font used. Errors that occur during OCR include misreading letters, skipping over letters that are unreadable, or mixing together text from adjacent columns or image captions.
-
Additional Work: OCR is not error proof, OCR also makes mistakes. A person has to manually compare the original image document and the recognized text for errors and correct them.
-
Not worth doing for small amounts of text: OCR has to suffer from a long process of document scanning, recognizing and, verification of output text. Thus OCR may not be feasible and worthwhile to use for small amount of documents.
1.2 Devanagari Script Devanagari script is derived from ancient Brahmi script through many modifications. Many languages including Sanskrit, Nepali, Hindi, Marathi, Bihari, Bhojpuri, Maithili, and Newari are written in Devanagari and over 500 million people are using it. Devanagari is a syllabicalphabetic script with a set of basic symbols - consonants, half-consonants, vowels, vowel6
modifiers, digits and special diacritic marks (Kompalli, Setlur , & Govindaraju, 2006) (Kompalli, Setlur, & Govindaraju, 2009). Script has its own specified composition rules for combining vowels, consonants and modifiers. Modifiers are attached to the top, bottom, left or right side of other characters. All characters of a word are stuck together by a horizontal line, called dika, which runs at the top of core characters (Khedekar, Ramanaprasad, Setlur, & Govindaraju, 2003). Devanagari character may be formed by combining one or more alphabets which are referred as composite characters or conjuncts. For example: half- consonant ka (क्) and consonant ya (य) combine to produce the conjunct character kya (कय), Consonant-modifier and conjunct-modifier characters are produced by combining consonants and conjuncts with vowel modifiers (Eg: क्+्ा → क , कय्+्ा → कय ). This combination of alphabets contrasts with Latin in which the number of characters is fixed. A horizontal header line (dika) runs across the top of the characters in a word, and the characters span three distinct zones (Figure 2); an ascender zone above the Dika, the core zone just below the Dika, and a descender zone below the baseline of the core zone. Symbols written above or below the core will be referred to as ascender or descender components, respectively. A composite character formed by one or more half consonants followed by a consonant and a vowel modifier will be referred to as a conjunct character or conjunct (Kompalli, Setlur , & Govindaraju, 2006). Nepali, originally known as Khas Kurā is an Indo-Aryan language with around 17 million speakers in Nepal, India, Bhutan, and Burma. Nepali is written in Devanagari, which is developed from the Brahmi script in the 11th century AD. The Nepali is started to be written from 12th century AD1. In Nepali, there are 13 vowels (swaravarna), 36 consonants (vyanjanvarna) (33 pure consonants and 3 are composite consonants), 10 numerals, and halfletters. When vowels come together with consonants they are written above, below, before or after the consonant they belong to using special diacritical marks. When vowels are written in this way they are known as modifiers. In addition, consonants occur together in clusters, often
1
http://www.omniglot.com/writing/nepali.htm
7
called conjunct consonants. Altogether, there are more than 500 different characters (K.C. & Nattee, 2007). The sentences end with ‘purnaviram’. It is written and read from left to right in a horizontal line. Many languages in India use different variants of this script. Nepali language uses a subset of characters from Devanagari script set Dika (Headerline)
Ascender
Head Line
सम्विि
Core
Base Line
Upper Zone Middle Zone Lower Zone
Compound Character
Descender
Figure 2 Structure of Nepali Text Word
for written purposes. Some characters of Devanagari script are language specific. But the basic vowels, consonants and modifiers are same in all languages. For example ‘Nukta’ is used in Hindi but not in Nepali. Similarly, letter ‘LLA’ is also not used in Nepali.
Vowels and corresponding modifiers: Table 1 Vowels and Corresponding Modifiers
अ
Vowel Corresponding Vowel Modifier
आ
इ
ई
उ
ऊ
ऋ
ए
ऐ
ओ
औ
अं
अ:
ा
िा
ा
ा
ा
ा
ा
ा
ा
ा
ा
ा
Diacritics, Consonant-modifiers and Special Symbols: In some situations, a consonant following (or proceeding) another consonant is represented by a modifier called consonant modifier. In this case, the constituent consonants take modified shapes, such as ‘reph’.
Table 2 Diacritics and Special Symbols Diacritics and Special Symbols
ा
ा
ा
Different forms of Consonant modifier ra (र)
ा
ऽ
Consonants and their Half Forms: Along with a set of vowel modifiers there is a set of pure-consonants (also called half-letters) which when combined with other consonants yield conjuncts (Pal & Chaudhuri, 2004).
8
Table 3 Consonants and their half-forms Consonant
Half Form
Consonant
Half Form
Consonant
Half Form
Consonant
Half Form
Consonant
क च ट त प य
क् च्
ख छ ठ थ फ र ष
ख्
ग ज ड द ब ल ह
ग् ज्
घ झ ढ ध भ व
घ् झ्
ङ ञ ण न म श
त् प् य् स्
थ् फ् ष्
ब् ल् ह्
ध् भ् व्
Half Form
ञ् ण् न् म्् श्
Numerals:
०१२३४५६७८९ Letter Variants: In writing Nepali, there are many letter variations found in written or printed documents. This is because fonts have different writing styles. Some characters having letter variants, differs by the old and new writing styles. The old variants of some letters (e.g. letter अ and letter ण) are not used in these days but the old documents frequently contains these forms. A set of letter variants is shown in Table 4. Table 4 Letter Variants
Letter Numeral Five Letter ‘A’ Letter ‘Jha’ Letter ‘ण’
Variants
Letter Letter ‘La’ Letter ‘Sha’ Letter ‘Ksha’
Variants
There are many conjuncts which are written as a single character e.g. द्द, द्म, हृ, i.e. sometimes two or more consonants can combine to form new complex shape. Sometimes the shape of the compound character is so complex that it becomes difficult to identify the constituent characters. Despite the existence of so many compound characters, their frequency of appearance in any text page is much lower than that of basic characters. Table 5 Formation of Compound Characters
ट+ ट
ट+ ठ
द+ द
द+ म
श+ र
त+ र
द+ ध
क +ष
ट्ट
ट्ठ
द्द
द्म
श्र
त्र
ि
क्ष
9
In writing Nepali, many consonants come together in a cluster to form Typographic Ligatures. These are also frequently found in Nepali. The number of ligatures employed may be languagedependent; thus many more ligatures are conventionally used in writing Sanskrit than in written Nepali (Typographic ligature, 2016) . Using 33 consonants in total hundreds of ligatures can be formed (the composite character classes exceeds 5000) most of which are infrequent. All the consonant characters, vowel characters, compound characters, modifiers are connected to by ‘dika’ and looks like characters are hanging in a rope. This is a special feature in Devanagari Script and it does not appear in Latin Script. There are many shapes that look similar e.g. घ and ध, म and भ, ब and व. These characteristics of Devanagari Scripts are becoming challenges for DOCR (Devanagari Optical Character Recognition). Devanagari Script is different from Latin Script by these characteristics so the same technique from Latin OCR may not work fine for DOCR. Thus finding a technique suitable for segmentation of text images in Devanagari script is also challenging.
1.3 Problem Definition We all want an OCR system for Nepali which can recognize different types of documents, documents composed of varying fonts, and the main thing we want is the accuracy of recognition. Today we have many OCR project releases for Nepali as well as Hindi and Sanskrit. But their performance has not been satisfactory. The problem lies in inadequate handling of conjuncts and compound characters. This issue has to be seriously dealt with in order to develop a reliable and high performance OCR system for Nepali. In this research work, Hybrid Recognition Approaches in recognition along with compound characters/conjuncts or ligature recognition is used to improve the overall performance of Nepali OCR.
10
1.4 Motivation Digital documents have become a part of everyday life. Anyone can take advantage of scanning their documents making easy to reference, organizing files, protecting and storing of documents. There is no limitation to the types of documents that can be digitized. Thus the increased interest forces us to deal with any type of document that someone may wish to observe such as images. Plain text has a number of advantages over scanned copies of text. A text document can be searched, edited, reformatted, and stored more compactly but it is not possible in the case of images. One will not be able to edit, search or reformat any text that visually appears in images. Images are nothing more than just a collection of pixels for a computer. Extracting the text data from images is an important for reading, editing and analyzing the text content contained in the images. Computers cannot recognize the text data directly in images. Thus the design of computer program called “OCR” that can recognize text in digital documents (images) is important. OCR technology for some scripts like Roman, Chinese, Japanese, Korean and Arabic is fairly mature and commercial OCR systems are available with accuracy higher than 98%, including OmniPage Pro from Nuance or FineReader from ABBYY for Roman and Cyrillic scripts, and Nuance for Asian languages. Despite ongoing research on non-Latin script recognition, most of the commercial OCR systems focus on Latin-based languages. OCR for Indian scripts, as well as many low-density languages, is still in the research and development stage. The resulting systems are often costly and do little to advance the field (Agrawal, Ma, & Doermann, 2009). In case of Nepali OCR, the segmentation process cannot achieve full accuracy because of dika, touching characters, conjuncts/compound characters, modifiers, and variation in typefaces. These problems directly affect successful recognition and thus result in decreased performance. Due to the presence of language-specific constructs, in the domain Devanagari script requires
11
different approaches to segmentation. Thus working on better approach for segmentation and improvement of performance is important.
1.5 Research Questions Studies show that developing an OCR system for Devanagari script is more challenging than the Latin script due to its writing arrangement. The techniques applied for Latin OCR may or may not apply to the Devanagari script. The main challenges in segmentation for Devanagari OCR are: i) Handling modifiers and diacritics, and ii) Handling compound characters and ligatures (connected components). Dealing with these two main challenges is necessary to achieve better accuracy. One major difficulty to improve the performance of OCR system lies in recognition of compound characters forming complex shapes. The research questions formulated are: -
What are the challenges of Devanagari (Nepali) OCR?
-
What are the current segmentation and recognition techniques for Devanagari (Nepali) OCR?
-
How can the accuracy of Devanagari (Nepali) OCR be improved using the combined approach of Holistic methods and character level dissection technique?
1.6 Objectives This research is focused on improving the performance of Nepali OCR. This research will be helpful for understanding the segmentation approaches used for Devanagari and Bangla OCR, and underlying challenges and the improvements required. A better approach for designing an OCR system for Nepali is the expected outcome of this research. Moreover, the improved techniques will be implemented to develop a prototype OCR system for Nepali. The objectives of this study are as follows: -
To implement a hybrid approach of recognition that uses both holistic approach and dissection method of recognition
12
-
To determine and evaluate the hybrid approach for improved performance of Nepali OCR
1.7 Organization of Document This document is organized into 5 chapters. Chapter 1 includes basic introduction of thesis which covers problem definition, motivation, research questions and objectives, and the basic overview of terms and terminologies. Chapter 2 discusses about different segmentation methods and recognition methods proposed for Devanagari optical character recognition. This chapter also gives information about various OCR tools developed so far for Devangari. Chapter 3 discusses about methods applied to conduct this research work and experiment. In this chapter, different components and phases of applied method are also discussed. In chapter 4 segmentation results and recognition results are presented. The computation cost for character level recognition technique and holistic approach is also described in this chapter. Finally, chapter 5 concludes the research, the contributions and possible future improvements are discussed in this chapter.
In conclusion, in this chapter, the basic concepts of optical character recognition and a general architecture of OCR, Devanagari script for Nepali language from the point of view of OCR are discussed. The motivation of the research, research questions, objectives and goals of this research were also discussed in this chapter. The next Chapter will discuss about different segmentation methods and recognition approaches proposed in the literature. And it will also discuss about various OCR tools developed so far for Devangari.
13
CHAPTER II LITERATURE REVIEW Optical character recognition is a sequence of multiple processes – segmentation, feature extraction, and classification. Different models or techniques are proposed for character segmentation. These techniques can be categorized into three major strategies – dissection technique, recognition driven technique, and holistic methods. The use and selection of these techniques highly depends on the construct of script and language. Various feature extraction and classification techniques has been proposed by different researchers. The feature extraction algorithms may rely on morphology of characters for better classification. Classification in one of the major steps in OCR and design of good classifier is also a challenging task. Mostly supervised learning is used for the classification of characters.
2.1 Different Models of Character Segmentation in OCR Systems Character segmentation is an operation that seeks to decompose an image of sequence of characters into sub-images of individual symbols. The difficulty of performing accurate segmentation is determined by the nature of the material to be read and by its quality. Segmentation is the initial step in a three-step procedure. (Casey & Lecolinet, 1996): Given a starting point in a document image: 1) Find the next character image. 2) Extract distinguishing attributes of the character image. 3) Find the member of a given symbol set whose attributes best match those of the input, and output its identity. This sequence is repeated until no additional character images are found. A character is a pattern that resembles one of the symbols the system is designed to recognize. But to determine such a resemblance the pattern must be segmented from the document image. Casey & Lecolinet (Casey & Lecolinet, 1996) have classified the segmentation methods into 14
three pure strategies based on how segmentation and classification interact in the OCR process. The elemental strategies are: 1) The classical approach, in which segments are identified based on "character-like" properties. This process of cutting up the image into meaningful components is given a special name, “dissection". 2) Recognition-based segmentation, in which the system searches the image for components that match classes in its alphabet. 3) Holistic methods, in which the system seeks to recognize words as a whole, thus, avoiding the need to segment into characters.
2.1.1
Dissection Techniques
By dissection means decomposition of image into a sequence of sub-images using general properties of the valid characters such as height, width, separation from neighboring components, disposition along a baseline etc. Dissection is an intelligent process in that an analysis of the image is carried out; however, classification into symbols is not involved at this point. The segmentation stage consisted of three steps: 1) Detection of the start of a character. 2) A decision to begin testing for the end of a character 3) Detection of end-of-character. The analysis of the projection of a line of print has been used as a basis for segmentation of non-cursive writing. When printed characters touch, or overlap horizontally, the projection often contains a minimum at the proper segmentation column (Casey & Lecolinet, 1996). A peak-to-valley function has been designed to improve this method. A minimum of the projection is located and the projection value noted. A vertical projection is less satisfactory for the slanted characters. Analysis of projections or bounding boxes offers an efficient way to segment non-touching characters in hand- or machine-printed data. However, more detailed processing is necessary 15
in order to separate joined characters reliably. The intersection of two characters can give rise to special image features. Consequently dissection methods have been developed to detect these features and to use them in splitting a character string image into sub-images. Only image components failing certain dimensional tests are subjected to detailed examination.
2.1.2
Recognition Driven Segmentation
This approach also segment words into individual characters which are usually letters. It is quite different from dissection based approach. Here, no feature-based dissection algorithm is employed. Rather, the image is divided systematically into many overlapping pieces without regard to content. These are classified as part of an attempt to find a coherent segmentation/recognition result. Letter segmentation is a by-product of letter recognition, which may itself be driven by contextual analysis. The main interest of this category of methods is that they bypass the segmentation problem: No complex “dissection" algorithm has to be built and recognition errors are basically due to failures in classification. The basic principle is to use a mobile window of variable width to provide sequences of tentative segmentations which are confirmed (or not) by character recognition. Multiple sequences are obtained from the input image by varying the window placement and size. Each sequence is assessed as a whole based on recognition results. In recognition-based techniques, recognition can be performed by following either a serial or a parallel optimization scheme. In the first case, recognition is done iteratively in a left-to-right scan of words, searching for a "satisfactory" recognition result. The parallel method proceeds in a more global way. It generates a lattice of all (or many) possible feature-to-letter combinations. The final decision is found by choosing an optimal path through the lattice (Casey & Lecolinet, 1996). Recognition-based segmentation consists of the following two steps: 1) Generation of segmentation hypotheses (e.g. windowing) 2) Choice of the best hypothesis (verification step)
16
2.1.3
Holistic Technique
Holistic technique is opposite of the classical dissection approach. This technique is used to recognize word as a whole. Thus skips the segmentation of words into characters. This involves comparison of features of unsegmented word image to the features or descriptions of words in database. Since a holistic approach does not directly deal with characters or alphabets, a major drawback of this class of methods is that their use is usually limited to predefined words. A training stage is thus mandatory to expand or modify the scope of possible words. This property makes this kind of method more suitable for applications where the lexicon is statically defined, like check recognition. They can be used for specific user as well as to the particular vocabulary concerned. Holistic methods usually follow a two-step scheme: 1. The first step performs feature extraction. 2. The second step performs global recognition by comparing the representation of the unknown word with those of the references stored in the lexicon. (Chaudhuri & Pal, 1997)
2.2 Segmentation Challenges in Devanagari OCR Several works has been reported in Devanagari and other south Asian scripts too. Among them Devanagari, Bangla, and Gurmukhi have same issues/challenges as they follow same structure of characters and writing style (e.g. composition rules, headerline, conjuncts, compound characters, position of vowel modifiers etc.). The challenges and open problems related to Devanagari OCR are outlined below. These problems are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013). The Segmentation challenges faced in Devanagari OCR are described below:
17
2.2.1
Over Segmentation of Basic Characters
Some of the characters in Devanagari such as (ग), (ण), (श) have two basic components. Similarly, letter Kha (ख) also have structure with visually two separate components and looks like a combination of letter Ra and Va (रव). In such cases OCR system get confused and cannot
Figure 3 Over-segmentation Example (Letter ण, श, ग)
segment a complete basic character. Sometimes poor quality of document also leads to over segmentation of characters. Some of these problems can be handled during post-processing and some of them must be considered in OCR process (segmentation and classification).
2.2.2
Handling vowel modifiers and Diacritics
Devanagari script consists of several Vowel modifiers. When vowel modifiers comes together
Figure 4 Segmentation using Projection Profile Technique
with core consonants they take position at top, bottom, left or right and result a new shape. Identification of modifiers and their recognition is important task. The main challenge is to handle the large number of characters that are formed when the vowel modifiers combine with the basic characters (Bag & Harit, 2013). Sometimes vowel modifiers come together with other diacritics (For example, vowel modifier I (िा) and Chandravindu (ा), च िह च + ह + िा+ ा). In such case they overlap and increase the complexity of segmentation.
18
2.2.3
Handling Compound characters and Ligatures
In Devanagari, compound characters and ligatures are popular. Conjunct or compound characters may be produced by combining half-consonants with consonants. There is a large set of compound characters and ligatures. Sometimes, it is harder to identify its constituent characters by simply analyzing it. Thus handling a large set of compound characters and ligatures is also challenging task. Apart from these segmentation challenges there are others challenges too like incorrect typos, word and character spacing. Kulkarni (Kulkarni, 2013) have studied the display typefaces of Devanagari Script. He noticed that most of the existing digital display typefaces in Devanagari are inconsistent. They have imbalanced letter structures, limited/ inadequate matras and illdesigned conjuncts. They also seem outdated and are overused. Many of them copy features and styles from existing Latin typefaces. He recommends looking at Devanagari type-design independently and not as secondary to Latin type design. This inconsistency and imbalanced letter structures in typefaces adds the complexity to the OCR system. Because of the structural complexities of Indian scripts, the character recognition module that makes use of only the image information (shape and structure) of a character is prone to give incorrect results. To improve the recognition accuracy rate, it is necessary to use language knowledge to correct the recognition result. There has been a limited use of post-processing in Indian OCR systems and more efforts are needed in this direction (Bag & Harit, 2013). Almost all Indic scripts need character reordering to re-organize from visual order to logical (Unicode) order. Since most OCR systems operates strictly from left to right; the characters are scanned in visual order and recognition also happens in visual order. This needs to be reordered in post processing. Apart from the above-mentioned problems, which directly pertain to the OCR systems, there is a need for a major effort to address related problems like scene text recognition, restoration of degraded documents, and large scale indexing and search in multilingual document archives.
19
2.3 Related work Various works have been reported in literature for the correct segmentation of conjuncts/compound characters, shadow characters to increase the performance of Devanagari OCR. At the same time, various feature extraction methods and character recognition algorithms have been proposed. Some of the works from literature are briefly described below.
2.3.1
Segmentation
Bansal & Sinha (Bansal & Sinha, 1998) have considered the problem of conjunct segmentation in context of Devanagari script. The conjunct segmentation algorithm process takes the image of the conjunct and the co-ordinates of the enclosing box. The position of the vertical bar and pen width are also inputs to the algorithm. For extracting the second constituent character of the conjunct, the continuity of the collapsed horizontal projection is checked. Bansal & Sinha (Bansal & Sinha, 2001) have divided words into top and bottom strip and then vertical projection is computed to extract character/symbol and top modifiers. Collapsed Horizontal Projection is defined for the segmentation of conjuncts/touching characters and shadow characters. Ma & Doermann (Ma & Doermann, 2003) identified Hindi words and then segmented into individual characters using projection profile technique (isolating top modifiers, separating bottom modifiers, and extracting core characters). Composite characters are identified and further segmented based on the structural properties of the script and statistical information. The Collapsed Horizontal Projection Technique is adopted from Bansal & Sinha (2001) for conjunct segmentation. Bansal & Sinha (Bansal & Sinha, 2002) presents a two pass algorithm for the segmentation and decomposition of Devanagari composite (touching and fused) characters/symbols into their constituent symbols. The proposed algorithm extensively uses structural properties of the script. In the first pass, words are segmented into easily separable characters/composite characters. Statistical information about the height and width of each separated box is used to hypothesize whether a character box is composite. In the second pass, the hypothesized composite characters are further segmented. For segmentation of composite characters, the continuity of collapsed horizontal projection is checked. Agrawal, 20
Ma & Doermann (Agrawal, Ma, & Doermann, 2009) have generated the character glyphs from font files and passed them through the feature extraction routines. For each character segmented in the document image, feature extraction is performed. With the objective of grouping broken characters, segmenting conjuncts, and touching characters, the technique of font-model-based intelligent character segmentation and recognition was developed. For each word, connected component analysis is performed. Kompalli et al. (Kompalli, Nayak, & Setlur, 2005) have proposed a projection profile based method for character segmentation from words. Words are separated into ascenders, core components, and descenders. Gradient features are used to classify segmented images into different classes: ascenders, descenders, and core components. Core components contain vowels, consonants, and frequently occurring conjuncts. Core components are pre-classified into four groups based on the presence of a vertical bar: no vertical bar (Eg: छ, ट, ह, vertical bar at the center (Eg:व फ, क), right (Eg: व, त, म) or at multiple locations (Eg: कय, स, सत). Four neural networks are used for classification within these groups. Due to ascender and core character separation, characters may be divided into multiple segments during OCR. Positional information from segmented images is used to reconstruct the original character. For recognition of valid but not frequently occurring conjuncts, Kompalli et al. (2005) have attempted to segment the conjunct characters into their constituent consonants and classify segmented images. For the segmentation of valid but not frequently occurring conjuncts, authors have examined breaks and joins in the horizontal runs (HRUNS) of a candidate conjunct character and build a block adjacency graph (BAG). Adjacent blocks in the BAG are selected from left to right as segmentation hypothesis. Both left and right images obtained from each segmentation hypothesis are classified using conjunct/vowel classifiers. The segmentation hypothesis with highest confidence is accepted. Post processing is carried out using a lexicon with 4,291 entries generated from the Devanagari data set. Kumar et al. (Kumar & Sengar, 2010) presents projection profile technique for printed Devanagari and Gurmukhi Script character segmentation. Initially, horizontal histogram of segmented line is computed and the position of headerline is located. This separates the word into top and bottom
21
strip. Vertical projection histogram for each strip is computed for the segmentation of top modifiers and characters. In this paper conjuncts/fused characters are not considered. The results are for clean documents consisting no conjuncts/fused characters. A projection profile technique is proposed in (Dongre & Mankar, 2011) for the segmentation of Devanagari Text Image. To normalize the image against thickness of the character the input image is thinned. Then the vertical projection histogram is computed and the locations containing single white pixels are noted. These points are taken as the boundaries for individual characters. The proposed method skips the process of headerline removal. In case of character segmentation, words are segmented into more symbols than actually present in the word. Kompalli et al. (Kompalli, Setlur , & Govindaraju, 2006) have extended their previous work (Kompalli, Nayak, & Setlur, 2005) and two different approaches: segmentation driven and recognition driven segmentation are compared for OCR of machine printed, multi-font Devanagari text. They have proposed recognition driven approach that combines classifier design with segmentation using the hypothesis and test paradigm. Word images are examined along horizontal runs (HRUNS) to build a Block Adjacency Graph (BAG). Given the BAG of a word, histogram analysis of block width is used to identify the longest blocks as headline (dika) and isolate ascenders from core components. Regression over the centroids of these core connected components is used to determine a baseline for the word. It uses the classifier to obtain hypotheses for word segments like consonants, vowels, or consonant-ascenders. If the confidence of the classifier is below a threshold the algorithm attempts to segment the conjuncts, consonant-descenders and half-consonants. Thus, the classifier results are used to guide the further segmentation. Kompalli et al. (Kompalli, Setlur, & Govindaraju, 2009) have proposed a novel graph-based recognition driven segmentation methodology for Devanagari script OCR using hypothesize and test paradigm. This work is further improvement to their previous work (Kompalli et al. 2006). A BAG is constructed from a word image and ascenders, and core components are isolated. The core components can be isolated characters that does not need further segmentation or conjuncts and fused characters that may or may not have
22
descenders. Multiple hypotheses are obtained for each composite character by considering all possible combinations of the generated primitive components and their classification scores. A stochastic model (describes the design of a Stochastic Finite Automata (SFSA) that outputs word recognition results based on the component hypotheses and n-gram statistics) for word recognition has been presented. It combines classifier scores, script composition rules, and character n-gram statistics. Post-processing tools such as word n-grams or sentence-level grammar models are applied to prune the top n choice results. They have not considered special diacritic marks like avagraha, udatta, anudatta, special consonants such as, punctuation and numerals. Symbols such as anusvara, visarga and the reph character often tend to be classified as noise. Table 6 Existing Text Segmentation Approaches for Devanagari OCR Authors Bansal & Sinha (2001)
Segmentation Technique Collapsed Horizontal Projection
Performance 93% at character level
Kompalli et al. (2005)
BAG Analysis
93.81% for consonants and vowels
Kompalli et al. (2006)
Graph Based Character Segmentation
Kompalli et al. (2009)
Graph-based recognition driven segmentation
Ma & Doermann (2003)
Structural Properties and statistical information Font-model-based segmentation, connected component analysis Collapsed Horizontal Projection for Segmentation
39.58% for the segmentation driven OCR and 44.10% with the recognition driven OCR. Accuracy of the recognition driven BAG segmentation ranges from 72 to 90% the average recognition accuracy can reach 87.82% 92% at character-level recognition
Agrawal et al. (2009) Bansal & Sinha (1998)
Bansal & Sinha (2002)
Collapsed Horizontal Projection
recognition rate of 85% has been achieved on the segmented touching characters 85% recognition rate
For Nepali HTK OCR, (Shakya, Tuladhar, Pandey, & Bal, 2009) (Bal, 2009) the projection profile technique have been adopted for character segmentation. The process includes removal of headerline and upper modifiers and then applying Multi-factorial analysis technique to segment basic characters. The method is able to segment isolated characters along with half and conjoined characters. For the classifier, Hidden Markov Model (HMM) from HTK toolkit is used. (Rupakheti & Bal, 2009) adopted projection profiling technique for Nepali Tesseract OCR. Headerline width is identified and then vertical projection histogram of word to be segmented is computed. Then the histogram analysis is done to mark starting and ending 23
boundary of character fragment by taking headerline line as a threshold value that qualifies the segment to be separated. Most of the researchers have adopted projection profiling technique for character segmentation. For Devanagari character segmentation, this technique includes two phases: preliminary segmentation segments words into basic characters and compound characters/shadow characters/fused characters. In general, preliminary segmentation includes detection of headline and use of its reference to isolate ascenders, core components, and descenders. For segmentation of compound characters, Bansal & Sinha (1998, 2001, 2003), have proposed continuity checking of Collapsed Horizontal Projection. Kompalli et al (2005) have proposed graph analysis for compound character segmentation. (Ma & Doermann, 2003) have used Structural Properties and statistical information of script is for further segmentation of compound characters. Kompalli et al. (2006, 2009) have proposed graph based recognitiondriven character segmentation technique to overcome the problem regarding the compound character segmentation which is usually difficult using projection profile techniques. Various character segmentation approaches for Devanagari OCR are summarized in Table 6.
2.3.2
Recognition
Various feature extraction algorithms and classifiers have been proposed for Devanagari optical character recognition. They all have focused on the improved performance. The shaded portions on the characters are used as features by Chaudhari & Pal (Chaudhuri & Pal, 1997), the classifiers used were decision trees. Kompalli et al. have used GSC features and Neural Network as a classifier (Kompalli, Nayak, & Setlur, 2005). Kompalli et al. (Kompalli, Setlur , & Govindaraju, 2006) have used GSC as features and k-nearest neighbor classifier. Ma & Doermann (Ma & Doermann, 2003) suggests use of statistical structural features; they have used Generalized Hausdorff Image Comparison (GHIC) for the recognition of characters. Different feature extraction methods and classifiers used by various researchers in the field of Devanagari OCR are summarized in Table 7.
24
Table 7 Feature Extraction and Classifiers in Devangari OCR
Author
Feature
Classifier
Performance
Pal & Chaudhari (1997)
Decision Tree and Template Matching Neural Network
96.5%
Kompalli et al. (2005)
Shaded portions in the character GSC
Kompalli et al. (2006)
GSC
k-nearest neighbor
95%
Bansal & Sinha (2002)
Statistical Structure
Statistical Knowledge
85%
Dhurandhar et al. (2005)
Curves, contour
85%
Kompalli et al. (2009)
SFSA
Centroid matching, length matching, interpolation Stochastic Finite State Automation
Ma & Doermann (2003)
Statistical structural
87.82%
Agrawal et al. (2009)
Moment descriptors, directional features
Generalized Hausdorff Image Comparison (GHIC) GHIC
Bansal et al. (2001)
Filters
Distance based classifiers
93%
84.77%
96%
92%
(Bishnu & Chaudhuri, 1999) have proposed a recursive contour following method for segmenting handwritten Bangla words into characters. Based on certain characteristics of Bangla writing styles, different zones across the height of the word are detected. These zones provide certain structural information about the constituent characters of the word. Recursive contour following solves the problem of overlap between successive characters. (Garain & Chaudhuri, 2002) have proposed a method for segmenting the touching characters in printed Bangla script. With a statistical study they noted that touching characters occur mostly at the middle of the middle zone, and hence certain suspected points of touching were found by inspecting the pixel patterns and their relative position with respect to the predicted middle zone. The geometric shape is cut at these points and the OCR scores are noted. The best score gives the desired result. Habib (Murtoza, 2005) have proposed a projection profiling technique for Bangla Character Segmentation. The width of the headline is variable because of print style (font size). So sometimes headline cannot be removed clearly. Here two morphological operations: thinning and skeletonization has been tried to overcome this problem. These operations removes pixels and pixels remaining makeup the image skeleton. Character can be separated by using connected components which is considered as input of recognition step.
25
The Arabic OCR Framework proposed by Nazly and others (Sabbour & Shafait, 2013) takes raw Arabic script data as text files as input in training phase. The training part outputs a dataset of ligatures, where each ligature is described by a feature vector. Recognition which takes as input an image specified by the user. It uses the dataset of ligatures generated from the training part to convert the image into text. It contains versions of degraded text images which aim at measuring the robustness of a recognition system against possible image defects, such as, jitter, thresholding, elastic elongation, and sensitivity. The performance of system is reported to be 91% for Urdu clean text and 86% for Arabic clean text.
2.4 OCR Tools Developed for Devanagari The development of Devanagari (Sanskrit, Hindi, Marathi, and Nepali) OCR software has been initiated by many organizations and individuals in India and Nepal. C-DAC, from India have developed an OCR system (Chitrankan) for Hindi and Marathi languages. Madan Puraskar Pustakalaya (MPP) from Nepal have also developed OCR projects for Nepali language (based on Tesseract Open Source OCR engine and HTK tool). Ind.Senz (founded by Dr. Oliver Hellwing) is developing OCR software for Devanagari Script (Sanskrit, Hindi and Marathi languages). The other projects are Parichit and Sanskrit/Hindi-Tesseract OCR. These tools are described in details below:
Chitrankan: Chitrankan is an OCR (Optical Character Recognition) system for Hindi and other Indian Languages developed by by C-DAC. It works with Hindi and Marathi languages along with embedded English text. It comes with facilities like Spell Checker, saving recognized text in ISCII format, and exporting text as .RTF for editing by any wordprocessor. Skew detection and correction upto ±15°, automatic text and picture region deteciton, and advanced DSP(Digital Signal Processing) algorithms to remove noise and Back Page Reflection are also imlemented. The recognized text is not much accurate so manual editing is required. The supported operating systems are Windows XP and older version of Windows2.
2
http://cdac.in/index.aspx?id=mlc_gist_chitra
26
Parichit: This project is based on Tesseract OCR Engine (http://code.google.com/p/tesseractocr/). The front end is the modified version of VietOCR (http://vietocr.sourceforge.net/). The project aims to create open source OCRs for Indian and South Asian Languages. It also aims to create high quality raining data for creating Tesseract language models for each of the Indian Languages. This project reports on going works on Headerline Segmenter (Shirorekha Segmenter) and Character Reordering for post processing3.
Sanskrit / Hindi - Tesseract OCR (Traineddata files for Devanagari fonts for Tesseract_OCR 3.02+): Tesseract OCR 3.02 provides hih.traineddata for recognizing texts in Devanagari scripts. However training texts, images and box files are not provided, so it is difficult to improve the accuracy by father improving the traineddata. It is noted that recognition is more accurate and faster if the training is done with the same/similar font as used in the text to be OCRed. With the aim of creating traineddata for various Devanagari fonts such that the Tesseract OCR can be used for the recognition of document written in various Devanagari fonts, this traineddata is maintained by http://sourceforge.net/users/shreeshrii. The trained data can be downloaded from http://sourceforge.net/projects/tesseracthindi. Currently the traineddata for Sanskrit2003 font and another similar font is available4.
Ind.senz OCR Programs: The OCR programs are available for Hindi, Marathi, and Sanskrit languages. These are the only Devanagari OCR programs developed and available for professional use. Ind.senz explains about the usability of programs in Data Entry Companies, Publishing Houses, and Universities – whenever large amounts of Hindi and Sanskrit text have to be digitized. The programs take text images and transform them automatically into computer editable text in Unicode format. Ind.senz reports the achievement of high accuracy rates on typical Devanagari fonts. The OCR programs provided are paid software. The demo version can be downloaded from http://www.indsenz.com/int/index.php5.
3
http://code.google.com/p/parichit http://sourceforge.net/projects/tesseracthindi 5 http://www.indsenz.com/int/index.php 4
27
Google Drive OCR: Google have launched Nepali OCR in Google Drive. The OCR technology is free for Google Drive users. OCR provided have good performance in single column documents. It can retain some formatting like bold, font size, font type and line-breaks. But lists, tables, columns, footnotes, and endnotes are likely not be detected. Though it shows good performance, we need to be Google Drive user, we need to surrender our documents to the Google, and we need to work online.
A Step Towards development of Nepali OCR: HTK Toolkit Based OCR: This OCR project is developed under Phase I of PAN Localization project (2004-2007). The project was executed by Madan Puraskar Pustakalya, http://madanpuraskar.org/. The development of Nepali OCR had been done with the guidance and direct training from the Bangladesh team. The OCR project was closed with the release of beta version6. The source files and executable are available on http://nepalinux.org7. Tesseract Based Nepali OCR: Under the initiatives of MPP and Kathmandu University (KU), efforts were made for developing a Tesseract based Nepali OCR under PAN Localization Project Phase II. In this project, 202 Nepali Characters including basic characters and some derived characters (characters with ukar, ekar, and aikar) were trained via Tesseract 2.04. It is available for download at http://nepalinux.org or it can also be downloaded from the website of PAN Localization Project, www.panl10.net8. After the release of the HTK based beta version of the Nepali OCR, Google’s Tesseract based Nepali OCR was developed in 2009. Then after, the development and enhancement of Nepali OCR has discontinued. It’s been a long time that these tools have not been updated. In the current scenario, new versions of operating systems and new platforms have been released. The tools developed do not meet the requirements of the new versions of Operating Systems
6
Findings of PAN Localization Project, PAN Localization Project 2012; ISBN: 978-969-9690-02-2 http://nepalinux.org/index.php?option=com_content&task=view&id=46&Itemid=53 8 http://www.panl10n.net/madan-puraskar-pustakalaya-nepal/ 7
28
like Windows 7, and Windows 8.1. It is also necessary to develop OCR tools for other platforms like Linux, Android etc.
In conclusion, in this chapter, various works and methods for the correct segmentation of conjuncts/compound characters, and shadow characters to increase the performance of OCR are discussed. Moreover, various feature extraction methods and character recognition algorithms were also described briefly. Most of the researches focus on the improvement of performance of Devangari OCR by improving the conjuncts/compound character segmentation process. The method includes projection profile techniques, collapsed horizontal projection technique, and recognition-driven segmentation techniques. Various feature extraction methods and classifiers proposed for the successful recognition of Devangari character are also presented. Finally, various tool developed for Devangari OCR including Hindi, Sanskrit, Marathi and Nepali language are presented. The next Chapter will discuss about the methods applied to conduct this research and experiment. It will also discuss about different components and phases of applied method.
29
CHAPTER III METHODOLOGY The research works on Devanagari Optical Character Recognition suggests that the segmentation process cannot achieve full accuracy because of noise, touching characters, compound characters, variation in typefaces, and many similar looking characters. Because of the presence of language-specific constructs in non-Latin scripts, such as “dika” (Devanagari), modifiers (south-east Asian scripts), writing order, or irregular word spacing (Arabic and Chinese) it requires different approaches to segmentation (Agrawal, Ma, & Doermann, 2009). Devanagari Script also possess its own constructs which totally differ from Latin.
Figure 5 Proposed Nepali OCR Model
The most practiced character dissection method for Devanagari works by removing the headerline (dika) and separating the lower modifiers and upper modifiers, which makes it easy to extract the basic characters but increases the complexity of extracting modifiers. The modifiers get broken and it is difficult to note their position in a sequence of segmented characters and restore their original shape. To minimize the overhead of component level segmentation and minimize the errors due to inaccurate dissection, here, a hybrid approach which combines the Holistic Method and Dissection Technique is proposed. Kompalli et al. (Kompalli, Setlur, & Govindaraju, 2009) have also proposed a novel graph-based recognition driven segmentation methodology for Devanagari script OCR using hypothesize and test 30
paradigm which is promising work and an inspiring work for using a hybrid approach to OCR. Harit and Bag (2013) have also highlighted the need of new approaches because the problems are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013). The proposed framework has two phase recognition scheme: Phase 1: Segment input text image into words and recognize words using Holistic Approach. Measure the confidence of classification. If the confidence is lower than threshold then we go for Phase 2 recognition. Phase 2: Words that are poorly classified in Phase 1 are segmented into characters using projection profile. Segmentation results may be characters or compound characters (conjuncts, shadow characters, consonant-consonant-vowel combinations, consonant vowel combinations, and characters including diacritics). These characters are then classified. A general framework of DOCR is given in Figure 5. The general framework of our approach consists of two main parts – training and recognition:
3.1 Training: Training takes the raw Nepali text data as input and outputs a dataset of words and a dataset of ligatures (compound characters), where each data is described by feature vector. Training phase consists of two main steps: 1. Generation of a dataset of images for the possible words and ligatures (compound characters) of the Nepali language to be used by the application. 2. Extracting features that describe each word and ligature in the dataset generated by the previous step.
3.1.1
Dataset Generation:
This step involves use of automated computer program to generate the necessary Training Dataset. Text corpus of target language is fed to the program and the analysis of textual data is
31
performed to generate the list of words, basic characters, and compound characters which will be later used for rendering images representing corresponding text. Various steps involved in dataset generation are: 3.1.1.1 Create Distinct Words List and Character List: In this project, a text corpus collected by Madan Puraskar Pustakalaya (MPP) under the Bhasha Sanchar Project9 is used. The corpus includes different types of articles from different news portals, magazines, websites and books (about 2,500 articles). The text corpus thus collected is
Figure 6 Training Dataset Generation
fed to the Text Separator, a program written in C#. This program searches for the words and maintains the dictionary in the form of tuples and a dictionary of characters in the form of is generated. The number of words extracted for Nepali is over 150,000 having different length. The number of basic characters and compound characters extracted is over 7,000 having different character length.
9
This corpus has been constructed by the Nelralec / Bhasha Sanchar Project, undertaken by a consortium of the Open University, Madan Puraskar Pustakalaya (मदन पुरस्कार पुस्तकालय), Lancaster University, the University of Göteborg, ELRA (the European Language Resources Association) and Tribhuvan University.
32
3.1.1.2 Image Dataset Generation: In order to generate an Image Dataset of words and Compound Characters (including basic characters) following steps are carried out: -
Images for each extracted word and character are rendered using a rendering engine. This involves rendering the text using 15 different Devanagari Unicode Fonts namely some of them are Mangal, Arial Unicode MS, Samanata, Kokila, Adobe Devanagari, Madan.
-
The degraded images are generated by applying different image filtering operations (e.g. threshold, blur, erode) to the images rendered in previous step.
3.1.2
Feature Extraction:
The second main step of the training phase is to extract a feature vector representing each word and compound characters included in dataset. For this, the following steps are done: 1. Normalize each image to a fixed width and height 2.
Compute the Histogram of Oriented Gradients (HOG) descriptor
Figure 7 Feature Extraction
To extract the HOG features from dataset, hog routine implemented in skimage.feature has been used. The routine allows to manage orientations, pixels per cell, and cells per block. The process of feature extraction is shown in Figure 7.
3.2 Recognition: The recognition part takes as input an image which is specified by the user through the user interface. Its main task is to recognize any text that occurs in the input image. The recognized
33
text is presented as an output to the user in an editable format. The recognition of the text in an input image is done using the following steps: Step1: Segment page image into lines and words. Step 2: Describe each unknown word image using HOG descriptor. Step 3: Classify each word segment using Random Forest classifier tool Step 4: Calculate the confidence of Classification Step 5: If classification confidence is lower than threshold a.
Segment words into characters/ligatures
b.
Perform Character level classification
3.2.1
Line and Word Segmentation
In this research work, instead of Projection Profile, a Blob Detection based approach for line and word segmentation has been used. Blobs are bright on dark or dark on bright regions in an image10 11. In Devanagari Script each word is a bunch of characters and these characters are tied with each other by header line (dika). This property of Devanagari Script makes it easy to use blob detection for detecting individual words in a text document. Figure 8 shows the Nepali words and each word as a separate bright region in the black background.
Figure 9 Nepali text words as Blobs
Figure 8 Snapshot of Character Segmentation
Word segmentation using Blog Detection involves various steps. Algorithm: Line and Word Segmentation Step 1:
Preprocessing: Blurring, Binarization (Grayscale and Thresholding), Skeletonization, and Inverting Image
Step 2:
Detect blobs
Step 3:
Get average blob size and remove all small and big blobs
Step 4:
Create clusters of blobs by the analysis of their distribution and y-cordinate
Step 5:
Each cluster bounding box represents a text line, now we can perform word segmentation
10http://scikit-image.org/docs/dev/auto_examples/plot_blob.html
Blob Detection, scikit-image, Accessed: 12/19/2015 Blobs Processing, Afoge.net, Accessed: 12/19/2015
11http://www.aforgenet.com/framework/features/blobs_processing.html
34
Step 6:
Re-apply blob detection in a line to Perform word segmentation (vertical and horizontal bluring may be applied for more accurate segmentation)
3.2.2
Character Segmentation:
The character segmentation to basic components becomes more challenging due to its properties – use of modifiers and diacritics, and compound characters and ligatures. By studying the structure of Devanagari script and use of compound characters it is found that it would be better to use compound characters and ligatures as a single characters. The method is inspired form the work by Nazly Sabbour and Faisal Shafait, the ligature based approach to implement segmentation free Arabic and Urdu OCR (Sabbour & Shafait, 2013). On analyzing the Nepali text corpus, it is found that there are about 7,000 compound characters (basic characters, conjuncts, ligatures) used in Nepali. Projection profile (PP) algorithm is used to segment characters. The algorithm for character segmentation is given below: Algorithm: Character Segmentation Step 1: Input: list of blobs Step 2: Apply Horizontal projection on Word Rectangle Part Hpp(word) = {r1, r2, r3, … , rn}, where r1, r2, …, rn are the score of black pixels in corresponding rows Step 3: Find header line in a word; HLl(word, x, y), HLh(word) HLl(word, x, y) is the location of headerline in a word, where x the location of upper row that lies in headerline and y is the location of lower row. HLh(word) = y – x, is height of headerline of word Step 4: Apply Veritical projection on Word Rectangle Part Vpp(word) = {v1, v2, v3, … , vn}, where v1, v2, …, vn are the score of black pixels in corresponding columns Step 5: Remove Header Line Vpp(word)hr = Vpp(word) – HLh(word) = {v1 - HLh(word), v2 - HLh(word) , … , vn HLh(word) } Step 6: Detect cut points CP(word) = ; Cut points are valleys i.e. space between two characters Step 7: Perform segmentation
This module takes the blobs (rectangles enclosing the word) as input. In the earlier step Horizontal Projection is applied on the word. Hpp(word) = {r1, r2, r3, …, rn} is the result of Horizontal Projection which contains score of black pixels in each row. The analysis on Hpp(word) is performed to detect the header line of a word and to calculate its height HLh(word). Those rows that have Horizontal Projection score equal to max score or near about
35
max score and are neighbor to each other are the part of headerline. The analysis is performed on the upper half part of the word. The location of headerline HLl(word, x, y), is the position of headerline in a word, where x the location of upper row that lies in headerline and y is the location of lower row. The height of word is given by HLh(word) = y – x. Then vertical projection is applied on the blob. Vpp(word) ={v1, v2, v3, … , vn}, is list of scores of black pixels in each column of blob, where v1, v2, ….. vn, are scores of black pixels in respective column. The method that have been practiced so far to isolate the individual character in a word of Devanagari Script is to remove the headerline. I have also used the same method. And this method works fine for isolating compound characters. The headerline is not removed in actual but HLh(word) is subtracted from each element of Hpp(word), the result is Vpp(word)hr = {v1 - HLh(word), v2 - HLh(word) , … , vn - HLh(word) }. On subtraction some element may result less that zero, in such case make that element zero because no score can be less that zero. Next task is to find the cut points, CP(word) = , where cp1, cp2, …, cpnare cutpoints, by analyzing the Vpp(word)hr , cut points are the points in a word from where we can chop a word to isolate the characters. And normally these are the points where element of Vpp(word)hr is equal to zero that means the space between two characters. Finally the cutpoints are noted and the segmentation is performed, the result of segmentation is given by CS(TextImage) = {, , …, }.
3.2.3
Classifier Tool
For development of both word classifier and character classifier, Random Forest classifier tool was developed. According to sk-learn.org, “A random forest is a meta-estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting”12.
12
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html [Accesses: 03-24-2016]
36
For testing purposes a limited set of words and characters has been trained. The training of Random Forest is performed with following setup: Word Classifier: Three different Random Forest classifiers are trained based on the word length i.e. the ratio of image width and height. The training data images with width