Interim Report on Nepali OCR HTK Toolkit and .NET based
INTERIM REPORT ON NEPALI OCR 1
Shanti Shakya, 1Samir Tuladhar 2 Rajesh Pandey, 2Bal Krishna Bal 1 Department of Computer Science and Engineering Kathmandu University, Kavre, Dhulikhel, Nepal, {
[email protected] ,
[email protected]} 2
Madan Puraskar Pustakalaya Lalitpur, PatanDhoka, Nepal
{
[email protected],
[email protected] }
Abstract Efforts on the research and development of the Nepali OCR system is quite new which is as recent as 2006 initiated by Madan Puraskar Pustakalaya (MPP) under the PAN Localization Project. This document is an interim report on the Nepali Optical Character Recognition (OCR) which gives an overview of the current state of the research and development. We also discuss the implementation aspects of the different modules involved and present the achievements made so far. Introduction Optical Character Recognition (OCR) is basically a mechanical or electronic translation of images of handwritten, typewritten or printed text into machine editable text [10]. However, throughout the text, we would be referring to printed text by OCR. The Nepali language is written in the Devanagari script and hence the unique characteristics of the Devanagari script as reported in [1, 2, 3, 4, 6, 9] apply to the Nepali printed text as well. Much research has been already done for developing OCR applicable to languages that follow the Devanagari script and especially for Hindi claiming to have as much as 93% performance accuracy rate in the character level recognition [1]. Similar achievements have been reported for the Bangla script and language which is similar to the Devanagari script in many respects [3, 4, 6, 9]. In this respect, the current work on the Nepali OCR system is focused towards making use of the already available methods and techniques of OCR both for the Devanagari and the Bangla script at times making slight modifications wherever necessary to meet the specific needs for the Nepali language. In the document, we give an overview of the current state of research and development. We also discuss the implementation aspects of the different modules involved and present the achievements made so far.
1
Interim Report on Nepali OCR HTK Toolkit and .NET based
System overview
Fig.1. A High Level System Architecture of the Nepali OCR
The Nepali OCR system essentially consists of five major modules, viz., Preprocessing, Segmentation, Feature Extraction, Training and Recognition. We would be briefly discussing on each of the modules below. Preprocessing This module is designated for enhancing the quality of the image for processing of the image in further modules of the system. The preprocessing module further involves several other sub modules like Noise Removal, Binarization, Skew correction etc. Since the OCR application potentially needs to handle the images of old and distorted image documents, there is a big likelihood of the presence of noise in images. We have used the KFill algorithm [8] for removing the noise, basically the salt and pepper noise, in real scanned and microfilmed image documents. The another sub module is the binarization module which converts the input image into printed text in black and the rest in white as a background. We have employed the Otsu method[11]. For the other sub module i.e., skew correction module, responsible for correcting the orientation of slanted scanned images, we have used the Hough transform method[12].
2
Interim Report on Nepali OCR HTK Toolkit and .NET based Segmentation The objective of the segmentation module is to extract each character from the text present in the image document. However, before getting involved with the technical details of the segmentation process, it would be useful to look at the nature of the Nepali printed text. As reported by [2,4] for Devanagari and Bangla scripts, a horizontal line is drawn on top of all characters referred to as dika in the case of Nepali. The characters of a word are actually connected through the dika. In this way, a word in the Nepali language may be portioned into three zones. The upper zone denotes the portion above the headline (dika), the middle zone covers the portion of basic and compound characters below the headline, and the lower zone may contain some consonant and vowel modifiers. The imaginary line separating the middle and lower zone may be called the base line. The visual representation of the Nepali word is shown in Fig.2 below:
Fig.2. Headline, baseline and three zones in Devanagari The segmentation step is carried out in three stages: Segmentation of text into lines. Segmentation of line into words. Segmentation of words into characters. In Nepali as in the case of languages following Indic scripts (Devanagari and Bangla), the interline spacing is used for identifying and segmenting lines of text. Similarly, the vertical spacing or white space is used to segment words in the line. For the character level segmentation, first of all, the head line or the dika is removed. After the removal of the dika, the characters now more or less resemble the isolated Roman characters. Hence, now, the characters are segmented in the same way as in the case of Roman characters. However, the individual characters of Nepali and any other languages following or close to the Devanagari script are not that straightforwardly represented without the headline or the dika. Segmentation process becomes complicated for Nepali owing to the splitting and joining errors. After removal of the headline (Dika) of some characters (like ‘ग’, ‘ख’), they seems to be two separate characters. This error is called splitting error.
Fig. 3. Splitting error
3
Interim Report on Nepali OCR HTK Toolkit and .NET based Similarly, there are some characters (like छ ,छ ) which seems to be single character after removal of the header line. This error is called joining error.
Fig.4. Joining error Given below in Fig. 5, we show a snapshot of the segmentation process for Nepali text.
Fig.5. Segmentation of Nepali text in the line, word and character level
The joining and splitting error is expected to be solved by the post processing module, which is in due course of implementation. Basically, the post processing module would have some hand crafted rules that will address these errors. Feature extraction For feature extraction, we employ the same method as reported in [3], according to which we divide each segmented character image into several frames of fixed length, say 8 pixels. Then Discrete Cosine Transformation (DCT) calculation is applied to each pixel over the frames. The extracted feature of each character image is stored into a file of a specific format, later to be used in the training and recognition stages.
4
Interim Report on Nepali OCR HTK Toolkit and .NET based
Training and recognition For simplicity purpose, the training and recognition stages have been merged into one. So that the characters in the input image could be recognized, the entire possible character sets of the adopted font should be trained including and/or all the complex characters (like उँ , न,् नु, दै , , मृ etc). A complete set of characters of the Devanagari script applicable for the Nepali language is shown in Fig. 6. below:
Fig. 6. Unicode value for Nepali Characters Source: http://unicode.org/charts/PDF/U0900.pdf For training, we have followed the Hidden Markov Model (HMM) model adopted from the HTK Toolkit as in the case of [3]. The processes involved during the training using the HTK tool kit involves the following[7]:
Preparation of the training data.
Selection of prototype file.
Creation of HMM file by HTK
Addition of newly created HMM model into MMF (Master Model File)
Addition of model into HMM list
5
Interim Report on Nepali OCR HTK Toolkit and .NET based
A flow chart of the training and recognition procedures is shown below in Fig.7.
Fig.7. A flowchart of the training and recognition procedures
Results and discussion We provide below a few screenshots in consecutive figures to present the results achieved so far in segmentation and recognition modules.
Fig.8. Source image document 6
Interim Report on Nepali OCR HTK Toolkit and .NET based
Fig.9. Segmentation process applied to the preprocessed source image
यसपछ
कोरा ◌लए को
कामसो रे खेदेणख गन
दे ता कुनै अ◌ाशचय भएन । कुकुरह
सेट ◌कन को , टे ◌लफो न जडान गन बध र डे ली ◌मर र जता
सुदा समेत कुनै
◌मलजा
अखबारका ◌ि नम्
भएन । न त नेपोिलयिनले
बग◌ ा चा मा घुदा , सुगुरहल ◌मसट र जसका
जडा उ
वयं ल े कालो कोट, मूसाको ◌श् ◌ा क ा र गद लगाउने
ल.ग ् ◌ा◌ाउँ दार उ.सलाई असाये मन परे खप सुँगुनीगीले .◌
Fig.10. OCRed text of the source image The segmentation module in figure 9, segments the given text in a fairly high accuracy rate. As can be seen from the figure, not just the isolated characters have been segmented but also the half and conjoined characters. This achievement has been possible as a result of the application of the fuzzy multifactorial analysis [9]. This approach has been especially effective for segmenting touching characters. However, at the same time it an overhead of requiring to address the structural ambiguity of the segmented characters while training. In addition to the multifactorial analysis method, we have also used another approach – removing the headline as well as the upper modifier. The combined result of the two approaches is found to increase segmentation accuracy considerably. In Figs. 11, 12 and 13 below, we try to depict the problem posed by the multifactorial analysis method.
7
Interim Report on Nepali OCR HTK Toolkit and .NET based Fig. 11. Conjoined character को
Fig. 12. Segmented portion of को
Fig.13. Another segmented portion of को As what can be seen from Figs. 12 and 13, we would now need to develop some specific rules as to how the system ought to be trained for the segmented characters. The segmented portion in Fig. 13 resembles a “dirgha ikaar”, a vowel modifier but in fact is a “okar”, another vowel modifier. We aim to resolve these ambiguity problems by developing certain training rules. As far as the segmentation and recognition results are concerned, we claim to have achieved about 90% and 80% accuracy rates respectively. We have trained and tested samples from scanned as well as microfilmed versions of old documents. We have already trained around 4735 character samples and this process is ongoing. The more the training, the better the accuracy of the system in recognition is what we have found. With some more additional training and after having implemented the postprocessing module, which is currently in the course of development, we expect the accuracy rate to considerably increase.
8
Interim Report on Nepali OCR HTK Toolkit and .NET based Conclusion In the document, we gave a general overview of the Nepali OCR system, which is currently being developed. We discussed on the different modules involved and also talked about the different algorithms applied. We also presented the results and achievements made so far thereby shedding light on the existing limitations of the system. Acknowledgement
The PAN L10n works have been carried out with the aid of a grant from the International Development Research Centre, Ottawa, Canada, administered through the Center for Research in Urdu Language Processing (CRLUP), National University of Computing and Emerging Sciences, Lahore, Pakistan (NUCES).
References [1] V. Bansal and R. Sinha, "A complete OCR for printed Hindi text in Devanagari script," in Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference, Seattle, WA, USA, 2001, pp. 800-804. [2] V. Bansal and R. Sinha, "A Devanagari OCR and a brief overview of OCR research for Indian scripts," in Proceedings of STRANS01, IIT Kanpur, 2001. [3] S. Habib, H. Md. Abdul, and M. Khan, "A high performance domain specific ocr for Bangla script," Center for Research on Bangla Language Processing, Department of Computer Science and Engineering, BRAC University, 66 Mohakali, Dhaka,Bangladesh, Working Paper 2007. [4] B. Chaudhuri and U. Pal, "An OCR system to read two Indian language scripts: Bangla and Devanagari (Hindi)," in Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1997, pp. 1011-1015. [5] S. Kompali, S. Srirangaraj, and G. Venu, "Design of Comparison of Segmentation Driven and Recognition Driven Devanagari OCR," in Second International Conference on Document Image Analysis for Libraries, IEEE Computer Society, Washington, DC, USA, 2006, pp. 96-102. [6] B. B. Chaudhuri, Digital Document Processing: Major Directions and Recent Advances (Advances in Pattern Recognition), P. S. Singh, Ed. Secaucus, NJ, USA: Springer-Verlag New York, Inc. , 2006. [7] HTK Speech Recognition Toolkit. [Online]. HYPERLINK "http://htk.eng.cam.ac.uk/docs/docs.shtml" http://htk.eng.cam.ac.uk/docs/docs.shtml [8] K. Chinnasarn, Y. Rangsanseri, and P. Thitimajshima, "Removing salt-and-pepper noise in text/graphics images," in IEEE APCCAS ,The 1998 IEEE Asia-Pacific Conference on Circuits and Systems, Chiangmai, Thailand, 1998, pp. 459 - 462. [9] U. Gerain and B. Chaudhuri, "Segmentation of touching characters in printed Devanagari and Bangla scripts using fuzzy multifactorial analysis," in Proceedings of the sixth international conference on Document Analysis and Recognition, Seattle,USA, 2001, pp. 805-809. [10] Wikipedia, the free encyclopedia. [Online]. HYPERLINK "http://en.wikipedia.org/wiki/Optical_character_recognition" http://en.wikipedia.org/wiki/Optical_character_recognition [11] Wikipedia, the free encyclopedia. [Online]. HYPERLINK "http://en.wikipedia.org/wiki/Otsu's_method" http://en.wikipedia.org/wiki/Otsu's_method 9
Interim Report on Nepali OCR HTK Toolkit and .NET based [12] Wikipedia, the free encyclopedia. [Online]. HYPERLINK "http://en.wikipedia.org/wiki/Hough_transform" http://en.wikipedia.org/wiki/Hough_transform
10