International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com
Volume 1 No 4 June 2011
Multilingual Optical Character Recognition System for Printed English and Telugu Base Characters M Swamy Das Department of Computer Science and Engineering Chaitanya Bharathi Institute of Technology,Hyderabad
[email protected]
K Rahul Department of Computer Science and Engineering Chaitanya Bharathi Institute of Technology,Hyderabad
[email protected]
Abstract— Optical character recognition, usually abbreviated to OCR, is a system that automatically translates scanned images of handwritten, typewritten or printed text into machine-encoded text. Several commercial OCR systems are now available in the market. But most of these systems work for Latin based scripts such as Roman and English and also for Japanese and Arabic characters. India is multilingual country where there are several languages with different scripts. Due to this diversity, the work on Indian languages is not reported very much. In a multilingual country like India it is common to have documents with multiple scripts. The character set of most of the Indian scripts are large in number and complex in structure compared to Latin based scripts. There are around 52 base characters and several compound characters are possible. For such the recognition of such large and complex scripts, artificial neural networks are well suited. The objective of this work is to develop a multilingual OCR system that can recognize the basic printed characters of English and Telugu scripts. From the results it can be understood that more than 98% accuracy can be achieved. Due to the segmentation and preprocessing errors some of the characters could not be recognized by this system. Keywords- OCR, Multilayer Feed Forward Networks, Backpropagation, UNICODE, multiscript multi lingual document.
I.
INTRODUCTION
OCR is a computer program that converts handwritten or machine printed image documents into editable text documents. Once it is translated into a text document, it can be stored in ASCII or UNICODE format. In India there are several historic documents that were machine printed or handwritten. In order to make them available on the net they need to be digitized. Once they are digitized by using the OCR systems, they can be accessed from the web. So there is a great need for the OCR system development. India is a multilingual[1] country where there are more than 18 regional languages derived from 12 different
CRK Reddy Department of Computer Science and Engineering Chaitanya Bharathi Institute of Technology,Hyderabad
[email protected]
A Govardhan Department of Computer Science and Engineering JNTUCE, Jagityal
[email protected]
scripts. One script could be used to write more than one languages. For example, languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali are written using the Devanagari script; Assamese and Bangla languages are written using the Bangla script. In such a multilingual country it common to have documents with multiple scripts such as bus/railway reservation forms, question papers, language translation books and money-order forms may contain text lines in more than one script/language forms. For processing these documents, multilingual OCR systems needed. Almost all existing works on OCR make an important implicit assumption that the script type of the document to be processed is known beforehand. In an automated multilingual environment, such document processing systems relying on OCR would clearly need human intervention to select the appropriate OCR package, which is certainly inefficient, undesirable and impractical. The ability to reliably identify the script type using the least amount of textual data is essential when dealing with document pages that contain text words of different scripts. Most of the Indian scripts have around 52 base (vowel and consonant) characters. Apart from these base characters there are vowel, and consonant modifiers with which several compound characters can be formed. Developing an OCR system with large and complex character sets is a difficult task. Artificial neural network based methods are well suited over the conventional methods. In this work Multilayer Feed Forward Neural Network, that uses backpropagation algorithm for training has been chosen.. II.
PROCESSING STEPS OF OCR
The typical phases [2] of an OCR system are: Preprocessing Segmentation
106
International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com
Feature extraction and Selection Classification/Recognition
In the preprocessing phase, the scanned input document is first converted into a gray scale and then to a binary image. Then the noise and skew is removed from the binary image.
Volume 1 No 4 June 2011
B.
Segmentation This module extracts lines, words and then finally characters from the noise free, de-skewed text document images. In Telugu script, the consonant and vowel modifiers may be attached / placed on top or bottom or left or right to the base character. The text document image may contain overlapped lines and characters. To segment the image into lines and characters, we used the approach [2] based on connected components and profiles. Theses algorithms are given as follows: Line segmentation Algorithm: a. b. c.
d. Figure 1. Processing steps of a typical OCR system
The Segmentation phase extracts lines, words and characters from the noise and skew free binary image. The extracted characters are given as input to the feature extraction/selection module. The feature extraction phase extracts the features from the binary character images. From these feature set the essential features that are useful for discriminating the individual characters selected.The classification/recognition phase actually recognizes the characters by using the selected features. Fig.1 shows the steps of a typical OCR system. Once the characters are recognized by the classifier they can be stored as ASCII or UNICODE characters for further processing.
e.
f.
g.
Label the connected components for the given document image. Determine the top, bottom, left and right position for each CC using bounding box. Establish the following vertical spatial relations to check whether two CCs are: i. Fully overlapped ii. Partially overlapped Cluster the CCs using nearest neighborhood method to extract lines by constructing an undirected graph in which each node represents a connected component and the link represent the distance between components. If a component is overlapped with another component then compute the Euclidean distance between the components. The distance between non overlapping components is infinite, i.e. they are not reachable. If a connected component is reachable and nearest from a component in a cluster that belongs to that cluster add the component to that cluster. Finally, each cluster of connected components forms a line.
Word segmentation algorithm: III.
PROPOSED MULTILINGUAL OCR SYSTEM
The proposed the Multi-lingual OCR system consists of Pre-processing, segmentation, feature extraction, classification modules. The complete process of the system is shown in Fig.7 and Fig 8. A. Pre-processing Any language identification method requires conditioned image input of the document, which implies that the document should be noise free and skew free. The Preprocessing module includes the conversion of color or gray scale image into binary, noise removal, thinning and skew detection and correction. Actual processing takes place on the binary images. Binary image separates the foreground pixels from the background. For binarization of gray scale images Otsu method [4] has been used.
for each line: for each column: for each row: count the number of black pixels; if there is a series of zeroes in the vertical profile then it is considered as word spacing; end end end Character segmentation algorithm: a.
b.
Remove consonant modifiers from the word. For this, Determine the middle row using bounding box (i.e. height/2). Compute horizontal profile and identify the bottom base line using this profile. The bottom base line is
107
International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com c.
d.
e.
the highest peak row in the profile down from the middle row. The connected components down the bottom base line are consonant modifiers. Remove them from the word and add to the consonant modifiers group. Remove vowel modifiers by finding the top base line computed as in the above step and modifiers to the vowel modifier group. Using the vertical profile separate the base characters using the clear paths between them. Then add vowel and consonant modifiers using nearest neighborhood method with horizontal relationship heuristics.
C. Feature extraction and Selecton Feature extraction is a special form of dimensionality reduction. When the input data to an algorithm is too large to be processed and redundant then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. There are several feature extraction methods that based on geometric, diagonal features and pixel based. The first step in feature extraction is discourser. After this, the character image is resized to an appropriate resolution. The feature extraction method to be used in this is a pixel map where the symbol image is mapped to a corresponding two dimensional binary matrix. An important issue to consider is deciding the size of the matrix. If all the pixels of the symbol are mapped into the matrix, one would definitely be able to acquire all the distinguishing pixel features of the symbol and minimize overlap with other symbols. However this strategy would imply maintaining and processing a very large matrix (up to 1500 elements for a 100x150 pixel image). In this work, character image is mapped to a matrix of 15x10 (i.e. 150 elements) chosen. D. Classification/ Recognition Classification module recognizes the characters. There are several classifiers. As mentioned in the introduction section, Artificial Neural Network approach is well suited for complex kind of pattern recognition tasks like multilingual script recognition MLP[10](Multilayer Feed forward Network Perceptrotn) network is adopted in this work. It uses a supervised training called Backpropagation algorithm. It basically consists of two phases namely training and testing. During the training phase, the training samples consisting of feature vector and its target value, for each basic character is given as input. After the training phase, the trained network is tested. The operation of training and testing is given as follows: Training phase:
Volume 1 No 4 June 2011
1) Form network according to the specified topology parameters. 2) Initialize weights with random values within the specified weight_bias value. 3) Load trainer set files (both input image and desired output text). 4) Analyze input image and map all detected symbols into linear arrays. 5) Read desired output text from file and convert each character to a Unicode value to store separately. 6) For each character : a. Calculate the output of the feed forward network. b. Compare with the desired output corresponding to the symbol and compute error. c. Back propagate error across each link to adjust the weights. 7) Move to the next character and repeat step 6 until all characters are visited. 8) Compute the average error of all characters. 9) Repeat steps 6 and 8 until the specified number of epochs. a. Is error threshold reached? If so abort iteration. 10) If not continue iteration Recognition phase: The recognition phase of the implementation is simple and straightforward. Since the program is coded into modular parts, the same routines that were used to segment, extract features and compute network parameters of input vectors in the training phase can be reused in the recognition phase as well. The basic steps can be summarized as follows: 1) 2) 3) 4)
Load image file. Analyze image for character lines. For each character line detect consecutive words. For each word detect constituent character symbols. a. Analyze and process symbol image to map into an input vector. b. Feed input vector to network and compute output. 5) Render the Unicode binary output to a text box. IV.
IMPLEMENTATION
The MLP Network implemented in this work is a three layer network. The input layer constitutes of 150 neurons which receive pixel binary data from a 10x15 pixel matrix. The size of this matrix was decided taking into consideration the average height and width of character image that can be mapped without introducing any significant pixel noise. The hidden layer constitutes of 250 neurons whose number is
108
International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com decided on the basis of optimal results as shown in the graph Fig. 6.(a).
Volume 1 No 4 June 2011
Figure 4. Sample miltilingual input image
Figure 2. Architectural diagram of MOCRS
The output layer is composed of 16 neurons corresponding to the 16-bits of UNICODE encoding. The complete architectural diagram of the network is shown in Fig. 2. This system is developed using Microsoft Dot net frame work. Fig. 3 shows the graphical user interface.
Figure 5. Sample test image and recognized output file
(a)
Figure 3. Graphical User Interface of MOCRS
(b) Figure 6. Graphs (a) epochs VS Error (b) Hidden neurons VS Error
V.
CONCLUSION AND FUTURE WORK
This work is a hope to generate interest in Multilingual Optical Character Recognition systems using Neural Networks as the backend to solve the classification problem. At present, this system can successfully identify Telugu and English characters of certain fonts in a multi-lingual
109
International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com document. Using this system, text in multi lingual documents can be recognized unlike conventional OCR systems, which can recognize text documents of a particular language only. Therefore, this helps in simplifying the document digitization process in multi-lingual countries like India. Presently, this system can recognize only Telugu and English characters present in a printed text document. As shown in the Fig. 6 it could not recognize 3 characters due to segmentation errors. This system has few limitations in the number of fonts and font sizes it can recognize. This system can be improved such that characters of various font types and sizes can also be recognized. For Telugu, we were only able to recognize base characters without vowel and consonant modifiers since recognition of complex characters requires complex segmentation algorithms and excessive training of the network. This system can be extended to recognize all the characters in the Telugu character set. The system which can now recognize printed text documents of two languages can be improved to recognize documents of many more Indian languages. It can further be extended to recognize hand written text documents. REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9] [10] [11] [12] [13] [14] [15]
M C Padma et,at “ Identification of Telugu, Devanagari and English Scripts Using Discriminating Features”, IJCSIT, Vol 1. No. 2, November 2009 M. Swamy Das, Dr. C. R. K. Reddy, Dr. A. Govardhan, G. Saikrishna, “Segmentation of overlapping text lines, characters in printed telugu text document images”, IJEST Vol. 2(11), 6606-6610, 2010. B. Anuradhaand, Arun Agarwal and C. Raghavendra Rao, “An Overview of OCR Research in Indian Scripts”, IJCSES, Vol.2, No.2, 2008. N. Otsu, “A threshold selection method from gray-level histograms”, IEEE Transactions on Systems, Man, and Cybernetics, Vol. smc-9, no. 1, 1979. C. V Lakshmi, C. Patvardhan, “An optical character recognition system for printed Telugu text, Pattern Analysis & Applications”, Volume7, pp.190-204, 2004. Agarwal, David Doermann, “Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features”, 10th International Conference, ICDAR, 2009. K. S. Sesh Kumar, A. M. Namboodiri, C. V. Jawahar, “Learning Segmentation of Documents with Complex Scripts”, Fifth Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338, pp.749-760, 2006. B.M. Sagar, DR. G. Shoba, DR. P. Ramakanth Kumar, “Character Segmentation algorithms for kannada optical character Recognition”, Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition. Rafael C.Gonzalez and Richard E.Woods, Digital Image Processing, Pearson Education, 3rd edition, 2008. Shahzad Malik, “Hand-Printed Character Recognizer using Neural Network”, 95.407A Project, 2000. Andrew T. Wilson, “Off-line Handwriting Recognition Using Artificial Neural Networks”, University of Minnesota, Morris, 2000. http://en.wikiversity.org/wiki/Learning_and_Neural_Networks http://page.mi.fu-berlin.de/rojas/neural/chapter/K7.pdf http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html http://www.learnartificialneuralnetworks.com
Volume 1 No 4 June 2011
M Swamy Das completed his Bachelor of Engineering in Computer Science and Engineering from Osmania University, Hyderabad, India in 1992. He Completed his M.Tech in CSE from JNTU Hyderabad, India in 2000. Pursuing Ph.D from JNTU Hyderabad in the area of Image Processing. Presently working as Associate Professor in the Dept. of Computer Science and Engineering Department, Chaitanya Bharathi Institute of Technology, Hyderabad, India. His research interests are in the area of Optical Character Recognition. He has published four papers in international conferences and journals. .He is member of IETE, ISTE.
Dr. CRK Reddy completed his B.E in Computer Science and Engineering from University of Hyderabad, India in 1989, M.Tech from JNTU Hyderabad in 1998 and Ph.D in Computer Science and Engineering in the area of Program Testing in Software Engineering from University of Hyderabad, his area of interests are Software Engineering, Language Computing. He is presently working as Professor and Head of Computer Science and Engineering Department, CBIT, Hyderabad.He has published about 10 papers in national and International journals and conferences. He is a member of ISTE and IETE Mr. K Rahul has completed his B.E. in Computer Science and Engineering from Osmania University, Hyderabad in 2011. Presently he is working for Oracle Corporation, Bangolore. His area of interests are programming with Dot net and JAVA.
Dr. A Govardhan completed his Bachelor of Engineering in Computer Science and Engineering from Osmania University, Hyderabad, India in 1992. He completed his M.Tech in CSE from Jawaharlal Nehru University (JNU) New Delhi 1994 and completed Ph.D in CSE from JNTU Hyderabad in the area of Batabases. His area of Interests are data mining, Information Systems. He has produced four doctorates. He has published several papers in National and International journal/ conferences. He worked in various levels at JNTU Hyderabad, JNTU Ananthapur. Presently he is working as principal of JNTU college of Engineering, Jagityal, Karimnagar district, Andhra Pradesh, India.
110
International Journal of Science and Advanced Technology (ISSN 2221-8386) http://www.ijsat.com
Volume 1 No 4 June 2011
Figure 7. Flow chart showing the training process with Backpropagation algorithm
Figure 8. Flow chart for recognition of Multilingual Documents
111