A Statistical Approach to Sinhala Handwriting Recognition S. Hewavitharana and Dr. N.D. Kodikara Department of Computer Science, University of Colombo, Sri Lanka.
[email protected],
[email protected]
ABSTRACT Much of the research on handwriting recognition had been concentrated on Latin script, which is used by languages such as English. Little or no attention had been given to Indic language scripts such as Devanagari, Tamil and Sinhala. This paper describes a system to recognize handwritten Sinhala characters, which uses a statistical classification approach. First, the input characters are preclassified into one of six groups of character classes considering some structural properties of the text lines. A statistical classifier, based on interval estimation, is used for the final recognition process. Results have shown 74% recognition accuracy for the first choice and 94.4% recognition accuracy for the top three choices.
1. INTRODUCTION More and more institutions in Sri Lanka, both public and private, are increasingly introducing computers into their business activities. The use of computers is predominant in the areas of design, manufacturing and information processing. Especially, in informat ion processing, computers provide a fast and accurate way to store, retrieve and sharing of informat ion, wh ich is vital to today‟s information driven world. One of the major obstacles into the integration of computers as informat ion processing systems is the fact that most useful business data are still contained in paper. Particu larly in public transactions of Sri Lanka, a huge amount of office paperwork is handwritten including forms, letters, faxes and cheques. Many government institutions have massive piles of archived documents written main ly in Sinhala. The only method available right now, to enter these data into a computer is to manually typein them by somebody; a painstakingly slow process indeed. In many such situations, it is highly desirable that these documents are processed by machines, for which handwrit ing recognition is essential. Th is can also be used for postal address interpretation to automate the mail sorting in post offices.
Handwriting recognition is the task of transforming a language representation in its spatial form of graphical marks into its symbolic representation [1]. The spatial form can be of two forms: a scanned image of a handwritten document, or writings by a special pen device on an electronic surface such as a digitizer. The two approaches are distinguished as off-line and on-line handwriting, respectively. For languages based on the Latin script, the symbolic representation is typically the 8-b it ASCII representation. Most other scripts including Sinhala can be represented using the 16-bit UNICODE format. Handwriting recognition attempts have converged into two distinct families of classification methods, namely: structural methods and statistical methods. Structural methods use some qualitative measurements of the input as features while statistical methods rely on quantitative measurements. Somet imes, these two techniques are combined at appropriate stages to form a hybrid classifier for improved performance. Much of the research on handwriting recognition has been concentrated on English language, where many pragmatic systems now exist. Some effo rts have been reported in the literature fo r Indian language scripts including Devanagari [2], Tamil [3], [4] and Bangla [5]. There have been only a few attempts in the past to address the recognition of printed or handwritten Sinhala language script. Mannapperuma [6] has used a structural approach to recognize printed characters while Rajapakse [7] has used a neural network approach to handwritten character recognition. The objective of this study is to address the issue of off-line Sinhala handwriting recognition using a statistical approach. The classifier is build using the concept of interval estimation [8]. The rest of the paper is organized as follows. Section 2 looks at the Sinhala script in a pattern recognition point of view. In section 3, the proposed system is described in detail. The results of the study are presented in section 4, followed by our conclusions and directions of future work in section 5.
Upper Line Upper Baseline
Upper Zone Core Zone
Lower Baseline Lower Line
Lower Zone
Fig. 1: Three zones in a handwritten Sinhala text-line
2. THE SINHALA SCRIPT The Sinhala script is alphabetic in nature and the words are two-dimensional compositions of symbols. The contemporary Sinhala alphabet, adopted by the UNICODE standard [9], consists of 61 symbols: 18 vowels, 2 semi-vowels and 41 consonants. The script is written fro m left to right and the writing system is syllabic. i.e. vowels are inherent in consonants and are not represented separately like in the Lat in script. Characters in a word are written separated from each other, in a non-cursive manner. A subset of the Sinhala language known as Sudda Sinhala Hodiya was chosen for study. It contained 25 characters. Some o f the consonant modifiers, wh ich are usually written separated from the consonant, were also included in the experiment. Altogether, the selected domain contained a total of 34 classes. In handwriting recognition point of view, it is convenient to visualize Sinhala characters in terms of 3 vertical zones: a core zone, an upper zone and a lower zone, as in Fig.1. These zones are determined by four virtual reference-lines, which we may call: the lower line, lower baseline, upper baseline and upper line. Depending on its positioning with respect to the 3zone frame, each character can be categorized into different character groups. We call th is method preliminary classification and it is explained in the next section.
3. EXPERIMENTAL SYSTEM The experimental system consists of five major sections. The first section deals with the data collection fo llo wed by preprocessing. Segmentation, Preliminary classificat ion, Feature ext raction and Recognition are the other sections described in here.
3.1 Data Collection Data samples were colleted fro m 5 writers who had different writing styles. Each A4 sized document
contained about 20-30 words, written in lines, of the selected consonants and modifier symbols. They were scanned using a flat-bed scanner at a resolution of 100 dpi and stored as 8-bit grey scale images. In preprocessing, the document images were binarized using a global thresholding technique [10] to eliminate noise and extract the handwriting.
3.2 Segmentation An input image consists of a uniform text area with distinct text lines. In the segmentation process, this is broken down into constituent text lines, words and finally into individual characters and modifier symbols. The method is based on the horizontal projection profile o f the image. Zero values or valleys in the projection profile correspond to the horizontal gaps between text lines. Each text line is identified using two reference lines; the upper line and the lower line. They correspond to the minimu m and maximu m zero value positions adjoining a text line, respectively. (See Fig. 2a) In the next step, the upper baseline and lower baseline are identified for each of the segmented text lines. For th is, we use a method similar to [11]. First derivative of the horizontal projection profile is calculated for each segmented text line. The local extrema o f the first derivative in the t wo halves of the text line are taken to be the two baselines. The lines drawn across the two peaks in Fig. 2b indicate the two baselines. To guide the writer and to simplify the process of reference line extraction, a pre-formatted paper was used for data acquisition. Each document had the four reference lines printed on it. However, these lines were co mp letely eliminated during the binarization of the image and had no effect on the segmentation. After the reference lines have been found, words and characters are extracted using the vertical projection profile of each text line. Word boundaries and character boundaries are distinguishable since the former are much wider than the latter. Modifier symbols, which appear to the left or right of a character, are also segmented in this stage.
Upper Line
Lower Line
Fig. 2a: Line segmentation using the horizontal projection profile Upper Baseline
Lower Baseline
Fig.2b: Baseline identificat ion using the local extrema of the first derivative
Character Boundaries
Word Boundaries
Fig. 2c: Word & character seg mentation using the vertical pro jection profile Further segmentation is required to separate the upper or lower modifier symbols fro m a consonant, as in the seventh character in Fig 2c. Ones all the characters have been segmented, the minimu mbounding box of each character is identified eliminating the wh ite space around it. Upper and lower boundary values of the minimu m-bounding box, along with the four reference lines, are sent to the next stage for preliminary classification.
overlapping groups depending on the relative heights of each character in the three-zone frame, as in [12]. 1. 2. 3. 4.
3.3 Preliminary Classification 5. The aim of the preliminary classification is to reduce the number of possible candidates for an unknown character, to a subset of the total character set. For this purpose, each character in the selected domain is categorized into one of the following non-
Core characters & symbols – characters which lie within the core zone Ascending characters – characters which lie in the core and upper zones Descending characters – characters which lie in the core and lower zones Upper modifiers – symbols which lie within the upper zone Lower mod ifiers – symbols wh ich lie within the lower zone
The group 1 is further split into 1a and 1b, depending on the ratio of the height and width of the character, as core characters and core modifiers respectively.
Table 1 illustrates the full lists of characters categorized into the six pre-classification groups. The subsequent recognition stage concentrates only on the pre-classified group and treat the members of the group as possible recognition candidates. Characters belonging to other groups are assumed to be invalid matches to the unknown and are not considered for the recognition.
3.4 Feature Extraction All the segmented character images are then scaled into a common height and width (32x32 pixels) using a bilinear interpolation technique. The slant associated with the characters is negligible due to the use of a pre-formatted paper for data collection. Hence, no attempt was made to perform slant normalizat ion. A 71-dimentional feature vector is then extracted fro m each resized image.
Different sizes of zones were used in the study ranging fro m 2x2 pixels to 16x16 p ixels. When the zones size was small, it captured more detailed pixel variations. However, due to the varying nature of handwriting, there was high dissimilarity between the feature vectors of the same class. Large sized zones failed to capture the essential parts of characters, which make them d istinct fro m others. The best results were produced by 4x4 pixel zones. Therefore, we decided to use 4x4 zones for feature ext raction.
3.5 Recognition A statistical classifier based on interval estimat ion is used for the recognition process. For each zone of size 4x4 p ixels, we calcu late an interval of values within which the mean pixel density of the population lies with 95% confidence. The upper and lower confidence limits of region i are calculated as follows: Lower limit of conf. interval (LCLi ) = ( x 1.96 si ) i
Fig. 3: Feature Vector The first three features correspond to the preclassification group, height and width of the character image. Features d1, d2, d3 and d4, contain pixel densities of the top, bottom, left and right halves of the image respectively. To construct the other features, the character image is divided into 8 horizontal and vertical strips, producing 64 equal sized zones. Pixel density is computed for each of the 64 zones. These densities form a sequence of values ranging from 0 to 16, which corresponds to features D1 to D64. The zones are traversed row by row, so that the first eight features corresponds to the first row, second eight to the second row, and so on.
Upper limit of conf. interval (UCLi ) = ( x 1.96 i where, xi is the sample mean, variance and n is the sample size.
n si ) n
S i2 is the sample
30 sample images fro m each character class were chosen as training data. For each class, LCLi and UCLi are calculated for each of the 71 features except the first one, which is the pre-classified group, and is stored as the classifier. Upon the receipt of an unknown image, the recognition process first extract the feature vector and then compare these values with the corresponding confidence intervals of the classifier. If a value is within the confidence interval, it is indicated by 1,
Correct
Error
Total
Number of Char.
493
7
500
%
98.6
1.4
100
Table 2: Preliminary classification results
#
370
70
32
Misrecognition 27
Test Set
%
74.0
88.0
94.4
5.6
100.0
Trai ned Set
#
224
17
2
7
250
%
89.6
96.4
97.2
2.8
100.0
Top 1
Top 2
Top 3
Total 500
Table 3: Recognition results otherwise indicated by 0. The total number of matches is counted for each candidate class. The class with the highest matches is considered to be the one to which the character image belongs to. The best 3 matches are presented in the descending order.
4. EXPERIMENTAL RESULTS We trained the system with more than 1000 characters belonging to all the classes. The testing data contained a separate set of 500 characters. A portion of the training data was also used to test the system, to see how well the system represents the data it has been trained on. A total of 50 text lines were subjected to segmentation and reference line identification. In all the cases, every character in each text line was correctly segmented. The reference line identification was almost 99% accurate resulting only 1% preclassification error. Table 2 contains the results of the preliminary classificat ion process. Out of 500 characters investigated, 98.6% were successfully classified into the correct group. Recognition results are given in Table 3. In the test set, a recognition rate of 74% was achieved for the 1st choice and 94.4% for the top 3 choices. Understandably, the training set produced much higher recognition rate than the test set.
5. CONCLUSIONS In this paper, we have presented a system to recognize off-line Sinhala handwriting using a classifier, based on statistical approach. The number of possible candidates for a character is narrowed
down to a smaller subset of the alphabet, in the first stage. Then, the actual recognition is. The results show 94.4% recognition rate for the top three choices. The main recognition errors were due to abnormal writing and ambiguity among similar shaped characters. Abnormal writing caused a character to be pre-classified into a wrong group, thereby resulting a misrecognition. Most of the confusion was between characters such as „Ta‟, „Va‟, „Ma‟ and „Tha‟, „Ka‟, „Na‟. This could be avoided by using a word dictionary to look-up for possible character compositions. The presence of contextual knowledge will help to eliminate the ambiguity. Extension of the system to cater for the full Sinhala alphabet would require splitting a co mposite character into basic recognizable symbols. The methods used in [13] and [14] are under investigation for this purpose. Future work could also include extracting more robust features for the classifier to achieve better discrimination power. We strongly feel that the method of pre-classificat ion would have much higher recognition accuracy if applied to Optical Character Recognition, since printed characters preserve the correct positioning on threestrip frame. Moreover, the classification method used in this study would be applicable to other languages such as Tamil. Since Tamil is one the official languages of Sri Lan ka, an attempt to extend the system to cater for Tamil characters would be of high national interest.
ACKNOWLEDGEMENT We would like to thank Mrs. H. C. Fernando for her contribution to building the classifier and all the people who helped us in collecting data.
REFERENCES [1]
R. Plamondon and S. N. Srihari, “On-Line and Off-line Handwriting Recognition: A Co mprehensive Survey”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 63-84, Jan. 2000.
[2]
V. Bansal and R.M.K. Sinha, “On how to describe shapes of Devanagari characters and use them for recognition”, Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, pp. 410-413, Sept. 1999.
[3]
P. Chinnuswamy, S.G. Khrishnamoorthy, “Recognition of handprinted Tamil characters”, Pattern Recognition, vol. 12, pp. 141-152, 1980.
[4]
N. Damayanthi, P. Thangavel, “Handwritten Tamil character recognition using Neural Network”, Proc. The Ta mil Internet 2000 Conference, Singapore, July 2000.
[5]
B. B. Chaudhuri and U. Pal, “A complete printed Bangla OCR system”, Pattern Recognition, vol. 31, no. 5, pp. 531-549, 1997.
[6]
H. Mannapperuma, A method to recognize Sinhala Characters, Dept. of Statistics and Co mputer Science, Un iversity of Co lo mbo, B.Sc. Dissertation, October 1994.
[7]
R.K.Rajapakse, A Neural Network based Character Recognition System for Sinhala Scripts, Dept. of Statistics and Computer Science, University of Colo mbo, M.Sc. Thesis, 1995.
[8]
S. Hewav itharana, A Statistical Approach to Sinhala Handwriting Recognition, Dept. of Co mputer Science, Un iversity of Co lo mbo, B.Sc. Dissertation, December 2001.
[9]
The Unicode Consortium, The Unicode Standard 3.0, Harlo w: Addison Wesley publishers, 2000.
[10] R. C. Gon zalez, R. E. Woods, Digital Image Processing, Addison Wesley publishers, 1993. [11] R. M. Bo zinovic and S. N. Srihari, “Off-line cursive script word recognition”, IEEE Trans. on Pattern Anal. Mach. Intelligence, vol. 11, no. 1, pp. 68-83, Jan. 1989.
[12] J. B. Disanayake, Nuthana Sinhala Lekhana Vyakarana 1: Akshara Vinyasaya (Sinhala) , LakeHouse Investment Co. Ltd., pp. 444-461, 1990. [13] M. Lohakan, S. A irphaiboon and M. Sangworasil, “Single -character segmentation for handprinted Thai word”, Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, Sept. 1999. [14] A. Bishnu and B. B. Chaudhuri, “Segmentation of Bangla handwritten text into characters by recursive contour following”, Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, Sept. 1999.