2011 International Conference on Image Information Processing (ICIIP 2011)
Classification of Characters and Grading Writers in Offline Handwritten Gurmukhi Script Munish Kumar Computer Science Department Panjab University Constituent College Muktsar, Punjab, India
[email protected]
M. K. Jindal Department of Computer Science and Applications Panjab University Regional Centre Muktsar, Punjab, India
[email protected]
Abstract— Grading of writers based on their handwriting is a complex task mainly because of various writing styles of different individuals. In this paper, we have attempted grading of writers based on offline Gurmukhi characters written by them. Grading has been accomplished based on statistical measures of distribution of points on the bitmap image of characters. The gradation features used for classification are based on zoning, which can uniquely grade the characters. In this work, one hundred different Gurmukhi handwritten data sets have been used for grading the handwriting. We have used zoning; diagonal; directional; intersection and open end points; and Zernike moments feature extraction techniques in order to find the feature sets and k-NN, HMM and Bayesian decision making classifiers for classification. Keywords-Handwritten character recognition; extraction; classification; k-NN; HMM; Bayesian
I.
INTRODUCTION
feature
R. K. Sharma School of Mathematics and Computer Applications Thapar University Patiala, Punjab, India
[email protected]
matching, Projection histograms, Contour profiles, Zoning, Zernike moments, Gradient feature and Gabor features etc. Pal et al. [4, 5] have presented a technique for off-line Bangla handwritten compound characters recognition. They have used modified quadratic discriminant function for feature extraction. They have also presented technique for feature computation of numeral images. In this technique they, have segmented numeral image into zones and the directional features are computed in each of the individual zones. After that they have used modified quadratic discriminant function for recognition. They have also used curvature feature for recognizing Oriya characters [6]. Hanmandlu et al. [7] have reported grid based features for handwritten Hindi numerals. They have divided the input image into 24 zones. After that, they have computed the vector distance for each pixel position in the grid from the bottom left corner and normalized. In the present work, a writer grading system is being proposed based on the three recognition methods, namely, k-NN, HMM and Bayesian decision making.
Handwritten Character Recognition, usually abbreviated to HCR, is the translation of handwritten text into machine II. GURMUKHI SCRIPT processable format. HCR is the field of research in pattern Gurmukhi script is the script used for writing Punjabi recognition and artificial intelligence. HCR can be online or language and is derived from the old Punjabi term offline. In online handwriting recognition, data are captured “Guramukhi”, which means “from the mouth of the Guru”. during the writing process with the help of a special pen and an Gurmukhi script has three vowel bearers, thirty two electronic surface. Offline documents are scanned images of consonants, six additional consonants, nine vowel modifiers, prewritten text, generally on a sheet of paper. Recognition of three auxiliary signs and three half characters. Gurmukhi script offline handwritten characters has been an active research area is 12th most widely used script in the world. Writing style of in the field of pattern recognition. Offline handwriting Gurmukhi script is from top to bottom and left to right. In recognition is significantly different from online handwriting Gurmukhi script, there is no case sensitivity. The character set recognition, because here, stroke information is not available of Gurmukhi script is given in Figure 1. In Gurmukhi script, [1, 2]. A handwritten character recognition system consists of most of the characters have a horizontal line at the upper part the activities, namely, Digitization, Pre-processing, called headline and characters are connected with each other Segmentation, Features extraction and Classification. Offline through this line. In Gurmukhi script some characters are quite handwritten character recognition system is popular for cheque similar to each other, which make their recognition a bit reading, postcode recognition and signature verification. A complex. printed Gurmukhi script recognition system has been presented The Consonants by Lehal and Singh [3]. The feature extraction stage analyzes a handwritten character image and selects a set of features that can be used for uniquely grading the character. Some ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡ researchers have used structural features for character recognition. Different feature extraction methods have been ਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜ presented for representation of characters, such as Template Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011) 978-1-61284-860-0/11/$26.00 ©2011 IEEE
2011 International Conference on Image Information Processing (ICIIP 2011) The Vowel Bearers
ੳਅੲ
image is transformed into a contour image. This process of pre-processing is described in Figure 3 (a) and (b) for Gurmukhi character ਕ .
The Additional Consonants (Multi Component Characters)
ਸ਼
ਜ਼
ਖ਼
ਫ਼
ਗ਼ ◌਼ਲ
The Vowel Modifiers
◌ੋ ◌ੌ ◌ੇ
◌ੈ ਿ◌ ◌ੀ
◌ਾ ◌ੁ ◌ੂ
Auxiliary Signs
◌ੱ
◌ੰ ◌ਂ
The Half Characters
C. Feature extraction
◌੍ਹ ◌੍ਰ ◌੍ਵ
In this phase, the features of input characters are extracted. The performance of handwriting recognizer depends on features, which are being extracted. The extracted features should be able to classify each character uniquely. We have used zoning, diagonal, directional, intersection and open end points and Zernike moments based features for gradation of offline Gurmukhi handwriting of individuals. Here, we first transform the input character image into contour image as shown in Figure 3(a) and Figure 3(b).
Figure 1. Gurmukhi script character set.
III.
Figure 3. (a) Digitized image of Gurmukhi character (ਕ ) (b) Contour image of Gurmukhi character (ਕ ).
THE PROPOSED GRADING SYSTEM
The proposed grading system consists of the phases, namely, digitization, pre-processing, feature extraction, classification and grading. The block diagram of the proposed system is given in Figure 2.
1) Zone based feature extraction In this technique, one has to first normalize the character image into a window of some size, say, 100×100. Then, divide the normalized image into n number of equal zones. After that, the densities of ON pixels in each zone are calculated. This method is presented in the following algorithm. Step I: Normalized the input character image. Step II: Divide the input image into n zones. Step III: Calculate the density of ON pixels in each of n zones. Step IV: Normalize the values in feature set to [0, 1].
Figure 2: Block diagram of handwriting grading system.
A. Digitization
A set of n features can thus be generated for each character written by different writers. 2) Diagonal feature extraction
Digitization is the process of converting the paper based handwritten document into electronic form. The electronic conversion is accomplished using a process whereby a document is scanned and an electronic representation of the original, in the form of a bitmap image, is produced. Digitization produces the digital image, which is fed to the preprocessing phase.
Diagonal features are very important features in order to achieve higher recognition accuracy and reducing misclassification. These features are extracted from the pixels of each zone by moving along its diagonals as shown in Figure 4.
B. Pre-processing Pre-processing is a series of operations performed on the digital image. Pre-processing is the initial stage of character recognition. In this phase, the character image is normalized into a window of size 100×100. After normalization, we produce bitmap image of normalized image. Now, the bitmap
Figure 4. Diagonal feature extraction.
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
2011 International Conference on Image Information Processing (ICIIP 2011) The steps that have been used to extract these features are given below.
,
Step III: Each zone has 19 diagonals; foreground pixels present along each diagonal is summed up in order to get a single subfeature.
Using this algorithm, we corresponding to every zone.
will
obtain
n
features
3) Directional feature extraction Following algorithm has been used to obtain directional features for a given character. Step I: Divide the input image into n (=100) zones, each of size 10×10 pixels. Step II: The features, as shown in Figure 5 are extracted from the pixels of each zone. Step III: Find the positions of starting ON pixel (x) and ending ON pixel (y) in each zone and calculate the directional distance between these using the formula, .
Step IV: For the zones with zero foreground pixels, the feature value is taken as zero. Thus, a feature set of n elements will again be obtained for a given character.
Starting ON Pixel
Ending ON Pixel
Figure 5. Directional feature extraction.
4) Intersection and open end points feature extraction We have also extracted intersection and open end points for a zone. An intersection point is the pixel that has more than one pixel in its neighborhood and an open end point is the pixel that has only one pixel in its neighborhood. 5) Zernike moments feature extraction Zernike [9] introduced a set of complex polynomials , which forms a complete orthogonal set over a unit disk x2 + y2 ≤ 1. The form of the polynomial is:
2
. exp and
1
and j=√ 1
here, n is positive integer or zero, m is an integer subject to constraints n-|m| is even, and |m| n, p is the length of the vector from the origin to the pixel (x, y), is the angle between the vector p and -axis in counter clockwise direction. (p) is radial polynomial defined as:
Step IV: These 19 sub-features values are averaged to form a single value and placed in corresponding zone as its feature. Step V: Corresponding to the zones whose diagonals do not have a foreground pixel, the feature value is taken as zero.
2
where, p=
Step I: Divide the input image into n (=100) number of zones, each of size 10×10 pixels. Step II: The features are extracted from the pixels of each zone by moving along its diagonals.
,
| | /2
(p) = ∑
0
!
1
!
| |
!
2
| |
2
2 !
Zernike polynomials are defined only inside the unit circle, + 2 1, and therefore the calculation of Zernike moments requires a linear coordinate transformation from the image space to the interior of the unit circle. To achieve translation and scale invariance, extra normalization processes are required. We have achieved this by moving the image of a character centre to the image centroid. 2
D. Classification Classification phase is the decision making phase of an HCR engine. This phase uses the features extracted in the previous stage for making the class membership. In this paper, we have used k-nearest neighbor classifier, Hidden Markov Model (HMM) classifier and Bayesian classifier. In the knearest neighbor classifier, Euclidean distances from the candidate vector to stored vector are computed. The Euclidean distance between a candidate vector and a stored feature vector is given by, d
∑N
x
y
.
is Here, N is the total number of features in feature set, the library stored feature value and is the candidate feature value. HMMs are probabilistic pattern matching techniques that have the ability to absorb both the variability and similarities between stored and inputted feature values. An HMM is a finite state machine that can move to a next state at each time unit. With each move, an observed vector is generated. The probabilities of HMM are calculated using observation vector extracted from a samples of handwritten Gurmukhi characters. Recognition of unknown character is based on the probability that an unknown character is generated by HMM [10]. The Bayesian classifier is a statistical approach that allows designing the optimal classifier if complete statistical model is known. IV.
DATA COLLECTION
We have gathered data for this work from 100 different writers. These writers were required to write each Gurmukhi character. A sample of these characters is given in Figure 6.
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)
2011 International Conference on Image Information Processing (ICIIP 2011) REFERENCES [1] [2] [3] [4] [5] [6]
Figure 6. Samples of handwritten Gurmukhi characters
V.
EXPERIMENTAL RESULTS AND DISCUSSION
In this work, we have taken samples of Gurmukhi characters from one hundred different writers (W1, W2, …, W100). In order to establish the correctness of our approach, we have also considered these characters taken from five Gurmukhi fonts. These fonts are: amrit, Anandpur Sahib, Granthi, LMP_TARAN and Maharaja (F1, F2, …, F5, respectively). In the process of grading, we have found the writer with best handwriting, based on the characters taken in the data set. We have calculated average score of all writers obtained by them when the five features and three classifiers are used. The results in the form of average score achieved by the writers are given in Figure 7. One can infer from this figure that writer 6 is the best writer as he got highest score when compared with other writers.
Scores of Writers
Average grading of Writer wise 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3
[7] [8] [9] [10] [11] [12] [13] [14] [15]
[16]
[17] [18] W W W W W W W W W W W W W W W W W W W W W W W W W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M. Lorigo and V. Govindaraju, “Offline Arabic handwriting recognition: a survey”, IEEE Transactions on PAMI, Vol. 28(5), pp. 712-724, 2006. R. Plamondon and S. N. Srihari, ''On-line and off- line handwritten character recognition: A comprehensive survey'', IEEE Transactions on PAMI, Vol. 22 (1), pp. 63-84, 2000. G. S. Lehal and C. Singh, “A Gurmukhi script recognition system”, in the proceedings of 15th ICPR, Vol. 2, pp. 557-560, 2000. U. Pal, T. Wakabayashi and F. Kimura, “Handwritten Bangla Compound Character Recognition using Gradient Feature”, in the proceedings of 10th ICIT, pp. 208- 213, 2007. U. Pal, T. Wakabayashi and F. Kimura, ''Handwritten numeral recognition of six popular scripts'', in the proceedings of ICDAR 07, Vol.2, pp.749-753, 2007. U. Pal, T. Wakabayashi and F. Kimura, ''A system for off-line Oriya handwritten character recognition using curvature feature'', in the proceedings of 10th ICIT, pp.227-229, 2007. M. Hanmandlu, J. Grover, V. K. Madasu and S. Vasikarla, “Input fuzzy for the recognition of handwritten Hindi numeral”, in the proceedings of ITNG’ 07, pp. 208-213, 2007. S. V. Rajashekararadhya and S. V. Ranjan, “Zone based Feature Extraction algorithm for Handwritten Numeral Recognition of Kannada Script”, in the proceedings of IACC 2009, pp. 525-528, 2009. J. Tripathy, “Reconstruction of Oriya alphabets using Zernike Moments”, IJCA, Vol. 8 (8), pp. 26-32, 2010. L. R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”, in the Proceedings of IEEE, vol. 77, pp. 257-286, 1989. M. K. Jindal, “Degraded Text Recognition of Gurmukhi Script”, PhD Thesis, Thapar University, Patiala, India, 2008. U. Pal, B. B. Chaudhuri, “Indian Script character recognition: a survey”, Pattern Recognition, Vol. 37, pp. 1887-1899, 2004. V. Shapiro, G. Gluhchev, V. Sgurev, “Handwritten document image segmentation and analysis”, Pattern Recognition, Letters archive, Vol. 14 (1), pp. 71-78, 1993. S. Mori, K. Yamamoto, and C. Y. Suen, “Historical review of OCR research and development”, in the proceedings of the IEEE, Vol. 80 (7), pp. 1029-1058, 1992. M. K. Jindal, G. S. Lehal and R. K. Sharma, “On Segmentation of touching characters and overlapping lines in degraded printed Gurmukhi script”, International Journal of Image and Graphics (IJIG), World Scientific Publishing Company, Vol. 9(3), pp. 321-353, 2009. M. K. Jindal, G. S. Lehal and R. K. Sharma, “Segmentation of Touching Characters in Upper Zone in printed Gurmukhi Script”, in the proceedings of 2nd Bangalore Annual Compute Conference on 2nd Bangalore Annual Compute Conference, (Bangalore, India, January 09 10, 2009). COMPUTE '09. ACM, New York, NY, 1-6. DOI= http://doi.acm.org/10.1145/1517303.1517313 K. Wong, R. Casey and F. Wahl, ”Document Analysis System”, IBM Journal of Research Development, Vol. 26(6), pp. 647-656, 1982. V. Bansal and R. M. K. Sinha, “Segmentation of touching and fused Devanagri characters”, International Journal of Pattern Recognition, Vol. 35(4), pp. 875-893, 2002.
Writers
Figure 7. Average grading of characters writer wise.
VI.
CONCLUSION
In the present paper, a grading system for Punjabi writers based on off-line handwritten Gurmukhi characters using zoning, directional, diagonal, intersection and open end points and Zernike moments features and using k-NN, HMM and Bayesian classifiers has been proposed. Using this approach, one can compare the handwriting of one writer with other writers. We can also associate a score with the handwriting of an individual. This approach can also be extended for other Indian scripts such as Devanagri, Bengali, Tamil etc.
Proceedings of the 2011 International Conference on Image Information Processing (ICIIP 2011)