printed Gujarati characters using a novel SOM based k-NN approach ... popular and cultural reach language spoken in the western part ... approach for book sales forecasting [7]. ... proposed and SVM based sachem for Thai and English script.
Classification of Printed Gujarati Characters using SOM based K-Nearest Neighbor Classifier Abstract— This paper presents a method for combining Self Organizing Map (SOM) with k-Nearest Neighbour Classifier (kNN) to device an elegant classification technique and applying it for classification of subset of printed Gujarati characters. Many researchers have employed many different models for the classification of printed/handwritten characters for number of different languages all over the globe; few of the widely used classifiers are Template Matching, Artificial Neural Network (ANN), Hidden Markov Model (HMM), and Support Vector Machine (SVM) etc. Our attempt is to use SOM based k-NN classifier for classification of subset of printed Gujarati characters. This approach does not require prior feature identification stage hence it is faster and more generalize compare to other approaches. A prototype system is implemented for the same and tested on sufficient dataset. Average accuracy of 82.36% is reported on test dataset.
recognition of printed Gujarati characters [4][5]. To the best of our knowledge no reference was found in the literature for the application of SOM based k-NN for classification of printed/ handwritten characters from Indian scripts. Gujarat being one of the prominent states of India, Gujarati is a popular and cultural reach language spoken in the western part of India hence development of an efficient OCR for Gujarati language will play major role in projecting Gujarati language and literature on global scale. Rest of the paper is organized as follows. Section II describes the basic of Gujarati character set. Section III gives a brief review of related work. Section IV provides methodology. Section V presents experimental setup, results and analysis. Finally, section VI presents conclusion and future work.
Keywords— Gujarati Character Classification, Self-Organizing Maps, K-Nearest Neighbour, Printed Character Classification.
II. BASICS OF GUJARATI CHARACTER SET Gujarati like most other Indian languages is a phonetic language derived from Devanagari script having reach set of characters and modifiers to change the sound of the character. Therefore classification of Gujarati characters adds more complexity as compare to other Roman or Latin character classification system. Gujarati script is written from left to right, with each character representing a syllable. The character set of Gujarati script consist of 12 vowels, which are called Swar and 34 consonants, which are called Vyanjan. These are shown in Figure 1 and Figure 2 respectively.
I. INTRODUCTION Printed Character classification is a process of indentifying characters from a printed scanned document (also referred as optical copy) and converting them into corresponding UNICODE in order to facilitate automatic searching, sorting and retrieval of documents. Necessity of such a system can be derived from the fact that, because of the availability of cheap computer and internet technology even in the rural areas, enormous amount of documents and literature are now available in printed scanned format, which demands an efficient Optical Character Recognition (OCR) System for the automatic processing of these documents. Character classification is the heart of any Optical Character Recognition(OCR) System hence this paper focuses on the development of an efficient character classification system for printed Gujarati characters using a novel SOM based k-NN approach that can be integrated with other module like PreProcessing and Segmentation to formulate a complete OCR System. Most of the current OCR systems requires prior feature identification step which makes the algorithm slow and specific to the problem at hand. Also quality of recognition directly depends on the quality of feature identification. The proposed approach does not use explicit feature identification rather it takes the whole image as input pattern and in order to avoid the curse of dimensionality it combines SOM with kNN. Some work can be found in literature for the character recognition for Indian scripts like Hindi, Talugu, Kannad, Gurumukhi, Oriya, and Bangla using various classifiers like ANN, HMM, Fuzzy, SVM, Template Matching etc. [8][9][10][11]. However a little work can be found for
Fig. 1 Vowels of Gujarati script
Gujarati consists of set of special modifier symbols called Maatras, corresponding to each vowel, which are attached to consonants to change their sound. The modifiers corresponding to each vowel is shown in Figure 3. First, Vowel does not have any corresponding modifier but is basic sound for the consonants. Modifiers are placed at the top, at bottom right or at bottom part of the consonant. They can be attached at different positions for different consonants. They can occur in different shapes depending on the consonant to which it is attached.
Fig. 2 Consonant of Gujarati script
V Ashwin and P S Sastry [8] have proposed a SVM based classifier for printed Kannada script. The algorithm extracts features by dividing the normalized binary image in to tracks and sector. As the binary image is normalized it makes it independent of size and font. S Chanda et al. [12] have proposed and SVM based sachem for Thai and English script identification and reported very high accuracy of around 99.36%. The method identifies number of features like loop water reservoir principal, component overlap, rotated “J” feature and profile feature. Although the method is very accurate but highly dependents on the features thus cannot be generalized for other languages. Also the feature identification itself is a costly operation. B. B. Chaudhari et al have done some pioneer work in recognition of Bangala, Hind and Oriya scripts [9], [13]. Most of the techniques discussed so far are feature based methods and cannot be generalized or used as universal classifier (i.e. the same algorithm used for classification of characters from number of different languages). India being a multi-lingual country, the need of Universal OCR cannot be obviated. One of the Goals of this research is to show that proposed classification method, being independent of features, can be used for number of other languages without major changes. IV. METHODOLOGY
A. Self Organizing Map (SOM) Self organizing maps are special type of neural network All the Characters (Vyanjans and Swars) and modifiers that has only two layer namely input and output. The output (Maatras) together roughly provide basic orthographic units, layer of the self organizing map is arranged as a two which are referred as glyphs that are combined together in dimensional grid while input layer is one dimensional vector. different ways to represent all the frequently used syllables [2]. The number of nodes in the input layer is equal to number of input where as the number of node in output depends on the III. RELATED WORK problem at hand. Each node i in input layer is connected with Kohonen first proposed the Self Organizing Map and a all other node j in output layer using a weight Wji. competitive learning algorithm to map n-dimensional vector on 2-dimensional grid that preserves the neighbourhood property of the patterns [1]. Since then SOM is used in many applications where unsupervised clustering and arrangement of samples based on similarity is desired. Some efforts are also made to apply SOM combined with other classifier for supervised classification. Goswami et al. have used SOM with CBR (Case Base Reasoning) for classification of candlestick pattern in stock data [6]. Pei-Chann et al. have used similar approach for book sales forecasting [7]. Above stated work Fig. 4 Architecture of Self Organizing Map demonstrates the capability of SOM and its applicability to Important characteristic of SOM is as follows. practical problems. A little work can be found in the literature 1. It maps N-dimensional input vector to 2-Dimensional for the classification of Gujarati characters. S. K. Shah and A. map. Sharma [4] have proposed a design of complete OCR for 2. Preserve the Neighbourhood property of input vectors Gujarati script that uses a Template Matching based approach (i.e. similar input vectors are placed nearby on 2-Dimensional for classification of characters. Fringe distance is used as grid). measure of similarity between template and input binary Competitive learning algorithm is used for training in SOM. images. The overall accuracy reported was 78%. Antani and The Algorithm can be summarized as follows. Agnihotri [5] have proposed classification of Gujarati Step1. Find the distance of input vector I with weight characters using k-NN classifier and Minimum Hamming vector Wj for all nodes j in output layer. Distance Classifier and reported the accuracy of 67%. A Dj = D(I, Wj) comprehensive survey on character classification for Indian Where D can be any suitable distance computing function scripts by B. B. Chaudhary and U. Pal can be found in [11]. T Step2. Select node j with minimum Dj as winner node Fig. 3 Special symbols in Gujarati script
Step3. Update weights Wj in proportion to learning rate l such that Wj I (distance is minimized) Step4. Also update the weight of all neuron k in neighbourhood of j in proportion of learning rate and radial distance k from j in 2D lattice. Step5: Execute step 1to 4 for all vector I in training set (single iteration). Step 6: Continue until iterations are not exhausted For more details on SOM reader is advised to refer literature [1][4] B. k-Nearest Neighbour Classifier(k-NN) K-NN is distance based classification technique that work as follows. Step1. Generate a set of reference data samples (known as training set). Step2. Compute the distance of unseen data samples with all other data samples in training set. Step3. Select k samples from the training set having minimum distance with unseen sample (known as k-nearest neighbours). Step4. Label the class of the unseen data sample to be equal to the class of majority of k- nearest data samples from training set. The advantage of K-NN over other methods are 1) it is simple & effective, 2) does not require prior training, 3) new data samples can be added easily in the training set for future classification. However, the problems with this approach are 1) it is not always easy to compute accurate distance between data samples (especially when data is categorical). 2) The execution time varies linearly with number of training samples [3]. C. Combining SOM with k-NN The proposed hybrid approach uses SOM as pre-processing stage before K-NN. All n-dimensional data samples in training set are projected on 2-dimesional map using SOM. Following are the benefits of pre-processing stage 1) The dimensions of the data samples are reduced from ndimensions to 2-dimensions 2) Neighbourhood properties of the data samples are preserved(i.e. Similar data samples are projected nearby on 2D map). 3) Distance measure between data samples can be computed using any of the well known distance measure techniques (i.e. Euclidian, Manhattan, or Absolute). 4) Clusters can be formed on the map and cluster mean can be computed. The k-NN algorithm now can be modified to compute the distance of unseen data samples with cluster means only instead of all training samples. D. Proposed Methodology for Character Classification Proposed method is divided in to two stages 1) Training 2) Classification. They are detailed as follows. 1) Training SOM: Step1: Convert all NxM character images into binary vector of size NxM.
Step2: Map all NxM binary patterns on to SOM (using competitive learning algorithm describe in III-A). Step3: Find the 2D coordinate Ti(x,y) for each pattern in training set. Step4: Store 2D coordinate Ti(x,y) along with the class label Ci in database. 2) Classification: Step1: Convert the unseen input character in to NxM binary vector Step2: Plot the vector on SOM and compute the 2D coordinate for the same N(x,y) Step3: Compute the distance of N(x,y) with all Ti(x,y) and select k-nearest neighbour form the database. Step4: predict the character class using majority voting between the k-neighbours or minimum distance criteria or combination of both. V. EXPERIMENTAL SETUP AND RESULT ANALYSIS The various design parameters of the SOM network are Learning Radius, Learning Rate and Number of Iterations. Like most neural networks, there are no standard rules for selecting parameters of SOM. However, few guidelines can be found in [1]. Data set used in the research consist of sample of 40 different characters (consonants and vowels) from 4 different families of fonts normally used for printing Gujarati text, making a total of 160 training data samples. Each sample is a binary image of size 30x30. Although this experiment uses only a subset of the character set from Gujarati script, it can be extended easily to accommodate more characters without any change in algorithm. A prototype system is implemented using MS C#.NET 2.0 Express Edition. Fig. 5 shows the result of SOM after training with learning rate 0.1, learning radius 3 and number of iterations 1000. The size of the SOM used in this experiment is 20x20. It can be seen that the similar characters are grouped together in a nearby region as a result of the self organization of patterns. Measuring the accuracy of the classifier is always a challenging and debatable issue, as the accuracy measure depends on many parameters, mainly size and quality of training dataset and test dataset. Many different measures are suggested in literature for different cases [14], [15], [16]. However, in most cases the simple accuracy measure (i.e. percentage of samples classified correctly) is sufficient. Therefore, this research uses accuracy as a measure to find the quality of the classifier.
As the test dataset is more diversified as compared to the training data set, the accuracy on the test dataset is lower than the training dataset, which is expected. Table II shows summary of the accuracy statics for all characters. TABLE I SUMMARY OF ACCURACY FOUND FOR ALL CHARACTERS
No. of Character Symbols falling in the accuracy range. Accuracy Range Training Dataset
Fig. 5 Trained SOM with learning rate 0.1, learning radius 3, and number of iterations 1000
Test Dataset
=100% =90%
37 0
11 3
=80%
0
11
=70%
2
5
=60%
0
7
=80%.
Fig.7 Some of the frequently used characters in Gujarati language.
Some of the less frequently used characters which are misclassified very often due to their similar structure are shown in Fig. 8. Fig. 6 Scanned cut-out of “Gujarat Samachar”, one of the leading daily news paper in Gujarat
The test dataset consist of 533 characters. The images are taken from the scanned cut-out of Gujarat’s leading news papers namely “Gujarat Samachar”, “Sandesh”, “Jay Hind”, “Akila” ( See Fig. 6 for cut-out of “Gujarat Samachar”). The dataset consist of images that are varying in size and font. Some of the fonts are different than those used in training. Some samples are also in italic format. Thus care has been taken to keep enough diversity in the test dataset. Table I shows the average accuracy of the system for classifying characters from training as well as test dataset. TABLE I AVERAGE ACCURACY OF THE SYSTEM TO CLASSIFY THE CHARACTERS
Training Dataset Correct Classification Incorrect Classification Total
Test Dataset
156
97.5%
439
82.36%
4
2.5%
94
17.63%
160
533
Fig. 8 Set of similar looking characters in Gujarati language.
VI. CONCLUSIONS Self Organizing Map based k-NN classifier is applied for the classification of subset of printed Gujarati character set and it is found that the results are encouraging. As the algorithm does not use feature identification, it is generally faster than other feature based algorithm and can be used for real time applications. However, there is onetime cost involved in training the SOM. Moreover, it is found that the algorithm without any change can be applied for the classification of multiple character sets by just including them in the training dataset. Following future enchantment can be suggested. a) The Accuracy of the classifier can be improved by further investigating frequently misclassified similar characters. b) Include complete character set along with modifiers and special characters rather than the considered subset. c) Addressing the issues of joint characters.
REFERENCES [1] [2] [3] [4]
[5]
[6]
[7]
[8]
Teuvo Kohonen The Self-Organizing Map 3e, Springer Publication, 2000. T. Hart and Cover, "Nearest Neighbor Pattern Classification," IEEE Transactions on Information Theory, Vol. 13, pp. 21-27, 1967. P. Cunningham and S. J. Delany, k-Nearest Neighbour Classifiers, Technical Report, University College of Dublin, March 2007. S K Shah and A Sharma “Design and Implementation of Optical Character Recognition System to Recognize Gujarati Script using Template Matching” IE(I) Journal-ET, Vol. 86, pp. 44-49, 2006. Antani S. and Agnihotri L. “Gujarati Character Recognition” In Proc. Of 5th International Conference on Document Analysis and Recognition, IEEE Computer Society Press, pp. 418-421, 1999. Goswami MM, Bhensdadia CK, and Ganatra AP “Candlestick Analysis based Pridiction of Stock Price Fluctuation using SOM-CBR” IEEE International Advance Computing Conference (IACC), 2009 Pei-Chann Chang and Chien-Yuan Li “A Hybrid System combining Self Organizing Map with Case Base Reasoning in Wholesaler’s new release Book Sales Forecasting” Experts System with Applications Vol. 29, pp. 183-192, Elsevier 2005. T V Ashwin and P S Sastry “A Font and Size Independent OCR System for Printed Kannada Documents using Support Vector Machines” Sadhana Vol. 27, p-1, pp. 35-58, 2002.
[9]
[10]
[11] [12]
[13] [14]
[15] [16]
B. B. Chaudhari and U. Pal “An OCR System to Read Two Indian Language Scripts: Bangala and Devnagari(Hindi)” 4th International Conference on Document Analysis and Recognition(ICDAR) 1997. H. Swethalakshmi, Anitha Jayaraman, V. Srinivasa Chakravarthy, and C. Chandra Sekhar “Online Handwritten Character Recognition of Devanagari and Telugu Characters using Support Vector Machines” 10th International Workshop on Frontiers in Handwriting Recognition, 2006. B.B. Chaudhary and U. Pal “Indian Script Character Recognition: A Survey” Pattern Recognition, Vol. 37, Issue 9, pp. 1887-1899, 2004. S. Chanda, Oriol Ramos Terrades, and U Pal “SVM based Scheme for Thai and English Script Identification” 9th International Conference on Document Analysis and Recognition (ICDAR07) 2007. B. B. Chuadhary, U Pal and M Mitra “Automatic Recognition of Printed Oriya Script” Sadhana, Vol. 22, Part-I, pp. 23-34, 2002. Ian H. Witten and Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques 2e, Morgan Kaufmann Publication, 2005. Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques, Morgan Kaufmann Publication, 2001. Chakrabarti et al. Data Mining: Know it all The Morgan Kaufmann Publication, 2005.