COMPARISON OF SUPPORT VECTOR MACHINE AND NEURAL NETWORK IN CHARACTER LEVEL DISCRIMINANT TRAINING FOR ONLINE WORD RECOGNITION
Abdul Rahim Ahmad1 Marzuki Khalid2
Christian Viard-Gaudin3 Emilie Poisson3
1
Universiti Tenaga Nasional, Km 7, Jalan Kajang-Puchong, 43009 Kajang, Selangor.
[email protected] 2
Centre for Artificial Intelligence and Robotics Universiti Teknologi Malaysia, Jalan Semarak, 54100 Kuala Lumpur {marzuki,rubiyah}@utmkl.utm.my 3
IRCCyN - Division ISA - UMR 6597 Equipe Image, Vidéo et Communications, Ecole polytechnique de l'université de Nantes
[email protected]
ABSTRACT Discrete Hidden Markov Model (HMM) and hybrid of Neural Network (NN) and HMM are popular methods in handwritten word recognition system. In the hybrid system, NN is used for character level recognition while HMM is used for producing word score based on the probability of the hypothesized characters combined. All reported results shows better recognition for the hybrid system due to better discrimination capability of the NN. A more recent recognition method based on Support Vector Machine (SVM) has been suggested as an alternative to NN. In speech recognition (SR), SVM has been successfully used in the context of a hybrid SVM/HMM system. It gives a better recognition result compared to the system based on hybrid NN/HMM. This paper describes part of the joint work between CAIRO, UTM and IRCCyN, Nantes, France in developing a hybrid SVM/HMM based word recognition system. It mainly compares NN and SVM in a word recognition system based on character level discriminant training. Keywords: Hidden Markov Model (HMM), Neural Network, handwritten word recognition, Support Vector Machine (SVM), speech recognition (SR), character level discriminant training.
1.0 INTRODUCTION Handwriting recognition systems can either be in one of the two domains; online or offline (figure 1). The two are differentiated by the nature of their input signals. In offline system, static representation of a digitized document is used in applications such as cheque, form, mail or document processing. On the other hand, online handwriting recognition (OHR) systems rely on information acquired during the production of the handwriting. They require specific equipment that allows the capture of the trajectory of the writing tool. Mobile
communication systems such as Personal Digital Assistant (PDA), electronic pad and smartphone have online handwriting recognition interface integrated in them.
Real time Recognition based on traces of pen movement on a pad On-line
Recognition of scanned images from printed documents Off-line
Figure 1 – The two domain of handwriting recognition Although these systems with handwriting recognition capability are already widely available in the market, further improvements can be made on the recognition performances while constraining space for parameter storage and improving processing speed. Word Image Many current systems use Discrete Hidden Markov Model based recognizer or a hybrid of Neural Network (NN) and Hidden Markov Model (HMM) for the recognition. Figure 2 show a hybrid NN/HMM OHR system. Online information captured by the input device first needs to go through some filtration, preprocessing and normalization processes. After normalization, the writing is usually segmented into basic units (normally character or part of character) and each segment is classified and labeled. Using HMM search algorithm in the context of a language model, the most likely word path is then returned to the user as the intended string [1]. Segmentation process can be performed in various ways. However, observation probability for each segment is normally obtained by using a neural network (NN) and the probabilities of transitions within a resulting word path are estimated by a Hidden Markov Model (HMM). In this work, support vector machines (SVM)
Word Normalization
Lines of reference and resample points Feature Extraction
Word feature vectors
0.02 0..32 0.71 -0.2 0.91 0.23 -0.1
0.71 -0.2 0.91 -0.2 0.91 0.71 0.81
…. …. …. …. …. …. ….
0..32 0.71 -0.3 0.71 -0.2 0.91 0.2
0..32 0.71 -0.3 0.71 -0.2 0.91 0.2
NNs or SVMs Segments of character hypothesis and the likelihood values. HMM Word Likelihood un sept frs Figure 2 - a hybrid NN/HMM OHR
have been used in place of NN in a hybrid SVM/HMM recognition system. The main objective is to further improve the recognition rate by using support vector machine (SVM) at the segment/character classification level. This is motivated by successful earlier work by Ganapathiraju [4] in a hybrid SVM/HMM speech recognition (SR) system and the work by Bahlmann [8] in OHR. Ganapathiraju obtained better recognition rate compared to hybrid NN/HMM SR system. Basic multiclass SVM is first used to train an OHR system using character databases. SVM with probabilistic output is used in the hybrid system where it is integrated with the HMM module for word recognition. Some results of using SVM for character recognition are given and compared with results using NN reported by Poisson [9]. The following databases were used: IRONOFF, UNIPEN and the mixture IRONOFF-UNIPEN databases. For word recognition, previous works reported that discriminant training at character level and at word level were attempted. In this paper, both methods are described but only character level discriminant training using NN and SVM are compared. This paper is organized as follows: Section 2 discusses online word recognition system in a deeper context. NN and SVM are compared and word recognizer training is discussed in section 3. The description of character recognition experiments with SVM, databases used in the experiments and the experiments results are discussed in section 4. Section 5 contains a summary that concludes this paper.
2.0 ONLINE WORD RECOGNITION SYSTEM Word recognition systems can be categorized into those that take the holistic approach in recognition and those that first segment word in order to perform recognition. In the first case, no segmentation is needed. Global features are detected from the shape of the word. The second case involves segmenting word into characters, either by the so called classical analytical approach or by recognition based segmentation (SegRec). Classical analytical approach incorporates intelligence into character segmentation process so that word is segmented into character, based on the information of the word. SegRec method takes advantage of the fact that segmentation and recognition processes are interdependent where segmentation points are determined as a result of the recognition process. In SegRec, two approaches can be taken; either (a) suggests various segmentation paths to cut the word into characters an then select only those that are relevant for the word recognition or (b) over segment the word into sub-characters or characters and associates several sub-characters to form full characters. Approach (a) is called INSEG (input space segmentation) approach and approach (b) is called OUTSEG (output space segmentation). In our work, INSEG approach is use. For a SegRec based online word recognition system, four main processes are involved; (i) word preprocessing, normalization and resampling, (ii) segmentation, (iii) feature extraction and (iv) recognition. Process (i) aims at normalizing the input to have uniform height, slant and rotation angle besides making the sample signal to have a uniform equidistant sampling points. In (ii) the input is first segmented by slicing the word at minimum, maximum, crossing or pen-up/pen-down points. A segmentation graph is built by combining consecutive slices to produce character hypotheses. This segmentation graph proposed various possibilities of segmenting a word into characters.
In (iii), feature extraction process, feature vectors for each character hypothesis is determined. In this work, 7 local features were extracted for each point; (a) normalized x coordinate, (b) normalized y coordinate, (c) direction angle θ(x) of the curve according to the x axis (cosine θ(x)), (d) direction angle θ(x) of the curve according to the y axis direction (sine θ(x)), (e) curvature according to x axis, (f) curvature according to y axis and (g) the position of stylus (up or down). See figure 3 for direction and curvature features. Xn+2, Yn+2 Xn+1, Yn+1 Xn, Yn Xn-1, Yn-1
Xn, Yn
φ(n)
θ(n)
x
Xn-2, Yn-2
x direction : cos θ(n) y direction : sin θ(n)
x
x curvature : cos φ (n) y curvatura : sin φ (n)
Figure 3 - direction and curvature These 7 features are simple to obtain and have been used by Poisson [9] in experiments using TDNN and MLP NN. In (iv), the recognition process, the SVM is used to estimate the character likelihood of each character hypothesis in the segmentation graph. Finally, the word likelihood of an image is computed using dynamic programming, i.e. Forward and Viterbi algorithm under the framework of HMM. 3.0 WORD RECOGNIZER TRAINING In this section, neural network and support vector machine are introduced and compared with respect to their usage in word recognizer training. 3.1 Neural Network A simple view of Neural networks (NN) is as a massively parallel computing system consisting of large number of processors (nodes) with many interconnections. NN models have nodes (neurons) and directed edges (with weights) between neuron outputs and neuron inputs. NN can learn complex nonlinear input-output relationships. Commonly used NNs are the multilayer perceptron (MLP) and Radial-Basis Function (RBF) NN. Self-Organizing Map (SOM) or Kohonen-Network is another popular NN mainly used for data clustering and feature mapping. NN learning/training involves using data examples (and teaching signals) in updating and optimization of the architecture and weights to efficiently perform good classification later. For a specific selected architecture, training involves optimization of the weight parameters. NN training uses empirical risk minimization which stops training once learning error is within a specified margin. This leads to non-optimal model and the solution is often plagued by local minimum problems. Also different training session lead to different NN weights parameters 3.2 Support Vector Machine SVM has been used in recent years as an alternative to NN. SVM, unlike NN, takes into account learning examples as well as structural behavior. It achieves better generalization due to structural risk minimization (SRM). SVM formulation approximates SRM principle by maximizing the margin of separation. Basic SVM is linear but it can be used for non-linear
data by using kernel function to first indirectly map non-linear data into linear feature space. Basic SVM is also a two class classifier however; with some modification, multiclass classifier can be obtained. For a little more formal description of SVM, let us consider a set of l linearly separable data and its class {(x1,y1), (x2,y2),…, (xl, yl) where xi ∈ Rd and yi ∈ { ± 1 }, the maximum margin classifier is f(x) =sgn (w.x + b) where w and b are parameters that maximize the margin with respect to the two classes. A new form of the classifier, expressed with input and output l vectors information is as follows: f ( x ) = sgn( ∑ α i y i ( xi .x ) + b ) . Here parameter α for each i =1
corresponding input vectors needs to be found in order to find the maximal margin classifier. The α values are mostly zero and those inputs with non-zero α’s are called the support vectors and they contribute strongly towards the decision function. If the data set is not linear, N the decision function used is : f ( x) = sgn(∑ α i y i K ( xi .x) + b) where a kernel K is used in the i =1
mapping of the non-linear input space into a linear space. The maximal margin classifier is found in the linear space. There are a few possible kernels that can be chosen: (a) Linear kernel: K ( x , y ) = x.y , (b) Polynomial kernels: K ( x , y ) = ( x.y + 1 ) d , (c) Radial basis function − x− y
kernel: K ( x y ) = e
2σ
2
2
and (d) Hyperbolic tangent kernel:
K ( x , y ) = tanh( ax . y − b ) .
SVM training involves solving a convex quadratic programming (QP) problem with equality and inequality constraints obtained by the objective of margin maximization. The solution solves for nonzero parameters α’s introduced in the formulation and extracts the support vectors corresponding to it. The values of α’s obtained is constraint to be positive for perfectly separable case and between 0 and C in the case of non-linearly separable data. The value C is the penalty term and needs to be chosen prior to training the SVM. Other issues in SVM training include finding the best training model using appropriate kernel and the hyper parameters. In multiclass SVM, many two class SVMs are trained and in classification, voting schemes are used for selecting the correct class.
3.3 Training for Word Recognizer Generally, when using a hybrid NN/HMM, there are two approaches to train the word recognition system (figure 4); (a) character-level discriminant training (b) Word-level discriminant training. 3.3.1 Character-level discriminant training (CDT) In CDT, the NN is optimized at the character level, given the character-level teaching signals and separately the HMM is optimized at the word level, given the word-level teaching signals. CDT is a straightforward way to combine the NN and HMM as a word recognizer. The NN is trained as a character recognizer, thus optimized at character level. The NN is first trained using isolated characters to bootstrap it. Then, the recognizer refines its recognition performance by training the NN using characters segmented from the word database. Finally, to cater for junk characters, the most influential error-causing character hypotheses are classified as junk examples and the NN is retrained using the combined segmented word and the junk examples. However, although training with junk can improve word recognition
accuracy, the level of error that a junk example cause to the recognition is hard to adjust. It is actually hard to train the NN to model the problem accurately and as the NN is only optimized at the character level, it therefore will not achieve better performance at the wordlevel. Teaching signals
NN
Teaching signals
HMM
Teaching signals
NN
HMM
Figure 4 - Character-level vs. Word-level Discriminant Training 3.3.2 Word-level discriminant training (WDT) For (b), both NN and HMM are optimized at the word-level. The training of the recognizer is done globally using a single word-level objective function and a gradient descent (backpropagation) algorithm. The word-level objective function can be based on Maximum Likelihood (ML) or Maximum Mutual Information (MMI) criterion. WDT optimizes the parameters in NN using a single word-level objective function. For MMI criterion, the recognizer is trained to maximize the likelihood of the true model, and at the same time to minimize the likelihood of the rest of the models. This criterion gives the recognizer some discriminant power to separate between the true class and the rest of the classes. The training using the ML criterion only maximizes the true model regardless of the rest of the models. This does not give the recognizer any discriminant power. When NN is replaced by SVM, only CDT can be used. WDT cannot be used as the SVM formulation does not allow for SVM optimization, given word level teaching signals. In this work, we compare NN and SVM at the CDT functionality only. 4.0 EXPERIMENTS AND RESULTS There are a few comparisons between NN and SVM that can be made. First, NN and SVM in recognition of isolated characters, where training was done using isolated character databases. Second, comparing NN and SVM in action within NN/HMM or SVM/HMM hybrid system using CDT. For the purpose, a few databases were used.
4.1 Databases In the experiments, IRONOFF, UNIPEN and a combination of the two databases together were used. IRONOFF contains both online and offline handwriting information collected by IRCCyN in Nantes, France. It contains 4,086 isolated digits, 10,685 isolated lower case letters, 10,679 isolated upper case letters and 410 EURO signs. It also contains 31,346 isolated words from a 197 word lexicon (French: 28,657 and English: 2,689). In the experiments only the isolated digits and letters are being used. UNIPEN online database are available from the International Unipen Foundation. It consists of various training and benchmark datasets for online handwriting which includes characters and words. In the experiments only benchmark set 1a (16,000 isolated digits), 1b (28000 isolated upper case characters) and 1c (61000 isolated lowercase characters) are used. A combination of IRONOFF and UNIPEN database called IRONOFF-UNIPEN is also used. 4.2 Comparison of NN and SVM For the experiments using SVM, example isolated characters are preprocessed and 7 local features for each point of the spatially resampled online signal were extracted. They are already described in section 2. For each example character there are 350 feature values as input to the SVM. We use LIBSVM with RBF kernel, since RBF kernel has been shown to generally give better recognition result [7]. Grid search was done to find the best values for the C and gamma (representing 1 in the original RBF kernel formulation) for the final SVM 2σ 2
models and by that, C = 8 and gamma = 2-5 were chosen. The results of the experiments using SVM are compared with a similar experiment using NN by Poisson [9]. Poisson uses time delay NN (TDNN) and mutilayer perceptron NN (MLP NN). Table 1 below compares NN (both MLP and TDNN) and SVM in term of recognition rate and parameters required for each model. Table 1: NN vs SVM : Recognition rates and parameters for IRONOFF- UNIPEN database. MLP TDNN SVM Free Rec Rate Free par. Rec Rate nSV Rec Rate par. Digit 36110 3790 3014 97.9 98.4 98.68 Lowercase 37726 8926 15696 91.3 92.7 93.76 Uppercase 37726 93.0 8926 94.5 10035 95.13 Data Set
It is observed that recognition rate using SVM is higher than NN (both MLP and TDNN). However, free parameter storage for SVM model is significantly higher. The memory space required for SVM will be the number of support vectors (nSV) multiply by the number of feature values (in this case 350). This is significantly large compared to NN which only need to store the weight. TDNN needs less space due to the weight-sharing scheme. However, in SVM, space saving can be achieved by storing only the original online signals and the penup/pen-down status corresponding to the SV in a compact manner. During recognition, the model will be expanded dynamically as required. Table 3 shows the comparison of recognition rates between NN and SVM using all threeb databases. SVM clearly outperforms the two NN methods in all three isolated character cases.
Table 3: TDNN vs. SVM comparison for all datasets of IRONOFF and UNIPEN databases IRONOFF database Data Set Digit Lowercase Uppercase
MLP 98.2 90.2 93.6
TDNN 98.4 90.7 94.2
SVM 98.83 92.47 95.46
UNIPEN database MLP 97.5 92.0 92.8
TDNN 97.9 92.8 93.5
SVM 98.33 94.03 94.81
UNIPEN-IRONOFF databases MLP TDNN SVM 97.9 98.4 98.68 91.3 92.7 93.76 93.0 94.5 95.13
The result for the isolated character cases above indicates that the recognition rate for the hybrid word recognizer could be improved by using SVM instead of NN. Thus, we are currently implementing word recognizer with CDT using both NN and SVM and comparing their performance. 5. CONCLUSION SVM have been shown to be a better character recognizer because of its use of SRM, implemented by maximizing margin of separation in the decision function. However, the drawback is a larger model size but this can be improved by better selection of hyper parameters through a finer grid search and by reduced set selection [5][6]. Work is ongoing in comparing CDT based SVM/HMM and NN/HMM hybrid word recognizer. It is envisaged that, due to SVM’s better discrimination capability, word recognition rate for SVM/HMM system will be better than in a NN/HMM hybrid system. 6. REFERENCES [1] Homayoon S.M. Beigi, "An Overview of Handwriting Recognition," Proceedings of the 1st Annual Conference on Technological Advancements in Developing Countries, Columbia University, New York, July 24-25, 1993, pp. 30-46. [2] Bengio, Y. and LeCun, Y., "Word Normalization for On-Line Handwritten Word Recognition," in Proc. 12 th Int'l. Conf. Pattern Recognition, vol. 2, pp.409-413, Jerusalem, Oct. 1994. [3] Bengio, Y., LeCun, Y., LeRec: “An ANN/HMM Hybrid for On-line Handwriting Recognition”, Neural Computing, vol 7, no. 5, 1995. [4] A. Ganapathiraju., “Support Vector Machines for Speech Recognition,” PhD Thesis, Mississippi State University, January 2002. [5] C.J. Burges. “Simplified support vector decision rules”. In L. Saitta, editor, Proceed- ings, 13th Intl. Conf. on Machine Learning, pages 71-77, San Mateo, CA, 1996. Morgan Kaufmann. [6] T. Downs, K.E. Gates, A. Masters., “Exact Simplification of Support Vector Solution”, In Journal of Machine Learning Research 2 (2001) 293-297. [7] S. S. Keerthi and C.-J. Lin, “Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel” , Neural Computation., July 1, 2003; 15(7): 1667 - 1689. [8] C. Bahlmann, B. Haasdonk, H. Burkhardt., “Online Handwriting Recognition with Support Vector Machine – A Kernel Approach”, In proceeding of the 8th Int. Workshop in Handwriting Recognition (IWHFR), pp 49-54, 2002. [9] E. Poisson, C. Viard-Gaudin, P.M Lallican, “Multi-modular architecture based on convolutional neural networks for online handwritten character recognition”, ICONIP2002, 2002.