8th International Workshop on Frontiers in Handwriting Recognition 2002.
A LEARNING ALGORITHM FOR STRUCTURED CHARACTER PATTERN REPRESENTATION USED IN ON-LINE RECOGNITION OF HANDWRITTEN JAPANESE CHARACTERS A. KITADAI AND M. NAKAGAWA Tokyo University of Agriculture and Technology, 2-24-16, Naka-cho, Koganei, Tokyo, Japan E-mail:
[email protected] However, statistical learning for SCPR remained to be studied later. Although PLA has been proved useful for an unstructured set of features, there has been no publication to report the adaptation of PLA to SCPR. Akiyama et al. reported the effect of PLA for unstructured character patterns in the Japanese character set, but its adaptation to structured patterns were avoided [8]. The problem is how to utilize the occurrence of subpatterns in some character patterns and avail the information on their statistical distributions to other character patterns that share them. This paper presents a method to adapt PLA to SCPR. This paper is organized as follows. Section 2 introduces SCPR in the scope of our on-line recognition system for handwritten Japanese characters. Section 3 presents the design of the learning algorithm. Section 4 shows experimental results of this learning algorithm. Section 5 concludes this work.
Abstract This paper describes a prototype learning algorithm for structured character pattern representation with common subpatterns shared among multiple character templates for on-line recognition of handwritten Japanese characters. Although prototype learning algorithms have been proved useful for an unstructured set of features, they have not been presented for structured or hierarchical pattern representation. In this paper, we present cost-free parallel translation without rotation of subpatterns that negates their location distributions and normalization that reflects feature distributions in raw patterns to the subpattern prototypes, and then show that a prototype learning algorithm can be applied to the structured character pattern representation with significant effect.
1. Introduction
2. Recognition system
The performance of a character recognition system is affected by the quality and the size of the dictionary of prototypes, so its improvement is one of the most important factors. Prototype Learning Algorithm (PLA) is known as the methods to better approximate discrimination boundaries in the pattern space [1]-[4]. Some of PLA’s have been used to improve the accuracy of the pattern dictionaries and the advantages have been proved [5]. Structured character pattern representation (SCPR) represents a character pattern as a composite of subpatterns and their structure with common subpatterns shared among several character patterns whose shapes include them. The SCPR is suitable for patterns that have structures like Chinese characters and provides such advantages as reducing the dictionary size, making the recognition system robust against deformation of common subpatterns and so on [6]. We also proposed Structural Learning Algorithm that investigates which subpattern or the pattern as a whole is non-standard, registers the (sub) pattern and extends the effect of the registration to all the character categories whose shapes include it [7].
2.1 Dictionary with Structured Character Pattern Representation Kanji characters, ideographic characters of Chinese origin, are mostly composed of multiple subpatterns. For our on-line recognition system, each character pattern is registered as a composite of subpatterns and the structural information on how to combine them. Subpatterns are shared in the SCPR dictionary as shown in Fig.1. All basic subpatterns (primitives not decomposed SCPR dictionary
structural informations basic subpatterns (BS)
Figure 1. SCPR dictionary.
163
character patterns
further) as well as character patterns are represented by a square shape of 128 x 128 resolution and reduced to bounding boxes in structural information by the linear mapping when they are included in bigger subpatterns or character patterns (Fig. 2). This paper calls a result of the linear mapping as “a mapped basic subpattern”, even if the mapping is identical. Hereafter, we abbreviate it as MBS and the basic subpattern as BS. Before starting this research, already all the BS’s have been learned from learning patterns using a simple clustering algorithm based on LBG. Since the recognition method is sensitive to stroke order variations, multiple 128
remove unreliable correspondences
Figure 4. Making one-to-one correspondence.
3. Designing the learning algorithm 3.1 Prototype Learning Algorithm We employ Generalized Learning Vector Quantization (GLVQ) [4] for the basic learning strategy, since it is effective for Kanji recognition [5]. GLVQ updates the genuine prototype (closest prototype in the correct class) Pi and the rival prototype (closest one in different classes) Pj with the learning rate α (t) as follows:
64 linear mapping (size reduction)
128
128
subpattern
BS
Figure 2. Linear mapping of BS using structural information.
Pi ' = Pi + 4α (t )lk (1 − lk )
Dj
(D + D )
2
i
( Pl − Pi )
(1)
j
Di Pj ' = Pj − 4α (t )lk (1 − lk ) (P − P ) (Di + Dj )2 l j
templates have been registered for each subpattern.
2.2 Process of the Recognition System µk =
Our on-line recognition system for handwritten Japanese characters basically employs fast elastic matching with SCPR dictionary. An input pattern is first normalized to the square size of 128 x 128 and then feature points are extracted as shown in Fig. 3. Each BS in the SCPR dictionary is a sequence of feature points on the time series so the same for MBS’s and character patterns composed from them. The recognition system employs an elastic matching of feature points whose recognition rate is compatible with Dynamic Programming matching, yet its performance is 6-8 times faster than DP-matching with beam search.
Di − D
j
Di + D
j
D i = Pl − Pi
lk = l k ( µ k ) =
, ,
1 1 + e −ξµ k
(2)
D j = Pl − P j
3.2 Application of GLVQ We can improve the patterns in our SCPR dictionary by moving feature points according to the formulae (1), (2). As the result of the one-to-one correspondences between feature points, the feature points of a learning pattern are corresponded to the feature points of the genuine pattern and those of the rival pattern (Fig. 5).
128
128
genuine pattern
handwritten pattern
learning pattern
rival pattern
Figure 3. Size normalization and feature points extraction.
Figure 5. Feature points correspondences to the genuine pattern and the rival pattern.
As the result of the elastic matching, many-to-one or one-to-many correspondences between feature points are made. For evaluating pattern similarity, however, one-toone correspondence is more reliable so that unessential correspondences are discarded and one-to-one correspondences are made (Fig. 4) [9].
Fig. 6 shows one-to-one correspondence of a feature point pl in a learning pattern to that pi in the genuine pattern and pj in the rival pattern. We consider that the prototypes are better represented by moving pi toward pl while pj away from pl. The formulae to move each feature point are shown from (3) to (5).
164
genuine pattern
learning pattern
rival pattern
character categories so that the lack of a sufficient number of learning patterns for each class is serious. We defer this argument until the evaluation experiments described later.
pl : (xl, yl)
di
dj
pi : (xi, yi)
pj : (xj, yj)
3.4 Reflection Method To evaluate the above learning, we need a method to reflect the learning to BS’s. A simple idea is to enlarge the learning pattern to the square size of the BS by applying the inverse of the mapping from the BS to the MBS. Since the bounding box of the MBS is smaller than that of the BS except the case of identical mapping, the inverse mapping enlarges the bounding box of the learning pattern. Usually, however, handwriting has noises due to hand vibration etc. and these noises have no or little correlation with the bounding box size of the learning pattern, so that the inverse mapping may magnify these noises and reflect them into the subpattern (Fig. 8).
pl : (xl, yl)
pi’: (xi’, yi’)
pj’: (xj’, yj’)
Figure 6. Relations between feature points and the improvement by the learning algorithm. xi ' = xi + 4α (t )lk (1 − lk ) yi ' = yi + 4α (t )lk (1 − lk )
(d (d
dj
i +dj)
2
dj
+dj)
2
i
( xl − xi )
(3)
( yl − yi )
BS
(4)
di y j ' = y j − 4α (t )lk (1 − lk ) (y − y ) (di + d j )2 l j
1 1 + e − ξµ
d i = pl − pi
,
k
, µk =
inversely mapped learning pattern learning algorithm (reflection)
di − d j di + d j
Another method is to extract the displacement between the MBS and the learning pattern and reflect it to the original square size of the BS with considering the virtual displacement factor and the noise factor (Fig. 9). The idea is to reflect the former while suppressing the magnification of noises. In the next section we will introduce the displacement normalization function G(d, S) where d is a observed displacement and S is the size of the MBS to extract the virtual displacement. Fig. 9 shows the process: each u(v) is a feature point of the MBS mapped from a feature point v in the BS, and each l(v) is a feature point of the learning pattern corresponding to u(v). The displacement between u(v) and l(v) is measured, and it is reflected on the feature point v using G(d, S).
3.3 Learning Algorithm and Linear Mapping In the above, however, the counterpart with the learning pattern is the MBS although the algorithm must improve the BS as depicted in Fig. 7. There might be arguments on this learning. Whether different shapes of the subpattern should be used to train the common subpattern or not? A naive investigation into human recognition would lead to support this, but machine learning is not so flexible as human. Whether is it useful because the number of the learning patterns for each subpattern amounts to ten times or hundred times the number of character patterns? Japanese character recognition systems have to recognize over 3,000 MBS mapping
inverse mapping
Figure 8. Learning process with inverse mapping.
(5)
d j = pl − p j
BS
learning pattern
mapping
di x j ' = x j − 4α (t )lk (1− lk ) (x − x ) (di + d j )2 l j
lk = lk ( µ k ) =
MBS
liner mapping
learning pattern
BS
learning pattern
MBS u1(v)
l1(v)
reflect displacement : l1(v)-u1(v) displacement : l2(v)-u2(v) v
compare
identical mapping
reflection
Figure 7. Feature reflection to the BS.
u2(v)
l2(v)
Figure 9. Reflection to the feature point of BS. 165
A sufficient number of learning patterns in the same class of the MBS produce the distribution of feature points in the learning patterns corresponding to each feature point in the MBS. The distribution shows the degree of freedom in movement for each feature point in the MBS. The averaged distance from the gravity point of the distribution to the feature points belonging to the distribution indicates the size of the distribution. By investigating the distances according to the bounding box size of the MBS, we should be able to know the correlation.
By assuming G(d, S), GLVQ’s formulae (3) to (5) are transformed to (6) to (8) where Si = (six, siy) is the size of the MBS corresponding to the genuine while Sj = (sjx, sjy) is that corresponding to the rival. xi ' = xi + 4α (t )lk (1 − lk )
G ( d j , S j )G ( xl − xi , six )
(G (d , S ) + G (d i
yi ' = yi + 4α (t )lk (1 − lk )
i
G ( d j , S j )G ( yl − yi , siy )
(G (d , S ) + G (d i
x j ' = x j − 4α (t )lk (1 − lk )
k
=
(6)
, S j ))
2
j
G (d i , Si )G ( xl − x j , s jx ) i
, S j ))
2
j
G(d i , Si )G ( yl − y j , s jy )
(G(d , S ) + G(d i
µ
i
(G(d , S ) + G(d i
y j ' = y j − 4α (t )lk (1 − lk )
, S j ))
2
j
i
G (d i , S i ) − G (d j , S j ) G (d i, S i ) + G (d j , S j )
(7)
3.6 Parallel translation
, S j ))
2
j
When the displacement from a feature point in a MBS to that in a learning pattern is measured, not only the displacement within the MBS and noises, but also the displacement due to the parallel translation of the MBS within a character pattern is included as shown in Fig. 11. This factor must be removed before considering the displacement normalization. Although removing it completely is difficult, we propose a method of employing cost-free parallel translation to match the gravity point of the learning pattern and that of the MBS.
(8)
3.5 Displacement normalization We propose a method to normalize the displacements between the corresponded feature points by the bounding box size of the MBS. We assume the correlation between the bounding box size of the MBS and the freedom of movement of each feature point in the MBS because each feature point can move in larger area if the bounding box size of the MBS is bigger. The degree of freedom of each feature point is the reason of the displacement between corresponded feature points. By finding the correlation between the bounding box size and the degree of freedom, we should be able to normalize the displacement. (a)
from SCPRD
distance by basic pattern
MBS
learning pattern
character patterns
distance by structural information
Figure 11. Displacement due to parallel translation of MBS.
feature point of MBS
corresponding feature points of learning patterns Figure 10. (a) The distribution of the feature points. (b)
from database
MBS
learning pattern
Figure 12. Removing the factor of parallel translation of MBS.
4. Experiments
feature point of learning pattern center of gravity
4.1 Normalization Formulae To make the formula of the normalization described in section 3.5., we obtained the correlations between the average of the displacement and the bounding box size as shown in Fig. 13-(b) using the database HANDS_nakayosi_t_98_09 (Nakayosi) [10].
Figure 10. (b) The distance between the gravity point and feature points. 166
MBS
The parallel transition had been applied before measuring the displacement. From the results, the average of dx (Dx) and that of dy (Dy) can be explained with (sx, sy) : D Ax ( s x ) = 0 . 0846 s x + 1 . 7
liner mapping
u1(v)
BS
l1(v)
displacement : l1(v)-u1(v) v
(9)
D Ay ( s y ) = 0 . 0539 s y + 3 . 5
learning pattern
identical mapping
u2(v)
l2(v)
The displacement between feature points l = (xl, yl) and u = (xu, yu) is normalized with the bounding box size S = (sx, sy) of the MBS as follows:
Figure 14. Process to find a good starting point for a feature point in the BS.
D Ax (128 ) D Ax ( s x ) D Ay (128 )
Without the normalization, we can consider that v is a good starting point if the following formula is satisfied for l1(v), l2(v), …,ln(v):
G x ( xl − x u , s x ) = ( x l − xu )
G y ( yl − yu , s y ) = ( yl − yu )
(10)
D Ay ( s y )
{l1(v) - u1 (v)}+ {l2 (v) - u 2 (v)}+ ... + {lN (v) - u N (v)}
feature point of learning pattern
(a) feature point of MBS
N
= ∑{ln (v) - u n (v)} = 0
With the normalization, it should be as follows:
dxcenter of gravity N
∑ { G (l
(dx , sx) (sx, sy) : bounding box size of MBS Figure 13. (a) : The distance of each directional element. projected distance (pixel)
(b)
average of dx
14 12
n =1
n
(v ) - u n (v ), S n )} = 0
(12)
By moving every feature point in the BS so as to fill this formula, we can prepare the BS representing the learning patterns equally. We define such a BS as Averaged Basic Subpattern (ABS). By changing the method to obtain the ABS using the Nakayosi database (1,517,867 patterns, excluding Jis2 Kanji) as the learning pattern set, we made two initial dictionaries: without normalization (1-1), with the normalization following the section 4.1 (1-2). We also made another initial dictionary taking the average of the inversely mapped learning patterns instead of ABS as a comparison dictionary (1-3). Table 1 shows the recognition rates for Nakayosi, the learning set. The “previous” in the table is of the rate before learning. It is effective to average learning patterns in some way, but the advantage of ABS is not clear.
average of dy
8 4 0
(11)
n =1
dy
(dy , sy)
displacement : l2(v)-u2(v)
1
32 64 96 128 bounding box size (pixel) Figure 13. (b) : The relations between the distance (distribution size) and the bounding box size of the MBS.
4.2 Averaged basic subpattern
Table 1. Recognition rates using ABS. Dictionary 1-1 1-2 1-3 previous Rate 86.5 % 86.6 % 86.4 % 84.4 %
It is usually better for the learning algorithm to begin learning from a good starting point. Generating the averaged patterns from enough learning patterns is an effective solution for non-structured character patterns. For the SCPR, we propose the following method. Fig. 14 shows the process to find a good starting point for the feature point v in the BS. Each u(v) is the feature point of the MBS mapped from v. And each l(v) is the corresponded feature point to u(v). Each learning pattern is in the same class of the MBS.
4.3 Learning algorithms The formulae to move the feature point of the BS with the normalization is (6)-(8) and (11). The parallel translation had been done before execution. We did three different types of learning as same as the above: to improve the dictionary 1-1 without 167
normalization (2-1), to improve the dictionary 1-2 with normalization (2-2), and to improve the dictionary 1-3 using the inversely mapped learning patterns as the learning pattern set (2-3). The learning pattern set is again Nakayosi. Table 2 shows the recognition rates for the Nakayosi itself. This result shows the advantage of learning algorithms in comparison with the previous
5. Conclusion In this paper we have proposed a prototype learning algorithm for structured character pattern representation. We have shown the advantage of the algorithm when combined with normalization that reflects feature distributions in raw patterns to the subpattern prototypes. By applying the learning method to on-line handwritten character recognition, the recognition rate for the testing dataset has been improved to 87.2% (over 4 points higher than before learning).
Table 2. Recognition rates using the learning algorithms. Dictionary 2-1 2-2 2-3 prev Rate 88.7 % 89.1 % 88.8 % 84.4 % dictionary and superiority of 2-2 than others.
Reference
4.4 Evaluation by testing patterns
1.
T. Kohonen, Improved versions of learning vector quantization, Proc. IJCNN, Vol.1, pp.545-550, 1990. 2. S. Geva, J. Sitte, Adaptive nearest neighbor pattern recognition, IEEE Trans. Neural Networks 2(2), pp.318322, 1991. 3. B-H. Juang and S. Katagiri, "Discriminative learning for minimizaiton error classification," IEEE Trans. Signal Processing 40(12), pp.3043-3054, 1992. 4. A. Sato and K. Yamada, "A formulation of learning vector quantization using a new misclassification measure," Proc. 14th ICPR, Brisbane, pp.322-325, 1998. 5. C-L. Liu and M. Nakagawa, "Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition," Pattern Recognition 34, pp.601-615, 2001. 6. M. Nakagawa, K. Akiyama, L. V. Tu, A. Homma and T. Higashiyama: Robust and highly customizable recognition of on-line handwritten Japanese characters, Proc. 13th ICPR, Vol. III, pp.269-273, 1996. 7. M. Nakagawa and L. V. Tu: Structural learning of character patterns for on-line recognition of handwritten Japanese characters, Proc. SSPR 96, pp.180-188, 1996. 8. K. Akiyama and K. Ishigaki, A Method of Generating High Quality Online Character Recognition Dictionary Based on Training Samples, Technical Report for IEICE Japan, PRMU99-235, pp31-36, 2000 (in Japanese). 9. M. Nakagawa, K. Akiyama: A Linear-time elastic matching for stroke number free recognition of on-line handwritten characters, Proc. 4th IWFHR, 48-56 Dec. 1994. 10. M. Nakagawa, T. higashiyama, Y. Yamanaka, S. Sawada, L. Higashigawa, and K. Akiyama: On-line handwritten character pattern database sampled in a sequence of sentences without any writing instructions, Proc. 4th ICDAR, vol.1, pp.376-381, Aug. 1997.
True evaluation is made by using the testing pattern set unused for the learning. The database HANDS_kuchibue_d_97_06 contains 1,435,440 on-line handwritten Japanese characters by 120 writers [9]. We employed 1,434,120 patterns in this database (except Jis2 Kanji patterns) as the testing set. Table 3 shows the recognition rates with the SCPR dictionaries improved in section 4.3. We analyze the effect of learning by considering two groups of characters. One group includes characters whose dictionary patterns are made from BS’s without size reduction, i.e., with identical mapping while the other group includes characters whose dictionary patterns are made from BS’s with size reduction. The testing contains 738,360 character patterns for the first group and 695,760 for the second. Table 4 shows recognition rates for the two groups. From these results, we could conclude as: 1. 2. 3.
The learning algorithms are effective for the testing pattern set as well. The normalization proposed here improves the learning effect. The learning is effective for the group of characters whose patterns are made from BS’s with reductive mapping and not effective for other group of characters, but its side effect is almost negligible. Table 3. Recognition rates for the testing dataset. Dictionary 2-1 2-2 2-3 prev Rate 86.8 % 87.2 % 87.0 % 83.1 % Table 4. Recognition rates for the two groups of testing sets. Dictionary 2-1 2-2 2-3 prev Identical Map 86.0 % 85.8 % 85.7 % 79.6 % Map. with reduction
87.7 %
88.6 %
88.4 %
86.7 %
168