Architecture for Classifier Combination Using Entropy ... - CEDAR

0 downloads 0 Views 220KB Size Report
paper that the word recognizer is the classifier and the lexical entries are the classes. ... dependent on each other as the second classi er in the series depends on the ... Table 1 shows a few serial combination methods using 3 classifiers.
Architecture for Classi er Combination Using Entropy Measures K. Ianakiev and V. Govindaraju Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science & Engineering UB Commons, Suite 202 Amherst, NY 14228-2567 Email: govind cedar.bu alo.edu Phone: (716) 645 6164 x 103; Fax: (716) 645 6176 Abstract

In this paper we emphasize the need for a general theory of combination. Presently, most systems combine recognizers in an ad hoc manner. Recognizers can be combined in series and/or in parallel. Empirical methods can become extremely time consuming, given the very large number of combination possibilities. We have developed a method of systematically arriving at the optimal architecture for combination of classi ers that can include both parallel and serial methods. Our focus in this paper, however, will be on serial methods. We also derive some theoretical results to lay the foundation for our experiments. We show how a greedy algorithm that strives for entropy reduction at every stage leads to results superior to combination methods which are ad hoc. In our experiments we have seen an advantage of about 5% in certain cases. 1

Architecture for Classi er Combination Using Entropy Measures

1 Introduction Machine recognition of isolated handwritten words, especially cursive script, is a dicult problem. The problem is made tractable only when constrained by a lexicon, a list of words that includes the truth of the image as one of its elements. Research in handwritten word recognition (HWWR) has traditionally focused on relatively small lexicons, typically comprised of 10 - 1000 entries. Features extracted from the image are matched against every lexicon entry by an expensive matching algorithm, and a con dence value is computed for each lexicon entry [1]. The lexicon entries are ranked in decreasing order of con dence. While this paradigm has proven to be sucient for many applications from check amounts to street names in mail addresses, there are other applications wherein the lexicons are large (of the order of 10,000 entries or more) and it is no longer practical to compare the features extracted from the image with every lexicon entry. Some means of rapidly eliminating large parts of the lexicon as being unlikely matches is called for. This process is called lexicon reduction, and serves to rapidly trim the original lexicon down to a tractable size for a word classi er. It is a known fact that classi er performance declines with increasing size of lexicon. This may be attributed to the presence in large lexicons of several entries that the classi er nds dicult to distinguish from the reference. By eliminating some of these entries, lexicon reduction results in improved recognition performance. Classi ers can combine in either series or in parallel. Figure 1 shows the architectures possible. This paper is about developing a general theory for combination of classi ers. In particular, we want to discuss the theoretical and complexity issues pertaining to lexicon reduction and the serial combination of classi ers. Let us assume for the discussion in this paper that the word recognizer is the classi er and the lexical entries are the classes. Further, let us begin by discussing the case of two classi ers. We shall later see how the methodology 2

PARALLEL

COMBINATION

Lexicon, L0

C1 COMB INATION

Lexicon, L0

C2

SERIAL COMBINATION Lexicon, L0

C1 COMB

Lexicon, L1

Lexicon Reduction

INATION

C2

Figure 1: Two Classi er Combination Models. Parallel Combination methods can have the classi ers acting independently in terms of both looking at the same original lexicon. Serial methods are clearly dependent on each other as the second classi er in the series depends on the results of the rst classi er.

developed can be readily generalized to any number of classi ers. The method of parallel combination of classi ers would submit the same lexicon to both classi ers and combine the results using a variety of methods, such as logistic regression and Borda Count [3, 4]. On the other hand, in serial combination, classi ers that operate later in the engine deal with smaller lexicons. In fact, lexicon reduction is central to the serial combination methods. Table 1 shows a few serial combination methods using 3 classi ers. We take note of the fact that the number of possibilities is very large. First, if there are 3 classi ers, they can be ordered in 3! = 6 ways. Further, given the original lexicon L0 of length jL0 j, the reduced lexicon output by classi er C1 can be of jL0 ; 2j di erent sizes (not counting the cases when the entire lexicon or just 1 entry are returned). For L0 = 2768, it amounts to 6  2766 = 16,596 possible con gurations for the architecture. 3

Architecture 2768

2768

classifier 1

classifier 2

2768

2768

50

500

classifier 2

classifier 1

classifier 1

classifier 1

50

10

10

50

Ax 1

classifier 3

1

classifier 3

Ay

A50 = 70.87% A10 = 39.23% 34.40% A500 = 62.77% A50 = 60.77% 46.60%

1

classifier 3

A50 = 70.87% 52.40%

1

classifier 3

A1

A10 = 66.43% 53.83%

Table 1: There are n! di erent ways in which n classi ers can be arranged in series. Ax indicates the accuracy of reduction after rst stage of classi cation with classi er1 and Ay after the second stage. x andy are the sizes of the reduced lexicons.

Architecture 2768

classifier 2

50

500

union of lexicons

Ax classifier 3

classifier 1

Az

A1

1

10

A500 = 62.77% A50 = 26.80% A10 = 57.10% 46.83%

classifier 1

2768

Ay

50

50

union of lexicons

classifier 3

1

5 classifier 2

A100 = 70.87% A10 = 70.87% A5 = 27.77% 52.40%

Table 2: Various con gurations are possible when parallel and serial methods are mixed. Ax, Ay , and Az have the same meaning as above. 4

The motivation for this paper stems from the diculty confronted by a designer in choosing the correct architecture. Empirical methods are typically used by researchers. However, as the number of classi ers increases and the size of the lexicon (L0 ) increases, the situation quickly gets out of hand. Our objective is to develop the theory that will provide the guidelines for choosing the best method of serial combination.

Architecture 1704

classifier 1 linear regresion

1704

1704

1704

1

79.08%

classifier 3

classifier 1 borda count

1704

A1

1

79.74%

classifier 3 5

1

83.01% GREEDY (our method described here) 84.31% classifier 1

classifier 3

Table 3: The GREEDY algorithm described in this paper performs better than the traditionally used parallel combination methods and the ad hoc serial combination methods.

If one were to consider parallel combination methods as well, the possible con gurations increase further. Table 1 shows how di erent architectures can be con gured mixing the notion of serial and parallel combination. We have shown just a few examples. It should be apparent that the possibilities are very large. We will describe a universal combination architecture that will allow us to enumerate every possible con guration. Searching for the optimal choice of the architecture is clearly an open research problem. We will develop a greedy algorithm that uses entropy measures to search the optimal architecture. We have experimented with 3 classi ers and present our results (Table 1). 5

2 Universal Combinator Let us consider the following model for possible combinations of N classi ers C1, C2, : : : , CN (Figure 2). Without loss of generality we can assume that the classi ers are running in the same order as they are enumerated. Given an unknown pattern x and a lexicon L1 , the run of the rst classi er C1 produces a ranked list R1 of the words in the lexicon and their associated probabilities. A part R10 of that ranked list is sent to the nal decision maker, a part R12 contributes to the lexicon (for a UNION and/or INTERSECTION with other parts of the lexicon) of the second classi er C2, while another part R13 is used for building the lexicon of the third classi er C3, and so on. Parts R12, R13, : : : , R1N of the ranked list R1 , that was output from the run of the rst classi er, are used in building lexicons for the classi ers, that run after the rst one, R12 for C2, R13 for C3, : : : , R1N for CN . Now, when the run of the rst classi er is over, it is the turn of the second classi er C2 to run with lexicon L2 built from part R12 of the ranked list, produced by the rst classi er C1, that is sent to C2. The result of the run of the second classi er C2, as before, is a ranked list R2 of the words in the lexicon L2 and their associated probabilities. A part R20 of that ranked list is sent to the nal decision maker, a part R23 contributes to the lexicon of the third classi er C3 , while another part R24 is used to build the lexicon of the fourth classi er C4, and so on. When the run of the second classi er is over, the third classi er C3 run with lexicon L3 built from R13 part of the ranked list, produced by the rst classi er C1, and part R23 of the ranked list, produced by the second classi er C2. The output is a ranked list R3 of the words in the lexicon L3 and their associated probabilities, parts of which are used to build lexicons for following classi ers. The same procedure is repeated for all classi ers that follow in the recognition engine. The nal decision maker outputs the nal decision as a ranked list or the top choice. Our conjecture is that the universal classi er combination model with suitably chosen parameters R10 , R12, R13 , : : : , R1N , R20, R23 , : : : , R2N , : : : , RN0 ;1 , RNN ;1 represents all possible classi er combination of N classi ers. The model gets its power from the fact that certain 6

initial

N

lexicon L

R1

0

classifier1

(N-1)

R1

R1

3

R1

N

R

L2 0

2

classifier2

R2

(N-1)

R

2

3

R2

L3 0

classifier3

R3

L(N-1) 0

classifier(N-1)

R(N-1)

c o m b i n a t o r

lexicon combinator(N-1)

d e c i s i o n

lexicon combinator3

final choise

N

R(N-1) lexicon combinatorN LN 0

classifierN

RN

L i - denotes the lexicon used by classifier(i) j Ri - denotest the part of rank list, produced by classifier(i), used in forming of lexicon L j

Figure 2: Universal Classi er Combination Model 7

Rpq values can be 0. In fact, if the only non-zero values are accorded to R10 : : : RN0 ;1 , then the classi ers only contribute to the nal decision combinator (Figure 2) and the architecture becomes purely parallel. Further details of the universal combinator will be reserved for another paper in preparation [9].

3 Entropy Measure Given a lexicon Li, an unknown pattern x, and a classi er Ci which assigns a probability pw to each word (w) in the lexicon, the initial entropy of the system is given by

E1 = ;

X pw ln pw

w2L1

:

Our conjecture is that the entropy monotonically reduces as the lexicon keeps getting smaller. There are two cases to be considered. We will show here that the conjecture holds if the classi ers are error free. That is at every stage, the classi er preserves the true choice in the reduced lexicon. Since the last classi er in a serial engine returns just 1 choice, this case assumes that the nal recognition choice returned by the cascade of classi ers is correct. In fact, the conjecture holds for cases when the classi er is not necessarily error-free. We will skip the proof of this second case for now. Lexicon L1 is split into L2 and L1 n L2. H2 is the entropy associated with the lexicon L2 (with the corresponding a posteriori probabilities) and G2 is the entropy associated with the complement L1 n L2. The total entropy of the system is given by

E2 = 2H2 + (1 ; 2 )G2 ; where 2 is a parameter. Lexicon L2 is split into L3 and L2 n L3. H3 is the entropy associated with the lexicon L3 and G3 is the entropy associated with the complement L2 n L3 . The total entropy of the system is given by E3 = 2 ( 3H3 + (1 ; 3)G3 ) + (1 ; 2 )G2 ; 8

where 2 and 3 are parameters. We can prove that the entropy of the system keeps decreasing as the lexicon gets reduced (under certain conditions), provided the reduced lexicon always contains the true choice1 . If after application of classi er Ci the lexicon is reduced to L1 =fc1 , : : : ,ck g then the new probability of x to be the word ci is pi = p + pi + p . n k X Let the initial entropy of the system be E0 = ; pi ln(pi).

1

Since the classi er Ci is error-free, we can choose 1 = 1. The entropy of the system after the reduction is k k E1 = ; p + pi + p ln( p + pi + p ) = p +  1  + p pi ln(pi) ; ln(p1 +    + pk ) k 1 k 1 k i=1 i=1 1 Let S = p1 +    + pk , then E0 ; E1 = k n ( p +  1  + p ; 1) pi ln(pi) ; pi ln(pi) ; ln(p1 +    + pk ) 1 k i=1 i=k+1 which is greater than i=1

X

X

X

X

( p +  1  + p ;1) ln p1 +  k  + pk ;;(1;(p1 +  +pk )) ln(1;(p1 +  +pk ));ln(p1 +  +pk ) 1 k Hence, E0 ; E1  ;S ln S ; (1 ; S ) ln k ; (1 ; S ) ln(1 ; S ) Let us de ne f (y) (Figure 3).

f (y) = ;y ln y ; (1 ; y) ln k ; (1 ; y) ln(1 ; y) 1 We use the following two results,

Corollary 1:

(x + y ) ln

Corollary 2: A 1 ln B 1 ln :x

:x

x1

x1

x

+y 2

 x

ln x + y ln y

+ x2 ln x2 +    + xn ln xn



+ x2 ln x2 +    + xn ln xn





(x + y ) ln(x + y )

(x1 + x2 +    + xn ) ln(x1 + x2 +    + xn ) (x1 + x2 +    + xn ) ln

9

x1

+ x2 +    + xn n

rk

k --k+1

1

-ln(k)

Figure 3: Graph of f (y)

f 0(y) = ln k(1 y; y)

Note that if S  rk , then E0  E1 , i.e entropy is decreasing

3.1 Probability Values for Lexical Entries Most methods take the image of a handwritten word and a lexicon of possible words, and rank the lexicon based on the \goodness" of match between each lexicon entry and the word image. Typically, the word recognizer computes a measure of \similarity" between each lexicon entry and the word image and uses this measure to sort the lexicon in descending order of the similarity measure. The lexicon entry with the highest similarity is the top choice of the recognizer. The top m choices are often referred to as the confusion set as it contains the lexicon entries that are \similar" to the actual lexicon entry that matches the truth in some feature space. We have developed elsewhere [8] the groundwork for the use of Bayesian methodology in integration of recognizers with any subsequent processing by deriving meaningful proba10

bilistic measures from recognizers. This allows us to compute the entropy values.

4 Greedy Algorithm We have developed a \greedy" algorithm to dynamically construct a combinator of N classi ers. Given a test pattern x and a lexicon L, we rst apply the classi er with the best recognition rate on lexicons of size jLj. We reduce the lexicon in order to minimize the entropy. This reduced lexicon is sent to the next classi er. We continue the process until we cannot reduce the entropy any more or the last classi er is exhausted. The top choice of the last classi er used is the nal recognition choice.

Algorithm

 input: pattern x, lexicon L,  for (i = 1; i  N; i++) f 1. choose from un-used classi ers the one with the best accuracy performance on a lexicon of size jLj 2. perform the classi cation and create a ranked list R from the words in the lexicon L according to their con dences 3. if ( i == N) f

{ return the rst entry of R { exit g 4. for (k = 1; k  jLj; k++) f { R1 is formed from the rst k words of the ranked list R with new a posteriori probabilities

{ R2 is formed from the rest of the words of the ranked list R with new a posteriori probabilities

11

{ choose appropriate

(for example: probability of the true choice to be present in the lexicon R1 )

{ compute the entropy Hk g

5. nd the minimal entropy among Hk , 2 < k < jLj 6. if (HjLj is the minimum) f

{ mark the classi er as used, but do not use it in the combination { keep the same lexicon L g 7. if (H1 is the minimum) f { return the rst entry of R { exit g 8. if (Hm is the minimum) f { mark the classi er as used, but do not use it in the combination { keep in the lexicon L only the rst m words of the ranked list Rg g

5 Experiments Three word classi ers - CMWR (Character Model Word Recognizer) [6] is Classi er1, WMWR (Word Model Word Recognizer)[5] is classi er2 and HOL (Holistic) [7] is classi er3 in our experiments. Each of them takes as input a binary image and an ASCII lexicon, computes a probability each lexicon entry and ranks the lexicon by decreasing probabilities. WMR is a fast lexicon-driven analytical classi er that operates on the chain-coded description of the street name image. Following slant normalization and smoothing, the image is \oversegmented" at likely character segmentation points. The resulting segments are grouped and the extracted features matched against letters in each lexicon entry using a dynamic programming algorithm. 12

CMR adopts a di erent approach. After preprocessing and oversegmentation, segments are grouped in various ways and OCR is performed on the groups to obtain a graph of possible character candidates. For each lexicon entry, the best path through the graph is then determined. CMR is computationally more expensive than WMR. The two recognizers are suciently orthogonal in approach as well as features used to be useful in a combination strategy. HOL does not perform any segmentation, but uses holistic information such as length, ascenders and descenders to classify the image. The individual performances of CMR and WMR are shown in Figure 4. The Oracle represents the method of combination where a correct result is obtained in the top choice if either of the recognizers has it correct. Performance of word classifiers 100 ’CMWR’ ’WMWR’ ’ORACLE’

Accuracy(%)

95

90

85

80

75 0

5

10

15 Top n

20

25

30

Figure 4: Oracle combination shows the best possible results that one could obtain from the combination of WMR and CMR

Tables 4&5 show how the entropy of the system reduces with lexicon reduction. In this case the classi ers are not error free. However, the entropy still goes down. Table 6 shows that the GREEDY method nds the lowest entropy when compared to other ad hoc methods of determining the architecture of combination. 13

Image Samples 1 2 3 4 5 6

1704

classifier 1

after C1 5.577219 5.265298 5.537880 5.047477 5.549012 5.577230

10

classifier 2

5

classifier 3

1

after C2 nal entropy 3.614541 2.814851 3.447350 2.646646 3.594105 2.793283 3.328180 2.530059 3.598829 2.799862 3.615262 2.814670

1704

classifier 1

after C1 6.294333 5.946801 6.250293 5.701441 6.263592 6.294185

50

classifier 2

10

classifier 3

1

after C2 nal entropy 4.193750 3.366025 4.045334 3.217013 4.175640 3.347035 3.938757 3.112000 4.180160 3.352859 4.194074 3.365882

Table 4: Note the reduction in entropy after each stage of lexicon reduction on 6 example image samples for two di erent architectures.

Image Sampeles 1 2 3 4 5 6

1704

classifier 1

after C1 3.730102 3.554187 3.708750 3.428433 3.713430 3.730988

5

classifier 3

1

1704

classifier 1

after C1 4.064764 3.897464 4.044601 3.778463 4.049117 4.065277

nal entropy 2.761246 2.584103 2.738524 2.461479 2.745451 2.761039

10

classifier 3

1

nal entropy 2.783761 2.615539 2.762236 2.498956 2.768770 2.783551

Table 5: Note the reduction in entropy after each stage of lexicon reduction on 6 example image samples for two di erent architectures.

14

Image Samples \GREEDY" 1 2.76 2 2.58 3 2.74 4 2.46 5 2.75 6 2.76 7 2.65

1704

classifier 1

5

classifier 3

1

2.76 2.58 2.74 2.46 2.75 2.76 2.66

1704

classifier 1

10

classifier 3

1

1704

classifier 1

2.78 2.62 2.76 2.50 2.77 2.78 2.69

10

classifier 2

5

classifier 3

2.81 2.65 2.79 2.53 2.80 2.81 2.72

Table 6: Final entropy on various image samples using the GREEDY method and other ad hoc architectures. Note that the GREEDY method usually has the nal lowest entropy.

6 Conclusions Research in handwritten word recognition has traditionally concentrated on small lexicons of 10 - 1000 words. Several real-world applications, such as the recognition of English prose, involve large lexicons of 10,000 - 50,000 words. Existing classi ers may still be used for these tasks if preceded by a lexicon reduction step. The task of lexicon reduction is that of rapidly discarding from the original lexicon entries that are unlikely to match the given image. The resulting two-stage architecture is a serial combination or cascading of classi ers, and is an e ective method of dealing with large lexicons in real-life word recognition scenarios. Moreover, by discarding entries that may potentially confuse the classi er, lexicon reduction results in improved recognition performance. In this paper, we have presented an overview of lexicon reduction as a problem in its own right, and discussed some of the issues relating to the design and construction of di erent combination methods. We have presented a method of serial combination of classi ers using a GREEDY algorithm that strives for minimal entropy of the system. Its recognition accuracy is superior to other ad hoc combination methods. 15

1

We have shown theoretically that if the classi ers are error free, the entropy of the system must reduce as the lexicon size keeps reducing. We have introduced the notion of the universal combinator.

References [1] A Lexicon Driven Approach to Handwritten Word Recognition for Real-time Applications , by Gyeonghwan Kim and Venu Govindaraju, IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 19, Number 4, April 1997 [2] Serial Classi er Combination for Handwritten Word Recognition , by S. Madhvanath and V. Govindaraju Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR '95), Montreal, Canada, August 1995, pp. 911 - 914, 1995. [3] Desicion Combination in Multiple Classi er Systems, by T.K. Ho, J.J. Hull, S.N. Srihari, IEEE Transactions on Pattern Analysis and Mashine Intelligence, PAMI16(1) : pp. 66-75, 1994. [4] Methods of Combining Multiple Classi ers and Their Applications to Handwritting Recognition, by L. Xu, A. Krzyzak, C.Y. Suen, System Man and Cybernatics, 22(3):418-435, May/June 1992. [5] G. Kim and V. Govindaraju. A lexicon drive approach to handwritten word recognition for real-time applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):366{379, 1997. [6] J.T. Favata and S.N. Srihari. O -line recognition of handwritten cursive words. In Proceedings of the SPIE Symposium on Electronic Imaging Science and Technology, San Jose, CA, 1992. 16

[7] S. Madhvanath, E. Kleinberg, and V. Govindaraju. Holistic Veri cation for Handwritten Phrases In IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12):1344-1356, 1999. [8] D. Boucha ra, V. Govindaraju, and S. Srihari. A Methodology for Mapping Scores to Probabilities In IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):923-927, 1999. [9] K. Ianakiev and V. Govindaraju. The Universal Combinaton Methodology. In Preparation, 2000.

17

Suggest Documents