Combining Classifiers with Informational Confidence

0 downloads 0 Views 493KB Size Report
eral classifiers reduces the average complexity of each classifier, making them easier to train .... bad idea. After all, the classifier has been trained with typical examples from ... We are going to propose an information-theoretical learning mechanism ... probabilistic event with probability x and its complement with probability.
Combining Classifiers with Informational Confidence Stefan Jaeger, Huanfeng Ma, and David Doermann Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA {jaeger,hfma,doermann}@umiacs.umd.edu

Summary. We propose a new statistical method for learning normalized confidence values in multiple classifier systems. Our main idea is to adjust confidence values so that their nominal values equal the information actually conveyed. In order to do so, we assume that information depends on the actual performance of each confidence value on an evaluation set. As information measure, we use Shannon’s well-known logarithmic notion of information. With the confidence values matching their informational content, the classifier combination scheme reduces to the simple sum-rule, theoretically justifying this elementary combination scheme. In experimental evaluations for script identification, and both handwritten and printed character recognition, we achieve a consistent improvement on the best single recognition rate. We cherish the hope that our information-theoretical framework helps fill the theoretical gap we still experience in classifier combination, putting the excellent practical performance of multiple classifier systems on a more solid basis.

1 Introduction Multiple classifier systems are a relatively new field of research, which is currently under intensive investigation. We have seen a major boost of publications during the last ten years. A considerable number of publications is older though. This chapter is in good company with many other papers applying multiple classifier systems to handwritten or printed character recognition [1, 2, 3, 4]. In fact, character recognition seems to be one of the most popular application targets for multiple classifier systems so far. Research on multiple classifier systems has developed into two main branches: One major research field addresses the problem of generating classifiers and compiling classifier ensembles. Typical examples are Bagging [5] and Boosting [6, 7, 8]. Several other techniques, which are out of the scope of this chapter, have been proposed during the recent years and have gained some importance [9, 10]. The second major research field investigated in the recent literature deals with combination schemes for integrating the diverse outputs of classifier ensembles S. Jaeger et al.: Combining Classifiers with Informational Confidence, Studies in Computational Intelligence (SCI) 90, 163–191 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com 

164

S. Jaeger et al.

into a single classification result. Our work falls into this second category, though we will address Boosting shortly at the end of this chapter. Multiple classifier systems offer an appealing alternative to the conventional monolithic classifier system. Instead of pressing all information into a single classifier, the idea of multiple classifier systems is to distribute information among more than one classifier. Distributing information over several classifiers reduces the average complexity of each classifier, making them easier to train and optimize. Another intriguing feature is that a multiple classifier system can outperform each of its individual classifiers. In fact, a multiple classifier system can achieve outstanding recognition rates even if all its constituent classifiers provide only moderate performance. Classifiers highlighting different aspects of the same classification problem are especially suitable for combination purposes since they can complement each other. For complementary classifiers, the weakness of one classifier is the strength of at least one other classifier, which compensates for false judgments made by its co-classifier. We can generate complementary classifiers by using, for example, multiple feature sets, training methods, and classification architectures. However, the combination of multiple classifiers requires integration of individual classifier outputs into a single answer. In virtually all practical cases, classifier output is only a rough approximation of the correct a-posteriori class probabilities. These approximations do not meet usually the mathematical requirements for probabilities. While this is not a major problem for single classifier systems, it can lead to severe performance degradation in multiple classifier systems, where the individual outputs need to be combined in a mathematically reasonable manner. In this chapter, we presume that each classifier returns a confidence value for its recognition result and each potential candidate in an n-best list. These confidence values denote the confidence of the classifier in its output. In order to compare confidences from different classifiers, confidence values need to be compatible with each other. Identical ranges and scales are necessary, but not sufficient, prerequisites of compatible confidence values. Each classifier in a multiple classifier system should output confidence values that are related to its actual performance. We must avoid combining classifiers that are “too optimistic” with classifiers that are “too pessimistic.” While the confidence values of the former are higher than their actual performance suggests, the latter provide confidence values that are too low and do not reflect their good performance properly [11]. In this sense, we need techniques for making confidence values of different classifiers compatible with each other. We propose an information-theoretical approach to this problem. Our method normalizes confidence values of different classifiers, acting as a repair mechanism for imprecise confidence values. The idea is to combine the information actually conveyed by each confidence value, instead of directly combining the values themselves. In order to do so, we first compute a performance estimate for each confidence value. Based on this performance estimate, we compute the informational content for each confidence value according to

Combining Classifiers with Informational Confidence

165

the well-known logarithmic notion introduced by Shannon [12]. The newly computed informational confidence values will then replace the old values, serving as a standard representation enabling “fair” comparisons. The following sections present the theoretical background of our approach, which implies a clear statement regarding the optimal classifier combination scheme: Since the natural way of combining information from different sources is to simply add the amount of information provided by each source, the sumrule seems to be the logical combination scheme. In fact, being additive is part of the definition of information. Accordingly, we strongly advocate the use of the sum-rule for classifier combination, i.e. adding the confidence values of all classifiers, because it is optimal in the information-theoretical sense. We do not see our approach in competition with more complex combination schemes, rather with elementary schemes such as max-rule or product-rule. In fact, our approach can be used in combination with almost any other scheme, though some schemes (e.g. product-rule) would theoretically make less sense. We therefore concentrate on four elementary combination schemes in this chapter: sum-rule, max-rule, product-rule, and majority vote. Applying the plain sum-rule does not automatically guarantee performance improvements. On the contrary, it can degrade performance. One of the main goals of this chapter is therefore to show that sum-rule in combination with informational confidence values can provide more consistent performance improvements. We cherish the hope that our findings will contribute to overcome the theoretical gap we have been experiencing in classifier combination for quite some time now. In addition, the proposed framework can prove useful in all kinds of sensor fusion tasks and learning in general. We structured the chapter as follows: Section 2 explains the idea and motivation of the proposed approach, followed by the theoretical background in Section 3. Section 4 explains how we can learn informational confidence values in practical applications and Section 5 then shows practical examples for script identification and character recognition. Finally, a summary of the main results concludes the chapter.

2 Motivation Many combination schemes have been proposed in the literature. They range from simple voting schemes [13, 14, 15] to relatively complex combination strategies, such as the approach by Dempster/Shafer [16] or the BehaviorKnowledge Space (BKS) method introduced in [17]. The large number of proposed techniques shows the uncertainty researchers still have in this field. Up till now, researchers have not been able to show the general superiority of a particular combination scheme, neither theoretically nor empirically. Though several researchers have come up with theoretical explanations supporting one or more of the proposed schemes, a general commonly accepted theoretical framework for classifier combination is still missing.

166

S. Jaeger et al.

Despite the high complexity of several proposed techniques, many researchers still continue to experiment with much simpler combination schemes. In fact, simple combination schemes have resulted in high recognition rates, and it is by no means clear that more complex methods are superior to simpler ones, such as sum-rule or product-rule. The sum-rule (product-rule), for instance, just adds (multiplies) the score for every class provided by each classifier of a classifier ensemble, and assigns the class with the maximum score to the input pattern. Interesting theoretical results have been derived for those simple combination schemes. For instance, Kittler et al. showed that the sum-rule is less sensitive to noise than other rules [18]. As said above, we also advocate the use of sum-rule, but for different reasons. The main motivation of our chapter lies in an idea introduced by Oberlaender more than ten years ago [11]. Though the approach presented here is more elaborate, and common ground between both works may not be easy to detect anymore, it was his approach that gave the main impetus for the technique described. Oberlaender worked on a method for improving the recognition rate of a single classifier system by adjusting the classifier’s confidence values. He measured the classification rate of a confidence value by computing the recognition rate on all test patterns classified with this particular confidence value. For each classification, he then replaced each output confidence value with its corresponding recognition rate stored in a look-up table. Figure 1 illustrates this idea for two confidence values: On the left-hand side, confidence

Recognition rate

K

old 1

K

new 1

Recognition rate

new

K

2

K

old 2

Confidence

K

old 1

K

new 2

K

new 1

K

old

Confidence

2

Fig. 1. Alignment of confidence values with recognition rates

value K1old performs better than its nominal value suggests. Consequently, this confidence value is mapped to the larger value K1new , which better reflects the true classification performance. According to Oberlaender, the classifier is “too pessimistic” when he outputs K1old since the confidence value is lower than its actual recognition rate. The left-hand side of Figure 1 also shows an example of a confidence value that is too optimistic: confidence value K2old

Combining Classifiers with Informational Confidence

167

has a nominal value that is higher than its corresponding recognition rate. In this case, the lower value K2new reflects better the actual performance. A potential problem with Oberlaender’s approach is that the order of confidence values may be reversed after this process. The right-hand side of Figure 1 shows an example where this is the case: While, in the previous example, K1old is smaller than K2old and K1new is smaller than K2new , i.e. monotony has not been violated by the replacement process, the example on the right-hand side of Figure 1 shows a reversed order after replacing each confidence value with its new value. The new value K2new is now smaller than K1new . Generally speaking, a classifier should not provide confidence values violating monotony, i.e. larger confidence values should always entail better performance in the application domain. In practice, however, this goal is usually hard to achieve due to the limited number of training samples for instance. One can argue that reversing the order of confidence values is in general a bad idea. After all, the classifier has been trained with typical examples from the application domain so that its confidence values, and in particular their relative order, should be meaningful. This apparent drawback motivated another approach based on an idea from Velek et al., who proposed a method in which confidence values still depend on their respective recognition rates, but their relative order is never reversed. This approach has been successfully applied to Japanese character recognition [19, 20, 21]. The key idea is to let confidence values progress according to the partial sum of correctly recognized patterns. This process, which warps the old confidence values, can be stated mathematically as follows: i ncorrect (Kkold ) Kinew = Ri = k=0 , (1) N where Ki is the ith confidence value, assuming discrete values, and N is the number of all patterns tested. The term Ri denotes the relative number of patterns correctly classified with a confidence smaller than or equal to Kiold . Equation 1 expresses this number using the help function ncorrect(Kkold ), which returns the number of patterns correctly classified with confidence Kkold . Since Ri is a monotonously increasing function, the new confidence values Kinew will also increase monotonously and will thus not change the relative order of the old confidence values. The following section introduces a new method that picks up ideas presented above, but realizes them in a general information-theoretical framework.

3 Informational Confidence Our approach is based on a simple observation: If the confidence values of a classifier convey more information than provided by the classifier itself, the additional information lacks foundation and is thus redundant. On the other

168

S. Jaeger et al.

hand, if the confidence values convey less information than suggested by the classifier, essential information is lost. Consequently, confidence values that contain exactly the same amount of information as provided by their classifier are optimal, or fair, because they allow a fair comparison between different classifiers. Throughout our chapter, we will refer to these optimal values as “optimal confidence” or “informational confidence.” In practical classifier systems, confidence values are usually just approximations of their optimal values. While this does not necessarily hamper the operation of a single classifier, which usually only depends on the relative proportion of confidence values, it causes problems in multiple classifier systems, where we need the correct amount of information conveyed for combination purposes. We are going to propose an information-theoretical learning mechanism for informational confidence. We assume that the amount of information conveyed by a confidence value depends directly on its performance, i.e. its positive feedback within the application domain. The higher the performance of a confidence value, the higher its informational content and thus the informational confidence. For the time being, we do not further specify what “performance” actually means in this context. In fact, we will see that the precise mathematical definition of “performance” follows from the observations and considerations made above, as soon as we assume a linear relationship between information and confidence, and use the negative logarithm as the definition of information. We begin the formalization with notations for confidence and performance: Let KC be a set of confidence values for a classifier C: KC = {K0 , K1 , . . . , Ki , . . . , KN }

(2)

Furthermore, let p(Ki ) denote the performance of the i-th confidence value Ki . We assume that the set of confidence values is either finite or that we can map it to a finite set by applying an appropriate quantization. Let us agree on identifying the lowest confidence with K0 and the highest confidence with KN . Note that this restriction of confidence values to a finite set of discrete values is not required per se, but it will later allow us to estimate performance values on a training set. Discrete confidence values are implicitly introduced whenever we process continuous values on a digital computer. Claude Shannon’s theory of communication serves as the starting point for the idea of informational confidence [12]. Shannon’s definition of information has numerous applications in many fields, and can be considered a major milestone of research in general. Readers interested in Shannon’s theory will find a good introduction to information theory in [22, 23, 24]. According to Shannon, the information of a probabilistic event is the negative logarithm of its probability. Figure 2 shows his definition of information for both a probabilistic event with probability x and its complement with probability 1 − x. Shannon’s logarithmic definition of information is motivated by two main characteristics, which in combination are unique features of the negative logarithm: First, a likely event conveys less information than an unlikely one.

Combining Classifiers with Informational Confidence

169

Information 4 -log(x) -log(1-x) 3.5

3

2.5

2

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

Fig. 2. Information I(x) = − log(x) for probability x, and I(1 − x) = − log(1 − x) for x

Or, in other words, a very likely event does not cause much “surprise” since it happens very often and we thus know it partly beforehand. Second, the information of two independent events occurring simultaneously equals the sum of information independently provided by each event: I(x ∗ y) = I(x) + I(y)

(3)

The only degree of freedom Shannon’s definition allows is the base of the logarithm, which is the unit that information is measured in. Base 2 is very often used by researchers and programmers alike. However, we choose the natural logarithm ln with base e as information bit for reasons becoming clear below. In Figure 2, we already used the natural logarithm to show the information conveyed by a probabilistic event. The last requisite needed for defining the information of a confidence value Ki is the performance function p(Ki ), which measures the performance, or reliability, of Ki under C. As mentioned above, we impose no further constraints on this function here. The reader should think of p(Ki ) as a function returning values between 0 and 1, which are related to the probability that a pattern is correctly classified given the confidence value Ki . We assume that the information conveyed by a confidence value depends directly on its reliability. In particular, we assume that a highly reliable confidence value Ki (a confidence value with high performance p(Ki )) provides more information than a less reliable confidence value. Mathematically, we achieve this desired behavior by using the complement p(Ki ) as the argument for the logarithm, instead of p(Ki ), when computing the information of a confidence value. The complement p(Ki ), which is computed simply as 1 − p(Ki ), denotes the uncertainty one has to take into account when a pattern is classified with confidence Ki .

170

S. Jaeger et al.

Having made these reasonable assumptions, we now establish the following linear relationship between old and new informational confidence values:   (4) Kinew = E ∗ I p(Kiold ) + C In (4), the variable E is a multiplying factor just influencing the scale. However, we will later see that E represents also an expectation value in the statistical sense. The term C is merely a constant specifying an offset. It will play no further role in this  chapter and will thus be equal to zero in the following. The expression I p(Kiold ) in (4) denotes the information conveyed by the complement p(Kiold ) of p(Kiold ), and implements the desired feature of providing more information for better performances of Ki . Inserting the definitions of information and performance complement into (4), and setting C to 0, leads to the following central relationship between confidence and information:   Kinew = E ∗ I 1 − p(Kiold )   = −E ∗ ln 1 − p(Kiold ) (5) We require confidence values to satisfy this equation in order to qualify as informational confidence. The functional relationship defined by Equation (5) ensures that old, raw confidence values with no performance, i.e. p(Ki ) = 0, will be mapped to an informational confidence value equal to zero. On the other hand, confidence values showing perfect performance, i.e. p(Ki ) = 1, will be assigned infinite confidence. Generally speaking, we can regard Equation (5) as a function mapping raw confidence values to their informational counterparts, depending on their performance. Each performance p(Ki ) determines a corresponding informational confidence value Ki , and vice versa, each confidence value Ki requires a specific performance p(Ki ). Hence, Equation (5) defines an equilibrium in which informational confidence values are fixed points, i.e. Kinew = Kiold . In the following, we will therefore not further distinguish between Kiold and Kinew , but just write Ki for a fixed point of Equation (5). Information and confidence become basically the same in the state of equilibrium. Using the mathematical definition of informational confidence in Equation (5), we are now able to specify also the performance function p(Ki ), which has not been further investigated so far. The next subsection is going to derive the still missing mathematical specification of p(Ki ), which is essential for a practical computation of informational confidence. 3.1 Performance Function The performance function is a direct consequence of the fixed point equation in (5). By resolving (5) for p(Ki ), the following straightforward transformation produces its mathematical specification:

Combining Classifiers with Informational Confidence

171

Ki = E ∗ I (1 − p(Ki )) Ki = − ln (1 − p(Ki )) ⇐⇒ E Ki

⇐⇒ e− E = 1 − p(Ki ) Ki

⇐⇒ p(Ki ) = 1 − e− E

(6)

This result shows that the performance function p(Ki ) describes the distribution of exponentially distributed confidence values. We can therefore consider confidence as a random variable with exponential density and parameter λ = E1 . Let us repeat some of the basics of statistics in order to clarify this crucial result. The general definition of an exponential density function eλ (x) with parameter λ is:  λ ∗ e−λx : x ≥ 0 eλ (x) = λ>0 (7) 0 : x 0

(8)

0

Figure 3 shows three different exponential densities differing in their parameter λ, with λ = 100, λ = 20, and λ = 10, respectively. We see that parameter Exponential Density 100

80

60

40

20 100 * exp(-100 * x) 20 * exp(-20 * x) 10 * exp(-10 * x) 0 0

0.05

0.1

0.15

0.2

0.25

0.3

Fig. 3. Exponential density for λ = 100, λ = 20, and λ = 10

172

S. Jaeger et al.

λ has a direct influence on the steepness of the exponential density: The higher λ, the steeper the corresponding exponential density function. The mathematical companion piece of a density function is its distribution. Based on a random variable’s density, a distribution describes the probability that the random variable assumes values lower than or equal to a given value k. For a random variable with exponential density eλ (x), we can compute the corresponding distribution Eλ (k) as follows:  Eλ (k) =

k

−∞  k

eλ (x) dx

λ ∗ e−λx dx

= 0

 k = −e−λx 0 = 1 − e−λk

(9)

Consequently, a random variable with exponential density is also called “exponentially distributed random variable.” Figure 4 shows the corresponding distributions for the three different densities depicted in Figure 3, with λ = 100, λ = 20, and λ = 10. The distribution converges on 1 with increasing confiExponential Distribution 1

0.8

0.6

0.4

0.2 1-exp(-100 * x) 1-exp(-20 * x) 1-exp(-10 * x) 0 0

0.05

0.1

0.15

0.2

0.25

0.3

Fig. 4. Exponential distribution for λ = 100, λ = 20, and λ = 10

dence. Again, we see the influence of the parameter λ on the steepness of the distribution function. The higher λ, the steeper the corresponding exponential distribution. Moreover, there is a direct relation between λ and the expectation value E of the exponentially distributed random variable. Both are in inverse proportion to each other, i.e. E = λ1 . Accordingly, the expectation

Combining Classifiers with Informational Confidence

173

values belonging to the exponential densities in Figure 3, or distributions in 1 1 1 , E = 20 , and E = 10 , respectively. Figure 4, are E = 100 3.2 Performance Theorem When we compare the performance specification in (6) with the exponential distribution in (9), we see that the only difference lies in the exponent of the exponential function. In fact, we can make performance and exponential distribution the same by simply setting λ to E1 . This relationship between performance and distribution now also sheds light on the parameter E. As mentioned above, the expectation value of an exponentially distributed random variable with parameter λ is λ1 . The parameter E therefore denotes the specific expectation value for classifier C. We summarize this important result in the Performance Theorem: Performance Theorem: A classifier C with performance p(K) provides informational confidence K = −E ∗ ln (1 − p(K)) if, and only if, p(K) is an exponential distribution with expectation E. Proof. The theorem follows from (5), (6), and (9).

3.3 Expected Information We have already found out that parameter E is an expectation value in the statistical sense. The performance theorem, however, reveals even more information about E. Mapping raw confidence values with identical performance to the same informational confidence value, the performance theorem describes a warping process over the set of raw confidence values for monotonously increasing performances [20]. The average information Iavg (C) provided by a classifier C satisfying the performance theorem can be written in the form of a definite integral:  1 − ln (1 − p(K)) dp(K) Iavg (C) = E 0



1

− ln (p(K)) dp(K)

=E 0



1

− ln (K) dK

=E 0

1

= E [K − ln (K) ∗ K]0 =E

(10)

We see that the average information provided by C is exactly E bits, or more precisely E Euler-bits. This is in accordance with the fact that E is an

174

S. Jaeger et al.

expectation value. In fact, parameter E stands for both, expected information and expected confidence, because the performance theorem is based on the assumption that information and confidence are the same. The mathematical results presented up to this point explain the meaning and implications of the parameters p(K) and E. They say nothing about how to actually compute them. For classifiers that do not provide informational confidence and thus violate the performance theorem, we need to know the specific values of E and p(K) in order to compute their correct informational confidence values. The next section is going to explain how we can actually use the performance theorem to first estimate these values and then compute informational confidence values for a given classifier.

4 Learning In most practical cases, classifiers do not provide informational confidence values. Their confidence values typically violate the fixed point equation in the performance theorem, indicating a distorted equilibrium between information and confidence. In order to combine those classifiers, we need a second training process in addition to the classifier-specific training methods, which teach classifiers the decision boundaries for each class. Technically, there are two different ways of restoring the equilibrium so that confidence values satisfy the performance theorem: We can adjust the expectation E and/or the confidence K. In both cases, we need to estimate the expectation and the performance for each confidence value on a training set, or rather evaluation set, which should be different from the set that classifier C was originally trained with. For discrete confidence values, the fixed point equation of the performance theorem then formulates as follows:   ˆ ∗ ln 1 − pˆ(K ˆ i) , ˆ i = −E (11) K ˆ i is the estimated fixed point confidence for the expectation estiwhere K ˆ and performance estimate pˆ(K ˆ i ). Accordingly, learning informational mate E confidence can be considered a 3-step process: In the first step we train classifier C with its specific training method and training set. In the second step, we estimate the expectation and performance for each confidence value on an evaluation set. Finally, we compute new informational confidence values according to (11) and store them in a look-up table. In all future classifications, confidence values provided by classifier C will then always be replaced with their informational counterparts stored in the look-up table. The following two subsections explain how we can compute estimates of expectation and performance on the evaluation set.

Combining Classifiers with Informational Confidence

175

4.1 Expectation Estimation In the following, expectation E will be an invariable constant for each classifier, depending solely on the classifier’s global recognition rate R measured on the evaluation set. In the practical experiments that are going to follow later in this chapter, we will use the expectation per Euler-bit as an estimate of E, instead of directly using R as E [25, 26]. In analogy to the computation of information for confidence values in (4) and (5), we are computing an estimate ˆ I(C) of the information I(C) conveyed by classifier C as follows: Iˆ (C) = I (1 − R) = − ln (1 − R) ,

(12)

where R denotes the overall recognition rate of C√on the evaluation set. Based ˆ ˆ ˆ now computes as I(C) on the estimate I(C), E R, which maps the global recognition rate R to its corresponding rate for a one-bit classifier. The fixed point equation in the performance theorem then formulates as follows:   ˆ √ ˆ i = − I(C) ˆ i) , K R ∗ ln 1 − pˆ(K (13) This leaves us with the performance estimates as the only missing parameters we still need to compute the final informational confidence values. 4.2 Performance Estimation Motivated by the performance theorem, which states that the performance function follows an exponential distribution, we propose an estimate that expresses performance as a percentage of the maximum performance possible. Accordingly, our relative performance estimate describes the different areas delimited by the confidence values under their common density function. Mathematically, it is based on accumulated partial frequencies and can be described by the following formula: i ˆ i) = pˆ(K

k=0

ncorrect (Kk ) N

(14)

This is the same formula as in (1), which the reader has already encountered in Section 2. The use of accumulated partial frequencies guarantees that the newly computed estimates of the informational confidence values will not affect the order of the original confidence values. Based on this performance estimate, the mapping of old confidence values to their informational counterparts becomes a monotonous function satisfying the following relationship: ˆi ≤ K ˆj Ki ≤ Kj =⇒ K

(15)

Moreover, our performance estimate in (14) ensures that informational confidence values have no affect on the recognition rate of a classifier C, except

176

S. Jaeger et al.

for ties introduced by mapping two different confidence values to the same informational confidence value. Ties can happen when two neighboring confidence values show the same performance and become indistinguishable due to insufficient evaluation data. In most applications, this should be no problem, though. Typically, the effect of informational confidence values shows only when we combine C with other classifiers in a multiple classifier system. Accumulated partial frequencies act like a filter in that they do not consider single confidence values but a whole range of values. They average the estimation error over all confidence values in a confidence interval. This diˆ i) minishes the negative effect of inaccurate measurements of the estimate pˆ(K in application domains with insufficient or erroneous evaluation data. ˆ i ) into (13), we still Before inserting the relative performance estimate pˆ(K ˆ normalize again to the one-bit classifier using I(C), as we already did for the expectation value. For expectation and performance estimates, the final version of the fixed point equation in the performance theorem then reads as follows:    ˆ ˆ √ I(C) I(C) ˆ ˆ R ∗ ln 1 − pˆ(Ki ) (16) Ki = − The next section presents practical experiments using the proposed theoretical framework and performance estimate.

5 Experimental Results Using informational confidence values, we will show that combined recognition rates outperform the single best recognition rates in several multiple classifier systems trained for handwritten or printed character recognition, and script identification. Let us begin with handwritten character recognition. 5.1 Japanese On-Line/Off-Line Handwritten Character Recognition Handwriting recognition is a very promising application field for classifier combination, as it is both challenging and practically important. Handwriting, together with speech, is one of the most natural interfaces for interacting with computers. Multiple classifier systems therefore have a long tradition in handwriting recognition [1, 4]. In particular, the duality of handwriting recognition, with its two branches off-line recognition and on-line recognition, makes it suitable for multiple classifier systems. While off-line classifiers process static images of handwritten words, on-line classifiers operate on the dynamic data and expect point sequences over time as input signals. On-line systems require special hardware that is able to recover the dynamic information during the writing process. Compared to the time-independent pictorial

Combining Classifiers with Informational Confidence

177

representations used by off-line classifiers, on-line classifiers suffer from the many stroke-order and stroke-number variations inherent in human handwriting and thus in on-line data. On the other hand, dynamic information provides valuable information that on-line classifiers can exploit to better discriminate between classes. In fact, off-line and on-line information complement each other. This relationship between off-line and on-line classifiers suggests the combination of both types of classifiers to overcome the problem of strokeorder and stroke-number variations [27, 28]. This is especially important in Japanese and Chinese character recognition because the average number of strokes per character, and thus the number of variations, is much higher than in the Latin alphabet [29, 30]. Classifiers We tested a multiple classifier system comprising two classifiers for on-line handwritten Japanese characters. One of these two classifiers, however, transforms the captured on-line data into a pictorial representation by connecting neighboring on-line points using a sophisticated painting method [19, 31]. This transformation happens in a pre-processing step before feature computation and actual classification. We can therefore consider this classifier to be an off-line classifier. Both on-line and off-line classifiers are nearest neighbor classifiers. Each classifier was trained with more than one million handwritten Japanese characters. The test and evaluation set contains 54, 775 handwritten characters. From this set, we took about two thirds of the samples to estimate, and about one third to test the informational confidence values. For more information on the classifiers and data sets used, we refer readers to the references [31, 32, 33]. Table 1 lists the individual recognition rates for the off-line and on-line classifier. It shows the probabilities that the correct class label is among the n-best class labels in which the off-line or on-line classifier has the most confidence in, with n ranging from 1 to 3. We see that the off-line recognition rates Japanese offline online AND OR 1-best 2-best 3-best

89.94 81.04 75.41 95.56 94.54 85.64 82.62 97.55 95.75 87.30 84.99 98.06

Table 1. Single n-best rates for handwritten Japanese character recognition

are much higher than the corresponding on-line rates. Clearly, stroke-order and stroke-number variations are largely responsible for this performance difference. They complicate considerably the classification task for the on-line classifier. The last two columns of Table 1 show the percentage of test patterns

178

S. Jaeger et al.

for which the correct class label occurs either twice (AND) or at least once (OR) in the n-best lists of both classifiers. The relatively large gap between the off-line recognition rates and the numbers in the OR-column suggests that on-line information is indeed complementary and useful for classifier combination purposes. Combination Experiments Table 2 shows the recognition rates for combined off-line/on-line recognition, using sum-rule, max-rule, and product-rule as combination schemes. As their Japanese (89.94) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule

93.25 91.30 92.98

93.31 91.35 49.04

93.16 91.30 57.69

93.78 91.14 65.16

Table 2. Combined recognition rates for handwritten Japanese character recognition

names already suggest, sum-rule adds the confidence values provided by each classifier for the same class, while product-rule multiplies the confidence values and max-rule simply takes the maximum confidence without any further operation. The class with the maximum overall confidence will then be chosen as the most likely class for the given test pattern, followed by the n-best alternatives ranked according to their confidence. Note that the sum-rule is the mathematically appropriate combination scheme for integrating information from different sources [12]. Also, the sum-rule is very robust against noise, as was already mentioned in Section 2, at least theoretically [18]. The upper left cell of Table 2 lists again the best single recognition rate from Table 1, achieved by the off-line recognizer. We tested each combination scheme on four different confidence types: 1) raw confidence Ki as provided directly by the respective classifier ˆ i ) as defined by the performance estimate 2) performative confidence pˆ(K in (14) ˆ=R ˆ i as defined in (11) with E 3) informational confidence K ˆ 4) normalized informational confidence Ki as defined in (16) Compared to the individual rates, the combined recognition rates in Table 2 are clear improvements. The sum-rule on raw confidence values already accounts for an improvement of almost 3.5%. The best combined recognition rate achieved with normalized informational confidence is 93.78%. It outperforms the off-line classifier, which is the best individual classifier, by almost 4.0%. Sum-rule performs better than max-rule and product-rule, a fact

Combining Classifiers with Informational Confidence

179

in accordance with the results in [18]. Mathematically, informational confidence values only make sense when used in combination with sum-rule. For the sake of completeness, however, we list the combined recognition rates for the other combination schemes as well. Generally speaking, the recognition rates in Table 2 compare favorably with the state-of-the art, especially when we consider the difficult real-world test data used for training and testing. Also, sum-rule in combination with informational confidence exploits the candidate alternatives in the n-best lists to a fairly large extent, as indicated by the small difference between the “OR”-column and the practical recognition rates actually achieved. This should also benefit syntactical post-processing. Figure 5 depicts the performance estimate for the off-line classifier (lefthand side) and on-line classifier (right-hand side). In both cases, the perfor1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

200

400

600

800

1000

0

200

400

600

800

1000

Fig. 5. Performance estimate for off-line (left) and on-line (right) recognition

mance estimates describe a monotonously increasing function. The off-line function is steeper and reaches a higher level though. Figure 6 shows the corresponding informational confidence values computed according to (16). On 12

12

10

10

8

8

6

6

4

4

2

2

0

0 0

200

400

600

800

1000

0

200

400

600

800

1000

Fig. 6. Informational confidence for off-line (left) and on-line (right) recognition

180

S. Jaeger et al.

average, due to the better performance of the off-line classifier, the off-line confidence values are higher than the on-line confidence values. 5.2 Printed Character Recognition The second application we experimented with is optical character recognition (OCR) for a transliterated Arabic-English bilingual dictionary scanned with a resolution of 300dpi. Our character set comprises all Latin characters and a large number of dictionary-specific special characters, which let the overall number of categories grow to 129. Figure 7 shows a sample page of the dictionary.

Fig. 7. Arabic-English dictionary

Classifiers We used two different nearest neighbor classifiers for recognition: a classifier based on a weighted Hamming distance operating on the pictorial data and a classifier using Zernike moments. The Hamming distance of the first classifier describes the distance between the size-normalized character image and a binary template map computed for each character class from the training set. On the other hand, moment descriptors have been studied for image recognition and computer vision since 1960s [34]. Teague [35] first introduced

Combining Classifiers with Informational Confidence

181

the use of Zernike moments to overcome the shortcomings of information redundancy present in the popular geometric moments. Zernike moments are a class of orthogonal moments which are rotation invariant and can be easily constructed to an arbitrary order. They are projections of the image function onto a set of complex, orthogonal polynomials defined over the interior of a unit circle x2 + y 2 = 1. It has already been shown in [36, 37] that Zernike moments can be useful for OCR. In our work, the order of Zernike moments was chosen to be 12, which provides a 48-dimensional moment vector. Due to space constraints, we refer readers again to the references for more information on Zernike moments, e.g. [36, 37]. Table 3 lists the n-best single recognition rates and the corresponding AND and OR rates. In our experiments, the weighted Hamming distance Latin Hamming Zernike AND OR 1-best 2-best 3-best

96.35 96.97 97.00

93.66 97.75 98.64

91.01 99.01 95.46 99.26 96.34 99.30

Table 3. Single n-best rates for printed Latin character recognition

is superior to the Zernike moments. It achieves a recognition rate of 96.35 versus 93.66 achieved by the Zernike moments. The OR-column of Table 3 is again an indication that a combination of both classifiers has the potential to improve the single best recognition rate, i.e. the recognition rate provided by the Hamming distance. Combination Experiments Table 4 is an overview of the combined recognition rates we computed. It is Latin (96.35) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule

98.12 97.06 97.78

98.16 96.69 79.87

97.94 96.72 84.98

98.50 96.97 93.83

Table 4. Combined recognition rates for printed Latin character recognition

organized similar to Table 2, with the single-best recognition rate listed in the upper left corner of the table. Each column contains the recognition rate for a particular confidence type and each row contains the rates for a specific combination scheme. The recognition rates in Table 3 and 4 represent the average over twenty randomly selected pages with 64570 test patterns from which

182

S. Jaeger et al.

we took every second pattern to compute the performance estimate. As we have already observed for Japanese character recognition, the sum-rule yields higher improvements than max-rule and product-rule. On raw confidence values, it achieves a considerable improvement on the best single recognition rate. Normalized informational confidence values increase the recognition rate further to 98.50, which is more than two percent higher than the best individual rate. 5.3 Script Identification In our third application, we investigate the effect of informational confidence values on a multiple classifier system integrating different classification architectures. The application domain is bi-lingual script identification. A significant number of today’s documents can only be accessed in printed form. A large portion of them are multilingual documents, such as patents or bilingual dictionaries. In general, automatic processing of these documents requires that the scripts must be identified before they can be fed into an appropriate OCR system for automatic processing. Earlier work on script identification includes template-based approaches [38], approaches exploiting character-specific characteristics [39, 40, 41], as well as text line projections [42] and font recognition based on global texture analysis [43]. For our classifier combination experiments, we have developed a different approach that operates on word-level [44, 45]: A modified Docstrum algorithm first segments the document into words, then we use Gabor filters to compute features for each script class [46]. Our experiments focus on bilingual documents with one script being English and the second script being either Arabic, Chinese, Hindi, or Korean. Accordingly, script identification then becomes a 2-class classification problem for each word. Figure 8 shows a page from a bilingual Arabic-English word dictionary, with the Arabic words successfully identified.

Classifiers This subsection is a brief description of the different classifiers used in our multiple classifier system for script identification, including their recognition rates. More information on the features used and pre-processing steps is given in [44]. Our multiple classifier system comprises four classifiers based on different classification principles: nearest neighbors, weighted Euclidean distances, support vectors, and Gaussian mixtures. • Nearest Neighbor (KNN): First introduced by Cover and Hart [47] in 1967, the Nearest Neighbor (NN) classifier was proven to be a very effective and simple classifier. It found applications in many research fields. Using the NN classifier, a test sample x is classified by assigning it the

Combining Classifiers with Informational Confidence

183

Fig. 8. Arabic word segmentation

label that its nearest sample represents in the training samples. The distance between two feature vectors is usually measured by computing the Euclidean distance. In the experiments reported here, the classifier’s confidence in a class label λi assigned to a testing sample x computes as follows: conf (x|λi ) = min (dis(x, xt )), xt ∈Sλi

where dis(x, xt ) is the Euclidean distance between two vectors x and xt , and Sλi is the set containing all training samples with label λi . In our experiments, x represents a 32-dimensional feature vector. • Weighted Euclidean Distance (WED): Based on the mean µ(i) and standard deviation α(i) of the training samples for each class λi , the distance between a test sample x and λi is computed as follows:   d  (i)    xk − µk  dis(x, λi ) =   i = 1...M  α(i)  k=1

k

where d is the feature dimension and M is the number of classes. Testing samples are assigned the class label with the minimum distance, and the classifier’s confidence in a class label λi is computed as: M j=1,j=i dis(x, λj ) conf (x|λi ) = M j=1 dis(x, λj )

184

S. Jaeger et al.

• Support Vector Machine (SVM): SVMs were first introduced in the late seventies, but are now receiving increased attention. The SVM classifier constructs a ‘best’ separating hyperplane (the maximal margin plane) in a high-dimensional feature space which is defined by nonlinear transformations from the original feature variables. The distance between a test sample and the hyperplane reflects the confidence in the final classification. The larger the distance, the more confidence the classifier has in its classification result. Hence, we can directly use this distance as confidence. A detailed description of the computation of the separating hyperplanes is given, for instance, in [48]. The experiments reported here are based on the SVM implementation SVM-light, which uses a polynomial kernel function [49]. • Gaussian Mixture Model (GMM): The Gaussian Mixture Model (GMM) classifier is used to model the probability density function of a feature vector, x, by the weighted combination of M multi-variate Gaussian densities (Λ): M  p(x|Λ) = pi gi (x), i=1

where the weight (mixing parameter) pi corresponds to the prior probabil M ity that feature x was generated by component i, and satisfies i=1 pi = 1. Each component λi is represented by a Gaussian mixture model λi = N (pi , µi , Σi ) whose probability density can be described as: 1 1 gi (x) =  exp(− (x − µi )T Σi−1 (x − µi )), d 2 (2π) |Σi | where µi and Σi are the mean vector and covariance matrix of Gaussian mixture component i, respectively. Details about the estimation of these parameters can be found in [44]. In the following experiments, the likelihood value ln(p(λi |x)) describes the classifier’s confidence in labeling test pattern x as λi . Table 5 lists the individual recognition rates of each classifier for all scripts, i.e. the percentage of successful discriminations between English and respective non-English words. The KNN and SVM classifiers provide the best overScript/Classifier KNN WED GMM SVM AND OR Arabic Chinese Korean Hindi

90.90 92.19 94.04 97.51

80.07 84.89 84.68 91.97

88.14 90.59 90.72 93.11

90.93 93.34 92.54 97.27

68.55 76.04 76.57 86.48

98.04 98.66 98.53 99.51

Table 5. Single Recognition rates for script identification

Combining Classifiers with Informational Confidence

185

all performance. Classification rates are lowest for Arabic script since Arabic words are more similar to English (Latin) words than they are to words of other scripts. The OR-column indicates that the combination of all four classifiers has the potential to improve recognition. Combination Experiments In addition to the elementary combination schemes sum-rule, max-rule, and product-rule, we also experimented with combinations based on majority vote. Majority vote implements a simple voting among all four classifiers. The class with the maximum number of votes will then be chosen as the most likely class for a given test pattern. In case of ties, we take the label provided by the classifier with the highest recognition rate as ultimate decision. We roughly labeled between 6000 and 10000 evaluation patterns for each script, and used again 50% of these sets to estimate the performance of each confidence value. Tables 6, 7, 8, and 9 show the combined recognition rates for each script and each combination scheme. Arabic (90.93) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule Majority vote

91.03 88.18 91.03 91.46

92.39 90.67 91.13 —

92.39 90.87 91.17 —

92.73 90.80 91.23 —

Table 6. Combined recognition rates for Arabic script identification

Chinese (93.34) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule Majority vote

94.24 90.83 93.45 93.79

94.26 93.24 93.45 —

94.24 93.24 93.45 —

94.50 93.30 93.45 —

Table 7. Combined recognition rates for Chinese script identification

The results show that the combined recognition rates based on normalized informational confidence consistently outperform the best single recognition rate, which is again displayed in the upper left cell of each table. We achieve the biggest improvement of 1.8% on the most difficult script: Arabic. In particular, normalized informational confidence values perform better than other confidence types. The unnormalized informational confidence values also lead to improvements in all four cases. A very important point to notice is that the

186

S. Jaeger et al. Korean (94.04) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule Majority vote

93.73 90.99 92.52 94.06

93.75 93.30 92.38 —

94.16 93.49 92.38 —

94.35 94.06 92.38 —

Table 8. Combined recognition rates for Korean script identification Hindi (97.51) Raw Conf. Perf. Conf. Inf. Conf. Norm. Inf. Conf. Sum-rule Max-rule Product-rule Majority vote

98.05 93.13 97.32 97.93

97.98 95.64 97.19 —

97.83 95.81 97.19 —

98.08 96.65 97.19 —

Table 9. Combined recognition rates for Hindi script identification

plain sum-rule is not able to improve the single best recognition rate for Korean, whereas normalized informational confidence leads to an improvement. This is another confirmation of the fact that sum-rule does not necessarily improve recognition rates, as has already been observed by many researchers before. It performs better than max-rule or product-rule in our experiments, though. The simple performative confidence estimate also performs well except for Korean. Majority voting outperforms sum-rule on Arabic and Korean. On the other hand, it is outperformed by sum-rule on Chinese and Hindi. This somewhat inconsistent behavior also reflects very well the practical experiences of other authors. AdaBoost This subsection reports recognition experiments with Boosting, i.e. AdaBoost, in order to have a comparison with other, more complex state-of-the-art combination schemes. AdaBoost (Adaptive Boosting) was introduced by Freund and Schapire in 1995 to expand the boosting approach originally introduced by Schapire. We focus on the AdaBoost approach called AdaBoost.M1 [7]. The AdaBoost algorithm generates a set of classifiers and combines them. It changes the weights of the training samples based on classifiers previously built (trials). The goal is to force the final classifiers to minimize the expected error over different input distributions. The final classifier is formed using a weighted voting scheme. Details of AdaBoost.M1 can be found in [7]. Table 10 is a representative example of our Boosting experiments, with the number of trials being 20. In most cases, Boosting provides slight improvements of the single recognition rates in Table 5. However, the improvements are smaller than those achieved with informational confidence and do not outperform the single best recognition rate for Hindi in our experiments.

Combining Classifiers with Informational Confidence Classifier Arabic (90.93) Chinese (93.34) Korean (94.04) Hindi (97.51)

187

KNN WED GMM SVM 90.95 92.22 94.12 97.39

82.24 85.17 85.18 92.13

88.03 91.02 91.14 92.85

90.48 93.49 93.03 97.38

Table 10. Recognition rates with Boosting

We must also note that the nearest neighbor classifier does not lend itself to boosting because it never misclassifies on the training set. For this reason and the fact that boosting can still be applied on top of informational confidence, we do not report more results for boosting in this chapter.

6 Summary The classifiers of a multiple classifier system typically provide confidence values with quite different characteristics, especially classifiers based on different classification architectures. While this poses no problem for the individual classifiers themselves, it considerably complicates the integration of different classification results into a single outcome. We propose an informationtheoretical framework to overcome this problem and make confidence values from different sources comparable. Our approach is based on two main postulates. First, we assume that confidence is basically information as introduced by Shannon’s logarithmic definition. Second, we require that the confidence of a classifier depends on its performance function in a given application domain. These two simple assumptions lead to a fixed point equation for what we call informational confidence. Our main result, which we summarize in the Performance Theorem, states that classifiers providing informational confidence feature an exponential distribution as a performance function. The performance theorem motivates a learning method for informational confidence: We can estimate the performance of each confidence value on an evaluation set and insert the estimate directly into the fixed point equation given in the Performance Theorem. This provides us directly with the informational confidence values, which are fixed points. As performance estimate, we use a monotonously increasing function whose main purpose is to estimate the values of an exponential distribution for each confidence value. In our experiments for character recognition and script identification, combined recognition rates based on informational confidence consistently outperform the best single recognition rate. In particular, sum-rule in combination with informational confidence outperforms other elementary combination schemes, namely max-rule, product-rule, majority vote, and sum-rule itself on raw confidence values.

188

S. Jaeger et al.

For monotonously increasing performance estimates, informational confidence values do not affect the recognition rates of individual classifiers. They take effect only when used in combination with informational confidence values from other classifiers. Nevertheless, a single recognizer can of course take advantage of informational confidence values in the post-processing steps following classification, such as syntactic context analysis of n-best lists, etc. While sum-rule is the most natural combination scheme for informational confidence values, integration of informational confidence is by no means a contradiction to other, more complex integration techniques. On the contrary, informational confidence can be considered a general standard representation for confidence on which other post-processing techniques can rest.

References 1. Xu, L., Krzyzak, A., Suen, C.: Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition. IEEE Trans. on Systems, Man, and Cybernetics 22 (1992) 418–435 2. Gader, P., Mohamed, M., Keller, J.: Fusion of Handwritten Word Classifiers. Pattern Recognition Letters 17 (1996) 577–584 3. Sirlantzis, K., Hoque, S., Fairhurst, M.C.: Trainable Multiple Classifier Schemes for Handwritten Character Recognition. In: 3rd International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, Lecture Notes in Computer Science, Springer-Verlag (2002) 169–178 4. Wang, W., Brakensiek, A., Rigoll, G.: Combination of Multiple Classifiers for Handwritten Word Recognition. In: Proc. of the 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR-8), Niagara-on-the-Lake, Canada (2002) 117–122 5. Breiman, L.: Bagging Predictors. Machine Learning 2 (1996) 123–140 6. Freund, Y., Schapire, R.: A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14 (1999) 771–780 7. Freund, Y., Schapire, R.: Experiments with a New Boosting Algorithm. In: Proc. of 13th Int. Conf. on Machine Learning, Bari, Italy (1996) 148–156 8. Guenter, S., Bunke, H.: New Boosting Algorithms for Classification Problems with Large Number of Classes Applied to a Handwritten Word Recognition Task. In: 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, Lecture Notes in Computer Science, Springer-Verlag (2003) 326– 335 9. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 20 (1998) 832–844 10. Ianakiev, K., Govindaraju, V.: Architecture for Classifier Combination Using Entropy Measures. In: 1st International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, Lecture Notes in Computer Science, SpringerVerlag (2000) 340–350 11. Oberlaender, M.: Mustererkennungsverfahren (1995) German Patent DE 4436408 C1 (in German).

Combining Classifiers with Informational Confidence

189

12. Shannon, C.E.: A Mathematical Theory of Communication. Bell System Tech. J. 27 (1948) 379–423 13. Ho, T., Hull, J., Srihari, S.: Decision Combination in Multiple Classifier Systems. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 16 (1994) 66–75 14. Erp, M.V., Vuurpijl, L.G., Schomaker, L.: An Overview and Comparison of Voting Methods for Pattern Recognition. In: Proc. of the 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR-8), Niagara-onthe-Lake, Canada (2002) 195–200 15. Kang, H.J., Kim, J.: A Probabilistic Framework for Combining Multiple Classifiers at Abstract Level. In: Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany (1997) 870–874 16. Mandler, E., Schuermann, J.: Combining the Classification Results of Independent Classifiers Based on the Dempster/Shafer Theory of Evidence. In E.S. Gelsema, L.K., ed.: Pattern Recognition and Artificial Intelligence. (1988) 381–393 17. Huang, Y., Suen, C.: A Method of Combining Multiple Experts for Recognition of Unconstrained Handwritten Numerals. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 17 (1995) 90–94 18. Kittler, J., Hatef, M., Duin, R., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 226–239 19. Velek, O., Liu, C.L., Jaeger, S., Nakagawa, M.: An Improved Approach to Generating Realistic Kanji Character Images from On-Line Characters and its Benefit to Off-Line Recognition Performance. In: 16th International Conference on Pattern Recognition (ICPR). Volume 1., Quebec (2002) 588–591 20. Velek, O., Jaeger, S., Nakagawa, M.: A New Warping Technique for Normalizing Likelihood of Multiple Classifiers and its Effectiveness in Combined OnLine/Off-Line Japanese Character Recognition. In: 8th International Workshop on Frontiers in Handwriting Recognition (IWFHR), Niagara-on-the-Lake, Canada (2002) 177–182 21. Velek, O., Jaeger, S., Nakagawa, M.: Accumulated-Recognition-Rate Normalization for Combining Multiple On/Off-line Japanese Character Classifiers Tested on a Large Database. In: 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, Lecture Notes in Computer Science, SpringerVerlag (2003) 196–205 22. Pierce, J.R.: An Introduction to Information Theory: Symbols, Signals, and Noise. Dover Publications, Inc., New York (1980) 23. Sacco, W., Copes, W., Sloyer, C., Stark, R.: Information Theory: Saving Bits. Janson Publications, Inc., Dedham, MA (1988) 24. Sloane, N.J.A., Wyner, A.D.: Claude Elwood Shannon: Collected Papers. IEEE Press, Piscataway, NJ (1993) 25. Jaeger, S.: Informational Classifier Fusion. In: Proc. of the 17th Int. Conf. on Pattern Recognition, Cambridge, UK (2004) 216–219 26. Jaeger, S.: Using Informational Confidence Values for Classifier Combination: An Experiment with Combined On-Line/Off-Line Japanese Character Recognition. In: Proc. of the 9th Int. Workshop on Frontiers in Handwriting Recognition, Tokyo, Japan (2004) 87–92 27. Jaeger, S., Manke, S., Reichert, J., Waibel, A.: Online Handwriting Recognition: The Npen++ Recognizer. International Journal on Document Analysis and Recognition 3 (2001) 169–180

190

S. Jaeger et al.

28. Jaeger, S.: Recovering Dynamic Information from Static, Handwritten Word Images. PhD thesis, University of Freiburg (1998) Foelbach Verlag 29. Jaeger, S., Liu, C.L., Nakagawa, M.: The State of the Art in Japanese Online Handwriting Recognition Compared to Techniques in Western Handwriting Recognition. International Journal on Document Analysis and Recognition 6 (2003) 75–88 30. Liu, C.L., Jaeger, S., Nakagawa, M.: Online Recognition of Chinese Characters: The State-of-the-Art. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 26 (2004) 198–213 31. Jaeger, S., Nakagawa, M.: Two On-Line Japanese Character Databases in Unipen Format. In: 6th International Conference on Document Analysis and Recognition (ICDAR), Seattle (2001) 566–570 32. Nakagawa, M., Akiyama, K., Tu, L., Homma, A., Higashiyama, T.: Robust and Highly Customizable Recognition of On-Line Handwritten Japanese Characters. In: Proc. of the 13th International Conference on Pattern Recognition. Volume III., Vienna, Austria (1996) 269–273 33. Nakagawa, M., Higashiyama, T., Yamanaka, Y., Sawada, S., Higashigawa, L., Akiyama, K.: On-Line Handwritten Character Pattern Database Sampled in a Sequence of Sentences without Any Writing Instructions. In: Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany (1997) 376–381 34. Teh, C.H., Chin, R.T.: On image anlaysis by the methods of moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 10 (1988) 496–513 35. Teague, M.: Image analysis via the general theory of moments. Journal of the Optical Society of America 70 (1979) 920–930 36. Khotanzad, A., Hong, Y.H.: Rotation invariant image recognition using feature selected via a systematic method. Pattern Recognition 23 (1990) 1089–1101 37. Khotanzad, A., Hong, Y.H.: Invariant image recognition by zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 489–497 38. Hochberg, J., Kelly, P., Thomas, T., Kerns, L.: Automatic script identification from document images using cluster-based templates. IEEE Trans. Pattern Analysis and Machine Intelligence 19 (1997) 176–181 39. Sibun, P., Spitz, A.L.: Language determination: Natural language processing from scanned document images. In: Proc. 4th Conference on Applied Natural Language Processing, Stuttgart (1994) 115–121 40. Spitz, A.L.: Determination of the script and language content of document images. IEEE Trans. Pattern Analysis and Machine Intelligence 19 (1997) 235–245 41. Tan, C., Leong, T., He, S.: Language Identification in Multilingual Documents. In: Int. Symposium on Intelligent Multimedia and Distance Education (ISIMADE’99), Baden-Baden, Germany (1999) 59–64 42. Waked, B., Bergler, S., Suen, C.Y.: Skew detection, page segmentation, and script classification of printed document images. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC’98), San Diego, CA (1998) 4470–4475 43. Zhu, Y., Tan, T., Wang, Y.: Font recognition based on global texture analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 23 (2001) 1192–1200

Combining Classifiers with Informational Confidence

191

44. Ma, H., Doermann, D.: Word level script identification for scanned document images. Proc. of Int. Conf. on Document Recognition and Retrieval (SPIE) (2004) 178–191 45. Jaeger, S., Ma, H., Doermann, D.: Identifying Script on Word-Level with Informational Confidence. In: Int. Conf. on Document Analysis and Recognition (ICDAR), Seoul, Korea (2005) 416–420 46. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 15 (1993) 1162–1173 47. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Information Theory IT-13 (1967) 21–27 48. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 49. Joachims, T.: Making Large-Scale SVM Learning Practical. In: Advances in Kernel Methods-Support Vector Learning. B. Sch¨ olkopf, C. Burges, and A. Smola, MIT-Press (1999) 41–56

Suggest Documents