Hongwei Hao a,*, Cheng-Lin Liu b, Hiroshi Sako b a University of Science and Technology Beijing, ..... 16(1): 66-75, 1994. [3] J. Kittler, M. Hatef, R.P.W. Duin, ...
Confidence Evaluation for Combining Diverse Classifiers Hongwei Hao a,*, Cheng-Lin Liu b, Hiroshi Sako b a University of Science and Technology Beijing, Beijing, P.R. China b Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan Abstract For combining classifiers at measurement level, the diverse outputs of classifiers should be transformed to uniform measures that represent the confidence of decision, hopefully, the class probability or likelihood. This paper presents our experimental results of classifier combination using confidence evaluation. We test three types of confidences: log-likelihood, exponential and sigmoid. For re-scaling the classifier outputs, we use three scaling functions based on global normalization and Gaussian density estimation. Experimental results in handwritten digit recognition show that via confidence evaluation, superior classification performance can be obtained using simple combination rules.
1. Introduction The combination of multiple classifiers is receiving increasing attention from researchers and practitioners of pattern recognition. Classifier combination is expected to outperform the individual classifiers, whose performance is limited by the imperfection of feature extraction and learning/classification algorithms, and the inadequacy of training data. Many reported results have shown the promise of classifier combination in that the accuracy of combination is higher than that of the best individual classifier [1-3]. To achieve high combination performance, there are some requirements to the constituent classifiers and the combination method. An essential condition is that the individual classifiers are complementary, i.e., they show different classification properties. On a set of complementary classifiers, also a wise combination method is needed to achieve high performance. The variety of classifier combination methods can be divided into three groups according to the level of classifier outputs: abstract (crisp class) level, rank level, and measurement level [1]. Combination at measurement level is advantageous in that the output measurements contain richer information of class measures. In this case, however, an appropriate combination method is more *
critical, especially for combining the classifiers that output diverse measurements. For combination at measurement level, we divide the combination method into two subtasks: confidence evaluation and combination rule. On transforming the classifier outputs to confidence measures that represent the class probabilities or likelihood, high performance can be obtained by simple combination rules such as the sumrule or product-rule [3]. For combining the classifiers with diverse outputs, confidence evaluation is particularly important, otherwise the simple combination rules give poor combination performance. The importance of confidence evaluation has been recognized in classifier combination [4] and some works have been done along this way [5-7]. Besides, some works have specially focused on the confidence evaluation of classifier outputs, for improving class probability estimates, pattern rejection, contextual processing, etc. [8-12]. These methods are certainly applicable to classifier combination as well. This paper investigates the effects of some confidence evaluation methods in classifier combination. Unlike that some previous works transformed the classifier outputs heuristically using, say, normalizing the raw outputs to unity, the contrast of outputs between rank-ordered classes, the reciprocal of class output, etc., we focus on probabilistic methods that give plausible estimates of class probabilities or likelihood. We decompose a confidence evaluation method into two sub-functions: a scaling function and an activation function. We have three activation functions corresponding to three types of confidences: loglikelihood, exponential, and sigmoid. The scaling function shifts and re-scales the classifier outputs into moderate ranges. Three scaling functions are considered: one based on global normalization and two based on Gaussian density estimation of classifier outputs. Experiments of classifier combination in handwritten digit recognition show that the combination based on confidence evaluation yields superior classification performance. The simple global normalization of classifier outputs performs fairly well, but is
The work of Hongwei Hao was done when he was working in Hitachi Central Research Laboratory.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
outperformed by the Gaussian methods. In respect of the confidence types, the exponential and sigmoid measures outperform the log-likelihood.
P (ω i | x) =
∑ j =1 exp[d j (x)]
.
We take the exponential before normalization to unity as a type of confidence
2. Confidence Evaluation Methods Many works in classifier combination assumed that the classifier outputs already represent the estimates of class probabilities, as for neural networks and parametric statistical classifiers. For many other non-parametric classifiers, however, the outputs have diverse meanings and ranges, and so, they need to be transformed to uniform confidence measures. Even for neural networks and parametric classifiers, the outputs do not give accurate probability estimates and the re-evaluation of confidences may benefit the combination performance. Suppose we have K classifiers (experts) {E1,…,EK}, each classifying an input pattern into the same set of M classes {ω1 , Κ ω M } . The features X = {x1 , Κ , x M } used by the classifiers can be inter-dependent or independent. Denote the outputs of classifier Ek as
g i (d) = e f i ( d ) ,
evaluation is to transform them into a uniform form g i (d k ), i = 1,Κ , M , which represent the class probabilities or likelihood. For generating confidence measures, first the classifier outputs are shifted and scaled by a scaling function f i (d k ) to a moderate range. The re-scaled outputs are transformed to confidence measures by an activation function g i (d k ) = g i [ f i (d k )] .
2.1. Confidence types In illustrating the confidence evaluation methods, we will drop off the classifier index k since each classifier is considered independently. As prevalently used in neural networks, the sigmoid function behaves well in squashing neuronal outputs to approximate probability measures. We take it as an activation function for confidence transformation:
To give the class posterior probabilities that satisfy the axiom of probabilities, the exponential likelihood and the sigmoid measure are to be normalized to unity of sum:
p (ω i | d) =
In this case, the class posterior probability can be calculated by soft-max as
g i (d) M
∑g j =1
j
.
(d)
Since the normalized posterior probability is derived from the exponential or sigmoid measure, we don’t treat it as a new type of confidence.
2.2. Scaling functions In any type of confidence, the scaling function plays a crucial role to determine the value of confidence. An essential requirement to the scaling function is that the rescaled classifier outputs distribute in a moderate range around 0. It is desired that the transformed confidence measures represent the probability as of the input pattern to belong to a specific class. To manage the range of classifier outputs, one simple strategy is to re-scale the output values to zero mean and standard deviation 1:
f i (d) =
(1)
In many parametric classifiers such as the linear discriminant function (LDF) and the quadratic discriminant function (QDF), the class measurement is the logarithm or negative logarithm of Bayesian likelihood: d i (x) = log[ P(ω i ) p(x | ω i )] .
(2)
in which using re-scaled output instead of the raw output may give better confidence estimates. For parametric classifiers, the exponential measure corresponds to the Bayesian likelihood. We herein generalize it to all other classifiers to estimate the Bayesian likelihood. We also use the log-likelihood as the third type of confidence. When approximating the Bayesian likelihood using exponential, the log-likelihood is simply the linear form of the scaling function: g i (d) = f i (d) . (3)
d k = (d k1 ,Κ , d kM ) T , the task of confidence
1 g i (d) = . 1 + e − fi (d )
exp[d i (x)]
M
where
µ0
and
σ 02
di − µ0
σ0
,
(4)
are the mean and variance of the
pooled classifier outputs, respectively. We refer to this scaling function as Global Normalization. For classifiers that output dissimilarity measures, the sign of the raw outputs should be reversed before global normalization. The other two scaling functions are derived from Gaussian densities of classifier outputs. Assuming multivariate or one-dimensional Gaussian densities to the classifier outputs, the class probabilities are shown to be
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
calculated from soft-max or sigmoid, from which we extract the scaling functions. For estimating the posterior probabilities of neural network outputs, Denker and LeCun assumed that the distribution of each output (before sigmoid) conditioned on the input is normal. By integrating this distribution over the input samples, they showed that the class posterior probability is calculated in soft-max [8]. In this paper, by directly assuming that the outputs of a classifier are distributed normally in the output space with constraints on the means and variances, we obtain a softmax formula similar to that of Denker and LeCun. Assume for each class, the density of classifier outputs is a multivariate Gaussian with identity variance
1
p(d | ω i ) =
exp[−
∑
M j =1
σ
2 i
2
],
(2πσ i2 ) M / 2 2σ i2 where mij , j = 1, Κ , M are the mean values of outputs for class
ωi .
µ+
for target class and
ωi
classes such that for class
µ−
, mii =
for other
µ+
and
mij = µ − , j ≠ i . Further assuming equal variances
σ =σ 2 i
2
and equal prior probabilities, the log-
likelihood is
∑ log p(d | ω ) = −
M j =1
2σ 2
∑ j =1 (d 2j − 2d j µ − ) + µ + 2 + (M − 1)µ −2 M
+
r
2σ 2 2d i ( µ + − µ − ) 2σ 2
Omitting the terms irrespective of class index i, the loglikelihood becomes
log p(d | ω i ) =
µ+ − µ− di . σ2
Based on this, the posterior probability is calculated by
(5)
and those of other
µ+
of
ωi
and the mean
of competing negative samples. On a competing
negative sample from class
ωj ,
the output d i is the
closest runner-up to the output d j . Eventually, the scaling function is as
µ+ − µ− µ+ + µr f i (d) = (d i − ). 2 σ2
(6)
The next scaling function is obtained by assuming one-dimensional Gaussian density to the output of each class. On the output d i , assume the densities of positive samples (from class
ωi )
and negative samples (from
other classes) are Gaussians with identical variance
σ2
and means µ and µ , respectively, Shuermann has shown that the class posterior probability has the sigmoid form [11] +
where
(d i − µ + ) 2 + ∑ j =1, j ≠ i (d j − µ − ) 2 M
=−
midway of the sample mean
−
P(ω i | d i ) =
2σ 2
ωi
classes becomes 0. We estimate this boundary as the
(d j − mij ) 2
i
=−
boundary between the samples of
Considering that the outputs of a strong
classifier are well ordered such that the target class generally has high measure while other classes have low outputs, we assume that all classes share two distinct mean values,
The form of (5), however, is not a good scaling function, because the classifier output is not shifted. Intuitively, the output of class ω i should be shifted such that the
µ
:
(d i − mij )
µ+ − µ− di ] σ2 . P(ω i | d) = µ+ − µ− M ∑ j =1 exp[ σ 2 d j ] exp[
1 , 1 + e − fi (d )
f i (x) = α [d i − ( β + γ / α )] ,
µ+ − µ− σ2
µ+ + µ−
with
α=
γ = ln
P(ω i ) P(ω i ) . We set = M considering that P(ω i ) P(ω i )
,
β=
(7)
2
,
and
the negative samples may include those out of the M hypothesized classes and the M+1 classes are assumed to have equal prior probabilities. The scaling functions of (6) and (7) are referred to as Gaussian Method 1 (Gauss-1) and Gaussian Method 2 (Gauss-2), respectively. They differ only in the shift of classifier output.
3. Experimental Results To evaluate the performance of the confidence evaluation methods, we conducted experiments in
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
handwritten digit recognition by combining five classifiers, which use varying pre-processing procedures, feature vectors, and classifier structures. Two datasets, one collected in Hitachi and another one extracted from NIST SD-19, were used to train the classifiers. The specifications of the individual classifiers are shown in Table 1. The feature vectors represent 4orientation or 8-direction features, or enhanced direction feature with structural features. In the notations, the prefix “e-” means 8-direction feature. “grd” means gradient feature, and the suffix “-g” means feature extracted from gray-scale images (converted from binary ones in normalization). “mul” means enhanced (chaincode) direction feature, and “des” means chaincode feature extracted on deslanted images. In pre-processing, different aspect ratio mapping functions were used to control the aspect ratio of normalized image. The classifier structures are polynomial classifier (PC), learning vector quantization (LVQ), and learning quadratic discriminant function (LQDF). The PC outputs class similarity measures, while LVQ and LQDF outputs dissimilarity measures. The pre-processing, feature extraction and classification methods have been described in [13-15]. Table 1. Specifications of five individual classifiers Index Feature Aspect Classify Train E0 e-grd-g ARF9 PC Hitachi E1 grd-g ARF7 PC Hitachi E2 mul ARF2 PC Hitachi E3 e-grd ARF8 LVQ Hitachi E4 des ARF8 LQDF NIST Table 2. Accuracies (%) on validation and test datasets Expert Validate Test1 Test2 E0 99.40 99.79 89.95 E1 99.04 99.71 88.28 E2 99.17 99.58 87.78 E3 99.11 99.64 87.77 E4 98.41 98.57 82.03 Plurality 99.44 99.84 90.81 We use a dataset of 81,544 digit samples collected in Hitachi as the validation data for estimating scaling parameters of confidence evaluation. For testing the performance of classifier combination, we use two test datasets. Test-1 contains 9,725 samples collected in environment similar to that of the validation dataset. Test2 contains 36,473 samples that was rejected or misrecognized by an old recognizer of Hitachi. The samples of Test-2 are difficult due to excessive shape distortion or image degradation. The accuracies of individual classifiers and the accuracies of combining the classifiers by plurality of votes are listed in Table 2. We can see that
combination by plurality gives accuracies higher than those of the best individual classifier. Table 3 shows the accuracies of confidence-based combination using the sum-rule (average of class confidences) [3]. The three scaling functions and three activation functions (corresponding to confidence types linear, exponential and sigmoid) are combined to give nine confidence measures. Further, the exponential and sigmoid measures are normalized to unity of sum to give posterior probabilities in closed world. In Table 3, we can see that the normalization to unity operation evidently deteriorates the combination performance. Compared to the accuracies of plurality, the simple scaling function, global normalization, gives very good combination performance with all the three confidence types. The global normalization, nevertheless, is outperformed by the scaling functions based on Gaussian densities. On the Gaussian scaling functions, the exponential and sigmoid measures give higher combination accuracies than the log-likelihood (linear). The superiority of exponential and sigmoid measures over the linear measure indicates that transforming the classifier outputs into probabilistic measures is beneficial. The Gaussian scaling functions outperform global normalization because their parameters are estimated considering the class-specific distributions of classifier outputs. Table 3. Combination results (%) using sum-rule Scaling Conf Un-normalized Normalized Test1 Test2 Test1 Test2 Lin 99.86 91.75 Global Exp 99.89 92.24 99.89 92.08 Sig 99.84 91.48 99.80 91.41 Lin 99.85 90.89 Gauss-1 Exp 99.88 92.45 99.84 91.12 Sig 99.88 92.19 99.84 90.82 Lin 99.85 90.89 Gauss-2 Exp 91.99 99.84 91.12 99.90 Sig 91.93 99.84 90.94 99.89 To justify the promise of confidence evaluation for various combination rules, we also used the product-rule and nearest mean (decision template [16]) in combination. The templates of nearest mean were generated from the validation samples. The accuracies of product-rule and nearest mean are shown in Table 4 and Table 5, respectively. In combination using product-rule, the normalized and un-normalized confidences give all the same classification results. Further, the product-rule on exponential confidence yields the same classification result as the sum-rule on log-likelihood. We can see that
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
the combination performance of product-rule is consistently inferior to that of sum-rule. In combination using nearest mean, the combination of un-normalized exponential measures gives very poor classification performance. This is because the value of exponential has very wild range, and so, the samples of one class are not compact. The normalized exponential measures, nevertheless, perform fairly well. The best performance is given by the un-normalized sigmoid measures, and is higher than that of plurality. The great difference of accuracy between linear measure and sigmoid measure indicates that confidence transformation is crucial for decision template-based combination as well. Table 4. Combination results (%) using product-rule Scaling Conf Test1 Test2 Global Exp 99.86 91.75 Sig 99.81 91.26 Gauss-1 Exp 99.85 90.89 Sig 99.85 90.61 Gauss-2 Exp 90.89 99.89 Sig 90.70 99.85 Table 5. Combination results (%) using nearest mean Scaling Conf Un-normalized Normalized Test1 Test2 Test1 Test2 Lin 99.66 85.83 Global Exp 96.78 70.36 99.74 89.29 Sig 99.47 82.37 99.49 82.35 Lin 99.65 85.98 Gauss-1 Exp 41.82 8.83 99.84 91.05 Sig 99.83 90.61 99.87 91.96 Lin 99.65 85.98 Gauss-2 Exp 41.82 8.83 99.84 91.05 Sig 99.86 91.61 99.83 90.75
4. Conclusion The experimental results of this paper show that the transformation of classifier outputs to confidence values is efficient to improve the performance of measurement level combination. Representing the confidence as class probability or likelihood and estimating the scaling parameters considering class-specific distributions of classifiers show their superiority. In the future, we will test the efficiency of confidence evaluation with weighted combination rules and meta-classification rules.
References [1] L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multiple classifiers and their applications to handwriting recognition, IEEE Trans. System, Man, and Cybernetics, 22(3): 418-435, 1992.
[2] T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, IEEE Trans. Pattern Analysis and Machine Intelligence, 16(1): 66-75, 1994. [3] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3): 226-239, 1998. [4] J. Kittler, A framework for classifier fusion: is it still needed? Advances in Pattern Recognition, F.J. Ferri, et al. (Eds.), SSPR and SPR 2000, LNCS Vol.1876, Springer, 2000, pp.45-56. [5] X. Lin, et al., Adaptive confidence transform based classifier combination for Chinese character recognition, Pattern Recognition Letters, 19(10): 975-988, 1998. [6] AS. Artukorale, P.N. Suganthan, Combining classifiers based on confidence-based values, Prof. 5th ICDAR, Bangalore, India, 1999, pp.37-40. [7] L.P. Cordella, et al., Reliability parameters to improve combination strategies in multi-expert systems, Pattern Analysis and Applications, 2(3): 205-214, 1999. [8] J.S. Denker, Y. LeCun, Transforming neural-net output levels to probability distribution, Advances in Neural Information Processing 3, R.P. Lippmann, J.E. Moody, D.S. Touretzky (Eds.), MorganKaufman, 1991, pp.853-859. [9] A. Hoekstra, S.A. Tholen, R.P.W. Duin, Estimating the reliability of neural network classification, Proc. ICANN’96, Bochum, Germany, 1996, pp.53-58. [10] R.P.W. Duin, D.M.J. Tax, Classifier conditional posterior probabilities, Advances in Pattern Recognition: SSPR’98 & SPR’98, Amin, Dori and Freeman (Eds.), Springer, 1998, pp.611-619. [11] J. Schuermann, Pattern Classification--A United View of Statistical and Neural Approaches, WileyInterscience, 1996. [12] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in Large Margin Classifiers, A. Smola, et al. (Eds.), MIT Press, 1999. [13] C.-L. Liu, H. Sako, H. Fujisawa, Performance evaluation of pattern classifiers for handwritten character recognition, Int. J. Document Analysis and Recognition, 4(3), pp.191-204, 2002. [14] C.-L. Liu, H. Sako, H. Fujisawa, Learning quadratic discriminant function for handwritten character recognition, Proc. 16th ICPR, Quebec, Canada, 2002, Vol.4, pp.44-47. [15] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: investigation of normalization and feature extraction techniques, submitted to Pattern Recognition, 2003. [16] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple classifier fusion: an
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
experimental
comparison,
Pattern
Recognition,
34(2): 299-314, 2001.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE