Estimating the Bayes Error Rate through Classi er Combining Kagan Tumer and Joydeep Ghosh Department of Electrical and Computer Engineering University of Texas, Austin, TX 78712-1084 E-mail: fkagan,
[email protected]
Abstract The Bayes error provides the lowest achievable error rate for a given pattern classi cation problem. There are several classical approaches for estimating or nding bounds for the Bayes error. One type of approach focuses on obtaining analytical bounds, which are both dicult to calculate and dependent on distribution parameters that may not be known. Another strategy is to estimate the class densities through non-parametric methods, and use these estimates to obtain bounds on the Bayes error. This article presents a novel approach to estimating the Bayes error based on classi er combining techniques. For an arti cial data set where the Bayes error is known, the combiner-based estimate outperforms the classical methods.
1. Introduction For many real-life classi cation problems, perfect classi cation is not possible. In addition to fundamental limits to classi cation accuracy arising from overlapping class densities, errors creep in because of de ciencies in the classi er and the training data. Classi er related problems such as incorrect structural model, parameters, or learning regime may be overcome by changing or improving the classi er. On the other hand, errors caused by the data (e.g., nite training sets, mislabeled patterns) cannot necessarily be corrected during the classi cation stage. It is therefore important to not only design a good classi er, but also estimate limits to achievable classi cation rates. Such estimates determine whether it is worthwhile to pursue alternative classi cation schemes. The fundamental performance limit for statistical pattern recognition problems is given by the Bayes error. This error provides a lower bound on the error rate that can be achieved by a pattern classi er [3, 5, 7]. Therefore, it is important to calculate, or estimate this
error in order to evaluate the performance of a given classi er. When the pattern distributions are known, it is possible, although not always practical, to obtain the Bayes error directly [7]. However, when the pattern distributions are unknown, the Bayes error is not readily obtainable. In Section 2, we highlight the diculties in estimating The Bayes error rate and summarize several classical methods aimed at nding bounds for this value. In Section 3 we derive a new method for estimating the Bayes error based on the combining theory introduced in [15, 16]. The estimate we develop relies on the result that combining multiple classi ers reduces the added errors due to the individual classi ers [16], and therefore leads to the isolation and estimation of the Bayes error. This method is applied to an arti cial data set in Section 4. The results show that the combiningbased method achieves better estimates than the classical methods.
2. Background In real world classi cation problems where classes are not fully separable, it is unrealistic to expect absolute classi cation performance (100% correct). The object of a statistical classi cation problem is to reach the \best possible" performance. The obvious question that arises of course, is how to determine the optimum classi cation rate. Since the Bayes decision provides the lowest error rates, the problem is equivalent to determining the Bayes error [3, 7]. Let us consider the situation where a given pattern vector x needs to be classi ed into one of L classes. Further, let P (ci ) denote the a priori class probability of class i, 1 i L, and p(xjci ) denote the class likelihood, i.e., the conditional probability density of x given that it belongs to class i. The probability of the pattern x belonging to a speci c class i, i.e., the a posteriori probability p(ci jx), is given by the Bayes
rule:
p(ci jx) = p(xjcpi()xP) (ci ) ;
(1)
where p(x) is the probability density function of x and P is given by: p(x) = Li=1 p(xjci ) P (ci ) . The classi er that assigns a vector x to the class with the highest posterior is called the Bayes classi er, and the error associated with this classi er is called the Bayes error, given by [7, 9]:
Ebay = 1 ?
L X
Z
i=1 Ci
P (ci )p(xjci )dx
(2)
where Ci is the region where class i has the highest posterior. Obtaining the Bayes error from Equation 2 entails evaluating the multi-dimensional integral of possibly unknown multivariate density functions over unspeci ed regions (Ci ). Due to the diculty of this operation, the Bayes error can be computed directly only for a very limited number of problems. For most real world problems, estimating the densities (e.g. through Parzen windows) and priors is the only only available course of action. Although tedious, a numerical integration method can then be used to obtain the Bayes error. However, since errors are introduced both during the estimation of the class densities and regions, and compounded by a numerical integration scheme, the results are in general not reliable. Therefore, attention has focused on approximations and bounds for the Bayes error, which are either calculated through distribution parameters, or estimated through training data characteristics. 2.1. Parametric Estimates of the Bayes Error
One of the simplest bounds for the Bayes error is provided by the Mahalanobis distance measure [3]. For a 2-class problem, let i and i be the mean vector and covariance matrix, respectively, for classes i = 1; 2. Furthermore, let be the non-singular, average covariance matrix , where = P (c1 ) 1 + P (c2 ) 2 . Then, the Mahalanobis distance , given by: = (1 ? 2 )T ?1 (1 ? 2 ) ;
(3)
provides the following bound on the Bayes error: (4) Ebay 1 +2 PP((cc1 ))PP((cc2 )) : 1 2 The main advantage of this bound is the lack of restriction on the class distributions. Furthermore, it
is easy to calculate using only sample mean and sample covariance matrices1. It therefore provides a quick way of obtaining an approximation for the Bayes error. However, it is not a particularly tight bound, and more importantly as formulated above, it is restricted to a 2-class problem. Another bound for a 2-class problem can be obtained from the Bhattacharyya distance. For a 2-class problem, the Bhattacharyya distance is given by [3]:
% = ?ln
Z
p
p(xjc1 )p(xjc2 )dx:
(5)
In particular, if the class densities are Gaussian with mean vectors and covariance matrices i and i for classes i = 1; 2, respectively, and is the average covariance, the Bhattacharyya distance is given by [7]: 1 +2 1 1 T ? 1 2 % = 8 (2 ? 1 ) (2 ? 1 ) + 2 ln p : (6) j1 j j2 j Using the Bhattacharyya distance, the following bounds on the Bayes error can be obtained [3]: p 1 ? 1 ? 4p exp(?2%) pp (7) E exp ( ? % ) bay 2 where p = P (c1 )P (c2 ). The Bhattacharyya distance generally provides a tighter error bound than the Mahalanobis distance, but has two drawbacks: it requires knowledge of the class densities, and is more dicult to compute. Even if the class distributions are known, computing Equation 5 is not generally practical. Therefore, Equation 6 has to be used even for non-Gaussian distributions to alleviate both concerns. While an estimate for the Bhattacharyya distance can be obtained by computing the rst and second moments of the data and using Equation 6, this compromises the quality of the bound. A detailed discussion of the eects of using training sample estimates for computing the Bhattacharyya distance is presented in [4]. A tighter upper bound than either the Mahalanobis distance or the Bhattacharyya distance based bounds is provided by the Cherno bound:
Ebay P (c1 )s P (c2 )1?s
Z
p(xjc1 )s p(xjc2 )1?s dx; (8)
where 0 s 1. For classes with Gaussian densities, the integration in Equation 8 yields exp(?%c(s)), where the Cherno distance, %c (s), is given by [7]: ?1 T %c (s) = 2(2 ? 1 ) (s1 s+(1(1??s)s)2 ) (2 ? 1 ) 2j : + 21 ln js1 +s(1 ?1s?) s j1 j j2 j
1 The accuracy of the bound will depend on the training sample estimates of the mean and covariance.
The optimum s for a given i and i combination can be obtained by plotting %c (s) for various s values [7]. Note that the Bhattacharyya distance is a special case of the Cherno distance, as it is obtained when s = 0:5. Although the Cherno bound provides a slightly tighter bound on the error, the Bhattacharyya bound is often preferred because it is easier to compute [7]. The common limitation of the bounds discussed so far stems from their restriction to 2-class problems. Extensions of these bounds to an L-class problem is presented in [9]. In this scheme, upper and lower bounds for the Bayes error of an L-class problem are obtained from the bounds on the Bayes error of L subproblems, each involving L ? 1 classes. Continuing this progression eventually reduces the problem to obtaining the Bayes error for 2-class problems. Based on this techniques the upper and lower bounds for the Bayes error of an L-class problem are respectively given by [9]: min
2f0;1g
and
1
L ? 2
L
L?1 + 1 ? (1 ? P (ci )) Ebay ;i L ? 2 i=1
X
!
;
L L?1 X L?1 L(L ? 2) i=1 (1 ? P (ci )) Ebay;i ;
L is the Bayes error for an L-class problem, where Ebay L?1 is the Bayes error of the (L ? 1)-class subprobEbay ;i lem, where the ith class has been removed, and is an optimization parameter. Therefore, the Bayes error for an L class problem can be computed starting from L the 2 pairwise errors. 2.2. Non-Parametric Estimate of the Bayes Error
The computation of the bounds for 2-class problems presented in the previous section and their extensions to the general L-class problem depend on knowing (or approximating) certain class distribution parameters, such as priors, class means and covariances between classes. Although it is in general possible to estimate these values from the data, the resulting bounds are not always satisfactory. A method that provides an estimate for the Bayes error without requiring knowledge of the class distributions is based on the nearest neighbor (NN) classi er. The NN classi er assigns a test pattern to the same class as the pattern in the training set to which it is closest (de ned in terms of a pre-determined distance metric). The Bayes error can be given in terms of the error of an NN classi er. Given an L-class problem
with suciently large training data, the following result holds [2]: ! r L?1 1? 1? L E L L ? 1 NN Ebay ENN : (9) This result is independent of the distance metric chosen. Equation 9 places bounds on the Bayes error provided that the sample sizes are suciently large. These results are particularly signi cant in that they are attained without any assumptions or restrictions on the underlying class distributions. However, when dealing with limited data, one must be aware that Equation 9 is based on asymptotic analysis. Corrections to this equation based on data size limitations, and its extension to k-NN classi ers are discussed in [1, 6, 8].
3. Bayes Error Estimation through Classi er Combining In dicult classi cation problems with limited training data, high dimensional patterns or patterns that involve a large amount of noise, the performance of a single classi er is often unsatisfactory and/or unreliable. In such cases, combining the outputs of multiple classi ers has been shown to improve the classi cation performance [10, 12, 16, 17, 18]. The derivation of the Bayes error estimate relies on the result that the outputs of certain well-trained neural networks approximate the corresponding a posteriori probabilities [13, 14]. Thus, for such networks, the output of the mth classi er on the ith class is modeled as fim (x) = p(ci jx) + i (x), where i (x) is the error in approximating the posterior probability for x, associated with the ith class. When the outputs of the classi ers provide estimates of the class posteriors, simple averaging of the corresponding outputs can be an eective strategy for combining the results of several classi ers to obtain a better decision. The eect of such a combining scheme on classi cation decision boundaries and their relation to error rates was theoretically analyzed in [15, 16]. More speci cally, we showed that combining the outputs of dierent classi ers \tightens" the error-prone regions surrounding decision boundaries. The total error of a classi er (Etot ) can be divided into the Bayes error and added error (Eadd ), which is the extra error due to the speci c classi er (model/parameters) being used. Thus, the error of a single classi er and the ave combiner, which average the outputs of N classi ers, are respectively given by: Etot = Ebay + Eadd ; (10) ave = Ebay + E ave : Etot (11) add
Data Set Characteristics 1 DATA1 1 DATA2
2 2 1 1 2 2
Table 1. Artificial Data Sets.
i (dimension) 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2.56 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 3.86 3.10 0.84 0.84 1.64 1.08 0.26 0.01 8.41 12.06 0.12 0.22 1.49 1.77 0.35 2.73
Table 2. Bayes Error Estimates for Artificial Data.
Feature True Bayes Combiner-Based Nearest Neighbor Mahalanobis Bhattacharyya Set Error Estimate Bounds Bound Bounds (Eq. 13) (Eq. 9) (Eqs. 3-4) (Eqs. 6-7) DATA1 10.00 9.53 8.17 Ebay 15.00 18.95 5.12 Ebay 22.04 DATA2 1.90 2.05 2.46 Ebay 4.80 14.13 0.23 Ebay 4.74 Note that the Bayes error is not aected by the choice of the classi er. By determining the amount of improvement that can be obtained from a combiner, the Bayes error can be isolated and evaluated. The errors in approximating class posteriors are rarely independent, and generally depend on the correlation among the individual classi ers, denoted . This leads to the ave and Eadd (See [16] following relationship between Eadd for details): (12) E ave = 1 + (N ? 1) E : add
N
add
Solving the set of Equations 10, 11 and 12 for Ebay , provides: ave ? 1) + 1) Etot : (13) Ebay = N Etot (?N((?N1)(1 ? ) Equation 13 provides an estimate of the Bayes error as a function of the individual classi er error, the combined classi er error, the number of classi ers combined and the correlation among them.
4. Experimental Results on an Arti cial Data Set In this section, we apply the method to two arti cial problems with known Bayes error rates. Both these problems are taken from Fukunaga's book [7], and are 8-dimensional, 2-class problems, where each class has a Gaussian distribution. For each problem, the class means and the diagonal elements of the covariance matrices (o-diagonal elements are zero) are given in Table 1. From these speci cations, 1000 training patterns and 1000 test patterns were generated for
each problem. The Bayes error rate for both these problems (10% for DATA1 and 1:9% for DATA2) is given in [7]. It is a well-known result that the outputs of certain properly trained feed-forward arti cial neural networks approximate the class posteriors [11, 13, 14]. Therefore, these networks provide a suitable choice for the multiple classi er combining scheme discussed in Section 3. Two dierent types of networks were selected for this application. The rst is a multi-layered perceptron (MLP), and the second is a radial basis function (RBF) network. The single hidden layered MLP used for DATA1 had 5 units, and the RBF network had 5 kernels, or centroids. For DATA2 the number of hidden units and the number of kernels were increased to 12. Table 2 shows the dierent estimates for the Bayes error. For each feature set, the Bayes error is estimated through the separate combining results for both the MLP2 and RBF network. A third estimate is obtained from combining the MLP and RBF results. These three values are averaged to yield the results that are reported. Since these data sets are arti cial and based on Gaussians with known mean and covariance matrices, the actual distribution parameters are used in the estimates of the Mahalanobis and Bhattacharyya distances, rather than sample estimates. Notice that although the Bhattacharyya bound is expected to be tighter than the Mahalanobis bound, this is not so for DATA1. The reason for this discrepancy is twofold: rst, the Mahalanobis distance provides tighter bounds as the error becomes larger [3]; second, two terms con2 That is, by using the average error rate of 20 MLPs as E tot ave . and the error obtained by averaging the network outputs as Etot
tribute to the distance of Equation 6, one for the difference of the means and one for the dierence of the covariances. In the case where the covariances are identical, the second term is zero, leading to a small Bhattacharyya distance, which in turn leads to a loose bound on the error. DATA1, by virtue of having a large Bayes error due exclusively to the separation of the class means, represents a case where the Bhattacharyya bound fails to improve on the Mahalanobis bound. For DATA2, the Bhattacharyya distance provides bounds that are more useful, and the upper bound in particular is very similar to the upper bound provide by the NN method. For both DATA1 and DATA2, the Bayes error rate estimate obtained through the classi er combining method introduced in this article provides values that are closest to the actual Bayes error rate, even though both experiments are biased towards the classical techniques3.
5. Conclusion Determining the limits to achievable classi cation rates is useful in dicult classi cation problems. Such information is helpful in determining whether it is worthwhile to try a dierent classi er, or a dierent set of parameters with the chosen classi er with the hope of getting better classi cation rates. For example, knowing that certain errors (e.g. mislabeled patterns) are too deeply rooted in the original data to be eliminated with better classi ers or combiners, may prevent the search for non-existent \better" classi ers. In this article, we derive an estimate for the Bayes error based on linear combining theory. Experimental results show that this error estimate compares favorably with classical estimation methods. The error estimates presented in this article also help isolate classi er/feature set pairs that fall short of their performance potentials, directing further design efforts (classi er selection/feature extraction) into areas where they are needed most. Acknowledgements: This research was supported in part by AFOSR contract F49620-93-1-0307, NSF grant ECS 9307632, and ARO contracts DAAH 04-94-G0417 and 04-95-10494.
References [1] L. J. Buturovic. Improving k-nearest neighbor density and error estimates. Pattern Recognition, 26(4):611{ 616, 1993. 3 Since the data has Gaussian distribution, actual and 's are used in Equations 3 and 6 rather than sample estimates.
[2] T. M. Cover and P. E. Hart. Nearest neighbor pattern classi cation. IEEE Transactions on Information Theory, 13:21{27, 1967. [3] P. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall, 1982. [4] A. Djouadi, O. Snorrason, and F. D. Garber. The quality of training-sample estimates of the Battacharyya coecient. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1):92{97, January 1990. [5] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley, New York, 1973. [6] K. Fukunaga. The estimation of the Bayes error by the k-nearest neighbor approach. In L. N. Kanal and A. Rosenfeld, editors, Progress in Pattern Recognition 2, pages 169{187. North-Holland, Amsterdam, 1985. [7] K. Fukunaga. Introduction to Statistical Pattern Recognition. (2nd Ed.), Academic Press, 1990. [8] K. Fukunaga and D. Hummels. Bayes error estimation using Parzen and k-NN procedures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:634{643, 1987. [9] F. Garber and A. Djouadi. Bounds on the Bayes classi cation error based on pairwise risk functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3):281{288, 1988. [10] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993{1000, 1990. [11] R. P. Lippmann. A critical overview of neural network pattern classi ers. In IEEE Workshop on Neural Networks for Signal Processing, 1991. [12] M. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image Processing, chapter 10. Chapmann-Hall, 1993. [13] M. Richard and R. Lippmann. Neural network classi ers estimate Bayesian a posteriori probabilities. Neural Computation, 3(4):461{483, 1991. [14] D. W. Ruck, S. K. Rogers, M. E. Kabrisky, M. E. Oxley, and B. W. Suter. The multilayer Perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4):296{298, 1990. [15] K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classi ers. Pattern Recognition, 29(2):341{348, February 1996. [16] K. Tumer and J. Ghosh. Theoretical foundations of linear and order statistics combiners for neural pattern classi ers. IEEE Transactions on Neural Networks, 1996. (Accepted for publication. Preprints can be obtained from URL http://www.lans.ece.utexas.edu/kagan/publications.html). [17] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241{259, 1992. [18] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classi ers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418{435, May 1992.