Bias, Variance, and Error Correcting Output Codes for Local Learners
Francesco Ricci
Istituto per la Ricerca Scienti ca e Tecnologica 38050 Povo (TN), Italy
David W. Aha
Navy Center for Applied Research in Arti cial Intelligence Naval Research Laboratory, Code 5510 Washington, DC 20375 USA
Abstract: This paper focuses on a bias variance decomposition analysis of a local learning
algorithm, the nearest neighbor classi er, that has been extended with error correcting output codes. This extended algorithm often considerably reduces the 0-1 (i.e., classi cation) error in comparison with nearest neighbor (Ricci & Aha, 1997). The analysis presented here reveals that this performance improvement is obtained by drastically reducing bias at the cost of increasing variance. We also show that, even in classi cation problems with few classes (m5), extending the codeword length beyond the limit that assures column separation yields an error reduction. This error reduction is not only in the variance, which is due to the voting mechanism used for error-correcting output codes, but also in the bias.
Keywords: Case-based learning, classi cation, error-correcting output codes, bias and vari-
ance
Email:
[email protected],
[email protected] Phone: ++39 461 314334 FAX: ++39 461 302040
0
Bias, Variance, and Error Correcting Output Codes for Local Learners
?
Francesco Ricci1 and David W. Aha2 1
Istituto per la Ricerca Scienti ca e Tecnologica 38050 Povo (TN), Italy 2 Navy Center for Applied Research in Arti cial Intelligence Naval Research Laboratory, Code 5510 Washington, DC 20375 USA
Abstract. This paper focuses on a bias variance decomposition analysis of a local
learning algorithm, the nearest neighbor classi er, that has been extended with error correcting output codes. This extended algorithm often considerably reduces the 0-1 (i.e., classi cation) error in comparison with nearest neighbor (Ricci & Aha, 1997). The analysis presented here reveals that this performance improvement is obtained by drastically reducing bias at the cost of increasing variance. We also show that, even in classi cation problems with few classes (m5), extending the codeword length beyond the limit that assures column separation yields an error reduction. This error reduction is not only in the variance, which is due to the voting mechanism used for error-correcting output codes, but also in the bias.
1 Introduction
Dietterich and Bakiri [1991; 1995] introduced the use of error-correcting output codes (ECOCs) for representing class labels in supervised learning tasks. They found that ECOCs signi cantly increased the classi cation accuracy of two learning algorithms for several multiclass tasks (i.e., with k > 2 classes). Subsequently, Kong and Dietterich [1995] analyzed these algorithms and reported that ECOCs reduce both bias and variance. The algorithms investigated in these studies were C4.5 [Quinlan, 1993] and error backpropagation (BP) [Rumelhart and McClelland, 1986]. These are examples of global learning algorithms in that they generate classi cation predictions by examining the mappings of instances to classes in the entire training set. This distinguishes them from local learning algorithms [Bottou and Vapnik, 1992], such as the nearest neighbor classi er, which generate classi cation predictions by examining only the mappings of nearby instance information. In the context of ECOCs, this distinction is important; global algorithms bene t from ECOCs because the classi cation predictions they generate for each output bit are not correlated. In contrast, local algorithms will not bene t from ECOCs because, independent of each output bit's class partition, they will generate classi cation predictions based on the same (local) information in the training set, and this causes the output bit predictions to be correlated [Kong and Dietterich, 1995; Wettschereck and Dietterich, 1992]. In [Ricci and Aha, 1997], we con rmed this analysis and introduced a method to overcome this problem; use a feature selection approach, independently for each output bit, to ensure ?
NCARAI Technical Report AIC-97-025
that dierent local information is used to generate classi cation predictions for dierent output bits. The resulting local ECOC algorithm, an extension of the nearest neighbor classi er, signi cantly increased classi cation accuracies on several multiclass tasks. Although we included a followup analysis that explains the conditions (i.e., data characteristics) under which these bene ts can be expected, we did not perform a bias/variance decomposition analysis. In this paper, we perform this analysis and show that the bene ts provided by ECOCs to our local learning approach dier from those that they provide to global learning algorithms. In particular, ECOCs reduce bias at the expense of variance for local learning algorithms, instead of reducing both bias and variance. Section 2 reviews de nitions for bias and variance, while Section 3 reviews error-correcting output codes. Section 4 then summarizes our bias/variance decomposition analysis for our local learning approach, for both archived and synthesized data sets, while in Section 5 we discuss ways in which variance can be reduced and our approach can be applied to tasks with few features.
2 The Bias and Variance Decomposition
This section reviews the bias/variance decomposition of the error of a classi er, following the de nitions given in [Breiman, 1996a] and [Friedman, 1996]. In a classi cation problem one assumes that there exist two random variables, X and Y , where X describes the input parameters (i.e., instances) and Y is a discrete variable with a nite number of values, Y 2 f1; : : : ; kg, called classes. A classi cation problem is completely described by k real deterministic functions fi (x) = P (Y = ijX = x). The goal is to produce a classi er Y^ 2 f1; : : : ; kg that minimizes the misclassi cation error (risk) E [r(X )] where
r(x) =
Xk f (x)1(Y^ (x) 6= i) i=1
i
(1)
and where 1() is a function that takes the value 1 if the argument is true and 0 otherwise [Friedman, 1996]. The minimum misclassi cation rate is obtained using the Bayes optimal classi er: YB (x) = arg max f (x) (2) i i with misclassi cation rate
Z
Er(YB ) = E [rB (X )] = 1 ? max f (X )P (dX ) i i
(3)
Given a nite training set T = f(xi ; yi ) : i = 1; : : : ; mg the classi er induced by a supervised learner depends on T , which we denote as Y^ (xjT ). Some classi ers rst form an approximation f^i (xjT ) to fi (x) and then use that to form the rule Y^ (xjT ) = arg maxi f^i (xjT ). Other classi ers skip this step and immediately build Y^ (xjT ). For both cases an aggregate classi er YA (x) can be de ned as [Breiman, 1996a]: f^i (x) = ET [f^i (xjT )] = P (Y^ (xjT ) = i) (4)
f^ (x) YA (x) = arg max i i 2
(5)
When the approximations f^i (x) or f^i (xjT ) dier from fi (x) or fi (xjT ) this yields a dierent behavior from the Bayes optimal classi er and a corresponding error: X (6) r(xjT ) = 1 ? fi (x)f^i (xjT ) i
rA (x) = 1 ? fYA(x) (x) (7) Breiman de nes a classi er to be unbiased at x if YA (x) = YB (x) (i.e., if on the average, over the replications of T , the classi er picks the correct class more often than other classes). Let U be the set of instances at which Y^ is unbiased and B its complement. At this point we can de ne the bias and variance of zero-one loss functions as follows [Breiman, 1996a]: Bias(Y^ ) = P (Y^ (x) 6= Y (x); x 2 B) ? P (YB (x) 6= Y (x); x 2 B) (8) V ar(Y^ ) = P (Y^ (x) 6= Y (x); x 2 U ) ? P (YB (x) 6= Y (x); x 2 U ) (9) These equations allow us to compute the bias/variance decomposition of the error: Er(Y^ ) = Er(YB ) + Bias(Y^ ) + V ar(Y^ ) (10) Some properties of this de nition are worth noting: { Bias and Variance are non negative. This fact follows immediately from the de nition. Negative variance was a main limitation of a similar decomposition proposed by Kong and Dietterich [1995]. { The variance of the aggregate classi er is zero. The aggregate classi er, where it is unbiased, is equivalent to the Bayes optimal classi er. This property is shared by any deterministic classi er (i.e., whose behavior is independent of the particular training set T ). { The bias of the Bayes classi er is zero. Again this is a simple consequence of the de nition. This property, a natural requirement of the bias/variance decomposition, is not shared by the decomposition proposed by Kohavi and Wolpert [1996].
3 Error Correcting Output Codes
Error correcting output codes (ECOC) is a multiclass classi cation technique where each class is encoded as a string of codeletters called a codeword [Dietterich and Bakiri, 1995]. Given a test instance, each of its codeletters is predicted, and the class whose codeword has smallest Hamming distance to the predicted codeword is assigned. We will now detail this technique and its integration with a nearest neighbor classi er [Ricci and Aha, 1997]. Let k be the number of classes and l an integer. For each i = 1; : : : ; k, we can create a correspondence or encoding i 7! codei = bi1 bi2 bil (11) with the following properties: { (bij ) is a k l matrix with values in a nite set B. { The Hamming distance of two rows in bij is at least h 1. 3
Table 1. Three Dierent Class Encodings Class Name
Earthling Martian Venusian Italian
Atomic 1 2 3 4
Encoding Types Distributed One-Per-Class ECOC 1000 1111111 0100 0000111 0010 0011001 0001 0101010
{ The Hamming distance of two columns is at least 1 and at most l ? 1.
Figure 1 shows three dierent types of encodings. In the rst, atomic, B = f1; : : :; kg, l = 1 and, h = 1. This is the standard way to encode a class symbol (i.e., using a string
of length 1). The two other encodings are examples of distributed encodings; each class is encoded in a boolean string of length l > 1. In one-per-class, l = k and the Hamming distance between every pair of codewords is 2 (h = 2). An ECOC encoding is a generalization of one-per-class where l > k, the strings bi1 bi2 bil (i.e., the rows of (bij )) are boolean, and the minimal Hamming distance between codewords is h 3. This gives error correcting capabilities to the encoding; if the number of erroneously predicted bits is not greater than b h?2 1 c, then the correct class is still the closest in Hamming distance. For example, h = 4 for the ECOC encoding shown in Figure 1. When B = f0; 1g, a column of the matrix (bij ) provides a partition of classes and instances into two disjoint sets. Thus, a boolean encoding of length l yields l bit functions on the training instances. Let bj denote the j -th bit function and bj (x) its value on instance x. ECOC classi ers are trained and tested as follows: 1. Learn each function bj on the training set and generate hypotheses ^bj . 2. For each instance x in the test set predict the value of each bit function on x (i.e., ^b1 (x)^b2 (x) ^bl (x)). 3. Assign to x the class: argki=1 minfH [^b1(x)^b2 (x) ^bl (x); bi1 bi2 bil ]g; (12) where H [; ] de nes Hamming distance, and compute the 0-1 error on x. 4. Average the 0-1 errors on the test set. This general scheme has been used with several supervised learning techniques to learn the bit functions in step (1). Dietterich and Bakiri [1995] report that ECOCs often signi cantly increased the classi cation accuracies for C4.5 [Quinlan, 1993] and networks trained by backpropagation. Ricci and Aha [1997] extended these results to the local learning algorithm IB1, an implementation of the nearest neighbor classi er. ECOCs can work only if the bit functions have dierent bias errors [Kong and Dietterich, 1995]. We [Ricci and Aha, 1997] showed that the bias errors made by local learners on dierent bit functions can be decorrelated by selecting dierent features for each function. 4
To perform feature selection, we modi ed the schemata racing algorithm [Maron and Moore, 1997; Ricci and Aha, 1997]. In the next section we shall illustrate the eect of ECOCs and feature selection on the bias and variance components of the error.
4 Bias and Variance of Error-Correcting Output Codes
In this section we empirically determine the bias and variance of ECOC classi ers on both archived and synthesized data sets. According to Equations 8 and 9, bias and variance can be precisely estimated only when the Bayes optimal error rate and rule are known. However, this error rate is not known for most data sets. Kohavi and Wolpert [1996], using a frequency count estimator of the Bayes optimal error rate, and observing that each instance is unique in all the data sets they used, estimated the Bayes optimal error rate to be zero. This is unlikely, and the decomposition of the error based on this assumption is not reliable. Nevertheless, we also make this assumption for the following three reasons. First, assuming that the Bayes optimal error rate to be zero means that we overestimate both variance and bias. This follows from Equation 10 and because all the addenda are positive. Moreover, the nearest neighbor classi er is known to have high bias [Breiman, 1996b], as the decomposition of the error on arti cial data sets shows (see Table 4). Therefore, it is unlikely that the true bias on real data sets (i.e., that obtained from the estimate by subtracting a part of the Bayes error rate (see Equation 8)) could greatly dier from that estimated, which is high (see Table3) or be comparable in size with the variance. Second, in the experiments performed on synthesized data sets, where the Bayes error rate is known and the bias/variance decomposition is exactly computable, we observed similar behaviors in the changes of bias and variance with respect to the nearest neighbor classi er. Third, we expect that the aggregate classi ers of the tested algorithms are similar (i.e., their B and U sets are similar), and therefore the changes in bias and variance in the algorithms are not due to dierent overestimates of the bias and variance. In fact the part of the Bayes error that should be subtracted from the estimated bias depends only on the bias set (i.e., only from the aggregate classi er (see Equation 8)). The same argument also applies to variance. In contrast, the class distributions for synthesized data sets are known, and de ned with probabilities p(xjj ). Thus, the Bayes optimal P error rate can be computed precisely using Bayes theorem: p(j jx) = (p(xjj )p(j ))=( i p(xji)p(i)). For each data set we randomly split the data into 100 partitions, with 90% of the data used for training and 10% for testing. We used the relative frequency that each instance x in the data set was classi ed as i to estimate f^i (x) = ET [f^i (xjT )] = P (Y^ (xjT ) = i). The bias set B and its complement U can be obtained based on these estimates. A summary description of the archived data sets is shown in Table 2. More information on these data sets, which have only numeric-valued features, can be found in [Ricci and Aha, 1997].
4.1 Archived Data Sets
The results of the experiments conducted on archived data are shown in Table 3. IB1 is an implementation of the nearest neighbor classi er [Aha, 1992]) and is used as a baseline. The other algorithms (i.e., IB1atomic (standard single-codeletter encoding), IB1opc (one-perclass encoding), and IB1ecoc (error correcting encoding)) dier from IB1 in that they all 5
Table 2. Selected Data Sets (C=Classes, F=Features) Data Set Glass GL Clouds98 CL98 Clouds99 CL99 Clouds204 CL204
Size C F 214 6 9C 69 4 98C 321 7 99C 500 10 204C
Table 3. Bias and Variance Decomposition of the Error for Archived Data Dataset GLASS Bias Variance CLOUDS98 Bias Variance CLOUDS99 Bias Variance CLOUDS204-100 Bias Variance CLOUDS204 Bias Variance
IB1 IB1atomic IB1opc IB1ecoc
0.318 0.302 0.016 0.390 0.374 0.016 0.432 0.408 0.024 0.387 0.366 0.021 0.385 0.371 0.013
0.287 0.178 0.109 0.349 0.286 0.063 0.422 0.335 0.087 0.411 0.310 0.101 0.392 0.297 0.096
0.313 0.232 0.080 0.403 0.347 0.056 0.481 0.390 0.091 0.595 0.539 0.056 0.566 0.418 0.148
0.234 0.177 0.057 0.364 0.294 0.070 0.382 0.335 0.048 0.332 0.277 0.055 0.268 0.209 0.058
employ a feature selection component (i.e., a distinct feature selection task for each output bit [Ricci and Aha, 1997]) and a distinct class encoding. The Bayes optimal error rate is not reported here because it is assumed to be zero, as explained above. In each these data sets IB1ecoc recorded smaller errors than IB1, IB1atomic , and IB1opc . Only on CLOUDS98 did IB1ecoc yield higher errors than IB1opc , because the class encoding space de ned by only four classes is too small to generate suciently distinctive codewords.3 >From the results, shown in Table 3, we conclude that ECOCs drastically reduce the bias component of the error at the cost of increasing the variance. Bias is always reduced from a minimum of 18% (CLOUDS99) to a maximum of 44% (CLOUDS204). Conversely the decrease in the total error obtained by IB1ecoc is moderated by an increase in variance (i.e., IB1ecoc 's variance was at least twice IB1's variance for each data set). We brie y explored two methods to decrease variance: by increasing training set size and by increasing codeword length.4 While IB1's variance decreased as training set size increased 3
The exhaustive codes technique [Dietterich and Bakiri, 1995], used in this case, generates only seven bit functions. 4 Increasing codeword length is an example of increasing the emphasis on \voting" [Breiman,
6
0.4 bias error
0.35
0.3
0.25
0.2
0.15 15
20
25
30
35 40 Codeword Length
45
50
55
60
Fig. 1. Bias and Error of IB1ecoc when Varying Codeword Length for Clouds204-500 (i.e., compare CLOUDS204-100 in Table 3 (100 training instances) with CLOUDS204 (500)), this did not decrease IB1ecoc's variance. However, increasing codeword length does appear to reduce IB1ecoc's variance, as exempli ed in Figure 1, which plots the bias and total error of IB1ecoc on CLOUDS204 (using only 100 instances) with dierent codeword lengths.5 This gure shows that there is a larger decrease in variance than in bias with increased codeword lengths. Because the choice of codewords in uences the error, and thus causes signi cant oscillation in the curves, more convincing results might be obtained by cross validating the choice of codewords.
4.2 Synthetic Classi cation Tasks
As stated earlier, the bias/variance decomposition of the error for archival data sets is dicult to estimate because one often does not know the Bayes optimal error rate. In contrast, this rate can easily be computed for synthetic classi cation tasks assuming a precise model of the data is available. We investigated the following four sets of synthesized tasks:
7norm7classes This data set has eight features, seven classes, and 700 instances. Each class is drawn from a multivariate normal distribution with unit covariance matrix. Class 1 has mean (a; a; : : : ; a), class 2 has mean (?a; ?a; : : : ; ?a), class 3 has half positive and 1996a], which has already been used within ECOC classi ers with promising results [Kong and Dietterich, 1995; Aha and Bankert, 1997]. 5 Only one set of codewords was used for each codeword length.
7
Table 4. Bias and Variance Decomposition of the Error for Synthesized Data Dataset 7N7C-8D Bias Variance RING5 Bias Variance WAVE Bias Variance LED Bias Variance
IB1 IB1atomic IB1opc IB1ecoc
0.431 0.156 0.015 0.648 0.484 0.002 0.190 0.139 0.004 0.355 0.008 0.073
0.386 0.130 0.085 0.501 0.218 0.121 0.376 0.096 0.233 0.423 0.012 0.137
0.413 0.151 0.092 0.612 0.391 0.059 0.300 0.122 0.131 0.406 0.027 0.105
0.301 0.086 0.045 0.497 0.237 0.098 0.167 0.053 0.067 0.357 0.009 0.074
half negative values, and so on by p binary subdivision. This generalizes a data set used by Breiman [1996a]. We set a = 2= 8; the Bayes optimal error rate is 0.261. 5ringnorm This data set has 20 features, ve classes, and 700 instances. Class i is multivariate normal with mean zero and covariance matrix 2 i ? 1 times the identity, i = 1; : : : ; N . This data set is similar to the \ringnorm" data set described in [Breiman, 1996a]. Its Bayes optimal error rate is 0.161. Wave This data set is similar to that used in CART [Breiman et al., 1984]. It has 22 features, ten classes and 300 instances. The instances belonging to class k have the 2kth feature equal to t and the two features 2k ? 1 and 2k + 1 equal to t=2. All the other features are 0. A random noise is added to each feature with 0 mean and 1 standard deviation. We set t = 3; the Bayes optimal error rate is 0.047. LED This is the classic LED display dataset used in CART [Breiman et al., 1984]. There are seven boolean features and ten classes. The Bayes optimal error rate is 0.274. Table 4 shows the decomposition of the error for the same set of algorithms that were applied to the archival data sets. The behavior of IB1ecoc on the synthetic data sets is analogous to what we observed on the archival data; IB1ecoc recorded lower error rates than IB1 by reducing bias and increasing variance. In all data sets, error was reduced even if feature selection was not useful. In fact, all of the features are relevant and are not mutually dependent, which explains why IB1atomic and IB1opc recorded higher error rates than IB1. IB1ecoc did not lower error rates for the LED task because IB1 has a low bias for this task. Thus, there is no room for improvement.
5 Discussion and Improvements
The main nding of Section 4 is that ECOC extensions of the local classi er IB1, where feature selection is performed for each codeletter to decorrelate errors, reduces bias at the cost of an increase in variance. This trade o is positive when IB1 has some bias error and there are many classes (m 5). However, this technique has limitations, and we now describe and report additional investigations for two of them. 8
5.1 Reducing variance
Kong and Dietterich [1995] showed that the variance error of C4.5ecoc can be reduced by \voting" multiple repetitions of the same classi er built with bootstrapped training sets [Breiman, 1996a]. In our case (i.e., with a local classi er like IB1ecoc), there is also another way: \smoothing" the local classi er. In fact, research in regression theory has shown that variance can be decreased by combining the in uences of instances nearby a query [Geman et al., 1992]. We therefore replaced IB1 with a distance-weighted k-nearest neighbor classi er [Atkeson et al., 1997]. Applied to the LED display task, both this algorithm and its ECOC variant yields variance and bias equal to zero. Variance can also be reduced by using longer codewords, as demonstrated in Figure 1. In fact, combining the predictions for multiple bit functions to yield a class prediction is a form of voting, which has bene cial eects on variance, as mentioned above. Apparently this technique cannot be used for data sets with a small number of classes. The maximal number of codewords that are both column separated and row separated is 2k?1 ? 1, where k is the number of classes [Dietterich and Bakiri, 1995]. We already noted that for data sets like CLOUDS98, which has four classes, only seven dierent bit functions can be created. But each bit function is learned with a randomized feature selection algorithm. Therefore, even if two bit functions are de ned in exactly the same way on the training set, the features selected for them can dier, which can cause dierences in their bias errors. Therefore we evaluated the algorithms using longer codewords whose columns are not all distinct (i.e., by using each bit function three times) and applied this technique to CLOUDS98. With a codeword length of 21, the total error decreased from 0.364 to 0.361 (Bias 0.294, Var 0.067), and to 0.359 (Bias 0.336, Var 0.023) with 42 codeletters. Thus, this procedure greatly reduced variance without increasing overall error.
5.2 Tasks with few features
Decorrelating classi cation error on bit functions using feature selection is a technique that seems reasonable for data sets with many features and when there is some form of redundancy (i.e., meaning that a set of features can be modi ed by replacing some of them without changing their predictive capability). Therefore, in a data set with only two features this method does not seem applicable. We tested this hypothesis on a synthetic data set with two features. Figure 2 depicts its class separations. 800 instances were randomly chosen in [0; 1]2 with a uniform distribution. Surprisingly, IB1ecoc compares well to IB1 on this task; it has 0.041 bias and 0.007 variance, whereas IB1 has 0.045 bias and 0.009 variance. We also observed that IB1atomic has the same behavior as IB1 on this task, as both features were selected, whereas IB1opc's error was signi cantly higher (0.043 bias and 0.041 variance) due to an increase in variance.
6 Conclusion
Extending the nearest neighbor classi er with error-correcting output codes (ECOCs) and bitwise feature selection can signi cantly reduce errors on some tasks. The error reduction is obtained by reducing the bias component more than increasing the variance. We also showed that, even in classi cation problems with few classes (m5), extending the codeword length beyond the limit that ensures the column separation property can still reduce error rates by reducing variance. Extending a local learner with ECOCs may not be useful on data sets 9
1
1 0.8
2
1
3 0.6
6 3
2
4
0.4
7
0.2
5
5
0 0
0.2
0.4
0.6
0.8
1
Fig. 2. A Synthetic Data Set where the local learner has a small bias and high variance error. We showed that in such cases, by replacing the local learner with another with smaller variance error, one can still bene t from ECOC encodings.
7 Acknowledgements
We thank Leo Breiman for helpful suggestions. This research was supported in part by a grant from the Oce of Naval Research.
References
[Aha and Bankert, 1997] D. W. Aha and R. L. Bankert. Cloud classi cation using error-correcting output codes. Arti cial Intelligence Applications: Natural Science, Agriculture, and Environmental Science, 11:13{28, 1997. [Aha, 1992] D. W. Aha. Tolerating noisy, irrelevant, and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36:267{287, 1992. [Atkeson et al., 1997] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. Arti cial Intelligence Review, 11:11{73, 1997. [Bottou and Vapnik, 1992] Leon Bottou and Vladimir Vapnik. Local learning algorithms. Neural Computation, 4:888{900, 1992. [Breiman et al., 1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. [Breiman, 1996a] L. Breiman. Bias, variance, and arcing classi ers. Technical Report 460, University of California, Berkeley, April 1996. [Breiman, 1996b] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123{140, 1996.
10
[Dietterich and Bakiri, 1991] T. G. Dietterich and G. Bakiri. Error-correcting output codes: A general method for improving multiclass inductive learning programs. In Proceedings of the Ninth National Conference on Arti cial Intelligence, pages 572{577, Anaheim, CA, 1991. AAAI Press. [Dietterich and Bakiri, 1995] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Arti cial Intelligence Research, 2:263{286, 1995. [Friedman, 1996] J. H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Technical report, Stanford University, August 1996. [Geman et al., 1992] S. Geman, A. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1{58, 1992. [Kohavi and Wolpert, 1996] R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 275{283, Bari, Italy, 1996. Morgan Kaufmann. [Kong and Dietterich, 1995] E. B. Kong and T. G. Dietterich. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning, pages 313{321, Tahoe City, CA, 1995. Morgan Kaufmann. [Maron and Moore, 1997] O. Maron and A. W. Moore. The racing algorithm: model selection for lazy learners. Arti cial Intelligence Review, pages 193{225, 1997. [Quinlan, 1993] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993. [Ricci and Aha, 1997] F. Ricci and D. W. Aha. Extending local learners with error-correcting output codes. Technical Report AIC-97-001, Naval Research Laboratory, Navy Center for Applied Research in Arti cial Intelligence, January 1997. [Rumelhart and McClelland, 1986] D. E. Rumelhart and J. L. McClelland, editors. Parallel Distributed Processing: Exploration in the Miscrostructure of Cognition. MIT Press, 1986. [Wettschereck and Dietterich, 1992] D. Wettschereck and T. G. Dietterich. Improving the performance of radial basis function networks by learning center locations. In J. Moody, S. Hanson, and R. Lippman, editors, Neural Information Processing Systems 4. Morgan Kaufmann, San Mateo, CA, 1992.
This article was processed using the LATEX macro package with LLNCS style
11