In: Proceedings of MEASUREMENT 97 May 29-31, 1997, Smolenice, Slovakia
Combining Neural Network Voting Classi ers and Error Correcting Output Codes Friedrich Leisch
Kurt Hornik
[email protected] [email protected] Institut fur Statistik und Wahrscheinlichkeitstheorie Technische Universitat Wien Wiedner Hauptstrae 8{10/1071 A-1040 Wien, Austria
Abstract
We show that error correcting output codes (ECOC) can further improve the eects of error dependent adaptive resampling methods such as arc-lh. In traditional one-in-n coding, the distance between two binary class labels is rather small, whereas ECOC are chosen to maximize this distance. We compare one-in-n and ECOC on a multiclass data set using standard MLPs and bagging and arcing voting committees.
1 Introduction The most popular output coding for arti cial neural network (ANN) classi ers is one-in-n using one output unit per class. The target values are zero for all output units except for the one corresponding to the correct class, which has target value one. It can be shown that classi ers using one-in-n coding and minimizing the (quadratic) output error such as multi layer perceptrons (MLPs) are equivalent to the (optimal) Bayes classi er and that the output of the units represents the posterior probabilities of the classes given an input (e.g., Kanaya & Miyake, 1991). Knowledge of the exact posterior probabilities is not necessary for optimal classi cation, because the optimal decision is to choose the class with the largest posterior probability given an input. Only knowledge of the order of the probabilities is necessary. With one-in-n coding this order may change if only one output unit res incorrectly, i.e., if only one bit of the output is ipped from zero to one. Dietterich & Bakiri (1995) have shown that the generalization performance of both C4.5 and MLP classi ers can be improved by using error correcting output codes, which can detect and correct wrong bits in the output.
2 Voting Classi ers The training of ANNs is usually a stochastic and unstable process. As the weights of the network are initialized at random and training patterns are presented in random order, ANNs trained on the same data will typically be dierent in value and performance. In addition,
Training sets
XN resample
XN1
g1
XN2
g2
XNk
gk
ANN classi ers
adapt
gkv combine
Figure 1: Arcing classi ers small changes in the training set can lead to two completely dierent trained networks with dierent performance even if the nets had the same initial weights. Roughly speaking, ANNs have a low bias because of their approximation capabilities, but a rather high variance because of this instability. Suppose we had k independent training sets XN1 ; : : : ; XN and corresponding classi ers gi = g (jXN ); i = 1; : : : ; k trained using these sets, respectively. We can then combine these single classi ers into a joint voting classi er gkv by assigning to each input x the class the majority of the gi votes for. If the gi have low bias, then gkv should have low bias, too. Recently, several resample and combine techniques for improving ANN performance have been proposed. Breiman (1994, 1996) introduced a procedure called bagging (\bootstrap aggregating") for tree classi ers that can also be used for ANNs. The bagging algorithm starts with a training set XN of size N . Several bootstrap replica XN1 ; : : : ; XNk are constructed and a neural network is trained on each. These networks are nally combined by majority voting. Note that bootstrap samples are independent samples from the empirical distribution of the training set, which is a consistent estimate of the true distribution. Arcing (\adaptive resample and combine"), which is a more sophisticated version of bagging, was rst introduced by Freund & Schapire (1995) and called boosting. The new training sets are not constructed by uniformly sampling from the empirical distribution of the training set XN , but from a distribution over XN that includes information about previous training runs. Leisch & Hornik (1997) introduced a simple new arcing method based on the idea that the \importance" of a pattern for the learning process can be measured by the aggregated output error of an ANN for the pattern over several training runs. Patterns that repeatedly have high output errors are somewhat harder to learn for the network and therefore their resampling probabilities are increased proportionally to the error. Let XN = fx1; : : : ; xN g be the original training set of size N , let pin denote the probability that pattern xn is included into the i-th resampled training set XNi and initialize with p1n = 1=N . Our arcing algorithm called arc-lh (\arcing by Leisch & Hornik")1 works as follows: k
i
1. Construct a pattern set XNi by sampling with replacement with probabilities pin from i XN and train a classi er gi using set X . N Breiman (1996) compared the original boosting algorithm by Freund & Schapire (1995) with other arcing algorithms and named it after its inventors arc-fs (\arcing by Freund & Schapire"). We used this naming scheme and called our algorithm arc-lh. 1
2
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10
1 0 1 0 1 0 1 0 1 0
1 0 0 0 1 1 0 0 1 1
0 1 0 1 1 0 1 0 0 1
0 1 1 1 0 0 1 1 1 1
0 1 0 0 1 1 1 1 0 0
0 1 0 1 0 1 0 1 1 0
1 0 0 1 1 0 0 1 1 0
0 1 1 1 1 1 0 0 0 0
1 0 1 0 0 1 0 1 0 1
0 1 1 0 0 1 1 0 1 0
0 1 1 0 1 0 0 1 0 1
1 0 0 0 0 0 1 1 0 0
1 0 1 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0 1 1
1 0 1 1 1 1 1 1 1 1
Table 1: 15 Bit error correcting BCH code for a 10 class problem (Lin & Costello, 1983; Kong & Dietterich, 1995). The minimum Hamming distance between two code vectors is 7. 2. Add the network output error of each pattern to the resampling probabilities: i+1 = n
p
i
PNn=1n +in +i ( ni() n) p
e
p
x
e
;
i (xn ) = jjt(xn ) ? gi (xn )jj
e
x
where t(xn ) denotes the training target for pattern xn . 3. Set i := i + 1 and repeat.
3 Error Correcting Output Codes Dietterich & Bakiri (1995) have shown that ECOC can be used on a wide variety of multiclass learning problems and improve the performance of classi ers using binary encoded class labels such as MLPs and C4.5. ECOC reduce both bias and variance of a classi er (Kong & Dietterich, 1995), whereas bagging and arcing reduce mostly the variance without increasing the bias. The maximum Hamming distance between two one-in-n codes is only two (independent from the number of classes), whereas ECOC are chosen such that the Hamming distance is a maximum for the desired code length. E.g., the BCH code shown in Table 1 has a minimum Hamming distance of 7. Hence, the dierence between training target and network output becomes a more robust measurement of the misclassi cation probability of a pattern.
4 Simulations We have tested arc-lh with ECOC on the well known letter-recognition data set from the UCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html). The data set consists of 20000 dierent letter samples that have been converted into 16 primitive numerical attributes (statistical moments and edge counts). Due to computational reasons we restricted our simulations to the rst ten letters of the alphabet (`A'{`J'), resulting in a data set of size 7468. We used 5000 samples for training and 2468 for testing the performance. 3
one-in-n Single MLP 86.78% Bagging 91.05% arc-lh 91.04%
ECOC 87.12% 91.03% 96.75%
Table 2: Simulation Results: Percent of correctly classi ed patterns. We used multi layer perceptrons with 16 inputs, 16 hidden units and 10 output units for the one-in-n encodings and 16-16-15 MLPs for the 15 bit ECOC (Table 1). The number of hidden units was chosen to optimize the performance of the stand-alone MLP (by testing networks with 5{30 hidden units); further improvements of the bagging and arcing classi ers may be possible by choosing a dierent number of hidden units. The rst row of Table 2 shows the performance of a standard MLP on the independent test set of size 2468. The ECOC improves the performance only slightly on this example. The second row shows the result of a bagging classi er with 20 MLPs forming the voting committee. The bagging classi er has a signi cantly better performance than a standard MLP, but seems to be independent from the output coding in use. The arcing classi er can fully utilize the eects of ECOC. Using one-in-n coding the performance is the same as with bagging, but using ECOC the performance can be increased by 5% (compared to bagging) and 10% (compared to a standard MLP).
5 Discussion One-in-n coding and ECOC have been compared on a multiclass benchmark problem, showing that ECOC can drastically improve the eectiveness of error dependent adaptive resampling methods such as arc-lh. ECOC has no eects on bagging in our example, which is not as surprising as it may seem. Dietterich & Bakiri (1995) mention the open problem of the relationship between the ECOC approach and ensemble methods. ECOC may be seen as a very \compact" voting committee, where a certain number of incorrect votes can be corrected. The improvement for error dependent arcing seems to result mostly from the larger distances between the target labels, which makes the network output error more robust against small changes.
Acknowledgement This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (`Adaptive Information Systems and Modelling in Economics and Management Science').
4
References Breiman, L. (1994). Bagging predictors. Tech. Rep. 421, Department of Statistics, University of California, Berkeley, California, USA. Breiman, L. (1996). Bias, variance, and arcing classi ers. Tech. Rep. 460, Statistics Department, University of California, Berkeley, CA, USA. Dietterich, T. G. & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Arti cial Intelligence Research, 2, 263{286. Freund, Y. & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. Lecture Notes in Computer Science, 904. Kanaya, F. & Miyake, S. (1991). Bayes statistical behavior and valid generalization of pattern classifying neural networks. IEEE Transactions on Neural Networks, 2(4), 471{475. Kong, E. B. & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Machine Learning: Proceedings of the 12th International Conference, pp. 313{ 321. Morgan-Kaufmann. Leisch, F. & Hornik, K. (1997). ARC-LH: A new adaptive resampling algorithm for improving ANN classi ers. In Mozer, M. C., Jordan, M. I., & Petsche, T. (eds.), Advances in Neural Information Processing Systems, vol. 9. Lin, S. & Costello, D. J. (1983). Error control coding: fundamentals and applications. New Jersey, USA: Prentice-Hall, Englewood Clis.
5