Comparing Techniques for Multiclass Classification Using Binary SVM ...

0 downloads 0 Views 147KB Size Report
Multiclass classification using Machine Learning techniques consists of inducing a ... as all the strategies considered divide the problem into binary classification.
Comparing Techniques for Multiclass Classification Using Binary SVM Predictors Ana Carolina Lorena and Andr´e C.P.L.F. de Carvalho Laborat´ orio de Inteligˆencia Computacional (LABIC), Instituto de Ciˆencias Matem´ aticas e de Computa¸ca ˜o (ICMC), Universidade de S˜ ao Paulo (USP), Av. do Trabalhador S˜ ao-Carlense, 400 - Centro - Cx. Postal 668 S˜ ao Carlos - S˜ ao Paulo - Brasil {aclorena,andre}@icmc.usp.br

Abstract. Multiclass classification using Machine Learning techniques consists of inducing a function f (x) from a training set composed of pairs (xi , yi ) where yi ∈ {1, 2, . . . , k}. Some learning methods are originally binary, being able to realize classifications where k = 2. Among these one can mention Support Vector Machines. This paper presents a comparison of methods for multiclass classification using SVMs. The techniques investigated use strategies of dividing the multiclass problem into binary subproblems and can be extended to other learning techniques. Results indicate that the use of Directed Acyclic Graphs is an efficient approach in generating multiclass SVM classifiers.

1

Introduction

Supervised learning consists of inducing a function f (x) from a given set of samples with the form (xi , yi ), which accurately predicts the labels of unknown instances [10]. Applications where the labels yi assume k values, with k > 2, are named multiclass problems. Some learning techniques, like Support Vector Machines (SVMs) [3], originally carry out binary classifications. To generalize such methods to multiclass problems, several strategies may be employed [1,4,8,14]. This paper presents a study of various approaches for multiclass classification with SVMs. Although the study is oriented toward SVMs, it can be applied to other binary classifiers, as all the strategies considered divide the problem into binary classification subproblems. A first standard method for building k class predictors form binary ones, named one-against-all (1AA), consists of building k classifiers, each distinguishing one class from the remaining classes [3]. The label of a new sample is usually given by the classifier that produces the highest output. Other common extension to multiclass classification from binary predictions is known as all-against-all (AAA). In this case, given k classes, k(k − 1)/2 classifiers are constructed. Each of the classifiers distinguishes one class ci from another class cj , with i = j. A majority voting among the individual responses R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 272–281, 2004. c Springer-Verlag Berlin Heidelberg 2004 

Comparing Techniques for Multiclass Classification

273

can then be employed to predict the class of a sample x [8]. The responses of the individual classifiers can also be combined by an Artificial Neural Network (ANN) [6], which weights the importance of the individual classifiers in the final prediction. Another method to combine such kind of predictors, suggested in [14], consists of building a Directed Acyclic Graph (DAG). Each node of the DAG corresponds to one binary classifier. Results indicate that the use of such structure can save computational time in the prediction phase. In another front, Dietterich and Bariki [4] suggested the use of errorcorrecting output codes (ECOC) for representing each class in the problem. Binary classifiers are trained to learn the “bits” in these codes. When a new pattern is submitted to this system, a code is obtained. This code is compared to the error-correcting ones with the Hamming distance. The new pattern is then assigned to the class whose error-correcting codeword presents minimum Hamming distance to the code predicted by the individual classifiers. As SVMs are large margin classifiers [15], that aim at maximizing the distance between the patterns and the decision frontier induced, Allwein et al. [1] suggested using the margin of a pattern in computing its distance to the output codes (loss-based ECOC). This measure has the advantage of providing a notion of the reliability of the predictions made by the individual SVMs. This paper is organized as follows: Section 2 presents the materials and methods employed in this work. It describes the datasets considered, as well as the learning techniques and multiclass strategies investigated. Section 3 presents the experiments conducted and results achieved. Section 4 concludes this paper.

2

Materials and Methods

This section presents the materials and methods used in this work, describing the datasets, learning techniques and multiclass strategies employed. 2.1

Datasets

The datasets used in the experiments conducted were extracted from the UCI benchmark database [16]. Table 1 summarizes these datasets, showing the numbers of instances ( Instances), of continuous and nominal attributes ( Attributes), of classes ( Classes), the majority error (ME) and if there are missing values (MV). ME represents the proportion of examples of the class with most patterns on the dataset. Instances with missing values were removed from the “bridges” and “posoperative” datasets. This procedure left 70 and 87 instances in the respective datasets. For the “splice” dataset, instances with attributes different from the base pairs Adenine, Cytosine, Guanine and Thymine were eliminated. The other values of attributes present reflects the uncertainty inherent to DNA sequencing processes. This left a total of 3175 instances. Almost all datasets have been pre-processed so that data had zero mean and unit variance. The exceptions were “balance” and “splice”. In “balance”, many

274

A.C. Lorena and A.C.P.L.F. de Carvalho Table 1. Datasets summary description Dataset Balance Bridges Glass Iris Pos-operative Splice Vehicle Wine Zoo

 Instances 625 108 214 150 90 3190 846 178 90

 Attributes  Classes (cont., nom.) 4 11 9 4 8 60 18 12 17

(0, 4) (0, 11) (9, 0) (4, 0) (0, 8) (0, 60) (18, 0) (12, 0) (2, 15)

3 6 6 3 3 3 4 3 7

ME

MV

46.1% 32.9% 35.5% 33.3% 71.1% 50.0% 25.8% 48.0% 41.1%

no yes no no yes no no no no

attributes became null, so the pre-processing procedure was not applied. In the “splice” case, a coding process suggested in the bioinformatics literature, which represents the attributes in a canonical format, was employed instead [13]. Thus, the number of attributes used in “splice” was of 240. 2.2

Learning Techniques

The base learning technique employed in the experiments for comparison of multiclass strategies was the Support Vector Machine (SVM) [3]. Inspired by the Statistical Learning Theory [17], this technique seeks an hyperplane w · x + b = 0 able of separating data with a maximal margin. For performing this task, it solves the following optimization problem: Minimize: w2 Restrictions: yi (w · xi + b) ≥ 1 where xi ∈ m , yi ∈ {−1, +1} and i = 1, . . . , n. In the previous formulation, it is assumed that all samples are far from the decision border from at least the margin value, which means that data have to be linearly separable. Since in real applications the linearity restriction is often not complied, slack variables are introduced [5]. These variables relax the restrictions imposed to the optimization problem, allowing some patterns to be within the margins. This is accomplished by the following optimization problem: n  2 Minimize: w + C ξi i=1  ξi ≥ 0 Restrictions: yi (w · xi + b) ≥ 1 − ξi where C is a constant that imposes a tradeoff between training error and generalization and the ξi are the slack variables. The decision frontier obtained is given by Equation 1.  f (x) = yi αi xi · x + b (1) xi ∈ SV

Comparing Techniques for Multiclass Classification

275

where the constants αi are called “Lagrange multipliers” and are determined in the optimization process. SV corresponds to the set of support vectors, patterns for which the associated lagrange multipliers are larger than zero. These samples are those closest to the optimal hyperplane. For all other patterns the associated lagrange multiplier is null, so they do not participate on the determination of the final hypothesis. The classifier represented in Equation 1 is still restricted by the fact that it performs a linear separation of data. This can be solved by mapping the data samples to a high-dimensional space, also named feature space, where they can be efficiently separated by a linear SVM. This mapping is performed with the use of Kernel functions, that allow the access to spaces of high dimensions without the need of knowing the mapping function explicitly, which usually is very complex. These functions compute dot products between any pair of patterns in the feature space. Thus, the only modification necessary to deal with non-linearity is to substitute any dot product among patterns by the Kernel product. For combining the multiple binary SVMs generated in some of the experiments, Artificial Neural Networks (ANNs) of the Multilayer Perceptron type were considered. These structures are inspired in the structure and learning ability of a “biological brain” [6]. They are composed of one or more layers of artificial neurons, interconnected to each other by weighted links. These weights codify the knowledge of the network. This weighted scheme may be an useful alternative to power the strength of each binary SVM in the final multiclass prediction, as described in the following section. 2.3

Multiclass Strategies

The most straightforward way to build a k class multiclass predictor from binary classifiers is to generate k binary predictors. Each classifier is responsible to distinguish a class ci from the remaining classes. The final prediction is given by the classifier with the highest output value [3]. This method is called oneagainst-all (1AA) and is illustrated in Equation 2, where i = 1, . . . , k and Φ represents the mapping function in non-linear SVMs. f (x) = max(wi · Φ(x) + bi ) i

(2)

Other standard methodology, called all-against-all (AAA), consists of building k(k − 1)/2 predictors, each differentiating a pair of classes ci and cj , with i = j. For combining these classifiers, a majority voting scheme (VAAA) can be applied [8]. Each AAA classifier gives one vote for its preferred class. The final result is given by the class with most votes. Platt et al. [14] points some drawbacks in the previous strategies. The main problem is a lack of theory in terms of generalization bounds. To overcome this, they developed a method to combine the SVMs generated in the AAA methodology, based on the use of Directed Acyclic Graphs (DAGSVM). The authors provide error bounds on the generalization of this system in terms of the number of classes and the margin achieved by each SVM on the nodes.

276

A.C. Lorena and A.C.P.L.F. de Carvalho

A Directed Acyclic Graph (DAG) is a graph with oriented edges and no cycles. The DAGSVM approach uses the SVMs generated in an AAA manner in each node of a DAG. Computing the prediction of a pattern using the DAGSVM is equivalent to operating a list of classes. Starting from the root node, the sample is tested against the first and last classes of the problem, which usually corresponds to the first and last elements of the initial list. The class with lowest output in the node is then eliminated from the list, and the node equivalent to the new list obtained is consulted. This process proceeds until one unique class remains. Figure 1 illustrates an example where four classes are present. For k classes, k − 1 SVMs are evaluated on test. Thus, this procedure speeds up the test phase.

Fig. 1. (a) DAGSVM of a problem with four classes; (b) illustration of the SVM generated for the 1 vs 4 subproblem [14]

This paper also investigates the use of ANNs in combining the AAA predictors. The ANN can be viewed as a technique to weight the predictions made by each classifier. In an alternative multiclass strategy, Dietterich and Bariki [4] proposed the use of a distributed output code to represent the k classes in the problem. For each class, a codeword of length l is assigned. These codes are stored on a matrix M ∈ {−1, +1}kxl . The rows of this matrix represents the codewords of each class and the columns, the l binary classifiers desired outputs. A new pattern x can be classified by evaluating the predictions of the l classifiers, which generates a string s of length l. This string is then compared to the rows of M . The sample is assigned to the class whose row is closest according to some measure, like the Hamming distance [4]. Commonly, the size of the codewords has more bits than needed to represent each class uniquely. The additional bits can be used to correct eventual classification errors. For this reason, this method is named error-correcting output coding (ECOC).

Comparing Techniques for Multiclass Classification

277

Allwein et al. [1] points out that the use of the Hamming distance ignores the loss function used in training, as well as confidences attached to the predictions made by each classifier. The authors claim that, in the SVM case, the use of the margins obtained in the classification of the patterns for computing the distance measure can improve the performance achieved by ECOC, resulting in the lossbased ECOC method (LECOC). Given a problem with k classes, let M be the matrix of codewords of lengths l, r a label and fi (x) the prediction made by the i-th classifier. The loss-based distance of a pattern x to a label r is given by Equation 3. l  max{(1 − M (r, i)fi (x)), 0} (3) dM (r, x) = i=1

Next section presents the experiments conducted using each of the described strategies for multiclass classification.

3

Experiments

To obtain better estimates of the generalization performance of the multiclass methods investigated, the datasets described in Section 2.1 were first divided in training and test sets following the 10-fold cross-validation methodology [10]. According to this method, the dataset is divided in ten disjoint subsets of approximately equal size. In each train/test round, nine subsets are used for training and the remaining is left for test. This makes a total of ten pairs of training and test sets. The error of a classifier on the total dataset is then given by the average of the errors observed in each test partition. For ANNs, the training sets obtained were further subdivided in training and validation subsets, in a proportion of 75% and 25%, respectively. While the training set was applied in the determination of the network weights, the validation set was employed in the evaluation of the generalization capacity of the ANN on new patterns during its training. The network training was stopped when the validation error started to increment, in a strategy commonly referred as early-stopping [6]. With this procedure, overfitting to training data can be reduced. The validation set was also employed in the determination of the best network architecture. Several networks with different architectures were generated for each problem, and the one with lower validation error was chosen as the final ANN classifier. In this work, the architectures tested were one-hiddenlayer ANNs completely connected with 1, 5, 10, 15, 20, 25 and 30 neurons on the hidden layer. The standard back-propagation algorithm was employed on training with a learning rate of 0.2 and the SNNS (Stuttgart Neural Network Simulator ) [19] simulator was used in the networks generation. The software applied in SVMs induction was the SVMTorch II tool [2]. In all experiments conducted, a Gaussian Kernel with standard deviation equal to 5 was used. The parameter C was kept equal to 100, default value of SVMTorch II. Although the best values for the SVM parameters may differ for each multiclass strategy, they were kept the same to allow a fair evaluation of the differences between the techniques considered.

278

A.C. Lorena and A.C.P.L.F. de Carvalho

The codewords used in the ECOC and LECOC strategies were obtained following a heuristic proposed in [4]. Given a problem with 3 ≤ k ≤ 7 classes, k codewords of length 2k−1 − 1 are constructed. The codeword of the first class is composed only of ones. For the other classes ci , where i > 1, it is composed of alternate runs of 2k−i zeros and ones. Following, Section 3.1 summarizes the results observed and Section 3.2 discusses the work conducted. 3.1

Results

Table 2 presents the accuracy (percent of correct classifications) achieved by the multiclass strategies investigated. The first and second best accuracies obtained in each dataset are indicated in boldface and italic, respectively. Table 2. Multiclass strategies accuracies Dataset

1AA

Balance 96.5±3.1 Bridges 58.6±19.6 Glass 64.9±14.2 Iris 95.3±4.5 Pos op. 63.5±18.7 Splice 96.8±0.9 Vehicle 85.4±4.5 Wine 98.3±2.7 Zoo 95.6±5.7

VAAA

AAA-ANN

DAGSVM

97.9±2.0 60.0±17.6 67.3±10.3 96.0±4.7 64.6±19.3 96.8±0.7 85.5±4.0 98.3±2.7 94.4±5.9

99.2±0.8 58.6±12.5 68.2±9.7 96.0±4.7 62.2±17.1 83.4±1.7 84.4±5.3 97.2±3.9 94.4±5.9

98.4±1.8 62.9±16.8 68.7±11.9 96.0±4.7 61.3±26.7 96.8±0.7 85.8±4.0 98.3±2.7 94.4±5.9

ECOC

LECOC

90.7±4.7 96.5±3.1 58.6±20.7 61.4±19.2 65.3±14.5 65.8±14.6 94.7±6.1 95.3±4.5 61.3±18.9 63.5±18.7 93.6±1.7 96.8±0.9 81.9±4.7 85.5±3.6 97.2±4.1 98.3±2.7 95.6±5.7 95.6±5.7

Similarly to Table 2, Table 3 presents the mean time spent on training, in seconds. All experiments were carried out on a dual Pentium II processor with 330 MHz and 128 MB of RAM memory. Table 4 shows the medium number of support vectors of the models ( SVs). This value is related to the processing time required to classify a given pattern. A smaller number of SVs leads to faster predictions [9]. In the case of the AAA combination with ANNs, other question to be considered in terms of classification speed is the network architecture. Larger networks lead to slower classification speeds. Table 5 shows, for each dataset, the number of hidden neurons of the best ANN architectures obtained in each dataset. 3.2

Discussion

According to Table 2, the accuracy rates achieved by the different multiclass techniques are not much different. Applying the corrected resampled t-test statistic described in [12] to the first and second best results in each dataset, no statistical significance can be detected at 95% of confidence level. However, the results suggests that the most successful strategy is the DAGSVM. On the other side, the

Comparing Techniques for Multiclass Classification

279

Table 3. Training time (seconds) Dataset Balance Bridges Glass Iris Pos op. Splice Vehicle Wine Zoo

1AA

VAAA

AAA-ANN DAGSVM

ECOC

LECOC

5.7±1.1 4.2±1.1 33.7±1.0 4.2±1.1 4.3±0.5 4.3±0.5 11.3±1.1 43.4±1.9 50.3±10.8 43.4±1.9 64.3±3.4 64.3±3.4 10.2±0.8 38.7±2.0 42.1±3.2 38.7±2.0 72.5±2.0 72.5±2.0 1.7±1.2 1.9±1.0 11.7±7.0 1.9±1.0 1.8±1.2 1.8±1.2 3.6±0.8 4.4±0.7 7.3±1.2 4.4±0.7 3.7±0.7 3.7±0.7 476.7±17.1 205.9±1.2 388.2±4.2 205.9±1.2 497.4±41.5 497.4±41.5 21.9±0.9 17.4±1.1 96.7±1.3 17.4±1.1 45.2±1.3 45.2±1.3 4.9±0.6 4.7±0.7 13.5±19.0 4.7±0.7 5.0±0.0 5.0±0.0 12.3±1.2 36.8±3.5 50.3±2.1 36.8±1.2 108.6±1.9 108.6±1.9 Table 4. Mean number of Support Vectors (SVs)

Dataset

1AA

VAAA

AAA-ANN DAGSVM

ECOC

LECOC

Balance Bridges Glass Iris Pos op. Splice Vehicle Wine Zoo

208.2±10.7

115.3±6.5

115.3±6.5

69.8±3.5

208.2±10.7

208.2±10.7

129.6±4.1

175.3±3.4

175.3±3.4

59.3±2.7

879.4±29.7

879.4±29.7

275.5±7.5

252.2±6.2

252.2±6.2

100.6±5.3

40.8±2.7

24.5±1.4

24.5±1.4

16.7±1.3

40.7±2.8

40.7±2.8

106.8±7.6

61.0±6.1

61.0±6.1

54.2±4.9

107.3±7.5

107.3±7.5

5043.0±16.4

3625.2±10.3

3625.2±10.3

2577.1±9.0

5043.0±16.4 5043.0±16.4

669.0±11.1

469.2±5.0

469.2±5.0

232.1±9.7

1339.5±22.7 1339.5±22.7

75.2±1.5

54.3±3.4

54.3±3.4

35.3±3.2

132.9.6±4.5

191.2±6.8

191.2±6.8

62.0±4.8

2160.4±67.9 2160.4±67.9

75.2±1.5

75.2±1.5

1608.4±65.8 1608.4±65.8

Table 5. Number of hidden neurons in the AAA-ANN architectures Balance Bridges Glass Iris Pos op. Splice Vehicle Wine Zoo 1

5

5

10

1

1

5

30

5

ECOC method presents, in general, the lowest performance. It must be observed that the simple modification of this technique with the use of a distance measure based in margins (LECOC) improves its results substantially. In a comparison among the methods with best and worse accuracy in each dataset, a statistical significance of 95% can be verified in the following datasets: “balance”, “splice” and “vehicle”. Comparing the three methods for AAA combination, no statistical significance at 95% of confidence level can be verified in terms of accuracy - except on the “splice” dataset, where the ANN integration was worst. It should be noticed that the ANN approach presents a tendency in some datasets towards lowing the standard deviation of the accuracies obtained, indicating some stability gain. Concerning training time, in general the faster methodology was 1AA. The ECOC and LECOC approaches, on the other hand, were generally slower in this phase. The lower training time achieved by VAAA and DAGSVM in some

280

A.C. Lorena and A.C.P.L.F. de Carvalho

datasets is due to the fact that this method trains each SVM on smaller subsets of data, which speeds it up. In the AAA-ANN case, the ANN training time has to be taken into account, which gives a larger time than those of VAAA or DAGSVM. From Table 4 it can be observed that the DAGSVM method had the lower number of SVs in all cases. This means that the DAG strategy speeds up the classification of new samples. VAAA figures as the method with second lowest number of SVs. Again, the simpler data samples used in the binary classifiers induction in this case can be the cause of this result. It should be noticed that, although AAA-ANN has the same number of SVs of VAAA, the ANN prediction stage has to be considered.

4

Conclusion

This work evaluated several techniques for multiclass classification with SVMs, originally binary predictors. There are currently works generalizing SVMs to the multiclass case directly [7,18]. However, the focus of this work was on the use of SVMs as binary predictors, and the methods presented can be extended to other Machine Learning techniques. Although some differences were observed among the methods in terms of performance, in general no technique can be considered the most suited for a given application. When the main requirement is classification speed, while maintaining a good accuracy, the results observed indicate that an efficient alternative for SVMs is the use of the DAG approach. As future research, further experiments should be conducted to tune the parameters of the SVMs (Gaussian Kernel standard deviation and the value of C). This could improve the results obtained in each dataset. Acknowledgements. The authors would like to thank the financial support provided by the Brazilian research councils FAPESP and CNPq.

References 1. Allwein, E. L., Shapire, R. E., Singer, Y.: Reducing Multiclass to Binary: a Unifying Approach for Margin Classifiers. Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann (2000) 9–16 2. Collobert, R., Bengio, S.: SVMTorch: Support vector machines for large scale regression problems. Journal of Machine Learning Research, Vol. 1 (2001) 143–160 3. Cristianini, N., Taylor, J. S.: An Introduction to Support Vector Machines. Cambridge University Press (2000) 4. Dietterich, T. G., Bariki, G.: Solving Multiclass Learning Problems via ErrorCorrecting Output Codes. Journal of Artificial Intelligence Research, Vol. 2 (1995) 263–286 5. Cortes, C., Vapnik, V. N.: Support Vector Networks. Machine Learning, Vol. 20 (1995) 273–296

Comparing Techniques for Multiclass Classification

281

6. Haykin, S.: Neural Networks - A Compreensive Foundation. Prentice-Hall, New Jersey (1999) 7. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, Vol. 13 (2002) 415–425 8. Kreßel, U.: Pairwise Classification and Support Vector Machines. In Scholkopf, B., Burges, C. J. C., Smola, A. J. (eds.), Advances in Kernel Methods - Support Vector Learning, MIT Press (1999) 185–208 9. Mayoraz, E., Alpaydm, E.: Support Vector Machines for Multi-class Classification. Technical Report IDIAP-RR-98-06, Dalle Molle Institute for Perceptual Artificial Intelligence, Martigny, Switzerland (1998) 10. Mitchell, T.: Machine Learning. McGraw Hill (1997) 11. M¨ uller, K. R. et al.: An Introduction to Kernel-based Learning Algorithms. IEEE Transactions on Neural Networks, Vol. 12, N. 2 (2001) 181–201 12. Nadeau, C., Bengio, Y.: Inference for the Generalization Error. Machine Learning, Vol. 52, N. 3 (2003) 239–281 13. Pedersen, A. G., Nielsen, H.: Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis. Proceedings of ISMB’97 (1997) 226–233 14. Platt, J. C., Cristianini, N., Shawe-Taylor, J.: Large Margin DAGs for Multiclass Classification. In: Solla, S. A., Leen, T. K., M¨ uller, K.-R. (eds.), Advances in Neural Information Processing Systems, Vol. 12. MIT Press (2000) 547–553 15. Smola, A. J. et al.: Introduction to Large Margin Classifiers. In Advances in Large Margin Classifiers, Chapter 1, MIT Press (1999) 1–28 16. University of California Irvine: UCI benchmark repository - a huge collection of artificial and real-world datasets. http://www.ics.uci.edu/˜mlearn 17. Vapnik, V. N.: The Nature of Statistical Learning Theory. Springer-Verlag (1995) 18. Weston, J., Watkins, V.: Multi-class Support Vector Machines. Technical Report CSD-TR-98-04, Department of Computer Science, University of London, 1998. 19. Zell, A. et al.: SNNS - Stuttgart Neural Network Simulator. Technical Report 6/95, Institute for Parallel and Distributed High Performance Systems (IPVR), University of Stuttgart (1995)

Suggest Documents