Document not found! Please try again

On a unified framework for sampling with and without ... - EHU

1 downloads 0 Views 2MB Size Report
Department of Computer Science and Artificial Intelligence. University of ... One of the most active areas of research in the machine learning community is the study of ..... mining using MLC++ , a machine learning library in C++ . International ...
On a Unified Framework for Sampling With and Without Replacement in Decision Tree Ensembles J.M. Mart´ınez-Otzeta, B. Sierra, E. Lazkano, and E. Jauregi Department of Computer Science and Artificial Intelligence University of the Basque Country P. Manuel Lardizabal 1, 20018 Donostia-San Sebasti´ an Basque Country, Spain [email protected] http://www.sc.ehu.es/ccwrobot

Abstract. Classifier ensembles is an active area of research within the machine learning community. One of the most successful techniques is bagging, where an algorithm (typically a decision tree inducer) is applied over several different training sets, obtained applying sampling with replacement to the original database. In this paper we define a framework where sampling with and without replacement can be viewed as the extreme cases of a more general process, and analyze the performance of the extension of bagging to such framework. Keywords: Ensemble Methods, Decision Trees.

1

Introduction

One of the most active areas of research in the machine learning community is the study of classifier ensembles. Combining the predictions of a set of component classifiers has been shown to yield accuracy higher than the most accurate component on a long variety of supervised classification problems. To achieve this goal, various strategies of combining these classifiers in different ways are possible [Xu et al., 1992] [Lu, 1996] [Dietterich, 1997] [Bauer and Kohavi, 1999] [Sierra et al., 2001]. Good introductions to the area can be found in [Gama, 2000] and [Gunes et al., 2003]. For a comprehensive work on the issue see [Kuncheva, 2004]. The combination, mixture, or ensemble of classification models can be performed mainly by means of two approaches: – Concurrent execution of some paradigms with a posterior combination of the individual decision each model has given to the case to classify [Wolpert, 1992]; the combination can be done by voting or by means of more complex approaches [Ho et al., 1994]. – Hybrid approaches, in which two or more different classification systems are implemented together in one classifier [Kohavi, 1996]. J. Euzenat and J. Domingue (Eds.): AIMSA 2006, LNAI 4183, pp. 118–127, 2006. c Springer-Verlag Berlin Heidelberg 2006 

On a Unified Framework for Sampling With and Without Replacement

119

When implementing a model belonging to the first approach, a necessary condition is that the ensemble classifiers are diverse. One of the ways to achieve this consists of using several base classifiers, apply them to the database, and then combine their predictions in a single one. But even with a unique base classifier, is still possible to build an ensemble, applying it to different training sets in order to generate several different models. One of the ways to get several training sets from a given dataset is bootstrap sampling, where a sampling with replacement is made, obtaining samples with the same cardinality than the original dataset, but with different composition (some instances from the original set may be missing, while others may appear more than once). This is the method that bagging [Breiman, 1996] uses to obtain several training databases from a unique dataset. In this work we present a sampling method that make appear sampling with and without generalization as the two extreme cases of a more general continuous process. Then, it is possible to analyze the performance of bagging or any other algorithm that makes use of sampling with or without replacement in the continuum that spans between the two extremes. Typically, the base classifier in a given implementation of bagging uses to be a decision tree, due to the fact that small changes in the training data use to lead to proportionally big changes in the built tree. A decision tree consists of nodes and branches to partition a set of samples into a set of covering decision rules. In each node, a single test or decision is made to obtain a partition. In each node, the main task is to select an attribute that makes the best partition between the classes of the samples in the training set. In our experiments, we use the well-known decision tree induction algorithm, C4.5 [Quinlan, 1993]. The rest of the paper is organized as follows. Section 2 presents the proposed framework, with a brief description of the bagging algorithm. In section 3 the experimental setup in which the experiments were carried out is described. The obtained results are shown in section 4 and section 5 is devoted to conclusions and further work.

2

Unified Framework

In this section we will define a sampling process, of which sampling with replacement, and without replacement are the two extreme cases, existing a continuous range of intermediate possibilities. To define the general case, first of all let us take a glance to sampling with and without replacement: – Sampling with replacement: an instance is sampled according to some probability function, it is recorded, and then returned to the original database – Sampling without replacement: an instance is sampled according to some probability function, it is recorded, and then discarded

120

J.M. Mart´ınez-Otzeta et al.

Let us define now the following sampling method: – Sampling with probability p of replacement: an instance is sampled according to some probability function, it is recorded, and then, with a probability p, returned to the original database It is clear than the last definition is more general that the other two, and includes them. The differences among the three processes are depicted in Figure 1.

extract

Dataset

return

Sample

a) Sampling with replacement

extract

Dataset

return

Sample

b) Sampling without replacement

extract

Dataset

return with prob. p

Sample

c) Generalized sampling

Fig. 1. Graphical representation of the three different sampling methods

As we have already noted, sampling with replacement is one of the extreme cases of the above definition: when p is 1, so every instance sampled is returned to the database. Sampling without replacement is the opposite case, where p is 0, so a sampled instance is discarded and never returns to the database. Some questions arise: is it possible to apply this method to some problem where sampling with replacement is used to overcome the limitations of sampling without replacement? Will be the best results obtained when p is 0 or 1, or maybe in some intermediate point? We have tested this generalization in one well-known algorithm: bagging, where a sampling with replacement is made.

On a Unified Framework for Sampling With and Without Replacement

2.1

121

Bagging

Bagging (bootstrap aggregating) is a method that adds up the predictions of several classifiers by means of voting. These classifiers are built from several training sets obtained from a unique database through sampling with replacement. Leo Breiman described this technique in the early 90’s and it has been widely used to improve the results of single classifiers, specially decision trees. Each individual model is created from a instance set with the same number of elements than the original one, but obtained through sampling with replacement. Therefore, if in the original database there are m elements, every element has a probability 1 − (1 − 1/m)m of being selected at least once in the m times sampling is performed. The limit of this expression for big values of m is 1 − 1/e ,which yields the value of 63.2. Therefore, in average, only a 63.2% of the original cases will be present in the new set, appearing some of them several times. Bagging Algorithm – Initialize parameters • Initialize the set of classifiers D = ∅ • Initialize N , the number of classifiers – For n = 1, ..., N : • Extract a sample Bn through sampling with replacement from the original database • Built a classifier Dn taking Bn as training set • Add the classifier obtained in the previous step to the set of classifiers: D = D ∪ Dn – Return D It is straightforward to apply the previously introduced approach to bagging. The only modification in the algorithm consists in replacing the standard sampling procedure by the generalization above described.

3

Experimental Setup

In order to evaluate the performance of the proposed sampling procedure, we have carried out an experiment over a high number of the well-known UCI repository databases [Newman et al., 1998]. To do so, we have selected all the databases of medium size (between 100 and 1000 instances) among those converted to the MLC ++ [Kohavi et al., 1997] format, and located in this public repository: [http://www.sgi.com/tech/mlc/db/] This amounts to 59 databases,

122

J.M. Mart´ınez-Otzeta et al. Table 1. Characteristics of the 41 databases used in this experiment Database #Instances #Attributes #Classes Anneal 798 38 6 Audiology 226 69 24 Australian 690 14 2 Auto 205 26 7 Balance-Scale 625 4 3 Banding 238 29 2 Breast 699 10 2 Breast-cancer 286 9 2 Cars 392 8 3 Cars1 392 7 3 Cleve 303 14 2 Corral 129 6 2 Crx 690 15 2 Diabetes 768 8 2 Echocardiogram 132 7 2 German 1000 20 2 Glass 214 9 7 Glass2 163 9 2 Hayes-Roth 162 4 3 Heart 270 13 2 Hepatitis 155 19 2 Horse-colic 368 28 2 Hungarian 294 13 2 Ionosphere 351 34 2 Iris 150 4 3 Liver-disorder 345 6 2 Lymphography 148 19 4 Monk1 432 6 2 Monk2 432 6 2 Monk3 432 6 2 Pima 768 8 2 Primary-org 339 17 22 Solar 323 11 6 Sonar 208 59 2 ThreeOf9 512 9 2 Tic-tac-toe 958 9 2 Tokyo1 963 44 2 Vehicle 846 18 4 Vote 435 16 2 Wine 178 13 3 Zoo 101 16 7

from which we have selected one of each family of problems. For example, we have chosen monk1 and not monk1-cross, monk1-full or monk1-org. After this final selection, we were left with the 41 databases shown in Table 1.

On a Unified Framework for Sampling With and Without Replacement

123

begin Generalized sampling testing Input: 41 databases from UCI repository For every database in the input For every p in the range 0..1 in steps 0.00625 For every fold in a 10-fold cross validation Construct 10 training sets sampling according to parameter p Induce models from those sets Present the test set to every model Make a voting Return the ensemble prediction and accuracy end For end For end For end Generalized sampling testing Fig. 2. Description of the testing process of the generalized sampling algorithm

The sampling generalization described in the previous section makes use of a parameter p that is continuous. In the experiments carried out we have tested the performance of every value of p between 0 and 1 in steps of 0.00625 width. this amounts to a total of 161 discrete values. For every value of p a 10-fold crossvalidation has been carried out. In Figure 2 is depicted the algorithm used for the evaluation.

4

Experimental Results

In this section we present the experimental results obtained from a experiment following the methodology described in the previous section. We were interested in analyze the performance of the modification of bagging when sampling with replacement was changed by our approach, and for what values of p better results are obtained. Therefore, to analyze the data of all the databases, we have normalized the performances, taking as unit the case where p is zero. This case is equivalent to sampling without replacement, so it is clear that every set obtained in this way from a given dataset will be equivalent to the original one. This case corresponds to apply the base classifier (in our case C4.5) without any modification at all. This will be our reference when comparing performances with other p values. For example if, over the dataset A, with p = 0 we obtain an accuracy of 50%, and with p = 0.6 the performance is 53%, the normalized values would be 1 and 1.06 respectively. In other words, the accuracy in the second case is a six per cent better than in the first one. This normalization will permit us analyze which values of p yield better accuracy with respect to the base classifier. Standard bagging is the case when p takes the value 1. The obtained databases are diverse and this is one of the causes of the expected better performance. But,

124

J.M. Mart´ınez-Otzeta et al.

is this gain in performance uniform over all the interval? Our results show that it is not the case, and that beyond p = 0.5 there are no noticeable gains, being the most important shifts around p = 0.15 and p = 0.25. This means that small diversity between samplings could lead to similar results than the big diversity that bagging produces. After normalization, each database would have associated a performance (typically between 0.9 and 1.1) to every value of p; this performance is the result of the 10-fold crossvalidation as explained in the previous section. After applying linear regression, we obtained the results shown below.

Fig. 3. Regression line for values of p between 0 and 0.25: y = 0.0763x + 1.0068

In every figure, the X axe is p, while Y is the normalized performance. In Figures 3, 4, 5 and 6 are shown the results in the intervals (0, 0.25), (0.25, 0.5), (0.5, 0.75) and (0.75, 1), respectively. In every interval, the normalization has been carried out with respect to the lower limit of the interval. This has been made to make clear the gains in that interval. Observing the slope of the regression line, we note that the bigger gains are in the first interval. In the second one, there are some gains, but not at all in the same amount than in the first one. In the two last intervals the gains are very small, if any. Let us note too that this means that the cloud of points in Figure 3 is skewed towards bigger performance values than the clouds depicted in Figures 4, 5 and 6. The extreme lower values in Figure 3 are around 0.95, while in the other intervals appear some values below that limit. This means the chances of an important drop in performance are much smaller than the opposite. With respect to Figure 4, where some perceptible gains are still achieved, it is observed that performances below 0.92 are extremely rare, while in Figure 5 appear a higher amount of them. In Figure 6, apart from a unique extreme case below 0.90, the frequency of appearance of performances below 0.92 is very rare too. More detailed analysis is needed to distinguish true patterns behind this data from statistical fluctuations. From these data it looks as if with little diversity it is possible to achieve the same results than with bagging. In Figure 7 it is drawn the result of a polynomial regression of sixth degree. It shows that values close to the maximum are around p = 0.15.

On a Unified Framework for Sampling With and Without Replacement

125

Fig. 4. Regression line for values of p between 0.25 and 0.50: y = 0.0141x + 0.9951

Fig. 5. Regression line for values of p between 0.50 and 0.75: y = 0.0008x + 0.9969

Fig. 6. Regression line for values of p between 0.75 and 1: y = −0.0004x + 1.0017

Fig. 7. Sixth grade polynomial regression for values of p between 0 and 0.25

126

5

J.M. Mart´ınez-Otzeta et al.

Conclusions and Further Work

In this paper we have defined a generalization of sampling that includes sampling with and without replacement as extreme cases. This sampling has been applied to the bagging algorithm, in order to analyze its behavior. The results suggests that ensembles with less diversity than those obtained applying bagging could achieve similar performances. The analysis carried out in previous sections has been made over the accumulated data of all the 41 databases, so another line of research could consist of detailed analysis of performance over any given database. In this way, it could be possible a characterization of databases for which improvements in the interval (0, 0.25) are more noticeable and, in the other hand, databases for which improvements are achieved in intervals different than (0, 0.25). Let us note that the above results have been obtained putting together the 41 databases, so it is expected that some databases will behave different than the main trend; in some cases they will be against the main behavior, and in others their results will be in the same line, but much more marked. As further work, a better analysis of the interval (0, 0.25), where the most dramatic changes occur, would be of interest. A study of the value of similarity measures when applied over the ensembles obtained with different p values would be desirable too, along with theoretical work.

Acknowledgments This work has been supported by the Ministerio de Ciencia y Tecnolog´ıa under grant TSI2005-00390 and by the Gipuzkoako Foru Aldundia OF-838/2004.

References [Bauer and Kohavi, 1999] Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1-2):105–142. [Breiman, 1996] Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140. [Dietterich, 1997] Dietterich, T. G. (1997). Machine learning research: four current directions. AI Magazine, 18(4):97–136. [Gama, 2000] Gama, J. (2000). Combining Classification Algorithms. Phd Thesis. University of Porto. [Gunes et al., 2003] Gunes, V., M´enard, M., and Loonis, P. (2003). Combination, cooperation and selection of classifiers: A state of the art. International Journal of Pattern Recognition, 17(8):1303–1324. [Ho et al., 1994] Ho, T. K., Hull, J. J., and Srihari, S. N. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:66–75. [Kohavi, 1996] Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.

On a Unified Framework for Sampling With and Without Replacement

127

[Kohavi et al., 1997] Kohavi, R., Sommerfield, D., and Dougherty, J. (1997). Data mining using MLC ++ , a machine learning library in C ++ . International Journal of Artificial Intelligence Tools, 6(4):537–566. [Kuncheva, 2004] Kuncheva, L. I. (2004). Combining pattern classifiers: methods and algorithms. Wiley-Interscience, Hoboken, New Jersey. [Lu, 1996] Lu, Y. (1996). Knowledge integration in a multiple classifier system. Applied Intelligence, 6(2):75–86. [Newman et al., 1998] Newman, D., Hettich, S., Blake, C., and Merz, C. (1998). UCI repository of machine learning databases. [Quinlan, 1993] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. [Sierra et al., 2001] Sierra, B., Serrano, N., Larra˜ naga, P., Plasencia, E. J., Inza, I., Jim´enez, J. J., Revuelta, P., and Mora, M. L. (2001). Using bayesian networks in the construction of a bi-level multi-classifier. Artificial Intelligence in Medicine, 22:233–248. [Wolpert, 1992] Wolpert, D. (1992). Stacked generalization. Neural Networks, 5:241– 259. [Xu et al., 1992] Xu, L., Kryzak, A., and Suen, C. Y. (1992). Methods for combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on SMC, 22:418–435.

Suggest Documents