A new algorithm to build consolidated trees: study

0 downloads 0 Views 319KB Size Report
algorithm for building trees, denominated Consolidated Trees Construction ... figure 1– and B indicates the proposed branches or criteria to create the.
A new algorithm to build consolidated trees: study of the error rate and steadiness J.M. P´erez, J. Muguerza, O. Arbelaitz, and I. Gurrutxaga Computer Science Faculty, University of the Basque Country, po. Box 649 20080 Donostia, Spain Abstract. This paper presents a new methodology for building decision trees, Consolidated Trees Construction algorithm, that improves the behavior of C4.5. It reduces the error and the complexity of the induced trees, being the differences in the complexity statistically significant. The advantage of this methodology in respect to other techniques such as bagging, boosting, etc. is that the final classifier is based on a single tree and not in a set of trees, so that, the explaining capacity of the classification is not lost. The experimentation has been done with some databases of the UCI Repository and a real application of customer fidelization from a company of electrical appliances.

1

Introduction

In many pattern recognition problems, the explanation of the made classification becomes as important as the good performance of the classifier related to its discriminating capacity. Diagnosis in medicine, fraud detection in different fields, customer fidelization, resource assignation, etc, are examples of this kind of applications. Decision trees [11] are among the set of classification techniques that are able to give an explanation for the made classification, but they have a problem; they are too sensitive to the sample used in the induction process. This feature of decision trees is called unsteadiness or instability [6,4]. To face up this problem, several algorithms based on decision trees, such as bagging [3,1], boosting [8], Random Subespace Method [10] and different variants of them [4], with greater discriminating capacity and steadiness have been developed. All the algorithms mentioned before use resampling techniques to increase the discriminating capacity of the global system, but in both techniques the final classifier is a combination of different trees with different structures so, they lose the explaining capacity. The result of our research work in this area is the development of a new algorithm for building trees, denominated Consolidated Trees Construction (CTC) algorithm. Our aim has been to design a new algorithm for building a single tree that improving the performance of standard classification trees, reduces the complexity and maintains the explaining capacity. The result of this methodology is a single agreed tree, built using different training subsamples.

2

J.M. P´erez et al.

The paper proceeds with the description of our new methodology for building Consolidated Trees (CT), Section 2. In Section 3, the description of the data sets and the experimental set-up is presented. The results of our experimental study are discussed in section 4. Finally, Section 5 is devoted to summarize the conclusions and future research.

2

CTC learning algorithm

CTC algorithm, is based on resampling techniques. Each training subsample is used to build a tree, but finally, a single tree is built reached by consensus among the trees that are being built. This technique is radically different from bagging, etc.; the consensus is achieved at each step of the trees’ building process and only one tree is built. All the trees being built (using the base classifier: C4.5 in our case) from the different subsamples, make proposals about the variable that should be used to split in the current node. The decision about which variable will be used to make the split in the node of the consolidated tree is agreed among the different proposals coming from the trees; the decision is made by a voting process. Based on this decision, all the trees (each one associated to a different subsample) are forced to use the same variable to make the split. The process is repeated iteratively until some stop criterion is fulfilled. Algorithm 1 Consolidated Trees Construction Algorithm (CTC) Generate N S subsamples (S i ) from S with Resampling M ode method. LS i := {S i } where 1 ≤ i ≤ N S repeat for i := 1 to N S do CurrentS i := F irst(LS i ) LS i := LS i − CurrentS i Induce the best split (X, B)i for CurrentS i end for Obtain the consolidated pair (Xc , Bc ), based on (X, B)i ,1 ≤ i ≤ N S if (Xc , Bc ) 6= N ot Split then for i := 1 to N S do Divide CurrentS i based on (Xc , Bc ) to obtain n subsamples {S1i , . . . , Sni } LS i := {S1i , . . . , Sni } + LS i end for end if until LS i is empty

The algorithm starts extracting from the original training set a set of subsamples (Number Samples, N S in the algorithm) to be used in the process, based on the desired resampling technique (Resampling Mode). The consolidation of nodes, in the CT’s construction process, is made step by step.

A new algorithm to build consolidated trees

3

Algorithm 1 describes the process of building a CT and an example of the process can be found in Figure 1. The pair (X, B)i is the split proposal for the first subsample of LS i . X is the variable selected to split –for example ”sex” in the first level of figure 1– and B indicates the proposed branches or criteria to create the new subsamples. When X is discrete as many branches as values has X are proposed –”m” and ”f” in the case of ”sex” variable–. When X is continuous, B is a cut value –for example cut = 30 for variable ”age”–. Xc is obtained by a voting process among all the proposed X and Bc is the median when Xc is continuous and all the possible values when Xc is discrete.

Fig. 1. CTC Algorithm. Example

The used subsampling technique and the number of subsamples used in the tree’s building process [12] are important aspects of the algorithm. There are many possible combinations for the Resampling Mode: size of the subsamples (100%, 75%, 50%, etc; with regard to the original training set size),

4

J.M. P´erez et al.

with replacement or without replacement, stratified or not, etc. We have experimented with most of the possible options. With the analysis of the preliminary results, some important clues in the kind of Resampling Mode that should be used came out: the subsamples must be large enough (at least 75%) so that they do not lose too much information of the original training set. The fact that subsamples are stratified, so that the a priory probability of the different categories of the class variable is maintained, is also important. That’s why, in this paper, only results get for stratified subsamples with sizes of 75% without replacement and of 100% with replacement (bootstrap samples) are presented. Once the consolidated tree has been built, its behavior is similar to the behavior of a single tree. Section 5 will show that the trees built using this methodology, have better discriminating capability and they are less complex.

3

Experimental methodology

Ten databases of real applications have been used for the experimentation. Most of them belong to the well known UCI Repository benchmark [2], widely used in the scientific community. Table 1 shows the characteristics of the databases used in the comparison. Table 1. Description of experimental domains Domain Breastcancer (Wisconsin) Heart disease (Cleveland) Hypothyroid Faithful Lymphography Iris Glass Voting Hepatitis Segment

N. of patterns

N. of features

N. of classes

699 303 3163 24507 148 150 214 435 155 210

10 13 25 49 18 4 9 16 19 19

2 2 2 2 4 3 7 2 2 7

The Faithful database is a real data application from our environment, centered in the electrical appliance’s sector, and does not belong to UCI. In this case, we try to analyze the profile of customers during time, so that a classification related to their fidelity to the brand can be done. This will allow the company to follow different strategies to increment the number of customers that are faithful to the brand, so that the sales increase. In this kind of applications, the use of a system that provides information about the factors taking part in the classification (the explanation) is very important; it

A new algorithm to build consolidated trees

5

is nearly more important to analyze and explain why a customer is or is not faithful, than the own categorization of the customer. This is the information that will help the corresponding department to make good decisions. The CTC methodology has been compared to the C4.5 tree building algorithm Release 8 of Quinlan [11], using the default parameter settings. The methodology used for the experimentation is a 10-fold stratified cross validation [9]. Cross validation is repeated five times using a different random reordering of the examples in the data set. This methodology has given us the option to compare 5 times groups of 10 CTC and C4.5 trees (50 executions for each instance of the analyzed parameters). In each run we have calculated the average error and its standard deviation, and the complexity of the trees, estimated as the number of internal nodes of the tree (Compmean ). In order to evaluate the improvement achieved in a given domain by using the algorithm CTC instead of the algorithm C4.5, we calculate the relative improvement or relative difference: (ResultCT C − ResultC4.5 )/ResultC4.5 for both aspects, error and complexity. For every result we have tested the statistical significance [5,7] of the differences obtained with the two algorithms using the paired t-test with significance level of 95%.

4

Results

The behavior the meta-algorithm has when changing the values of several parameters has been analyzed. The analyzed ranges of the parameters are: • Number Samples(N S in the table): 3, 5, 10, 20, 30, 40, 50, 75, 100, 125, 150, 200. The analyzed results show that 10 is the minimum number of samples to obtain satisfactory results with CTC. So, the average results presented in this section do not take into account the values 3 and 5, except for Faithful where due to the size of the database the study has been done from 3 to 40 subsamples. • Resampling Mode: as we mentioned in section 2, for the results presented in this paper, the following values of the parameter have been used: 75% without replacement and bootstrap sampling (100% with replacement) and always stratified. • Crit Split and Crit Variable: simple majority among the Number Samples for both. Both kind of trees have been pruned using the pruning algorithm of the C4.5 R8 software, in order to obtain two systems with similar conditions. We can not forget that developing too much a classification tree leads to a greater probability of overtraining. • Crit Branches: when the decision made in the CT building process is to split a node, the new branches have to be selected. When the variables were continuous, the selection has been done based on the median of the proposed values. Experimentation with the mean has also been done, but

6

J.M. P´erez et al.

the results were worse. When the variables were discrete we follow the standard C4.5 procedure.

Table 2. Error and complexity comparison among C4.5 and CTC (75%) Resampling Mode = 75% without replacement C4.5

CT C(Errmin )

CT C(Compmin )

Err Comp Err R.Dif Comp R.Dif N S Err R.Dif Comp R.Dif N S Breast-W Heart-C Hypo Faithful Lymph Iris Glass Voting Hepatitis Segment Average

5,63 23,96 0,71 1,48 20,44 5,75 31,55 3,41 20,29 13,61 12,68

3,16 15,11 5,38 39,47 8,84 3,71 22,93 4,80 8,53 11,71 12,36

5,40 22,85 0,72 1,48 19,65 4,01 29,43 3,36 20,31 11,52 11,87

-4,12 -4,61 0,56 -0,13 -3,88 -30,24 -6,69 -1,35 0,10 -15,31 -6,57

3,11 13,51 4,51 32,44 9,24 3,13 25,07 4,84 6,80 13,82 11,65

-1,41 -10,59 -16,12 -17,79 4,52 -15,57 9,30 0,93 -20,31 18,03 -4,90

125 20 20 20 30 150 75 20 75 50

5,49 22,92 0,72 1,50 19,93 4,68 29,94 3,50 20,58 13,51 12,28

-2,49 -4,35 0,84 0,94 -2,51 -18,62 -5,09 2,64 1,43 -0,75 -2,79

3,04 13,20 4,40 28,56 9,09 3,07 23,04 4,76 6,49 12,64 10,83

-3,52 -12,65 -18,18 -27,65 2,76 -17,37 0,48 -0,93 -23,96 7,97 -9,30

30 30 30 05 40 20 20 10 100 10

Table 2 shows the results related to error (Err) and complexity (Comp) for the different domains. The resampling mode of the CTC is 75% without replacement. The comparison among C4.5 and CTC we present in the table has been calculated using the best value of the parameter Number Samples (N S) for CTC in order to minimize the error (left side) or the complexity (right side) of the classifier. Relative differences (R.Dif) for error and complexity among the two algorithms are also presented. The mean of the results obtained with the two algorithms, for the different domains is shown in the last row. It can be observed that if the value that better minimizes the error is chosen for Number Samples, in average, the CTC algorithm obtains a relative improvement of 6,57%. In this case, the relative improvement in the complexity of the generated trees, is 4,90%. Even when the value of Number Samples that better minimizes the complexity has been selected, the CTC has smaller errors than the C4.5, being the improvement 2,79%. In this case, the complexity is reduced 9,30%. In the largest database we have studied (Faithful ), the reduction of the complexity becomes specially important, the obtained improvement is 27,65%. This reduction is statistically significant for every value of the parameter Number Samples. We have also analyzed the standard deviation of the error, and the results show that CTC is in average more stable than C4.5, and besides, it is smaller for most of the analyzed databases. If we analyze the most difficult domains (ex: error larger than 10%), we can observe that, except in Hepatitis, the CTC obtains large relative improve-

A new algorithm to build consolidated trees

7

ments. Besides, in the databases C4.5 behaves better, the relative improvements are at most 0,56%. Table 3. Error and complexity comparison among C4.5 and CTC (bootstrap) Resampling Mode = 75% without replacement C4.5

CT C(Errmin )

CT C(Compmin )

Err Comp Err R.Dif Comp R.Dif N S Err R.Dif Comp R.Dif N S Breast-W Heart-C Hypo Faithful Lymph Iris Glass Voting Hepatitis Segment Average

5,63 23,96 0,71 1,48 20,44 5,75 31,55 3,41 20,29 13,61 12,68

3,16 15,11 5,38 39,47 8,84 3,71 22,93 4,80 8,53 11,71 12,36

5,18 -8,10 23,92 -0,16 0,80 12,64 1,48 -0,27 20,40 -0,17 4,94 -13,99 29,83 -5,45 3,27 -4,05 18,81 -7,30 12,38 -9,05 12,10 -3,59

3,82 22,80 6,98 29,53 12,00 3,38 34,02 5,13 10,16 18,58 14,64

21,13 50,88 29,75 -25,17 35,68 -8,98 48,35 6,94 19,01 58,63 23,62

150 75 10 03 75 75 30 40 150 40

5,46 -3,05 25,54 6,59 0,80 12,64 1,55 4,32 20,98 2,66 4,94 -13,99 31,19 -1,13 3,50 2,64 19,69 -2,98 13,62 0,06 12,73 0,78

3,71 20,73 6,98 20,53 10,18 3,38 30,62 5,09 9,11 17,49 12,78

17,61 37,21 29,75 -47,97 15,08 -8,98 33,53 6,02 6,77 49,34 13,83

10 10 10 30 10 75 10 05 10 10

Table 3 shows that even if the improvement of the error with regard to the C4.5 is maintained when bootstrap samples are used, (only for the value of Number Samples that minimizes the error), this improvement is smaller, 3,59%. On the other hand, the complexity of the induced decision trees is quite larger than in C4.5; 23,62%. It seems clear that for our algorithm, generating the subsamples with 75% and without replacement is better than using bootstrap sampling (Figure 2). Possibly, the amount of information from the original training set appearing in each subsample is important when building a CT, the subsamples of 75% have more information than the 100% bootstrap subsamples (63.2% of different cases in average). The size of the subsamples probably influences the complexity of the final tree, and that is why, the complexity of the CTC(bootstrap) is considerably larger than the one of the CTC(75%). We can find some domains, where CTC(bootstrap) reduces more the error than CTC(75%), but the complexity of the induced trees increases considerably. We should analyze this behavior in the future. Figure 2 shows the evolution of the average error and complexity of CTC(75%), left side of the figure, and CTC(bootstrap), right side, compared to C4.5 for all domains, when varying the values of Number Samples (the Faithful database has not been taken into account in the graphics). It can be observed, that in CTC(75%), for every value of N S the error of the C4.5 is improved with smaller complexity. In Table 4 average values of error and complexity for the analyzed range of values of N S when using CTC(75%) are shown. It can be observed that the

8

J.M. P´erez et al.

Fig. 2. Evolution of the average error and complexity in the whole set of studied domains when increasing Number Samples Table 4. Average results using CTC(75%) of error and complexity for the whole range of values of N S (left) and for N S =20 (right) CTC(75%) Average for N S = 10,...,200 Err R.Dif Comp Breast-W Heart-C Hypo Faithful Lymph Iris Glass Voting Hepatitis Segment Average

5,56 23,43 0,73 1,49 20,03 4,42 29,96 3,39 20,92 12,42 12,24

-1,21 -2,21 2,30 0,84 -2,01 -23,03 -5,04 -0,56 3,08 -8,70 -3,65

3,09 13,84 4,65 30,03 9,18 3,13 24,01 4,80 7,02 14,08 11,38

R.Dif -2,11 -8,41 -13,60 -23,92 3,74 -15,63 4,69 0,00 -17,68 20,23 -5,27

N S = 20 Err R.Dif Comp R.Dif 5,52 22,85 0,72 1,48 19,91 4,68 29,94 3,36 21,11 12,76 12,23

-1,99 -4,61 0,56 -0,13 -2,56 -18,62 -5,09 -1,35 4,00 -6,26 -3,60

3,07 13,51 4,51 32,44 9,22 3,07 23,04 4,84 7,29 13,24 11,42

-2,82 -10,59 -16,12 -17,79 4,27 -17,37 0,48 0,93 -14,58 13,09 -6,05

average relative improvement in the error is 3,65% with a relative reduction in complexity of 5,27%. These results confirm the robustness of the algorithm with regard to the N S parameter. Even if the improvements are slighter than the ones obtained making the tuning of N S, the results for CTC(75%) are still better than the ones obtained with C4.5. If we would like to tune the Number Samples parameter in order to optimize the error/complexity trade off in the analyzed domains, the results address us to a number of samples around 20 or 30 (see Figure 2, left side). As an example, Table 4 (right side) presents the results achieved for N S =20. It can be observed that they do not differ substantially from the results obtained

A new algorithm to build consolidated trees

9

finding in each domain the best value for Number Samples. The CTC(75%) maintains its improvement if compared to C4.5 in both cases, 3,60% for the error and 6,05% for the complexity. If we analyze the statistical significance of the differences among the two algorithms, we won’t find significant differences in the error parameter for none of the values of Number Samples, even if the average behavior of the CTC(75%) algorithm is always better than the behavior of C4.5. However, when analyzing the complexity, significant differences in favor of CTC are found in five of the ten databases studied (Heart-C, Hypo, Faithful, Iris, Hepatitis). For the Segment domain, significant differences in favor of C4.5 have been found although the CTC(75%) achieves smaller error rate. This does not happen with CTC(bootstrap), since the reduction of the complexity is statistically significant in favor of C4.5 in the majority of the analyzed domains.

5

Conclusions and further work

This paper presents a new algorithm for building decision trees: Consolidated Trees Construction Algorithm that improves the error and the complexity of C4.5. The proposed algorithm maintains the explanatory feature of the classification, because the created classifier is just a single decision tree. This makes our technique very useful in many real life problems. The method used in the subsample generation process (Resampling Mode) is an important parameter because the results vary substantially depending on the selected case. In the experimentation made up to the moment, the CTs built using subsamples with sizes of 75% of the original training set size, and without replacement, are the ones presenting the best behavior. On the other hand, the robustness of the meta-algorithm with respect to the parameter Number Samples has been proven, being 20 or 30 adequate values, taking into account the error/complexity trade off, for all the experimented domains. The set of studied domains has to be enlarged in the future. As further work, we are also thinking on experimenting with other possibilities for the Resampling Mode parameter (different amount of information or variability of the subsamples). The first results address us to try with percentages greater than 75% and without replacement making a bias/variance analysis in order to study the origin of the error. Other interesting possibility is to generate new subsamples dynamically, during the building process of the CT, where the probability of selecting each case is modified based on the error (similar to boosting). We are also analyzing a possibility where the own meta-algorithm builds trees that do not need to be pruned. With this aim, we would make a tuning of the parameters Crit Split, Crit Variable and Crit Branches so that the generated trees are situated in a better point of the learning curve and the computation load of the training process is minimized.

10

J.M. P´erez et al.

We are also working on the design of a metric that will give us the possibility of comparing two decision trees from the structural point of view. Our aim is to make a pairwise comparison among trees, where the common subtree starting form the root is measured. This would help in the comparison of the structural stability of trees built using different algorithms.

Acknowledgmentes This work was partly supported by the University of Basque County/Euskal Herriko Unibertsitatea: project 1/UPV 00139.226-T-14882/2002. We would like to thank the company Fagor Electrodomesticos, S. COOP. for permitting us the use, in this work, of their data obtained through the project BETIKO, (Faithful ). The Lymphography domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic.

References 1. Bauer E., Kohavi R. (1999) An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105-139 2. Blake, C.L., Merz, C.J. (1998) UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/∼mlearn/MLRepository.html, University of California, Irvine, Dept. of Information and Computer Sciences 3. Breiman L. (1996) Bagging Predictors. Machine Learning, 24, 123-140 4. Chawla N.V., Hall L.O., Bowyer K.W., Moore Jr., Kegelmeyer W.P (2002) Distributed Pasting of Small Votes. LNCS 2364. Multiple Classifier Systems: Proc. 3th. Inter. Workshop, 2002, Italy, 52-61 5. Dietterich T.G. (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1924 6. Dietterich T.G. (2000) Ensemble Methods in Machine Learning. LNCS, 1857. Multiple Classifier Systems: Proc. 1st. Inter. Workshop, 2000, Italy, 1-15 7. Dietterich T.G. (2000) An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-157 8. Freund, Y., Schapire, R. E. (1996) Experiments with a New Boosting Algorithm. Proceedings of the 13th International Conference on Machine Learning, 148-156 9. Hastie T., Tibshirani, R. Friedman J. (2001) The Elements of Statistical Learning. Springer-Verlang (es). ISBN: 0-387-95284-5 10. Ho T.K.(1998) The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832-844. 11. Quinlan J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.(eds), San Mateo, California 12. Skurichina M., Kuncheva L.I., Duin R.P.W. (2002) Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy. 2364. Multiple Classifier Systems: Proc. 3th. Inter. Workshop, 2002, Italy, 62-71

Suggest Documents