experiments with an innovative tree pruning ... - Semantic Scholar

3 downloads 0 Views 464KB Size Report
ABSTRACT. The pruning phase is one of the necessary steps in deci- sion tree induction. Existing pruning algorithms tend to have some or all of the following ...
EXPERIMENTS WITH AN INNOVATIVE TREE PRUNING ALGORITHM Mingyu Zhong Michael Georgiopoulos School of Electrical Engineering and Computer Science School of Electrical Engineering and Computer Science University of Central Florida University of Central Florida Orlando, FL, USA Orlando, FL, USA email: [email protected] email: [email protected] Georgios C. Anagnostopoulos Department of Electrical and Computer Engineering Florida Institute of Technology Melbourne, FL, USA email: [email protected] ABSTRACT The pruning phase is one of the necessary steps in decision tree induction. Existing pruning algorithms tend to have some or all of the following difficulties: 1) lack of theoretical support; 2) high computational complexity; 3) dependence on validation; 4) complicated implementation. The 2-norm pruning algorithm proposed here addresses all of the above difficulties. This paper demonstrates the experimental results of the comparison among the 2-norm pruning algorithm and two classical pruning algorithms, the Minimal Cost-Complexity algorithm (used in CART) and the Error-based pruning algorithm (used in C4.5), and confirms that the 2-norm pruning algorithm is superior in accuracy and speed. KEY WORDS decision tree, classification, 2-norm, pruning, CART, C4.5

1

Introduction

Decision trees are widely used in Machine Learning for their interpretability, their fast training/prediction process, and their relative high accuracy. Usually the training (induction) of a tree consists of two phases: a growing phase, where a large tree is constructed, and a pruning phase, where the tree is trimmed to prevent over-fitting. The pruning phase remains a relatively open topic, since many existing algorithms have been proposed, while none is generally recognized as a universal optimal one. The 2-norm pruning algorithm was proposed recently [11], with strong theoretical support and many practical advantages, including its independence of validation, its low computational complexity, and its simplicity in implementation. This paper attempts to verify its superiority, over the Minimal Cost Complexity Pruning (CCP) [2] and ErrorBased Pruning (EBP) [9], through experimentation. The rest of this paper is organized as follows. Section 2 summarizes the existing work on tree pruning algorithms. Section 3 introduces the 2-norm pruning algorithm. Section 4 shows the experimental results of the comparison be-

tween this algorithm and two classical pruning algorithms - CCP and EBP. Section 5 concludes this paper.

2

Existing Work on Tree Pruning

To prevent over-fitting and to overcome the difficulties in setting a proper stopping criterion in the growing phase, usually a pruning phase is applied after a full-sized tree is grown (see [2], page 37). The most famous pruning algorithms include the Minimal Cost-Complexity Pruning (CCP) in CART (see [2], page 66) and the Error-Based Pruning (EBP) in C4.5 (see [9], page 37). CCP obtains a pruning sequence of pruned trees that optimizes the cost, which is defined as training error rate plus a penalty proportional to the tree size, and it applies k-fold cross-validation to select the best tree. Cross-validation, although accurate in predicting the generalization, is computationally complex. Recently, CCP’s additive penalty has been generalized to sub-additive penalties (that is, the extra error rate is a nonlinear function of the tree size). The risk upper bound estimation is studied in [1, 5, 8, 10]. A non-parametric approach without cross-validation is also provided in [8]. However, it was recognized that replacing CART’s additive penalty with sub-additive penalties results in the loss of the efficiency in finding the candidate pruning trees [8]. EBP, as well as some other pruning algorithms such as the Minimum Error Pruning (MEP) (see [7]), also provides an approach to estimate the tree’s generalization error without validation, by simply using statistical information at the nodes of the tree. Although these two approaches are usually simple and fast, their models are not fully justified. In fact, both algorithms tend to under-prune a tree as shown in [4]. Theoretical analysis of their behavior is given in [3] and [11], respectively. Zhong et al. recently proposed the idea of k-norm estimation for the tree costs, including the misclassification rate [11]. Based on the assumption that, given the training examples, the class probabilities are random variables with Dirichlet distribution (which also leads to the widely used Laplace’s/Lidstone’s Law of Succession for probabil-

ity estimation), the authors provided the expressions for the k-norm of the costs for any natural number k without validation, and claimed that when k > 1, the k-norm takes into account not only the expected performance but also the reliability of the performance. For example, the 2-norm combines the expected value and the standard deviation, as shown in (7). The authors also proved some useful properties of the k-norm estimation, such as the fact that the k-norm cost of a tree can be computed and optimized recursively, which means that the evaluation and the pruning of a tree can be completed within one traversal of the tree. Due to these promising properties, this paper focuses on this algorithm and performs the necessary experiments to test this algorithm and compare it with other existing approaches to prune decision trees.

3

2-Norm Pruning Algorithm

Although the original work [11] proposed a general k-norm pruning algorithm, the authors have suggested that k = 2 norm estimates are more accurately calculated than higher k norm estimates, and they are also less computationally complex than their higher k-norm counterparts. Hence, in this paper, we discuss and test only a special case of the original algorithm: 2-norm pruning based on misclassification rates. 3.1

Procedure

The 2-norm pruning algorithm follows the same structure of the EBP, as shown below. Procedure PruneSubtree(t) input: t = root of a subtree output: rˆ(t) = estimated error rate of the pruned subtree Let rˆleaf (t) = estimated error rate of t as a leaf If t is a decision node Let rˆi = PruneSubtree(ci ) for each child ci Evaluate rˆtree (t) (estimated error rate of the subtree rooted at t) based on rˆi If rˆtree (t) < rˆleaf (t) − ε Then Return rˆtree (t) Replace the subtree rooted at t with a leaf End If Return rˆleaf (t)

For the 2-norm pruning algorithm, rˆleaf (t) and rˆtree (t) are evaluated by (7), (8) and (9), which will be explained later. The above procedure implies a bottom-up order, which means that the children nodes of any decision node t are considered before node t is considered. Each node is accessed exactly once. The decision of pruning/notpruning a node, once made, is never changed when its ancestor is under consideration. Therefore, only a portion of all possible pruned trees is visited. For EBP and 2-norm pruning, it has been proven that the selected pruned tree is guaranteed to be globally optimal (in the sense of EBP’s estimated error rate and 2-norm error rate, respectively) among all possible pruned trees.

3.2

Estimation of Error Rates

Given an input X (a vector of attributes), the error rate r of a classifier can be expressed as follows: r = 1 − P [CLabel∗ (X) |X]

(1)

where P [Φ] is the probability of the event Φ, Cj (j = 1, 2, · · · , J) is the event that the class label is j, J is the number of classes, Label∗ (X) is the predicted class label by the classifier for X. Since the true value of P [Cj |X] is unknown (or otherwise the Bayes optimal classifier could be used), it is reasonable to treat it as a random variable. In decision trees, usually P [Cj |X] is approximated by P [Cj |At ], where At is the event that a leaf t receives X, and the prediction is given by: (2) Label∗ (X) = arg max Njt j

where Njt is the number of training examples of class j in node t. The 2-norm pruning algorithm treats P [Cj |At ] (abbreviated as pjt ) as a random variable with Dirichlet distribution (which is the basis of the widely used Lidstone’s and Laplace’s Law of Succession). In particular, the prior joint probability density function (PDF) of pjt is given by (p1t , · · · , pJt ) ∼ Dir(λ, λ, · · · , λ), that is, ⎞ ⎛ J J   pjt ⎠ pλ−1 f (p1t , · · · , pJt ) = αδ ⎝1 − jt I[pjt ≥ 0] j=1

j=1

(3) where f () is the joint PDF, δ() is the Dirac delta function, I[] is the indicator function, and α is a constant such that the integral of f (p1t , · · · , pJt ) is unity. On the other hand, the Bayesian theory states that P [Observations|P]f (P) P [Observations|P]f (P)dP (4) Here the Observations can be interpreted as the training data, or namely (N1t , · · · , NJt ). Using (3) and (4), it has been proven that (p1t , · · · , pJt |T rainingData) ∼ Dir(λ + N1t , λ + N2t , · · · , λ + NJt ). This posterior distribution implies that f (P|Observations) = 

E[pjt |T rainingData] =

λ + Njt Jλ + Nt

(5)

where E[Q] is the expected value of the random variable J Q, and Nt = j=1 Njt . The above equation is the wellknown Lidstone’s Law of Succession. When λ = 1, it reduces to Laplace’s Law of Succession. Due to (1), it is not difficult to prove that the error rate at a leaf t has the distribution (rt , 1−rt ) ∼ Dir((J −1)λ+ Nt − maxj Njt , λ + maxj Njt ), which means that E[rt |T rainingData] =

(J − 1)λ + Nt − maxj Njt Jλ + Nt (6)

However, the authors of the 2-norm pruning algorithm argue that the expected value of rt is not a good estimate, because it does not take any reliability factor into account: in practice, if two classifiers have similra average accuracy, we tend to favor the one with the smaller variance. The authors have also proven [11] that, under a loose condition, if the expected value of rt were used as the estimate, the pruning algorithm would not prune any decision node as long as the split in the node reduces the training misclassifications by at least one. Therefore, the authors use the 2-norm estimate as a trade-off between the mean and the variance of the error rate:

rˆ(t) = rt 2 = E [rt2 ] = E[rt ]2 + V AR[rt ] (7) where V AR[Q] stands for the variance of the random variable Q. For a leaf t, according to the properties of the Dirichlet distribution, ((J − 1)λ + bt )((J − 1)λ + bt + 1) E rt2 = (Jλ + Nt )(Jλ + Nt + 1)

(8)

In our experiments, each database has to be separated into a training set and a test set. To minimize the randomness due to partitioning, we repeated the experiments 20 times with different partitioning, each time. To ensure that the classes in the training set have approximately the same distribution as in the test set, we shuffle the examples before partitioning. It has been observed that if the random partitioning is simply repeated, some of the examples might appear more often in the training set than in the test set or the opposite. Therefore, we partition each database only once into 20 subsets with approximately equal size. In the i-th (i = 1, 2, · · · , 20) test, we use subsets i, i+1, · · · , i+m−1 for training and the others for testing, where m is predefined (e.g., if the training ratio is 5%, m = 1; if the training ratio is 50%, m = 10). Subset 21 is subset 1, subset 22 is subset 2, and so forth. This approach guarantees that each example appears in the training set for exactly m times and in the test set for 20 − m times; it also allows the training ratio to vary. The procedure of the experiments is shown below: For each database Shuffle the examples Partition the database into 20 subsets For m = 1, 10 For i=1 to 20 Set the training set to subsets i to i+m-1 Set the test set to the remaining subsets Grow a full tree with CART and the training set Use CCP/EBP/2-norm for pruning the tree Evaluate each pruned tree with the test set End For End For End For

where bt = Nt − maxj Njt . For simplicity, we omit the condition “|T rainingData” in the above expression and the following ones. For a decision tree node t, it has been proven that  E rc2 E[P [Ac |At ]] (9) E rt2 = c∈Children(t)

where E[P [Ac |At ]] is evaluated with Lidstone’s Law of Succession: E[P [Ac |At ]] =

Nc + η Nt + K t η

(10)

where Kt is the number of immediate children in t and η is analogous to λ (a different notation is used so that it can take a different value). Equation (9) indicates that the error rate of a tree can be estimated recursively. It also guarantees that the procedure listed in Subsection 3.1 yields the global optimal pruned tree among all possible pruned trees without accessing all of them. This is one of the most important reasons to use the 2-norm (rather than other approaches such as the 1-SE rule in CART).

4.2

The parameters of each algorithm are set as follows: CCP: use 10-fold cross validation and 1-SE rule. EBP: use the 75% confidence level. λ 2-norm: η = 0.5, λ = 100L J 2 N , ε = 10Nt , where L is the number of leaves in the full-sized tree and N is the number of training examples. These settings were found to be good default values for the parameters of the 2-norm algorithm. 4.3

4

Databases

Experiments

In this section we show the procedure and the results of a full test on the three pruning algorithms: CCP, EBP, and 2-norm. 4.1

Parameter Settings

Procedure

We compare three metrics: the accuracy of the pruned tree, the size of the pruned tree, and the elapsed time in pruning (computational complexity of pruning).

The statistical information of the databases, used in the experiments, is listed below. The databases g#c## are Gaussian artificial databases. The acronym g2c15 means 2 classes and 15% overlap. The overlap is defined to be equal to the error rate observed by the Bayes optimal classifier. These databases have two attributes x and y. In each class, x and y are independent Gaussian random variables, whose means vary among the classes. A scatter plot of the Gaussian artificial databases for 2 and 6 class, and 15% overlap are shown in Fig. 1.

Database Examples abalone g2c15 g2c25 g6c15 g6c25 kr-vs-kp letter optdigits pendigits satellite segment shuttle splice waveform

4177 5000 5000 5004 5004 3196 20000 5620 10992 6435 2310 58000 3190 5000

Numerical Categorical Major Classes Attributes Attributes Class % 7 1 3 34.6 2 0 2 50.0 2 0 2 50.0 2 0 6 16.7 2 0 6 16.7 0 36 2 52.2 16 0 26 4.1 64 0 10 10.2 16 0 10 10.4 36 0 6 23.8 19 0 7 14.3 9 0 7 78.6 0 60 3 51.9 21 0 3 33.9

Table 1. Statistics of Databases

1 0.9 0.8 0.7

y

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

x

1 0.9

4.4 4.4.1

Results Accuracy

The accuracies of the tree pruned by each algorithm are listed in Table 2. We show the mean, the standard deviation and the T-test result of the 20 runs. We choose to call the difference between two accuracies significant if the difference is above 1% and the T-test of the difference comes out to be statistically significant at confidence level of 95%. It can be seen that when the training ratio is small (5%), the 2-norm pruning algorithm outperforms CCP 5 times, while CCP never outperforms the 2-norm algorithm. On the other hand, when the training ratio is large (50%), the 2-norm pruning algorithm has similar accuracy as CCP (neither one of these algorithms outperforms the other for any database experiment). There are two potential causes of the accuracy difference between CCP and the 2-norm algorithm: 1) they visited different the candidate pruned trees, and 2) even if the candidates are the same, they would choose different final trees using their own error rate estimation. Unfortunately, it is difficult and unjustified to compare the candidate trees because both have examined many candidate pruned trees but selected only one (that is, even if 99% of the candidate trees visited by one algorithm are better than those visited by the other, we cannot draw the conclusion that the former algorithm will select a better tree). The 2-norm algorithm also outperforms EBP in accuracy regardless of the training ratio (it outperforms EBP 5 times when the training ratio is 5% and it outperforms EBP 7 times when the training ratio is 50%). The two causes discussed above also apply here, but major factor is EBP’s tendency to under-prune, as explained in Section 2.

0.8

4.4.2

0.7

Tree Size

y

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

x Class

C

1

C

2

C

3

C

4

C

5

Although the primary goal of the 2-norm pruning algorithm is not to minimize the tree size but the error rate, we also show the size of the pruned trees in Table 2. The 2-norm pruning algorithm considers the estimated error rate only, and thus the resulting size is sometimes larger than that of CCP. We can also see that EBP always generates larger trees than the other two (2-norm and CCP).

C

6

4.4.3

Speed

Figure 1. Data Points of g2c15 (upper) and g6c15 (lower)

The other databases are benchmark databases obtained from the UCI Repository [6]. We chose databases from the UCI repository, such that they have no missing value, and such that they contain at least 2000 examples (because the minimum training ratio is as low as 5%).

Now we compare the speed of the three pruning algorithms. The mean elapsed time (in seconds) is shown in Fig. 2. For each algorithm, the elapsed time is very stable in 20 runs, and thus we do not show the standard deviation. For the same reason, and because the mean time is quite different among the three algorithms, all comparisons are statistically significant. Fig. 2 indicates that the 2-norm pruning algorithm is significantly faster than the other two (at least hundreds of times faster than CCP and usually tens of times faster than

Accuracy Training Ratio = 5% (in %) CCP 2-norm EBP 2-norm : CCP 2-norm : EBP Database Mean Std Mean Std Mean Std Diff Sig ± Diff Sig ± abalone 59.1 1.5 59.5 2.0 56.6 1.5 0.4 51.6 2.9 100.0 + g2c15 83.4 1.1 84.7 0.9 82.5 1.1 1.3 100.0 + 2.2 100.0 + g2c25 72.5 1.4 74.0 0.9 69.4 2.4 1.5 100.0 + 4.6 100.0 + g6c15 79.7 1.3 79.2 1.4 78.8 1.7 -0.5 71.9 0.4 64.4 g6c25 69.4 1.1 69.4 1.4 68.1 1.2 0.0 2.5 1.3 99.8 + kr-vs-kp 93.4 1.6 93.5 1.5 93.9 1.4 0.1 19.1 -0.4 57.0 letter 62.8 1.2 62.8 0.9 63.3 1.0 -0.1 18.6 -0.5 89.1 optdigits 70.2 1.7 71.7 1.9 72.0 2.1 1.5 98.7 + -0.3 41.1 pendigits 83.8 1.5 83.2 1.4 84.7 1.2 -0.6 77.0 -1.5 99.9 − satellite 77.9 1.4 79.2 1.3 78.4 1.6 1.3 99.7 + 0.8 91.9 segment 85.5 2.6 86.2 2.4 86.8 2.4 0.7 61.2 -0.6 53.8 shuttle 99.7 0.1 99.6 0.1 99.7 0.1 -0.1 97.1 -0.1 100.0 splice 82.6 3.7 83.1 2.9 80.5 3.4 0.5 35.9 2.5 98.5 + waveform 69.0 1.7 70.7 1.2 70.0 1.5 1.7 99.9 + 0.8 91.9 Tree Size Training Ratio = 5% (in leaves) CCP 2-norm EBP 2-norm : CCP Database Mean Std Mean Std Mean Std Diff Sig ± abalone 6.9 5.2 9.5 3.2 43.2 5.8 2.6 93.7 − g2c15 7.0 2.6 2.2 0.7 14.2 5.2 -4.9 100.0 + g2c25 6.3 3.6 2.7 1.9 28.5 7.4 -3.6 100.0 + g6c15 8.8 1.9 7.4 1.4 20.9 4.3 -1.4 98.8 + g6c25 8.5 2.5 8.6 1.9 33.2 5.3 0.1 11.2 kr-vs-kp 7.2 2.4 6.0 1.6 9.3 2.0 -1.2 93.2 letter 171.9 81.9 232.2 11.0 279.4 8.6 60.3 99.8 − optdigits 19.1 4.9 31.1 5.5 47.7 4.4 12.0 100.0 − pendigits 39.8 13.3 36.6 3.5 55.6 4.7 -3.2 69.5 satellite 9.0 4.0 17.5 3.3 34.6 4.0 8.5 100.0 − segment 8.5 1.8 9.4 1.3 12.6 1.3 0.9 90.5 shuttle 8.6 2.4 6.3 1.7 11.7 1.5 -2.3 99.8 + splice 8.6 4.2 7.9 1.2 16.7 1.8 -0.7 51.9 waveform 7.5 4.1 13.0 2.2 31.2 1.9 5.5 100.0 −

Training Ratio = 50% CCP 2-norm EBP 2-norm : CCP 2-norm : EBP Mean Std Mean Std Mean Std Diff Sig ± Diff Sig ± 63.0 1.0 62.6 0.6 58.3 1.4 -0.4 80.8 4.3 100.0 + 84.8 0.7 85.4 0.4 82.6 0.7 0.7 100.0 2.8 100.0 + 74.4 0.8 74.4 0.7 71.6 1.0 0.0 0.7 2.8 100.0 + 82.4 0.5 82.2 0.6 81.1 0.8 -0.1 49.1 1.1 100.0 + 72.7 0.9 73.0 0.6 70.5 0.5 0.2 69.6 2.5 100.0 + 99.1 0.2 98.5 0.6 99.2 0.2 -0.6 99.9 -0.7 100.0 84.1 0.4 83.5 0.3 84.0 0.4 -0.6 100.0 -0.6 100.0 88.1 1.0 87.9 0.8 88.3 0.8 -0.2 50.4 -0.4 83.1 95.1 0.3 94.7 0.4 95.1 0.3 -0.4 99.9 -0.4 100.0 85.3 0.6 85.6 0.6 85.0 0.6 0.3 92.1 0.7 99.8 94.0 1.0 94.2 0.7 94.6 0.6 0.1 39.1 -0.4 95.5 99.9 0.0 99.9 0.0 99.9 0.0 0.0 92.2 0.0 90.8 94.3 0.4 94.1 0.4 92.9 0.7 -0.2 84.9 1.2 100.0 + 75.8 1.0 76.5 0.6 75.0 0.6 0.8 99.5 1.6 100.0 +

Training Ratio = 50% 2-norm : EBP CCP 2-norm EBP 2-norm : CCP Diff Sig ± Mean Std Mean Std Mean Std Diff Sig ± -33.7 100.0 + 12.2 7.3 72.1 11.2 382.8 13.8 59.9 100.0 − -12.0 100.0 + 10.6 7.2 2.0 0.0 114.0 15.3 -8.6 100.0 + -25.8 100.0 + 13.0 6.4 25.0 17.7 230.2 26.6 12.1 99.3 − -13.5 100.0 + 16.1 3.0 22.2 6.2 145.5 13.4 6.1 100.0 − -24.6 100.0 + 13.0 4.2 32.2 8.6 256.0 19.6 19.2 100.0 − -3.4 100.0 + 27.0 3.4 21.4 4.5 27.9 2.7 -5.7 100.0 + -47.2 100.0 + 1283.8 28.4 1083.0 26.7 1294.1 25.9 -200.8 100.0 + -16.6 100.0 + 94.3 16.1 128.4 7.7 210.9 11.0 34.1 100.0 − -19.0 100.0 + 177.3 38.9 141.6 9.9 199.0 7.2 -35.7 100.0 + -17.1 100.0 + 44.8 17.6 96.3 7.4 221.5 10.3 51.5 100.0 − -3.3 100.0 + 28.3 9.6 29.1 2.4 43.2 3.1 0.9 29.7 -5.4 100.0 + 23.3 4.5 19.6 1.8 21.2 2.6 -3.7 99.9 + -8.8 100.0 + 15.7 3.2 27.4 2.5 52.5 4.2 11.7 100.0 − -18.2 100.0 + 28.3 11.3 82.3 7.1 234.9 6.6 54.1 100.0 −

2-norm : EBP Diff Sig ± -310.7 100.0 + -112.0 100.0 + -205.2 100.0 + -123.4 100.0 + -223.8 100.0 + -6.6 100.0 + -211.1 100.0 + -82.5 100.0 + -57.4 100.0 + -125.2 100.0 + -14.1 100.0 + -1.7 97.3 + -25.1 100.0 + -152.6 100.0 +

Note: • • • •

Mean and Std are the mean and the standard deviation, respectively, among the 20 runs in each database. Diff is the difference of the mean between the 2-norm pruning algorithm and the one being compared to. Sig is the significance level (%) of the comparison given by the T-test of the 20 observations. ± is the comparison result. A comparison result is shown only when significant, that is, when |Diff| is at least one and Sig is at least 95. If the 2-norm algorithm is significantly better, “+” is shown; if it is significantly worse, “−” is shown. • All numbers are rounded to one digit after the decimal.

Table 2. Comparison of Tree Accuracy (Upper) and Size (Lower)

EBP). The reason is that the 2-norm pruning algorithm does not access any training examples and it finishes within one traversal of the tree. Fig. 2 also shows that when the size of the training set is increased by ten times, the elapsed time of the 2norm pruning algorithm is increased by less than ten times, while the time of the other two algorithms is increased by more than ten times. This explanation is as follows. The time complexity of the 2-norm pruning algorithm is O(M ) where M is the number of nodes in the full-sized tree, and usually M  N , where N is the number of training examples. The 10-fold cross-validation in CCP involves 10 times of tree growing. When only numerical attributes are present, the time complexity of the growing phase is at least O(N logN ); when only categorical attributes are present, the time complexity is at least O(N ). This explains why CCP’s scalability is almost linear for the databases kr-vs-

kp and splice (which have only categorical attributes) but worse for other databases. EBP passes the training examples along the tree to evaluate the grafting in each decision node, that is, it evaluates three cases: a) the split test is still used, and the examples are dispatched to the subtrees; b) all examples are sent to the left child; c) all examples are sent to the right child. Because the number of examples in the last two cases do not increase, the time complexity is also higher than O(N ). This comparison shows that the 2-norm pruning algorithm scales very well to large databases.

5

Conclusion

In this paper we tested a new pruning algorithm for decision tree classifiers: the 2-norm pruning algorithm. We compared this algorithm to two classical pruning algorithms:

Acknowledgment

abalone

This work was supported in part by the National Science Foundation (NSF) grant CRCD 0203446, and the National Science Foundation grant DUE 05254209. Georgios C. Anagnostopoulos and Michael Georgiopoulos also acknowledge the partial support from the NSF grant CCLI 0341601.

g2c15 g2c25 g6c15 g6c25 kr−vs−kp

References

letter optdigits

[1] A. R. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561–576. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1991.

pendigits satellite segment

[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.

shuttle splice waveform 1E−5 1E−4 1E−3 1E−2 1E−1 CCP

2−Norm

1

10

100

EBP

For each (database, algorithm) combination, two bars are shown: the outer, darker bar represents the mean time (in seconds) of the 20 runs for 50% training ratio; the inner, lighter bar represents the mean time for 5% training ratio. Since the horizontal (time) axis is logarithmic, the length difference between the two bars represents the ratio of the former time over the latter (smaller ratio means better scalability).

Figure 2. Comparison of Elapsed Time

CART’s CCP and C4.5’s EBP. Our experimental results reinforces the superiority of the 2-norm pruning algorithm, especially in speed, requiring one traversal of the tree to estimate and minimize the tree error rate. In particular, the 2-norm pruning algorithm outperformed, in speed, the CCP by, at times, 3, 4 or 5 orders of magnitude. Furthermore, it outperformed the EBP, in speed, by at times, 1, 2 or 3 orders of magnitude. Overall, the 2-norm pruning algorithm was shown that it scales well as the size of the training set increases. In a number of instances, the 2-norm algorithm was found to be superior, in accuracy, than the CCP an the EBP. Finally, the size of the 2-norm algorithm was smaller than the size of the EBP algorithm, reaffirming the well known fact, reported in the literature, that the EBP algorithm under-prunes. Finally, we discovered that for larger training set sizes, the size of the tree, discovered by the CCP is smaller than the size discovered by the 2-norm algorithm. Finally, it is worth emphasizing that the 2-norm pruning algorithm was based on a sound theoretical foundation. Summarizing, it is fair to claim that the 2-norm pruning algorithm compares very favorably with two of the most well-known and frequently used pruning algorithms, the CCP and the EBP.

[3] T. Elomaa and M. K¨aa¨ ri¨ainen. An analysis of reduced error pruning. Journal of Articial Intelligence Research, 15:163–187, 2001. [4] F. Esposito, D. Malerba, and G. Semeraro. A comparative analysis of methods for pruning decision trees. IEEE Trans. Pattern Anal. Machine Intell., 19(5):476–491, 1997. [5] Y. Mansour and D. McAllester. Generalization bounds for decision trees. In Proc. 13th Annu. Conference on Comput. Learning Theory, pages 69–80. Morgan Kaufmann, San Francisco, 2000. [6] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. [7] T. Niblett and I. Bratko. Learning decision rules in noisy domains. In Expert Systems. Cambridge University Press, 1986. [8] A. B. Nobel. Analysis of a complexity-based pruning scheme for classification trees. IEEE Trans. Inform. Theory, 48(8):2362–2368, 2002. [9] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [10] C. Scott and R. Nowak. Dyadic classification trees via structural risk minimization, 2002. [11] M. Zhong, M. Georgiopoulos, and G. C. Anagnostopoulos. Theoretical analysis of risk estimation for decision tree classifiers. Machine Learning, under review.

Suggest Documents