Generalization in decision trees and DNF: Does size matter? - CiteSeerX

Generalization in decision trees and DNF: Does size matter? Mostefa Golea1, Peter L. Bartlett1 , Wee Sun Lee2 and Llew Mason1 1 Department

of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra, ACT, 0200, Australia

2 School

of Electrical Engineering University College UNSW Australian Defence Force Academy Canberra, ACT, 2600, Australia

Abstract Recent theoretical results for pattern classi cation with thresholded real-valued functions (such as support vector machines, sigmoid networks, and boosting) give bounds on misclassi cation probability that do not depend on the size of the classi er, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions). For example, we show that with high probability any decision tree of depth no more than d that is consistent with m training examples has misclassi cation probability no more than ?1 ? 1=2 2 O m Ne VCdim(U ) log m log d , where U is the class of node decision functions, and Ne N can be thought of as the eective number of leaves (it becomes small as the distribution on the leaves induced by the training data gets far from uniform). This bound is qualitatively dierent from the VC bound and can be considerably smaller. We use the same technique to give similar results for DNF formulae.

Author to whom correspondence should be addressed

1 INTRODUCTION Decision trees are widely used for pattern classi cation [2, 7]. For these problems, results from the VC theory suggest that the amount of training data should grow at least linearly with the size of the tree[4, 3]. However, empirical results suggest that this is not necessary (see [6, 10]). For example, it has been observed that the error rate is not always a monotonically increasing function of the tree size[6]. To see why the size of a tree is not always a good measure of its complexity, consider two trees, A with NA leaves and B with NB leaves, where NB NA . Although A is larger than B , if most of the classi cation in A is carried out by very few leaves and the classi cation in B is equally distributed over the leaves, intuition suggests that A is actually much simpler than B , since tree A can be approximated well by a small tree with few leaves. In this paper, we formalize this intuition. We give misclassi cation probability bounds for decision trees in terms of a new complexity measure that depends on the distribution on the leaves that is induced by the training data, and can be considerably smaller than the size of the tree. These results build on recent theoretical results that give misclassi cation probability bounds for thresholded real-valued functions, including support vector machines, sigmoid networks, and boosting (see [1, 8, 9]), that do not depend on the size of the classi er. We extend these results to decision trees by considering a decision tree as a thresholded convex combination of the leaf functions (the boolean functions that specify, for a given leaf, which patterns reach that leaf). We can then apply the misclassi cation probability bounds for such classi ers. In fact, we derive and use a re nement of the previous bounds for convex combinations of base hypotheses, in which the base hypotheses can come from several classes of dierent complexity, and the VC-dimension of the base hypothesis class is replaced by the average (under the convex coecients) of the VC-dimension of these classes. For decision trees, the bounds we obtain depend on the eective number of leaves, a data dependent quantity that re ects how uniformly the training data covers the tree's leaves. This bound is qualitatively dierent from the VC bound, which depends on the total number of leaves in the tree. In the next section, we give some de nitions and describe the techniques used. We present bounds on the misclassi cation probability of a thresholded convex combination of boolean functions from base hypothesis classes, in terms of a misclassi cation margin and the average VC-dimension of the base hypotheses. In Sections 3 and 4, we use this result to give error bounds for decision trees and disjunctive normal form (DNF) formulae.

2 GENERALIZATION ERROR IN TERMS OF MARGIN AND AVERAGE COMPLEXITY We begin with some de nitions. For a class H of f?1; 1g-valued functions de ned on the inputP space X , the convex hull Pco(H) of H is the set of [?1; 1]-valued functions of the form i ai hi , where ai 0, i ai = 1, and hi 2 H. A function in co(H) is used for classi cation by composing it with the threshold function, sgn : R ! f?1; 1g, which satis es sgn() = 1 i 0. So f 2 co(H) makes a mistake on the pair (x; y) 2 X f?1; 1g i sgn(f (x)) 6= y. We assume that labelled examples (x; y) are generated according to some probability distribution D on X f?1; 1g, and we let PD [E ] denote the probability under D of an event E . If S is a nite subset of Z , we let PS [E ] denote the empirical probability of E (that is, the proportion of points in S that lie in E ). We use ED [] and ES [] to denote expectation in a similar way. For a function class H of f?1; 1g-valued functions de ned on the input

space X , the growth function and VC dimension of H will be denoted by H (m) and VCdim(H ) respectively. In [8], Schapire et al give the following bound on the misclassi cation probability of a thresholded convex combination of functions, in terms of the proportion of training data that is labelled to the correct side of the threshold by some margin. (Notice that PD [sgn(f (x)) 6= y] PD [yf (x) 0].) Theorem 1 ([8]) Let D be a distribution on X f?1; 1g, H a hypothesis class with VCdim(H ) = d < 1, and > 0. With probability at least 1 ? over a training set S of m examples chosen according to D, every function f 2 co(H) and every > 0 satisfy 1=2 ! 2 (m=d) d log 1 PD [yf (x) 0] PS [yf (x) ] + O pm + log(1=) : 2

In Theorem 1, all of the base hypotheses in the convex combination f are elements of a single class H with bounded VC-dimension. The following theorem generalizes this result to the case in which these base hypotheses may be chosen from any of k classes, H1 ; : : : ; Hk , which can have dierent VC-dimensions. It also gives a related result that shows the error decreases to twice the error estimate at a faster rate. Theorem 2 Let D be a distribution on X f?1; 1g, H1 ; : : : ; Hk hypothesis classes with VCdim(Hi ) = di , and > 0. With probability at least 1 ? over a training S set S of m examples chosen according to D, every function f 2 co ki=1 Hi and every > 0 satisfy both

PD [yf (x) 0] PS [yf (x) ] +

1=2 ! ? 1 1 2 O pm 2 (d log m + log k) log m =d + log(1=) ;

PD [yf (x) 0] 2PS [yf(x) ] +

? O m1 12 (d log m + log k) log m2 =d + log(1=) ; P P where d = i ai dj and the ai and ji are de ned by f = i ai hi and hi 2 Hj for ji 2 f1; : : : ; kg. Proof sketch: We shall sketch only the proof of the rst inequality of the theorem. The proof closely follows the proof of Theorem o n 1 (seeP[8]). We consider a number of approximating sets of the form CN;l = (1=N ) Ni=1 h^ i : h^ i 2 Hl , S where l = (l1 ; : : : ; lN ) 2 f1; : : : ; kgN and N 2 N . De ne CN = l CN;l. P S For a given f = i ai hi from co ki=1 Hi , we shall choose an approximation g 2 CN by choosing h^ 1 ; : : : ; h^ N independently from fh1; h2 ; : : : ; g, according to the distribution de ned by the coecients ai . Let Q denote this distribution on CN . As in [8], we can take the expectation under this random choice of g 2 CN to show that, for any > 0, PD [yf (x) 0] EgQ [PD [yg(x) =2]] + exp(?N2 =8). Now, for a given l 2 f1; : : : ; kgN , the probability that there is a g in CN;l and a > 0 for which PD [yg(x) =2] > PS [yg(x) =2] + N;l QN 2em d is at most 8(N + 1) i=1 d exp(?m2N;l=32): Applying the union bound i

i

i

li

li

(over the values of l), taking expectation over g Q, and setting N;l =

dl Q + 1) N 2em i

kN =N

1=2

shows that, with probability at least i=1 dli m ln 8(N 1 ? N , every f and > 0 satisfy PD [yf (x) 0] Eg [PS [yg(x) =2]] + Eg [N;l]. As above, we can bound the probability inside the rst expectation in terms of PS [yf (x) ]. Also, Jensen's inequality implies that Eg [N;l] ? 32 1=2 P : Setting N = =(N (N + m (ln(8(N +l1)=N) + 2Nmln k + N i ai dji ln(2em)) gives the result. 1)) and N = 82 ln md 32

Theorem 2 gives misclassi cation probability bounds only for thresholded convex combinations of boolean functions. The key technique we use in the remainder of the paper is to nd representations in this form (that is, as two-layer neural networks) of more arbitrary boolean functions. We have some freedom in choosing the convex coecients, and this choice aects both the error estimate PS [yf (x) ] and the average VC-dimension d. We attempt to choose the coecients and the margin so as to optimize the resulting bound on misclassi cation probability. In the next two sections, we use this approach to nd misclassi cation probability bounds for decision trees and DNF formulae.

3 DECISION TREES A two-class decision tree T is a tree whose internal decision nodes are labeled with boolean functions from some class U and whose leaves are labeled with class labels from f?1; +1g. For a tree with N leaves, de ne the leaf functions, hi : X ! f?1; 1g by hi (x) = 1 i x reaches leaf i, for i = 1; : : : ; N . Note that hi is the conjunction of all tests on the path from the root to leaf i. For a sample S and a tree T , let Pi = PS [hi (x) = 1]. Clearly, P = (P1 ; : : : ; PN ) is a probability vector. Let i 2 f?1; +1g denote the class assigned to leaf i. De ne the class of leaf functions for leaves up to depth j as Hj = fh : h = u1 ^ u2 ^ ^ ur j r j; ui 2 Ug: It is easy to show that VCdim(Hj ) 2j VCdim(U ) ln(2ej ). Let di denote the depth of leaf i, so hi 2 Hd , and let d = maxi di . The boolean function implemented by a decision tree T can be written as a thresholded convex combination of the form T (xP ) = sgn(f (x)), where f (x) = PN PN N w =2, with w > 0 and w (( h ( x ) + 1) = 2) = w h ( x ) = 2 + i i i i i=1 i i i i=1 i i PiN=1 w = 1. (To be precise, we need to enlarge the classes H slightly to be closed i j i=1 under negation. This does not aect the results by more than a constant.) We rst assume that the tree is consistent with the training sample. We will show later how the results extend to the inconsistent case. The second inequality of Theorem 2 shows that, for xed > 0 there is a constant c such that, for any distribution D, with probability at least 1 ? over P the sample S we have PD [T (x) 6= y] 2PS [yf (x) ] + 12 Ni=1 wi di B; where B = mc VCdim(U ) log2 m log d. Dierent choices of the wi s and the will yield different estimates of the error rate of T . We can assume (wlog) that P1 PN . A natural choice is wi = Pi and Pj+1 < Pj for some j 2 f1; : : : ; N g which gives i

PD [T (x) 6= y] 2

N X i=j +1

Pi + dB 2 ;

(1)

where d = Ni=1 Pi di . We can optimize this expression over the choices of j 2 f1 : : : ; N g and to give a bound on the misclassi cation probability of the tree. P Let (P; U ) = Ni=1 (Pi ? 1=N )2 be the quadratic distance between the probability vector P = (P1 ; : : : ; PN ) and the uniform probability vector U = (1=N; 1=N; : : : ; 1=N ). De ne Ne N (1 ? (P; U )). The parameter Ne is a measure of the eective number of leaves in the tree. The following theorem gives bounds on misclassi cation probability in terms of Ne . Theorem 3 For a xed > 0, there is a constant c that satis es the following. Let D be a distribution on X f?1; 1g. Consider the class of decision trees of depth up to d, with decision functions in U . With probability at least 1 ? over the training set S (of size m), every decision tree T that is consistent with S has P

U ) log2 m log d 1=2 ; PD [T (x) 6= y] c Ne VCdim(m

where Ne is the eective number of leaves of T . Proof: Supposing that (d=N )1=2 we optimize (1) by choice of . If the chosen the optimized bound still is actually smaller than (d=N )1=2 then we show that P holds by a standard VC result. If (d=N )1=2 then Ni=j+1 Pi 2 Ne =d: So (1) implies that PD [T (x) 6= y] 22 Ne =d + dB=2 . The optimal choice of is then ( 12 d2 B=Ne )1=4 . So if ( 21 d2 B=Ne )1=4 (d=N )1=2 , we have the result. Otherwise, the upper bound we need to prove satis es 2(2Ne B )1=2 > 2NB , and this result is implied by standard VC results using a simple upper bound for the growth function of the class of decision trees with N leaves.

Thus the parameters that quantify the complexity of a tree are: a) the complexity of the test function class U , and b) the eective number of leaves Ne . The eective number of leaves can potentially be much smaller than the total number of leaves in the tree [5]. Since this parameter is data-dependent, the same tree can be simple for one set of Pi s and complex for another set of Pi s. For trees that are not consistent with the training data, the procedure to estimate the error rate is similar. By de ning Qi = PS [yi = ?1 j hi (x) = 1] and Pi0 = Pi (1 ? Qi )= (1 ? PS [T (x) 6= y]) we obtain the following result. Theorem 4 For a xed > 0, there is a constant c that satis es the following. Let D be a distribution on X f?1; 1g. Consider the class of decision trees of depth up to d, with decision functions in U . With probability at least 1 ? over the training set S (of size m), every decision tree T has U ) log2 m log d 1=3 ; PD [T (x) 6= y] PS [T (x) 6= y] + c Ne VCdim(m

where c is a universal constant, and Ne = N (1 ? (P 0 ; U )) is the eective number of leaves of T .

4 DNF AS THRESHOLDED CONVEX COMBINATIONS A DNF formula de ned on f?1; 1gn is a disjunction of terms, where each term is a conjunction of literals and a literal is either a variable or its negation. For a given DNF formula g, we use N to denote the number of terms in g, ti to represent the ith

term in f , Li to represent the set of literals in ti , and Ni the size of Li . Each term ti can be thought of as?a member of the class HN , the set of monomials with Ni liter als. Clearly, jHj j = 2jn . The DNF g can be written as a thresholded convex combi P N nation of the form g(x) = ?sgn(?f (x)) = ?sgn ? i=1 wi ((ti + 1)=2) : (Recall that sgn() = 1 i 0.) Further, each term ti can be?written as a thresholded con P vex combination of the form ti (x) = sgn(fi (x)) = sgn l 2L vik ((lk (x) ? 1)=2) : Assume for simplicity that the DNF is consistent (the results extend easily to the inconsistent case). Let + ( ? ) denote the fraction of positive (negative) examples under distribution D. Let PD+ [] (PD? []) denote probability with respect to the distribution over the positive (negative) examples, and let PS+ [] (PS? []) be de ned similarly, with respect to the sample S . Notice that PD [g(x) 6= y] =

+ PD+ [g(x) = ?1]+ ?PD? [(9i) ti (x) = 1], so the second inequality of Theorem 2 shows that, with probability at least 1 ? , for any and any i s, i

k

? PD [g(x) 6= y] + 2PS+ [f (x) ] + dB 2 + ?

PN

N X

i

2PS? [?fi (x) i ] + B2

i

i=1 2 log n log m + log(N=) =m.

As in the case of where d = i=1 wi Ni and B = c decision trees, dierent choices of , the i s, and the weights yield dierent estimates of the error. For an arbitrary order of the terms, let Pi be the fraction of positive examples covered by term ti but not by terms ti?1 ; : : : ; t1 . We order the terms such that for each i, with ti?1 ; : : : ; t1 xed, Pi is maximized, so that P1 PN , and we choose wi = Pi . Likewise, for a given term ti with literals l1 ; : : : ; lN in an arbitrary order, let Pk(i) be the fraction of negative examples uncovered by literal lk but not uncovered by lk?1 ; : : : ; l1 . We order the literals of term ti in the same greedy way as above so that P1(i) PN(i) , and we choose vik = Pk(i) . For ) ) , where 1 j N and 1 ji Ni , we get i < Pj(i+1 Pj+1 < Pj and Pj(i+1 i

i

i

0

i

0

1

1

N N X X @2 A + ? Pi + dB PD [g(x) 6= y] + @2 Pk(i) + B2 A 2 i i=j +1 i=1 k=j +1 N X

i

i

Now, let P = (P1 ; : : : ; PN ) and for each term i let = (P1(i) ; : : : ; PN(i) ). De ne Ne = N (1 ? (P; U )) and Ne(i) = Ni (1 ? (P (i) ; U )), where U is the relevant uniform distribution in each case. The parameter Ne is a measure of the eective number of terms in the DNF formula. It can be much smaller than N ; this would be the case if few terms cover a large fraction of the positive examples. The parameter Ne(i) is a measure of the eective number of literals in term ti . Again, it can be much smaller than the actual number of literals in ti : this would be the case if few literals of the term uncover a large fraction of the negative examples. Optimizing over and the i s as in the proof of Theorem 3 gives the following result. Theorem 5 For a xed > 0, there is a constant c that satis es the following. Let D be a distribution on X f?1; 1g. Consider the class of DNF formulae with up to N terms. With probability at least 1 ? over the training set S (of size m), every DNF formulae g that is consistent with S has

P (i)

PD [g(x) 6= y] +(Ne dB)1=2 + ? where d = maxNi=1 Ni , = PD [y = 1] and

i

N X i=1

(i) 1=2 (Ne B)

B = c(log n log2 m + log(N=))=m.

5 CONCLUSIONS The results in this paper show that structural complexity measures (such as size) of decision trees and DNF formulae are not always the most appropriate in determining their generalization behaviour, and that measures of complexity that depend on the training data may give a more accurate description. Our analysis can be extended to multi-class classi cation problems. A similar analysis implies similar bounds on misclassi cation probability for decision lists, and it seems likely that these techniques will also be applicable to other pattern classi cation methods. The complexity parameter, Ne described here does not always give the best possible error bounds. For example, the eective number of leaves Ne in a decision tree can be thought of as a single number that summarizes the probability distribution over the leaves induced by the training data. It seems unlikely that such a number will give optimal bounds for all distributions. In those cases, better bounds could be obtained by using numerical techniques to optimize over the choice of and wi s. It would be interesting to see how the bounds we obtain and those given by numerical techniques re ect the generalization performance of classi ers used in practice.

Acknowledgements

Thanks to Yoav Freund and Rob Schapire for helpful comments.

References

[1] P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Neural Information Processing Systems 9, pages 134{140. Morgan Kaufmann, San Mateo, CA, 1997. [2] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classi cation and Regression Trees. Wadsworth, Belmont, 1984. [3] A. Ehrenfeucht and D. Haussler. Learning decision trees from random examples. Information and Computation, 82:231{246, 1989. [4] U.M. Fayyad and K.B. Irani. What should be minimized in a decision tree? In AAAI-90, pages 249{754, 1990. [5] R. C. Holte. Very simple rules perform well on most commonly used databases. Machine learning, 11:63{91, 1993. [6] P.M. Murphy and M.J. Pazzani. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Arti cial Intelligence Research, 1:257{275, 1994. [7] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992. [8] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the eectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322{330, 1997. [9] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A framework for structural risk minimisation. In Proc. 9th COLT, pages 68{76. ACM Press, New York, NY, 1996. [10] G.L. Webb. Further experimental evidence against the utility of Occam's razor. Journal of Arti cial Intelligence Research, 4:397{417, 1996.