Large Margin Trees for Induction and Transduction
Donghui Wu Dept of Mathematical Sciences Rensselaer Polytechnic Institute 110 8th St., Troy, NY 12180, USA (518)276-6899
Kristin P. Bennett Dept of Mathematical Sciences Rensselaer Polytechnic Institute 110 8th St., Troy, NY 12180, USA (518)276-6899
Nello Cristianini Dept of Engineering Mathematics University of Bristol Bristol BS8 1TR, UK
John Shawe-Taylor Dept of Computer Science Royal Holloway, Univ. of London Egham, Surrey TW20 0EX, UK
[email protected]
[email protected]
[email protected]
nello.cristianini@ bristol.ac.uk
February 1, 1999 Abstract
The problem of controlling the capacity of decision trees is considered for the case where the decision nodes implement linear threshold functions. In addition to the standard early stopping and pruning procedures, we implement a strategy based on the margins of the decision boundaries at the nodes. The approach is motivated by bounds on generalization error obtained in terms of the margins of the individual classi ers. Experimental results are given which demonstrate that considerable advantage can be derived from using the margin information. The same strategy is applied to the problem of transduction, where the positions of the test points are revealed to the training algorithm. This information is used to generate an alternative training criterion motivated by transductive theory. In the transductive case, the results are not as encouraging, suggesting that little, if any, consistent advantage is culled from using the unlabelled data in the proposed fashion. This conclusion does not contradict theoretical results, but leaves open the theoretical and practical question of whether more eective use can be made of the additional information.
1
1 Introduction Perceptron Decision Trees (PDT) are decision trees in which each internal node is associated with a hyperplane in general position in the input space. Given their high exibility, a feature that they share with more standard decision trees, they tend to over t the data if their complexity is not somehow kept under control. The standard approach to controlling their complexity is to limit their size, with early stopping or pruning procedures. In this paper we use their margin (namely, the minimum distance between the decision boundaries and the training points) to control their capacity both in an inductive and in a transductive situation. Transduction has been proposed by Vapnik [10] as an alternative form of inference, which skips the inductive step of generating a hypothesis, but directly infers the labels on a given test set from the training data. In this paper we examine methods for incorporating capacity-control using margins into PDT for both inductive and transductive inference. We demonstrate that by incorporating margin maximization using relatively minor algorithmic changes, into an existing PDT algorithm, generalization can be greatly enhanced. For the inductive case, the theoretical motivations behind the use of the margin lie in the Data-Dependent Structural Risk Minimization [6]: the scale of the cover used in VC theory to provide a bound on the generalization error depends on the margins and hence the hierarchy of classes is chosen in response to the data. For the transductive case where the consistent hyperplane with maximal margin on both the training and test points is selected, similar bounds can be obtained more directly by de ning a hierarchy of classes of hyperplanes derived from their margins on the training and test points. The bounds obtained are qualitatively similar, so that the question of what advantage can be obtained using this method remains open. The rami cation of this theory is also quite intuitive { decisions with wider margins are better. The two trees in Figure 1 have the same number of decisions, and training accuracy, but the right tree has a smaller probability of future error since it has wider margins. x
x x
x
x
x
x
o o
x x x
x x
x
x x
x x xx
oo o o o o o o ooo o o o
o o o o o o o o o o o o o
x
x
x
x
x x o o o
x x
x
x o o
x
x
x
o
x
x x
x
x
x
x
x
x
x x
x x
x
x
x
x
x
x
x o
x
x
x x
x x xx
oo o o o o o o ooo o o o
o o
o o o o o o o o o o o o o
o
Figure 1: Small Margin versus Large Margin Decision Trees The algorithms described here are greedy procedures aimed at producing large margin trees. We propose three such algorithms for induction: FAT, MOC1 and MOC2, and one for transduction MTT, comparing their performance with that of OC1 [4], one of the best known PDT learning systems. Practically, these algorithms were implemented by either modifying the splitting criterion within the basic OC1 algorithm (MOC1, MOC2, and MTT) or by post-processing the OC1 output (FAT). All three large-margin inductive systems and the 2
transductive system outperform OC1 on most of the real world data-sets we have used, indicating that over tting in PDTs can be eciently combatted by enlarging the margin of the decision boundaries on the training data. In this case, it does not much matter how the large margin bias is enforced: the quite dierent approaches of MOC1, MOC2, FAT, and MTT all work. The important thing seems to be to enforce the large margin. We report preliminary experiments with real world datasets comparing transductive and inductive inference. Our rst experiment on average case transductive behavior is consistent with what already was observed in a similar experiment on transduction using support vector machines [2], namely that the use of working set information rarely hurts and sometimes helps the generalization. In a second experiment we look at the dierence between transduction and induction on a single task that would seemingly favor transduction, our results are less conclusive. Both the margin based algorithms outperform OC1 but the transductive approach is not consistently better than the margin-based inductive approach.
2 Perceptron Decision Trees PDTs are decision trees in which each internal node is associated with a hyperplane in general position in the input space. The space is so partitioned into dierent (polytopic) regions which are then assigned to classes. It is useful to formally de ne the class of functions computed by PDTs as a special case of Generalized Decision Trees functions: De nition 2.1 Generalized Decision Trees (GDT). Given a space X and a set of boolean functions F = ff : X ! f0; 1gg, the class GDT(F ) of Generalized Decision Trees over
F are functions which can be implemented using a binary tree where each internal node is labeled with an element of F , and each leaf is labeled with either 1 or 0.
To evaluate a particular tree T on input x 2 X, all the boolean functions associated to the nodes are assigned the same argument x 2 X, which is the argument of T(x). The values assumed by them determine a unique path from the root to a leaf: at each internal node the left (respectively right) edge to a child is taken if the output of the function associated to that internal node is 0 (respectively 1). This path is known as the evaluation path. The value of the function T(x) is the value associated to the leaf reached. We say that input x reaches a node of the tree, if that node is on the evaluation path for x. We will also call nodes the internal nodes of a tree, leaves the external ones. x
x
w1
x
x
o
o x
x
x
o
o
x
o
o
x x
o o
o o
x
o
o o o o
o
x
x
w3
w3
x
x
w2
x
x x
x
x
w1
o
o
w2
Figure 2: A Perceptron Decision Tree and the way it splits the input space 3
De nition 2.2 Given X = Rn, a Perceptron Decision Tree (PDT) is a GDT over FPDT = ffw : fw (x) = 1 , wT x > 0; w 2 Rn g +1
where we have assumed that the inputs have been augmented with a coordinate of constant value, hence implementing a thresholded perceptron. PDTs are generally induced by means of a TopDown growth procedure, which starts from the root node and greedily chooses a perceptron which maximizes some cost function, usually a measure of the \impurity" of the subsamples implicitly de ned by the split. This maximization is usually hard to perform, and sometimes replaced by randomized (sub)optimization. The subsamples are then mapped to the two children nodes. The procedure is then recursively applied to the nodes, and the tree is grown until some stopping criterion is met. Such a tree is then used as a starting point for a \BottomUp" search, performing a pruning of the tree. This implies eliminating the nodes which are redundant, or which are unable to \pay for themselves" in terms of the cost function. Generally pruning an over tting tree produces better classi ers than those obtained with early stopping, since it makes it possible to check if promising directions were in fact worth exploring, and if locally good solutions were on the contrary a dead-end. So, while the standard \TopDown" algorithm is an extremely greedy procedure, with the introduction of pruning it can be possible to look-ahead: this allows for discovery of more hidden structure.
3 Generalization Analysis for Induction using PDTs The theory of generalization for large margin classi ers relies on data-dependent structural risk minimisation [6] in the sense that the hierarchy of the hyperplanes is determined by their margin on the training data. The same approach can be applied to decisions trees with large margin decision nodes. The following theorem holds [7, 1]
Theorem 3.1 [7, 1] Suppose we are able to classify an m sample of labeled examples using
a perceptron decision tree and suppose that the tree obtained contained K decision nodes with margins i at node i, then we can bound the generalization error with probability greater than 1 ? to be less than ?
130R2 D0 log(4em) log(4m) + log (4m)K +1 2KK m (K + 1) where D0 =
PK
1
i=1 i2
!
(1)
and R is a bound on the norm of the training examples.
This result gives an upper bound on generalization error for consistent trees. It is a decreasing function of the margins and number of points and increasing function of the number of decisions. Empirical study have shown that just minimizing the number of decision nodes does not produce the best tree [4]. Margins play a signi cant factor, so PDT algorithms should incorporate margin maximization. 4
4 Description of the Inductive Algorithms In this section we examine three dierent approaches for incorporating margin maximization for inference within PDT. The rst two approaches incorporate margin maximization into the splitting criterion or impurity measure used to construct the tree. The last approach maximizes margins in a post-processing phase. More extensive description of these algorithms can also be found in [1]. We started with one of the best known PDT learning systems, OC1 of Murthy, Kasif and Salzberg, which is freely available over the Internet[4]. It is a randomized algorithm which performs a randomized hill-climbing search for learning the perceptrons, and builds the tree TopDown. Starting from the root node, the system chooses the hyperplane which minimizes a prede ned \impurity"measure (e.g. informationgain, Gini index, or the Twoing Rule). The system is greedy because at each stage it chooses the best split available, and randomized because such a best split is not obtained by means of exhaustive search but with a randomized hill-climbing process. In our experiments the twoing rule is used as the \impurity" measure in OC1. The standard twoing rule is TwoingValue = jTjnLjj jTjnRjj
j jjLLi jj ? jjRRijj j L R i=1
k X
!2
where n = jTL j + jTR j k - number of classes, jTL j - number of instances on the left of the split jTR j - number of instances on the right of the split jLi j - number of instances in category i on the left of the split jRi j - number of instances in category i on the right of the split This is a goodness measure rather than an impurity one, and OC1 attempts to maximize it at each split during the growth via minimizing 1=TwoingValue:
4.1 Description of MOC1 Algorithm MOC1 (Margin OC1) is a variation of OC1, which modi es the objective function of OC1 to consider the size of the margin. The underlying philosophy is to nd a separating plane with a tradeo between decision impurity and the size of margin at each node. MOC1 attempts to minimize the impurity function and maximize the margin during induction process. The MOC1 algorithm minimizes the following objective function: where
(1 ? ) OC1 Objective + C current1margin 5
- OC1 Objective is the impurity measure of OC1, in this study, the twoing rule is used as impurity measure. - current margin is the distance between two nearest points on the dierent side of the current separating hyperplane. - is a scalar weight, 2 [0; 1] - C = log10(no of points at current node) determines how much the large margin is weighted in selecting the split. Tuning could improve the performance. When determining the weight of the margin, we also take the number of points at the current node into consideration. The idea is that a constant weight of margin for all nodes is not good. The weight should be able to adapt to the position of current node and size of training examples at the current node. Since we are not particularly interested in nding the tree with highest possible accuracy, but rather demonstrating that large margins can improve the generalization, we did not tune the for each dataset to achieve the highest possible accuracy, setting = :05 in all datasets. In other words, the results of MOC1 presented below are not the best results possible.
4.2 Description of MOC2 Algorithm MOC2 uses a modi ed twoing rule, which directly incorporates the idea of large margin to the impurity measure. Unlike MOC1, MOC2 uses a soft margin. It treats points falling within the margin and outside of the margin dierently. Only the impurity measure is altered. The rest is same as in the standard OC1 algorithm.
The modi ed twoing rule k k X X jLi j jMLij j MT j R j MR Lj jMTR j ij ij TwoingV alue = n n jT j ? jT j jMT j ? jMT j L R L R i i =1
=1
where n = jTL j + jTR j - total number of instances at current node k - number of classes, for two class problems jTLj - number of instances on the left of the split, i.e. wT x + b >= 0 jTR j - number of instances on the right of the split i.e. wT x + b < 0 jLij - number of instances in category i on the left of the split jRij - number of instances in category i on the right of the split jMTLj - number of instances on the left of the split, wT x + b >= 1 jMTR j - number of instances on the right of the split wT x + b = 1 jMRij - number of instances in category i with wT x + b