A Bottom-Up Oblique Decision Tree Induction ... - Semantic Scholar

A Bottom-Up Oblique Decision Tree Induction Algorithm Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak and André C. P. L. F. de Carvalho Department of Computer Science, ICMC University of São Paulo (USP) São Carlos - SP, Brazil {rcbarros,cerri,pablo,andre}@icmc.usp.br

Abstract—Decision tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The majority of the oblique and univariate decision tree induction algorithms perform a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose a novel bottom-up algorithm for inducing oblique trees named BUTIA. It does not require an impurity-measure for dividing nodes, since we know a priori the data resulting from each split. For generating the splitting hyperplanes, our algorithm implements a support vector machine solution, and a clustering algorithm is used for generating the initial leaves. We compare BUTIA to traditional univariate and oblique decision tree algorithms, C4.5, CART, OC1 and FT, as well as to a standard SVM implementation, using real gene expression benchmark data. Experimental results show the effectiveness of the proposed approach in several cases. Keywords-oblique decision trees; bottom-up induction; clustering; SVM; hybrid intelligent systems

I. I NTRODUCTION Decision tree induction algorithms are highly used in a variety of domains for knowledge discovery and pattern recognition. The induced knowledge in the form of hierarchical trees can be regarded as a disjunction of conjunctions of constraints on the attribute values [1]. Each path from the root to a leaf is actually a conjunction of attribute tests, and the tree itself allows the choice of different paths, i.e., a disjunction of these conjunctions. Such a representation is intuitive and easy to assimilate by humans, which partially explains the large number of studies that make use of these techniques. Another reason for their popularity is their good predictive accuracy in several application domains, such as medical diagnosis and credit risk assessment [2]. A major issue in decision tree induction is which attribute(s) to choose for splitting an internal node.

For the case of axis-parallel decision trees (also known as univariate), the problem is to choose the attribute that better discriminates the input data. A decision rule based on such an attribute is thus generated, and the input data is filtered according to the consequents of this rule. For oblique decision trees (also known as multivariate), the goal is to find a combination of attributes with good discriminatory power. Oblique decision trees are not as popular as the univariate ones, mainly because they are harder to interpret. Nevertheless, researchers argue that multivariate splits can improve the performance of the tree in several datasets, while generating smaller trees [3]–[5]. Clearly, there is a tradeoff to consider in allowing multivariate tests: simple tests may result in large trees that are hard to understand, yet multivariate tests may result in small trees with tests that are hard to understand [6]. One of the advantages of oblique decision trees is that they are able to produce polygonal (polyhedral) partitions of the attribute space, i.e., hyperplanes at an oblique orientation to the attribute axes. Univariate trees, on the other hand, can only produce hyper-rectangles parallel to the attribute axes. The tests at Pmeach node of an oblique tree have the form w0 + i=1 wi xji ≤ 0, where wi is a real-valued coefficient associated to the ith attribute of a given instance xj , and w0 is the disturbance coefficient (bias) of the test. For either the growth of oblique or axis-parallel decision trees, there is a clear preference in the literature for algorithms that rely on a greedy, top-down, recursive partitioning strategy, i.e., top-down induction. The most well-known algorithms for decision tree induction indeed implement this strategy, e.g., CART [7], C4.5 [8] and OC1 [9]. These algorithms make use of impurity-based criteria to decide which attribute(s) will split the data in purer subsets (a pure subset is one whose instances belong to the same class). Since these algorithms are top-down, it is not possible to know a priori which instances will result in each subset of a partition. Thus, in top-down induction, trees are usually grown until every leaf

node is pure, and a pruning method is employed to avoid data overfitting. Works that implement a bottom-up strategy are quite rare in the literature. The key ideas behind bottom-up induction were first presented by Landeweerd et al. [10]. The authors propose growing a decision tree from the leaves to the root, assuming that each class is represented by a leaf node, and that the closest nodes (according to the Mahalanobis distance) are recursively merged into a parent node. Albeit simple, their approach presents several deficiencies, e.g., it allows only a single leaf per class, which means that binary-class problems will always be modeled by a 3-node tree. This is quite problematic since there are complex binary-class problems in which a 3-node decision tree cannot model accurately the attribute space. We believe this deficiency is one of the reasons for demotivating researchers to further investigate the bottom-up induction of decision trees. Another reason may be the extra computational effort required to compute the costly Mahalanobis distance. In this paper, we propose alternatives to solve the deficiencies of the typical bottom-up approach. For instance, we propose the application of a well-known clustering algorithm to allow each class to be represented by more than one leaf node. In addition, we incorporate in our algorithm a support vector machine (SVM) [11] solution to build the hyperplane that divide the data within each non-terminal node of the oblique decision tree. We call our approach BUTIA (Bottom-Up oblique Tree Induction Algorithm), and we evaluate its performance in gene expression benchmark datasets. This paper is organized as follows. In Section II we detail the proposed algorithm, which combines clustering and SVM for generating oblique decision trees. In Section III we conduct a comparison among BUTIA and traditional top-down decision tree induction algorithms C4.5 [8], CART [7] and OC1 [9]. Additionally, we compare BUTIA to Sequential Minimal Optimization (SMO) [12] and Functional Trees [13]. Section IV presents studies related to our approach, whereas in Section V we discuss the main conclusions of this work. II. BUTIA We propose a new bottom-up oblique decision tree induction algorithm, named BUTIA (Bottom-Up oblique Tree Induction Algorithm). It employs two machine learning algorithms in different steps of tree growth, namely Expectation-Maximization (EM) [14] and Sequential Minimal Optimization (SMO) [12]. Our motivation for building bottom-up trees is twofold:

• In a bottom-up approach we have a priori information on which group of instances belongs to a given node of the tree. It means we know the result of each node split before even generating the separating hyperplane. In fact, our algorithm uses these a priori information for generating hyperplanes that maximize the separation margin between instances of two nodes. Hence, there is no need of relying on an impurity-measure to evaluate the goodness of a split; • The top-down induction strategy usually overgrows the decision tree until every leaf node is pure, and then a pruning procedure is responsible for simplifying the tree in order to avoid data overfitting. In bottom-up induction, a pruning step is not necessary because we are not overgrowing the tree. Since we start growing the tree from the leaves to the root, our approach reduces significantly the chances of overfitting by clustering the data instances. Given a space of instances X = {x1 . . . xn }, xi ∈