Multiway Decision Tree Induction using Projection and Merging (MPEG) Yasser El-Sonbaty Amgad Neematallah College of Computing & Information Technology, Arab Academy for Science & Technology, 1029, Alexandria, Egypt
[email protected] [email protected]
Abstract Decision trees are one of the most popular and commonly used classification models. Many algorithms have been designed to build decision trees in the last twenty years. These algorithms are categorized into two groups according to the type of the trees they build, binary and multiway trees. In this paper, a new algorithm, MPEG, is designed to build multiway trees. MPEG uses DBMS indices and optimized query to access the dataset it works on, hence it has few memory requirements and no restrictions on sizes of datasets. Projection of examples over attribute values, merging of generated partitions using class values, applying GINI index to select among different attributes and finally post pruning using EBP method, are the basic steps of MPEG. The trees built by MPEG have the advantages of binary trees as being accurate, small in size and the advantages of multiway trees as being compact and easy to be comprehended by humans.
1. Introduction Classification is an important data mining problem, in which there exists a set of labeled examples and it is desired to build a model - classifier - that classifies any new unlabeled example from the information kept in the training phase in an accurate and fast way. Examples introduced to the classification model builder are in the form of vectors of ordered attribute values followed by the corresponding class labels : where vi is the value of attribute i, k is the number of attributes and c is the example class label. Attributes - also called features, predictors, or independent variables - with discrete domains are referred to as categorical, while those with continuous domains are referred to as numerical. The class dependent variable - is always categorical. There are many techniques to build the classifier. Some of them are Artificial Neural Network (ANN)
[35], Genetic Algorithms (GA) [34], Support Vector Machines (SVM) [32], Bayesian Networks [33], Regression [15], Decision Trees and Decision Rules [12]. Decision trees are one of the important and most widely used classifiers. They consist of nodes where attributes are to be tested, outgoing branches of these nodes covering possible non overlapping tests and leaves corresponding to classes appeared in the dataset examples. In the literature there are a large number of decision tree induction algorithms. Most of them adopt top down strategy that searches in the space of available decision trees. Decision trees are considered as one of the highly effective classifiers with the following characteristics: − Compared to neural networks or bayesian classifiers, decision trees are easily interpreted and comprehended by humans [3]. − Decision tree generation algorithms do not require additional information besides that already contained in the training data [4], i.e. they are non parametric. − They also can be easily converted into SQL statements which in its turn can be used to access databases efficiently [6]. − Finally, decision trees display similar and sometimes better accuracy compared to other techniques [5]. Decision tree building algorithms may be divided into two categories. The first category is those building binary trees. CART [15], SLIQ [20] and SPRINT [23] are examples of this category. The second category includes algorithms that build multiway trees like CHAID [21], C4.5 [17], FIRM [22] and CRUISE [24]. Most of the current techniques have the restriction that the training data should fit in memory. Several studies are made to enable classifiers to classify large datasets. Catlette [1] has studied methods for sampling of data and discretization of numeric attributes. While these methods decrease the classification time significantly, they also reduce the classification accuracy. Chan and Stoflo [2] studied the method of partitioning the original dataset, building a classifier for
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
each partition and finally combining the resultant classifiers to get the sought classifier. Their results showed low accuracy compared to building one classifier from the whole dataset. In this paper, we present a new scalable multiway decision tree building algorithm that shares both advantages of binary and multiway decision tress. Trees generated by the proposed algorithm are accurate and small in size, which are the good features encountered in binary trees. They are also compact with small depths like multiway trees. The algorithm is scalable as it works over disk resident datasets and does not require the whole dataset to fit in memory as most of current algorithms do, and also it has neither limitation on sizes of datasets nor special memory requirements. The rest of the paper is organized as follows. In section 2, we will discuss in some details the induction of decision tree classifier. A comparison between binary and multiway splits will be explained. The most currently used attribute selection criteria and the different pre and post pruning methods and the different evaluation measures of decision trees will be also discussed in this section. In section 3 the previous researches in decision tree classifiers will be presented. In section 4, the proposed algorithm, MPEG, will be explained in details. Experimental results and comparisons with C4.5 over different datasets will be discussed in section 5. Finally the conclusions will be presented in section 6.
2. Induction of decision tree classifier There are a large number of decision tree induction algorithms. Most of them share the same idea in building the decision tree classifier. First, the tree is built in a top down approach through the growth building - phase. Second, the tree is pruned in a bottom up approach to get the right sized tree. A distinct dataset is needed for both the training and pruning phases. In case of absence of separate training/pruning datasets, Resampling methods, like cross validation and bootstrap, are most commonly used [7]. The following sections will explain the building and pruning phases in more details.
2.1. Tree building phase The tree is built using the concept of breadth first search by recursively partitioning the examples in each node into new sub nodes until we get leaves that fit a certain stopping condition. The partitioning is performed at each node by testing candidate attributes values, seeking splitting rules that are best evaluated against certain evaluation measure. The winning attribute of the previously test stage is greedily selected. The node examples are hence distributed over newly
generated sub nodes using the winning attribute values. Decision tree building algorithms are divided into two categories according to the type of splits resulting at each node: binary and multiway tree building algorithms. The splitting criterion / rules take different forms according to the type of the tested attributes. In case of numerical attributes, splitting criteria are of the form V a, V > a for binary trees and a V < b for multiway trees, where V is the value of the corresponding winning attribute, a and b are constants. In case of categorical attributes, they take the form of V∈ A, where A is a non empty set of the corresponding winning categorical attribute values. In the following sections, more details about the greedy approach, binary and multiway splits, attribute selection criteria, and stopping conditions are presented. 2.1.1. Greedy top down induction. Greedy top down construction is the most commonly used method for tree growing. At each impure node, the best evaluated attribute is selected for splitting and no extra lookahead for next splits is performed. The greedy approach is simple and effective in terms of computations and searching for solutions especially in large search spaces. But looking only for the next best step, does not always lead to the optimal solution. Besides that, several authors warn that this greedy method has selection bias toward variables that provide more split points ([9]; [8]). Others pointed out that the greedy method is biased toward variables with more missing values when the GINI criterion is used [10]. These biases affect the interpretability and accuracy of the resulting trees. As a counter part to the previous defects, Murthy and Salzberg [11] state that greedy heuristic produces trees that are consistently close to the optimal in terms of expected length, and that one level lookahead not only does not improve upon greedy but often produces worse results, as it produces trees with approximately the same classification accuracy and size as greedy induction, with slightly shorter longest path. They also showed that post pruning techniques are at least as beneficial as one level look ahead. 2.1.2. Binary versus multiway splits. In binary trees, each impure node is partitioned into exactly two sub nodes, against at least two sub nodes in case of multiway trees. Splitting is done by testing candidate attributes against certain evaluation criteria, and then greedily selecting the best evaluated attribute. Categorical attribute values are divided into n sets according to splitting type, where n = 2 in case of binary tress and n 2 in case of multiway trees. The splitting rule is of the form A ∈ Si , where Si is a set of values of attribute A and i has values from 1 to n. If a
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
categorical attribute A has N distinct value, we need to make nN-n+1-n +1 exhaustive test which can be prohibitively expensive. A greedy algorithm, proposed initially for IND [19] is used to overcome this issue. C4.5 [17] splits a node over a categorical attribute values into C sub nodes where C is the number of categorical attribute values. CHAID [21], FIRM [22] make extra value grouping to merge nodes. Numerical values are handled differently in binary and multiway trees. In binary trees a binary test is usually performed. This test compares the value of attribute A against a threshold t: A t. The training cases are sorted first on values of A in order to obtain a finite set of values {v1,v2,..vn}, where n is the number of values of attribute A, any threshold between vi and vi+1 will have the same effect when dividing the training set, so that, n -1 possible thresholds are checked for each numerical attribute A. By Sorting this threshold values initially, these checks can be made by one sequential scan of the database. Once the threshold value is determined to exist between vi and vi+1, an accurate value for the threshold is to be determined. Most algorithms choose the midpoint of the interval [vi , vi+1] vi + vi + 1 as the actual threshold, that is t = . 2 In multiway trees, numerical attributes are handled in different manners. C4.5 [17] and CHAID [21] apply binary tests over them as in binary trees. FIRM [22] discretizes the numerical attribute range into 10 intervals and treats them as categorical attributes. Berzal [13] applies an agglomerative hierarchical clustering algorithm over intervals of numerical attribute values using Euclidean distance measure between class probability vectors of these intervals. In CRUISE [24] ANOVA F-tests are applied to select the most relevant attribute, and then BOX-COX transformation and Linear Discriminant Analysis (LDA) are applied to select the splitting points. Binary trees although bring more accurate, smaller in size, not suffering from being biased compared to multiway trees, are undoubtedly more unintelligible to human experts, with unrelated attribute values being grouped together and with multiple tests on the same attribute appear in the same decision rule from root to leaf. Also the subset criterion, for categorical attributes, can require a large increase in computation, especially for attributes with many values. On the other hand multiway trees are compact, shorter in depth and easy to comprehend since attributes rarely appear more than once in decision paths. But they have large sizes and suffer from bias especially towards attributes with large number of values and with more noise and hence may be less accurate than binary trees.
2.1.1. Attribute selection criteria. To build a decision tree, it is necessary to find at each internal node a test for splitting the data into subsets. Finding the split points means to find the attribute that is the most useful in discriminating the input data. These attributes are typically ranked using attribute selection criteria - also known as goodness measures, feature evaluation criteria or impurity measures- and the best evaluated attribute are then chosen from the ranked list. Ben Bassat [14] divided feature evaluation criteria into three categories. The first category is rules derived from information theory, by expanding tree nodes that contribute to the largest gain in average mutual information of the whole tree. The second category is rules derived from distance measures. The feature evaluation criteria in this class measure separability, divergence or discrimination between classes. The third is rules derived from dependence measures. They measure the statistical dependence between two random variables in which one of them is the class variable and the other is the predictor attribute. Let S be a set of training examples within partition i with each example e ∈ S belonging to one of the classes in C = {C1, C2… CK}, where K is the number of classes. We can define the class vector of partition i as Pi = (n1, n2,…, nK) ∈ NK, where nj = |{e ∈ S : class(e) = Cj}| and the class probability vector of partition i as PCi = (p1, p2,…, pK) ∈ [0, 1]K: n n n Where ( p1 , p2 ,..., p K ) = ( 1 , 2 ,..., K ) |S| |S| |S| The impurity measure of partition i is defined as function φ : [0, 1] K → R such that φ (PC(S)) ≥ 0, and the weighted average impurity on its sub partitions over v |S | attribute A as Φ ( S , A ) = ¦ i ⋅ φ ( PC( S i )) , v is the i =1 | S | total number of partitions. Finally the goodness of split due to attribute A is defined as reduction in impurity after the partitioning: ΔΦ (S, A) = φ (PC(S)) - Φ (S, A) Information gain is an example of rules derived from information theory [16]. Entropy of the partition, E (A, S), is used as an impurity measure: φ(PC(S)) = E K
(A, S) = ¦ − Pr( Ci ) ⋅ log 2 (Pr( Ci )) i =1
| {e ∈ S | class( e ) = Ci} | . This |S| method was used in many algorithms for induction of decision trees such as ID3 [16]. Gain criteria has a serious deficiency as it has a strong bias in favor of tests with many outcomes. Quinlan proposed in [17] an enhancement for the information gain to overcome this bias, the new criteria is called the gain ratio which is
Where Pr (Ci ) =
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
used in C4.5 [17]. The gain ratio, GR, normalizes the gain with attribute information: ΔΦ ( S , A ) GR = v |S | |S | − ¦ i ⋅ log 2 i |S| | S | i =1 GINI index is an example of rules derived from distance measures. It is originally presented and used in CART [15]: φ (PC(S)) = ¦ PC i ⋅ PC j = 1 -
tree and then remove subtrees that are not contributing significantly towards generalization accuracy. Some of the most commonly used methods for pruning are: Reduced Error Pruning (REP) [25], Pessimistic Error Pruning (PEP) [25], Minimum Error Pruning (MEP) [30], Critical Value Pruning (CVP) [31], Cost Complexity Pruning (CCP) [15] and Error Based Pruning (EBP) [17]. A comparison between these methods is discussed in details in [26].
i≠ j
㺌( PC i ) 2 , where PCi is the class probability vector of i
partition i and i, j take the values from 1 to number of partitions. The GINI prefers splits that put the largest class into one pure node, and all the rest into other node(s), while entropy favors size–balanced children nodes. Orthogonality [18] is another example of rules derived from distance measures. It is defined as: ORT (τ, S) = 1 – cos θ (P1, P2), Where θ (P1, P2) is the angle between two class vectors P1 and P2 of partitions Sτ and S¬τ respectively. It is suitable for only binary splits. X2 independence test is an example of rules derived from dependence measures. It is used to measure the degree of association between each attribute values and the class values. The attribute with the highest association value - minimum level of significance - is the winning attribute. Details about X2 test and how to be calculated can be found in [27]. 2.1.4. Stopping conditions. One of the most important goals of good decision tree building classifiers is to obtain compact right sized trees. Existence of noise and randomness in training dataset may lead to the problem of over fitting, which increases the error rate of the built tree. So generalization of the tree at the end of the growing phase is a very important matter to adjust the model. One of the most popular ways of generalization is to stop the splitting of impure nodes upon satisfying a certain stopping criteria. This stopping criterion is also called pre-pruning. Some of the most important used stopping criteria are: 1. Restrictions on minimum node size [28]. This strategy is known to be not robust. 2. Thresholds on impurity [29]. 3. Applying X2 test [16]. Number of studies showed that post-pruning is better than pre-pruning which may have a negative effect on accuracy when the data are sparse.
2.2.
Tree pruning
It is the most popular way of generalization. It was initially proposed by Breiman et al in [15]. They suggested the following procedure: build the complete
2.3.
Evaluation measures of decision trees
After building a decision tree, it is important to evaluate the performance and efficiency of the generated tree. One of the most important objectives we are interested in behind obtaining a decision tree is to classify the new unlabeled examples accurately. Accuracy can be calculated by applying the following number of misclassified examples over formula 1 total number of examples either the whole dataset if pruning is applied, or over only the test/prune dataset if pruning is not applied. Another important thing is to classify unlabeled examples fast. That is to be realized by performing few number of comparisons over the examples attributes values. Two measures can be used to test this characteristic in the desired tree: − Average Leaf Level which is defined as H
㺌h * number of leaves at level h h=1
, where H is the total number of leaves height of the tree. − Average Number of comparisons per Example, which is defined as H
㺌h * number of leaf examples at level h h=1
total number of examples The size of the tree is also an important measure for the classifier efficiency, as it is a good indication to the complexity of building the tree from computation, time and storage points of view. It is directly calculated as the total number of nodes - internal and leaves - of the tree.
3. Previous work in decision tree classifiers Within the area of decision tree classification, there exists a large number of algorithms to construct decision trees. Most of these algorithms depend on the existence of the whole dataset in the memory, even though today's databases are in general much larger than main memory.
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
SLIQ [20] was one of the pioneer algorithms that addressed the area of dealing with disk resident data. To realize that and to guarantee fast retrieval of the disk resident data, SLIQ uses a class list data structure, which is a list having an entry for each example in the dataset, and ordered attribute list structure which is built and sorted once for each attribute. The list of the currently tested attribute is the only one needed to be in memory. But SLIQ imposes the existence of the class list in memory during the whole building phase, which puts not appropriate memory restrictions against its flexibility. To overcome these restrictions, SPRINT [23] substitutes both the attribute lists and class list with a new attribute list structure, having entries for the examples of the current processed partition only. This list is updated after each split to contain examples of the next partition to be processed, and to facilitate splitting of the attribute lists without need to rescan the dataset, SPRINT uses a hash table structure that keeps information about the current partitions of the whole dataset. SPRINT so runs with minimal amount of main memory and scales to large training datasets, but the needs for costly hash joins and for the existence of the large hash tables to fit in memory are drawbacks of sprint [3]. Regarding binary trees building algorithms, handling of categorical attributes to find the best split points may be computationally expensive, especially for attributes with large number of values. A greedy algorithm, proposed initially for IND [19] is used to overcome this issue. SLIQ [20] enhances this greedy algorithm by using a hybrid approach. Handling categorical attribute has large costs in case of multiway trees. To overcome this problem, multiway trees building algorithms handle categorical attributes differently. C4.5 [17] splits a node over a categorical attribute values into C sub nodes where C is the number of categorical attribute values. CHAID [21], FIRM [22] make extra value grouping to merge nodes. Multiway trees, although favored upon binary trees for their compactness and ease of comprehensibility for human. But they suffer from the following two problems: − Handling of numerical attributes to find the suitable split points. Some algorithms like FIRM [22] discretize numerical attributes and treat them as categorical ones. That can be a very bad factor against accuracy and correct semantic of built trees. Other algorithms like C4.5 [17] apply binary tests as those of binary trees. CRUISE [24] applies the computationally expensive ANOVA F-tests , BOXCOX transformation and Linear Discriminant Analysis (LDA). − Being biased towards splitting using attributes with many values. This bias may build trees with large
number of leaves, and lead to the choice of non relevant attributes. Addressing the previous problems, we introduce a new algorithm, MPEG that has few memory requirements and no restrictions on sizes of datasets, deals with both numerical and categorical attributes efficiently and avoids the existence of too many splits for many valued attributes before evaluating using the attribute selection criterion. We compared our algorithm with C4.5 [17] which is one of the most important and commonly used algorithms in building multiway trees and in the following sections, the proposed algorithm and comparisons with C4.5 will be discussed.
4. Multiway decision tree induction using projection and merging (MPEG) The new algorithm handles dataset examples through a database management system, getting use of the power of DBMS, to address the scalability problem using the following structures: − Indexed tables to hold the dynamic data of the generated partitions and temporary redundant data to avoid rescanning the whole dataset. − Clustered indexed tables to hold the static data such as the predictor and class attribute values of the examples. − Optimized queries to access data stored in disk in an efficient way. In building the decision tree, the following six steps are performed recursively over each impure partition: 1. Projection over attributes values. 2. Applying X2 test of significance on each attribute. 3. Generating partitions. 4. Merging Partitions. 5. Selection of winning attribute. 6. Serialization and updating disk resident data. Finally an error based pruning is performed to generalize the generated tree. 10-fold cross validation is applied in our technique such that the building / pruning phases are applied 10 times. Each time 90% of the examples are chosen to be used in the building phase and the remainder 10% in the pruning phase. Examples are chosen randomly but preserving the same class distribution in the whole dataset. Finally, the tree having the best results is referred to as the required decision tree. In the following sections, the above mentioned steps are explained in details.
4.1.
Projection over attributes values
Data related to dataset examples and their corresponding attributes values are stored and retrieved
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
from the following disk resident clustered indexed tables Example Class 9 Attribute Attribute Id Id Id Value Examples Example-attribute-values Projection is applied over the values of each attribute using optimized aggregate database queries, where examples of the same values are grouped together.
4.2. Applying X test of significance on each attribute
Example Attribute Class label Id value 1 v1 C1 2 v1 C2 3 v2 C1 4 v2 C2 5 v3 C2 6 v4 C2 7 v5 C2 8 v5 C3 9 v6 C2 10 v7 C2
If the dataset has a large number of attributes, it may be important if we can remove some of these attributes before going into the next costly steps of partition generation, partitions merging and evaluation of the winning attributes. X2 test of independence using the normally used significance level of 5 is applied, so only attributes passing this test will continue into the remaining steps. The information gained from the previous projection step is used in building the needed X2 contingency table and hence in computing the X2 value. As this test in its turn has a considerable computation overhead, it will be applied only if the number of candidate attributes exceeds certain threshold. This threshold from experimental results is recommended to be 10.
Generating partitions in this way amounts to divide the space of classes into sets of class values, disjointly covering the space of predictor attribute values and then distributing the examples over these sets. This method is highly efficient due to the following considerations: − Separation of predictor attribute values, according to their class distribution, leads to fast separation of the class values themselves in the next levels, and hence to more compact trees. − Unlike most of the current algorithms which search at the same time for both the best split points for each attribute and for the winning attribute, MPEG separates these two tasks and easily gets a candidate solution to be enhanced and tested in later steps. This helps to reduce the complexity of the partitioning process.
4.3. Generating partitions
4.4. Merging partitions
After projection of the dataset examples over the values of each attribute, adjacent values that share the same set of classes values are grouped together to form the candidate partitions and stored in a memory resident array of partitions. Each entry - partition - in this array will have the following fields: Partition ID: containing a unique identifier for each partition. Class set: set of records of containing − Class value. − Number of examples, belonging to this class value, which appear in this partition. Min value: the minimum attribute value in this partition, in case of numerical attributes. Max value: the maximum attribute value, in case of numerical attributes. Categorical attribute set of values: set of different values of the attribute that appeared in this partition, in case of categorical attributes. The following example shows the generation process, a set of examples are projected over attribute A, then examples, which values have the same class set, are grouped together to form one partition.
As discussed previously, being biased towards the selection of attributes having many splits is a serious problem facing multiway tree building algorithms as it may yield to large sized trees, besides increase the possibility of selecting of irrelevant attributes, and therefore to generate non accurate trees. Merging of generated partitions can help in solving this problem. Number of generated partitions will not then have significance difference for different attributes, and hence will not be the main factor in the evaluation used in the attribute selection criteria. Two types of merging are performed in the proposed algorithm: ♦ Merging of partitions having the same set of class values. ♦ Merging of partitions having different sets of class values.
2
Partition 2 Partition 3 Partition 4
4.4.1. Merging of partitions having the same set of class values. In this type of merging, partitions having the same set of class values are merged together to form one partition having one disjunctive expression as the splitting rules. The following example illustrates this merging method.
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE
Partition 1
C1 C2 C3 P1 L1 V < R1
C3 P2
C1
C2 C3 P3 L3 V < R3
C4
C5 P4
C1
C2 C3 P5 L5 V < R5
C4 C5 C3 C1 C2 C3 L1V