Pareto-optimality of oblique decision trees from ... - CIn-UFPE

J Glob Optim (2011) 51:301–311 DOI 10.1007/s10898-010-9614-9

Pareto-optimality of oblique decision trees from evolutionary algorithms Jose Maria Pangilinan · Gerrit K. Janssens

Received: 1 April 2010 / Accepted: 15 September 2010 / Published online: 5 October 2010 © Springer Science+Business Media, LLC. 2010

Abstract This paper investigates the performance of evolutionary algorithms in the optimization aspects of oblique decision tree construction and describes their performance with respect to classification accuracy, tree size, and Pareto-optimality of their solution sets. The performance of the evolutionary algorithms is analyzed and compared to the performance of exhaustive (traditional) decision tree classifiers on several benchmark datasets. The results show that the classification accuracy and tree sizes generated by the evolutionary algorithms are comparable with the results generated by traditional methods in all the sample datasets and in the large datasets, the multiobjective evolutionary algorithms generate better Paretooptimal sets than the sets generated by the exhaustive methods. The results also show that a classifier, whether exhaustive or evolutionary, that generates the most accurate trees does not necessarily generate the shortest trees or the best Pareto-optimal sets. Keywords Pareto-optimality · Decision tree · Evolutionary algorithm · Multiobjective optimization · Classification · Data mining

1 Introduction The data mining task of classification using decision trees (DT) has been studied for many years. Researchers in the fields of statistics, decision theory, and machine learning have reported a huge amount of work on this topic. Significant improvements on decision trees are difficult to produce but promising research still remains open and has to be explored. One of these areas is the application of evolutionary algorithms in decision tree construction and optimization. The objective of this paper is to apply evolutionary algorithms (EA) in the

J. M. Pangilinan Saint Louis University, Baguio City, Philippines e-mail: [email protected] G. K. Janssens (B) IMOB, Hasselt University, Diepenbeek, Belgium e-mail: [email protected]

123

302

J Glob Optim (2011) 51:301–311

optimization aspects of oblique decision tree construction, describe the EA performance, and investigate the Pareto-optimality of their solution sets. Several issues have to be considered simultaneously in this exploration, such as computational complexity, accuracy, depth of tree, tree size, balance, and stability. However, for clarity and visualization purposes, the study uses two criteria only and reduces the multiobjective decision tree problem to a bi-objective optimization problem that minimizes the tree size and maximizes the classification accuracy. The problem is defined as follows: Let T be the set of all decision trees induced from a dataset D. Let T ∈ T be a decision tree and let l denote a leaf node in T . Let α be the classification accuracy of the tree T on the dataset D and let the size of the tree x = |{l| l ∈ T }|. The multiobjective decision tree problem is to find a set of non-dominated trees and the optimization problem on a decision tree T can be formulated as Minimize x Maximize α

(1)

The primary contribution of this paper is it investigates decision tree optimality not only in terms of classification accuracy or tree size but also on the quality of the solution sets generated by each classifier. The paper is organized as follows: Sect. 2 presents a review of related literature, Sect. 3 presents the methodology and experiments and Sect. 4 states the conclusions of the study.

2 Literature review 2.1 Splitting criteria An early technique by Quinlan [1] that influenced a large part of the research on decision trees is useful to look at in order to understand the basics of decision tree construction. A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute than those of another attribute. Therefore, such attributes have some underlying relationship with the target attribute. Essentially, splitting is finding a measure that compares attributes with each other and then deciding the variables that split the data more purely higher up the tree. There have been many studies in optimizing split selection. A split can be either univariate (axis-parallel) or multivariate (oblique). An axis-parallel split divides a node with only one predictor and an oblique split divides a node with a linear or non-linear combination or predictors. Trees that are induced using oblique splits are called oblique trees. Finding optimal linear splits is known to be intractable and is an optimization problem, so heuristic methods are required. Methods for finding good linear splits include linear discriminant analysis, hill climbing search, linear programming, and perceptron training in neural networks. This section examines existing literature that addresses the problem of finding oblique splits for Classification Trees. The more popular approach in finding univariate splits examines all possible binary splits along each predictor variable to select the split that most reduces some measure of impurity. Such method is used in Classification and Regression Trees (CART), Quinlan’s C5.0, and Morgan’s THAID. This type of exhaustive search has two major drawbacks. First, the computational complexity of finding splits for nominal variables increases exponentially with

123

J Glob Optim (2011) 51:301–311

303

the number of its categorical values and obtaining linear combinations of ordered predictor variables becomes computationally difficult. Second, it tends to select variables that have more instances. Loh and Vanichsetakul [2] free the bias in variable selection for ordered predictor variables by separating the selection of the variable and the split point. They calculate the F-statistic for each predictor variable, select the variable with the highest F-statistic, and apply linear discriminant analysis to find the split point. Unordered variables are transformed into ordered variables but this transformation is not free from bias. Loh and Shih [3] further developed Quick, Unbiased, Efficient, Statistical Tree (QUEST) to address the disadvantages of their first algorithm, e.g. a split rapidly lessens the learning sample when a node is split into as many sub-nodes as there are classes. Loh and Shih conclude that in terms of classification accuracy, variability of splits, and tree size, there is no clear winner when univariate splits are used. However, QUEST trees, based on a linear combination of splits are usually shorter and more accurate than the same trees based on univariate splits. Preliminary work in oblique decision trees was done by Breiman et al. [4]. They introduced CART with linear combinations (CART-LC) as a method to create oblique splits. CART-LC iteratively finds local optimal values of each of the coefficients. CART-LC generates and tests hyperplanes until the marginal benefits become smaller than a specified constant. It uses a backward-deletion procedure to simplify the structure of the split by deleting variables that have little contribution to the effectiveness of the split. Heath et al. [5] developed an algorithm (SADT) to induce oblique decision trees based on simulated annealing. The algorithm first creates an initial random hyperplane to divide the dataset and creates hyperplanes on the two new partitions. This process of creating hyperplanes is recursive. Their experiments show that simulated annealing works well for generating oblique decision trees since it generates smaller trees without reducing classification accuracy. Murthy et al. [6] developed a randomized algorithm Oblique Classifier 1 (OC1) and has been thought to be an extension of CART-LC. The primary contribution of OC1 in oblique decision tree research is its perturbation algorithm in finding linear splits. The perturbation is a randomization of hyperplanes to escape local minima. Murthy notes that there are data distributions wherein univariate splits perform better than oblique splits. They accounted such distributions in OC1 by computing the best axis-parallel split at each node before applying hyperplane perturbation. The experiments of Murthy et al. show that OC1 outperforms CART-LC in several datasets. Vadera [7] presents a new algorithm that utilizes linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to a cost-sensitive algorithm (ICET), an oblique decision tree algorithm (OC1) and to linear programming. A study of multi-class binary decision trees with oblique planes using nonlinear programming can be found in Street [8]. Street finds that a nonlinear programming approach (OC-SEP) appears to offer an advantage over some axis-parallel methods on oblique separating planes but is more computationally demanding than greedy algorithms. The OC-SEP implementation is still not feasible for large scale problems. He suggests that incorporating feature-selection to the objective function can achieve enhanced classification accuracy and interpretability.

123

304

J Glob Optim (2011) 51:301–311

2.2 Performance measures Several criteria exist to evaluate a decision tree algorithm for a given dataset against another algorithm on the same dataset. Conciseness of the decision tree that an algorithm produces may be one measure and clarity of the rules from the tree is another. It is also possible to look at how fast an algorithm is, but the interest of most researchers is in an algorithm’s learning ability and classification accuracy. Moret [9] summarizes work on measures such as tree size, expected testing cost, and worst-case testing cost. He shows that an algorithm that minimizes one measure does not guarantee minimization of other measures. On the other hand, Fayad and Irani [10] argue that by concentrating optimization on one measure, improvement on the performance of other measures is possible. Kononenko and Bratko [11] points out that comparisons based on classification accuracy are unreliable, because different classifiers produce different types of estimates and accuracy values can vary with prior probabilities of the classes. They suggested an information-based metric to evaluate a classifier as a remedy to accuracy problems. Tree quality depends more on stopping rules than on splitting rules. Pruning is argued to be a better method than stop-splitting rules [6]. No single pruning method has been adjudged superior to others. Obtaining the ‘right’ sized trees is important for several reasons, which depend on the size of the classification problem [12]. For moderate sized problems, the critical issues are classification accuracy, error-estimation, and gaining insight into the predictive and generalization structure of the data. For large tree classifiers, the critical issue is optimizing structural properties such as height and balance [13]. Murthy [14] notes that the time complexity of induction and post-processing is exponential in tree height in the worst case. This puts a premium on designs that produce shallower trees, multi-way rather than binary splits, and selection criteria that prefer balanced splits. Several authors have designed methods to improve upon greedy algorithms by constructing near-optimal decision trees using a look-ahead algorithm [15] genetic programming [16] and simulated annealing [5,17,18]. Many authors suggest that using a collection of decision trees, instead of just one reduces the variance in classification performance [19–21]. The idea is to build a set of trees (ensembles) for the same training sample and then combine their results. Multiple trees have been built using randomness or using different subsets of attributes for each tree [22]. 2.3 Evolutionary algorithms for decision trees An evolutionary algorithm is a promising technique to build oblique decision trees. Cantu-Paz and Kamath [23] summarizes the advantages of an EA, which are the following: 1. EAs can consider more than one coefficient at a time and may not be trapped in local optima as easily as the simple greedy hill-climbing algorithms. 2. EAs have been shown to have good scalability to the dimensionality of the problem [24]. 3. EAs are tolerant to noisy fitness evaluations [25]. 4. EAs are stochastic algorithms; they produce different trees on the same data set that can be easily combined into ensembles. Cantu-Paz and Kamath [23] extended the OC1 algorithm by Murthy et al. [6] into three new algorithms by using evolution strategies, genetic algorithms, and simulated annealing to find oblique partitions on a variety of datasets including seven datasets from the UCI (University of California at Irvine) machine learning repository. They compared the performance of six

123

J Glob Optim (2011) 51:301–311

305

algorithms according to accuracy, number of nodes, and execution time. They conclude (1) that the EAs scale up better than traditional methods (OC1-AP, OC1, and CART-LC) to the dimensionality of the data and (2) that creating ensembles with the EAs results in higher accuracy than single trees produced by existing methods. The study suggests future work may be done on scalability using larger datasets, on design of specialized operators, and the integration of local hill-climbing algorithms to EAs. Sörensen and Janssens [26] developed a genetic algorithm for binary decision trees to overcome the drawbacks of the automatic interaction (AID)-technique. The technique uses specialized genetic operators that preserve the structure of the trees to find a diverse set of DTs with high explanatory power. The algorithm’s advantage as compared to the AID-technique is that it gives the decision maker a set of high-fitness decision trees to choose from instead of only one DT. Papagelis and Kalles [27] proposed a genetic algorithm (GAtree) to evolve binary decision trees and introduced a single-objective fitness function to balance accuracy and size of decision trees. The GAtree algorithm randomly picks a variable to create a node and a binary decision tree is completed down to the leaves with this same line of thought. There is no discussion of a splitting criterion. The selection of highest fit chromosome is taken by cross-validation. Their experiments used selected datasets from the UCI repository and the GAtree performance was compared to C4.5 and the One-R algorithm. They find that the time burden of the GAtree is substantially bigger than greedy heuristics but on the other hand, their experiments show advantages over greedy heuristics when there are irrelevant or strongly dependent variables. Bot and Langdon [28] applied genetic programming (GP) to linear classification trees using limited error fitness (LEF), cross validation, and Pareto scoring on continuous data attributes from four UCI repository data sets. The study measured the GP’s performance, accuracy and tree size in comparison to OC1, C5.0, and M5’. They used the GPsys program by Qureshi (1997), which is a strongly-typed, steady state GP system. They conclude that in some datasets, the GP works comparatively or better than the reported accuracy from other decision tree algorithms but the GP performs worse on other datasets. Furthermore, fitness-sharing Pareto creates a promising approach to reducing tree size but creates larger trees than Pareto domination while LEF does not improve or worsen classification accuracy, but saves much on execution time. Kim [29] proposed an evolutionary multiobjective optimization (EMO) approach that searches for the best accuracy rate of classification for different sized trees using genetic programming. He introduces structural risk minimization that finds a desirable number of rules for a given error bound. His EMO reduces the size of trees dramatically and his best rule set is better than that of C4.5, but computing time takes longer. A multi-criteria approach in evaluating decision trees was presented by Bryson [30]. He provided a method to assist decision makers to select an efficient decision tree by ranking decision trees using a weighing model. The model evaluates a decision tree on several performance measures such as accuracy rate, tree simplicity and size, stability, and the discrimination power of its predictors. Zhao [31] presented a multiobjective genetic programming (MOGP) system for developing Pareto optimal decision trees. The system allows the decision-maker to make tradeoffs in several ways based on his estimates of classification errors, and recommends a set of alternative solutions accordingly. As an evolutionary approach, the system visualizes the progress of the evolution of solutions such that the decision maker can decide to stop the procedure when satisfactory solutions have been found or when the solutions on the front appear to have stabilized.

123

306

J Glob Optim (2011) 51:301–311

3 Methods and experiments Doumpos et al. [32] introduced a method to evaluate, through experiments, several classification methods such as statistical techniques and non-parametric methods in terms of analyzing different data characteristics. Most of the studies presented in Sect. 2 also evaluate the performance of decision tree algorithms on different datasets by measuring classification accuracy and tree size i.e. minimizing the tree size and maximizing classification accuracy. Whereas most studies show averages of tree size and classification accuracy to measure performance, averaging has its limitations. Averages do not show tradeoff solutions in a multiobjective optimization problem, i.e. solutions in the search space that are Pareto-optimal or non-dominated cannot be described by averages [33]. The current paper intends to describe the quality of solutions of different decision tree algorithms by measuring the quality of their Pareto-optimal sets in addition to tree size and classification accuracy. The concepts of Pareto optimality are formalized as follows [34]: 1 2 1 2 1. A x 1 (denoted2by x ≺ x ) if and only if solution x is1 said to dominate a solution 2 (∀i ∈ M f i (x ) ≤ f i (x ) ∧ (∃i ∈ M f i (x ) < f i (x ) where M is the set of objectives. 2 2 1 2. A solution x1 is said to bePareto-optimal if and only if ¬∃x : x ≺ x 1 2 2 1 3. Pareto-optimal set: P S = x ¬∃x : x ≺ x

There are three main classifications of performance metrics for evaluating the quality of non-dominated sets. The first classification evaluates the convergence or the proximity of the non-dominated set to the Pareto-front based on several concepts (e.g. dominance, distance, and set membership). The second classification evaluates the diversity of a non-dominated set by calculating the spread of solution along its front. The third classification evaluates both convergence and diversity. The Hypervolume [35] is a metric that measures both convergence and diversity of a nondominated set. The hypervolume metric calculates the volume covered by the members of the non-dominated set Q. For each solution i ∈ Q, a hypercube vi is computed from a reference point and the solution i as the diagonal corners of the hypercube, The reference point can be found by constructing a vector of worst objective function values. The hypervolume (HV) is calculated as: |Q| HV = volume (2) vi i=1

Figure 1 shows the hypervolume of two non-dominated sets A and B in a minimization problem. The hypervolume of set A is greater than the hypervolume of set B which means that set A is a better non-dominated set than B. 3.1 Multi-objective evolutionary algorithm The proposed multiobjective evolutionary algorithm is an adaptation of OC1 [14] and the extension comes in the form of an EA by changing the original hill climbing and perturbation algorithms in OC1 to an evolutionary process. The evolutionary algorithm in this study has its own variation operators (described later) but adopts the selection procedures of two well-known multi-objective evolutionary algorithms (MOEA) namely Strength Pareto Evolutionary Algorithm (SPEA2) [36] and Non-dominated Sorting Algorithm (NSGA-II) [37] for purposes of comparison. These two selection procedures are chosen for the study as they

123

J Glob Optim (2011) 51:301–311

307

Fig. 1 Illustration of the concept of Hypervolume

have been benchmark procedures in most MOEA studies presented in Sect. 2 and as shown in [38]. The MOEA in this study is a search function that finds a linear combination of the predictor variables that best splits the dataset. A chromosome or an individual consists of real-parameter genes that represent the coefficients of a hyperplane that splits the dataset to form a node of the decision tree. The length of the chromosome is fixed and is equal to the number of dimensions, d of the dataset. Each gene in a chromosome represents the coefficient of a predictor variable xi where i = 1, 2,…, d in a d-dimensional dataset. Each gene is a real random number in [−1, 1]. The decision tree is represented as a binary tree of chromosomes. The initial population is composed of chromosomes of linear splits. An axis-parallel chromosome is added to the initial population to ensure that the initial population can capture univariate splits in the early generations of the EA. The experiment adopts a non-uniform mutation operator, which is described as follows: xit+1 = xit + p(xil − xiu )h(t) h(t) = 1 − r

(1−t/tmax )b

(3) (4)

The variable p is either −1 or 1 with probability of 0.5. The terms xil and xiu express the lower and upper bounds that define the decision space and restrict the decision variable xi to take a value within the decision space at generation t. The parameter r is a real random number in the interval [0, 1], t is the generation counter, and tmax is the maximum number of generations. The parameter b is a user-defined input that determines non-uniformity and has a value usually greater than one. Equation 3 assures that mutation accomplishes uniform exploration at the first generations and search becomes local at the last generations. Equation 4 makes it possible to create a non-uniform mutation where the probability of creating solutions closer to the parent gets higher as the number of generations increase. Recombination is implemented as the arithmetic average of both parent vectors. The arithmetic crossover is defined as zi =

1 (xi + yi ) 2

(5)

where x and y are the parents, z is the offspring, i = 1, . . . , m, and m is the number of dimensions of the chromosome. The arithmetic crossover can generate any point in the search space as defined by the parents and possesses a feature that may constitute adaptive search. This means that if the difference between the parent solutions is small, the difference between offspring solutions is also small, and if the difference between the parent solutions

123

308

J Glob Optim (2011) 51:301–311

is large, the difference between offspring solutions is large. This arithmetic crossover keeps the mean objective values of the offspring population the same and at concurrently increases the population diversity in general. The fitness of a chromosome is evaluated by an impurity measure called the towing rule [4]. The experiment uses the hypervolume (2) to describe the Pareto-optimality of the solutions sets of each classifier. 3.2 Experiments and results In order to evaluate the EA effectively for purpose classification, the datasets from the University of California at Irvine (UCI) machine-learning repository are excellent benchmarks as they have been proven to be representative in classification studies. Six standard datasets for classification are chosen from the UCI machine learning repository. Four of them are small datasets wherein the number of instances in the dataset n < 1,000, and two are large datasets (Optical-digits and Pen-digits), n > 5,000. An additional artificial dataset, the RCB dataset is used to test an algorithm’s performance in generating linear splits. Ten fivefold cross-validation experiments are used on the small datasets and the RCB dataset. Cross-validation is a technique to estimate the accuracy of a predictive model by partitioning the sample dataset into corresponding subsets and analyzing one subset (training set) and validating the analysis on the other subset (validation set). For the large datasets, the accuracy is evaluated with testing sets. Fifty trees are generated from each dataset for each classifier. A population of 28 was estimated from [39] which run for 100 generations. Also a sensitivity analysis is conducted of mutation and recombination rates in decision tree construction using the same datasets listed above. The findings show that different mutation and recombination rates influence the performance of MOEA differently along the datasets. However, in most of the datasets, mutation and recombination rates greater than 0.40 either generated trees of higher classification accuracy or smaller tree size. Hence, a mutation and a recombination rate of 0.5 were used in this study. For the OC1 classifier, twenty random hyperplanes are used at each node and the best hyperplane is selected through hill-climbing and five perturbations are allowed for each hyperplane. The Axis-Parallel (AP) classifier does not require any input parameter. Table 1 shows the hypervolume, classification accuracy, and average tree size of the nondominated sets of the decision trees induced by four classifiers from the small datasets. The best solutions are in bold print. The averages in accuracy do not show significant differences among the classifiers. The AP classifier has the best accuracy for the Glass and Vehicle datasets, SPEA2 for the Housing and Vowel datasets, whereas OC1 has the worst accuracy in three datasets. The tree sizes of the oblique classifiers are smaller than the AP classifier in all datasets. In terms of computing time, the AP classifier is the fastest followed by OC1, then by the EA classifiers. In terms of hypervolume, the EA classifiers have larger values than the AP and OC1 classifiers except in the Vehicle dataset. This means that the EAs generate better Pareto-optimal solutions than both AP and OC1 in the other three datasets. Table 2 shows the results from the large datasets. The averages in accuracy do not show significant differences for all four classifiers but the tree size of the oblique classifiers are much smaller than the AP trees. The worst solutions in terms of accuracy, size, and hypervolume are produced by the AP classifier while the best solutions are generated by the oblique classifiers. OC1 has the best tree size, accuracy, and hypervolume in the RCB dataset as this dataset has oblique splits only. The computing time of the AP classifier remains fastest but the computing time of OC1 has increased considerably and the EA classifiers still have the

123

J Glob Optim (2011) 51:301–311

309

Table 1 Comparison of algorithms in terms of accuracy, trees size, and hypervolume of small datasets (best solutions are bold printed) Dataset

AP

OC1

SPEA2

NSGA2

64.53

Glass Accuracy

64.63

61.57

64.24

Tree size

13.00

9.12

8.72

9.12

Hypervolume

11.94

12.73

13.53

13.45

Housing Accuracy

80.36

74.49

80.47

79.29

Tree size

35.94

31.28

33.08

28.96

Hypervolume

13.96

13.14

16.10

16.20

Vehicle Accuracy

69.42

67.33

67.84

68.71

Tree size

44.54

31.44

29.48

36.58

Hypervolume

15.45

15.49

13.34

13.30

Accuracy

74.72

78.75

81.15

80.64

Tree size

60.76

28.46

31.66

31.68

2.47

14.20

15.74

16.57

Vowel

Hypervolume

Table 2 Comparison of classifiers in terms of accuracy, trees size, and hypervolume of large datasets (best solutions are bold printed) Large datasets

AP

OC1

SPEA2

NSGA2

RCB Accuracy

92.85

98.37

97.54

97.62

Tree size

83.72

12.78

18.50

17.28

2.47

13.22

12.38

13.04

Hypervolume Optical-digits Accuracy

84.78

86.51

88.69

88.62

Tree size

148.14

54.44

59.08

57.26

1.80

8.90

12.37

11.74

Accuracy

90.91

92.88

94.14

94.20

Tree size

211.10

74.98

71.72

76.64

1.79

7.41

10.64

9.43

Hypervolume Pen-digits

Hypervolume

worst execution time. However, the EAs generate better hypervolumes than the OC1 and AP classifiers in the other large datasets.

4 Conclusions This study presented an application of evolutionary algorithms in decision tree construction and an analysis of their performance in terms of classification accuracy, tree size, and the

123

310

J Glob Optim (2011) 51:301–311

hypervolume of their Pareto-optimal sets. The experiments show that no classifier is superior in the small datasets whereas the oblique classifiers perform better than the AP classifier in the large datasets in terms of classification accuracy, tree size, and hypervolume. The MOEAs have the worst computing time in all datasets but they generate larger hypervolumes in most cases which means that the MOEAs generate better non-dominated solutions. The results also show that a classifier that generates the most accurate trees does not necessarily generate the shortest trees or the best non-dominated sets.

References 1. Quinlan, R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986) 2. Loh, W., Vanichsetakul, N.: Tree structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 83, 715–725 (1988) 3. Loh, W., Shih, Y.: Split selection methods for classification trees. Stat. Sin. 7, 815–840 (1997) 4. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984) 5. Heath, D., Kasif, S., Salzberg, S.: Induction of oblique decision trees. Proceedings of IJCAI, Morgan Kaufman, Chambery, pp. 1002–1007 (1993) 6. Murthy, S., Kasif, S., Salzberg, S.: A system for induction of oblique decision trees. J. Artif. Intel. Res. 2, 1–32 (1994) 7. Vadera, S.: Safer oblique trees without costs. Expert Syst. 22, 206–221 (2005) 8. Street, N.: Oblique multi-category decision trees using nonlinear programming. J. Comput. 17, 25–31 (2005) 9. Moret, B.: Decision trees and diagrams. Comput. Surv. 14(4), 593–623 (1982) 10. Fayyad, U., Irani, K.: What should be minimized in a decision tree? Proceedings of the National Conference on Artificial Intelligence, Boston, pp. 749–754 (1990) 11. Kononenko, I., Bratko, I.: Information based evaluation criterion for classifier’s performance. Mach. Learn. 6, 67–80 (1991) 12. Gelfand, S., Ravishankar, C., Delp, E.: An iterative growing and pruning algorithm for classification tree design. IEEE Trans. Pattern Anal. Mach. Intel. 13, 163–174 (1991) 13. Wang, Q., Suen, C.: Analysis and design of a decision tree based on entropy reduction and its application to large character set recognition. IEEE Trans. Pattern Anal. Mach. Intel. 6, 406–417 (1984) 14. Murthy, S.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2, 345–398 (1998) 15. Murthy, S., Salzberg, S.: Look-ahead and pathology in decision tree induction. Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 1025–1033 (1995) 16. Koza, J.: Hierarchical genetic algorithms operating in populations of computer programs. Proceedings of the 11th International Joint Conference on Artificial Intelligence, pp. 768–774 (1989) 17. Lutsko, J.F., Kuijpers, B.: Simulated annealing in the construction of near-optimal decision trees. In: Chessman, P., Oldford, R.W. (eds.) Selecting Models from Data: AI and Stat. IV, pp. 453–462. Springer, Berlin (1994) 18. Folino, G., Pizzuti, C., Spezzano, G., Spezzano, O.: Genetic programming and simulated annealing: a hybrid method to evolve decision trees. Proceedings of the European Conference on Genetic Programming, Edinburgh, pp. 294–303 (2000) 19. Kwok, S., Carter, C.: Multiple decision trees. Proceedings of the 4th Annual Conference on Uncertainty in Artificial Intelligence, Cambridge, pp. 327–325 (1990) 20. Shlien, S.: Nonparametric classification using matched binary decision trees. Pattern Recognit. Lett. 13, 83–88 (1992) 21. Buntine, W.: Learning classification trees. Stat. Comput. 2, 63–73 (1992) 22. Heath, D., Kasif, S., Salzberg, S.: k-DT: A multi-tree learning method. Proceedings of the 2nd International Workshop on Multi-strategy Learning, West Virginia, pp. 138–149 (1993) 23. Cantu-Paz, E., Kamath, C.: Inducing oblique decision trees with evolutionary algorithms. IEEE Trans. Evol. Comput. 7, 54–68 (2003) 24. Harik, G., Cantu-Paz, E., Goldberg, D., Miller, B.: The gamblers ruin problem, genetic algorithms, and the sizing of populations. Evol. Comput. 7, 231–253 (1999) 25. Miller, B., Goldberg, D.: Genetic algorithm selections schemes, and the varying effects of noise. Evol. Comput. 4, 113–131 (1996)

123

J Glob Optim (2011) 51:301–311

311

26. Sörensen, K., Janssens, G.K.: Data mining with genetic algorithms on binary trees. Eur. J. Oper. Res. 151, 253–264 (2003) 27. Papagelis, A., Kalles, D.: Breeding decision trees using evolutionary techniques. Proceedings of the 8th International Conference on Machine Learning, Williamstown, pp. 393–400 (2001) 28. Bot, M., Langdon, W.: Application of genetic programming to induction of linear classification trees. Proceedings of European Conference on Genetic Programming, pp. 247–258. Springer, Edinburgh (2000) 29. Kim, D.: Structural risk minimization on decision trees using evolutionary multiobjective optimization. Springer Lecture Notes on Computer Science 3003, pp. 338–348 (2004) 30. Bryson, N.: Evaluation of decision trees: a multicriteria approach. Comp. Oper. Res. 31, 1933–1945 (2004) 31. Zhao, H.: A multiobjective genetic programming approach to developing Pareto-optimal decision trees. Decis. Support Syst. 43, 809–826 (2007) 32. Doumpos, M., Chatzi, E., Zopounidis, C.: An experimental evaluation of some classification methods. J. Glob. Optim. 36(2), 33–50 (2006) 33. Chinchuluun, A., Pardalos, P.M.: A survey of recent developments in multiobjective optimization. Ann. Oper. Res. 154(1), 29–50 (2007) 34. Pappalardo, M.: Multiobjective optimization: a brief overview. In: Chichuluun, A., Pardalos, P.M., Migdalas, A., Pitsoulis, L. (eds.) Pareto Optimality, Game Theory and Equilibria, pp. 517–528. Springer, Berlin (2008) 35. Zitzler, E., Thiele, L.: Comparison of multi-objective evolutionary algorithms: empirical results. Evol. Comput. 8, 125–148 (2000) 36. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: improving the strength Pareto evolutionary algorithm for multiobjective optimization. In: Giannakoglou, K., Tsahalis, D., Periaux, J., Papailu, K., Fogarty, T. (eds.) Evolutionary Methods for Design, Optimization, and Control, pp. 19–26. CIMNE, Barcelona (2002) 37. Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist multiobjective genetic algorithm: NSGAII. IEEE Trans. Evol. Comput. 6, 182–197 (2002) 38. Gil, C., Márquez, A., Baños, R., Montoya, M.G., Gómez, J.: A hybrid method for solving multi-objective global optimization problems. J. Glob. Optim. 38(2), 265–281 (2007) 39. Deb, K.: Multiobjective Optimization Using Evolutionary Algorithms. Wiley, UK (2001)

123