Classification and Regression Tree Methods (In Encyclopedia of Statistics in Quality and Reliability, Ruggeri, Kenett and Faltin (eds.), 315–323, Wiley, 2008) Wei-Yin Loh Department of Statistics University of Wisconsin Madison, WI 53706
[email protected] Phone: 1 608 262 2598; Fax: 1 608 262 0032 Keywords: cross-validation, discriminant, linear model, prediction accuracy, recursive partitioning, selection bias, unbiased Abstract A classification or regression tree is a prediction model that can be represented as a decision tree. This article discusses the C4.5, CART, CRUISE, GUIDE, and QUEST methods in terms of their algorithms, features, properties, and performance.
1
INTRODUCTION
Classification and regression are two important problems in statistics. Each deals with the prediction of a response variable y given the values of a vector of predictor variables x. Let X denote the domain of x and Y the domain of y. If y is a continuous or discrete variable taking real values (e.g., the weight of a car or the number of accidents), the problem is called regression. Otherwise, if Y is a finite set of unordered values (e.g., the type of car or its country of origin), the problem is called classification. In mathematical terms, the problem is to find a function d(x) that maps each point in X to a point in Y. The construction of d(x) requires the existence of a training sample of n observations L = {(x1 , y1 ), . . . , (xn , yn )}. In computer science, the subject is known as supervised learning. The criterion for choosing d(x) is usually mean squared prediction error E{d(x)−E(y|x)}2 for regression, where E(y|x) is the expected value of y at x, and expected misclassification cost for classification. If Y contains J distinct values, the classification solution (or classifier ), may be written as a partition of X into J disjoint pieces Aj = {x : d(x) = j} such that X = ∪Jj=1 Aj . A classification tree is a special form of classifier where each Aj is itself a union of sets, with the sets being obtained by recursively partitioning the x-space. This permits the classifier to be represented as a decision tree. A regression tree is similarly a tree-structured solution in which a constant or a relatively simple regression model is fitted to the data in each partition. 1
2
Name Region Make Type Drive Rprice Dcost Engnsz Cylin Hp City Hwy Weight Whlbase Length Width
Table 1: Variables in the car dataset Definition Values (#unique values in parentheses) Manufacturer region Asia, Europe, or U.S. (3) Make of car Acura, Audi, etc. (38) Type of car Car, minivan, pickup, sports car, sport utility vehicle, or wagon (6) Drive type Front, rear, or four-wheel drive (3) Suggested retail price U.S. dollars (410) Dealer cost U.S. dollars (425) Size of engine liters (43) Number of cylinders -1 for the rotary-engine Mazda RX-8 (8) Horsepower hp (110) City miles per gallon miles (29) Highway miles per gallon miles (32) Weight of car pounds (347) Length of wheel base inches (40) Length of car inches (66) Width of car inches (18)
To illustrate, consider a dataset on new cars for the 2004 model year from the Journal of Statistics Education Data Archive (www.amstat.org/publications/jse/jse_data_ archive.html). After filling in missing values and correcting errors, we obtain a dataset with complete information on 15 variables for 428 vehicles. Table 1 gives the definitions of the variables and Figure 1 shows two classification trees for predicting the Region variable, i.e., whether the vehicle manufacturer is from Asia, Europe, or the U.S. (there are 158 Asian, 120 European, and 150 U.S. vehicles in the dataset). Each intermediate node has a condition associated with it. If an observation satisfies the condition, it goes down the left branch; otherwise it goes down the right branch. Each leaf node is labeled with a predicted class, with cross-hatched, light gray, and dark gray denoting Asia, Europe, and the U.S., respectively. Beneath each node is a fraction, giving the number of misclassified samples divided by the number of samples. It is easily seen from the trees that Dcost is an important predictor for European cars: 125 of 159 cars with Dcost greater than $30,045 are European. For cars with Dcost less than or equal to $30,045, those with 3.5-liter or larger engines are six times as likely to be American, while those with smaller engines are predominantly Asian. A classification or regression tree algorithm has three major tasks: (i) how to partition the data at each step, (ii) when to stop partitioning, and (iii) how to predict the value of y for each x in a partition? There are many approaches to the first task. For ease of interpretation, a large majority of algorithms employ univariate splits of the form xi ≤ c (if xi is non-categorical) or xi ∈ B (if xi is categorical). The variable xi and the split point c or the split set B are often found by an exhaustive search that optimizes a node impurity criterion such as entropy (for classification) or sum of squared residuals (for regression). There are also several ways to deal with the second task, such as stopping rules and tree pruning. The third task is the simplest: the predicted y value at a leaf node is the class that
2
3 Dcost≤30045 Engnsz ≤3.5
Dcost≤30045 Engnsz ≤3.5
Width ≤73 Rprice 30315(8/33) =pickup Engnsz >2.6(5/17) ≤2.6(2/10) ≤ 30315 =wagon(12/21) =minivan(4/20) City >15(3/13) ≤15(2/11)
Figure 6: C4.5 (J48) tree for Drive, with black, white, and, gray-colored leaf nodes denoting front-wheel, rear-wheel, and four-wheel drive, respectively.
8
9
Region = Asia or Europe
Region = Asia
8 213 Engnsz Type= sports car
Engnsz ≤ 3.6 Type= minivan or sports car Drive= 4wd or rwd
1 228 Weight
4 273 Cylin
5 317 Engnsz
3 171 Engnsz
Weight ≤ 3623
6 198 Weight
7 279 Cylin
2 220 Length
Figure 7: GUIDE piecewise simple linear model for predicting Hp. Beneath each leaf node are the sample mean of Hp and the name of the linear predictor. middle, and right branches, respectively. The most important linear predictors are Cylin and Engnsz. A major advantage of this model, compared to a piecewise multiple linear model, is that the data and the fitted regression functions can be visualized graphically, as demonstrated in Figure 8. We see from Node 5 that European sports cars with large engines have among the highest Hp values. For American cars, Hp increases linearly with Engnsz, irrespective of the other variables. This GUIDE model has an R2 value of 84%, which is higher than that of the RPART and multiple linear models discussed in Section 2.
6
CONCLUSION
All the above algorithms can deal with datasets with missing values. CART uses “surrogate splits” to pass observations with missing split values through its nodes. CRUISE follows a similar approach, but using an alternative set of splits. It is not known if one method is better than the other. C4.5 uses weights, passing an observation with a missing split value down every branch, where the weight is proportional to the number of observations non-missing that variable in the branch. GUIDE and QUEST employ nodewise plug-in estimates for the missing values. CART, CRUISE, and QUEST accept user-specified class prior probabilities and unequal misclassification costs. Class priors are useful if the training dataset is not a random sample. The three algorithms also allow splits on linear combinations of variables. Although 9
3000
3400
Hp 170
180
190
100 200 300 400 500
N2: Asian, Enginsz < 3.7 minivan or spc, fwd
200
N3: Asian, Enginsz < 3.7 car, pickup, suv or wagon
160
180
200
220
N4: Asian, Enginsz > 3.6
N5: European sportscar
N6: European, not spc Weight < 3624
7.0
7.5
8.0
4
4
6
8
10
12
2600
3000
3400
Weight
N7: European, not spc Weight > 3623
N8: American
Whole dataset
6
8 Cylin
10
12
Hp 4
5
6
7
8
Enginsz
9 10
100 200 300 400 500
Enginsz
100 200 300 400 500
Cylin
Hp
100 200 300 400 500
6.5
Hp
Enginsz
100 200 300 400 500
Length
100 200 300 400 500
Weight
6.0
Hp
2600
Hp
Hp
100 200 300 400 500
2200
100 200 300 400 500
N1: Asian, Enginsz < 3.7 minivan or spc, rwd or 4wd
Hp
Hp
100 200 300 400 500
10
2
3
4
5
6
7
8
Engnsz
Figure 8: Data and simple linear regression models at the nodes of the GUIDE tree in Figure 7. The abbreviation spc stands for sports car. A plot of Hp versus Engnsz for the whole dataset is given in the bottom right corner for comparison.
10
11 such splits are hard to interpret, there is empirical evidence that they have much better prediction accuracy than trees with univariate splits [13]. C4.5 does not have these capabilities. GUIDE is also restricted to univariate splits, but empirical studies with real datasets suggest that the prediction accuracy of its piecewise multiple linear models is, on average, comparable to that of the best methods, including spline models [9]. There are many other tree algorithms that space restrictions prevent their discussion here. The interested reader is referred to CHAID [15, 16] for classification, RECPAM [17] for classification and regression, M5 [18, 19] for regression, LOTUS [20] for logistic regression, and Bayesian CART [21, 22]. In terms of statistical theory, methods based on recursive partitioning are known to be risk consistent under certain regularity conditions. Specifically, the expected mean squared error of piecewise-constant regression trees and the expected misclassification cost of classification trees converge to the lowest possible values as the training sample size increases [1]. Further, for piecewise linear regression trees, the predicted values converge to the unknown regression function values pointwise, at least for least-squares [23], logistic, Poisson [24], and quantile [25] regression.
7
ACKNOWLEDGMENTS
CART is a registered trademark of California Statistical Software. This article was prepared with partial support from grants from the U.S. Army Research Office and the National Science Foundation.
References [1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984. [2] T. M. Therneau and E. J. Atkinson. An introduction to recursive partitioning using the RPART routine. Technical Report 61, Mayo Clinic, Section of Statistics, 1997. [3] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3-900051-07-0. [4] J. N. Morgan and J. A. Sonquist. Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58:415–434, 1963. [5] A. Fielding. Binary segmentation: The automatic detector and related techniques for exploring data structure. In C. A. O’Muircheartaigh and C. Payne, editors, The Analysis of Survey Data, Volume I, Exploring Data Structures. Wiley, New York, 1977. [6] P. Doyle. The use of Automatic Interaction Detector and similar search procedures. Operational Research Quarterly, 24:465–467, 1973. [7] W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 7:815–840, 1997. 11
12
REFERENCES
[8] H. Kim and W.-Y. Loh. Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96:589–604, 2001. [9] H. Kim, W.-Y. Loh, Y.-S. Shih, and P. Chaudhuri. A visualizable and interpretable regression model with good prediction power. IIE Transactions. Special Issue on Data Mining and Web Mining, 2007. [10] H. Kim and W.-Y. Loh. Classification trees with bivariate linear discriminant node models. Journal of Computational and Graphical Statistics, 12:512–530, 2003. [11] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [12] I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann, San Fransico, CA, second edition, 2005. http://www.cs.waikato.ac.nz/ml/weka. [13] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:203–228, 2000. [14] W.-Y. Loh. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12:361–386, 2002. [15] G. V. Kass. Significance testing in automatic interaction detection (A.I.D.). Applied Statistics, 24:178–189, 1975. [16] G. V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29:119–127, 1980. [17] A. Ciampi, S. A. Hogg, S. McKinney, and J. Thiffault. RECPAM: A computer program for recursive partition and amalgamation for censored survival data and other situations frequently occurring in biostatistics, I: Methods and program features. Computer Methods and Programs in Biomedicine, 26:239–256, 1988. [18] J. R. Quinlan. Learning with continuous classes. In Proceedings of the Australian Joint Conference on Artificial Intelligence, pages 343–348, Singapore, 1992. World Scientific. [19] Y. Wang and I. Witten. Inducing model trees for continuous classes. In Proceedings of the Poster Papers of the European Conference on Machine Learning, Prague, 1997. [20] K.-Y. Chan and W.-Y. Loh. LOTUS: An algorithm for building accurate and comprehensible logistic regression trees. Journal of Computational and Graphical Statistics, 13:826–852, 2004. [21] H. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search (with discussion). Journal of the American Statistical Association, 93:935–960, 1998. [22] D. G. Denison, B. K. Mallick, and A. F. M. Smith. A Bayesian CART algorithm. Biometrika, 85:363–377, 1998. 12
13
REFERENCES
[23] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression trees. Statistica Sinica, 4:143–167, 1994. [24] P. Chaudhuri, W.-D. Lo, W.-Y. Loh, and C.-C. Yang. Generalized regression trees. Statistica Sinica, 5:641–666, 1995. [25] P. Chaudhuri and W.-Y. Loh. Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8:561–576, 2002.
13