THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA ...

22 downloads 3041 Views 179KB Size Report
Most data-mining projects involve classification problems—assigning ... A second weakness of tree-base d tools is that they sometimes produce very course-.
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Dan Steinberg and N. Scott Cardell Introduction Most data-mining projects involve classification problems—assigning “objects” to classes—whether it be sifting profitable from unprofitable, detecting fraudulent cases, identifying repeat buyers; profiling high-value customers likely to attrite, or flagging high credit-risk applications. This paper proposes a new technique for solving these types of classification problems by hybridizing two popular classification tools, decision trees and logistic regression. CART® and LOGIT, datamining tools by Salford Systems, are used to demonstrate how to implement this new approach both from a theoretical and practical perspective.

Strengths and Weaknesses of CART CART (Classification and Regression Trees) is a state-of-the-art decision-tree tool that can investigate any classification task and provide a robust, accurate predictive model. This methodology is characterized by its ability to automate the modeling process, communicate via pictures and handle complex data structure. The core features of Salford Systems’ CART include: § § § § §

automatic separation of relevant from irrelevant predictors automatic interaction detection impervious to outliers unaffected by missing values (using surrogates) invariant to variable transformations (e.g., log, square root, etc).

CART and decision trees in general, however, are notoriously weak at capturing strong linear structure While CART recognizes the structure, it can not effectively represent it, producing very large trees in an attempt to represent very simple relationships. A second weakness of tree-base d tools is that they sometimes produce very coursegrained response images; that is, the tree may contain only a small number of terminal nodes. Given that the classification outcome or “score” is shared by all cases in a terminal node, a 12-node tree for example can only predict 12 different probabilities. For some problems, this may be considered a classification success (e.g., identifying a large set of cases as “responders”) however for other types of problems it can be problematic. A final weakness, again only problematic for particular problem types, is that decision trees generate discontinuous responses; thus, a small change in x could lead to a large change in y.

Strengths and Weaknesses of Logit Logistic regression (or Logit) is a traditional methodology that relies on classical Page 1 of 6 © Copyright Salford Systems, 1998

statistical principles and, like CART, demonstrates remarkable accuracy in a broad range of contexts. In the STATLOG project, for example, variations on logisticdiscriminant analysis consistently ranked among the top performers, scoring highest in 5 of 21 problems. Logistic regression models effectively capture global linear structure in data and, given that non-linear structure can be reasonably approximated with linear structure, even incorrectly specified models can perform well. This technique also provides a smooth continuous predicted probability of class membership so that a small change in the predictor variable yields a small change in the predicted probability. However, while Logit excels at handling linear and smooth curvilinear data structures, it often requires experts to hand craft the models and the results can be challenging to interpret, often understood only via simulation. In addition, while Logit permits some flexibility with transformations, polynomials and interactions, it is incumbent upon the analyst to correctly identify the best variable representations. Juxtaposing the strengths and weaknesses of CART and Logit, a natural question arises: given they excel at different tasks, can we capitalize on their strengths by combining them? CART § Automatic analysis § Uses surrogates for missing values § §

§ §

Unaffected by outliers Discontinuous response „ small change in x could lead to large change in y Course-grained response images „ finite number of probabilities High speed

LOGIT § Requires experts § Deletes records or imputes missing values § Sensitive to outliers § Continuous smooth response „ small change in x leads to small change in y § Unique predicted probability for every record § Low speed to infeasible if too many inputs

Early Attempts to Hybridize The first attempt to capitalize on the different strengths of the two techniques involved running logistic regression models in terminal nodes of deliberately shallow decision trees. CART Tree

Page 2 of 6 © Copyright Salford Systems, 1998

LOGIT

LOGIT

LOGIT

LOGIT

An examination of how CART works explains why these first attempts were unsuccessful. As noted above, CART excels in the detection of local data structure. Once a data set is partitioned into two subsets at the root node, each half of the tree is then separately analyzed. As the partitioning continues, the analysis is always restricted to the node in focus. The discovery of patterns becomes progressively more localized and the “fit” at one node is never adjusted to take into account the “fit” at another. In this manner, CART reaches its goal: to split the data into homogeneous subsets. The farther down the tree, the less the variability in the target (dependent) variable. CART splits send cases with x ≤ c to the left and x > c to the right; thus, the variance in predictor (independent) variables is also drastically reduced. For example, if x is normally distributed and the splitting cut point is at the mean of x, the variance in the two child nodes is reduced by about 64%. For subsequent mean splits, the variance reduction will always be greater than 50%. This reduction in predictor variable variance will also apply to correlated predictors. By the time CART has declared a node as terminal, the information remaining in the node is insufficient to support further statistical analysis. The sample size in the terminal nodes is drastically reduced as is the target and predictor variable variance. Thus, in a well-developed CART tree, no parametric model should be supportable within terminal nodes. Estimating Logits or other parametric models earlier in the tree for example, after just a few splits, has the same drawback as terminal node models but the results are less extreme. At best, this latter approach provides a mechanism for identifying switching regression albeit this is not very successful in practice.

Proposed New Hybrid Approach The key to running a successful hybrid is to run the Logit in the root node—thereby taking advantage of Logit’s strength in detecting global structure—and to include CART terminal node dummies in the Logit model. This new hybrid approach is implemented as follows: 1. use CART to assign every case to a terminal node „ with CART surrogates, assignment is possible in every case even those with some or all missing values 2. create a new categorical variable that is equal to the terminal node Page 3 of 6 © Copyright Salford Systems, 1998

assignment „ the new categorical will have as many levels as terminal nodes 3. feed the categorical dummy variable into the Logit model (LOGIT will automatically expand x-level categorical into x dummy variables) „ this first-run Logit model is then used as a baseline model (more on this below) 4. add main-effects variables to the baseline CART-Logit model „ added variables constitute the hybrid component and can be tested as a group via a log likelihood ratio test

CART Tree

Variable

Dummy Variable

Dummy Variable

Dummy

Variable

Dummy

Logit Run on Entire Dataset

The Logit formulas for the CART and hyrbid models can be represented as follows: §

CART only:

y = β 0 + β 1NODE 1 + β 2 NODE 2 + … βKNODE K

where NODEI is a dummy variable for ith CART node, and §

CART-Logit Hybrid:

y = β 0 + β 1 NODE 1 + β 2 NODE 2 + … + β R NODE R + β R + 1X 1 + β R + 2X 2 + β R + 3X 3 + … β R X R R

= β0+

R

∑ β iQ i + ∑ β iZ j i =1

R +1

= CART Node Dummies + Hybrid Covariates Page 4 of 6 © Copyright Salford Systems, 1998

The Logit model fit to the CART terminal node dummies converts the dummies into estimated probabilities; otherwise, it is an exact representation of the CART model. Each dummy represents the rules and interaction structure discovered by CART, albeit buried in a black box. The likelihood score on this model can then be used as a baseline score for further testing and model assessment. Note also this simple hybrid model is an excellent way to incorporate sampling weights and recalibrate a CART tree. The addition of main effect variables to the baseline model then allows the now expanded model to capture effects common across all nodes (i.e., global structure). Because all strong effects have already been detected in the initial CART run, the effects detected across the terminal nodes are likely to be weak; nevertheless, a collection of weak effects can be very significant. A good starting point for expanding on the LOGIT component of the hybrid model is to: § add variables already selected as important by CART § add competitor variables in the root node that never actually appeared as splitters or surrogates in the CART tree § add variables known to be important from other studies A stepwise selection procedure can be used to pare down the variable list and then the pared-down list of main-effect variables tested as a group via a likelihood ratio test. In sum, by looking across nodes, Logit finds effects that CART cannot detect. Because these effects are not very strong, they are not detected by CART and not used as primary node splitters. Once the sample is split by CART, these effects become progressively more difficult to detect as the subsamples become increasingly more homogeneous in the child nodes. While these effects may not be the strong individually, collectively they can add enormous predictive power to the model.

Finessing the Hybrid Model Other considerations that must be addressed in building a CART-Logit hybrid model are missing values, variable transformations and interactions. The simplest approach for handling missing values of course is to simply ignore the problem by dropping all records with missing values on model variables. Alternatively, CART-predicted probabilities can be assigned to those cases with missing values while hybrid-predicted probabilities assigned to all other cases. More complicated approaches include missing value imputation and adding missing-value dummy indicators to the model plus nesting for non-missing. Given that CART tree give good results, these more complicated approaches are usually not required. Page 5 of 6 © Copyright Salford Systems, 1998

Variable transforms, such as logs and square roots, will also need to be considered. Given interactions will already be captured in the CART terminal node dummies, the only interaction terms worth considering are those that capture node-specific effects (i.e., terminal node interactions with selected variables) and interactions with missing value indicators. Salford Systems MARS™, which automates the process of identifying optimal variable transformations and interactions, can be effectively utilized as this stage.

Assessing Node-Specific Logit Fit CART segments the data into very different subsamples so why expect that a single common Logit is valid? First, the terminal node dummies capture all of the complex interactions and non-commonality of the hyper-segments. Second, a simple test can be carried out to check if the Logit developed in each node resulted in an improvement over the CART score. To test whether the hybrid model shows a lack of fit in any node or subset of nodes, perform a simple likelihood test node by node. If the CART likelihood is greater than the hybrid model likelihood, do not apply the hybrid model to that particular node.

Real World Results In our experience, models have often been improved dramatically by the hybrid CART-Logit technique. In the following direct mail and financial market applications, the hybrid out performed both CART and LOGIT alone: Direct • • •

mail applications: response model for catalog response model for credit card offer response model for an insurance product

Financial applications: • mortgage model • loan delinquency model • fraud detection model While these real-world examples are valuable case studies, they constitute a small sample and the results can not be shared due to confidentiality. Because of this, extensive experiments on artificial data sets were conducted. This analysis permitted a more accurate assessment of the possible benefit and flaws of the hybrid methodology.

Monte Carlo Test Results For the Monte Carlo hybrid model assessment, samples of various sizes (2,000 to 100,000 records) were randomly drawn (from ?). Each experiment was run 100 different times by resetting the random seed. The resulting models were assessed on Page 6 of 6 © Copyright Salford Systems, 1998

the basis of fit and also performance (e.g., profit yielded if model guides policy). Training and hold-out samples were used to assess possible over-fitting of the data. A description of the specific Monte Carlo experiments is summarized below: § § § § § § §

simple Logit: one variable CART tree: one variable, highly non-linear hybrid process Logit: several variables (possibly missing) hybrid: several variables (possibly missing) highly non-linear smooth function (not Logit) complex Logit with informative missingness

The Monte Carlo results indicated that in smaller samples (n=2,000), LOGIT performed very well even when it was not the “true” model. In larger samples (n=20,000), the hybrid model dominated on out-of-sample performance measures. And, in the larger samples, both CART and the hybrid model manage problems with missing values whereas the Logit model collaspes. Finally, in larger samples with high frequencies of missings, the hybrid outperforms other models regardless of which model is “true.”

References Breiman, L., J. Friedman, R. Olshen and C. Stone (1994), Classification and Regression Trees, Pacific Grove: Wadsworth. Friedman, J. H. (1991), Multivariate Adaptive Regression Splines (with discussion), Annals of Statistics, 19, 1-141 (March). Michie, D., D. J. Spiegelhalter, and C. C. Taylor, eds (1994), Machine Learning, Neural and Statistical Classification, London: Ellis Horwood Ltd. Steinberg, D. and P. Colla (1995) CART: Tree-Structured Non-Parametric Data Analysis, San Diego, CA: Salford Systems.

CART is a registered trademark of California Statistical Software and licensed exclusively to Salford Systems. All other trademarks mentioned are the property of their respective owners.

Page 7 of 6 © Copyright Salford Systems, 1998

Suggest Documents