surgery and successfully perform a benchmark on three UCI data sets. 1 Introduction. The logistic regression model (LogReg) estimates the probability of a ...
PRISMA: Improving Risk Estimation with Parallel Logistic Regression Trees Bert Arnrich1 , Alexander Albert2 , and J¨ org Walter1 1 2
Neuroinformatics Group, Faculty of Technology, Bielefeld University, Germany Clinic for Cardiothoracic Surgery, Heart Institute Lahr, Germany
Abstract. Logistic regression is a very powerful method to estimate models with binary response variables. With the previously suggested combination of tree-based approaches with local, piecewise valid logistic regression models in the nodes, interactions between the covariates are directly conveyed by the tree and can be interpreted more easily. We show that the restriction of partitioning the feature space only at the single best attribute limits the overall estimation accuracy. Here we suggest Parallel RecursIve Search at Multiple Attributes (PRISMA) and demonstrate how the method can significantly improve risk estimation models in heart surgery and successfully perform a benchmark on three UCI data sets.
1
Introduction
The logistic regression model (LogReg) estimates the probability of a binary outcome Y depending on the linear combination of k input variables Xi in a data set D. The k + 1 coefficients β0 , β1 , . . . , βk are usually estimated by iterative likelihood maximization. In health sciences the input parameters are, e.g., absence/presence of some risk factors, medication or procedure type for a certain patient. Then the coefficients βi are easily interpretable (e.g. for a binary variable Xi the corresponding eβi is equal to the odds ratio) and broadly appreciated. Despite its simplicity, the accuracy compares favorably to many binary classification methods [8]. In principle the Xi can be any non-linear transformations or combinations of variables in order to adapt the model to existing non-linearities and parameter interactions. Unfortunately the model looses then quickly its comprehensiveness – at least to many health professionals – and therefore these extensions are not very commonly applied. Another well appreciated model format is the decision tree (DT). It assigns each new case to a unique terminal node which comprises a group of cases. The combination of DT and LogReg models was suggested earlier and embraces several advantages (see e.g. [5]): • The tree-structure can handle large parts of the overall model complexity. • Interactions between the covariates are directly conveyed by the tree and can be interpreted more easily in qualitative terms. • A simple form of the fitted function in each node enables the statistical properties of the method to be studied analytically.
88
B. Arnrich et al.
In previous work concerning tree-structured regression we can distinguish two basic strategies: Two Phase Methods: First partitioning the data using a tree construction method and afterward fit models in each node. Adaptive Methods: Recursive partitioning of the data, taking into account the quality of the fitted models during the tree construction process. One example of the first strategy is Quinlan’s M5 regression tree [9]. In a preliminary step M5 builds first a classification tree using the standard deviation of the Y values as node impurity function and then a multivariate linear model is fitted at each node. In the second direction a deviance-based approach was proposed: LOTUS constructs contingency tables with the Y outcome [4]. The variable Xi with the smallest significance level (tested with χ2 statistic) is selected to split the node. The binary split point that minimizes the sum of the deviances of the logistic regression models fitted to the two data subsets, is chosen. Although it was shown that the regression tree models have significant advantages over simple regression or standard decision trees, it remains unclear whether the structure of the built hybrid tree is optimal regarding to the overall estimation performance. In contrast to previous work we propose to search a partitioning where the node models produces the highest overall estimation accuracy.
2
Methods
Aiming at best overall estimation accuracy, the key ideas of the proposed search for an optimal tree are: 1. Parallel recursive search at multiple attributes, 2. fit stepwise logistic regression model in each partition, 3. test if the discriminative power of the sub-model in each leaf node is better than a parent model using the area under the receiver operating characteristic (ROC) curve, and 4. select the tree structure and the corresponding node models with the highest overall estimation accuracy. Independent of the algorithmic strategy that is used for finding optimal split points of an attribute Xi in D or a subspace of D, it is important to ensure that obviously ”bad” partitions, i.e. a sequence of records belonging to a single class should not be broken apart [6], are not selected by the evaluation function. Here we used the concept of boundary points introduced in [7]. In the following we first briefly introduce the used evaluation functions for finding optimal split points which we apply only on boundary points. Next we explain the parallel subtree construction, model fitting, pruning and final tree generation.
PRISMA: Improving Risk Estimation
2.1
89
Split Criteria: Gain Ratio
The gain ratio criterion assesses the desirability of a partition as the ratio of its information gain to its split information [10]. The information that is gained if a set D is partitioned into two subsets D1 and D2 induced by a boundary point T of the attribute Xi is given by Gain(Xi , T ; D) = Ent(D) − E(Xi , T ; D) where Ent(D) denotes the class information entropy and E(Xi , T ; D) is the weighted average of the resulting class entropies. The potential information generated by dividing D into m subsets is given by the split information. With this kind of normalization the gain ratio is defined as the ratio of information gain and split information. For a given attribute Xi the boundary point T with maximal gain ratio is selected, if the information gain is positive. 2.2
Split Criteria: Class Information Entropy and MDLPC
In [7] a stopping criteria based on the Minimum Description Length Principle Criterion (MDLPC) for the entropy based partitioning process was developed. A partition induced by a boundary point T of the attribute Xi in a set D with minimal class information entropy E(Xi , T ; D) is accepted if the information gain is greater than a threshold (for details see [7]). 2.3
Split Criteria: χ2 Statistic
For the χ2 method a 2×2 contingency table is computed for each boundary point T in variable i ({Xi < T, Xi ≥ T } versus Y = {0, 1}). Using the χ2 distribution function the significance of an association between the outcome and each boundary point is calculated. For a given attribute i the most significant partitioning is chosen, if its significance level is at least 5 %. 2.4
Parallel Subtree Construction using Proxy Nodes
With the usage of optimal splits for multiple attributes at each node, parallel subtrees according to the number of different attributes are constructed. To ensure that we visit every partitioning only once, we introduced proxy nodes: if the required partitioning already exists in the tree, the new node becomes a proxy node and refers then to the corresponding node (see Fig. 1). By this mechanism many redundant computations can be saved. 2.5
LogReg Model Fitting and ROC-based Pruning
For each non-proxy node a stepwise backward logistic regression model is fitted. 1 As initial predictor variables all non-constant ordinal numeric attributes Xi are used. Proxy nodes refer to the computed node models and their possible subtrees. 1
We employ the glm and the accelerated version fastbw of the statistical software package R [11] for the regression task.
90
B. Arnrich et al.
Fig. 1. Parallel subtree construction using proxy nodes: beside the split on “Critical Preoperative State” (CPS) also a branch for “Non-Coronary Surgery” (NCS) is opened. The same strategy can be seen for the child’s of node 3 where additionally to the CPS-Split also an “Age Group” (AG) devision is carried out. To ensure that every partitioning is visited only once, a new node with a partitioning which already exists in the tree, becomes a proxy node and refers to the corresponding node. For example the partitioning in node 9 (NCS=0 and CPS=0) is the same as in node 5 (CPS=0 and NCS=0). Therefore node 9 is a proxy of node 5 and refers to it.
In the following pruning phase the estimation accuracy of each leaf node model is compared with all of its parent models. The model with the best estimated accuracy is assigned to the leaf node, i.e. if a parent model is superior, the leaf model will be discarded and the node will refer to the parent model (e.g. see node 7 in Fig. 2). We choose the area under the ROC curve (AUC) as best suited for comparing our node models. In a low risk regime the AUC, also sometimes called “c-index”, is an integral measure for the entire performance of a classification or estimation system. While model improvements result only in small value changes (AU C ∈ [0.5, 1]) the figure is very sensible compared to, e.g. the standard error. 2.6
Generating the Final Tree
Our goal is to find a single, complete and unique tree structure, where the node models produce the best overall estimation accuracy. In a first step all complete and unique trees from the series of parallel trees resulting at the previous tree construction process have to be extracted in a recursively manner starting at the root node (see example in Fig. 2):
PRISMA: Improving Risk Estimation
91
Fig. 2. Extracted unique trees with regression models in the nodes. In the final tree generation process all child nodes are grouped according to their attributes and a new tree for each attribute is created if more than one group exists. In the example tree in Fig. 1 the branch at NCS=0 has two different attributes: AG and CPS. Therefore two new tress (one with child’s attribute AG, another with CPS in the NCS=0 branch) were generated. The models in the proxy nodes 9 and 11 refer to their corresponding models in the nodes 5 and 6. An example of the parent regression model assignment in the ROC-based pruning phase can be seen at node 7 (see left tree): the estimated accuracy of the regression model F3 in node 3 was superior in the sub-partition AG