Causal Inference in Observational Data

arXiv:1611.04660v1 [cs.AI] 15 Nov 2016

Causal Inference in Observational Data Pranjul Yadav

Michael Steinbach

Vipin Kumar

Dept. of Computer Science University of Minnesota Minneapolis, MN

Dept. of Computer Science University of Minnesota Minneapolis, MN

Dept. of Computer Science University of Minnesota Minneapolis, USA

[email protected] Bonnie Westra

[email protected] Alexander Hoff

[email protected] Connie Delaney

School of Nursing University of Minnesota Minneapolis, MN

Dept. of Computer Science University of Minnesota Minneapolis, USA

School of Nursing University of Minnesota Minneapolis, USA

[email protected] [email protected] [email protected] Lisiane Prunelli Gyorgy Simon School of Nursing University of Minnesota Minneapolis, USA

Dept. of Health Sciences Research Mayo Clinic, Rochester, MN

[email protected]

[email protected]

ABSTRACT Our aging population increasingly suffers from multiple chronic diseases simultaneously, necessitating the comprehensive treatment of these conditions. Finding the optimal set of drugs and interventions for a combinatorial set of diseases is a combinatorial pattern exploration problem. Association rule mining is a popular tool for such problems, but the requirement of health care for finding causal, rather than associative, patterns renders association rule mining unsuitable. One of the purpose of this study was to apply SSC guideline recommendations to EHR data for patients with severe sepsis or septic shock and determine guideline compliance as well as its impact on inpatient mortality and sepsis complications. Propensity Score Matching in conjuction with Bootstrap Simulation were used to match patients with and without exposure to the SCC recommendations. Findings showed that EHR data could be used to estimate compliance with SCC recommendations as well as the effect of compliance on outcomes. Further, we propose a novel framework based on the Rubin-Neyman causal model for extracting causal rules from observational data, correcting for a number of common biases. Specifically, given a set of interventions and a set of items that define subpopulations (e.g., diseases), we wish to find all subpopulations in which effective intervention combinations exist and in each such subpopulation, we wish to find all intervention combinations such that dropping any intervention from this combination will reduce the efficacy of the treatment. A key aspect of our framework is the concept of closed intervenPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2016 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

tion sets which extend the concept of quantifying the effect of a single intervention to a set of concurrent interventions. Closed intervention sets also allow for a pruning strategy that is strictly more efficient than the traditional pruning strategy used by the Apriori algorithm. To implement our ideas, we introduce and compare five methods of estimating causal effect from observational data and rigorously evaluate them on synthetic data to mathematically prove (when possible) why they work. We also evaluated our causal rule mining framework on the Electronic Health Records (EHR) data of a large cohort of patients from Mayo Clinic and showed that the patterns we extracted are sufficiently rich to explain the controversial findings in the medical literature regarding the effect of a class of cholesterol drugs on Type-II Diabetes Mellitus (T2DM).

Keywords Causal Inference, Confounding, Counterfactual Estimation.

1.

RELATED WORK

Causation has received substantial research interest in many areas. In computer science, Pearl [5] and Rosenbaum[6] laid the foundation for causal inference, upon which several fields, cognitive science, econometrics, epidemiology, philosophy and statistics have built their respective methodologies [7, 8, 9]. At the center of causation is a causal model. Arguably, one of the earliest and popular models is the Rubin-Neyman causal model [3]. Under this model X causes Y , if X occusr before Y ; and without X, Y would be different. Beside the Rubin-Neyman model, there are several other causal models, including the Granger causality [10] for time series, Bayes Networks [11], Structural Equation Modeling [8], causal graphical models [12], and more generally, probabilistic graphical models [13]. In our work, we use the potential outcome framework from the Rubin-Neyman model and we

use causal graphical models to identify and correct for biases. Causal graphical models are tools to visualize causal relationships among variables. Nodes of the causal graph are variables and edges are causal relationships. Most methods assume that the causal graph structure is a priori given, however, methods have been proposed for discovering the structure of the causal graph [14, 15]. In our work, the structure is partially given: we know the relationships among groups of variables, however we have to assign each variable to the correct group based on data. Knowing the correct graph structure is important, because substructures in the graph are suggestive of sources of bias. To correct for biases, we are looking for specific substructures. For example, causal chains can be sources of overcorrection bias and ”V”-shaped structures can be indicative of confounding or endogenous selection bias [9]. Many other interesting substructures have been studied [16, 17, 18]. In our work, we consider three fundamental such structures: direct causal effect, indirect causal effect and confounding. Of these, confounding is the most severe and received the most research interest. Numerous methods exist to handle confounding, which includes propensity score matching (PSM) [19], structural marginal models [9] and g-estimation [8]. The latter two extend PSM for various situations, for example, for timevarying interventions [9]. Propensity score matching is used to estimate the effect of an intervention on an outcome. The propensity score is the propensity (probability) of a patient receiving the intervention given his baseline characteristics and the propensity score is used to create a new population that is free of confounding. Many PSM techniques exist and they typically differ in how they use the propensity score to create this new population [20, 21, 22, 23]. Applications of causal modeling is not exclusive to social and life sciences. In data mining, Lambert et al. [24] investigated the causal effect of new features on click through rates and Chan et al. [25] used doubly robust estimation techniquest to determine the efficacy of display advertisements. Even extending association rules mining to causal rule mining has been attempted before [26, 27, 28]. Li et al. [26] used odds ratio to identify causal patterns and later extended their technique [28] to handle large data set. Their technique, however, is not rooted in a causal model and hence offers no protection against computing systematically biased estimates. In their proposed causal decision trees [29], they used the potential outcome framework, but still have not addressed correction for various biases, including confounding.

2. 2.1

SIMPLE CAUSAL RULE MINING IN IRREGULAR TIME-SERIES DATA Introduction

According to the Center for Disease Control and Prevention, the incidence of sepsis or septicemia has doubled from 2000 through 2008, and hospitalizations have increased by 70% for these diagnoses1. In addition, severe sepsis and shock have higher mortality rates than other sepsis diagnoses, accounting for an estimated mortality between 18%

and 40%. During the first 30 days of hospitalization, mortality can range from 10% to 50% depending on the patients risk factors. Patients with severe sepsis or septic shock are sicker, have longer hospital stays, are more frequently discharged to other short-term hospital or long-term care institutions, and represent the most expensive hospital condition treated in 20112. The use of evidence-based practice (EBP) guidelines, such as the Surviving Sepsis Campaign (SSC), could lead to an earlier diagnosis, and consequently, earlier treatment. However, these guidelines have not been widely incorporated into clinical practice. The SSC is a compilation of international recommendations for the management of severe sepsis and shock. Many of these recommendations are interventions to prevent further system deterioration during and after diagnosis. Even when the presence of sepsis or progression to sepsis is suspected early in the course of treatment, timely implementation of adequate treatment management and guideline compliance are still a challenge. Therefore, the effectiveness of the guideline in preventing clinical complications for this population is still unclear to clinicians and researchers alike. The majority of studies have focused on early detection and prevention of sepsis and little is known about the compliance rate to SSC and the impact of compliance on the prevention of sepsis-related complications. Further, the measurement of adherence to individual SSC recommendations rather than the entire SSC is, to our knowledge, limited. The majority of studies have used traditional randomized control trials with analytic techniques such as regression modeling to adjust for risk factors known from previous research. Datadriven methodologies, such as data mining techniques and machine learning, have the potential to identify new insights from electronic health records (EHRs) that can strengthen existing EBP guidelines. The national mandate for all health professionals to implement interoperable EHRs by 2015 provides an opportunity for the reuse of potentially large amounts of EHR data to address new research questions that explore patterns of patient characteristics, evidence-based guideline interventions, and improvement in health. Furthermore, expanding the range of variables documented in EHRs to include team-based assessment and intervention data can increase our understanding of the compliance with EBP guidelines and the influence of these guidelines on patient outcomes. In the absence of such data elements, adherence to guidelines can only be inferred; it cannot be directly observed. In this section, we present a methodology for using EHR data to estimate the compliance with the SSC guideline recommendations and also estimate the effect of the individual recommendations in the guideline on the prevention of in-hospital mortality and sepsis-related complications in patients with severe sepsis and septic shock.

2.2

Methods

Data from the EHR of a health system in the Midwest was transferred to a clinical data repository (CDR) at the University of Minnesota which is funded through a Clinical Translational Science Award. After IRB approval, deidentified data for all adult patients hospitalized between 1/1/09 to 12/31/11 with a severe sepsis or shock diagnosis was obtained for this study.

2.2.1

Data and cohort selection

sequently used in this paper.

The sample included 186 adult patients age 18 years or older with an ICD-9 diagnosis code of severe sepsis or shock (995.92 and 785.5*) identified from billing data. Since 785.* codes corresponding to shock can capture patients without sepsis, patients without severe sepsis or septic shock, and patients who did not receive antibiotics were excluded. These exclusions aimed to capture only those patients who had severe sepsis and septic shock, and were treated for that clinical condition. The final sample consisted of 177 patients.

2.2.2

Variables of interest

Fifteen predictor variables (baseline characteristics) were collected. These include sociodemographics and health disparities data: age, gender, race, ethnicity, and payer (Medicaid represents low income); laboratory results: lactate and white blood cells count (WBC); vital signs: heart rate (HR), respiratory rate (RR), temperature (Temp), mean arterial blood pressure (MAP); and diagnoses for respiratory, cardiovascular, cerebrovascular, and kidney-related comorbid conditions. ICD-9 codes for comorbid conditions were selected according to evidence in the literature . Comorbidities ˘ Zs ´ prior problem list to were aggregated from the patientˆ aA detect preexisting (upon admission) respiratory, cardiovascular, cerebrovascular, and kidney problems. Each category was treated as yes/no if any of the ICD-9 codes in that category were present. The outcomes of interest were inhospital mortality and development of new complications (respiratory, cardiovascular, cerebrovascular, and kidney) during the hospital encounter. New complications were determined as the presence of ICD9 codes on the patients billing data that did not exist at the time of the admission.

2.2.3

Study design

This study aimed to analyze compliance with the SSC guideline recommendations in patients with severe sepsis or septic shock. Therefore, the baseline (TimeZero) was defined as the onset of sepsis and the patients were under observation until discharged. Unfortunately, the timestamp for the diagnoses is dated back to the time of admission; hence the onset of sepsis needs to be estimated. The onset time for sepsis was defined as the earliest time during a hospital encounter when the patient meets at least two of the following six criteria: MAP < 65, HR >100, RR >20, temperature < 95 or >100.94, WBC < 4 or > 12, and lactate > 2.0. The onset time was established based on current clinical practice and literature on sepsis5. The earliest time when two or more of these aforementioned conditions were met, a TimeZero flag was added to the time of first occurrence of that abnormality, and the timing of the SSC compliance commenced.

2.2.4

Guideline compliance

SSC guideline recommendations were translated into a readily computable set of rules. These rules have conditions related to an observation (e.g. MAP < 65 Hgmm) and an intervention to administer (e.g. give vasopressors) if the patient meets the condition of the rule. The SSC guideline was transformed into 15 rules in a computational format, one for each recommendation in the SSC guideline recommendations, and each rule was evaluated for each patient (see Figure 1). After each rule is an abbreviated name sub-

Figure 1: SSC rules for measuring guideline compliances We call the treatment of a patient compliant (exposed) for a specific recommendation, if the patient meets the condition of the corresponding rule any time after TimeZero and the required intervention was administered; the treatment is non-compliant (unexposed) if the patient meets the condition of the corresponding rule after TimeZero, but the intervention was not administered (any time after TimeZero); and the recommendtion is not applicable to a treatment if the patient does not meet the condition of the corresponding rule. In estimating compliance (as a metric) with a specific recommendation, we simply measure the number of compliant encounters to which the recommendation is applicable. In this phase of the study, the time when a recommendation was administered was not incorporated in the analysis. We also estimate the effect of the recommendation on the outcomes. We call a patient exposed to a recommendation, if the recommendation is applicable to the patient and the corresponding intervention was administered to the patient. We call a patient unexposed to a recommendation if the recommendation is applicable but was not applied (the treatment was non-compliant). The incidence fraction in exposed patients with respect to an outcome is the fraction of patients with the outcome among the exposed patients. The incidence fraction of the unexposed patients can be defined analogously. We define the effect of the recommendation on an outcome as the difference in the incidence fractions between the unexposed and exposed patients. The recommendation is beneficial (protective against an outcome) if the effect is positive, namely, the incidence faction in the unexposed is higher than the incidence fraction in the unex-

posed patients.

2.2.5

Data quality

Included variables were assessed for data quality regarding accuracy and completeness based on the literature and domain knowledge. Constraints were determined for plausible values, e.g., a CVP reading could not be greater than 50. Values outside of constraints were recoded as missing values. Any observation that took place before the estimated onset of sepsis (TimeZero) was considered a baseline observation. Simple mean imputation was the method of choice for imputing missing values. Imputation was necessary for lactate (7.7%), temperature (3%), and WBC (3%). There was no missing data for the other variables and for the outcomes of interest. Central venous pressure was not included as a baseline characteristic due to the high number of missing values (54%).

2.2.6

Propensity score matching

Patients who received SSC recommendations may be in worse health than patient who did not receive SSC recommendations. For example, patients whose lactate was measured may have more apparent (and possibly advanced) sepsis than patients whose lactate was not measured. To compensate for such disparities, propensity score matching (PSM) was employed. The goal of PSM is to balance the data set in terms of the covariates between patients exposed and unexposed to the SSC guideline recommendations. This is achieved by matching exposed patients with unexposed patients on their propensity (probability) of receiving the recommendations. This ensures that at TimeZero, pairs of patients, one exposed and one unexposed, are at the same state of health and they only differs in their exposure to the recommendation. PSM is a popular technique for estimating treatment effects. To compute the propensity of patients to receive treatment, a logistic regression model was used, where the dependent variable is exposure to the recommendation and the independent variables are the covariates. The linear prediction (propensity score) of this model was computed for every patient. A new (matched) population was created from pairs of exposed and unexposed patients with matching propensity scores. Two scores match if they differ by no more than a certain caliper (.1 in our study). The effect of the recommendation was estimated by comparing the incident fraction among the exposed and unexposed patients in the matched population.

2.2.7

PSM nested inside bootstrapping simulation

In order to incorporate the effect of additional sources of variability arising due to estimation in the propensity score model and variability in the propensity score matched sample, 500 bootstrap samples were drawn from the original sample. In each of these bootstrap iterations, the propensity score model was estimated using the above caliper matching techniques and the effect of the recommendation was computed with respect to all outcomes. In recent years, bootstrap simulation has been widely employed in conjunction with PSM to better handle bias and confounding variables. For each recommendation and outcome, the 500 bootstrap iterations result in 500 estimates of the effect (of the recommendation on the outcome), approximating the sampling distribution of the effect.

2.3

Results

Table 1 shows the baseline characteristics of the study population. Results are reported as total count for categorical variables, and mean with inter-quartile (25% to 75%) range for continuous variables. As shown in Table 1, the majority of patients were male, Caucasian, and had Medicaid as the payer. Before the onset of sepsis, Cardiovascular comorbidities (56.4%) were common, the mean HR (101.3) was slightly above the normal, as well as lactate (2.8), and WBC (15.8). The mean length of stay for the sample was 15 days, ranging from less than 24 hours to 6 months. TimeZero was within the first 24 hours of admission, and patients at that time were primarily (86.4%) in the emergency department. Feature Total Number of Patients Average Age Gender(Male) Race(Caucasian) Ethnicity(Latino) Payer(Medicaid) White Blood cell Lactate Mean blood Pressure Temperature Heart Rate Respiratory Rate Cardiovascular Cerebrovascular Respiratory Kidney

Mean 177 61 102 97 11 102 15.8 28 73.9 98.4 101.3 20.6 100 66 69 62

Table 1: Demographics statistics of patient population In Figure 2, the effects of various rule-combination pairs are depicted. An effect is defined as the difference in the mean rate of progression to complications between the exposed and unexposed groups. Since we used bootstrap simulation, for each rule-complication pair, 500 replications were performed resulting in a sampling distribution for the effect. Sampling distribution for each rule-association pair is presented as boxplots. The boxplots represent the statistic measured, i.e. in this study, the differential impact of a recommendation on mortality between the exposed and unexposed population. When this statistic is 0, the recommendation has no effect. If the recommendation is greater than 0, it means that the recommendation is protective for that specific condition; and if the recommendation is below 0, the recommendation may even increase the risk for the outcome for that specific condition. The panes (groups of boxplots) correspond to the complications and the boxes within each pane correspond to the recommendation (rule). For example, the effect of the Ventilator rule (Recommendation 15: patients in respiratory distress should be put on ventilator) on mortality (Death) is shown in the rightmost box (Ventilator) in the bottom-most pane (Death). Since all effects in the boxplot are above 0, namely the number of observed complications in the unexposed group is higher than in the exposed, compliance with the Ventilator rule reduces the number of deaths. Therefore, the corresponding recommendation is beneficial to protect patients from Death (mortality). In Table 3, we present the 95% Confidence Intervals for various rule-outcome pairs. 95% Confidence intervals for various rule-outcome pairs.

Figure 3: Distribution of the propensity scores between exposed and unexposed groups for the outcome Death when patients and the SSC recommendation was Ventilator..

Figure 2: Box-plots of the mean difference between groups (unexposed - exposed) to the guideline recommendations and each of the outcomes of interest.

To further ensure the validity of the results, we examine the propensity score distribution in the exposed and unexposed group. As an example, Figure 3 illustrates the propensity score distribution for a randomly selected bootstrap iteration to measure the effect of Ventilator on Death. The horizontal axis represents the propensity score, which is the probability of receiving the interventions, and the vertical axis represents the density distribution, namely the proportion of patients in each group with a particular propensity for being put on Ventilator. Figure 3 shows substantial overlap between the propensity scores in the exposed and unexposed group. The propensity score overlap represents the distribution; the predictor Ventilator across the exposed and unexposed populations regarding the outcome Death; the balance was successful when the propensity score was applied for this population. Other rule-complication pairs exhibit similar propensity score distribution.

3.

CONCLUSION

The overall purpose of this study was to use EHR data to determine compliance with the Surviving Sepsis Campaign

(SSC) guideline and measure its impact on inpatient mortality and sepsis complications in patients with severe sepsis and septic shock. Results showed that compliance with many of the recommendations was > 95% for MAP and CVP with fluid resuscitation given for low readings. Other high compliance (greater than 80%) recommendations were: insulin given for high blood glucose and evaluating respiratory distress. The recommendations with the lowest compliance (< 30%) were: vasopressor or albumin for continuing low MAP or CVP readings. This may be due to a study design artifact, where the rule only considered interventions initiated after TimeZero (estimated onset of sepsis) while the fluid resuscitation may have taken place earlier. Alternatively, the apparently poor compliance could also be explained with issues related to the coding of fluids: during data validation, we found that it was difficult to track fluids. Our study also demonstrates that retrospective EHR data can be used to evaluate the effect of compliance with guideline recommendations on outcomes. We found a number of SSC recommendations that were significantly protective against more than one complication: Ventilator was protective against Cardiovascular and Respiratory complications as well as Death; use of Vasopressors was protective for Respiratory complications. Other recommendations, BCulture, Antibiotic, Vasopressor, Lactate, CVP, and RespDistress, showed results less consistent with our expectation. For instance, Vasopressor used to treat low MAP, appears to increase cerebrovascular complications. While this finding is not statistically significant, it may be congruent with the fact that small brain vessels are very sensitive to changes in blood pressure. Low MAP can cause oxygen deprivation, and consequently brain damage. Ventilator, Vasopressor, and BGlucose showed protective effects against Respiratory complications. The SSC guideline recommends the implementation of ventilator therapy as soon as any change in respiratory status is noticed. This intervention aims to protect the patient against further system stress, restore hypoxia, help with perfusion across the main respiratory-cardio vessels, and decrease release of tox-

ins due to respiratory efforts. Our study is a proof-of-concept study demonstrating that EHR data can be used to estimate the effect of guideline recommendations. However, for several combinations of recommendations and outcomes, the effect was not significant. We believe that the reason is that guidelines represent workflows and the effect of the workflow goes beyond the effects of the individual guideline recommendations. For example, by considering the recommendations outside the context of the workflow, we may ignore whether the intervention addressed the condition that triggered its administration. If low MAP triggered the administration of vasopressors, without considering the workflow, we do not know whether MAP returned to the normal levels thereafter. Thus we cannot equate an adverse outcome with the failure of the guideline, it may be the result of the insufficiency of the intervention. Moving forward, we are going to model the workflows behind the guidelines and apply the same principles that we developed in this work to estimate the effect of the entire workflow. This phase of our study did not address the timing of recommendations nor the time prior to TimeZero. For this analysis, guideline compliance was considered only after TimeZero (the estimated onset), since compliance with SSC is only necessary in the presence of suspected or confirmed sepsis. There is no reason to suspect sepsis before TimeZero. However, some interventions may have started earlier, without respect to sepsis. For example, 100% of the patients in this sample had antibiotics (potentially preventive antibiotics), but only 99 (55%) patients received it after TimeZero. The EHR does not provide date and time for certain ICD9 diagnoses. During a hospital stay, all new diagnoses are recorded with the admission date. We know whether a diagnosis was present on admission or not, thus we know whether it is a preexisting or new condition, but do not know precisely when the patient developed this condition during the hospitalization. For this reason, we are unable to detect whether the SSC guideline was applied before or after a complication occurred, thus we may underestimate the beneficial effect of some of the recommendations. For example, high levels of lactate is highly related to hypoxia and pulmonary damage. If these patients were checked for lactate after pulmonary distress, we would consider the treatment compliant with the Lactate recommendation, but we would not know that the respiratory distress was already present at the time of the lactate measurement and we would incorrectly count it as a complication that the guideline failed to prevent.

4. 4.1

COMPLEX CAUSAL RULE MINING IN IRREGULAR TIME-SERIES DATA Introduction

Effective management of human health remains a major societal challenge as evidenced by the rapid growth in the number of patients with multiple chronic conditions. TypeII Diabetes Mellitus (T2DM), one of those conditions, affects 25.6 million (11.3%) Americans of age 20 or older and is the seventh leading cause of death in the United States [1]. Effective treatment of T2DM is frequently complicated by diseases comorbid to T2DM, such as high blood pressure, high cholesterol, and abdominal obesity. Currently, these diseases are treated in isolation, which leads to wasteful duplicate treatments and suboptimal outcomes. The recent

rise in the number of patients with multiple chronic conditions necessitates comprehensive treatment of these conditions to reduce medical waste and improve outcomes. Finding optimal treatment for patients who suffer from multiple associated diseases, each of which can have multiple available treatments is a complex problem. We could simply use techniques based on association, but a reasonable algorithm would likely find that the use of a drug is associated with some unfavorable outcome. This does not mean that the drug is harmful; in fact in many cases, it simply means that patients who take the drug are sicker than those who do not and thus they have a higher chance of the unfavorable outcome. What we really wish to know is whether a treatment causes an unfavorable outcome, as opposed to being merely associated with it. The difficulty in quantifying the effect of interventions on outcomes stems from subtle biases. Suppose we wish to quantify the effect of a cholesterol-lowering agent, statin, on diabetes. We could simply compare the proportion of diabetic patients in the subpopulation that takes statin and the subpopulation that does not and estimate the effect of statin as the difference between the two proportions. This method would give the correct answer only if the statin-taking and non-statin-taking patients are identical in all respects that influence the diabetes outcome. We refer to this situation as treated and untreated patients being comparable. Unfortunately, statin taking patients are not comparable to nonstatin-taking patients, because they take statin to treat high cholesterol, which by and in itself increases the risk of diabetes. High cholesterol confounds the effect of statin. Many difference sources of bias exist, confounding is just one of the many. In this manuscript, we are going to address several different sources of bias, including confounding. Techniques to address such biases in causal effect estimation exist. However, these techniques have been designed to quantify the effect of a single intervention. In trying to apply these techniques to our problem of finding optimal treatment for patients suffering from varying sets of diseases, we face two challenges. First, patients with multiple conditions will likely need a combination of drugs. Quantifying the effect of multiple concurrent interventions is semantically different from considering only a single intervention. The key concept in estimating the effect of an intervention is comparability: to estimate the effect of intervention, we need two groups of patients who are identical in all relevant aspects except that one group receives the intervention and the other group does not. For a single intervention, the first group is typically the sickest patients who still do not get treated and the second group consists of the healthiest patient who get treatment. They are reasonably in the same state of health. However, when we go from a single intervention to multiple intervention and try to estimate their joint effect, comparability no longer exists. A patient requiring multiple simultaneous interventions is so fundamentally different from a patient who does not need any intervention that they are not comparable. The other key challenge in finding optimal intervention sets for patients with combinatorial sets of diseases is the combinatorial search space. Even if we could trivially extend the methods for quantifying the effect of a single intervention to a set of concurrent interventions, we would have to systematically explore a combinatorially large search space.

The association rule mining framework [2] provides an efficient solution for exploring combinatorial search spaces, however, it only detects associative relationships. Our interest is in causal relationships. In this manuscript, we propose causal rule mining, a framework for transitioning from association rule mining towards causal inference in subpopulations. Specifically, given a set of interventions and a set of items to define subpopulations, we wish to find all subpopulations in which effective intervention combinations exist and in each such subpopulation, we wish to find all intervention combinations such that dropping any intervention from this combination will reduce the efficacy of the treatment. We call these closed intervention sets, which are not be confused with closed item sets. As a concrete example, interventions can be drugs, subpopulations can be defined in terms of their diseases and for each subpopulations (set of diseases), our algorithm would return effective drug cocktails of increasing number of constituent drugs. Leaving out any drug from the cocktail will reduce the efficacy of the treatment. Closed intervention sets allow us to go from estimating a single intervention to multiple interventions. To address the exploration of the combinatorial search space, we propose a novel frequency-based anti monotonic pruning strategy enable by the closed intervention set concept. The essence of antimonotonic property is that if a set I of interventions does not satisfy a criterion, none of its supersets will. The proposed pruning strategy based on the closed intervention is strictly more efficient than the traditional pruning strategy used by the Apriori algorithm [2]. Underneath our combinatorial exploration algorithm, we utilize the Rubin-Neyman model of causation [3]. This model sets two conditions for causation: a set X of interventions causes a change in Y iff X happens before Y and Y would be different had X not occurred. The unobservable outcome of what would happen had a treated patient not received treatment is a potential outcome and needs to be estimated. We present and compare five methods for estimating these potential outcomes and describe the biases these methods can correct. Typically the ground truth for the effect of drugs is not known. In order to assess the quality of the estimates, we conduct a simulation study utilizing five different synthetic data set that introduce a new source of bias. We will evaluate the effect of the bias on the five proposed methods underscoring the statements with rigorous proofs when possible. We also evaluate our work on a real clinical data set from Mayo Clinic. We have data for over 52,000 patients with 13 years of follow-up time. Our outcome of interest is 5-year incident T2DM and we wish to extract patterns of interventions for patients suffering from combinations of common comorbidities of T2DM. First, we evaluate our methodology in terms of the computational cost, demonstrating the effectiveness of the pruning methodologies. Next, we evaluate the patterns qualitatively, using patterns involving statins. We show that our methodology extracted patterns that allow us to explain the controversial patterns surrounding statin [4]. Contributions. (1) We propose a novel framework for extracting causal rules from observational data correcting for a number of common biases. (2) We introduce the concept of closed intervention sets to extend the concept of quantifying the effect of a single intervention to a set of concurrent in-

terventions sidestepping the patient comparability problem. Closed intervention sets also allow for a pruning strategy that is strictly more efficient than the traditional pruning strategy used by the Apriori algorithm [2]. (3) We compare five methods of estimating causal effect from observational data that are applicable to our problem and rigorously evaluate them on synthetic data to mathematically prove (when possible) why they work.

4.2

Background: Association Rule Mining

We first briefly review the fundamental concepts of association rule mining and extend these concepts to causal rule mining in the next section. Consider a set I of items, which are single-term predicates evaluating to ‘true’ or ‘false’. For example, {age > 55} can be in item. A k-itemset is a set of k items, evaluated as the conjunction (logical ’and’) of its constituent items. Consider a dataset D = { d1 , d2 .....dn }, which consists of n observations. Each observation, denoted by Dj is a set of items. An itemset I = i1 , i2 , . . . , ik (I ⊂ I) supports an observation Dj if all items in I evaluate to ‘true’ in the observation. The support of I is the fraction of the observations in D that support I. An itemset is frequent if its support exceeds a pre-defined minimum support threshold. A association rule is a logical implication of form X ⇒ Y , where X and Y are disjoint itemsets. The support of a rule is support(XY ) and the confidence of the rule is conf(X ⇒ Y ) =

4.2.1

support(XY ) = P(Y |X). support(X)

Causal Rule Mining

Given an intervention itemset X and an outcome item Y , such that X and Y are disjoint, a causal rule is an implication of form X → Y , suggesting that X causes a change in Y . Let the itemset S define a subpopulation, consisting of all observations that support S. This subpopulation consists of all observations for which all items in S evaluate to ‘true’. The causal rule X → Y |S implies that the intervention X has causal effect on Y in the subpopulation defined by S. The quantity of interest is the causal effect, which is the change in Y in the subpopulation S caused by X. We will formally define the metric used to quantify the causal effect shortly. Rubin-Neyman Causal Model. X has a causal effect on Y if (i) X happens earlier than Y and (ii) if X had not happened, Y would be different [3]. Our study design ensures that the intervention X precedes the outcome Y , but fulfilling the second conditions requires that we estimate the outcome for the same patient both under intervention and without intervention. Potential Outcomes. Every patient in the dataset has two potential outcomes: Y0 denotes their outcome had they not had the intervention X; and Y1 denotes the outcome had they had the intervention. Typically, only one of the two potential outcomes can be observed. The observable outcome is the actual outcome (denoted by Y ) and the unobservable potential outcome is called the counterfactual outcome. Using the definition of counterfactual outcome, we can now define the metric for estimating the change in Y caused by X. Average Treatment response on the Treated (ATT) is a widely known metric in the causal literature and

is computed as follows: ATT(X → Y |S ) = E[Y1 − Y0 ]X=1 = E[Y1 ]X=1 − E[Y0 ]X=1 , where E denotes the expectation and the X = 1 in the subscript signals that we only evaluate the expectation in the treated patients (X = 1). ATT aims to compute an average per-patient change caused by the intervention. Y0 = Y1 , indicates that the intervention resulted in no change in outcome for the patient. Biases. Beside X, numerous other variables can also exert influence over Y , leading to biases in the estimates. To correct for these biases, we have correctly account for these other effects. The quintessential tool for this purpose is the causal graph, depicted in Figure 4. The nodes of this graph are sets of variables that play a causal role and edges are causal effects. This is not a correlation graph (or dependence graph), because for example, U and Z are dependent given X, yet there is no edge between them. Variables (items in I) can exert influence on the effect of X on Y in three way: they may only influence X, they may only influence Y or them may influence both X and Y . Accordingly, variables can be categorized into four categories: V are variables that directly influence Y and thus have direct effect on Y U are variables that only influence Y through X and thus have indirect effect on Y ; Z are variables that influence both X and Y and are called confounders; and finally O are variables that do not influence either X or Y and hence can be safely ignored.

Figure 4: Rubin-Neyman Causal Model Most of the causal inference literature assumes that the causal graph is known and true. In other words, we know apriori which variables fall into each of the categories, U , Z, V and O. In our case, only X and Y are specified and we have to infer which category each other variable (item) belongs to. Since this inference relies on association (dependence) rather than causation, the discovered graph may have errors, misclassifications of variables into the wrong category. For example, because of the marginal dependence between U and Y , variables in U can easily get misclassified as Z. Such misclassifications do not necessarily lead to biases, but they can cause loss of efficiency. Problem Formulation. Given a data set D, a set S of subpopulation-defining items, a set X of intervention items, a minimal support threshold θ and a minimum effect threshold η, we wish to find all subpopulations S (S ⊂ S) and all intervetions X (X ⊂ X ), X and S are disjoint, such that the causal rule X → Y |S is frequent and its intervention set X is closed w.r.t. our metric of causal effect, ATT. Note that the meaning of θ, the minimum support threshold, is different than in association rule mining literature.

Typically, rules with support less than θ are considered uninteresting, in other cases, it is simply a computational convenience, but in our case, we set θ to a minimum value such that ATT is estimable for the discovered patterns. We call a causal rule frequent iff its support exceeds the user-specified minimum threshold θ support(X → Y |S ) = support(XY S) = P(XY S) > θ and we call an intervention set X closed w.r.t. to ATT iff ∀x ∈ X,

|AT T (x → Y |S,X\x )| > η,

where η is the user-specified minimum causal effect threshold. In other words, a causal rule is closed in a subpopulation, if its (absolute) effect is greater than any of its subrules. Example. In a medical setting, X may be drugs, S could be comorbid diseases. Then X is a drug-combination that hopefully treats set of diseases S. This set of drugs being closed w.r.t. ATT means that dropping any drug from X will reduce the overall efficacy of the treatment; the patient is not taking unnecessary drugs. An itemset is closed if its support is strictly higher than all of its subitemsets’. Analogously, an intervention set is closed if its absolute causal effect is strictly higher than all of its subitemsets’.

4.2.2

Frequent Causal Pattern Mining Algorithm

We can now present our algorithm for causal pattern mining. At a very high level, the algorithm comprises of two nested frequent pattern enumeration [30] loops. The outer loop enumerates subpopulation-defining itemsets S using items in S, while the inner loop enumerates intervention combinations using items in X \ S. More generally, X and S can overlap but we do not consider that in this paper. Effective algorithms to this end exists [31, 32], we simply use Apriori [2]. Once the patterns are discovered, the ATT of the interventions are computed, using one of the methods from Section 4.3 and the frequent, effective patterns are returned. On the surface, this approach appears very expensive, however several novel, extremely effective pruning strategies are possible and we describe them below. Potential Outcome Support Pruning. Let X be an intervention k-itemset, S be a subpopulation-defining itemset, and let X and S be disjoint. Further, X−i be an itemset that evaluates to ‘true’ iff all items except the ith are ‘true’ but the ith item is ‘false’. Using association rule mining terminology, all items in X except the ith are present in the transaction. Definition 1 (Potential Outcome Support Pruning). We only need to consider itemsets X such that min{support(S, X),

support({S, X−1 ), . . . , support(S, X−k )}

> θ.

In order to be able to estimate the effect of x ∈ X in the subpopulation S, we need to have observations with x ‘true’ and also with x ‘false’ in S. Lemma 1. Potential Outcome Support Pruning is antimonotonic.

Proof: Consider a causal rule X → Y |S . If the causal rule X → Y |S is infrequent, then support(XS) < θ

∨

∃i, support(X−i S) < θ.

If support(X−i S) had insufficient support, then any extension of it with an intervention item x will continue to have insufficient support, thus the Xx → Y |S rule will have insufficient support. Likewise, if support(XS) had insufficient support, then any extension of it with an intervention item x will also have insufficient support. Pruning based on Causal Effect. Proposition 1. Effective causal rule pruning condition is anti-monotonic. Rationale: To explain the rational, let us return to the medical example, where X is a combination of drugs forming a treatment. Assuming that the effects of drugs are additive, if a casual rule X → Y |S is ineffective because ∃xi ∈ X,

|ATT(xi → Y |S,X\xi )| < η,

then forming a new rule Xxj → Y |S will also be ineffective because |ATT(xi → Y |S,xj ,X\xi )| will be ineffective. In the presence of positive interactions (that reinforce each other’s effect) among the drugs, this statement may not hold true. Beside statistical reasoning, one can question why a patient should receive a drug that has no effect in a combination.

4.3

Causal Estimation Methods

ATT, our metric of interest, with respect to a single intervention x in a subpopulation S is defined as ATT(x → Y |S ) = E [Y1 − Y0 ]S,X=1 , which is the expected difference between the potential outcome under treatment Y1 and the potential outcome without treatment Y0 in patients with S who actually received treatment. Since we consider treated patients, the potential outcome Y1 can be observed, the potential outcome Y0 cannot. Thus at least one of the two must be estimated. The methods we present below differ in which potential outcome they estimate and how they estimate it. For the discussion below, we consider the variables X, Z, U and V from the causal graph in Figure 4. X is a single intervention, U , V and Z can be sets of items. For regression models, we will denote the matrix defined by U , V and Z in the subpopulation S as U , V and Z (same letter as the variable sets). Counterfactual Confidence (CC). This is the simplest method. We simply assume that the patients who receive intervention X = 1 and those who do not X = 0, do not differ in any important respect that would influence Y . Under this assumption, Y1 in the treated is simply the actual outcome in the treated and the potential outcome Y0 is simply the actual outcome in the non-treated (X = 0). Thus ATT

=

conf((X = 1) → Y |S ) − conf((X = 0) → Y |S ),

=

P(Y |S, X = 1) − P(Y |S, X = 0)

In the followings, to improve readability, we drop the S subscript. All evaluations take place in the S subpopulations.

Direct Adjustment (DA). We cannot estimate Y0 in the treated (X = 1) as the actual outcome Y in the untreated, because the treated and untreated populations can significantly differ in variables such as Z and V that influence Y . In Direct Adjustment, we attempt to directly remove the effect of V and Z by including them into a regression model. Since a regression model relates the means of the predictors with the mean of the outcome, we can remove the effect of V and Z by making their means 0. Let R be a generalized linear regression model, predicting Y via a link function g g(Y |V, Z, X) = β0 + βV V + βZ Z + βX X. Then the (link-transformed) potential outcome under treatment is g(Y1 ) = β0 + βV V + βZ Z + βX and the potential outcome without treatment is g(Y0 ) = β0 + βV V + βZ Z. The ATT is then ATT = E g −1 (Y1 |V, Z, X = 1) X=1 − E g −1 (Y0 |V, Z, X = 0) X=1 . where g −1 (Y1 |V, Z, X = 1) is prediction for an observation with the observed V and Z but with X set to 1. The E(·)X=1 notation signifies that these expectation of the predictions are taken only over patients who actually received the treatment. The advantage of DA (over CC) is manyfold. First, it can adjust for Z and V as long the model specification is correct, namely the interaction terms that may exist among Z and V are specified correctly. Second, we get correct estimates even if we ignore U , because U is conditionally independent of Y given X. This unfortunately only is a theoretical advantage, because we have to infer from the data whether a variable is a predictor of Y and U is marginally dependent on Y , so we will likely adjust for U , even if we don’t need to. Counterfactual Model (CM). In this technique, we build an explicit model for the potential outcome without treatment Y0 using patients with X = 0. Specifically, we build a model g(Y |V, Z, X = 0) = β0 + βV V + βZ Z. and estimate the potential outcome as g(Y0 |V, Z) = g(Y |V, Z, X = 0). The ATT is then ATT = P(Y |X = 1) − E g −1 (Y0 |V, Z) X=1 . Similarly to Direct Adjustment, the Counterfactual Model does not depend on U . However, in case of the Counterfactual Model, we are only considering the population with X = 0. In this population, U and Y are independent, thus we will not include U variables into the model. Propensity Score Matching (PSM). The central idea of Propensity Score Matching is to create a new population, such that patients in this new population are comparable in all relevant respects and thus the expectation of the potential outcome in the untreated equals the expectation of the actual outcome in the untreated. Patients are matched based on their propensity of receiving treatment. This propensity is computed as a logistic regression model with treatment as the dependent variable log

P(X) = β0 + βV V + βZ Z. 1 − P(X)

Patient pairs are formed, such that in each pair, one patient received treatment and the other did not and their propensities for treatment differ by no more than a user-defined caliper difference ρ. The matched population has an equal number of treated and untreated patients, is balanced on V and Z, thus the patients are comparable in terms of their baseline risk of Y . Hopefully, the only factor causing a difference in outcome is the treatment. For estimating ATT, the potential outcome without treatment is estimated from the actual outcomes of the patients in the matched population who did not receive treatment: = E [Y1 − Y0 ] − P(Y |X = 1, M ) − P(Y |X = 0, M ),

AT T

where M denotes the matched population. Among the methods we consider, propensity score matching most strictly enforces the patient comparability criterion, however, it is susceptible to misspecification of the propensity regression model, which can erode the quality of the matching. Stratified Non-Parametric (SN). In the stratified estimation, we directly compute the expectation via stratification. The assumption is that the patient in each stratum are comparable in all relevant respects and only differ in the presence or absence of intervention. In each stratum, we can estimate the potential outcome Y0 in the treated as the actual outcome Y in the untreated. AT T

= E [Y1 − Y0 ]X=1 X = P (l|X = 1) [P (Y1 |l, X = 1) − P (Y0 |l, X = 1)] l

=

X

P (l|X = 1) [P (Y |X = 1) − P (Y |X = 0)] ,

l

where l iterates over the combined levels of V and Z. If we can identify the items that fall into U , then we can ignore them, otherwise, we should include them as well into the stratification. The stratified method makes very few assumptions and should arrive at the correct estimate as long as each of the strata are sufficiently large. The key disadvantage of the stratified method lies in stratification itself: when the number of items across which we need to stratify is too large, we may end up dividing the population into excessively many small subpopulations (strata) and become unable to estimate the causal effect in many of them thus introducing bias into the estimate.

4.4

Results

After describing our data and study design, we present three evaluations of the proposed methodology. The first evaluation demonstrates the computational efficiency of our pruning methodologies, isolating the effect of each pruning methods: (i) Apriori support-based pruning, (ii) Potential Outcome Support Pruning, and (iii) Potential Outcome Support Pruning in conjunction with Effective Causal Rule Pruning. In the second section, we provide a qualitative evaluation, looking at patterns involving statin. We attempt to use the extracted patterns to explain the controversial findings that exist in the literature regarding the effect of statin on diabetes. Finally, in order to compare the

treatment effect estimates to a ground truth, which does not exits for real drugs, we simulate a data set using proportions we derived from the Mayo Clinic data set. Data and Study Design. In this study we utilized a large cohort of Mayo Clinic patients with data between 1999 and 2013. We included all adult patients (69,747) with research consent. The baseline of our study was set at Jan. 1, 2005. We collected lab results, medications, vital signs and status, and medication orders during a 6-year retrospective period between 1999 and the baseline to ascertain the patient’s baseline comorbidities. From this cohort, we excluded all patients with a diagnosis of diabetes before the baseline (478 patients), missing fasting plasma glucose measurements (14,559 patients), patients whose lipid health could not be determined (1,023 patients) and patients with unknown hypertension status (498 patients). Our final study cohort consists of 52,139 patients who were followed until the summer of 2013. Patients were phenotyped during the retrospective period. Comorbidities of interest include Impaired Fasting Glucose (IFG), abdominal obesity, Hypertension (HTN; high blood pressure) and hyperlipidemia (HLP; high cholesterol). For each comorbidity, the phenotyping algorithm classified patients into three broad levels of severity: normal, mild and severe. Normal patients show no sign of disease; mild patients are either untreated and out of control or are controlled using first-line therapy; severe patients require more aggressive therapy. IFG is categorized into normal and prediabetic, the latter indicating impaired fasting plasma glucose levels but not meeting the diabetes criteria yet. For this study, progression to T2DM within 5 years from baseline (i.e. Jan 1, 2005) was chosen as our outcome of interest. Out of 52,139 patients 3627 patients progressed to T2DM , 41028 patients did not progressed to T2DM and the remaining patients (7484) dropped out of the study. In Table 2 we present statistics about our patient population.

4.4.1

Pruning Efficiency

In our work, we proposed two new pruning methods. First, we have the Potential Outcome Support Pruning, which aims to eliminate patterns for which the ATT is not estimable. Second, we have the Effective Causal Rule Pruning, where we eliminate patterns that do not improve treatment effectiveness relative to the subitemsets. In Figure 5 we present the number of patterns discovered using (i) the traditional Apriori support based pruning, (ii) our proposed Potential Outcome Support Pruning (POSP), and (iii) POSP in conjunction with Effective Causal Rule Pruning (ECRP). The number of patterns discovered by POSP is strictly less than the number of patterns discovered by the Apriori pruning. POSP in conjunction with ECRP is very effective.

4.4.2

Statin

In this section, we demonstrate that the proposed causal rule mining methodology can be used to discover non-trivial patterns from the above diabetes data set. In recent years, the use of statins, a class of cholesterollowering agents, have been prescribed increasingly. High cholesterol (hyperlipidemia) is linked to cardio-vascular mortality and the efficacy of statins in reducing cardio-vascular mortality is well documented. However, as evidenced by a 2013 BMJ editorial [4] devoted to this topic, statins are sur-

T2DM Present Absent Total Number of Patients 3627 41028 Average Age 44.73 35.58 Male(%) 51 41 Female(%) 49 59 Patient Diagnosis Status (%) NormFG 42 84 PreDM 58 16 Normal Obesity 29 59 Mild Obesity 25 30 Severe Obesity 46 11 Normal Hypertension 48 69 Mild Hypertension 33 23 Severe Hypertension 19 08 Normal Hyperlipidemia 12 29 Mild Hyperlipidemia 72 64 Severe Hyperlipidemia 16 07 Patient Medication Status(%) Statin 26 11 Fibrates 03 01 Cholesterol.Other 02 01 Acerab 17 07 Diuret 18 07 CCB 08 04 BetaBlockers 22 10 HTN.Others 01 01

Table 2: Demographics statistics of patient population

ables that defined the subpopulation. Variables that only influence diabetes but not statin use (say a diabetes drug) would fall into the V category. All subpopulations have variables that fall into Z and U and some subpopulation may also have V . The HLP variable in Table 2 uses statin as part of its definition, thus we constructed two new variables. The first one is HLP1, a variable at the borderline between HLP-Normal and HLP-Mild, consisting of untreated patients with mildly abnormal lab results (these fall into HLP-Normal) and patients who are diagnosed and receive a first-line treatment (they fall into HLP-Mild). Comparability is the central concept of estimating causal effects and these patients are comparable at baseline. Similarly, we also created another variable, HLP2, which is at the border of HLP-Mild and HLPSevere, again consisting of patients who are comparable in relevant aspects of their health at baseline. S PreDM NormFG HLP1 HLP2 PreDM,HLP1 PreDM,HLP2 NormFG,HLP1 NormFG,HLP2

CC 0.145 0.060 0.078 0.021 0.067 0.001 0.043 0.017

DA 0.022 0.023 0.019 -0.013 0.018 -0.038 0.020 -0.002

CM 0.010 0.034 0.014 -0.010 0.021 -0.031 0.015 -0.002

PSM 0.022 0.017 0.010 -0.021 0.004 -0.048 0.014 -0.005

SN 0.017 0.029 0.010 -0.015 0.002 -0.043 0.013 -0.004

Table 3: ATT due to statin in various subpopulations S as estimated by the 5 proposed methods.

Figure 5: Comparison of Pruning Techniques rounded in controversy. In patients with normal blood sugar levels (labeled as NormalFG), statins have a detrimental effect, they increase the risk of diabetes; yet in pre-diabetic patients (PreDM), it appears to have no effect. What we demonstrate below is that this phenomenon is simply disease heterogeneity. First, we describe how this problem maps to the causal rule mining problem. Our set of interventions (X ) consists of statin and our subpopulation defining variables consist of the various levels of HTN, HLP and IFG (S). Our interest is the effect of statin (x) on T2DM (Y ) in all possible subpopulations S, S ⊂ S. In this setup, HTN, which is associated with both hyperlipidemia (and statin use), as well as with T2DM, is a confounder (Z). A cholesterol drug, other than statin, (say) fibrates, are in the U category: they are predictive of statin (patients on monotherapy who take fibrates do not take statins), but have no effect on Y , because its effect is already incorporated into the hyperlipidemia severity vari-

Table 3 presents the ATT estimates obtained by the various methods proposed in Section 3.4 for some of the most relevant subpopulations. Negative ATT indicates beneficial effect and positive ATT indicates detrimental effect. Counterfactual confidence (CC) estimates statin to be detrimental in all subpopulations. While statins are known to have detrimental effect in patients with normal glucose levels [4], it is unlikely that statins are universally detrimental, even in patients with severe hyperlipidemia, the very disease it is supposed to treat. The results between DA, CM, PSM and SN are similar, with PSM and SN having larger effect sizes in general. The picture that emerges from these results is that patients with severe hyperlipidemia appear to benefit from statin treatment even in terms of their diabetes outcomes, while statin treatment is moderately detrimental for patients with mild hyperlipidemia. Bootstrap estimation was used to compute the statistical significance of these results. For brevity, we report the results only for PSM. The estimates are significant in the following subpopulations: NormFG, PreDM+HLP2 (p-values are