DATA MINING A DIABETIC DATA WAREHOUSE TO IMPROVE OUTCOMES AN ABSTRACT SUBMITTED ONTHE TWENTY FIRST DAY OF MARCH, 2002 TO THE DEPARTMENT OF HEALTH SYSTEMS MANAGEMENT OF THE SCHOOL OF PUBLIC HEALTH & TROPICAL MEDICINE OF TULANE UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF SCIENCE IN PUBLIC HEALTH BY
JOSEPH L. BREAULT, M.D.
APPROVED: PETER J. FOS, PH.D. CO-CHAIR
COLIN R. GOODALL, PH.D.
FRED PETRY, PH.D.
DOREEN BABO, DR.P.H.
JUDY OVERALL, J.D. CO-CHAIR
Abstract Modern society’s large databases are characteristic of today’s information age. Many industries utilize these large databases as the rich resources they are. Data mining, or knowledge discovery in databases, has become a key strategy in many industries to improve outputs and decrease costs. This field is only recently being applied to healthcare management. We review its current status in healthcare, propose a method for applying it to transactional healthcare databases to improve outcomes, and apply it to a diabetic data warehouse. This retrospective secondary data analysis uses classification and regression trees to identify key relationships out of which models are formulated to improve outcomes. Decision analysis trees are used to validate the improvements in outcomes. The perspective is that of the patient for outcomes and the institution for cost savings. The subjects are diabetic patients at an urban/suburban vertically integrated healthcare system in the Gulf South with a majority of HMO members. Outcome measures are the number of areas where cost savings of 5% or outcome improvements of 10% can be identified. Results show two areas where these outcomes improvements can occur. The proposed method should be applicable to any healthcare database, though the models developed will vary.
ii
DATA MINING A DIABETIC DATA WAREHOUSE TO IMPROVE OUTCOMES A DISSERTATION SUBMITTED ONTHE TWENTY FIRST DAY OF MARCH, 2002 TO THE DEPARTMENT OF HEALTH SYSTEMS MANAGEMENT OF THE SCHOOL OF PUBLIC HEALTH & TROPICAL MEDICINE OF TULANE UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF SCIENCE IN PUBLIC HEALTH BY
JOSEPH L. BREAULT, M.D.
APPROVED: PETER J. FOS, PH.D. CO-CHAIR
COLIN R. GOODALL, PH.D.
FRED PETRY, PH.D.
DOREEN BABO, DR.P.H.
JUDY OVERALL, J.D. CO-CHAIR
c Copyright 2002 ° by Joseph L. Breault, M.D. All rights reserved.
Acknowledgement I thank my committee members. My Chair until leaving Tulane on March 1, 2002 (and thereafter Co-Chair) was Dr. Pete Fos. He provided much support and coordination through the labyrinthine of prospectus and dissertation red tape, in addition to decision sciences advice. Dr. Colin Goodall provided heroic statistical and theoretical guidance, sometimes on a week-to-week basis. Dr Fred Petry gave me helpful insights into the data mining and database issues that occasionally confused me or blocked my way. Dr. Dooreen Babo gave me detailed feedback and advice about the structure and format of the dissertation document that allowed it to take the proper shape. Ms. Judy Overall, even though incredibly busy as Department Chair, found time to meet with me, review my dissertation, and act as Co-Chair after March 1, 2002. I also thank my family for their patience. My 10-year old son, Christopher, spent a third of his life tolerating his Dad going to graduate school instead of having more time to be with him. My wife, Christine, had endless hours of childcare while I was busy with studies and papers. Finally, many thanks to my colleagues at work, especially in the Information Services Department. Their help with data extraction was critical to this study.
iv
Foreword My first thought of applying data mining techniques in analyzing transactional healthcare data was at the 1999 Primary Care Research Methods & Statistics Conference. Dr. James Sinacore presented An Introduction to Classification and Regression Trees for Revealing Complex Relationships in Large Data Sets. I was amazed as I saw the possibilities of the CART software for healthcare data. Others at the Interface 2000 Conference gave me more insights about this methodology. At my own institution, an extensive diabetic registry was developed a few years prior. It was an ideal opportunity to put the theory to the data. I decided to do this for my dissertation. The missing piece was a statistician who could help me, but I needed someone who was already expert in transactional healthcare data mining to guide me. After dozens of inquires I realized this person would be difficult to find. At the July 2000 NCAR Conference on Statistics in Large Data Sets, Dan Carr recommended I track down Colin Goodall. I did, and the following month he agreed to help me and we began weekly phone calls about the data.
v
Contents Acknowledgement
iv
Foreword
v
1 Introduction
1
1.1 Introduction to the problem . . . . . . . . . . . . . . . . . . . . .
1
1.1.1 The data mining revolution . . . . . . . . . . . . . . . . .
2
1.1.2 The philosophy behind data mining . . . . . . . . . . . . .
3
1.1.3 Data mining in healthcare . . . . . . . . . . . . . . . . . .
9
1.1.4 Diabetes mellitus . . . . . . . . . . . . . . . . . . . . . . .
11
1.1.5 Diabetes data warehouse . . . . . . . . . . . . . . . . . . .
13
1.2 Statement of the problem . . . . . . . . . . . . . . . . . . . . . .
13
1.3 Research objectives: hypotheses to be tested . . . . . . . . . . . .
15
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.5 Definitions of terms . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2 Literature Review
22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2 General data mining . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
22
vi
2.2.2 The goal: to do business better . . . . . . . . . . . . . . .
23
2.2.3 Many data mining methods . . . . . . . . . . . . . . . . .
24
2.2.4 Each method has a data preparation process . . . . . . . .
26
2.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.3 Data mining in the healthcare literature . . . . . . . . . . . . . .
28
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.3.2 Outcome improvements . . . . . . . . . . . . . . . . . . . .
29
2.3.3 Managed care and cost savings . . . . . . . . . . . . . . .
31
2.3.4 Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.4 Data mining diabetic datasets . . . . . . . . . . . . . . . . . . . .
33
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.4.2 Diabetic registries . . . . . . . . . . . . . . . . . . . . . . .
34
2.4.3 Case-based reasoning . . . . . . . . . . . . . . . . . . . . .
37
2.4.4 Pima Indians—machine learning algorithms . . . . . . . .
37
2.4.5 Polish study—rough sets . . . . . . . . . . . . . . . . . . .
42
2.4.6 Australian study—data transformation . . . . . . . . . . .
43
2.4.7 Singapore study—interaction with domain experts . . . . .
46
2.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.5 Classification and regression trees in healthcare data mining . . .
49
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.5.2 Cardiovascular . . . . . . . . . . . . . . . . . . . . . . . .
49
2.5.3 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.5.4 Infectious disease . . . . . . . . . . . . . . . . . . . . . . .
51
2.5.5 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
vii
2.5.6 CART applications in diabetes . . . . . . . . . . . . . . . .
52
2.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3 Methods
54
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.2 Study site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.3 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.4 Tools and instruments . . . . . . . . . . . . . . . . . . . . . . . .
56
3.5 Study procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.6 Treatment of the data . . . . . . . . . . . . . . . . . . . . . . . .
66
3.6.1 Target variables . . . . . . . . . . . . . . . . . . . . . . . .
66
3.6.2 Predictor variables . . . . . . . . . . . . . . . . . . . . . .
70
3.6.3 CART settings . . . . . . . . . . . . . . . . . . . . . . . .
72
3.6.4 Manager and clinician survey analysis . . . . . . . . . . . .
73
3.6.5 Methods to decide outcomes improvement . . . . . . . . .
74
3.7 Presentation of the data . . . . . . . . . . . . . . . . . . . . . . .
74
3.7.1 The data mining process . . . . . . . . . . . . . . . . . . .
74
3.8 Knowledge discovery items . . . . . . . . . . . . . . . . . . . . . .
75
3.9 Decision analysis results . . . . . . . . . . . . . . . . . . . . . . .
79
3.10 Final test on virgin data . . . . . . . . . . . . . . . . . . . . . . .
80
4 Results
83
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.2 Variable extraction and epidemiology . . . . . . . . . . . . . . . .
84
4.2.1 Inclusion and exclusion criteria . . . . . . . . . . . . . . .
84
4.2.2 Data mining data table variables . . . . . . . . . . . . . .
86
viii
4.2.3 Diabetes epidemiology . . . . . . . . . . . . . . . . . . . . 100 4.3 Data mining results . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3.2 Glycemic control . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.3 Emergency department visits . . . . . . . . . . . . . . . . 119 4.3.4 Hospitalizations and deaths in hospital . . . . . . . . . . . 123 4.3.5 Charges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.6 Medical quality index . . . . . . . . . . . . . . . . . . . . . 132 4.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.4 Knowledge discovery . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.4.2 Younger age predicts HbA1c >9.5 . . . . . . . . . . . . . . 138 4.4.3 Outpatient access does not prevent ER use . . . . . . . . . 139 4.4.4 Renal disease predicts diabetic hospital deaths . . . . . . . 140 4.4.5 Results already known strengthen validity . . . . . . . . . 142 4.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.5 Evaluation of discovered knowledge . . . . . . . . . . . . . . . . . 144 4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.5.2 Glycemic control . . . . . . . . . . . . . . . . . . . . . . . 144 4.5.3 Emergency department visits . . . . . . . . . . . . . . . . 147 4.5.4 Hospital deaths . . . . . . . . . . . . . . . . . . . . . . . . 148 4.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.6 Final combined model to improve outcomes . . . . . . . . . . . . 151 4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.6.2 The final combined model . . . . . . . . . . . . . . . . . . 151
ix
4.6.3 Results of the model on learning and test sets . . . . . . . 152 4.6.4 Final test of the model on “virgin” data . . . . . . . . . . 159 4.6.5 Benefits of random sampling . . . . . . . . . . . . . . . . . 162 4.6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.7 Local management and clinicians . . . . . . . . . . . . . . . . . . 165 5 Discussion
170
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2.2 Internal validity . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2.3 External validity . . . . . . . . . . . . . . . . . . . . . . . 175 5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.3 Conclusions about the research hypotheses . . . . . . . . . . . . . 176 5.4 Standard statistical approaches . . . . . . . . . . . . . . . . . . . 177 5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.4.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 178 5.4.3 Applications to observational data sets . . . . . . . . . . . 181 5.4.4 Applications to prospective trials . . . . . . . . . . . . . . 182 5.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.5.2 Limitations summarized . . . . . . . . . . . . . . . . . . . 184 5.5.3 Decision analysis interventions . . . . . . . . . . . . . . . . 185 5.5.4 Survey limitations . . . . . . . . . . . . . . . . . . . . . . . 186
x
5.5.5 Post-analysis insights into limitations . . . . . . . . . . . . 186 5.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.6 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.6.2 Relational vs flat files problem of losing information . . . . 188 5.6.3 Integration with biostatistical techniques . . . . . . . . . . 192 5.6.4 Squashing data . . . . . . . . . . . . . . . . . . . . . . . . 195 5.6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 5.7 Final conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A Diabetic Data Warehouse Structure
199
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 A.2 Overall structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A.2.1 Organizational chart . . . . . . . . . . . . . . . . . . . . . 200 A.2.2 Diabetes registry maintenance and process . . . . . . . . . 200 A.2.3 Process specifications . . . . . . . . . . . . . . . . . . . . . 202 A.3 Administrative variables . . . . . . . . . . . . . . . . . . . . . . . 207 A.4 Clinic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 A.5 Hospital variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 A.6 Laboratory variables . . . . . . . . . . . . . . . . . . . . . . . . . 208 A.7 Medication variables . . . . . . . . . . . . . . . . . . . . . . . . . 208 B CART Software
217
B.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 B.1.1 Surrogate splitters intelligently handle missing values . . . 219 B.1.2 Adjustable misclassification penalties avoid errors . . . . . 220
xi
B.1.3 Alternative splitting criteria . . . . . . . . . . . . . . . . . 220 C Data Mining Software
222
D DATA Software
226
D.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 D.1.1 Healthcare decision making with DATA . . . . . . . . . . . 226 D.1.2 Cost-effectiveness analysis and more . . . . . . . . . . . . . 227 E Literature Searching Methodology
230
F Manager and Clinician Survey
233
G Institutional Review Board
234
Bibliography
237
References
237
xii
List of Tables 1.1 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.2 Research hypotheses . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1 Healthcare data mining applications . . . . . . . . . . . . . . . . .
29
2.2 Variables in the PIDD . . . . . . . . . . . . . . . . . . . . . . . .
38
3.1 Conceptualization of method used in 3 phases . . . . . . . . . . .
55
3.2 Data mining data table . . . . . . . . . . . . . . . . . . . . . . . .
61
3.3 Gold standard measures of diabetic quality care . . . . . . . . . .
69
4.1 Data mining data table: 58 variables, 15,393 observations . . . . .
86
4.2 Percent of diabetics on medications . . . . . . . . . . . . . . . . . 102 4.3 HbA1c Av95 predicted vs. actual results (includes drugs)
. . . . . 108
4.4 HbA1c Av95 predicted vs. actual results (excludes drugs) . . . . . 108 4.5 HbA1c Av80 predicted vs. actual results (includes drugs)
. . . . . 110
4.6 HbA1c Av80 predicted vs. actual results (excludes drugs) . . . . . 110 4.7 HbA1c Av70 predicted vs. actual results (includes drugs)
. . . . . 113
4.8 HbA1c Av70 predicted vs. actual results (excludes drugs) . . . . . 113 4.9 Hospital deaths predicted vs. actual results (excludes ER) . . . . 127 4.10 Further breakdown of Monotherapy data . . . . . . . . . . . . . . 147 4.11 Percentage saying the new knowledge items were new or useful . . 167 5.1 Types of validity from Table 6.2 of Iezzoni (1997) . . . . . . . . . 171
xiii
5.2 Differences between CART and logistic regression . . . . . . . . . 194 A.1 Test codes used to indicate diabetes . . . . . . . . . . . . . . . . . 203 A.2 Key value results that indicate diabetes . . . . . . . . . . . . . . . 204 A.3 Administrative variables . . . . . . . . . . . . . . . . . . . . . . . 208 A.4 Clinic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.5 POS CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 A.6 Selected provider service descriptions . . . . . . . . . . . . . . . . 210 A.7 Selected provider type descriptions with counts . . . . . . . . . . 211 A.8 Hospital variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A.9 Hospital variables, Dx subtable . . . . . . . . . . . . . . . . . . . 212 A.10 Hospital variables, procedures subtable . . . . . . . . . . . . . . . 212 A.11 DISCHARGE STATUS codes . . . . . . . . . . . . . . . . . . . . 213 A.12 Selected HOSPITAL SERVICE codes . . . . . . . . . . . . . . . . 214 A.13 HOSP FIN CLASS descriptions . . . . . . . . . . . . . . . . . . . 214 A.14 Laboratory variables . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.15 Laboratory variables, Dx subtable . . . . . . . . . . . . . . . . . . 215 A.16 Medication variables . . . . . . . . . . . . . . . . . . . . . . . . . 216 A.17 Medication variables, categories subtable . . . . . . . . . . . . . . 216 A.18 Medication variables, classes subtable . . . . . . . . . . . . . . . . 216 C.1 Data mining categories . . . . . . . . . . . . . . . . . . . . . . . . 222 C.2 Classification software types . . . . . . . . . . . . . . . . . . . . . 223 C.3 Decision tree software, free . . . . . . . . . . . . . . . . . . . . . . 223 C.4 Decision tree software, commercial
. . . . . . . . . . . . . . . . . 224
C.5 Rule based approaches, free . . . . . . . . . . . . . . . . . . . . . 224 C.6 Rule based approaches, commercial . . . . . . . . . . . . . . . . . 225
xiv
E.1 Conferences on data mining and knowledge discovery . . . . . . . 232
xv
List of Figures 1.1 The rapid growth of diabetes in the United States . . . . . . . . .
12
3.1 Sample CART diagram formatted using allCLEAR-part A . . . .
76
3.2 Sample CART diagram formatted using allCLEAR-part B . . . .
77
3.3 Pre-intervention glycemic control outcome . . . . . . . . . . . . .
80
3.4 Post-intervention glycemic control outcome . . . . . . . . . . . . .
81
4.1 Classification tree for HbA1c Av95 including drug variables . . . . 107 4.2 Classification tree for HbA1c Av95 excluding drug variables . . . . 109 4.3 Classification tree for HbA1c Av80 including drug variables . . . . 111 4.4 Classification tree for HbA1c Av80 excluding drug variables . . . . 112 4.5 Classification tree for HbA1c Av70 excluding drug variables . . . . 114 4.6 Classification tree for HbA1c Av70 including drug variables . . . . 115 4.7 Regression tree for HbA1c AdjSlopeREGRESSdrug . . . . . . . . . 116 4.8 Classification tree for HbA1c Av without drug variables . . . . . . 117 4.9 Classification tree for HbA1c Av with drug variables . . . . . . . . 118 4.10 Classification tree for ERbin2 . . . . . . . . . . . . . . . . . . . . 121 4.11 Classification tree for ERbin5 . . . . . . . . . . . . . . . . . . . . 122 4.12 Regression tree for ER . . . . . . . . . . . . . . . . . . . . . . . . 123 4.13 Classification tree for whether ever hospitalized . . . . . . . . . . 125 4.14 Classification tree for whether ever hospitalized, excluding ER . . 126
xvi
4.15 Classification tree for hospital death, excluding ER . . . . . . . . 128 4.16 Regression tree for number of hospitalizations, excluding ER . . . 129 4.17 Classification tree for ChargesBin5 . . . . . . . . . . . . . . . . . 130 4.18 Regression tree for Charges . . . . . . . . . . . . . . . . . . . . . 131 4.19 Classification tree for MQIbin2 . . . . . . . . . . . . . . . . . . . 133 4.20 Classification tree for MQIbin2 without equation variables . . . . 135 4.21 Classification tree for MQIbin5 without equation variables . . . . 136 4.22 Regression tree for MQI without equation variables . . . . . . . . 137 4.23 Pre-intervention glycemic control outcome . . . . . . . . . . . . . 145 4.24 Post-intervention glycemic control outcome . . . . . . . . . . . . . 146 4.25 Epidemiology of renal disease and glycemic control, n = 10, 240 . 149 4.26 Post intervention hospital mortality, n = 10, 240 . . . . . . . . . . 151 4.27 Epidemiology of the learning & test set population . . . . . . . . 153 4.28 Post-intervention final model—test & learning sets . . . . . . . . . 155 4.29 Post-intervention final model, lower bound—test & learning sets . 157 4.30 Post-intervention final model, upper bound—test & learning sets . 158 4.31 Epidemiology of the “virgin” set population . . . . . . . . . . . . 160 4.32 Post-intervention final model—virgin set . . . . . . . . . . . . . . 161 4.33 Post-intervention final model, lower bound—virgin set . . . . . . . 163 4.34 Post-intervention final model, upper bound—virgin set . . . . . . 164 A.1 Organizational chart of the diabetic data warehouse . . . . . . . . 200 B.1 CART’s main modeling window . . . . . . . . . . . . . . . . . . . 218 B.2 CART’s model of optimal tree . . . . . . . . . . . . . . . . . . . . 219 B.3 CART’s detail screen . . . . . . . . . . . . . . . . . . . . . . . . . 221 D.1 DATA software screenshot, cost effectiveness . . . . . . . . . . . . 227
xvii
D.2 DATA software screenshot, Markov . . . . . . . . . . . . . . . . . 228
xviii
1
Chapter 1 Introduction 1.1
Introduction to the problem To provide a context to understand the research question, literature review,
and methods, this section will briefly introduce some of the key background material. First, the data mining revolution is reviewed along with a brief history of data mining. Second, the philosophy behind data mining is discussed since there are traditional concerns about data dredging that many find distracting. Third, the reasons for the new focus on data mining in healthcare is discussed along with the issues involved in successfully applying data mining technologies to healthcare. Fourth, a brief review of diabetes epidemiology and clinical aspects important to data analysis are reviewed since the data set used in this research is on diabetes. Fifth, the environment of the particular diabetic data warehouse to be used in this study is discussed. After these brief introductions of the context, the topics return to the problem statement, the research objectives, the delimitations of the problem, and the definitions of terms.
2
1.1.1
The data mining revolution Whether it is MIT’s Technology Review magazine billing it as one of the
10 emerging technologies that will change the world (Waldrop, 2001), or Barclay’s Bank claiming it is their key element in fighting fraud, data mining is hot and in the news (Rogers, 2001). Data mining has become the key to increasing profits in web marketing by data mining e-commerce data (Edelstein, 2001), in the bridal industry at WeddingNetwork.com (DeYoung, 2001), and one futurist claims it is the key to most future profits which will be found in the selling of information:“Amazon.com just rescinded its ‘privacy’ statement, letting customers know it will be participating in the new gold rush” (Wacker, 2001). As impressive as the data mining revolution has been, there are its naysayers. Fortune magazine’s article Working in the Data Mines claims that data mining produces more questions and information needs to answer those questions with the goal of keeping data miners fully employed—gainfully or otherwise (Schrage, 1999). Others have pointed out that real data sets have the potential for error (Hand, 1999). “Almost always, medical data are contaminated, distorted, corrupted” (Hand, 2000). Thus the discovery of unexpected structures in the data might be due to errors in the data. The history of the term “data mining” goes back to the 1960s when digital computers were applied to data analysis problems. If one searched enough using the new computer technology, a well-fitted model could always be found, but it might be quite complex. It also may not represent any true characteristics of the data structures, nor be useful in other data sets of similar populations. This type of process to get an over-fitted model was called data mining, data dredging, data snooping, or data fishing, and had a negative connotation. In the early 1990s computer scientists adopted the term data mining for their algorithmic
3 and database oriented methods of searching for new patterns and structure in data, often quite massive data. The emphasis was no longer on inference and estimation, and it was applied retrospectively to observational data. It did not involve experimental design questions, but rather focused on algorithms and computational issues in data analysis (Smyth, 2000). Its scientific and financial success across many fields quickly brought a positive connotation to the term. The term “knowledge discovery in databases” is often used interchangeably with data mining, and this is adopted here.
1.1.2
The philosophy behind data mining The term data mining may bring up negative connotations to some, who
presume it means data dredging. Because an understanding of this issue is so important as a preliminary to this study, this section of the introduction will review the philosophy behind data mining. The traditional research approach avoids data trolling to come up with a model: ...the term data mining was used rather derogatorily, to denote a search for the best fitting model or the most significant hypothesis by trying a large number of models or hypotheses on the same body of data, not worrying about the requirement of reproducibility of the obtained results. it is this requirement, together with the required diversity of data mining tools, that cause the still low acceptance of medical data mining (Holena, Sochorova, & Zvarova, 1999). Instead, one should pick the hypothesis one wants to study, and conduct one’s study to collect data to support or not support the hypothesis. But as an editorial in a special issue of a journal dedicated to data mining in finance put
4 it: “In traditional statistics, there has been a focus on testing hypotheses against the data rather than discovering new hypotheses from the data” (Srivastava & Weigend, 1997). Breiman recently wrote about the two cultures in the statistical community, the data modeling culture (98% of statisticians) and the algorithmic modeling culture (2% of statisticians): “Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism” (2001). Epistemology and cognitive theory indicates that this anti-data mining attitude is a partly mistaken view based on a faulty understanding of how traditional research comes up with initial hypotheses, and a lack of awareness of newer techniques to evaluate data mining models. There has been recent progress in understanding the theoretical framework for data mining (Mannila, 2000). As Lonergan’s theory of knowledge (Lonergan, 1957) outlines our knowing process, there are three stages. First, experience: the intelligent subject-researcher has inputted to his or her brain a great variety of experiences, facts, data, and images over years of training and work. Second, understanding: the research mulls over (using one’s brain-computer) what he or she knows (the various healthcare databases within him or herself and externally) to produce a Eureka-like understanding that is formulated into a hypothesis to study. Third, judgment: the intelligent subject-researcher performs studies on the hypothesis to show that the understanding is correct or incorrect. The first two, experience and understanding, are in fact data mining within the human subject to develop hypotheses. This is what Archimedes was doing in the baths of Syracuse when he suddenly yelled “Eureka!” and ran out naked. When King Hiero asked him to develop a model to predict whether baser metals were added to the gold in the votive crown, he was data mining his experiences
5 of all the methods that could be used and why they might or might not work. Nothing seemed satisfactory till the tension of inquiry was suddenly relieved as the principles of displacement and specific gravity unexpectedly concretized into weighing the crown in water just as he was lying in the baths (Lonergan, 1957). Human subjects are using their brain-computers and the limited inputs they have from various databases that they have assimilated during their experiences. As McDonald, Brossette, and Moser (1998) put it: ...Traditional statistics offer us methods to confirm or reject a hypothesis. However, traditional statistics cannot lead to discovering patterns that we do not already suspect. In other words, in traditional statistics, the search for useful relationships in the data (knowledge) is based on the expectations of those generating hypotheses. Results, therefore, are the verifications, or lack thereof, of the suspicions of the investigators. These methodologies do not offer us a way to discover hidden patterns in data, resulting in answers only to the questions that are asked. Another assumption that contributes to an anti-data mining attitude is a limited view of statistics. Most are trained in the traditional statistical analysis of supporting or not supporting a null hypothesis by various tests and p-values. If 100 correlations are tested instead of 1, then the actual significance of the finding is not the individual α = 0.05 but a much larger α that may call into question the significance of the finding. As one commentator of early healthcare data mining efforts put it: ...unconstrained search raises some important concerns from a statistical viewpoint. The issue of multiple testing is a real concern when searching through a large set of potential hypotheses in an automated fashion since there is a nonzero probability that some nonexistent
6 association will appear significant just by chance. The probability of incorrectly accepting such a spurious hypothesis rises as more and more hypotheses are tested (Smyth, 2000). To avoid these problems (e.g., comparing multiple means using multiple t-tests) statisticians use multiple comparison tests, ANOVAs, or multivariate or logistic regression analysis. On a model level, if one “intelligently” picks a model to be tested based on theory without having looked at other models, then the statistical analysis with resultant p-values or confidence intervals are believed valid. However, if 100 models were pre-tested and the best fitting one then used, this is problematic for a number of reasons: (1) the model did not derive from a theory, but a theory was created to explain the model; (2) although the resultant statistical analysis calculates exact p-values or confidence intervals, these may be meaningless for reasons akin to the multiple comparison t-test problem above; (3) the data has been misused by using it twice, once in pre-testing models and again in testing the final model selected, which then proves nothing. Nevertheless, even in traditional approaches, looking at many models and comparing them with the data before developing one’s hypothesis is often done even if infrequently reported. An interesting panel at a recent data mining conference was on “data snooping, dredging and fishing: the dark side of data mining” (Jensen, 2000). The panelists acknowledged potential pitfalls, but also reviewed the solutions that correct for the statistical effects of searching large model spaces. These solutions include: • Dividing the data into two subsamples, so the model developed from half the data is then tested on virgin data in the other subsample to get an unbiased score.
7 • Using cross validation when the process for identifying a best model is algorithmic. • Using Sidak, Bronferroni, and other adjustments, although these can be restrictive in their assumptions. • Using resampling and randomization techniques, such as a bootstrap approach. Newer statistical techniques based on Monte Carlo and bootstrapping estimations can calculate the significance of the model findings from data mining. In White’s recent article entitled “A Reality Check for Data Snooping” (2000), he proposes methods for precisely this and concludes: Data snooping occurs when a given set of data are used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the method yielding the results. Our new procedure...provides simple and straightforward procedures for testing the null that the best model encountered in a specification search has no predictive superiority over a given benchmark model, permitting account to be taken of the effects of data snooping (p. 1115). An interesting application is in the calendar effects in stock returns. The stock data appear to show a strong pattern of stock returns depending on the day of the week, or other calendar timings. However, when one uses a bootstrapping approach to account for the intensive search for many models to arrive at the calendar model, the significance falls apart. “Although nominal p-values of individual calendar rules are extremely significant, once evaluated in the context of
8 the full universe from which such rules were drawn, calendar effects no longer remain significant” (Sullivan, Timmermann, & White, 1998). There has been a large movement for data mining in a variety of businesses in order to save money and be more efficient stewards of the large databases that modern industries require. The literature and work in data mining are growing rapidly, almost exponentially. As one recent article (Brodley, Lane, Lane, & Stough, 1999) explained it: One of the most important parts of a scientist’s work is the discovery of patterns in data. Yet the databases of modern science are frequently so immense that they preclude direct human analysis. Inevitably, as their methods for gathering data have become automated, scientists have begun to search for ways to automate its analysis as well. Over the past five years, investigators in a new field called knowledge discovery and data mining have had notable successes in training computers to do what was once a unique activity of the human brain. This knowledge discovery process “transforms data into knowledge” (Cios, 2000). The philosophy of data mining is to take advantage of these large data sets by simply doing what the human subject has always done in the scientific method, but with two improvements: (1) using the full dataset that has become too large for any individual to come to know, and (2) honestly acknowledging all the models that have been examined and discarded (something human subjects are often not explicit or honest about) and factor this into the evaluation of the significance of the final model(s).
9
1.1.3
Data mining in healthcare
From the moment of birth to the signing of the death certificate, data are collected at almost every contact of each individual with providers of healthcare in the United States (and many other countries). These data include administrative, demographic, health status, clinical, pharmaceutical use, and financial details. Increasingly, data are abstracted from written records, or entered directly at a workstation, into an extensive health information system (Goodall, 1999). Clinical and financial health care data are massive. Uniform billing data (UB92) is collected nationwide for discharges, and is in the tens of millions of records per year. Clinical data collected at any given institution is massive. The National Academy of Sciences convened a conference on Massive Data Sets in 1995, and the presentation on healthcare noted that “massive applies in several dimensions...the data themselves are massive, both in terms of the number of observations and also in terms of the variables...there are tens of thousands of indicator variables coded for each patient” (Goodall, 1995). And then one multiplies this by the number of patients in the United States, which is virtually the same as the population, namely hundreds of millions. All healthcare institutions have a large database. Various parts may be mandated, such as billing systems to collect insurance payments. These may include diagnoses and demographics of patients.
Additional mandated parts
may show compliance with the Health Plan Employer Data and Information Set (HEDIS, see http://www.ncqa.org/Programs/HEDIS/index.htm) and other health outcomes monitoring guidelines. Each healthcare institution also has some unique databases that have been developed through local needs. Commonly these include diagnostic testing results and physician profiling information. In
10 some locations this may also include elements of an electronic medical record or associated insurance databases when the institution also sponsors an insurance company. All of these parts of the database may be assembled together in an organized way, sometimes called a data warehouse (Babcock, 1996), or be separate parts. Usama Fayyad, speaking about corporate databases in general when he joined Microsoft in 1996 to organize a data mining research group, said: “It was pretty sad. In many companies, the ‘data warehouses’ were actually ‘data tombs’: the data went in and were never looked at again” (Waldrop, 2001). Healthcare institutions want to improve outcomes. The last decade has been increasingly focused on saving money, downsizing staff, and becoming competitively “lean and mean.” The goal is to stay afloat and either make a profit or provide savings for non-profits or public institutions. This is occurring in an environment demanding health care cost containment, or even decreases. At the same time, there is a push for quality outcomes and meeting standards coming from both regulators and insurance companies. Improving outcomes while also saving money has become the major challenge at many healthcare institutions. Many methods have been used to improve outcomes or at least maintain outcomes as costs are shaved. These include benchmarking, disease management programs, pharmacy formulary cost management, provider profiling to target outliers, cost-effectiveness analyses, etc. The majority have eliminated waste, trimmed personnel, and saved money without damaging outcomes. Many institutions are still working on implementing some of these methods that have become a standard foundation for cost containment and outcomes management.
11
1.1.4
Diabetes mellitus In order to understand the context of this study’s research question, liter-
ature search and methodology, some initial discussion is warranted on the population of diabetics in general and the specific diabetes dataset being used. More details will be provided in Chapter 3. Diabetes mellitus occurs when the body’s capacity to metabolize sugar is inadequate. In type 1 diabetes (typical of young, thin people) the cause is the pancreas not producing enough insulin, and so insulin injections are required. In type 2 diabetes (typical of older, obese people) the cause is their metabolism demands are too high given their obesity and they have become insulin resistant. Often these adult onset Type II diabetics can be managed with a variety of oral medications, or even on diet alone. The vast majority of diabetic patients are type 2. This brief review is oversimplified, and some understanding of diabetes will be presumed. A complete explanation of the disease will not be presented in this paper, though various aspects will be reviewed when needed in understanding variables. The population of diabetic patients is important in healthcare for a number of reasons. It is large with 15.7 million people in 1998 or 5.9% of the population of the United States having diabetes, 8.2% of those 20 and older, and 18.4% of those 65 or older. And the number is increasing rapidly as seen in Figure 1.1. The economic impact is impressive, with the 1997 estimate being direct medical costs of $44 billion, indirect costs of $54 billion for an economic cost annually of about $100 billion dollars. If cost savings can be achieved in diabetes, this can have a significant impact on health care spending (Centers for Disease Control, 1998). The burden of suffering of diabetes, the seventh leading cause of death in the United States, is sadly even more impressive. Death certificates indicate that
12
Figure 1.1: The rapid growth of diabetes in the United States diabetes contributes to 193,000 deaths annually, but this is vastly under-reported. Diabetes is the leading cause of new cases of blindness in adults aged 20-74 (as many as 24,000 become blind annually from diabetes), of end-stage kidney disease (33,000 diabetics start dialysis annually), and of leg amputations not related to injury (86,000 annually). Diabetic patients are 2 to 4 times more likely to have a heart attack or stroke than a non-diabetic patient. Close to two-thirds of diabetic patients also have hypertension. The social effects are massive with disability among diabetics 2 to 3 times higher than non-diabetics (Songer, 1995), congenital malformation rates up to 10% in the 18,000 deliveries annually to women with preexisting diabetes if preconception care is not provided, and deaths of newborns at rates 2 to 3 times higher than average for pregnancies among women with diabetes (Centers for Disease Control, 2001). Diabetes has been well studied, and many of the complications can be prevented. Early detection and proper treatment of diabetes can prevent up to 90% of blindness, and at least 50% of dialysis and amputations (Centers for Disease Control, 2001). Because of the economic and clinical impact of the disease, there has been a great deal of energy put into guidelines, best practices, optimization of care, and other management methods to improve outcomes. In short, all the low-hanging fruit has already been picked.
13
1.1.5
Diabetes data warehouse The importance of developing a good data warehouse to prepare for data
mining has been accepted in many industries. It has even been recognized in surgery (Tusch, Muller, Rohwer-Mensching, Heiringhoff, & Klempnauer, 2000). WeThis section reviews here the diabetic registries that are precursors of a diabetic data warehouse. Many countries, states, and local hospitals or clinics have a diabetes registry to identify and track patients with diabetes. The South has the highest incidence of diabetes per 1,000 (35.3) compared with other regions (24.9-26.5) (Adams, Hendershot, & Marano, 1999). The most recent report indicates that at least 365,000 or 8.4% of Louisiana residents 20 years and older have diabetes. In Health Care State Rankings for 2000, Louisiana was second worst in the nation in health indicators, and worst in the nation in diabetes death rate (38.7 deaths per 100,000 population) (Hood, 2001). The State’s 1999 data tables show a diabetes death rate of 68.9 for the City of New Orleans (LADHH, 2000). With this high prevalence and mortality, it may be the ideal location in the country to choose a diabetic data warehouse to investigate. The diabetic data warehouse from the New Orleans area has complete data starting 1/1/1998, includes data on 31,696 diabetic patients, and is detailed in Appendix A.
1.2
Statement of the problem Can the healthcare institution’s database be used to improve outcomes
in ways that goes beyond the traditional management methods?
There are
new software data mining tools that various industries are using to mine their
14 databases and come up with new ways of improving their industry specific outcomes. It has great potential for healthcare institutions, since much of their data have been underutilized in the cost containment and outcomes management methods used to date. Strategists even proclaim that “data mining and analysis is likely to prove as vital a tool to medics in this new century as the stethoscope was in the last” (Milley, 2000) and that a physician’s laptop will become the new “black bag” of medicine (Kiel, 2000). Although there have been early attempts at mining clinical databases for useful hypotheses (Blum, 1982; Walker & Blum, 1986), it has remained exceptional. As Prather, Lobach et al. (1997) noted, “Data warehousing and mining techniques have rarely been applied to health care.” Part of the reason is how much the healthcare industry lags behind others in operations technology: “Health care organizations invest only 2 percent of their total spending on [information] technology, compared with an overall industry average of 10 percent” (Kalis, 2000). The literature review in Chapter 2 details what has been done with data mining in healthcare, especially in the area of diabetes, and what has not yet been done. This dissertation will focus on the problem of how to apply data mining technologies to a specific healthcare database. The data mining software used is CART, described in detail in Appendix B. The healthcare database used is a diabetic data warehouse detailed in Appendix A and described below. The methods and procedures used will be described in Chapter 3. Methods for successful application of data mining technologies to transactional healthcare data requires: (1) preparing a healthcare database for input into data mining software to avoid GIGO (garbage in, garbage out)—this will require data transformations from a relational data warehouse to a data mining data
15 table that is useable by data mining tools; (2) selecting and skillfully applying the appropriate data mining software; (3) intelligently sifting through the software output to prioritize the new areas that will provide the most cost savings or outcomes improvement, and using these to construct a model for intervention. The problem addressed in this study is how can the CART data mining technology be applied to the diabetic data warehouse examined here, and can the results be used to improve outcomes in novel ways not already used in traditional management and clinical interventions.
1.3
Research objectives: hypotheses to be tested The research objectives are listed in Table 1.1. This study involves explora-
tory data analysis of the diabetic data warehouse. In this exploratory or descriptive focus, the research objectives in Table 1.1 may be more appropriate than formal hypotheses. Nevertheless, there are 3 formal hypothesis that will be tested in this study, and they are listed in Table 1.2. The first hypothesis will be proved or disproved by examining any discovered knowledge against national standards to see if it meets the definition of “new knowledge” below. This is discussed in more detail in Chapter 3. Questioning managers and clinicians in the institution whether they perceive the data mining results as something new to them, and whether they think this knowledge is useful in their work will test the second hypothesis. The instrument to be used for this is in Appendix F. The results will be reported separately for managers and clinicians. The majority of the responses received will determine whether the hypothesis is proved or disproved for each of the 2 groups.
16
Table 1.1: Research objectives 1. To understand the medical problem domain adequately enough to do efficient data mining. An overview of diabetes is in the introduction above. When needed to understand variables or modeling, additional information will be reviewed. 2. To understand the data adequately enough to do efficient data mining. A brief overview of the diabetic data warehouse is given in the introduction above. Appendix A has a full listing of its variables. Chapters 3 and 4 have a detailed review of many of the variables in the data. 3. To prepare the data by converting the data warehouse fields into the data mining data table fields. The transformations needed here are discussed further in Chapters 3 and 4. 4. To apply the CART data mining software to the data mining data table using appropriate target variables for outcomes. These results are in Chapter 4. 5. To evaluate the discovered knowledge. This will involve sifting through the data mining results to find novel associations that can be used by management or clinicians to improve outcomes. These results are discussed in Chapter 4. 6. To outline how to use the discovered knowledge to improve outcomes. Interventions are modeled in a decision analysis tree to estimate the impact on outcomes in Chapter 4.
Table 1.2: Research hypotheses 1. H0 : The CART data mining software cannot be applied to this diabetic data warehouse to discover new knowledge. HA : The CART data mining software can be applied to this diabetic data warehouse to discover new knowledge. 2. H0 : Managers or clinicians will not find this new knowledge useful. HA : Managers or clinicians will find this new knowledge useful. 3. H0 : This new knowledge can not be used to improve outcomes. HA : This new knowledge can be used to improve outcomes.
17 The third hypothesis will be proved or disproved by modeling potential changes in practice or interventions based on the discovered knowledge in decision analysis software (DATA 3.5). This is discussed in more detail in Chapter 3.
1.4
Limitations These data were obtained for purposes other than research, and this is a
limitation. Clinicians will be aware that billing codes are not always precise and accurate, though our use of them in a comorbidity index below should be robust in this regard. Epidemiologists and clinicians will be aware that important predictors of diabetic outcomes are missing from the database, such as BMI, family history of diabetes, time since onset of diabetes, diet and exercise habits. These variables were not electronically stored, and would require going to the paper chart and patient interviews to obtain. This study is completely limited to the data in the diabetic data warehouse. Transformed or summary data may be inaccurate unless a careful understanding is used about where the data comes from. For example, BMI might be approximated by an ordinal labeling of patients as 0, 1, or 2 based on whether obesity billing (ICD9) codes are missing, there is an obesity code, or there is a morbid obesity code. However, obesity is not usually a billable diagnosis, so clinicians often do not list it or do so sporadically. This is one of the limitations of a database initially collected for billing, and highlights the need to understand not only the data set but how it was collected, why it was collected, and the motivations of those doing the collecting. This will sometimes direct one to avoid some variables as unreliable (e.g., BMI via obesity codes) and consider other variables robust (e.g., comorbidity index below).
18 One of the confounding or mediating variables may be the maturity of the healthcare system. If a very mature system like Aetna is the context, they may have utilized virtually all known disease management and other cost savings systems so that data mining will have a smaller group of useful relationships to find than in a less mature system. This study will be using national standards rather than local ones for the definition of new knowledge (see definitions below and Chapter 3). External validity may be an issue. The subjects are from one large health system in the Gulf South. Whether relationships found and tested in the model will hold for other groups is unclear. This model could be applied to diabetic patients elsewhere to see if the model is generally applicable. If this is promising, then through publications this may become a commonly applied model in the United States. Ideally, the universe would be all diabetic patients in the United States. However, a random, representative sample of this universe would require access to data not available to us. Hence, an incremental approach is appropriate. The evidence of success of the first study would be needed to gain access to broader data and collaborators. This study is limited to this diabetic data warehouse alone. While the external validity of the models generated may be questioned, the actual method itself should be applicable anywhere. This study is limited to the CART data mining software for knowledge discovery. The CART software has proved useful in many clinical data mining settings, and produces diagrams and splitting rules that have been easily understood by clinicians as noted in the literature search in Chapter 2. Data mining software can be expensive, and the CART software is available to this researcher from a previously funded project. Other data mining software might arrive at additional discovered knowledge that CART cannot.
This is a limitation of
19 this study. However, the literature review in Chapter 2 will point out that this argument may not be valid, and CART has generally performed as good or better than other data mining software.
1.5
Definitions of terms This section defines and operationalizes the definitions of the terms in the
research hypotheses stated above. “Data mining is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions” (Cabena, Pablo, Rolf, Jaap, & Alessandro, 1998). Data mining, used here interchangeably with the term knowledge discovery in databases (KDD), is the process in this study of using the CART software to discover new knowledge. CART software is the classification and regression tree software program standardized and sold by Salford Systems (CARTr for Windowsr Version 4.0 c °
Salford Systems 1990–2000). The license used has a preprocessor workspace of
1,000,000 and a tree building workspace of 2,000,000. This software originated in the work of Breiman, Friedman, Olshen, and Stone (1984). New knowledge (also referred to in this study as the knowledge discovery items) is defined as associations or relationships among the variables in the data mining data table that are new and useful to outcomes improvement. Therefore, one needs to know what has already been done so that it is excluded from the relationships one is trying to identify. Rather than doing this locally, this study chose to use a national standard for this. Clinical associations are operationalized by their presence or absence in one of the major textbooks: DeGroot & Jameson (DeGroot, Jameson, & Burger, 2001), the 4th edition of a 3-volume definitive textbook of endocrinology with pages 654–967 devoted to diabetes, or
20 Ellenberg & Rifkin (Porte, Sherwin, Ellenberg, & Rifkin, 1997), the 5th edition of a 1372 page book on diabetes. Joslin’s book (Joslin, Kahn, & Weir, 1994) is not used since it is out of date, and the new edition is not due till the end of June 2002, which is too late for this study. Managerial associations are operationalized by their presence or absence in the national guidelines for diabetes care management (AMA, JCAHO, & NCQA, 2001), The American Diabetic Association’s website, or a major clinic’s implementation of a diabetes disease management program (Friedman et al., 1998). This latter clinic’s implementation of a diabetes management program won a $4.4 million grant from CMS in October 2001 for further studies, and is a good representation of what is available in the best circumstances. A copy of this clinic’s diabetes management program materials had been made available to this researcher. Note that the operationalization of the term “new knowledge” is based on national standards and not the local institution that this study is being done in. The local institution’s perception of what is new will be evident in the survey in Appendix F. Diabetic data warehouse refers to the specific diabetes data described in Appendix A and residing in an Oracle data warehouse in the Gulf South institution that this study is based in. Managers refer to the senior and mid-level management of the institution that owns the diabetic data warehouse. Some of the senior management are also physicians which may confound the difference between managers and clinicians. Clinicians refer to the primary care family practitioners, internists, and mid level providers (physician assistants and nurse practitioners in those 2 departments) of the institution that owns the diabetic data warehouse. Outcomes will be partly determined by the results of the data mining, and the specific outcomes are not specified ahead of time. In this paragraph the range
21 of things that outcomes may mean is operationalized, but the specific outcomes of interest are left open pending results. Outcomes may mean the usual HEDIS measures referred to elsewhere, such as gylcemic control, frequency of testing or exams specified in national standards (AMA et al., 2001). It might also mean a medical quality index based on some combination of these. Outcomes may mean reduced costs, hospitalizations, or emergency department visits. The term “to improve outcomes” will be operationalized to mean either an improvement of 10% in medical outcomes (glycemic control, HEDIS measures, or a medical quality index) or a 5% reduction in costs, hospitalizations, or emergency department visits. The reason for this distinction is that the later reductions have such an immediate impact on finances that a reduction of 5% would be of great interest and be very meaningful to managers. The prior outcomes need a greater effect size to interest clinicians in the value of changes that might bring about these outcome improvements.
22
Chapter 2 Literature Review 2.1
Introduction This review will focus on three concentric circles of literature as it goes
from general data mining, to healthcare data mining, to data mining diabetic datasets. During the discussion the following should become clearer: the purpose and goals of data mining in general, the limited applications to healthcare that have occurred, the use of CART software in healthcare data mining, what has and has not been done in data mining diabetic datasets. Within this context, how this research study will contribute to transactional healthcare data mining in diabetes will be outlined. The searching methodology used is outlined in Appendix E.
2.2 2.2.1
General data mining Introduction In this section the lessons learned are: the goal of data mining is business
success, that there are many methods data mining can use, that all the methods require a careful data preparation process, and that the data preparation process
23 may be a creative one of constructing data mining variables from the raw database fields.
2.2.2
The goal: to do business better A group from IBM (Cabena et al., 1998) quote Aristotle Onassis, “The
key to success in business is to know something that nobody else knows.” This idea is at the heart of what data mining is trying to achieve in any business. Data mining, by definition, “is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions” (p. 12). Authors in the investment field (Dhar & Stein, 1997) quote Gordon Gekko from the movie Wall Street: “I know of no commodity more valuable than information.” One of the interesting case studies is called “pattern directed data mining of point of sale data,” which describes how the A. C. Nielsen division of the Dun and Bradstreet corporation overcame the competition in the early 1990s. Using a rule-based system with the RETE algorithm doing pattern matching, they developed the SPOTLIGHT program. This produced formatted reports showing unexpected changes in market share and surpassed the ability of their competitors. It was further evolved into Opportunity Explorer that could produce customized reports on the fly with a user-friendly Windows interface. The goal is to create useful knowledge from the data and present it in a form that gives your business an advantage. The data mining process is aimed at discovering information without a previously formulated hypothesis, and in this sense is different than the traditional research protocol that starts with what you want to prove or disprove. Data analysis is often applied to supporting or not supporting a hypothesis that
24 a manager or researcher has. Data mining goes beyond this to tackle questions that traditional techniques cannot achieve and so has the potential of far greater business value to the company. Where traditional data analysis can provide answers to closed questions (Will customers buy more electronics if they are discounted?), data mining can provide answers to open ended questions when you do not even know what the variables are (How can one get customers to buy more electronics?). The book Mastering Data Mining (Berry & Linoff, 2000) focuses on applications to marketing, sales, and customer support. One tip they give is that “Interactive response times (in the 3-5 second range) are a requirement for convincing business users to exploit data on a daily basis.” The goal is not just to have experts use data mining methods and then present it to others, but for all who will benefit to have a user-friendly interface to quickly get the insights they need from the data available. Witten and Frank (2000) review the theory of automatically extracting models (decision trees, rules, and linear models) from data, validating the models, and applying the models in practice. In their view, “data mining is about solving problems by analyzing data already present in databases” (p. 3). If one wants to apply data mining technology to an area, one needs to be clear on what problem is to be solved.
2.2.3
Many data mining methods Adriaans and Zantinge (1996) write out of their experience as directors
of Syllogic, a data and systems management company that created one of the first commercial data mining applications for KLM airlines. They highlight the diversity of data mining:
25 Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. From this point of view, data mining is really an ‘anything goes’ affair. Any technique that helps to extract more out of your data is useful, so data mining techniques form quite a heterogeneous group. Although various different techniques are used for different purposes, those that are of interest in the present context are: • Query tools • Statistical techniques • Visualization • Online analytical processing (OLAP) • Case-based learning (k-nearest neighbor) • Decision trees • Association rules • Neural networks • Genetic algorithms. Fayyad (1996) discusses the common data mining methods: (1) decision trees and rules, (2) nonlinear regression and classification methods, (3) examplebased methods such as nearest neighbors, (4) probabilistic graphical dependency models, and (5) relational learning models also known as inductive logic programming (pp. 17-22). Edelstein (1999), in addition to the common models above, also discusses multivariate adaptive regression splines (MARS), memory based reasoning, generalized additive models, and boosting. One survey detailed 43 data mining software
26 tools (Goebel & Gruenwald, 1999), and Appendix C has a brief listing of data mining software available as of January 2002. More than one method may be needed in a project: McLeish, Yao, Garg, and Stirtzinger (1991) describe their experience with discovery analysis of a very large clinical database, and their conclusion that different tools are needed in different parts of the database to maximize discovery.
2.2.4
Each method has a data preparation process Weiss and Indurkhya (1998) present a practical guide to the statistical
evaluation of “big data.” They focus on data preparation, data reduction, and solution methods for data mining. They point out that data mining research has emphasized prediction methods or database organization. The preparation of the data is sometimes dismissed as a topic too mundane for extensive research. In the real world of data mining applications, the situation is reversed. More effort is expended preparing data than applying a prediction program to data. On an interesting historical note, they believe that the early philosophical roots for data mining were in Tukey’s classic text, Exploratory Data Analysis (1977). Data Preparation for Data Mining (Pyle, 1999) highlights the essential rule of computing: GIGO (garbage in, garbage out). Therefore, care must be taken to properly prepare the data before they are fed into the data mining software. Although this somewhat technical area is not flashy, it is essential to successful data mining. The author clarifies the difference between exploratory data analysis (“a statistical practice that involves using a wide variety of single-variable and
27 multivariable analysis techniques to search for the underlying systemic relationships between variables”) and data mining. Whereas the former focuses on discovering the basic nature of the underlying phenomena, the latter focuses tightly on the practical application of results. Others have also pointed out that data preparation takes the most time in data mining projects (Wegman, 2001). Duda, Hart, and Stork (2001) point out that the “problem” of missing data may actually be additional information to take advantage of: Sometimes the fact that an attribute is missing can be informative. For instance, in medical diagnosis, the fact that an attribute (such as blood sugar level) is missing might imply that the physician had some reason not to measure it. As such, a missing attribute could be represented as a new feature and could be used in classification (p. 411). The data preparation process is not simply a mechanical cleaning and reduction of the data. As Duda highlights, some thought must go into the meaning of the variables. In addition to the blood sugar level in his example, another variable may be constructed of whether the lab value is missing or present. The data preparation process is a creative one, where data mining variables may be created from other raw variables in the database.
2.2.5
Summary In this general data mining review, key points are that the goal of data
mining is business success, that there are many methods data mining can use, that all the methods require a careful data preparation process, and that the construction of data mining variables from the raw database fields is a process that demands both creativity and intelligence.
28
2.3 2.3.1
Data mining in the healthcare literature Introduction Surprisingly, traditional medical informatics textbooks do not cover this
area. The new edition of a classic in medical informatics (Shortliffe, Perreault, Wiederhold, & Fagan, 2000) did not even have data mining or knowledge discovery in its index! Yet a recent medical informatics article looking at 10 challenges listed “mining data for new medical knowledge” as one of them (Altman, 1997). The literature on data mining applications to medicine and healthcare can be categorized as: medical diagnostics, outcome improvements, managed care and cost savings, surveillance, and genetics. Zupan, Lavrac, and Keravnou (1999), in their editorial on data mining in medicine, note that “large collections of medical data are a valuable resource from which potentially new and useful knowledge can be discovered through data mining.” The purpose is to “gain insight into the relationships and patterns hidden in the data” (p. 1). Lavrac (1999) selects out data mining techniques that are most useful in medicine. She divides machine learning methods into three major groups: inductive learning of symbolic rules (induction of rules, decision trees, logic programs), statistical or pattern recognition methods (k-nearest neighbors or instance-based learning, discriminant analysis and Bayesian classifiers), and artificial neural networks. McDonald et al. (1998) review the use of data mining applied to pathology information systems. They point out “the amount of undiscovered knowledge in medical databases is potentially very large.” They list the items shown in Table 2.1 as the potential applications for data mining in health care. Data mining and machine learning in healthcare has been used most extensively in genetics (molecular sequence analysis), imaging systems in radiology, and
29
Table 2.1: Healthcare data mining applications 1. Surveillance a. Outcomes b. Epidemiology 2. Clinical practice guidelines/critical pathway evaluation a. Identify important decision variables b. Evaluate efficiency of current guidelines c. Predictive modeling d. Deriving rule-based systems 3. Imaging a. Classification 4. Molecular sequence analysis diagnostics. This section will concentrate on three of the applications of data mining that are of special interest for healthcare managers: outcome improvements, cost savings in managed care, and surveillance.
2.3.2
Outcome improvements Selecting and reporting what is interesting: the KEFIR application to health-
care data (Matheus, Piatetsky-Shapiro, & McNeill, 1996) describes the automatic discovery of deviations in databases, and how to determine the “interestingness” of the deviation from the point of view of cost savings or other benefits. Their software “performs an automatic drill-down through data along multiple dimensions to determine the most interesting deviations of specific quantitative measures relative to norms and previous values. It then explains key deviations through their relationships to other deviations in the data, and, where appropriate, generates simple recommendations for actions in response to deviations.” Hedberg (1995) describes how a Los Angeles hospital used the IDIS data mining software to discover subtle factors affecting success and failure in back surgery. Prather et al. (1997) reviewed their steps at Duke University in using an
30 extensive clinical database of obstetrical patients to identify factors that contribute to perinatal outcomes, and how data mining improved birth outcomes (Goodwin et al., 1997). Their database contained close to 72,000 patient records and 5,000 variables per patient. Despite advances in medicine, preterm births have remained at 8-12% in the United States for the past three decades. The authors believe that early results support their hope that data mining techniques may offer innovative solutions for identifying women at risk for preterm birth. Eriksen, Turley, Denton, and Manning (1997) introduce data mining techniques as an important tool for the development of nursing knowledge and knowledge structures. Using an Access database of nursing activities at a regional medical center in Houston, they applied a data mining tool (KnowledgeSEEKER from Angloss software). They concluded, “The use of data mining software has generated a number of surprises and insights. The use of the data mining software gave a mechanism for understanding a complex data set which had not been amenable to analysis using traditional statistical analysis” (p. 388). Improving diagnosis can improve outcomes. Using machine learning and data mining techniques on medical records of 9,714 pregnant women, one study found a rule to predict emergency C-sections. There was a 60% chance of an emergency C-section if 3 factors were present: (1) there were no previous vaginal deliveries, (2) there was an abnormal 2nd trimester ultrasound, and (3) there was malpresentation at admission. This “if—then” rule culled from historical data can help guide physicians to improve outcomes (Mitchell, 1999). Another study looked at 2,730 ECGs in an emergency department of which 517 were in patients with infarcts. Hybrid data mining techniques were used to predict an infarct with 76.6% accuracy (Burn-Thornton & Edenbrandt, 1998), giving useful lead time to physicians before standardized cardiac enzyme tests can be returned.
31
2.3.3
Managed care and cost savings GTE’s KEIFER system zeroes in on interesting health care trends, using
them to suggest cost-saving interventions (Hedberg, 1995). Bigus (1996) wrote Data Mining with Neural Networks, and discusses data mining in detecting fraudulent insurance claims from both patients and providers, and in identifying the most cost-effective providers. Borok (1997) writes about data mining use in managed care modeling. In this health care context, he states that data mining software should be able to • Determine which factors control resources usage and rank them. • Uncover atypical utilization patterns of providers within a group. • Drill down into specific cases directly from the resource or provider in question. • Predict resource utilization for new cases based on your group’s current practices. • Compute per member per month (PMPM) capitation rates and perform what-if scenarios for contract negotiation, modeling carve-outs, and finetuning cost and quantity limits. Kaiser Permanente implemented an enterprise wide data warehouse in 1994. Mining this data warehouse has discovered knowledge that saved them many millions of dollars. They found that physicians were using three drugs to treat a certain disorder. One drug was expensive and the most commonly prescribed. The other two were inexpensive. They also found side effect profiles were lowest on one of the inexpensive drugs. Treatment for the disorder was standardized, millions of dollars per year were saved on drug costs, and the patients had fewer
32 side effects (Hollis, 1998). The data from automated pharmaceutical dispensing cabinet systems in hospitals has been mined to reduce waste among the pill counts. The University of Utah Hospital started using these systems in early 1999 and in about 6 months had saved $64,000 in inventory alone (Tabar, 1999).
2.3.4
Surveillance Brossette and his group at the University of Alabama (1998) looked at
data mining in hospital infection control and public health surveillance. They concluded that their data mining surveillance system (DMSS) based on association rules was “efficient and effective in identifying new, unexpected, and interesting patterns in surveillance data” (p. 373). They describe their refinements to DMSS, which monitors emerging infections and antimicrobial resistance, and demonstrated that DMSS could identify potentially interesting and previously unknown patterns (Moser, Jones, & Brossette, 1999). The DMSS software tool recently crunched 11 months of lab data from a 600-bed hospital to identify 41 suspected outbreaks, 97% of which were hospital acquired by subsequent expert chart review. During the same time, the hospital’s infection control team had flagged only 9 suspected outbreaks with a 33% accuracy rate. This has attracted the CDC’s interest as a possible national surveillance method for detecting foodborne infectious disease outbreaks (Kreuze, 2001). A group in Italy had developed a system of data collection and mining to perform a similar function in monitoring nosocomial infections. Their system can identify critical situations on both a hospital unit level (contagion events) and the individual level (unexpected antibiotic resistance of a bacterium). Finding a bacterium of the same strain on another patient in the same hospital within 30
33 days identifies a contagion event if the bacterium is resistant to more than two antibiotics (Lamma, Manservigi, Mello, Storari, & Riguzzi, 2000). A national cancer surveillance system using data mining techniques has been outlined and proposed by a group at the University of Maryland (Forgionne, Gangopadhyay, & Adya, 2000). It would detect cancer patterns in populations, formulate models to explain the patterns, and evaluate the efficacy of treatments and interventions. Surveillance for difficult airways is usually done by rule-based airway systems (36% sensitivity), or the Mallampati test and Wilson Risk-Sum (5,000 for the 3-fold cross validation method this study will use. “Sample” may be an inaccurate term to use. This data mining study uses 100% of the patients in the database. Although many patients may be excluded by our continuity selection criteria, there is no sampling, per se. Instead, there is use of every patient in the universe of this study, limited only by our continuity requirement, which is not a sampling process but an exclusion criteria process.
3.4
Tools and instruments The data will come from the institution’s diabetic data warehouse; its
structure is detailed in Appendix A. Since the data warehouse was created by the institution, there is no data collection needed. Although the data were already collected in the diabetic data warehouse, it will require significant creativity in processing it into a data mining data table. The tools used to access the data are SQL statements. User-friendly front ends for SQL statements are used whenever possible, namely Oracle Discoverer version 3.1 or DBMS COPY version 7.0.5. Both of these software packages directly
57 access the diabetic data warehouse through a network connection at the institution that owns it. When needed, an IS person at the institution that owns the data warehouse assists with more complicated SQL statements that the user-friendly front ends cannot handle. The result of this process is the transformation of the relational data warehouse into a data mining data table. The output process to obtain the data mining data table also used Microsoft Excel 2000 and Access 2000. SPlus6 is used to do a comprehensive descriptive analysis of the variables including density distribution graphs of continuous variables. The data mining data table is then opened in the CART software tool where the data mining process occurs. The results of this process are the knowledge discovery items that appear to be new knowledge and potentially useful for improving outcomes. CART’s diagrammatic output is in a format that another software package, allCLEAR, uses to produce better splitting diagrams than CART can. CART warrants further discussion since its use is the central part of this research. There are many data mining methods as noted in the literature review above, and in Appendix C. This study uses the technique of classification and regression trees to data mine the diabetic data warehouse. There are many variations on this technique; this study uses a commercial package from Salford Systems r . CART automatically sifts large, complex databases, searching called CART° for and isolating significant patterns and relationships. This discovered knowledge can then be used to generate predictive models. This data mining technique is best suited to problems with the following five characteristics (Mitchell, 1997): 1. Instances are represented by attribute-value pairs. A particular attribute (e.g., sex) has a small number of disjoint possible values (male, female). Continuous attributes can be incorporated by an algorithm extension that dynamically redefines values as true or false based on a cut-point value that
58 produces the greatest information gain in the learning sample. Healthcare data fit this well. 2. The target function has discrete output values, e.g., HbA1c may have many values (any given number from below 6 to above 15) but can meaningfully be put into a binary value (good ≤7, not good >7). This technique can also handle more than 2 output values (e.g., good 8). Target variables of interest fit this well. 3. Disjunctive descriptions may be required, and this type of partitioning is typical in healthcare: good results/bad results, meets guidelines/does not meet guidelines, elderly/not elderly, etc. 4. The training data may contain errors. Healthcare databases require manual entry of most variables, and error at the input source will always be present in some small percentage. 5. The training data may contain missing attribute values. This is typical of medical records where any given variable may be missing in some patients. A key issue in classification and regression trees is how to split the data at any given node. The goal is the data being partitioned into as pure a form as possible at the next level of nodes. That is, the selected splitting criteria should minimize each node’s impurity. Commonly used criteria are entropy, variance, Gini, misclassification, twoing, and gain/ratio (Duda et al., 2001). One recent paper (Shih, 1999) enumerated the various splitting criteria that classification trees can use and listed six: (1) Gini criteria, (2) twoing criteria, (3) likelihood, (4) mean posterior improvement criteria, (5) statistical tests, and (6) a weighted sum approach. CART uses a number of these criteria. A recent dissertation (Lim, 2000) claims an even better splitting criterion than any of these six is to
59 use polytomous logistic regression trees. While there are many fine points to consider about which splitting rules will work best on healthcare databases, this study will work within the options of the CART software. Information about this software is in Appendix B. Three things will be done with these knowledge discovery items obtained by sifting through the CART output. First, they will be listed on the survey instrument in Appendix F to poll the managers and primary care clinicians in the institution about whether they perceive these items to be new and/or useful. Second, they will be modeled in the decision analysis software DATA 3.5 to determine whether the knowledge discovery items can be used to improve outcomes. This modeling procedure is akin to a univariate analysis. Third, the knowledge discovery items that do show significant outcomes improvement in the decision analysis univariate step, will be combined in a model and jointly applied to the decision analysis software to see how much the combination of all the discovered knowledge can improve outcomes. This modeling procedure is akin to a multivariate analysis. This final model will also be applied to a reserved (unused) part of the data to check validity. In summary, the software tools used include Oracle Discoverer, DBMS COPY, Microsoft Excel and Access, CART, allCLEAR, SPlus, and DATA. The instrument used to poll managers and clinicians about their impression of the data mining results is in Appendix F.
3.5
Study procedure Although there is no primary data collection, this section will give the
step-by-step plan for transforming the raw diabetic data warehouse fields (see Appendix A) into the variables that will make up the data mining data table.
60 This is the pre-data mining (pre-CART) part of the study that assembles the data to data mine. A key component is data transformation from the relational structure of the diabetic data warehouse with its multiple tables, to a form suitable for data mining. Data mining algorithms most often are based on a single table, within which there is a record for each individual, and the fields contain variable values specific to the individual. This is the data mining data table. The most portable format for the data mining data table is a flat file, with one line for each individual record. There will be a fixed number of fields. Some fields may be blank in any given line. Most often, one or more SQL statements on the data warehouse, and the flat file output construct the data mining data table. The data mining software then reads the flat file. This approach is taken here. The steps include: • Review each table of the relational database and select fields to export. • Determine the interactions between the tables in the relational database. • Define the layout of the data mining data table. • Specify patient inclusion and exclusion criteria, which will determine the number of records in the data mining data table. What is the time interval? What is the minimum and maximum number of records (e.g. clinic visits, or outcome measures) each patient must have to be included? What relevant fields can be missing and still include the individual in the data mining data table? • Data extraction, including the stripping of patient identifiers. • Sanity checks on the data mining data table, insuring, e.g., that the minimum and maximum of each variable makes clinical sense.
61 In the data mining data table in Table 3.2, the first 3 fields comprise the administrative variables of clinic number, date of birth, and sex. Note that in the administrative variables in Appendix A, there are also 2 member number fields, which are not carried over into the data mining data table since they have no value or interest to us. The clinic number is brought over to the data mining data table as a de-identified ID record number, and date of birth is transformed into age in years as of 1/1/2001. These are fairly simple transformations, but more complicated ones will soon be needed. Table 3.2: Data mining data table Table Admin
Data warehouse field Clinic number
Data mining data table field ID
Admin
DOB
Age
Admin
Sex
Sex
Clinic Clinic Clinic Clinic Clinic
Clinic number Point of service Point of service Point of service Point of service
ER ERbin2 ERbin5 OVS
Clinic
CPT codes
OVP
Clinic
Diagnosis codes
CMI
Clinic Clinic
Diagnosis codes Diagnosis codes
CMIbin5 LipidDx
Clinic
Diagnosis codes
HTN
Clinic
Diagnosis codes
CV
Clinic Clinic
Diagnosis codes Diagnosis codes
EYE RD
Notes Patient identifier is removed, ID is simply a record number 1,2,... in the data mining data table Transform DOB to age in years as of 1/1/2001 The 10 that are missing are left NULL in this field For record linking Number of ER visits ER visits (0, > 0) ER visits (0, 1, 2, 3 − 4, ≥ 5 Number outpatient services in the study period Number of outpatient provider office visits in the study period Number of major body systems (1-17) listed as a diagnosis code. This can act as a rough comorbidity index. See details in text. CMI (1 − 3, 4 − 5, 6 − 7, 8 − 9, ≥ 10) Is there a lipid disorder documented? Is there a hypertensive disorder documented? Is there a CAD/PVD disorder documented? Is there retinopathy documented? Is there renal disease documented?
62 Table Clinic
Data warehouse field Charge Amount
Data mining data table field Clinic Cost
Clinic
Transaction sign
Clinic Cost
Clinic Clinic Clinic Lab Lab Lab
Provider service Provider service Provider service Clinic number HbA1c results HbA1c results
FPIM Ophtho Podiatry
Lab Lab
HbA1c HbA1c
HbA1c Count HbA1c Trend
Lab Lab
LDL LDL
LDLAv LDLAv
Lab
LDL
LDLcount
Lab
Urine tests
UrineTestDone
Lab
Urine tests
UrineTestCount
Hosp Hosp
Discharge date Death code
Hospitalized? Hospitalization death?
Meds Meds
Clinic number Category code 019G Category code 019H Category code 7T6T Category code 7T6S Category code 019F Category codes Category codes Category code 0397 Category code 016C
Meds Meds Meds Meds Meds Meds Meds Meds
HbA1c Av HbA1c 95
Notes For use in constructing a charge variable For use in constructing a charge variable Primarily sees FP or IM? Number of eye doc exams (0, >0) Number of podiatry exams (0, >0) For record linking average HbA1c A binary variable (0, 1) is created using a cut-point of 9.5 in the average HbA1c . The number of HbA1c tests done. regression trendline of HbA1c values in a given patient average LDL. A binary variable (0, 1) is created using a cut-point of 130 in the average LDL. The number of LDL tests done (0, >0) A binary variable (0, 1) if urine protein test was done 0 or >0 times The number of urine protein test done Yes (1), No (0) Yes (1), No (0)
Sulfonyluria
For record linking 0 = not used, 1 = used
Metformin
0 = not used, 1 = used
Troglidizone
0 = not used, 1 = used
α−glucosidase inhibitor Insulin
0 = not used, 1 = used
InsulinOral Monotherapy LipidDrug
0 = no, 1 = yes 0 = no, 1 = yes 0 = not used, 1 = used
ACEI
0 = not used, 1 = used
0 = not used, 1 = used
63 Table Meds Meds Meds
Data warehouse field Category code 4M4M Category codes Category code 03C9
Data mining data table field ARB
Notes
ACEIandARB Steroids
0 = neither, 1 = either 0 = not used, 1 = used
0 = not used, 1 = used
Those who do not have continuity, defined as having at least 2 outpatient services (OVS), are excluded. Any one given patient may have many outpatient services, but these must all be placed on the same row. Therefore, the variables must be transformed into summary data that contain the most useful information. This avoids having 100 columns for clinic visit 1 through 100 with most rows just being populated in the initial columns. Since each visit has more than a dozen variables associated with it, this actually would mean having >1,000 sparsely populated columns for outpatient services unless summary transformations are used. Table 3.2 shows from which data warehouse subtable (col. 1) which fields are exported (col. 2). For each exported field or set of fields, Table 3.2 lists one or more fields in the data mining data table that are constructed from the exported field or fields (col. 3). While each of these variables needs to be explored in depth, two are discussed in more detail now: the comorbidity index and the average HbA1c . Tables 3.2 and 4.1 contains summary information about the variables this study uses. The feasibility of extracting each of these variables depends in part on the ability to run certain SQL statements. Past experience has shown that the network server cannot handle some complicated SQL statements. Until the actual study is performed and the variables extracted, table 3.2 is open to revision.
64 The comorbidity variable takes the diagnostic codes in the clinic subtable and converts them into 1 comorbidity column with 1 line per patient. All the codes are divided into the 17 categories of the ICD9: 001-139 Infectious and parasitic diseases 140-239 Neoplasms 240-279 Endocrine, nutritional and metabolic diseases, and immunity disorders 280-289 Diseases of the blood and blood forming organs 290-319 Mental disorders 320-389 Diseases of the nervous system and sense organs 390-459 Diseases of the circulatory system 460-519 Diseases of the respiratory system 520-579 Diseases of the digestive system 580-629 Diseases of the genitourinary system 630-677 Complications of pregnancy, childbirth, and the puerperium 680-709 Diseases of the skin and subcutaneous tissue 710-739 Diseases of the musculoskeletal system and connective tissue 740-759 Congenital anomalies 760-779 Certain conditions originating in the perinatal period 780-799 Symptoms, signs, and ill-defined conditions 800-999 Injury and poisoning Patients are labeled as having 1 through 17 of the categories they have been diagnosed into as a rough comorbidity index. For example, if someone had codes through their visits that fell into 5 of these groups, their comorbidity column would be a 5. Many comorbidity indexes, such as Charlson’s Index, are inpatient focused. Most of the diabetic patients in our study were never hospitalized. Thus, this study agrees with others who have found it valuable to use a physicians’ claims comorbidity index (Klabunde, Potosky, Legler, & Warren, 2000). This is a simplified variant of that idea.
65 Handling time series medical data is challenging for data mining software. One example in our study is the HbA1c value, the key measure of glycemic control that should be measured every 3 to 6 months in all diabetics. This is closely related to clinical outcomes and complication rates in diabetes. Healthcare costs increase markedly with each 1% increase in baseline HbA1c ; patients with an HbA1c of 10% vs. 6% had a 36% increase in 3-year medical costs (Blonde, 2001). How should this time series variable be transformed from the relational database to a vector (column) in the data mining data table? A given diabetic patient may have many of these HbA1c results. One could pick the last one, the first, or a median value. One could take an average. Since the trend over time for this variable is important, one could choose the slope of its regression line over time. However, a linear function may be a good representation for some patients, but a very bad one for others that may, for example, be better represented by an upside down U curve. This difficulty is a problem for most repeated laboratory tests. In any event, some information will be lost. This study chose to use the average HbA1c of all the results for a given patient and exclude patients who do not have at least 2 HbA1c results in the data warehouse. As noted in the table above, this average HbA1c was repartitioned into a 2-level categorical variable based on a meaningful clinical cut-point of 9.5. Experts agree that a HbA1c >9.5 is a bad outcome, or a medical quality error, no matter what the circumstances (AMA et al., 2001). On a separate level is the data collection from managers and clinicians using the instrument in Appendix F. The knowledge discovery items that will be queried here are not yet known. Senior management and members of the primary care internal medicine and family practice departments will be asked to fill out the instrument. This will give us the views of managers and clinicians about the
66 knowledge discovery items that the data mining part of this study arrives at. This will be the basis for judging whether hypothesis 2 is proved or disproved on a local level.
3.6
Treatment of the data Here the step-by-step use of the data mining data table is described. In
addition, the methods for analyzing the responses to the instrument in Appendix F, and the method for using decision analysis trees in deciding if hypothesis 2 is true or false is reviewed.
3.6.1
Target variables The binary target variables used by CART are the basis for splits at each
node, which generates 2 child nodes. The split at each node is trying to get a purer concentration of the target variable = 1 in one child node, which makes the other child node more concentrated in the target variable = 0 than the parent node. For example, if the target variable is HbA1c >9.5 (1 = bad glycemic control, 0 = less bad glycemic control) then every split is trying to get child nodes that are more or less concentrated in having patients with HbA1c >9.5. This study plans using the following as target variables: 1. HbA1c , cut-point of 9.5. As mentioned above, >9.5 is clearly a bad outcome (bad glycemic control) regardless of the circumstances. 2. Number of emergency room visits. Emergency room visits are generally considered expensive and a sign that adequate care is not being provided through the patient’s primary care system. Cut-points will include (0, 1) for the number of ER visits that are (0, >0), and a 5 bin variable.
67 3. Hospitalization (yes, no). Hospitalizations are expensive, and can be a sign of adequate care not being provided. 4. Hospitalization death (yes, no). A hospitalization resulting in death is clearly a bad outcome. Note that the denominator will include all patients, not just those hospitalized. Also note that this does not include all deaths since outpatient deaths are unrecorded in the data warehouse. 5. Charges. The incomplete charge data in the data warehouse includes clinic charges and medication charges. These can be combined into a charge variable. 6. The last target variable to be used in this study is a summary measure of medical quality of care in diabetes. Because an explanation of it is so extensive, the next section will be devoted to explaining it. Medical quality target variable The Institute of Medicine’s recent reports have emphasized the extent and seriousness of medical errors as a cause of death in the United States. The first report, To Err Is Human (Kohn, Corrigan, & Donaldson, 2000), listed the number of deaths in the United States from medical errors at 44,000 to 98,000. Even at the lower estimate, this becomes the 8th leading cause of death in our country. The number of preventable non-fatal adverse events, and correctable deviance from quality of care gold standards are much higher. The cost of these errors is in the billions, and in shorter lives and less quality of life when quality care is not provided. The report shows that health care is a decade or more behind other high-risk industries in its attention to ensuring basic safety.
68 The second report, Crossing the Quality Chasm (IOM, 2001), focuses on the deviance from quality care issues that are far more widespread than safety problems. They note “What is perhaps most disturbing is the absence of real progress towards restructuring health care systems to address both quality and cost concerns, or toward applying advances in information technology to improve administrative and clinical processes.” Hopefully this target variable is one such example of doing this. They recommend systemic solutions to errors when they are found, but systemic methods are not generally applied to finding errors. Data mining is an ideal method to apply to clinical databases in order to systematically find medical quality errors. The gold standards used are from the April 2001 “coordinated performance measurement for the management of adult diabetes” consensus statement from the American Medical Association, the Joint Commission on Accreditation of Healthcare Organizations, and the National Committee for Quality Assurance (2001).
As noted in their April 25, 2001 press release, “For the first time,
organizations representing the perspectives of physicians, health plans, hospitals, and other health care organizations have cooperated in the development of a common set of evidence-based measures for evaluating performance in health care” (www.ama-assn.org/ama/upload/mm/370/nr.pdf). Table 3.3 lists the outcome measures or gold standards of quality care that this study uses based on this document. The cut-points recommended in the AMA/JCAHO/NCQA report depend on the purpose of the measurement. This study uses their external accountability cut-points.
69
1 2 3 4 5 6
. . .
Table 3.3: Aspects of Care HbA1c management HbA1c management Lipid management Lipid management Urine testing Eye exams
Foot exams HTN management HTN management
Gold standard measures of diabetic quality care Measures CutWeight points Frequency of HbA1c testing 0, > 0 0.75 Control of HbA1c level Frequency of lipid testing Control of lipid levels Testing for microalbuminuria Frequency of screening examinations for diabetic retinopathy Frequency of foot examinations Frequency of blood pressure readings Control of blood pressure level
Used? Yes
≤9.5, >9.5 0, > 0
1.00
Yes
0.75
Yes
≤130, >130 0, > 0 0, > 0
1.00
Yes
0.50 0.75
Yes Yes
... ...
... ...
No No
...
...
No
A HbA1c should ideally be 9.5) are 1,052/7,953 or 13.2% of all diabetic patients in the learning group. The odds of having a bad HbA1c is 1,052/6,901 or 0.152. After calculating the odds of a bad HbA1c in the younger and older groups the OR can be calculated. Going to the original data on all 15,903 patients, the odds of a bad HbA1c in the younger group (≤ 65.581 years old, n = 8, 000) is 1,555/6,445 or 0.241. The odds of a bad HbA1c in the older group (n = 7, 903) is 557/7,346 or 0.0758. Therefore the odds ratio that someone is less than 65.6 years old if they have a bad HbA1c average (> 9.5) is 0.241/0.0758 or 3.2. This information could then be used in designing an intervention for the younger group. CART output of the splitting diagram, when a tree is successfully constructed, will be presented for each target variable. In order to present a more useful and accurate tree, the CART output will be converted to a more viewable diagram using allCLEAR software. Examples are in Figures 3.1 and 3.2, and came from the same preliminary data used in the odds ratio example above.
3.8
Knowledge discovery items The new knowledge (knowledge discovery items) gained from each target
variable will be listed. One example, again from the preliminary data discussed on the previous pages is that younger people (< 65 years of age) are 3.2 times more likely than older people (> 65) to have bad glycemic control (HbA1c average > 9.5). This is surprising information to most clinicians who have heard these results. It is not in either of the gold standard textbooks discussed in the definitions section of new knowledge in Chapter 1. This qualifies for “new knowledge” and is listed on the first row of Appendix F to give an example there. The next step for the knowledge discovery items is to determine which have the highest potential to produce outcomes improvement. For example, returning
P
(O - C 5 % 8 - # <
# # EJ EI B 8H G + ( +C( >+ + 5 ! '& %$"! 4 ; % # 8 # #
- 3 ( 2 1 0 / +. + -( , + + * () ! '& %$"#!
C (. + % # ( <
+ 3 2 1 0 / ( ; + : 9 (8 7 (((6 + + ! '& %$5 " ! 4 4 # % - +
F : - 5 ( - + <
- N -> K # M ( + # % 8 <
3= ( 2 ED C - B( A + @ - (8 , ! '& %$"! # + - #
(? %; ( 5 ( = ( <
- 3 2 1 0/ L ( K - 8F -C -B (C ! '& %$"! 4 @ (
3 2 "1 +C( C -. ++@ ; ( ; ! '& $"! , ( - 8
Figure 3.1: Sample CART diagram formatted using allCLEAR-part A 76
? (((> 5 - = + <
+N ) ) 7 )I ) +6I ) J
L) ( )" ( I ) ) 6 J
5 423 1 0 /. ' () , + ) ( * ) " ) ( ' &% $#!"
+ +P) +) I ) 2 + I ) + ) J
5 45 I ) 1 0 /. +' @5 N ) ))' ( 5C + &% $#! " 5 (6 + I + I )
P + ) ' ) I + )II ( 6 J
)II FH FG (E 5 ' D ) + C 2 + + (? &% $#! " ) ) ( ( 4+ 6 0 =6 B A (< @ 5 ? ) ) + ( ++ ( 2 ( )> &% $#! " )
5 4 0 F$O 5' +L (N 5 ) 5C + ' + &% $#! M ) I ( I ) I )
Q) )' ( 2 ( J
5 423 0 != +< + 5 ; ) (: ) ) +9 ) 2(* ) 8 ) &% $#" ! 7 6 + (
(P) + ( 5 I ) I )) J
Figure 3.2: Sample CART diagram formatted using allCLEAR-part B 77
L ) K 2 8 I + 6 5 I J
R
78 to the preliminary results, one might reasonably devise an intervention in middle age ( 9.5) or OK (≤ 9.5). Post-intervention, there are 557 + 844 = 1,401 patients with bad glycemic control. This is an improvement of 34% in those with a bad HbA1c (2,112 to 1,401) and thus does qualify for improved outcomes based on the definition of a 10% improvement in outcomes.
( ) *
! " # $ &' " # $ %
% %
$
Figure 3.3: Pre-intervention glycemic control outcome After the chosen “univariate” knowledge discovery items have been modeled, a final decision analysis tree will be presented that is a “multivariate” one, incorporating all the previous singly considered ones that were significant (as defined in Chapter 1).
3.10
Final test on virgin data This is where the study will apply the final “multivariate” decision analysis
model on the
1 3
of the data that was put away and safeguarded for this purpose
81
) * + ,
! " # $ '( % " # $
%& & % %& & %
# $
Figure 3.4: Post-intervention glycemic control outcome at the beginning. This is a test of the model on virgin data, none of which was used in constructing or evaluating any part of the model. Draper (2000) describes the 3-fold cross validation technique, which takes traditional cross validation 1-step further. The traditional approach is to use half the data to develop the model, and then use the other half to “test” the model. CART inherently does this with its learning and test sets, constructing the model from the learning set, but using the test set to decide when to stop growing the tree or stop pruning it back. This is CART’s method to avoid an over-fitted tree that does great with the learning data but poorly with other data sets. But when one tests the model, one comes up with refinements that are needed to make it more accurate and predictive (this is the amount of pruning of the tree that CART does based on information from the test set). Hence the refined model is essentially an untested one. There is no data left over to test it on, and if one goes back to test it on the original data (or the second half of it) one is essentially cheating by using the data twice: once to make refinements to
82 the model and again to “retest” the refined model on the already used data. But this “retest” is on data used to construct the model, so it is not really testing, and tends to increase error. The 3-fold cross validation technique gives one the opportunity to both refine the model and test it on unused data by using the following steps: Step 1 Divide the data at random into three subsets Si . Step 2 Fit the tentative model to S1 . Expand the initial model in all feasible ways suggested by data exploration using S1 (the learning set in CART). Step 3 Use the model fit to S1 to create predictive distributions for all data points in S2 (the test set in CART). Compare the actual outcomes with these distributions, checking for predictive calibration. Go back to step 2, change the model as necessary to get good calibration. This is automated in CART. Step 4 Announce the final model (fit to S1 and S2 ) from step 3, and report predictive calibration of this model on data points in S3 as an indication of how well it would perform with new data. Note that what the study is testing on the virgin data is twofold. First, CART is used on the virgin data (splitting it in half for a new learning and test set, or using cross validation) to confirm that the knowledge discovery items continue to be valid. Second, the “multivariate” decision analysis model is applied using parameters from the virgin data to show the outcomes improvements.
83
Chapter 4 Results 4.1
Introduction The owner of the private diabetic data warehouse is a large integrated
healthcare system in the New Orleans area. The diabetic data warehouse, as of December 2001, included 31,696 diabetic patients. The Oracle database is set up so that the administrative group links the 4 subtables (clinic, hospital, laboratory, and medication subtables). The results will include details on the variables extracted into the data mining data table, the extraction process, the epidemiology of this diabetic population, the data mining results for various outcome or target variables, decision models to evaluate how much this discovered knowledge can improve outcomes, and whether local institution clinicians and managers found the findings new or useful.
84
4.2
Variable extraction and epidemiology
4.2.1
Inclusion and exclusion criteria The sample is selected by including all adults in the diabetic data warehouse
who have at least 2 HbA1c values in the 3 34 -year study period of 1/1/98 to 9/30/01. Anyone without at least 2 outpatient services during the study period is excluded. By these criteria the sample is limited to adults with at least some continuity during the study period. Thus, the results will be about the population that is seen and followed by physicians, excluding diabetic patients who rarely see a doctor in our institution. Since the exact time of insurance coverage for these patients is unavailable, it may be that a diabetic patient who is usually closely monitored just recently came into the system and is being excluded by our criteria. Those who are not diabetic but may have slipped into the diabetic data warehouse (the criteria for entry is in Appendix A) inappropriately should be excluded. To do this, we first develop a list of clinic numbers from the clinic subtable that do not have a diabetes ICD9 code (250.xx) in any of the 4 billing codes for visits (5,142). We subtract from this anyone on a diabetic drug, which leaves 4,061. Next, we construct a list of the patients with an average HbA1c 7.0. When the HbA1c tests done in a given year for the entire diabetic data warehouse population were averaged for 1998-2001 (9 months for 2001), the annual means were 7.89, 7.62, 7.45 and 7.30. Using linear regression, this is a decrease of 0.194/year in the average HbA1c values (2.6% decrease annually) for a total decrease of 9.6% in average HbA1c values during this 3.75-year period. Of those with a HbgA1c values ≥6, annual means were 8.09, 7.93, 7.83 and 7.74. This is a decrease of 0.115/year (1.45% annual decrease), for a total decrease of 5.45% in average annual HbA1c values during the study period. These are population statistics, not individual ones. This same approach is used for each individual patient, since they were all required to have at least two HbA1c values. Using annual means for each individual, A linear regression trendline called HbA1c AvSlope is calculated. A limitation is that a linear fit will not be a good fit for all. The mean of -0.28 (median -0.19) can be interpreted using the mid study period HbA1c of 7.51 as a 3.7% annual decrease (13.9% in the study period) in HbA1c values. Note that n = 13, 470 because some patients had their ≥2 required HbA1c tests all in the same year. Since these would be averaged for that year, it leaves only one data point so a slope was not calculated. The population averages in the previous paragraph do include every HbA1c . HbA1c AvSlopeBin2 is a categorical (0,1) variable depending on whether the slope was ( 0), (0, 1, 2, 3-4, ≥5) respectively. LDLdone is 1 if there was an LDL blood test reported at least once, otherwise 0. This was present for 13,116 (85%) of the 15,393 patients. LDLAv is the average LDL of those that were done for a patient. Thirty percent met the diabetic goal of 400,” although a few were also “text” or “error.” LDLcount was the number
97 of times an LDL blood test was done in the study period, regardless of whether a valid numerical result was reported. UrineTestDone is 1 if there was a urine protein test done, 0 otherwise. These include any of the following: urine dipstick, urinalysis, urine proteincreatinine ratio, urine albumin-creatinine ratio, or 24-hour urine protein collection, which correspond to a test code performed of 5015, 5701.1, 2590.4, 2063.1, or 2063.3. UrineTestCount was the number of times a urine protein test was done in the study period. Insulin is whether a patient ever filled a prescription for insulin and had it recorded in the medication subtable. Insulin is defined as a category code of 019F in the medication subtable. There were 2,724 patients who ever used insulin. Of the 15,393 patients in the data mining data table there were 7,547 who never had a medication recorded. Much of this is due to medications being filled by an insurance or self pay system that did not result in documentation in the data warehouse. Thus, these 7,547 are missing values rather than an indication that medications were never used, and so are represented as NULL values rather than zeros; this approach is followed for all the medications. If a denominator of 7,846 (the number of patients among the 15,393 that we have medication information recorded) is used, then 34.7% used insulin. This denominator will be used for the following medication variables also. A discussion of medication epidemiology in the diabetic data warehouse is in the next section. Metformin is whether a patient ever filled a prescription for a biguanide and had it recorded, defined as a category code of 019H. This includes those on Glucovance. Precose is whether a patient ever filled a prescription for an α−glucosidase inhibitor, a category code of 7T6S. Sulfonylurea is whether a patient ever filled a prescription for a sulfonylurea, a category code of 019G.
98 These include those who used Glucovance. Troglitazone is whether a patient ever filled a prescription for a troglitazone, a category code of 7T6T. Monotherapy is whether a patient had filled at least one diabetic medication, but never a medication from another class of the 5 listed above. Glucovance is the one oral diabetic drug that is a combination of a biguanide and a sulfonylurea. The diabetic data warehouse listed it as a biguanide only, but this was corrected in all the extracted medication tables so the results reported here are accurate (e.g., someone on Glucovance alone is not on monotherapy and will be listed as being on both metformin and a sulfonylurea). InsulinOral is whether a patient was on both insulin and one or more other oral agents for diabetes. ACEI is whether a patient ever filled a prescription for angiotensin converting enzyme inhibitor, a category code of 016C. ARB is whether a patient ever filled a prescription for angiotensin receptor blocker, a category code of 4M4M. ACEIandARB is whether a patient ever filled a prescription for either an ACEI or ARB and is a simple join of the prior 2 drug records. LipidDrug is whether a patient ever filled a prescription for lipid lowering drug, a category code of 0398. Steroids is whether a patient ever filled a prescription for oral steroids, a category code of 03C9. Charges are only available for the clinic and medication subtables. There was a data warehouse error in bringing DRG codes over for all hospitalizations, and due to IS priorities this will not be debugged in the near future. Hence, hospital costs will not be estimated. Only about half of the patients in the data mining data table have medication information, so medication charges will not be included. The charges variable is outpatient services charges exclusively from the clinic subtable. Since charges are mostly meaningful in relation to the length of time someone received services, ChargesPerLOT is calculated to approximate
99 charges per month. An ordinal variable ChargesBin5 is created by which quintile ChargesPerLOT are in. LOT is the length of time between the first and the last services recorded in any of the diabetic data warehouse subtables (clinic, hospital, lab, medications). This gives us an estimate of a patient’s length of time in months when they are an active patient. This likely underestimates it by at least a few months, since patients usually have insurance for at least a few months before they have a service performed. This will be used in the Medical Quality Index below, as well as being a potential predictor variable in CART analysis. The data warehouse does not include information on start and end dates of insurance coverage. The mean LOT was 33 months, the median 40. MQI is the Medical Quality Index. This is a calculated variable based on Table 3.3. However, Table 3.3 is designed for annual evaluation for those who have been active patients for the entire year. This study period is 3.75 years, and patients have varying times of being active patients, with LOT estimating this. Therefore Equation (4.1) is used for MQI. MQI ranged from 0.75 to 3.75. MQIbin2 is a binary variable (0,1) for (≤MQI mean, >MQI mean). MQIbin5 is (0, 1, 2, 3, 4) when MQI is (2.5 but 3.25) approximately divides MQI into quintiles.
1c Count MQI = 0.75(1 if ( HbALOT ) ≥ 0.5) + 1.00(1 if HbA1c Av ≤ 9.5)
+ 0.75(1 if ( LDLcount ) ≥ 0.5) + 1.00(1 if LDLAv ≤ 130) LOT + 0.50(1 if ( UrineTestDone ) ≥ 0.5) + 0.75(Optho) (4.1) LOT SET is a random selection of integers 1, 2, or 3 to divide the data set into 3 groups: S1 will be the training group that CART uses to devise its tree.
100 S2 will be the test set that CART uses to decide its optimal pruning strategy on the training set. S3 is data unused by the modeling process that is reserved as “virgin” data in the 3-fold cross validation method described in the methods chapter. The random numbers were generated in MS Excel using the function = int(rand() ∗ 3 + 1). This method, rather than the software randomly choosing sets each time, is reproducible. As seen in the SPlus descriptive analysis, each Si is about a third of the data set. TLSet is a column that sets S1 = 0, S2 = 1, and S3 = N U LL to allow CART to distinguish the training and learning sets. The NULL ones are removed from the file for the set-aside data set to be used later.
4.2.3
Diabetes epidemiology This population’s epidemiology is important when others compare their
population with this one in determining external validity, that is, how comparable the populations are in order to conclude the results of this study can likely be validly applied in their population. Much of the epidemiology has already been reviewed in the details of the variables above such as age, sex, etc. Some additional information is in the following sections. Medication use Without using our continuity criteria, The medication subtable has 16,036 diabetic patients (excluding the non-diabetic patients who slipped in as defined above). Of these, there are 12,198 (76.07%) on diabetic medication, 5,626 (35.08%) on a lipid lowering agent, 8,086 (50.42%) on an ACE inhibitor, 1,503 (9.37%) on an angiotensin receptor blocker, 8,720 (54.38%) on either an ACEI or ARB, and 2,888 (18.01%) who have taken oral steroids. There were 7,847 with medication records who were not excluded by our inclusion and exclusion criteria and made
101 it into the data mining data table. Thus, about half of those in the data mining data table have medication records included. Of the 12,198 patients who have taken any diabetic drug that is recorded in the diabetic data warehouse, the number and percentage who ever used the following drugs are: insulin 4,020 (32.96%), a biguanide 6,176 (50.63%), an αglucosidase inhibitor 109 (0.89%), a sulfonylurea 7,977 (65.40%), or a troglitazone 1,575 (12.91%). Of the 12,198 patients, there are 6,469 (53.03%) who are on monotherapy and 2,294 (18.81%) who are on insulin plus one or more oral agents. These results are summarized in Table 4.2 below to contrast the results among the overlapping but different populations of these 4 groups: • DMDT: The data mining data table has 15,393 adult diabetic patients of the 31,696 in the diabetic data warehouse. The bias here is that there are so many missing medication records, hence the numerator is underestimated leading to lower percentages than in real life. • DMDTMEDS: The subgroup of the data mining data table with medication records of being on a diabetic drug (7,846). These are the percentages that are reported in the SPlus summary data above. • MEDS: Diabetics for whom records of filling any type of prescription medication (16,036) are available. This group does not have a continuity criteria. • DMEDS: Patients who have records of being on a diabetic medication (12,198). This group does not have a continuity criteria.
Type 1 and type 2 How many of the adult patients in this diabetic data warehouse are type 1 or type 2? With the condition of adults only, there are 2,773,357 OVS rows with
102
Table 4.2: Medication n= On diabetic drug Insulin Metformin Precose Sulfonylurea Troglitazone Monotherapy InsulinOral ACEI ARB ACEandARB Lipid Steroid
Percent of diabetics on medications DMDT DMDTMEDS MEDS DMEDS 15,393 7,846 16,036 12,198 50.95% 100% 76.07% 100.00% 17.69% 34.72% 25.07% 32.96% 27.58% 54.14% 38.51% 50.63% 0.46% 0.90% 0.68% 0.89% 34.07% 73.43% 49.74% 65.40% 7.45% 14.63% 9.82% 12.91% 24.42% 47.92% 40.34% 53.03% 11.12% 21.82% 14.31% 18.81% 30.58% 60.02% 50.42% 54.53% 5.80% 11.39% 9.37% 9.85% 32.55% 63.88% 54.36% 58.27% 21.67% 42.53% 35.08% 37.30% 9.06% 17.78% 18.01% 17.16%
28,906 unique clinic numbers in the clinic subtable. If there is proper coding, then type 1 diabetic patients would have either a 250.%1 or a 250.%3 code where % is a wildcard symbol that SQL uses. Searching through all of the 4 diagnostic codes for all office visits with the conditions (type 1 ICD9 and DOB < 1/1/81) gives 39,826 outpatient services with 5,821 unique clinic numbers of type 1 diabetics. Thus, 5,821/28,906 = 20.14% of adults were recorded as type 1 diabetics of the 28,906 adults who had outpatient services. Unfortunately, this coding is inaccurate. Reasons include provider inaccuracies since the fifth digit codes do not affect level of billing, provider confusion about what the fifth digit codes mean, and a tendency on the part of some to use a fifth digit of 1 for a default when it is not clear exactly what it should be. More importantly many of our colleagues automatically check type 1 when a patient is on insulin even though many type 2 subjects are treated with insulin. It is difficult to accurately determine the distribution between type 1 and 2. The
103 American Diabetic Association estimates that 90-95% of all diabetics are type 2 (American Diabetes Association, 2002).
4.3
Data mining results In this section the data mining results for various outcome variables are
reviewed. Before detailing the results, the introduction explains why some of the CART analyses are done with and without drug variables as predictors, how the CART diagrams are arranged, what they mean, and what to focus on in interpreting them.
4.3.1
Introduction For each target variable listed in the glycemic control subheading, two
CART analyses were done—one including drugs as predictor variables, and one excluding them.
Drugs refer to the following variables: Insulin, Metformin,
Precose, Sulfonylurea, Troglitazone, Monotherapy, InsulinOral, ACEI, ARB, ACEIandARB, LipidDrug, and Steroids. The data warehouse only had medication information on about half of the people in this study. CART uses surrogate splitters when variables are missing. These surrogate splitters are determined by what would give similar splits in patients where the variables are not the missing (Breiman et al., 1984, p. 142). Because so many variables are in the drug group, and they are missing from so many cases, running the CART analysis with and without them is a way of insuring that maximum information is retrieved for glycemic control target variables. The CART analysis diagrams later in this chapter have 3 parts to each diagram:
104 • The target variable is listed in the upper left corner. CART classification trees generally use a binary (0, 1) target variable. The definition of “1” is listed here. Regression trees have continuous target variables. • A small graphic from CART maps the entire tree at a glace, which helps with perspective. This is placed in varying locations on the diagrams according to space. The CART output colors these node shades of green or red to indicate level of purity of impurity of the node with respect to the target variable. Unfortunately, these do not reproduce well in black and white, but the percentages of the target variable listed in the main tree gives more precise information. • A detailed view of the tree (sometimes truncated if too long). Each node includes the following information: node number; N = the number of patients in the node; 1 = the percent of patients that have the target variable value in the left upper corner; and the splitting criteria if not a terminal node. Note that terminal nodes (TN) are in trapezoid shaped boxes and labeled as a negative number, so (node -1) = TN1. Splitting nodes are in diamond shaped boxes and labeled as a positive number, so node 1 is always the top level of the tree and is very different than node -1. The CART classification tree output classifies each node as class 0 or 1. The TN classifications are important if one uses CART for prediction. From this viewpoint, the predicted vs. actual results of each analysis is given in tables. However, the main interest of this study is not classification, but examining novel, interesting relationships that can be used to improve outcomes. This may mean identifying a group with significant deviance from good outcomes that a realistic intervention can be designed for. Hence, this study’s use of the CART analysis highlights the following principles:
105 Principle 1 It is most interesting to find same level child nodes that are very different in the percentage of bad outcomes. This identifies groups for possible interventions. Therefore, the actual CART output of raw numbers in each class in each node have been converted into percentages for the CART analysis diagrams presented below. Principle 2 It is usually interesting to find the nodes from Principle 1 if the numbers of patients in the nodes are large. The goal is a population intervention to improve outcomes, and this usually requires having a significant number of patients in the involved nodes. Hence, the upper level nodes are generally of most interest. Even very extreme nodes, such as TN3 with 1=0%, n = 4 in Figure 4.3, are not interesting if there are only tiny numbers. Principle 3 It is more interesting if the splitting criteria used to get the different child nodes in Principle 1, splits the population into more easily targeted groups. For example, if an age split divides the population into (≤65, >65) with the goal of targeting those ≤65, this is very interesting. Interventions in those ≤65 can be easily tailored to this more homogeneous group, most of whom will be working rather than retired. However, if the split is between those on or not on monotherapy, this will be less interesting. The reasons include: (a) those on monotherapy are harder to identify as a cohesive group to know how to design an intervention, and (b) those not on monotherapy are quite diverse—including those on no medications as well as those on multiple medications. Finally, note a CART analysis always does splits of the form “monotherapy ≤x,” with all the cases going to the left if this is true (Yes) and to the right if this is not true (No). To simplify the CART diagrams below for binary (0, 1) predictor
106 variables, if the split output reported Monotherapy ≤0.500, it was replaced as Monotherapy = 0. A “Yes” to Monotherapy = 0 means a patient is not on monotherapy.
4.3.2
Glycemic control Glycemic control is strongly related to diabetic outcomes and costs. For
glycemic control the target variables are: average HbA1c values with cut-points of 9.5, 8, and 7; a HbA1c trend variable that is the regression trendline of individual’s HbA1c Av trends; and the continuous variable of average HbA1c value. HbA1c cut-points The first CART analysis was on HbA1c Av95 which distinguishes those who are in bad glycemic control (HbA1c Av >9.5) from those who are less bad. The predictor variables were age, sex, OVP, CMI, LDLAv, LOT, LipidDx, HTN, CV, EYE, RD, FPIM, HospCountBin2, Insulin, Metformin, Precose, Sulfonylurea, Troglitazone, Monotherapy, InsulinOral, ACEI, ARB, ACEIandARB, LipidDrug, and Steroids. The classification tree is in Figure 4.1 when drugs are included. The 3 most important1 variables for classification were: age (most important), monotherapy (91% as important), and InsulinOral (65% as important). All other variables were less than half as important as age. The predicted vs. actual results are in Table 4.3. Since only half of our population had drug information, a CART analysis was run on the same variables minus the drugs, and the truncated classification tree is in Figure 4.2. The most important variables for classification were: age most important) and LDLAv (56% as important as age). All other variables were 1
The term important predictor variable in CART analysis refers to the most important classifier and those which are at least 50% as important as the most important one. For a discussion of relative importance of predictor variables, see (Hastie, Tibshirani, & Friedman, 2001, p. 331–332)
$ ( : 9+ " ! ## !)
/ # .
" # ! .0
( 4/8 $ " / #)
" $ $0 .
7 65 4 !$ $ $ 0
" ( 3% 2*1 * # $ )
# $ ( '& % "
$ " .
! # " 0 .
$ ! . # "0 .
" (-,+ * ! ! " )
" $ $ ( '& % $ ! $ # " !
" # ! 0 .
Figure 4.1: Classification tree for HbA1c Av95 including drug variables ! .0
! / # .
1 = (HbA1c > 9.5) including drugs
107
108
Table 4.3: HbA1c Av95 predicted vs. actual results (includes drugs) Predicted Actual 0 1 Total 0 2811 1855 4666 1 131 352 483 Total 2942 2207 5149 Correct 0.614 Sensitivity 0.602 Specificity 0.729 Table 4.4: HbA1c Av95 predicted vs. actual results (excludes drugs) Predicted Actual 0 1 Total 0 3186 1480 4666 1 168 315 483 Total 3354 1795 5149 Correct 0.680 Sensitivity 0.683 Specificity 0.652 less than half as important as age. In the first split of the non-drug tree, the parent node (n = 5, 091, 9.5% very bad control) is split by age ≤64.55 (n = 2, 306, 14.8% very bad) and >64.55 (n = 2, 785, 5.0% very bad). This is a new knowledge item that is evaluated below and can be useful for targeting interventions. The predicted vs. actual results are in Table 4.4. The CART analysis for average HbA1c with a cut-point of 8.0 uses the same predictor variables as above. The CART analysis without drugs in the predictor variables shows the only important predictor to be age. The first split made on the parent node (n = 5, 091, 32.4% bad control) is by age ≤63.59 (n = 2, 204, 43.0% bad) and >63.59 (n = 2, 887, 24.4% bad). The second level splits are also by age, and the third level of nodes (nodes -1, 3, 7, -12) divide the population into 4 age groups where the younger you are the larger percentage have HbA1c >8. See Figure 4.4 for details. When drugs are included as predictor variables, the most important predictors are monotherapy (predicting better control), followed by age
! ! (
()
& %$# ! !
*
) (
% +
!
! "
) (
' * ' )
! ' !"
' ! !
' "
! ! (
) ( ) (
! ) (
'
' ! "
' ) (
Figure 4.2: Classification tree for HbA1c Av95 excluding drug variables ) (
! (
) (
& ,$# ! !
' ' ! !
1 = (HbA1c > 9.5) excluding drugs
109
110
Table 4.5: HbA1c Av80 predicted vs. actual results (includes drugs) Predicted Actual 0 1 Total 0 2162 1360 3522 1 614 1013 1627 Total 2776 2373 5149 Correct 0.617 Sensitivity 0.614 Specificity 0.623 Table 4.6: HbA1c Av80 predicted vs. actual results (excludes drugs) Predicted Actual 0 1 Total 0 2128 1394 3522 1 697 930 1627 Total 2825 2324 5149 Correct 0.594 Sensitivity 0.604 Specificity 0.572 (88% as important; worse control if younger), InsulinOral (69% as important), and Insulin (57% as important). See Figure 4.3 for more analysis details. The predicted vs. actual results are in Table 4.5 with drugs included and Table 4.6 excluding drugs. The CART analysis of average HbA1c with a cut-point of 7.0 uses the same predictor variables as above. The CART analysis without drugs in the predictor variables shows the only important predictors to be age and LOT (56% as important as age). The first split made on the parent node (n = 5, 091, 61.6% bad) is by age ≤65 (n = 2, 327, 68.5% bad) and >65 (n = 2, 764, 55.8% bad). See Figure 4.5 for more details. When drugs are included as predictor variables, the important predictors are Insulin, Monotherapy (95% as important), and InsulinOral (50% as important)—age was only 22% as important at Insulin. If one is not on monotherapy, 81% have an average HbA1c >7, whereas only 54% do on monotherapy. At nodes 3 and -7 of Figure 4.6, Insulin alone gives worse
! ! !' &
" % !
' &
" " &
" " % $# " 5
&
%0# -. ! " " % " ! "
. ! !"
' &
' &
% 6 ,3 " 5
' &
" ' &
! " ' & % !
" " &
/ &
" % $ # " !
% 0# /-. " ! ! (
' &
4
32 (
! ! ! % $ #
' &
" !(
" !' &
) -,+* ) " (
" ! " (
- 1 -/+* " " "
Figure 4.3: Classification tree for HbA1c Av80 including drug variables
&'
&
" ! % $# 5 ) -/+* ) 5
' &
&
" ! ' &
1 = (HbA1c > 8) including drugs
111
! % &"# " & /
% &"# " !
! $
% &"# " !
!
1 )0
+.- " !,
& !!
! !
! ! ! &
!
! % $"# " !
+.- " ,
Figure 4.4: Classification tree for HbA1c Av80 excluding drug variables !
!
& +*()' ! ! &
1 = (HbA1c > 8) excluding drugs
112
113
Table 4.7: HbA1c Av70 predicted vs. actual results (includes drugs) Predicted Actual 0 1 Total 0 1319 741 2060 1 1097 1992 3089 Total 2416 2733 5149 Correct 0.643 Sensitivity 0.640 Specificity 0.645 Table 4.8: HbA1c Av70 predicted vs. actual results (excludes drugs) Predicted Actual 0 1 Total 0 1497 563 2060 1 1748 1341 3089 Total 3245 1904 5149 Correct 0.551 Sensitivity 0.727 Specificity 0.434 control (72%) than oral monotherapy (60%). If one is on oral monotherapy (node 4, n = 1, 553), the few who are on an ARB (n = 44) have much better control (40%) than those on non-ARB oral monotherapy (60%). The predicted vs. actual results are in Table 4.7 with drugs included and Table 4.8 excluding drugs. HbA1c Trends The CART analysis uses the same predictor variables as above, but now with a target variable representing trends. The first of these is HbA1c AvSlopeBin2, which divides the regression trendline of the average annual HbA1c values using a cut-point of zero. The CART analysis without drugs in the predictor variables shows the only important predictor to be age. The first split made on the parent node (n = 4, 421, 35.13% bad trend) is by age ≤70.28 (n = 2, 630, 38.3% bad trend) and >70.28 (n = 1, 791, 30.4% bad trend). This analysis does not classify into very contrasting groups, and a review of its tree does not show interesting patterns. When drugs are included as predictor variables, the most important
114
1 = (HbA1c > 7) excluding drugs
"!
)
# $ %'&"(
)
*)
#) $
Figure 4.5: Classification tree for HbA1c Av70 excluding drug variables predictor is age, followed by LOT (52% as important). The first split is the same as without drugs, and the analysis does not classify into very contrasting groups. The next CART analysis uses the same predictor variables as above, but now with a target variable of HbA1c AvAdjSlopeBin2, which divides the regression trendline of the average annual HbA1c values using a cut-point of the average trendline slope in the entire population. The CART analysis is an extremely complex one. For example, when drugs are included as predictor variables, there are 171 nodes and 172 terminal nodes. The most important predictor is LDLAv, followed by age (97% as important), LOT (93% as important), and OVP (61% as important). The first split made on the parent node (n = 4, 421, 45.51% bad trend) is by insulin (no insulin: n = 3, 888, 46.5% bad trend; insulin: n = 533, 38.5% bad trend). This is a fairly good improvement associated with being on insulin. Going down additional levels does not reveal other interesting associations.
115
1 = (HbA1c > 7) including drugs
!
"
, * .
) ( - " !
# ! !
# "
# ! $ %'& ($
!
! # "
#
#
"
" "
) " # # *,+ - !
/0$ -
!
Figure 4.6: Classification tree for HbA1c Av70 including drug variables
116
HbA1cAdjSlopeREGRESSdrug
"! # $ % &
' & $ & # #
# $ $ ' % (
Figure 4.7: Regression tree for HbA1c AdjSlopeREGRESSdrug This next CART analysis is a regression tree with a continuous target variable of HbA1c AvSlope, the HbA1c regression trendline slope. With or without drugs included in the predictors, the only important predictor variable was LOT, and only one split (LOT ≤15.82) was made on it. The same was true of HbA1c AvAdjSlope. All 4 of these CART analyses were almost identical—see Figure 4.7 for the analysis of the adjusted slope with drugs included. Average HbA1c HbA1c Av is a continuous variable and so uses regression trees. Without drugs, the only important predictor was age, and the CART analysis is shown in Figure 4.8.
With all the predictors including drugs, the most important
variable was age, followed by monotherapy (93% as important), Insulin (88% as important), and InsulinOral (64% as important). The CART analysis is in Figure 4.9
# " , $ % % $ !
$ % % ! -
$ ! -
$ # ' %& & $ % $ $ ! $
# ", $ % (
$ % $ ! -
# " !
$ % % ! -
% $! -
$ #+*) & % $
(
Figure 4.8: Classification tree for HbA1c Av without drug variables % -!
HbA1cAvnodrug
117
#*) & % , , ,
% , -
% ) .
/+0# / % % % %
* ") ( ' % % #
# % % % ) .
% & .
* ") ( ' #
$ #"!
% ) .
% % % % & .
* " ) ( ' % & %
* , ' #+ + % &
Figure 4.9: Classification tree for HbA1c Av with drug variables
) .
HbA1cAvdrug-truncated to upper levels
118
% * + # #
119
4.3.3
Emergency department visits Emergency department visits are associated with high costs, and many
believe may be an indicator of poorly assessable primary care services.
For
ER visits the target variable is the binary variable of (yes, no) for emergency department visits; an ordinal variable obtained by dividing emergency department visits into 5 bins; and the continuous variable of the number of emergency department visits. The predictor variables were age, sex, OVS, OVP, CMI, CMIbin5, HbA1Ccount, HbA1CAv, HospCount, LDLAv, LDLcount, UrineTestDone, LOT, MQI, LipidDx, HTN, CV, EYE, RD, PODIATRY, OPHTHO, FPIM, HbA1CAvAdjSlopeBin2, HospDeath, LDLdone, LDL130, UrineTestDone, Insulin, Metformin, Precose, Sulfonylurea, Troglitazone, Monotherapy, InsulinOral, ACEI, ARB, ACEIandARB, LipidDrug, and Steroids. Not surprisingly, the major classifier for all ER target variables was hospitalization since many ER visits result in a hospitalization. Hence, the HospCount and HospDeath predictor variables were removed from the CART analyses that follow. The CART analysis using a target variable of whether one was ever in the ER (ERbin2)is in Figure 4.10. In contrast with the target variable of glycemic control, here CMI is the most important classifier and the splitter in both nodes 1 and 2. CART choose CMIbin5 as the most important classifier, with CMI being 99% as important, OPV 91% as important, and OVS 72% as important. From this it is reasonable to conclude that one’s comorbidity index is a key predictor of emergency department visits, which would be expected by clinicians. In node 4 where 52.5% go to the emergency department, a larger number of visits with providers (36 or more) classifies an even higher utilization group where 68% (vs. 43%) go to the emergency department. The high utilisers in TN -8 appear to be in the emergency department because of the complexity of their problems, not
120 due to lack of access to providers. The same appears to be true for TN -3 though a different OVP cut-point is used in node 3. In node 5, whether one primarily sees a family practitioner (32.9% to ER) or an internist (49.4% to ER) classifies these 1,392 patients with CMI of 8 or more and OVP of 35 or less into different utilization groups. The combined test and learning sets with the constraints in node 5 have 2,883 patients, of whom 2681 have non-null FPIM values. Of the 1,040 who primarily see family practitioners vs. the 1,641 who primarily see internists, there is an average CMI of 9.06 vs. 9.18 and an average OVP of 22.47 vs. 23.43. The percent of each group that went to the ER was 35.10% vs. 53.44%. Access seems similar among the groups. It would be difficult to explain the 52% increase in emergency department utilization by the 1.2% increase in the comorbidity index between these groups. Nodes 6 and 7 give expected results of renal disease and worse glycemic control being associated with a greater chance of going to the emergency department. Using a target variable of ERbin5, the CART analysis is in Figure 4.11. The most important predictor variable is CMIbin5, followed by CMI (99% as important), OVP (79% as important), and OVS (67% as important). The CART analysis is more difficult to interpret because of the ordinal target variable, but is facilitated by the class assignment for each node. The only class 4 node, indicating 5 or more ER visits, is TN -4 determined by CMI being 9 or higher. Here again, the comorbidity index in nodes 1 and 2 is the primary determinant of how many times a person is likely to go to the ER. In node 3, more outpatient services result in TN -3 where there are more emergency department visits. This argues against the commonly held belief that emergency department visits are driven by poor access, at least in this adult diabetic population.
121
1 = ERbin2
% !#"%$
!#"%$
' ($ #
&
&
+ , - . -/ &
& & )'*
% & &
& &
&
Figure 4.10: Classification tree for ERbin2
%
&
122
ERbin5
"! #
(
$ %
& '! #
( &
)( )
&
& )
*$+-,
( )
Figure 4.11: Classification tree for ERbin5
123
ER
!
#"
" !
#$
" % "
#$ " % "
'$ " "
& " ! "
'$ %
' " % () *& + *, ! #$
'
Figure 4.12: Regression tree for ER Finally, the continuous target variable of the number of emergency department visits is used in a CART regression tree analysis that is in Figure 4.12. Nodes 1, 2, and 3 are all determined by the comorbidity index, similar to the other ER target variables. In this regression tree analysis, the most important predictor variable was CMI, followed by CMIbin5 (81% as important), and OVP (70% as important).
4.3.4
Hospitalizations and deaths in hospital Hospitalizations are very costly, and are themselves a poor outcome. For
hospitalizations, the target variable used is (yes, no); a binary variable for hospitalization death (yes, no); and the continuous variable of the number of hospitalizations. The initial predictor variables selected are age, sex, ER, OVS, OVP,
124 CMIbin5, HbA1c Count, HbA1c Av, HbA1c AvAdjSlope, LDLdone, LDLAv, LDLcount, UrineTestCount, LOT, MQI, LipidDx, HTN, CV, EYE, RD, Podiatry, Ophtho, FPIM, UrineTestDone, Insulin, Metformin, Precose, Sulfonylurea, Troglitazone, Monotherapy, InsulinOral, ACEI, ARB, ACEIandARB, LipidDrug, and Steroids. Using a target variable of hospitalization (yes, no), the CART analysis is in Figure 4.13. The most important predictor is whether a patient was ever in the emergency department (58% hospitalized vs. 10%), which is just common sense since most admissions occur through the emergency department rather than direct admissions or transfers. Of those never in the emergency department (n = 3, 339), the next split is whether a patient has cardiovascular disease. If patients do, 21% are hospitalized vs. 6%. Even though ER is an extremely impressive classifier, it has little meaning since it simply states the obvious. Therefore, the CART analysis was run again without ER in the predictor variables and this is shown in Figure 4.14. The most important variable was CMIbin5, followed by OVP (92% as important), OVS (67% as important), and CV (65% as important). The first split is based on the presence of cardiovascular disease, and this impressively divides the group into those with it (46% hospitalized) and without it (15% hospitalized). Thus, a diabetic patient is 3 times more likely to be hospitalized if there is cardiovascular disease. Those without cardiovascular disease are further divided by whether they are in the bottom 2 quintiles of CMIbin5 (9% hospitalized vs. 29%). If one does have cardiovascular disease, there is less chance of being hospitalized (34% vs. 58%) if there are fewer provider visits (43.2, see TN -6) had no deaths vs. 17.57% dying, perhaps indicating that the survivors have either a stronger constitution or less severe renal disease. A similar, but less divergent scenario is also true of those without renal disease (see nodes 2, 3, -4). The CART analysis is a good predictor of hospital deaths as seen in Table 4.9, compared with other CART analyses in Tables 4.3, 4.4, 4.5, 4.6, 4.7, and 4.8. Finally, a continuous variable of the number of hospitalizations (HospCount) is used with regression trees. Recall that the majority of the counts are zero. The
128
1 = Hospital Death
) # #
!" #
$# %'&( "
) # # #
) #
' # *,+- ./ 0 "
# 1" #
')
') #
Figure 4.15: Classification tree for hospital death, excluding ER
') # #
129
HospitalCount, excludes ER
! "!#%$ &
(' !
(' ! !
.' !
! )+* ,- &
.' !
( ! /0 - 1 2 )3 2 &
.' !
Figure 4.16: Regression tree for number of hospitalizations, excluding ER CART analysis is in Figure 4.16, and the most important variable is renal disease, followed by CMIbin5 (78% as important) and OVP (54% as important). One interesting observation of all the CART analyses of hospital target variables is that medications are not important variables in classifying or predicting whether someone is hospitalized, whether there is a hospital death, or how many hospitalizations there are.
4.3.5
Charges Charges, a proxy for costs, is an important outcome in healthcare. For
charges, an ordinal target variable of quintiles (see Figure 4.17 for the CART analysis), and the continuous variable (Figure 4.18) are used. As noted above under the variables section, due to data limitations, these are clinic charges only.
) ,+* $ $ # # "
) ( '&% $ # # # # "
$ ). ( '&% " - -
!
). ( '&% $ # ' $ $ $ #
Figure 4.17: Classification tree for ChargesBin5 # # $ - " /
) ! $ -
Upper levels of ChargesBin5
130
131
Charges per month
) * # * + -, . #
! " # $ % &' (
* * * # + -, *.
!3
""
* *
$/ 0) &01 .
!* " * # $ %&' (
2 * * * $ % &' ( !3 *" # * # #
Figure 4.18: Regression tree for Charges The CART analysis for the quintiles of charges as a target variable has OVP as the most important predictor, followed by HospCount (72% as important), CMI (67% as important), and OVS (67% as important). The first split is an expected one of whether a patient was ever hospitalized. If not, one is in the lower quintiles, if so, one is in the highest quintiles of charges. If a patient was never hospitalized, then the number of outpatient services is the key predictor for being in the lower vs. the higher quintiles. If a patient was hospitalized, the number of hospitalizations becomes the key charge predictor. Figure 4.17 truncates the rather large tree at the third level, but there are no interesting surprises at those levels either.
132 The CART regression tree analysis for Charges (note that these are average monthly charges) has OVS as the most important predictor variable, followed by LOT (71% as important), CMI (51% as important) and LDLcount (50% as important). HbA1c Av came close to this (49% as important). The first CART split separates out one extraordinary outlier. Node 2 of Figure 4.18, which has average monthly charges of $361, splits by whether a patient was hospitalized 0-2 times ($283 average) or 3 or more times ($1149 average). Note the standard deviations are large, so these cannot be considered statistically significant. Nevertheless, these are expected trends. For those with HospCount ≤2 in node 3, The key splitting criteria is whether one was in this healthcare system for ≤2.8 months or longer. For the small group (n = 26) with LOT ≤2.8, the high mean charges in node 4 are expected. This is similar in node 42. A commonly recognized scenario is a patient with a recent history of inadequate medical care coming into the healthcare system because of newly acquired health insurance. This may be due to turning 65 years of age and so qualifying for Medicare or a Medicare HMO, or a new job that now has health insurance benefits. In these cases expenses are high in the initial months to establish baseline testing and to get diabetes under control.
4.3.6
Medical quality index The Medical Quality Index (MQI) is a calculated variable using Equation 4.1
based on Table 3.3 from the April 2001 consensus statement of the AMA, JCAHO and NCQA. This is also a combination of HEDIS measures. Three target variables are used: a binary MQIbin2 that divides the group using the MQI mean, an ordinal MQIbin5 that divides the group into not fully equal quintiles, and a continuous MQI.
133
1 = above average MQI
! " #" $ % & '
* ! + " #" $ % &
! ! " )(' !
! " ,('
-+ ! ! . / 0- $ 01
+
Figure 4.19: Classification tree for MQIbin2 Figure 4.19 shows the CART analysis for MQIbin2 with the predictor variables of age, sex, ER, ERbin2, ERbin5, OVS, OVP, LipidDx, HTN, CV, EYE, RD, Podiatry, Ophtho, FPIM, CMI, CMIbin5, HbA1c Count, HbA1c Av, HbA1c Av95, HbA1c Av80, HbA1c Av70, HbA1c AvSlope, HbA1c AvSlopeBin2, HbA1c AvAdjSlope, HbA1c AvAdjSlopeBin2, HospCount, HospDeath, HospCountBin2, HospCountBin5, LDLdone, LDLAv, LDL130, LDLcount, UrineTestDone, UrineTestCount, Insulin, Metformin, Precose, Sulfonylurea, Troglitazone, Monotherapy, InsulinOral, ACEI, ARB, ACEIandARB, LipidDrug, Steroids, and LOT. The most important predictor variable was Ophtho, followed by LDLcount (88% as important), and LDLdone (54% as important). While there are some interesting findings here, such as the overwhelming importance of Ophtho, much of this is determined by the formula for MQI in Equation 4.1. The analyses for MQIbin5 and MQI are similar, though much larger trees.
134 It might be more interesting to look at the MQI target variables without predictors that occur in its formula. This gives the CART analyses in Figures 4.20, 4.21, and 4.22. With the binary target variable the most important predictor is OVS, followed by LipidDx, CMI, OVP, and CMIbin5 (77%, 72%, 72%, and 56% as important respectively). The first split was on LipidDx, dividing the group into those without a diagnosis of a lipid abnormality where 28% have above average MQI, and those with a lipid diagnosis where 59% have a higher than average MQI. As expected, more outpatient services are associated with a higher MQI. In the MQIbin5 ordinal target variable excluding equation predictors, the CART analysis produces a extremely large tree (see Figure 4.21); details are shown only for the upper levels. The most important predictor variable is OVS followed by OVP (65% as important) and CMI (53% as important). As with MQIbin2, having a lipid diagnosis is the first split and having this diagnosis is associated with higher rankings in MQI scores. As expected, a higher number of outpatient services means higher scores. Finally the MQI continuous variable excluding equation predictors uses a CART regression tree analysis. The most important predictors are OVS followed by LipidDx (72% as important), OVP (62% as important), and CMI (51% as important). As above, the first split is by LipidDx, and mean scores tend to be higher if there is a lipid disorder diagnosed.
4.3.7
Summary In this section on data mining results, CART analyses were reported for a
number of target variables. For glycemic control, younger age is a key predictor for bad glycemic control, and HbA1c slopes gave little insight. For emergency department visits, the comorbidity index was a key predictor of higher utilization,
135
MQIbin2 excluding eq ua t io n v a r ia bles
' !#"($%
!#"$&%
* * 4&+6 %
) *
) * !#2%
+#, - . / 0 1
7 * !82%
)
)* 34#5 6
Figure 4.20: Classification tree for MQIbin2 without equation variables
#/ 0 ./ ! '
-'( , *+) ! ! ( ' '
-21) " ! +
Figure 4.21: Classification tree for MQIbin5 without equation variables
&% #$ # " !
5 0 '4% # #$ " 3 !
'-,(+*) '
/ #9 4 78 +6 ' !
(
MQIbin5 excluding equation variables
136
# # )/.& # %
% % # # +"
# * , +
$)( '& % -
% % , " +
$)( '& # # " * %
).& # % # , #
% $)( '& % $ # "
# )/.& * *
%# )/.& # # ,
!
) 21,0 # # # " *
Figure 4.22: Regression tree for MQI without equation variables # % # * " +
# ).& # % # #
$)('& # % % % % -
MQI excluding equation variables
137
138 patients seen by internists had higher utilization than those seen by family practitioners even though CMI and OVP were similar, and there was no evidence that poor primary care access was associated with higher utilization. For hospitalizations, cardiovascular disease was a key predictor of whether a diabetic patient was hospitalized, and renal disease was a predictor of the number of hospitalizations. For hospital mortality, renal disease was an incredibly strong predictor. For charges, there were the obvious results of hospitalizations and more visits being associated with more charges. For the medical quality index, once the variables that are in the MQI formula are removed, having a diagnosis of dyslipemia was associated with a higher MQI.
4.4 4.4.1
Knowledge discovery Introduction Data mining’s goal is to discover new knowledge that can be useful. In
the context of this study, to be useful is to improve outcomes. In this section two discovered knowledge items that came from the data mining process are reviewed in detail. In addition, some of the data mining results that are not “new knowledge” are also reviewed since this tends to validate the methodology.
4.4.2
Younger age predicts HbA1c >9.5 Younger age (9.5 regardless of whether drugs were included as predictors or not. This is an important cut-point that HEDIS uses as an external monitor. The first level of the non-drug tree in Figure 4.2 shows that dividing people using an age cut-point of 65 years old, 14.83% of younger people (n = 2, 306) have a bad HbA1c >9.5. This is 3 times the rate of bad HbA1c values than in those who are older (4.99%, n = 2, 785).
139 If the odds of a bad HbA1c in the younger and older groups are calculated, one can then calculate an odds ratio (OR). Going to the combined test and learning data (n = 10, 240) and using 65 years of age as the cut-point, the odds of a bad HbA1c in the younger group (≤65 years old, n = 4, 759) is The odds of a bad HbA1c in the older group (n = 5, 481) is
711 4,759−711 253 5,481−253
= 0.1756. = 0.0484.
Therefore the odds ratio that someone is less than 65 years old if they have a bad HbA1c (average reading > 9.5) is
0.1756 0.00484
= 3.6281 (Fos & Fine, 2000). The 95%
confidence interval for this odds ratio2 is (3.12, 4.23), and thus very significant since the range does not include 1.00. This is surprising information to most clinicians. This is not in either of the gold standard textbooks discussed in the definition of new knowledge in Chapter 1. This identifies a cluster of deviance that can be targeted to improve outcomes that will be evaluated below.
4.4.3
Outpatient access does not prevent ER use There is a common belief among clinicians and administrators that many
emergency department visits are caused by poor access to outpatient providers. This is largely based on common sense and public perceptions; a search of Medline for “access” and “emergency” showed most articles on this topic were studies in children (Franco, Mitchell, & Buzon, 1997). The Medline search found no evidence-base for this belief in general among adults with chronic diseases such as diabetes. CART analyses on ER, ERbin2, and ERbin5 do not support this common belief. In the regression tree analysis of the continuous variable, the classifiers in nodes 1, 2, and 3 were the comorbidity index (Figure 4.12). A similar pattern was found in the binary (ERbin2) and ordinal (ERbin5) target variables using classification trees. If the common belief were true, one would expect to find a trend 2
using EpiInfo2000 calculator
140 in the CART analyses that associated higher emergency department utilization with lower OVP. Yet the CART analyses found the opposite: in Figure 4.10 nodes 3 and 4, having more provider visits was associated with more chances of having an emergency department visit. A similar pattern is found in Figure 4.11 node 3. One might explain this by more complicated patients see their physicians more often, and because of their comorbidities also have higher emergency department utilization.
4.4.4
Renal disease predicts diabetic hospital deaths As noted in the CART analysis of Figure 4.15, node 1, the best classifier of
hospital mortality is whether a patient has renal disease (9.8% hospital mortality vs. 0.8%). The parent node (n = 5, 091, 89 deaths, 1.75%) is split into a child node without renal disease (n = 4, 562, 37 deaths, 0.81%) and a child node with renal disease (n = 529, 52 deaths, 9.83%). If the odds of hospital mortality in those with and without renal disease is calculated, an OR can be calculated. Going to the combined test and learning set data on 10,240 patients and using presence or absence of renal disease as the cut-point, the odds of hospital mortality in the renal disease group (n = 1, 087 and 94 deaths) is
94 1,087−94
= 0.0947. The odds of hospital mortality in the non-
renal disease group (n = 9, 153 and 81 deaths) is
81 9,153−81
= 0.0089. Therefore
the OR that a diabetic patient had renal disease if they died in the hospital is 0.0947 0.0089
= 10.6. The 95% confidence interval for this odds ratio is (7.74, 14.53), and
thus very significant. This identifies a cluster of deviance that can be targeted to improve outcomes that will be evaluated below. This trend of diabetics who died in the hospital being more likely to have renal disease would be expected by clinicians, although whether cardiovascular or renal disease is more important may be less clear, and the OR being this large
141 is not commonly known and has management implications for intervention. It is interesting that cardiovascular disease was not a splitter in any of the nodes of this CART analysis in Figure 4.15. One study showed that whether or not someone had diabetes was not a significant predictor of hospital mortality for those who had an acute myocardial infarction (Chyun, Obata, Kling, & Tocchi, 2000), while another concluded that “NIDDM cases were 1.73 (relative risk) times more likely to die of AMI than nondiabetic patients” (Gagliardino et al., 1997). One recent study noted that “people with diabetes and renal impairment had significantly higher mortality than people with diabetes alone, with a rate ratio of 7.27 [4.11, 12.86] for people with type 2 diabetes aged 40-59 years” (Roper, Bilous, Kelly, Unwin, & Connolly, 2002). However, mortality is not limited to inpatient only in Roper’s study, renal impairment is more narrowly defined, and this rate ratio decreases with age–RR 3.68 (3.03, 4.46) in those 60-79 year old, and RR 1.92 (1.60, 2.31) in those ≥80 years old. Thus the literature does support renal disease being more of a risk for death than cardiovascular disease (though these are connected and both are vascular complications of diabetes). While the trend is not surprising or new, the extent of it is, especially in the more broadly defined renal disease used in this study. In this study’s population using the 10,240 patients in the test and learning sets, there are 3,193 patients aged 40-59 years old. There are 21 hospital deaths in the 256 patients with renal disease, and 5 deaths in the 2,932 without renal disease. This gives an OR of 52.40 (18.50, 159.97) that a 40-59 year old diabetic patient who dies in the hospital likely has renal disease. Since our definition of renal disease is broader than Roper’s definition, one might expect our rate ratio to be less than 7.27. Instead it is far higher at 52.40 (19.58, 140.23), significantly so since the 95% confidence intervals do not overlap.
142 A review of the gold standard textbooks discussed in the definition of new knowledge in Chapter 1 shows nothing beyond the more recent Roper article already cited. Thus, although the trend that renal disease increases mortality in diabetics is already known, the extremely high OR for hospital mortality using this broader definition of renal disease is relatively new knowledge based on the definitions in Chapter 1, and can be targeted to improve outcomes below.
4.4.5
Results already known strengthen validity The following results are generally known and accepted. While the items
below are “old knowledge,” their “discovery” tends to validate the CART methodology that arrived at these already well known facts. The following are not a complete list—more items can be gleaned from the CART analyses above. High LDL is associated with poor glycemic control Reviewing the CART analysis diagrams with a target variable of HbA1c Av cut-points, LDL average is a frequent splitter. For example, in nodes 2, 5, 8, and 17 of Figure 4.2, node 5 of Figure 4.1, and nodes 3 and 7 of Figure 4.4 if average LDL is higher, then glycemic control is worse. ER visits predict hospitalizations As noted in Figure 4.13, whether a patient was ever in the ER is the key predictor of whether they were hospitalized. The same is true for predicting hospital deaths and the number of hospitalizations, although those analyses are not shown in CART diagrams above. This simply reflects the fact that most admissions occur through the emergency department and not through direct admits or transfers that bypass the emergency department.
143 Cardiovascular disease predicts hospitalization As noted in the CART analysis of Figure 4.14 node 1, whether a diabetic patient has cardiovascular disease is the best classifier of whether they are hospitalized. Diabetic patients are 3 times as likely to be hospitalized (46% vs. 15%) if they have cardiovascular disease. This is not surprising, and would be considered generally expected among clinicians. More comorbidities mean more hospitalizations As noted in the CART analysis of Figure 4.14 node 2, if the comorbidity index is in the upper 3 quintiles, there are 3 times as many hospitalizations (29% vs. 9%) than when comorbidities are in the bottom 2 quintiles. This is not surprising, and would be considered generally expected among clinicians. A similar trend is also seen in node 12, in Figure 4.15 node 4, and in Figure 4.16 node 3. More comorbidities mean more emergency department visits As noted in all the CART analyses of ER, ERbin2, and ERbin5, a higher comorbidity index is associated with more emergency department visits. This would be considered generally expected among clinicians. Predictors of higher charges As noted in the CART analysis of ChargesBin5 and ChargesPerLOT, key predictors of higher charges are hospitalizations, the number of outpatient services, and just recently entering this healthcare system. These are all expected. More outpatient services are associated with a higher MQI score Figures 4.20, 4.21, and 4.22 all show the trend that there are higher MQI scores when there are more outpatient services. This is expected.
144
4.4.6
Summary This section reviewed discovered knowledge items that came from the data
mining process. It is new information that younger age predicts bad glycemic control in adult diabetics. Although it was known that renal disease predicts mortality, the extreme extent of the trend in hospital mortality using a broader definition of renal disease is new. In addition, some of the data mining results that are “old knowledge” were reviewed. The CART analyses ability to “discover” these already known and obvious facts through its methodology tends to validate the method.
4.5 4.5.1
Evaluation of discovered knowledge Introduction Data mining’s goal in this study is to discover new knowledge that can
be useful to improve outcomes. For each of the discovered knowledge items, this section evaluates how the new knowledge item might improve outcomes. When possible, calculations are made based on projected interventions from the literature that are applied in a decision analysis to this population to calculate an outcomes improvement percentage.
4.5.2
Glycemic control
Younger age predicts bad control Younger age significantly predicts bad diabetic control (OR = 3.63 [2.78, 3.77] with cut-points of 65 for age and HbA1c >9.5). This can assist clinic managers in targeting more focused interventions. This information is especially important because the younger group has so many more years of life left to develop diabetic complications from bad glycemic control. Figure 4.23 shows there are
145
) * +
! ! & '( " #! !
" " # ! " #
$%%
Figure 4.23: Pre-intervention glycemic control outcome 253+711 = 964 patients in the learning and test set population with bad glycemic control defined as average HbA1c >9.5. Presume managers use this information to develop a targeted intervention at young people (≤ 65) that reduces their HbA1c by the same as the 2-year results of the Lovelace disease management program applied to all diabetics—an average of 1.8 HbA1c points over a 2-year period (Friedman et al., 1998). To be conservative, calculations will use only a 1.0 point drop. The process used is to go to the original data on the 4,759 young people, subtract 1 HbA1c point, recalculate each one, and then reclassify it as bad (>9.5) or less bad (≤9.5). The results are shown in Figure 4.24, which modifies the younger arm of Figure 4.24. The number of post-intervention patients with bad glycemic control is now 253 + 315 = 568. This is an outcomes improvement of 41%. The cost of targeting
4,759 10,240
= 46% of
the population should be only half of targeting the whole population. In using the target variable of HbA1c >8.0, There is not much new information gleaned. Age continues to be the most important predictor, and splits the non-drug parent node at a bit below 65 years of age. Additional age partitions are the splitters for each of the child nodes. In the drug tree, monotherapy is the first
146
( )* +
! ! % &' " #! !
" # " #
$
Figure 4.24: Post-intervention glycemic control outcome splitter, and 25% of those on monotherapy have HbA1c >8 while 38% of those not on monotherapy have HbA1c >8. Unfortunately, not being on monotherapy is a diverse group that includes those on diet and exercise alone as well as those on multiple diabetic medications. While this gives us some clues to further explore the data warehouse, on the surface it is not obvious how to use this given the diversity of the non-monotherapy group. Age was the second most important predictor even in the drug group. Monotherapy The monotherapy relationship in glycemic control includes the following: • In Figure 4.1, a lower percentage of those on monotherapy have average HbA1c values >9.5 (11% in n = 3, 254 vs. 7% in n = 1, 837). • In Figure 4.3, a lower percentage of those on monotherapy have average HbA1c values >8 (25% in n = 2, 033 vs. 38% in n = 3, 058). • In Figure 4.6, a lower percentage of those on monotherapy have average HbA1c values >7 (54% in n = 3, 715 vs. 81% in n = 1, 376).
147
Table 4.10: Further breakdown of Monotherapy data Not on Monotherapy HbA1c Monotherapy no drugs multiple drugs All (n = 6, 425) n = 2, 561 (39.86%) n = 1, 145 (17.82%) n = 2, 719 (42.32%) >7 (n = 3, 828) n = 1, 375 (35.92%) n = 252 (6.58%) n = 2, 201 (57.50%) >8 (n = 1, 949) n = 589 (30.22%) n = 66 (3.39%) n = 1, 294 (66.39%) >9 (n = 857) n = 229 (26.72%) n = 23 (2.68%) n = 605 (70.60%) These somewhat different percentages depend on the cut-point. The difficulty in interpreting this is that the first split in the CART analysis is finding the best cut-point of the best predictor to classify most accurately who has the target variable, i.e., bad glycemic control at various cut-points. The different numbers of patients in each of the categories above, and the different target variables, means that these are not apple to apple comparisons. If these monotherapy results are taken as an interesting clue, and one goes to the original data in the combined test and learning sets, the data in Table 4.10 which is limited to those patients on whom we have medication data in the combined test and learning sets is obtained. The table shows expected trends: As the HbA1c rises, monotherapy decreases, diet and lifestyle alone decreases, and multiple medications increase. There is little here to devise an intervention to improve outcomes.
4.5.3
Emergency department visits Though there is a commonly held belief that emergency department util-
ization is high partly due to poor primary care access, the literature focuses largely on children and there is little beyond common sense for the belief among adult diabetes. The CART analyses actually show the opposite—more provider visits were associated with more chances of having an emergency department visit. One explanation may be that more complicated patients see their physicians more
148 often, and because of their comorbidities also have higher emergency department utilization. Hence it appears that diabetic patients chances of going to the emergency department are associated with comorbidity, not with provider access.
This
“new knowledge item” may be useful to managers in strategic planning. For the purpose of decreasing emergency department utilization among adult diabetics, it is probably not worth pouring resources into improved primary care access. While this item is difficult to transform into a specific outcomes improvement percentage that can be calculated, it does emphasize to managers that to reduce emergency department visits among adult diabetics it may be most important to control diabetes well and prevent its complications from developing as much as possible.
4.5.4
Hospital deaths Renal disease, a common microvascular complication of diabetes, signifi-
cantly predicts diabetic hospital mortality (OR = 10.6 [7.74, 14.53]), especially in the 40-59 age range (OR 52.40 [19.58, 140.23]). This can be used to target focused interventions to improve outcomes. According to the American Diabetic Association: Epidemiological analysis of the UKPDS data showed a continuous relationship between the risks of microvascular complications and glycemia, such that for every percentage point decrease in HbA1c (e.g., 9 to 8%), there was a 35% reduction in the risk of complications. (American Diabetic Association, 2001) HbA1c being reduced by 1 point leading to a 35% reduction in microvascular complications such as renal disease, can be linked to the new information about
149
"# $ % )* &( ' + % $ & "# $ % )*
& - + .
!
!
+
+
+
+ +
,
*
Figure 4.25: Epidemiology of renal disease and glycemic control, n = 10, 240 renal disease’s high OR for hospital mortality to devise an intervention to reduce hospital mortality. Figure 4.25 outlines the epidemiology of renal disease and glycemic control in these 10,240 diabetic patients, dividing them into those with HbA1c >6.2 that can be intervened in, and those with lower values. The actual hospital mortality rate for this population was 81 9153
94 1087
= 0.08648 if renal disease was present, and
= 0.00885 if it was not. There were 175 hospital deaths, 94 of these (54%)
are in those with renal disease; 81 of the total deaths (46%) are in those with renal disease and above normal HbA1c values that can form the intervention group. Presume managers use this information to organize an intervention for diabetics with above normal HbA1c values that reduces their HbA1c by the same as the 2-year results of the Lovelace disease management program applied to
150 all diabetics—an average of 1.8 HbA1c points (Friedman et al., 1998). Again, to be conservative, the calculations here use a 1.0 point drop in HbA1c . Based on the UKPDS study, one expects a 35% reduction of renal disease which is a microvascular complication of diabetes, and this reduction translates into an eventual reduction of hospital mortality. This change would occur over time in the diabetic population, where a certain percentage of people change insurance annually and leave this particular health system, and others enter. The fact that renal disease should also be reduced in intensity among those with renal disease in the intervention group, and hence have a smaller hospital mortality rate, is ignored. Figure 4.26 shows the results of this intervention on hospital mortality. the number of expected hospital deaths should eventually drop from 175 to 150. This is a 14% improvement in hospital mortality outcomes. Although the intervention and the effects on renal disease are already known, the impressive impact on hospital mortality because of renal disease being the main classifier in Figure 4.15 is the relatively new information that provides this link.
4.5.5
Summary In this section, the discovered knowledge was evaluated for its potential in
outcomes improvement. A targeted intervention at those 9.5. Emergency department utilization might be reduced by better diabetic control that prevents complications, rather than improving outpatient provider access. A targeted intervention at those with above normal HbA1c values to reduce it by 1 point would yield an eventual 14% reduction in hospital mortality.
151
#$ % &(' )+* , - . / .
!"
& % ) #$ % &(' )1 23, - 0 0 . / .
0 0 / 0 0
/ / / 0 /
'
'
Figure 4.26: Post intervention hospital mortality, n = 10, 240
4.6
Final combined model to improve outcomes
4.6.1
Introduction Similar to the concept of combining variables that are significant in a
univariate analysis into a multivariate model, the final combined model takes the previous interventions dealing with single outcome measures and combines them into one final model. This section specifies the model, applies it on the learning and test set data to calculate its potential to improve outcomes, and does a final test of the model on the one-third of the data that has been set aside for this purpose.
4.6.2
The final combined model The final combined model will include the univariate interventions to im-
prove outcomes discovered above:
152 1. A cost-effective targeted intervention at those 9.5 by targeting less than half the population. 2. A targeted intervention at those diabetic patients without renal disease who have above normal HbA1c (>6.2) to reduce HbA1c averages by 1 point, which brought about an outcomes improvement of 14% reduction of hospital mortality by reducing the development of renal disease. Since these interventions overlap, the final combined model will be a targeted intervention to reduce HbA1c averages in all those with HbA1c >6.2. Two outcomes will be monitored: (1) percentages of those with HbA1c >9.5, and (2) the projected decrease in hospital mortality by decreasing renal disease. The initial univariate interventions were presuming a very conservative 1.0 point drop in the HbA1c , based on the actual Loveless Clinic results of 1.8 points. The final model will use a more realistic 1.5 points with a sensitivity evaluation of plus or minus 0.5 points. The model will be first tested on the combined test and learning sets to see the expected outcome, and then on the hold-out set of “virgin” data that has not been used in any way in either formulating the model or testing it.
4.6.3
Results of the model on learning and test sets Figure 4.27 shows the epidemiology of the population (n = 10, 240) parti-
tioned so that interventions and outcomes can be evaluated. Note that there are 175 hospital deaths in total, 150 occurring in those with HbA1c >6.2. Note also that there are 964 patients with HbA1c >9.5. The diabetic intervention program described by Friedman (1998) is again applied to those with HbA1c >6.2 to achieve an expected 2-year reduction of
-
+
)
, , , )
)
. -
) )
* ) ) ) ,
)
" !
'( %&2 1 0 * / ) '. $ # ) " !
* ) '( %& $ #
* *) '. $# " !
* '. 2 0# " !
Figure 4.27: Epidemiology of the learning & test set population 153
# "
154 1.5 points. For every 1 point reduction in HbA1c , there is a 35% reduction in microvascular complications like renal disease. Hence one expects a 52.5% eventual reduction in renal disease with a 1.5 point drop in HbA1c . The numbers above and below HbA1c cut-points are calculated by going to the original HbA1c data, subtracting 1.5 points, and then recalculating the numbers of people in each category. The rates of renal disease and hospital mortality for that combination of renal disease (yes, no) and HbA1c level (≤6.2, >6.2 but ≤9.5, >9.5) are obtained from the population’s baseline epidemiology in Figure 4.27. These results are modeled in Figure 4.28. To give one calculation example, look at the post intervention population where HbA1c >6.2 and ≤9.5. By going to the original data and reducing each HbA1c value by 1.5 points, n = 3, 755. In this HbA1c category, the baseline model had
800 7608
= 10.5152% with renal disease, so n = 395 for +RD int his category.
Multiply this by the renal disease reduction factor of 35% for each HbA1c point. This means for the first point reduction the renal disease risk is decreased by 35% to 65% of the initial. For the next half point reduction the renal risk is reduced by
1 2
× 35% = 17.5% of the 65% initial value. This is a total renal risk reduction
of 46.38%. Hence, there are 395 × (1 − 0.4638) = 212 patients in this HbA1c postintervention category who have renal disease. To calculate how many of these 212 are expected to have a hospital death one multiplies by the baseline model rate for this category of
67 800
= 8.375% to get 18 hospital deaths. This same process is
used for each post intervention branch. Note that the number of hospital deaths is now 134 rather than 175, a hospital mortality outcomes improvement of 23%. The number of patients with bad HbA1c >9.5 is now 274 rather than 964, an outcomes improvement of 72%.
' %!
*
**
)!
+
+
+
(
+
*
! !
&' %
( &' % $# " !
0 ./ - ( & , $# " !
( 0 . &, #" !
( 7 ( ( (
( 77 ( ( 7 7 (
( ( # 4 2 &' ! 3 2 1
Figure 4.28: Post-intervention final model—test & learning sets 155
+ + ( 7 7 0. & , # !5
+ + ( & 6,$# !5
#
156 The next step is a sensitivity analysis by varying the actual HbA1c reduction by 0.5 points in either direction from the 1.5 point reduction expected for the intervention. The previous calculations are repeated, but using a 1.0 point reduction (associated with a 35% renal disease risk reduction); and then repeated again with a 2.0 point reduction (with a 57.75% renal risk reduction). This will give a sensitivity confidence interval for the outcomes improvements expected. The results are modeled in Figures 4.29 and 4.30. For the lower bound, the number of hospital deaths is now 143 rather than 175, a hospital mortality outcomes improvement of 18%. The number of patients with bad HbA1c >9.5 is now 419 rather than 964, an outcomes improvement of 57%. For the upper bound, the number of hospital deaths is now 123 rather than 175, a hospital mortality outcomes improvement of 30%. The number of patients with bad HbA1c >9.5 is now 163 rather than 964, an outcomes improvement of 83%. The final model applied to the 10,240 diabetic patients in the combined test and learning sets gives the following outcomes improvements: • A reduction of hospital deaths from the baseline 175 to 134 (123, 143), which is a hospital mortality outcomes improvement of 23% (18%, 30%). • A reduction in the number of patients with bad HbA1c values >9.5 from the baseline 964 to 274 (163, 419), which is a glycemic control outcomes improvement of 72% (57%, 83%). The numbers in parentheses refer to the sensitivity analyses upper and lower bounds when the intervention expectation of a 1.5 point drop in HbA1c is varied down to a 1.0 drop and up to a 2.0 drop. Since the conclusion of these outcomes improvements are not affected by the sensitivity analysis within this reasonable range, it is said to be “insensitive” to this range of values (Petitti, 2000).
*
.!
+*
) (
- , , - ) ( (
, , - ,
, , -
(
(
* ! !
&' %
) ( &' % $# " !
2 01 /) ( * & . $# " !
* ) ( 2 ( 0 &. #" !
- ) ) )
-- - - )
) ( ) ( # 6 4 & ! 5 4 3
* , , ) ( & 8.$# !7
, , - - * 20 & . # !7
Figure 4.29: Post-intervention final model, lower bound—test & learning sets 157
#
00 !
*)
$ ' ' '
$ ' ' ' % ' %
% ( % $ & %
% % $ $ &
$ &
( ' % %
& %
& $ '
( ' % $
) *
% $ "# !
% $ ( ' % "# ! / . - ,) " +
% $ & ) " + / -
$ ' ( ' $ ' % $ ' $
$ $ ' $ $
% " ) $ % $ 4 2 3 2 1
& ' %& % $ ) " +5
& ' %& $ )" + / -
Figure 4.30: Post-intervention final model, upper bound—test & learning sets 158
159
4.6.4
Final test of the model on “virgin” data The final step in the analysis of results is to repeat the previous section’s
calculations on the “virgin” data that was set aside and not used in any way in coming to the model’s conclusions. This 3-fold cross validation approach gives some impression of the model’s usefulness when applied to other data that was not used in its construction or testing. Figure 4.31 shows the epidemiology of the virgin population (n = 5, 153) partitioned so that interventions and outcomes can be evaluated. Note that there are 75 hospital deaths in total, 66 occurring in those with HbA1c >6.2. Note also that there are 504 patients with an average HbA1c >9.5. The diabetic intervention program described by Friedman (1998) is applied to those with HbA1c >6.2 to achieve a 2-year reduction of 1.5 points. For every 1 point reduction in HbA1c , there is a 35% reduction in microvascular complications like renal disease. Hence, a 46.38% eventual reduction in renal disease is expected. The numbers above and below HbA1c cut-points are calculated by going to the original HbA1c data, subtracting 1.5 points, and then recalculating the numbers. The rates of renal disease and hospital mortality for that combination of renal disease (yes, no) and HbA1c level (≤6.2, >6.2 but ≤9.5, >9.5) are obtained from the baseline epidemiology in Figure 4.31. These results are modeled in Figure 4.32. Calculations use the same techniques as the previous section. Note that the number of hospital deaths is now 49 rather than 75, a hospital mortality outcomes improvement of 35%. The number of patients with bad HbA1c >9.5 is now 153 rather than 504, an outcomes improvement of 70%. The next step is a sensitivity analysis by varying the actual HbA1c reduction by 0.5 points in either direction. Recalculate the analysis using a 1.0 point reduction (associated with a 35% renal disease risk reduction) and then with
( )
2 * *
1 1 1 * * 1
* 1 * *
* '( 0 / . - ' ,+&% $ # " !
'( &% $ # " !
'
+,&% $ # " !
* ' 0 . + %$ # " !
Figure 4.31: Epidemiology of the “virgin” set population 160
% "
*
'
+
& ) )
& ) (
) )
)
&
( &
(& )
& )
6
,
$% # "!
$% # 0 / . - $, + ("& !
$, 1+ 0 . !
) ) ) & & )
& ) & &
! $% 5 3 4 3 2
Figure 4.32: Post-intervention final model—virgin set 161
& $, + 0 .!
$, +1 " !
!
162 a 2.0 point reduction (with a 57.75% renal risk reduction). This will give a sensitivity confidence interval for the outcomes improvement expected. When these are done, the adjustments are seen in Figures 4.33 and 4.34. For the lower bound, the number of hospital deaths is now 56 rather than 75, a hospital mortality outcomes improvement of 25%. The number of patients with bad HbA1c >9.5 is now 235 rather than 504, an outcomes improvement of 53%. For the upper bound, the number of hospital deaths is now 38 rather than 75, a hospital mortality outcomes improvement of 49%. The number of patients with bad HbA1c >9.5 is now 90 rather than 504, an outcomes improvement of 82%. The final model applied to the 5,153 diabetic patients in the virgin data set results in the following outcomes improvements: • A reduction of hospital deaths from the baseline 75 to 49 (38, 56), which is a hospital mortality outcomes improvement of 35% (25%, 49%). • A reduction in the number of patients with bad HbA1c values >9.5 from the baseline 504 to 153 (90, 235), which is a glycemic control outcomes improvement of 70% (53%, 82%). The numbers in parentheses refer to the sensitivity analyses upper and lower bounds when the intervention expectation of a 1.5 point drop in HbA1c is varied down to a 1.0 drop and up to a 2.0 drop. Since the conclusion of outcomes improvement is not affected by the sensitivity analysis within this reasonable range, it is said to be “insensitive” to this range of values (Petitti, 2000) .
4.6.5
Benefits of random sampling There were no quotas for categories since one does not know what the
relevant variables will be. Besides, random sampling is meant to be random! With a large data set, one expects most categories within the sample to be more
*
1
#
& &
& & &
) (
) ) '
'
&
( )
( '
*
,
& & & $% # "!
& & & ( $% # 0 ./- $, + "!
& & & $, +20 .!
) ( ) ) ' ' )
( ' ) ( ' '
& & & & & & ! 6 4 $ 5 4 3
& & & $, +2"!
( ' $, + 0 .!
Figure 4.33: Post-intervention final model, lower bound—virgin set 163
!
(
" (
/ ' & / / '
/ ' %
' & ' % '
' & ' ' &
' &
% '
% ' / ' '% '&
'% '& '
5
*
/ / / ' / #* % 0) . %, '
#$ " . , + % #* ) !' %
% #$ " !
4 2
/
# % * % 3 2 1
#* % )0 ! %
% #* ) . ,
Figure 4.34: Post-intervention final model, upper bound—virgin set 164
165 or less evenly divided. And in fact, one sees in the decision tree diagrams that the number of subjects with HbA1c >9.5 or renal disease is about half in S3 than in S1 + S2 . This tends to be reassuring about the random sampling techniques, and highlights one of the benefits of random sampling: Randomization is a method that allows one to account quantitatively for the potential confounding produced by unmeasured determinants of the outcomes (Rothman & Greenland, 1998, page 144). As reviewed in the methodology section 3.10, the data were split into 3 sets by random sampling. The technique used was detailed at the end of section 4.2.2—the function int(rand() ∗ 3 + 1) in Excel was used to generate random numbers 1, 2, or 3 to form the variable SET. Sets 1 and 2 were used by the CART software as test and learning sets. Set 3 was put aside for use later in applying the final combined model on “virgin data.”
4.6.6
Summary This section constructed the final combined model to improve outcomes
based on the “knowledge discovery items” developed earlier in this Chapter. The model, a targeted intervention to reduce HbA1c in all those with HbA1c >6.2, monitored (1) percentages of those with HbA1c >9.5, and (2) the projected decrease in hospital mortality by decreasing renal disease. Both of these outcomes showed major improvements that were insensitive to reasonable changes in the expected impact of intervention.
4.7
Local management and clinicians The “new knowledge items” were presented to clinicians and management
at the institution that own the diabetic data warehouse in the survey reproduced
166 in Appendix F. It was sent by email to 45 clinicians and 27 managers on February 25 & 26, 2002. Through the survey, the local clinicians and managers gave their opinions about whether these new knowledge items were perceived locally as either new or useful. In keeping with good design principles for email surveys (Dillman, 2000, p. 367–272), each email was sent personally to the recipient and was not part of a mass mailing, the cover letter was brief so the key survey questions can be seen without having to scroll down the page, respondents were asked to place X’s inside brackets to indicate their answers, alternative methods of responding were listed, and follow-up reminder messages included a replacement questionnaire. An email reminder was sent out 2–3 days after the original email. A week after the original email, a printed copy of the survey was sent by interoffice mail to non-responders. The printed copy had a personal handwritten note to the individual asking the person to please fill it out and return it, if they did not do so by email already. This was meant to capture the people who did not use email, as well as a final written reminder to all those who had not yet returned the email survey. The results below were tabulated based on returned surveys as of March 16th, 18–19 days after the initial email survey was sent out. The response rate was 68% overall, 73% among clinicians, and 59% among managers, some of whom were clinicians in the past or practice part-time now. Five of the managers and 3 of the clinicians who responded self-identified as both a clinician and a manager. Table 4.11 shows the percentage of people who said the 3 new knowledge items were either new or useful. The items are: 1. Adult diabetics with HbA1c average >9.5 were 3.2 (95% CI: 2.78, 3.77) times more likely to be 65 years of age.
167
Table 4.11: Percentage saying the new knowledge items were new or useful Item Group Is item #n new? Is item #n useful? all 82% (p = 0.0012) 73% (p = 0.0214) clinicians 77% (p = 0.0340) 63% (NS) #1 managers 91% (p = 0.0478) 91% (p = 0.0478) both 88% (NS) 88% (NS) all 84% (p = 0.0005) 67% (NS) clinicians 80% (p = 0.0179) 50% (NS) #2 managers 91% (p = 0.0478) 91% (p = 0.0478) both 88% (NS) 100% (p = 0.0367) all 49% (NS) 86% (p = 0.0002) clinicians 53% (NS) 77% (p = 0.0340) #3 managers 55% (NS) 100% (p = 0.0135) both 25% (NS) 100% (p = 0.0367) 2. Adult diabetic patients with more frequent outpatient visits did not have less chance of an ER visit. 3. Adult diabetic patients who die during hospitalization are 10.6 (95% CI: 7.74, 14.55) times more likely to have renal disease than not. Table 4.11 gives the results for all respondents (n = 49), and results for the following subgroups: • clinicians: physicians, nurses, or pharmacists who self-identified as clinicians only (n = 30) • managers: managers who self-identified as managers only (n = 11) • both: those who self-identified as both clinicians and managers (n = 8) The p-values in Table 4.11 are for α = 0.05. They are calculated using the difference in proportions test of Statistica ’99, found under basic statistics, analysis, and then other significance tests. The comparison proportion is equal to 0.5 (that is, a tie vote) with N1 = N2 . Thus, a significant result means that the percent saying yes is significantly different from 50%, that is, a statistically
168 significant majority or minority of the group. A two sided test is used, since there is no knowledge of whether the expected result is above or below 50%. Statistica ’99 computes the p-level based on the t-value for the respective comparison: |t| = q −p2 | N1 ×N2 2 ×N2 ) × |p√1p×q and q = 1 − p. The degrees of freedom where p = (p1 ×NN11)+(p N1 +N2 +N2 are computed as N1 + N2 − 2. N S means the percentage is not significantly different (p > 0.05) than 50%, that is, a tie vote. Summarizing the results shown in Table 4.11, the majority of those polled thought all 3 new knowledge items were useful, and these were statistically significant findings for items #1 and #3. A significant majority thought items #1 and #2 were new. These findings were generally the same in all three subgroups when significance could be reached with the smaller numbers. Years in practice (for those who were clinicians in the past or present) was used to divide the group into those above (n = 25) and below (n = 17) the average years in practice of 14.8. The only significant difference was in item #2 where a significantly larger percentage (76% vs. 47%, p = 0.03073 ) of those with more experience thought this item would be useful. Sex was used to divide the group into males (n = 28) and females (n = 21), and two significant differences were found. A significantly larger percentage (95% vs. 57%, p = 0.0023) of females thought item #1 was useful. A significantly larger percentage (64% vs. 29%, p = 0.0096) of males thought item #3 was new. New knowledge item #3 was already known regarding the topic, but the extent of the high odds ratio was the new information. As one male person said in an unsolicited comment put on the survey: “These are very interesting. I did a lot of diabetes work and all of these facts surprised me. I knew #3, but was surprised at the magnitude of the difference.” While this may be the case for most responders, 3
Same method using Statistica ’99 as detailed in the text above in calculating p-values for Table 4.11, but using a one-tailed test since two percentages are given with one seemingly larger, and the test is determining if it is significantly larger.
169 perhaps more females than males interpreted the question to focus on the topic rather than the magnitude. These results allow a judgment about research hypothesis 2: Managers and clinicians found the new knowledge items useful.
170
Chapter 5 Discussion 5.1
Introduction The previous chapter presented the results of this study. This Chapter
discusses the issue of validity, whether the research hypotheses from Chapter 1 were proved or disproved, comparisons with standard statistical approaches, limitations, future directions, and final conclusions including its impact on health systems management.
5.2 5.2.1
Validity Introduction Any study raises issues of validity. There is the internal validity of whether
the study’s conclusions are valid for the population of the study. There is the external validity of whether the study’s conclusions are valid for other populations. This section will discuss the validity of this study of data mining a diabetic data warehouse to improve outcomes.
5.2.2
Internal validity Iezzoni (1997) has outlined the types of validity. Table 5.1 is adapted from
her book, and will serve as the structure for the discussion on validity.
Table 5.1: Types of validity from Table 6.2 of Iezzoni (1997) Validity Dimension Definition Example Face validity A measure contains the types of A method for adjusting for in-hospital mortality for acute variables that will allow it to do myocardial infarction (AMI) includes clinical variables what it aims to do that on “face value” are the types of variables clinicians consider important risk factors A measure contains all relevant A method for adjusting for in-hospital mortality from Content validity concepts AMI includes all clinical variables that are important risk factors A measure correlates with actual A method for adjusting for in-hospital mortality from Construct validity indicators of risk in the expected AMI correlates with actual measures of cardiac function way A measure has a positive corre- When a method for adjusting for in-hospital mortality Convergent validity lation with other indicators of from AMI shows increasing risk, actual measures of actual risk cardiac functioning also show increasing risk Discriminant validity A measure has a stronger corre- A method for adjusting for in-hospital mortality from lation with indicators specific to AMI correlates more strongly with actual measures of its purpose rather than with other cardiac function than with measures of ambulation indicators A measure correlates with the “gold A method for adjusting for in-hospital mortality from Criterion validity standard” measure AMI correlates with a clinical scale derived from intensive, continuous cardiac monitoring A measure explains variations in A method for adjusting for in-hospital mortality from Predictive validity outcomes AMI predicts accurately which patients have died Attributional validity Findings using the measure permit In-hospital mortality rates, adjusted using the measure, one to make statements about the permit one to attribute differences to effectiveness or causes of what is observed quality of care
171
172 Face validity Face validity is especially good with CART methodology among clinicians, which is one of its advantages and perhaps why it is so commonly used in medicine. The tree diagrams and recursive splitting is similar to the methodology physicians use in understanding medical categories. The variables included are most of the ones clinicians would think are relevant for the target variables used. The decision analysis approach of determining the impact of interventions also has strong face validity. Content validity There is some limitation to content validity since not all relevant variables are electronically stored (such as BMI), thus precluding them from being part of this study. Although this reduces content validity, such exclusions are a necessary part of the methodology of studies like this that use only electronically stored data. The extent to which content validity is limited by this will vary from study to study depending on what key variables are not present. The “new knowledge items” that result from this study are not the end product. These must be transformed through extensive domain knowledge and health systems management techniques into proposed interventions that can improve outcomes. These potential improved outcomes are then validated by a decision analysis approach with its tree diagrams and sensitivity analysis. This latter process has good content validity, with all of the relevant variables being included. Construct validity The construct validity of the CART methodology itself is manifest when it results in a list of already known associations—these were listed in the previous Chapter as a way of showing the validity of this data mining process (even though
173 these findings were not new knowledge to be used to improve outcomes). The nature of knowledge discovery or finding new knowledge items from data mining presumes something new is found, such as younger age being a key predictor of bad glycemic control. This conflicts with construct validity on the surface, since construct validity means it corresponds or correlates with what is already know. Yet other “discoveries,” such as the high OR for renal disease in hospital mortality, are already know, but the extend of the trend is new in this more broadly defined renal disease. Overall, so much is found that is already known as risk factors or predictors that construct validity appears moderately strong. Convergent validity This study has limited convergent validity. When the CART analysis shows, e.g., that younger age is a strong predictor for bad glycemic control, an evaluation of this is done. The data are then analyzed using odds ratios and decision analysis to show the actual increased risk, and potential risk reduction by interventions. Other predictors of bad glycemic control are so multifactorial, it is difficult to show a correlation between younger age and these other factors that are either poorly defined or not electronically recorded (diet, exercise levels, BMI, attitude about chronic illness, etc.). For renal disease being strongly associated with hospital mortality, the story is different. Clinicians are very aware that renal disease is a bad prognostic indicator of diabetic microvascular complications. This finding would have good convergent validity. The actual “knowledge discovery items” may need to be considered separately for convergent validity in data mining studies, and the results may vary.
174 Discriminant validity There is discriminant validity in this study. Glycemic control, for example, was not predicted by whether someone saw a podiatrist, but rather by more fundamental demographics such as age and medications. However, discriminant validity is limited because the study intentionally used variables thought to be relevant to diabetes, and are drawn from a diabetic data warehouse. Thus, there are only a limited ability to separate out variables that are not relevant, since not many of these are part of the diabetic data warehouse. Criterion validity Each target variable may have its own gold standard measure, if any exist. For glycemic control, there is no gold standard for predicting it; clinicians consider it the result of dozens of interacting dynamics. These have not been set into an accepted formula for prediction, and certainly a number of them would be variables that are not electronically stored. This is similar for hospitalization deaths, although it is known that renal disease is a risk factor. Criterion validity may vary among target variables depending on the presence or absence of a gold standard measure to compare the study’s results with. For this study, criterion validity is limited. Predictive validity This study does have predictive validity. Each target variable has a list of the important predictor variables generated by the CART analysis, along with sensitivity, specificity and predictive power (though these were not always presented for each analysis, since this study did not focus on predictive ability). The test on the “virgin” data or hold back data set also strengthens this predictive
175 ability and measures the accuracy of the predictions. The CART methodology is strong in its predictive ability. Attributional validity This study does not have attributional validity. The nature of a retrospective study on observational data precludes any statements of causality.
5.2.3
External validity External validity is how validly can the results of this study be applied to
other populations. Another way to express this is: how accurate are the identified predictors when applied to other data sets that have not been used to develop the model? On a first level, the ability to predict accurately on a new data set is already tested in the 3-fold cross validation methodology. The reason for a “virgin” data set that was not used in the CART analysis or model development was precisely to test the model on new data. This was done, and it was validated. At this level, there is good external validity, though the virgin data set came from the same population. On a next level, does the diabetic population at this institution that is electronically captured in the diabetic data warehouse of this study closely enough match the diabetic population in other New Orleans clinics to believe the study results should be valid there? We see no reason why it should not, when looking at a large insured population that is varied. The populations should be similar and results should be similar. If one went outside of the boundaries of the population of this study, say an indigent uninsured population exclusively, then the populations are so different one would not expect the results to be valid without rechecking them in that population’s data.
176 On a higher level, can the results in this diabetic insured population in New Orleans be considered valid in other large varied diabetic, insured, urban and suburban populations in the United States? It has been noted that New Orleans is one of the most obese cities in the United States, with a unique culture of Cajun and high fat eating. Nevertheless, obesity has been skyrocketing in the United States as a whole, and problems are similar everywhere, even if not as intense. Hence, external validity is at least moderate. One could certainly check some of the predictors (such as an odds ratio for bad glycemic control at an age cut-point of 65) very quickly in a different population and if similar then the proposed interventions are transferrable with similar outcomes improvements.
5.2.4
Summary In this section on validity, the internal and external validity of this study
was reviewed. This study has internal validity strengths in face validity, construct validity, and predictive validity. External validity appears to be reasonably strong if the external population is a large, varied, insured population in a major American urban/suburban area.
5.3
Conclusions about the research hypotheses The research hypotheses from Table 1.2 are now reviewed to conclude if
the null or alternative hypotheses are accepted. 1. The null hypothesis was disproved since one or more new knowledge items were discovered, as listed in Chapter 4. Therefore, the alternative hypothesis is accepted and we conclude that the CART data mining software can be applied to this diabetic data warehouse to discover new knowledge.
177 2. The null hypothesis claimed that neither managers nor clinicians would find this new knowledge useful. Based on the survey results in the previous section, we reject the null, and conclude that managers and clinicians did find the one or more new knowledge items useful, as listed in the previous section. Note that this refers to the local institution only. 3. The null hypothesis was disproved since new knowledge items were shown in decision analysis trees of Chapter 4 to improve outcomes in accord with the definitions of chapter 1. Therefore, the alternative hypothesis is accepted: This new knowledge can be used to improve outcomes.
5.4
Standard statistical approaches
5.4.1
Introduction In this dissertation, a data mining tool (CART) is applied to a large obser-
vational diabetic data set to find some gold nuggets (new knowledge items). These are then used in decision analysis to show the potential outcomes improvement using a literature intervention. Some of these gold nuggets include: those 9.5; and those diabetics with proteinuria are 10.6 times as likely to die if hospitalized than other diabetics. In this section, this methodology is contrasted with other approaches that biostatistics might use. This section merely outlines what might be possible, and leaves to future work or other dissertations the detailed analysis involved in these outlines. The data miner is aware that data mining uses powerful computational tools, such as classification trees applied to a large database, to find gold nuggets. Data mining has the capability of analyzing massive amounts of data, and running many analyses to find its gold nuggets. Compared to standard biostatistical
178 methodologies, data mining can handle larger volumes of data, is less sensitive to variable dependencies, and can find potentially useful relationships no one thought to ask about—all of these are problematic for standard biostatistical approaches. If a data miner brings these gold nuggets to a biostatistician, it seems legitimate to ask how they can work together to figure out more about these results. The goals might include answers to these questions: • How generalizable are these results? • What other observational data can test these hypotheses over a regional or national population? • What experimental data could be collected to test these hypotheses over a regional or national population? First, logistic regression on the same database is considered. Second, applications of this methodology to other observational data sets are discussed. Third, applications to the development of prospective trials are discussed.
5.4.2
Logistic regression Logistic regression (Hosmer & Lemeshow, 2000) is a standard biostatistical
method used when the dependent variable is binary. Logistic regression is found also in data mining software systems, for example SAS Institute’s Enterprise Miner and Insightful’s Insightful Miner. The following discussion is intended to contrast data mining generally with some of the inferential aspects of logistic regression. Hospital death (0 = no, 1 = yes) or HbA1c >9.5 (0 = no, 1 = yes) might be typical dependent variables used in a logistic regression. After data mining has arrived at these gold nuggets, one could then use one of these binary variables, say hospital death, as the dependent variable in a
179 logistic regression. The predictor (independent) variables in the logistic analysis could be the other variables that were used in the data mining analysis. The logistic equation would then predict the probability of hospital death based on the predictor variables. From the coefficients associated with each of the predictor variables, one could estimate the significance of that predictor variable on the outcome or dependent variable. One could also test the impact of an intervention decreasing HbA1c by 1.5 units. This could be done by using a second logistic regression equation identical to the first except for the variable (HbA1c - 1.5) substituting for HbA1c . All other factors remaining the same during the intervention, the difference in the dependent variable’s values between the two regression equations then indicates the effect of the intervention on reducing the probability of hospital death. The approach of the previous paragraph uses the gold nuggets obtained in the data mining process to help determine which dependent variable to use in a logistic equation. This has both benefits and liabilities, which are reviewed in turn. The benefits of this melding of data mining and logistic regression are: 1. Data mining informs the researcher what dependent variables might be most useful to use in formulating a hypothesis that a logistic equation will be used to evaluate; and 2. The logistic equation can be used to evaluate confounding variables in a standard approach that is widely familiar. Unfortunately, the liabilities may be substantial. First, there is the criticism that the research hypothesis the logistic equation is modeling did not come from scientific theory, but rather by checking dozens (or hundreds) of possibilities and choosing the most promising one. So when setting α = 0.05, there is a multiple comparison issue that calls any result into question, since the true p-value might
180 be much higher than 0.05 and thus the results may not be significant. The data are being used twice, once to derive a good hypothesis, and again to support the hypothesis. This could be solved by using an adjusted p-value as described by Halbert White (White, 2000) in section 1.1.2, which monitors the hypothesis space the hypothesis was developed from and gives an adjusted p-value. Dr. White’s group at UCSD and QuantMetrics has developed “Reality Check” software to partly automate this analysis. Second, there is the issue of whether the predictor (independent) variables are independent and therefore uncorrelated. Logistic regression estimates and inferences tend to be more sensitive to dependencies among predictor variables than do classification trees. Classification trees are less affected by dependencies between variables than logistic regression is.
In a multifactorial disease like
diabetes and in a large observational database obtained for non-research purposes, many of the variables are overlapping or interconnected. This might be solved by a careful selection of seemingly independent variables, followed by a principal components analysis to minimize collinearity and stabilize coefficients. The result might be reasonably independent variables that can validly be used in a logistic regression. Multicollinearity can be detected by (Gujarati, 1995, pages 335–339): • High R2 -values but few significant t ratios • High pair-wise correlations among regressors • Partial correlations low compared to R2 • R2 of auxiliary regressions being higher than the full regression • Eigenvalues and condition index (used by SAS to test for multicollinearity) • Tolerance and variance inflation factor
181 Yet it may be that HbA1c is strongly correlated with many of the seemingly independent variables in the logistic model. Thus an intervention that changes HbA1c by 1.5 units also modifies many of the other variables. If this is the case, further exploration of time series models, pooling of cross sectional and time series data, variable transformations such as first difference regression models, and ridge regression may be needed (Gujarati, 1995, pages 340–344). The melding of these approaches can bring together the insight of gold nuggets from data mining and the inferential force of logistic regression. Yet one can see from the brief outline above that substantial work must be done to overcome the issues raised by combining these methods. A limiting factor to generalizability is standard biostatistical methods using the same database that the data mining did. One solution could be dividing up a large data set into subsets so that data are not used twice. A more general solution moves us to the topic of other databases.
5.4.3
Applications to observational data sets What other observational data can test these hypotheses over a regional
or national population? For the dependent variable of hospital death in diabetic patients, the Medicare database is publicly available and may be of use in testing whether those diabetics with proteinuria are 10.6 times as likely to die if hospitalized than other diabetics. A logistic equation using this broader database would strongly support generalizability, and may find other interesting associations by examining the coefficients of the predictor variables. The national Medicare database has limitations. First, specific laboratory results such as proteinuria are not available, so substitutes such as billing codes for proteinuria or renal disease must be used. These are not exact substitutes, since billing codes do not capture all of the disease present in the population.
182 Second, the Medicare database is largely composed of those over 65 years of age; only narrow groups of people under 65 are included, such as the disabled. Hence, evaluation of the high risk of those under 65 to have a HbA1c >9.5 compared to those over 65 cannot be adequately carried out. Third, the Medicare database is so massive that it is common practice to use small random samples of it for analysis. However, if the dependent variable was hospital mortality, it may be reasonable to select only the smaller number of diabetic hospital deaths with a few case controls for each hospital death. Another realistic possibility that can overcome Medicare limitations is to obtain the cooperation of half a dozen major clinic systems around the United States to combine elements of their proprietary databases. This is certainly possible, though it does require finding co-investigators and obtaining institutional review board approval at each of the institutions. Each co-investigator would obtain a de-identified database that meets inclusion and exclusion criteria for patients and pre-agreed upon variables. The principal investigator would then combine the databases, making adjustments as needed for different units or definitions of normal based on testing methods for variables at different institutions. There would also have to be a method of dealing with missing data, since some institutions may not electronically collect certain data. From such a national observational database, standard biostatistical methods could be applied, such as logistic regression modeling. Out of this, one might expect stronger evidence for generalizability.
5.4.4
Applications to prospective trials The experimental approach is to design a research study to collect data at
multiple sites prospectively. This has the least chance of bias when properly set up, and may be able to give insights about cause and effect.
183 In this approach, the data miner’s gold nuggets and variables that appeared useful and independent in observational studies are used to construct a research study where data is collected prospectively. This allows us to collect critical data that might be missing in the observational studies in databases that have not been collected for research—such as BMI, family history of diabetes, exercise, diet, etc. Variable selection could be informed by a though review of the literature, rather than by the limitations of existing transactional databases. Such a prospective, multi-centered trial is expensive and time consuming. Its funding likely depends on whether the observational studies showed enough promise of outcomes improvements to warrant it.
5.4.5
Summary The gold nuggets found by data mining methods can be used to construct
standard biostatistical models such as logistic regression or time series models. The problems in making correctly calibrated inferences, i.e., using correct p-values and confidence intervals, must be addressed carefully. One must avoid using data twice, and check whether the variables used are independent enough to make standard methods valid. Using national observational databases provides good generalizability potential. Medicare databases are possible, though they have significant limitations. Collecting observational data from multiple sites has fewer limitations, but would be a major project to undertake. Finally, prospectively collecting data at multiple sites is the ideal approach, but unfortunately is also the most time and resource consuming.
184
5.5
Limitations
5.5.1
Introduction In this section, the study’s limitations are examined. First, limitations
that were discussed in previous sections are summarized and reviewed. Second, limitations to the decision analysis techniques used in chapter 4 are discussed. Third, limitations of the survey discussed in section 4.7 are reviewed. Fourth, post-analysis insights into limitations are discussed.
5.5.2
Limitations summarized As pointed out in section 1.4, there are limitations to this type of research
on this type of database. Data were obtained for purposes other than research. Billing codes are not always precise and accurate. Important predictors of diabetic outcomes are missing from the database, such as BMI, family history of diabetes, time since onset of diabetes, diet and exercise habits. These variables were not electronically stored, and would require going to the paper chart and patient interviews to obtain. This study was limited to the CART data mining software for knowledge discovery. The CART software has proved useful in many clinical data mining settings, and produces diagrams and splitting rules that have been easily understood by clinicians. Other data mining software might arrive at additional discovered knowledge that CART cannot. This is a limitation of this study. However, the literature review in Chapter 2 pointed out that this argument might not be valid, and CART has generally performed as good or better than other data mining software. The default CART settings used have limitations, such as the use of a single variable for splitting criteria in section 3.6.3 which limits predictive ability while increasing interpretability.
185 Variable transformations have limitations, such as the linear fit presumptions for HbA1c AvSlope in section 4.2.2, which will only be true for some patients. Charge data were very limited (see section 4.3.5), which limits the usefulness of this variable. Section 4.2.1 on inclusion and exclusion criteria noted the limitation to adults with some continuity, which reduced the number of patients from 31,696 to 15,393. The validity discussion (section 5.2) is largely about study limitations, such as the limitation to content validity, limited convergent validity, and the external validity section about applications beyond New Orleans.
5.5.3
Decision analysis interventions The decision analysis in Chapter 4 has limitations and could be expanded
to overcome some of these limitations. It used a documented intervention from Lovelace Clinic that achieved a 1.8 unit reduction in average HbA1c values. Based on this, a drop of 1.5 with a sensitivity analysis varying this by 0.5 (1.0, 2.0) was used in the decision analysis. This provided a rough estimate of the impact of the intervention in the population, using individual HbA1c values in the data and renal disease rates of the data. It may be possible to obtain more information about the intervention regarding individual characteristics of those who had a greater or lesser reduction in HbA1c . If the authors had this information and were willing to share it, then a more individual approach could be taken about how much reduction in HbA1c might be expected in our population. This could adjust our results somewhat. Decision trees could also be built into more layers following the CART output. This study used decision trees with 3 HbA1c levels and 2 renal disease levels to predict the number of hospital deaths or non-deaths. From the baseline epidemiology of our population, cell specific proportions for hospital mortality
186 were calculated. To model the effect of an intervention that reduced HbA1c , the HbA1c value in each individual was changed and the resultant renal disease and hospital mortality prediction was based on the proportion of these values in those cells from the original epidemiology. The decision tree analysis could be much more finely tuned by using many more variables from the CART analysis, so that the baseline epidemiology proportions are coming from smaller subgroups based on many more variables.
5.5.4
Survey limitations Limitations to the survey described in section 4.7 include non-responder
bias, where the 32% who did not respond might have answered differently on average than those who did; and non-responder bias where the larger non-response from managers (41% vs. 27%) may affect results in so far as managers tend to answer any differently than clinicians. If non-responding managers answered similarly to responding managers, this would tend to strengthen the overall results for all 3 items respondents were questioned on. The nature of an email survey lacks anonymity. There is always the possibility that a person feigns they know something (responding it is not new) to impress the one tabulating results. If this is a significant issue, then the 3 items respondents were questioned on regarding newness are stronger in reality than reported in section 4.7.
5.5.5
Post-analysis insights into limitations After performing the analysis documented in Chapter 4, are there any
additional insights into limitations of this methodology? It is very time consuming to develop the final data mining data table. In keeping with much of the published literature, this consumed about 75 to 80% of
187 the time. The usefulness of the final data mining data table is very dependent on the local institution. Have key variables been electronically stored? Have providers who mark billing codes and data entry clerks accurately done their jobs? Have the information services group properly written the code for their algorithms to bring data into a central registry or data warehouse? Even the most meticulous and intelligent data miner cannot overcome these institutional limitations. Hence the importance of clinical and administrative correlation of the data to gain insight into these limitations. The data cleaning process was one method that identified some of these errors. Another limitation not obvious at the beginning of this study was the importance of time analysis and the difficulty of adequately incorporating this into the data mining. Glycemic control changes over time. Hospital deaths are due to changes that have occurred over time. The cross sectional method of the data mining software used in this study did not capture the time element well. Although slopes for glycemic control over time were calculated, these did not prove insightful in the analysis. There was no automated way using this software to identify sequences over time that led to hospital deaths. Clinicians are skilled at identifying changes over time that lead to outcomes, and this type of reasoning would seem important in data mining. Incorporating this is challenging, and will be talked about in the future directions section below.
5.5.6
Summary Limitations include the non-research, observational data used, which elimi-
nates key variables that would have been included in a prospective research trial. The specific data mining software and its default settings used may limit results obtained. Variable transformations and inadequacies are also limitations. Decision analysis models could be much more detailed, providing higher accuracy,
188 though at the cost of complexity. The survey of local management and clinicians had limitations from non-responder bias and lack of anonymity. Post-analysis insights into limitations of this methodology include institutional limitations and the difficulty of adequately incorporating time and sequences into the method.
5.6 5.6.1
Future directions Introduction This section discusses some of the exciting future directions in the continued
working with this methodology in healthcare data mining. A major and active area for evolution of this method is in the relational database to flat file information loss in general and specifically in time series and sequencing information loss. Future work with this methodology may soon encounter massive databases too large for analysis, for which data squashing techniques hold great promise. Integration of data mining methods with standard biostatistical techniques is also promising. Finally, all database research will need to meet the new Federal HIPPA rules. While privacy requirements could interfere with database analyses, there may be a good resolution in the algorithmic distortion of personal data from squashing techniques.
5.6.2
Relational vs flat files problem of losing information Healthcare data are often contained in relational databases. This is true for
the diabetic data warehouse used in this study. The nature of relational databases is that all of the information cannot be squeezed into a flat file, at least not in a meaningful way that is useful for input to most data mining packages. The data mining technology used in this study (CART) requires flat files, and this is true in general for data mining and data analysis programs. Thus there is a concern
189 about loosing information in the data processing steps that are needed to feed the data into the software for analysis. Relational databases have SQL extraction tools that allow the construction of flat files in almost whatever format one wants. In this study we used Oracle Discoverer and DBMS/COPY to extract the desired data into the flat files used by CART. However, the relational database contains multiple flat files that are linked by a unique patient identifier, but cannot easily be combined. For example, the administrative demographics file has one row per patient. The lab file may have zero to thousands of rows per patient depending on how many laboratory tests were performed. The clinic file may have zero to dozens of rows per patient depending on how many clinic visits the patient had. The rows in the lab file do not match up with the rows in the clinic file. The hospital file has yet a different structure of rows per patient, as does the pharmacy file. Thus one must preselect what variables will be extracted, as the data mining software is not able to roam around a relational database on its own. This preselection limits what the data mining software can do. Data mining to locate otherwise unknown connections can only occur with the flat file data presented to the data mining software. Finding ways of more effectively dealing with this problem will be important to future applications of data mining in healthcare. Data mining software that can roam around a relational database without creating flat files is not yet commercially available. But it is an area that the IT industry is actively working on since its benefits will be broadly valuable, and if successful, will be extremely valuable to data mining transactional healthcare databases. Time series information Handling time series data is challenging for data mining software. One example in this study is the HbA1c value, the key measure of glycemic control that
190 should be measured every 3 to 6 months in all diabetics. How should this time series variable be transformed from the relational database to a vector (column) in the flat file presented to data mining software? There may be many of these results for a given diabetic patient. We could pick the last one, the first one, or one from the middle. We could take an average. All of these methods will lose some information. Since the trend over time for this variable is important, we could choose the slope of its regression line over time. But a linear line may be a good representation for some patients, but a very bad one for others that may, for example, be better represented by an upside down U curve. We could try to include it all with many columns, one for each HbA1c value, with associated columns indicating the time. However, since each patient has varying numbers of such events (from none to many dozens), this will leave columns with many missing values giving an inadequate representation to the data. In addition, most data mining software will not know how to associate the date with the lab value. This difficulty is a problem for most repeated laboratory tests. A related problem is present for pharmacy data. If a patient not being on vs. being on a particular medicine is represented as a binary variable of (0, 1), what happens when a patient is on the medicine for half the time? If this is represented by a fraction, how does one distinguish between being on it in the first half of the study period vs. the last half of the study period? Each person on a given drug may not be on the same dose, and a column for dose may also be needed. In traditional hypothesis driven statistics, this is often solved by inclusion and exclusion criteria that gets rid of all these less than ideal cases so one includes only those who have been on a given drug at a given dose for the entire study period in contrast to a control patient. In data mining, however, we want to utilize all the information. These time series issues have been investigated (Blum,
191 1982; Riva & Bellazzi, 1995; Sakamoto, 1996; Bellazzi et al., 1998; Huang & Yu, 1999; Bellazzi, Larizza, Magni, Montani, & Stefanelli, 2000; Gunopulos & Das, 2000; Tsien, 2000) but there is a great need to explore this further in healthcare data mining. Future directions might explore the integration of techniques or insights from econometrics, where methods of analyzing economic trends over time are standard. Repeated measures approaches in standard biostatistical methods might also prove useful. Further developments in the effective use of time-series data in data mining is one of the most exciting future areas where information loss may be minimized. Sequencing information The sequence of various events may hold meaning important to a study, e.g., if a patient has better glycemic control manifested in good HbA1c values mostly when the patient saw a physician within a month of the test. Perhaps this is meaningful information that implies the proper sequence of physician visits relative to HbA1c values is an important predictor of good outcomes. All this information is located in the relational database, but we must ferret it out by having an idea that this particular sequence may be important and then searching for it to see if it is valid or not. There may be many such sequences involving interactions between hospital, clinic, pharmacy and lab variables that are meaningful. Regrettably, ferreting out these meaningful sequences appear to require a hypothesis of what might be important. In the ideal data mining scenario, software could interface directly with the relational database and extract all possibly meaningful sequences for domain experts to review. However, current data mining software, such as CART used in this study, requires a flat file be created. Thus, one must devise a hypotheses of what sequences might be useful in
192 order to include this information in a flat file that data mining software would then analyze for importance. Unfortunately, the software will not be able to discover anything from hidden sequences not extracted from the relational database into the flat file. This issue will need to be addressed in future healthcare data mining both in more insightful variable transformations and in taking advantage of improved data mining algorithms. Future directions for better sequencing insights (i.e., what sequences are associated with certain outcomes of interest) may come from the intense work currently occurring in bioinformatics where new methods for finding and comparing genetic sequences are rapidly being developed (Ewens & Grant, 2001).
5.6.3
Integration with biostatistical techniques Integration of data mining methods with standard biostatistical techniques
is also promising. Section 5.4 on standard statistical approaches sketched out how the gold nuggets found by data miners might be used by a standard biostatistical tool such as logistic regression. This section discusses the development of hybrid models that integrate data mining tools and logistic regression. Much of this discussion is based on work by Salford Systems, the developers of CART, and presented at their training session on “Data Mining with Decision Trees: Advanced CARTr Techniques.” Future evolution of these approaches hold great promise. Core CART features that logistic regression does not have are: • Automatic separation of relevant from irrelevant predictors (variable selection) • Does not require a transformation such as log or square root (model specification)
193 • Automatic interaction detection (model specification) • Impervious to outliers (can handle dirty data) • Unaffected by missing values (does not require list-wise deletion or missing value imputation) • Requires only moderate supervision by the analyst • First time model is often as good as a neural net developed by an expert Logistic regression (Hosmer & Lemeshow, 2000) can provide a smooth, continuous predicted probability of class membership where a small change in a predictor variable yields a small change in predicted probability.
Logistic
regression can also effectively capture global features of data. The main effects model reflects how probability responds to predictor x over the entire range of x, with some flexibility allowed by transformation, polynomials and interactions. CART and logistic regression excel at different tasks. CART is weak at capturing strong linear structure, while logistic regression easily captures and represents linear structure. Many non-linear structures can still be reasonably approximated with a linear structure, so even incorrectly specified logistic equations can perform well. CART recognizes the structure but cannot represent it effectively. With many variables, many of which enter a model linearly, structure will not be obvious from CART output. Yet CART excels at the detection of local structure. As each step of the recursive partitioning continues, the CART analysis is always restricted to the node in focus and so becomes progressively more local. Table 5.2 summarizes the differences between CART and logistic regression. CART analyses all the data only at the parent node; each successive node only uses the data that has been partitioned into that node. Salford Systems found that the key to a successful hybrid between CART and logistic regression
194
Table 5.2: Differences between CART and logistic regression CART Logistic regression Automatic analysis Requires hand built models Surrogates for missing values Deletes records or imputes missing values Unaffected by outliers Sensitive to outliers Discontinuous response (small change Continuous smooth response (small in x could lead to a large change in y change in x leads to a small change in y Coarse-grained (a tree with 17 terminal Can have unique predicted probability nodes can only predict 17 different for every record probabilities) was to use logistic regression in the root node where it had access to all the data, thereby capitalizing on logistic regression’s strength in detecting global structure. The method they use to do this is: 1. Run CART and assign every patient to a terminal node. This assignment is possible even for cases with many missing values 2. Let the terminal node assignment be reported by a categorical variable with as many levels as terminal nodes 3. Feed this categorical variable in the form of terminal node dummy variables to a logistic regression model. This uses CART to create a new variable (or multiple dummy variables since logistic regression prefers binary variables). These added variables constitute the hybrid model. Used in this manner, logistic regression can augment CART. By looking across nodes, logistic regression can find effects that CART cannot detect. Because these effects are not terribly strong, they are not picked up by CART as primary node splitters. While these effects may not be the strongest individually, collectively they can add enormous predictive power to the model. CART assigns a single score (probability) to all cases arriving at a terminal node. In the hybrid
195 model, logistic regression imposes a slope on the cases in the node, allowing continuous differentiation of within node probabilities based on variables. Since the logistic regression is common to all nodes, the slope is common across nodes. The details of constructing the hybrid model outlined above is complex but promising. More information on this can be obtained at www.salford-systems.com. Future work using such hybrid models to predict hospital death or glycemic control over time may prove much more accurate than CART or logistic regression alone. Similar hybrid models using CART and neural nets have also been used. Hybrid models may have applications to other data mining methods, as well as to other forms of data in both large observational studies and prospective trials.
5.6.4
Squashing data Data squashing algorithms can reduce a massive data set more power-
fully and accurately than using a random sample (DuMouchel, Volinsky, Johnson, Cortes, & Pregibon, 1999). Squashing is a form of lossy compression that attempts to preserve statistical information (DuMouchel, 2001), much as a jpg file gives a good picture rendition even though in a much smaller file than the original digital photograph. In this study all the data available was used, so there was no sampling. If a larger data set is used, such as all diabetics in a nationwide Medicare database, memory and time constraints may require limitations on the amount of data used in the analysis. These newer data squashing techniques may be useful in these massive data sets. Privacy of records Healthcare database research needs patient identifiers to link tables in a relational database and extract its data mining data table. Yet data must be de-identified as quickly as possible to meet regulatory requirements. Beyond
196 regulatory limitations, there are ethical concerns about the privacy of medical records, whether they reside in a paper record or in an electronic database. This patient information is protected; research access requires Institutional Review Board approval and assurances that there will be no patient identifiers in any presentation or papers. The new HIPPA rules recently proposed are not yet finalized, but may range from requiring consent from each person whose medical information is in the database being analyzed (an impossible task in this study’s database of more than 30,000 patients) to being de-identified at an earlier stage which might prove problematic for proper linking to other files. A squashed data set may partly resolve privacy issues. Squashed records faithfully represent only the aggregate behavior of the large data set and not the individual records themselves. Squashing naturally lends itself to data where it is important to have minimal disclosure risk (DuMouchel et al., 1999).
5.6.5
Summary This section reviewed data transformation issues to minimize information
loss from a relational database to a flat file. These issues are central to time series and sequencing information that are key to transactional healthcare data mining. Hybrid models using CART and logistic regression are also possible. For massive data, squashing it via algorithms rather than random sampling may have advantages, including privacy protections.
5.7
Final conclusions Data mining transactional healthcare data warehouses is a valid, realistic
and cost-effective method of improving outcomes that is under utilized in today’s
197 healthcare setting. This study showed how data mining can discover new knowledge—quite a feat in an area that has been intensely studied by thousands of investigators for decades. Data mining methods are more powerful than standard statistical methods in some ways, but both are needed to find the optimal model. The benefits to managers are significant.
Using the decision analysis
approach of Chapter 4, this new knowledge may change managerial decisions to put resources into interventions where specific outcomes are most likely to be improved. The predictive ability of these models can be used prospectively by managers for resource allocation. The physical obstacles to effective healthcare data mining are all likely to be getting smaller in the coming years: the presence of data warehouses with enough useful information, adequate computer equipment and software, and electronic medical records becoming more widespread. A more difficult obstacle to overcome is the needed skill set that overlaps clinical medicine, decision analysis, data mining, and health systems management. Finally, the cultural change needed to shift from a hypothesis driven approach to algorithmic models (Breiman, 2001) may be a larger obstacle than the other ones. These methodologies should become a routine part of a healthcare manager’s analysis of data, and a clinician’s management of their panel of patients. User friendly interfaces for these tools are already available. Institutions should facilitate training of clinicians, researchers and managers who are capable of using these methodologies to discover new knowledge and utilize it to improve outcomes in the healthcare setting. This may mean adding faculty with these skills, and having software available for data mining. Although this can be incorporated in many existent courses, a separate course on these methodologies and their application in the healthcare setting should be provided.
198 This is an exciting area of health systems management today. The future for new knowledge discovery through data mining that can have a significant impact on improving healthcare outcomes is unparalleled. As some individuals and institutions have notable successes with this method in the coming years, it will take its place in the armamentarium of health systems management and population medicine.
199
Appendix A Diabetic Data Warehouse Structure A.1
Introduction The diabetic data warehouse incorporates all the electronic sources of data
on diabetic patients in the Institution. The Oracle Discoverer v3.1 is the method used to query the data warehouse. DBMS/COPY can also be set up to query the data warehouse as an interface to data mining software such as CART. The other major databases not used in this project are a data warehouse on billing, and a proprietary system (OMIS) on all the clinical and lab data. The information from both these as well as the insurance drug prescription information has been incorporated into the diabetic warehouse. The OMIS system is relatively difficult to run queries on in any intensive way. It is structured to present text information to clinicians who are trying to look up reports. It is not query friendly, and queries run at a low priority that may take weeks since it is programmed to give highest priority to clinicians seeking reports, and to daily input and maintenance tasks. The diabetes data warehouse has fairly complete data starting 1/1/98, and currently (as of early January 2002) goes through 9/30/01 (these queries were started in Fall 2000, and were mostly completed and updated in December 2001 and January 2002). It is updated quarterly.
200
A.2
Overall structure
A.2.1
Organizational chart
Figure A.1 outlines the data warehouse structure within the Oracle database.
&' ( & ' )* % +
& ' , !
$ % % ! "
#
Figure A.1: Organizational chart of the diabetic data warehouse The database is set up so that only the administration group links the other 4 groups (clinic, hospital, laboratory, and medication groups). Once any of these 4 groups are selected, the others are shaded out meaning that they cannot be selected for queries. However, if you go up to administration and select clinic number (even though you already have clinic number selected within one of the 4 groups being queried), then that allows you access to the other groups to querying.
A.2.2
Diabetes registry maintenance and process
Administrative data maintenance The Administrative Data file is a file of the Institution’s patients identified as being diabetic. Patients are included in the Administrative Data table, if one or more of the following has occurred:
201 1. A diagnosis code of diabetic has been captured from a clinic or hospital encounter. 2. A claim has been processed for diabetic medications or supplies. 3. A laboratory test indicating above normal glucose or Glycosylated Hgb. The initial administrative data was built by extracting diagnoses, medications, and laboratory test, which indicate diabetes. The extracts were combined into a single table maintaining only one entry for a patient. The clinic numbers identified were supplemented with demographic data from the Institution’s Patient Master File. Quarterly, newly collected diagnosis, medication, and laboratory data are extracted to obtain new patients for the registry. Diabetes registry table maintenance The Diabetes Registry consists of four tables in addition to the Administrative Data table described above: 1. Hospital Data 2. Laboratory Data 3. Medications Data 4. Clinic Data Contained in these four tables are utilization data obtained from their respective systems. Initially, utilization was loaded beginning with January 1998 through the present. The tables are organized by year, month, clinic number and member number. The Hospital Data table has two related tables, one to maintain diagnoses data (Hospital Diagnosis Data) and one to maintain procedure data (Hospital
202 Procedure Data). The hospital number links the two tables to the Hospital Data table. The Laboratory Data table has one related table (Lab Results Data) to maintain individual results contained in a laboratory test. Accession number, Test Code, and Series link the table to the Laboratory Data table.
A.2.3
Process specifications
Extract patients by diagnosis This process accesses the mainframe ADABAS file MR-PATMED-DETAIL. Extracts are based on the record type field PD-TYPE and the super descripter PD-STATUS-DX, which is a combination of the record status and the diagnosis code. The record type of “01” and “03” denotes records containing diagnoses and the record status of “6” indicates an active record. The diagnosis codes indicating diabetes are: • Diabetes Mellitus - Any code beginning with “250” • Diabetic polyneuropathy - Code “357.” • Diabetic retinopathy - Any code beginning with “362.0” • Diabetic cataract - Code “366.41” The data of service field, PD-SVC-DATE, is also used to control the selection of records. For the initial registry, dates of service on or before 7/1/98 identified the patients diagnosed as diabetic as of that date. Subsequent diagnosis data extracts identified the patients diagnosed as diabetic since that date. Only one occurrence per clinic number is required.
203 Extract patients by lab test This process accesses the mainframe ADABAS file LR-OMIS-DETAIL. Extracts are based on the test status field LD-TEST-STATUS and the test ordered field LD-TEST-CODE-ORD. The test status codes of “6” or “7” denotes resulted tests. The test codes used to indicate diabetes are in Table A.1. Table A.1: Test codes used to indicate diabetes Test Name Test Code(s) Total Serum Cholesterol 1010 Glucose, Fasting 1017 Glucose 2hr PC 1018 Glucose - 11am 1019 Glucose Random 1020, 1020T Chem 20 1037 Glucose - 4pm 1071 Triglyceride 1087 HDL Cholesterol 1140A, 1140B LDL Cholesterol 1141, 1141B LDL 1141.2, 1141.2B Chem 18 1155 Total Cholesterol/HDL Ratio 1160.1, 1160.1B HDL/Cholesterol Ratio 1161.3, 1161.3B Chem 7 1218 Chem 8 1228 Basic Metabolic Panel 1230 Comprehensive Metabolic Panel 1232 Chem 21 1237 Glucose, Random (BR Women’s Hospital) 1251 Lipid Profile 1306, 1306B Nutrition Profile 1309 Family Medicine Profile 1313 Dialysis Chem II 1317 Microalbumin 2590 Microalbumin volume 2590.2 Microalbumin - Minutes Collection 2590.3 Microalbumin - Timed Urine 2590.4 Glycosylated Hb 3096
The collection data field LR-TEST-SPN-DATE is used to control the selection of records. For the initial registry, collection dates between 1/1/98
204 and 7/1/98 potentially identified the patients diagnosed as diabetic as of that date. Subsequent laboratory data extracts potentially identify the patients as diabetic since the last extract’s date. In addition to the above extraction criteria for laboratory data, key results must contain certain values or multi occurrence of a certain test over a set period must be present. Key results and values are in Table A.2 Table A.2: Key value results that indicate diabetes Result Value HbA1c > 7% Fasting Glucose ≥126 Random Glucose > 200 The multi occurring test is Glycosylated Hb (code 3096) that has 2 occurrences in the previous 18 months. Only one occurrence per clinic number is required. Extract patients by medications This process accesses the insurance company’s pharmacy data claim and drug files MEDATA.CLAIM RX MM and MEDATA.DRUGFILE MM. Extracts are based on the drug category description field CATGY DESC. A drug category description of “DIABETES” denotes medications or supplies prescribed for diabetics. The FILL DATE field will also be used to control the selection of information. The initial medications selection was for fill dates between 1/1/98 and 7/1/98. Subsequent medications extracts began where the previous extract ended. Only one occurrence per patient is necessary. Since the pharmacy data does not contain the clinic number, it was necessary to link the extracted insurance member number to the Practice Profile Registration file PHYSPROF.REGISCRTATION to obtain the clinic number for insurance members who are patients at the institution.
205 Combine patient extracts This process combines the three patients extracts, diagnoses, laboratory tests and medications, into a single file containing only one occurrence per clinic number. Complete administrative data This process uses the output from the above Combine Patient Extracts process to add the necessary demography data to the clinic numbers found by the above extracts. The clinic number is used to access the mainframe ADABAS file PATIENT-MASTER and the Practice Profile Registration table PHYSPROF.REGISCRTATION. Extract hospital data This process accesses the mainframe ADABAS file MR-PATMED-DETAIL. Extracts were based on the record type field PD-TYPE and the clinic number field PD-CLINO. The record type for hospital data is “03”. The clinic number came from the uploaded Administrative Data table from the above process. The hospital discharge date field PD-SVC-DATE was also used to control selection of information. For the initial registry, discharge dates between 1/1/98 and the present were extracted. Subsequent hospital data extracts will begin where the previous extract ended. For each MR-PATMED-DETAIL Hospital record which meets the extraction criteria, output the following: 1. A file containing one record for each hospital record selected containing the fields listed under Hospital Data in the table documentation.
206 2. A file containing a record for each diagnosis code, in the hospital record selected, containing the fields listed under the Hospital Diagnosis Data in the table documentation. 3. A file containing a record for each procedure code, in the hospital record selected, containing the fields listed under the Hospital Procedure Data in the table documentation. Extract laboratory data This process accesses the mainframe ADABAS file LR-OMIS-DETAIL. Extracts were based on the test type field LD-TYPE, the test status field LDTEST-STATUS, and the clinic number field PD-CLINO. The test type for standard chemistry test is “S”. The test status for resulted tests are “6” and “7”. The clinic number will come from the uploaded Administrative Data table form the above process. The collection date field, LD-TEST-SPN-DATE, was also used to control selection of information. For the initial registry, collection dates between 1/1/98 and the present were extracted. Subsequent laboratory data extracts began where the previous extract ended. For each LR-OMIS-DETAIL Laboratory record which meets the extraction criteria, output the following: 1. A file containing one record for each laboratory record selected containing the fields listed under Laboratory Data in the table documentation. 2. A file containing a record for each result, in the laboratory record selected, containing the fields listed under the Lab Results Data in the table documentation.
207 Extract medications data This process accesses the pharmacy data claim and drug files MEDATA.CLAIM RX MM and MEDATA.DRUGFILE MM. Extract is based on the member number field MEMBER NBR. The member number comes from the Administrative Data table form the above process. The FILL DATE field is also used to control the selection of information. The initial medications selection is for fill dates between 1/1/98 and the present. Subsequent medications extracts began where the previous extract ended. For each drug claim record that meets the extraction criteria, add a row to the Medications Data table. Extract clinic data This process will access the data warehouse table WAREHOUSE.CHARGES. Related tables in the data warehouse will be required to provide data not maintained in the charges table. Extract will be based on the clinic number field DL EXTERNAL PT ID. The clinic number will come from the Administrative Data table from the above process. The service date field is also used to control the selection of information. The initial clinic data selection was for service dates between 1/1/98 and the present. Subsequent medications extracts began where the previous extract ended. For each charge record that meets the extraction criteria, add a row to the Clinic Data table.
A.3
Administrative variables The administrative variables are in Table A.3
208
Table A.3: Administrative Pvariables CLINIC NO: Clinic number of patient P:COUNT, MIN, MAX MEMBER NO: Insurance member number :COUNT, MIN, MAX of patient without the “-” between the 9th and 10th position. Registration file does not use the “-” in the member number field P MEMBER OHP NO: The same number as :COUNT, MIN, MAX above with the “-” P NAME: Name of patient P:COUNT, MIN, MAX DOB: Date of birth :COUNT, PMIN, MAX SEX: Sex of patient ?, F, M, U; :COUNT, MIN, MAX
A.4
Clinic variables The clinic variables are in Table A.4.
When I run Provider Types x
COUNT, the third column of Table A.7 is generated. Note that RES COUNT is 407 only. Drilling down into this, many of these are family practice residents, but these would be only a small portion of the resident encounters. Apparently, most are listed under staff doc only and so any comparisons with resident and staff would be difficult with this dataset.
A.5
Hospital variables The hospital variables are in Table A.8 and its subtables, Tables A.9 and
A.10.
A.6
Laboratory variables The laboratory variables are in Table A.14, and its subtable A.15.
A.7
Medication variables The medication variables are described in Table A.16 and its subtables
A.17 and A.18.
209
Table A.4: Clinic variables P YEAR: Year of this clinic visit (1998 was :COUNT, MIN, selected as the starting point., prior data is not readily available) P MONTH: Month of this clinic visit P:COUNT, MIN, CLINIC NO: Clinic number of patient P:COUNT, MIN, POS CODE: Point of Service code (see :COUNT, MIN, Table A.5) P DOS DATE: Date of service P:COUNT, MIN, PERFORMING DOCTOR: Performing :COUNT, MIN, doctor’s number (969 choices) P PROVIDER SERVICE: Section of Insti:COUNT, MIN, tution service provided in (see Table A.6) P PROVIDER TYPE: Type of Provider (see :COUNT, MIN, Table A.7) P PROVIDER SPECIALTY: Specialty :COUNT, MIN, (Department) of provider P GENERAL LEDGER CTR: Billing :COUNT, MIN, location (hundreds of codes, supplemental table defining location of codes not reproduced here) P INSTITUTION: C = New Orleans; B = :COUNT, MIN, Baton Rouge P DIAGNOSIS CODE 1: 1st of the 4 :COUNT, MIN, diagnosis codes for billing P DIAGNOSIS CODE 2: 2nd of the 4 :COUNT, MIN, diagnosis codes for billing P DIAGNOSIS CODE 3: 3rd of the 4 :COUNT, MIN, diagnosis codes for billing P DIAGNOSIS CODE 4: 4th of the 4 :COUNT, MIN, diagnosis codes for billing P :COUNT, MIN, CHARGE CODE: Billing charge code from the CVC’s P CPT CODE: CPT code (procedure code) P:COUNT, MIN, CPT MODIFIER: CPT modifier code P:COUNT, MIN, CHARGE AMOUNT: Charge amount :SUM,COUNT, (gross charge) AVG, Detail P TRANSACTION SIGN: Indicates if is a :COUNT, MIN, charge or a credit P ACCOUNT TYPE: Indicates the general :COUNT, MIN, payer class. 27 choices
MAX
MAX MAX MAX MAX MAX MAX MAX MAX MAX
MAX MAX MAX MAX MAX MAX MAX MAX MIN, MAX MAX
MAX,
210
Table A.5: POS CODE 01 IN-PATIENT 02 REG OUT PATIENT 03 OFFICE PATIENT 04 PATIENT’S HOME 05 EMERGENCY DEPARTMENT 07 NURSING HOME 08 SKILLED NSG FACILITY 10 OTHER LOCATION 11 REHABILITATION UNIT 12 OC AMBULATORY SURG 13 ANCILLARY OFFICE PT 14 DIALYSIS CENTER 15 IN-PT(OC OWNED)ANCIL 16 OUT-PT(OC OWNED)ANCI RH OFFICE PATIENT
Table A.6: Selected provider service descriptions 0200 0204 0208 0211 0218 0525 0529 0585 0675 0781 0890 0935 0943 0948 0952 0955 0960 0969 0972 0975 0982 0986 0991 0996 9001 9009
FAMILY PRAC-GWD GOODWOOD INTMED INT MED - EAST FP - MID CITY GONZALES FAMILY INTERNAL MED METAIRIE INT IV PROSPECTIVE MED WEIGHT CONTROL UPTOWN CLINIC ICU COVINGTON INT M GOLDEN MEADOWS SLIDELL INT MED MATTHEWS CLINIC MANDEVILLE SAT EYEGLASSES LAP INT MED KEN CONTACT LEN LAPALCO CLINIC KENNER INT MED N O E FAM PRAC MET FAM PRAC ALGIERS FAM PRA INTERNAL MED FAMILY PRACTICE
0201 0206 0209 0213 0511 0526 0565 0589 0700 0805 0902 0936 0944 0949 0953 0956 0961 0970 0973 0976 0984 0988 0994 0998 9003
PERKIN FAM PRAC GONZALES INTMED PEDIATRICS-EAST INTMED MID CITY DIABETIC INST GEN INT MED ADM ACUTE MEDICINE PODIATRY 7 OPHTHALMOLOGY INPATIENT AOMF ALGIERS INT MED OCH CL COV SLIDELL CLINIC LAROSE CLINIC CUT OFF CLINIC MANDEV INT MED MET CONTACT LEN DIET LAP CONTACT LEN LAPALCO FAM PRA NOE CONTACT LEN N O E INT MED FAM RESID PROG KENNER OPHTHA OPHTHAMOLOGY
0202 0207 0210 0214 0524 0527 0580 0590 0701 0887 0920 0937 0945 0951 0954 0959 0967 0971 0974 0980 0985 0990 0995 8888 9004
PERKIN INT MED FAM PRAC - EAST FAM. PRAC B-OC PEDIATRICS-GWD NEIGH CL INTMED INT MED III INT MED II PODIATRY 9 OPTOMETRY EXECUTIVE HEALT EMERGENCY MED OCH CL HAM SLIDELL FAM PRA RACELAND CLINIC MANDEVILLE CL MAN CONTACT LEN NOE RESID PROG PDIET KENNER CLINIC KENNER FAM PRAC N O E CLINIC METAIRIE CLINIC ALGIERS CLINIC HOSPITAL ADMIT OPTOMETRY
211
Table A.7: MD ER NP OCP OD PA PCP RES RN SHR SW
Selected provider type descriptions with counts STAFF PHYSICIAN 493,657 EMG ROOM PHYSICIAN 16,038 NURSE PRACTIONER 7,732 CONTRACT PROVIDER 3 OPTOMETRIST 18,790 PHYSICIAN ASSISTANT 10,327 PRIMARY CARE PHYSICIAN 300,113 RESIDENT 407 REGISTERED NURSE 70 SHARED PHYSICIAN 489,014 SOCIAL WORKER 1,561
Table A.8: Hospital YEAR: Year of hospitalization discharge MONTH: Month of hospitalization discharge CLINIC NO: Clinic number of patient HOSPITAL NO: Hospital number of patient POS CODE: Point of Service code (ER or FH) DISCHARGE DATE: Discharge date DISCHARGE STATUS: Discharge status list is in Table A.11
DEATH CODE (A=alive, C=coronor’s case or D=death) ATTN DOCTOR NO: Attending physician’s number (3997 choices) LOS: Length of stay is the difference between the admit date and the discharge date DRG: A standard code assigned by medical records that best describes the hospitalization HOSP SERV: The hospital service the patient was on (see Table A.12) HOSP FIN CLASS: Financial class (see Table refFC)
variables P P:COUNT, MIN, MAX :COUNT, MIN, MAX P P:COUNT, MIN, MAX P:COUNT, MIN, MAX :COUNT, MIN, MAX P
:COUNT, MIN, MAX 1A, 1N, 1Z, 2A, 2N, 2Z, 3A, 3Z, 4N, 4Z, 7A, 7N, 7Z, 8A, 8N, 8Z, HB, HR, MA, PRS, TE, TH, TI, TP, TR, NULL, :COUNT, MIN, MAX P :COUNT, MIN, MAX P P
P
P P
:COUNT, MIN, MAX :COUNT, MIN, MAX
:COUNT, MIN, MAX
:COUNT, MIN, MAX :COUNT, MIN, MAX
212
Table A.9: Hospital variables, Dx subtable P YEAR: Year of hospitalization discharge P:COUNT, MIN, MAX MONTH: Month of hospitalization :COUNT, MIN, MAX discharge P HOSPITAL NO: Patient’s hospital number :COUNT, MIN, MAX DIAGNOSIS CODE: Diagnosis code for A P very long list of hospital billing :COUNT, MIN, MAX
Table A.10: Hospital variables, YEAR: Year of this particular hospital procedure MONTH: Month of this particular hospital procedure HOSPITAL NO: Hospital number of patient PROCEDURE CODE: Billing code for the procedure PROCEDURE DATE: Date of the procedure PROCEDURE DOCTOR: Billing doctor for the procedure
codes,
procedures subtable P :COUNT, MIN, MAX P
:COUNT, MIN, MAX
P
:COUNT, MIN, MAX A Pvery long list of procedure codes, P:COUNT, MIN, MAX :COUNT, MIN, MAX P
:COUNT, MIN, MAX
213
1A-8N HB HF HR MA MN OU RC RP RS TA TB TC TE TF TH TI TM TO TP TR TS
Table A.11: DISCHARGE STATUS codes DEATH DISCHARGE TO RELATIVE’S HOME DISCHARGE TO FOSTER HOME ROUTINE DISCHARGE LEFT AGAINST MEDICAL ADVICE ALTERNATE CARE LEVEL REQUIRED, FACILITY NOT AVAILABLE OUTPATIENT ADMITTED AS NA INPATIENT TO THIS HOSPITAL REFERRED TO SPECIALTY CLINIC REFERRED TO FAMILY PHYSICIAN REFERRED TO OR DISCHARGED TO AN ORGANIZED HOME CARE SERVICE TRANSFERRED OR DISCHARGED TO ANOTHER INSTITUTION TRANSFERRED OR DISCHARGED TO A CORRECTIONAL INSTITUTION TRANSFERRED TO A CUSTODIAL CARE FACILITY TRANSFERRED OR DISCHARGED TO AN EXTENDED, SKILLED NURSING FACILITY TRANSFERRED OR DISCHARGED TO A STATE FACILITY TRANSFERRED TO A SHORT-TERM GENERAL HOSPITAL TRANSFERRED TO AN INTERMEDIATE CARE FACILITY TRANSFERRED TO A MATERNITY FACILITY TRANSFERRED TO OTHER FACILITY TRANSFERRED TO A PSYCHIATRIC FACILITY TRANSFERRED TO A REHABILITATION FACILITY TRANSFERRED TO A SPECIALTY HOSPITAL
214
Table A.12: CAR CCS CMR CTS CVA EMS END FAM GNM HTS HYP IM1 IM2 IM3 IM4 IM5 IM6 IM7 KTS MA NEP OPH POD PTS UCM VIM
Selected HOSPITAL SERVICE codes CARDIOLOGY CRITICAL CARE SVC CARD MEDICARE RISK CARDIO THORACIC STROKE TEAM CONSULT EMERGENCY MED SVC ENDOCRINE FAMILY PRACTICE GENERAL INTERNAL MED HEART TRANSPLANT HYPERTENSION INT MEDICINE TEAM 1 INT MEDICINE TEAM 2 INT MEDICINE TEAM 3 INT MEDICINE TEAM 4 INT MEDICINE TEAM 5 INT MEDICINE TEAM 6 INT MEDICINE TEAM 7 KIDNEY TRANSPLANT MEDICINE TEAM A NEPHROLOGY OPHTHALMOLOGY PODIATRY CONSULT SVC PANCREATIC TX SURG UNCOVERED MEDICINE VASCULAR INT MED
Table A.13: HOSP FIN CLASS descriptions Code HOSP FIN CLASS Code HOSP FIN CLASS A M/C HMO NON-CA P PRIVATE PAY B BLUE CROSS Q PENDING M’CAID C COMMERCIAL R MEDICARE REHAB S PACKAGE PRICE E MEDICARE RISK F HANDICAPPED CHILDREN T OHP CAPITATED U HMO/PPO CONTRACTS G TIME PAYMENT V OHP H MEDICARE PSYCH K MEDICARE SNF W WORKMAN CMP 3RD PART X CHAMPUS/RET.FED EMP L PRIVATE PAY INTN’L M MEDICARE Y REFERRAL LAB Z OUT/STATE MEDICAID N MEDICAID
215
Table A.14: Laboratory variables P YEAR: Year of collection date P:COUNT, MONTH: Month of collection date P:COUNT, CLINIC NO: Clinic number of patient P:COUNT, ACCESSION TEST SERIES: a control :COUNT, field used to uniquely identify the record P COLLECTION DATE: The collection date :COUNT, (DATA is a typo in the structure, it should be DATE) P TEST STATUS: Indicates that the test :COUNT, was resulted, non-resulted tests were not extracted P TEST TYPE: Distinguishes the test into :COUNT, three categories, S is Chemistry, CT is Cytology, and MS is Microbiology P ORDERING DOCTOR: Ordering doctor’s :COUNT, number P :COUNT, TEST CODE ORDERED: Test code that was ordered P TEST CODE PERFORMED: Test code :COUNT, that was performed P TEST NAME: Name of the test (2304 :COUNT, names, not detailed below) P CPT CODE: CPT code of the test P:COUNT, CPT MODIFIER: CPT modifier used :COUNT,
MIN, MIN, MIN, MIN,
MAX MAX MAX MAX
MIN, MAX
MIN, MAX
MIN, MAX
MIN, MAX MIN, MAX MIN, MAX MIN, MAX MIN, MAX MIN, MAX
Table A.15: Laboratory variables, Dx subtable P YEAR: Year of collection date P:COUNT, MIN, MONTH: Month of collection date P:COUNT, MIN, :COUNT, MIN, ACCESSION TEST SERIES: A control field used to uniquely identify the record P RESULT ABBREV: Abbreviation used for :COUNT, MIN, the test result P RESULT: The test result P:COUNT, MIN, :COUNT, MIN, RESULT LNH FLAG: (Low-High Flagged as abnormal) P RESULT NORMAL LOW: The cut point :COUNT, MIN, for below normal range P RESULT NORMAL HIGH: The cut point :COUNT, MIN, for above normal range
MAX MAX MAX MAX MAX MAX MAX MAX
216
Table A.16: MedicationP variables YEAR: Year medication filled P:COUNT, MIN, MAX MONTH: Month medication filled P:COUNT, MIN, MAX CLINIC NO: Clinic number of patient P:COUNT, MIN, MAX DOCTOR NO: Prescribing doctor number P:COUNT, MIN, MAX NDCA: unique number nationally assigned :COUNT, MIN, MAX to medications P CATGY CODE: From OHP’s formulary P:COUNT, MIN, MAX CLASS CODE: From OHP’s formulary P:COUNT, MIN, MAX BRAND NAME: Brand name of medicine :COUNT, MIN, MAX (2260 choices) P STRENGTH: Strength of medicine P:COUNT, MIN, MAX DOSAGE: Dose form of medicine (112 :COUNT, MIN, MAX choices) P ROUTE: Route medicine was prescribed (22 :COUNT, MIN, MAX choices) P FILL DATE: Date medicine was filled P:COUNT, MIN, MAX QUANTITY SUPPLY: Number of pills :COUNT, MIN, MAX filled P DAYS SUPPLY: Number of days :COUNT, MIN, MAX prescription should last P :SUM, COUNT, MIN, MAX, DOSE PER DAY: Total mg per day AVG, Detail P TOTAL COST: Cost for this prescription :SUM, COUNT, MIN, MAX, AVG, Detail
Table A.17: Medication variables, categories subtable P CATGY CODE: Codes for drug categories List of 496 codes, :COUNT, from OHP’s formulary MIN, MAX CATGY DESC: Drug categories from List P of 496 categories of drugs, OHP’s formulary :COUNT, MIN, MAX
Table A.18: Medication variables, classes subtable CLASS CODE: Codes for drug classes List P of 63 codes of drug classes, :COUNT, MIN, MAX CLASS DESC: Descriptions of drug classes List of 63 classes of drugs, P :COUNT, MIN, MAX
217
Appendix B CART Software Aspects of the CART software that are important to understanding this dissertation’s methodology are presented below from www.salford-systems.com. A full presentation is available in the CART manual and in (Breiman et al., 1984).
B.1
Features CART uses an intuitive, Windows based interface, making it accessible to
both technical and non-technical users. Underlying the “easy” interface, however, is a mature theoretical foundation that distinguishes CART from other methodologies and other decision trees. Salford Systems’ CART is the only decision tree system based on the original CART code developed by world renowned Stanford University and University of California at Berkeley statisticians; this code now includes enhancements that were co-developed by Salford Systems and CART’s originators. Based on a decade of machine learning and statistical research, CART provides stable performance and reliable results. Its proven methodology is characterized by: 1. A reliable pruning strategy, CART’s developers determined definitively that no stopping rule could be relied on to discover the optimal tree, so they introduced the notion of over-growing trees and then pruning back; this idea,
218
Figure B.1: CART’s main modeling window fundamental to CART, ensures that important structure is not overlooked by stopping too soon. Other decision tree techniques use problematic stopping rules. 2. A powerful binary split search approach. CART’s binary decision trees are more sparing with data and detect more structure before too little data is left for learning. Other decision tree approaches use multi-way splits that fragment the data rapidly, making it difficult to detect rules that require broad ranges of data to discover. 3. Automatic self validation procedures. In the search for patterns in databases it is essential to avoid the trap of “overfitting,” or finding patterns that apply only to the training data. CART’s embedded test disciplines ensure that the patterns found will hold up when applied to new data. Further, the testing
219 and selection of the optimal tree are an integral part of the CART algorithm. Testing in other decision tree techniques is conducted after-the-fact and tree selection is left up to the user.
Figure B.2: CART’s model of optimal tree In addition, CART accommodates many different types of real world modeling problems by providing a unique combination of automated solutions:
B.1.1
Surrogate splitters intelligently handle missing values
CART handles missing values in the database by substituting “surrogate splitters,” which are back-up rules that closely mimic the action of primary splitting rules. The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. Other products’ approaches treat
220 all records with missing values as if the records all had the same unknown value; with that approach all such “missings” are assigned to the same bin. In CART, each record is processed using data specific to that record; this allows records with different data patterns to be handled differently, which results in a better characterization of the data.
B.1.2
Adjustable misclassification penalties avoid errors
CART can accommodate situations in which some misclassifications, or cases that have been incorrectly classified, are more serious than others. CART users can specify a higher penalty for misclassifying certain data, and the software will steer the tree away from that type of error. Further, when CART cannot guarantee a correct classification, it will try to ensure that the error it does make is less costly. If credit risk is classified as low, moderate, or high, for example, it would be much more costly to classify a high risk person as low risk than as moderate risk. Traditional data mining tools cannot distinguish between these errors.
B.1.3
Alternative splitting criteria
CART includes seven single variable splitting criteria-Gini, symmetric Gini, twoing, ordered twoing and class probability for classification trees, and least squares and least absolute deviation for regression trees-and one multi-variable splitting criteria, the linear combinations method. The default Gini method typically performs best, but, given specific circumstances, other methods can generate more accurate models. CART’s unique “twoing” procedure, for example, is tuned for classification problems with many classes, such as modeling which of 170 products would be chosen by a given consumer. To deal more effectively with
221 select data patterns, CART also offers splits on linear combinations of continuous predictor variables.
Figure B.3: CART’s detail screen
222
Appendix C Data Mining Software This appendix is a cursory review of the breadth of data mining technologies based on the www.kdnuggets.com website. Every one of the areas in Table C.1 have many data mining software packages within them. It is meant to give a sense of where the CART software used in this study fits in. Table C.1: Data mining categories ½ Suites, supporting classification, clustering, and data preparation ½ Classification: building models to separate 2 or more discrete classes using Multiple approaches, Decision Trees or Rules, or other technologies ½ Clustering for finding clusters or segments ½ Data Transformation and Cleaning ½ Visualization software ½ Agents ½ Association rules and market basket analysis ½ Audio and Video Mining ½ Bayesian and Dependency Networks ½ Database and OLAP ½ Deviation and Fraud Detection ½ Estimation, Regression and Forecasting ½ Libraries and Developer Kits for creating embedded data mining applications ½ Sequential Patterns software ½ Simulation of Processes ½ Statistical Analysis software ½ Text Analysis and Information Retrieval (IR) tools for searching and analyzing unstructured texts ½ Web Mining: clickstream, log analysis, XML mining ½ Web Searching: search engines
223 The CART software is located with in the category of: Classification: building models to separate 2 or more discrete classes using multiple approaches, decision trees or rules, or other technologies. The subgroups within this category are listed in Table C.2. Table C.2: Classification software types ½ Multiple approaches, typically including both a decision-tree and a neural network models, as well as some way to combine and compare them. ½ Decision tree or Rule-based approaches ½ Neural networks ½ Bayesian and Dependency Networks ½ Other approaches, including Support Vector Machines, Rough Sets, and Genetic
Within the Decision tree or Rule-based approaches, there are many software products as listed in Tables C.4, C.3, C.6, and C.5 Table C.3: Decision tree software, free ½ C4.5, the “classic” decision-tree tool, developed by J. R. Quinlan, (restricted distribution) ½ EC4.5, a more efficient version of c4.5, which uses the best among three strategies at each node construction. ½ IND, provides Gini and C4.5 style decision trees and more. Publicly available from NASA but with export restrictions. ½ LMDT, builds Linear Machine Decision Trees (based on Brodley and Utgoff papers). ½ ODBCMINE, shareware data-mining tool that analyzes ODBC databases using the C4.5, and outputs simple IF—ELSE decision rules in ascii. ½ OC1, decision tree system continuous feature values; builds decision trees with linear combinations of attributes at each internal node; these trees then partition the space of examples with both oblique and axis-parallel hyperplanes. ½ PC4.5, a parallel version of C4.5 built with Persistent Linda (PLinda) system. ½ PLUS, Polytomous Logistic regression trees with Unbiased Splits, (Fortran 90).
224
Table C.4: Decision tree software, commercial ½ AC2, provides graphical tools for data preparation and builing decision trees. ½ Alice d’Isoft 6.0, a streamlined version of ISoft’s decision-tree-based AC2 data-mining product, is designed for mainstream business users. ½ Business Miner, data mining product positioned for the mainstream business user. ½ C5.0/See5, constructs classifiers in the form of decision trees and rulesets. Includes latest innovations such as boosting. ½ CART 4.0 decision-tree software, from winners in the KDD Cup 2000. Advanced facilities for data mining, data pre-processing and predictive modeling including bagging and arcing. ½ Cognos Scenario, allows you to quickly identify and rank the factors that have a significant impact on your key business measures. ½ Decisionhouse, provides data extraction, management, pre-processing and visualization, plus customer profiling, segmentation and geographical display. ½ KnowledgeSEEKER, high performance interactive decision tree analytical tool. ½ Neusciences aXi.DecisionTree, ActiveX Control for building a decision tree. Handles discrete and continuous problems and can extract rules from the tree. ½ PolyAnalyst, includes an information Gain decision tree among its 11 algorithms ½ SPSS AnswerTree, easy to use package with CHAID and other decision tree algorithms. Includes decision tree export in XML format. ½ XpertRule Miner (Attar Software), provides graphical decision trees with the ability to embed as ActiveX components.
Table C.5: Rule based approaches, free ½ CBA, mines association rules and builds accurate classifiers using a subset of association rules. ½ Claudien, a clausal discovery engine CN2, inductively learns a set of propositional if...then... rules from a set of training examples by performing a general-to-specific beam search through rule-space. ½ DBPredictor ½ KINOsuite-PR extracts rules from trained neural networks. ½ RIPPER, a system that learns sets of rules from data. Fast—asymptotically O(n*logn*logn), where n is the number of cases. ANSI C, Unix. For research purposes only.
225
Table C.6: Rule based approaches, commercial ½ AIRA, a rule discovery, data and knowledge visualization tool. AIRA for Excel extracts rules from MS-Excel spreadsheets. ½ Datamite, enables rules and knowledge to be discovered in ODBCcompliant relational databases. ½ DataDowser, finds IF[AND] THEN association rules; uses fuzzy logic. ½ PolyAnalyst, builds fuzzy logic classification rule with PolyNet Predictor, SKAT, or Linear Regression. ½ SuperQuery, business Intelligence tool; works with Microsoft Access and Excel and many other databases. ½ WizWhy, automatically finds all the if-then rules in the data and uses them to summarize the data, identify exceptions, and generate predictions for new cases. ½ XpertRule Miner (Attar Software) provides association rule discovery from any ODBC data source.
226
Appendix D DATA Software Aspects of the DATA software that are important to understanding this dissertation’s methodology are presented below from www.treeage.com. A full presentation is available in the DATA manual and the TreeAge website.
D.1
Features
D.1.1
Healthcare decision making with DATA
Clinicians use DATA to make complex treatment decisions, often when it is important to get the patient’s input on assigning values (utilities) to potential outcomes. When multiple treatment options are available, and outcomes are uncertain, DATA is an invaluable tool for presenting the options, specifying the risks, and quantifying the patient’s attitude towards them. Public health researchers and pharmacoeconomists use DATA to develop cost-effective therapies in an era of limited resources and growing demands. With DATA, it is possible to model treatment protocols, immunization programs, diagnostic testing, and pharmaceutical R&D projects, and analyze these complex decisions on the basis of a single criterion (health outcomes) or multiple criteria (costeffectiveness, benefit-cost).
227
D.1.2
Cost-effectiveness analysis and more
A model can be analyzed on the basis of expected costs, expected effectiveness, or combined cost-effectiveness. DATA automatically generates graphs and reports including marginal cost, marginal effectiveness and marginal costeffectiveness information. Instances of absolute or extended dominance (including parameters) are identified both graphically and in text reports. Powerful sensitivity analysis can be carried out in cost-effectiveness models.
Figure D.1: DATA software screenshot, cost effectiveness To go beyond expected value information, DATA offers extensive Monte Carlo simulation capabilities. Markov processes are graphically displayed with nodes to identify all of the potential states and the potential state transitions.
228 DATA also includes tunnel states and the ability to define tables that specify state transition probabilities for any stage during the Markov model analysis.
Figure D.2: DATA software screenshot, Markov Other features of particular interest to the medical and pharmaceutical communities include: 1. Built-in Markov analysis-related functions such as declining exponential approximation of life expectancy (”DEALE”), conversion of mortality rates into probability of death, discounted utilities, and many parameterized distributions. 2. Implementation of Bayes’ revision to convert test/procedure sensitivity and specificity values into meaningful decision probabilities.
229 3. Table interface allows the entry of mortality or other Markov analysis-related tables for use in any number of trees.
230
Appendix E Literature Searching Methodology A Medline and HealthStar search for “data mining” or “knowledge discovery” in titles or abstracts was conducted for English articles. The Research Index, Cite Seer (http://citeseer.nj.nec.com/cs), and http://gubbio.cs.berkeley.edu/ mlpapers were searched for “diabetes” or “healthcare” or “medical” to capture the computer literature. Books with the title “data mining” or “knowledge discovery” were searched for at the Library of Congress, the University of California, www. Amazon.com, and www.Barnesandnoble.com. www.kdnuggets.com was searched for additional books and articles. There are several key conferences held on data mining that generate collections of papers as detailed in Table E.1, and the table of contents of each of these was reviewed for relevant articles. The Japanese Discovery Science Conference is held in December since 1998. A September European Symposium, the Principles of Knowledge Discovery in Databases Conferences started in 1997. The annual Pacific Asia Conference on Knowledge Discovery and Data Mining also started in 1997. The Data Warehousing and Knowledge Discovery Conference started in 1999. Perhaps the most important data mining conference is the International Conference on Data Mining and Knowledge Discovery or KDD. It has been held
231 from 1995 and is sponsored by the Special Interest Group on Knowledge Discovery in Data and Data Mining (SIGKDD) of the Association for Computing Machinery. Other solitary conferences also have collected papers that are valuable to our topic. The AIMDM’99 was the Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making. The ISMDA was the First International Symposium on Medical Data Analysis. The following table gives details about the conference proceedings that have been reviewed for relevant articles for this study.
a
Library call # QA76.9 D3I55866 QA76.9 D3I55866 QA76.9 D3I55866 QA76.9 D3I55866 QA76.9.D3 QA76.9.D3 QA76.9.D3 Q174.D57 Q174.D57 Q174.D57 Q174 .D57 QA76.9.D3P16 QA76.9.D3 P528 QA76.9.D343 P553 QA76.9.D343 P553 QA76.9.D343 P553 QA76.9.D3P17 QA76.9.D3P17 QA76.9.D3P17 QA76.9.D3P17 QA76.9.D3P17 QA76.9.D37 QA76.9.D37 QA76.9.D37 R859.7.A78 J65 R853.S7 M43
Lecture notes in artificial intelligence number (a subseries of LNCS: Lecture notes in computer science of Springer Publishers)
Table E.1: Conferences on data mining and knowledge discovery Abbrev. Date Published proceedings reference KDD-95 Aug 20-21 (Fayyad & Uthurusamy, 1995) KDD-96 Aug 2-4 (Simoudis, Han, & Fayyad, 1996) KDD-97 Aug 14-17 (Heckerman, 1997) KDD-98 Aug 27-31 (Agrawal & Stolorz, 1998) KDD-99 Aug 5-18 (Chaudhuri & Madigan, 1999) KDD-00 Aug 20-23 (Ramakrishnan & Stolfo, 2000) KDD-01 Aug 26-29 (Provost, 2001) DS-98 Dec 14-16 LANIa 1532 (Arikawa & Motoda, 1998) DS-99 Dec 6-8 LNAI 1721 (Arikawa & Furukawa, 1999) DS-00 Dec 4-6 LNAI 1967 (Arikawa & Morishita, 2000) DS-01 Nov 25-28 LNAI 2226 (Jantke & Shinohara, 2001) PKDD-97 Jun 24-27 LNAI 1263 (Komorowski & Zytkow, 1997) PKDD-98 Sep 23-26 LNAI 1394 (Zytkow & Quafafou, 1998) PKDD-99 Sep 15-18 LNAI 1704 (Zytkow & Rauch, 1999) PKDD-00 Sep 13-16 LNAI 1910 (Zighed, Komorowski, & Zytkow, 2000) PKDD-01 Sep 3-5 LNAI 2168 (Raedt & Siebes, 2001) PAKDD-97 Feb 23-24 (Lu, Motoda, & Liu, 1997) PAKDD-98 Apr 15-17 LNAI 1394 (Wu, Ramamohanarao, & Korb, 1998) PAKDD-99 Apr 26-28 LNAI 1574 (Zhong & Zhou, 1999) PAKDD-00 Apr 18-20 LNAI 1805 (Terano, Liu, & Chen, 2000) PAKDD-01 Apr 16-20 LNAI 2035 (Cheung, Williams, & Li, 2001) DaWaK-99 Aug 30-Sep 1 LNCS 1676 (Mohania & Tjoa, 1999) DaWaK-00 Sep 4-6 LNAI 1874 (Kambayashi, Mohania, & Tjoa, 2000) DaWaK-01 Sep 5-7 LNAI 2114 (Kambayashi, Winiwarter, & Arikawa, 2001) AIMDM-99 Jun 20-24 LNCS 1620 (Horn, 1999) ISMDA-00 Sep 29-30 LNCS 1933 (Brause & Hanisch, 2000)
232
233
Appendix F Manager and Clinician Survey (copy of email text) Re: Results from diabetes registry analysis Dear .......... This brief survey asks your opinion about 3 results from diabetes registry analysis. Please click on the "reply" button, enter your response by typing an X between the appropriate brackets, and then click on the "send" button. If you prefer, print it out and interoffice mail it to me at the New Orleans East Clinic. More information is listed below the survey if you are interested. Should you have any questions, you can reach me at
[email protected] or beeper 423-4451. Thanks for your help. Sincerely, Joe Breault . 1. Analysis result: Adult diabetics with HbA1c average >9.5 were 3.2 (95%CI: 2.78, 3.77) times more likely to be 65 years of age. Do you find this is new information, i.e., not already familiar with it? [ ]Yes [ ]No Do you find this useful for clinical practice or population management? [ ]Yes [ ]No . 2. Analysis result: Adult diabetic patients with more frequent outpatient visits did not have less chance of an ER visit. Do you find this is new information, i.e., not already familiar with it? [ ]Yes [ ]No Do you find this useful for clinical practice or population management? [ ]Yes [ ]No . 3. Analysis result: Adult diabetic patients who die during hospitalization are 10.6 (95%CI: 7.74, 14.55) times more likely to have renal disease than not. Do you find this is new information, i.e., not already familiar with it? [ ]Yes [ ]No Do you find this useful for clinical practice or population management? [ ]Yes [ ]No . 4. Demographic Questions: Years in practice: [ ] if a clinician now or in the past Do you label yourself as a clinician [ ], a manager [ ], or both [ ]? ---------END OF SURVEY--------Additional information for those who are interested (optional): >30,000 diabetic patients in the diabetic registry Analysis was done on 15,393 with at least 2 HbA1c tests & 2 outpatient services Analysis done using CART (classification and regression trees) Study has IRB approval Study methodology is available on the intranet at http://xxxxx Study results are available on the intranet at http://xxxxx -----------------------------------------------Joseph L. Breault, MD, MS, MPH Associate Director & Research Director .......... Family Practice Residency Clinical Associate Professor of Family Medicine, Tulane HSC Clinical Assistant Professor of Family Medicine, LSU HSC
234
Appendix G Institutional Review Board The Institutional Review Board Application to the Institution who owns the diabetic data warehouse was filed in May 2001. Approval for this research was granted on June 5, 2001. The application in abbreviated form is reproduced below. CLINICAL INVESTIGATIONS COMMITTEE 1. TITLE OF PROTOCOL Data Mining Diabetic Databases to Improve Outcomes 2. DEPARTMENT and NAME OF PRINCIPAL INVESTIGATOR (PI): Family Practice, Joseph L. Breault. 3. STUDY SPONSOR: n/a. DOES STUDY HAVE EXTRAMURAL FUNDING? No. 4. DATES OF PROJECT PERIOD: FROM 6/2000 TO 6/2002. 5. IS THIS STUDY AN IN-HOUSE STUDY OR PART OF A REGIONAL, NATIONAL, ETC., COOPERATIVE RESEARCH PROGRAM? In-house. IF THIS IS AN IN-HOUSE STUDY, HAS IT BEEN REVIEWED BY A BIOSTATISTICIAN? YES. IF YES, PLEASE PROVIDE HIS/HER NAME: Colin Goodall, Ph. D. 6. DOES THE PRINCIPAL INVESTIGATOR, COLLABORATING INVESTIGATOR(S), OR ANY OF THEIR IMMEDIATE FAMILY MEMBERS: a) HAVE ANY MONETARY INTEREST (> THAN $5,000) IN THE DEVICE/PRODUCT EMPLOYED IN THIS STUDY? NO; b) HOLD AN OWNERSHIP SHARE (STOCKS, STOCK OPTIONS OR ROYALTIES) IN THE DEVICE/PRODUCT INVOLVED IN THIS STUDY OR IN THE COMPANY(S) WHICH OWNS THE DEVICE/PRODUCT? NO; c) STAND TO GAIN FINANCIALLY (other than as described in the study budget) - WHETHER DIRECTLY OR INDIRECTLY - AS A RESULT OF THE CONDUCT OF THIS STUDY? NO. 7. DOES THIS STUDY INVOLVE THE USE OF TISSUE (no), BLOOD (no), QUESTIONNAIRES (no), PROSPECTIVE CHART REVIEW (no), INTERVIEWS (no), RETROSPECTIVE CHART REVIEW (no), DRUGS (no), DEVICES (no), LASERS (no), RADIOACTIVE MATERIALS (no), GENE THERAPY (no), OR OTHER (Yes: Analysis of diabetic data warehouse data files.) 8. WILL WOMEN OF CHILD-BEARING POTENTIAL BE INVOLVED? No
235 9. WHAT IS YOUR DEADLINE FOR CIC APPROVAL? ASAP I, as indicated by my signature below, assure that I understand my roles and responsibilities as the Principal Investigator. Only those persons legally and responsibly entitled to, will be allowed to conduct procedures performed under this protocol. Any change in the protocol will be submitted to the Clinical Investigations Committee (CIC). I understand that continuing review is required in order to maintain approval and that it is my responsibility to ensure the proper reports are submitted in a timely fashion. I also understand that serious and/or unexpected events must be reported to the CIC within five (5) working days once the event has been recognized. Investigator Signature: Joseph L. Breault, MD PLEASE COMPLETE THE FOLLOWING QUESTIONS - ”SEE ATTACH ED” IS NOT AN ACCEPTABLE RESPONSE. PLEASE USE ADDITIONAL PAGES AS NECESSARY. 1. Outline the specific objectives of the study. (Include whether the major objectives are to test toxicity of treatment or the benefit of treatment). The study objective is to apply modern data mining techniques (knowledge discovery in databases) to the analysis of the diabetic data warehouse to improve outcomes. This may be done in a variety of ways. Techniques may include classification and regression tree analysis (e.g., CART 4.0 software), rough set analysis (e.g., ROSETTA software), etc. in addition to traditional logistic and multivariate regression techniques. Outcomes may include improvement of HEDIS and guidelines measures, improved detection of medical errors, reduced cost and hospitalizations, etc. Data mining methods hold the promise of finding interesting and novel associations that can be modeled in useful ways to guide management and clinicians to conditions that improve outcomes. The attached literature review and methodology has additional details. 2. All experiments involving drug testing should state the phase of the investigation (I, II, III, etc.) and what experience has shown in the previous phases of the study. A: This study does not involve any experiments with drug testing. 3. Describe all aspects of your study that will involve human subjects or tissue (specifically dosage, method of administration of drugs or other agents, testing procedures, questionnaires, interviews, normal volunteer or patients, etc.) Include radiation dose calculations, if applicable. Include safeguards employed. A: This study does not involve any contact with patients, human subjects, or tissue. Data safeguards implemented include: Study reports (presentations, posters, publications, etc.) will report aggregated data only and analysis methodology used-there will be no patient identifiers, and it will not be possible to identify any specific patients from the summary reports. This will be true for in house QA reports as well as any external reports. Data warehouse files that include patient identifiers such as clinic numbers to link different files within the data warehouse will be accessed and analyzed within the security of the IS system. If anything in
236 this project is ever sent outside the firewall, it would be only data without patient identifiers, such as summary aggregate data. The primary computer used for the analysis (the PI’s workplace computer) is password protected for access to the computer, defaults in a few minutes of nonuse to a screensaver that requires a password for reentry, and has a password protected hard disk. Thus, even if the computer or its hard disk were stolen, any information or data on it would be inaccessible. Once the data warehouse has been transformed or aggregated into useable flat files for the data mining software, the patient clinic number and name, along with any other unique identifying codes will be deleted from the flat files that will be used in this research project. Patient level identifying codes will only be used on a temporary bases to integrate the tables within the data warehouse, after which a different unique numbering system will be used to distinguish individual records. ATTACH A COPY OF THE CONSENT FORM TO THIS PROTOCOL: A consent form is not applicable since there is no contact with patients.
237
References Abston, K. C. (1999). Using the electronic medical record to predict the pharmacological management of acute myocardial infarction. Unpublished doctoral dissertation, The University of Utah. Adams, P. F., Hendershot, G. E., & Marano, M. A. (1999). Current estimates from the National Health Interview survey, 1996. Hyattsville, MD: U.S. Dept. of Health and Human Services, Centers for Disease Control and Prevention, and National Center for Health Statistics. Adriaans, P., & Zantinge, D. (1996). Data mining. Reading, MA: Addison-Wesley. Agardh, E. (1997). Views on the National Diabetes Registry. Lakartidningen, 94 (22), 2068. Agrawal, R., & Stolorz, P. (Eds.). (1998). KDD-98: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining: August 27-31, 1998, New York, NY. Menlo Park, CA: AAAI Press. Agre, G., & Koprinska, I. (1996). Case-based refinement of knowledge based neural networks. In J. S. Albus, A. Meystel, & R. Quintero (Eds.), Intelligent systems: a semiotic perspective: proceedings of the 1996 International Multidisciplinary Conference, October 20-23, 1996, Gaithersburg, MD. Washington, D.C.: U.S.G.P.O. Altman, R. B. (1997). Informatics in the care of patients: ten notable challenges. Western Journal of Medicine, 166 (2), 118–122. AMA, JCAHO, & NCQA. (2001). Coordinated performance measurement for the management of adult diabetes: A consensus statement from The American Medical Association, The Joint Commission on Accreditation of Healthcare Organizations, and The National Committee for Quality Assurance. http: //www.ama-assn.org/ama/upload/mm/370/diabetes.pdf. American Diabetes Association. (2002). American Diabetes Association website. www.diabetes.org, accessed 1/27/02. American Diabetic Association. (2001). Implications of the United Kingdom prospective diabetes study. American Diabetic Association: Clinical practice recommendations 2001. Diabetes Care, 24 (Supplement 1), S28–S32.
238 Arikawa, S., & Furukawa, K. (1999). Discovery Science: Second International Conference, DS’99, Tokyo, Japan, December 6-8, 1999, proceedings. Berlin; New York: Springer. Arikawa, S., & Morishita, S. (2000). Discovery Science: Third International Conference, DS 2000, Kyoto, Japan, December 4-6, 2000, proceedings. Berlin; New York: Springer. Arikawa, S., & Motoda, H. (1998). Discovery Science: First International Conference, DS’98, Fukuoka, Japan, December 14-16, 1998, proceedings. Berlin; New York: Springer. Armengol, E., Palaudaries, A., & Plaza, E. (2001). Individual prognosis of diabetes long-term risks: a CBR approach. Methods of Information in Medicine, 40 (1), 46–51. Babcock, C. (1996). Data carehouse. Computerworld, 30 (45), 53–54. Barriga, K. J., Hamman, R. F., Hoag, S., Marshall, J. A., & Shetterly, S. M. (1996). Population screening for glucose intolerant subjects using decision tree analyses. Diabetes Research and Clinical Practice, 34 Suppl, S17–29. Bellazzi, R., Larizza, C., Magni, P., Montani, S., & Stefanelli, M. (2000). Intelligent analysis of clinical time series: an application in the diabetes mellitus domain. Artificial Intelligence in Medicine, 20 (1), 37–57. Bellazzi, R., Magni, P., Larizza, C., De Nicolao, G., Riva, A., & Stefanelli, M. (1998). Mining biomedical time series by combining structural analysis and temporal abstractions. American Medical Informatics Association Annual Symposium Proceedings, 160–164. Berry, M. J. A., & Linoff, G. (2000). Mastering data mining: The art and science of customer relationship management. New York: Wiley Computer Publishing. Bigus, J. P. (1996). Data mining with neural networks: Solving business problems—from application development to decision support. New York: McGraw-Hill. Bioch, J. C., van der Meer, O., & Potharst, R. (1996). Classification using bayesian neural nets. In The 1996 IEEE International Conference on Neural Networks (pp. 1488–1493). Washington, DC: Institute of Electrical and Electronics Engineers. Blonde, L. (2001). Epidemiology, costs, consequences, and pathophysiology of type 2 diabetes: An American epidemic. The Ochsner Journal, 3 (3), 126– 131.
239 Blum, R. L. (1982). Discovery, confirmation, and incorporation of causal relationships from a large time-oriented clinical database: the RX project. Computers and Biomedical Research, 15, 165–187. Borok, L. S. (1997). Data mining: sophisticated forms of managed care modeling through artificial intelligence. Journal of Health Care Finance, 23 (3), 20–36. Brause, R. W., & Hanisch, E. (2000). Medical data analysis: First International Symposium, ISMDA 2000, Frankfurt, Germany, September 29-30, 2000: proceedings. Berlin; New York: Springer. Breault, J. L. (2001). Data mining diabetic databases: Are rough sets a useful addition? In A. Goodman, P. Smyth, X. Ge, & E. Wegman (Eds.), Computing Science and Statistics, 33rd Symposium on the Interface: June 13-16, 2001, Costa Mesa, CA (Vol. 33). Fairfax, VA: The Interface Foundation of North America. (in press) Breault, J. L., Goodall, C. R., & Fos, P. J. (2002). Data mining a diabetic data warehouse. Artificial Intelligence in Medicine, in press. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16 (3), 199-215. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group. Brodley, C., Lane, E., Lane, T., & Stough, T. M. (1999). Knowledge discovery and data mining. American Scientist, 86 (1), 54-61. Brossette, S. E., Sprague, A. P., Hardin, J. M., Waites, K. B., Jones, W. T., & Moser, S. A. (1998). Association rules and data mining in hospital infection control and public health surveillance. Journal of the American Medical Informatics Association, 5 (4), 373–381. Brossette, S. E., Sprague, A. P., Jones, W. T., & Moser, S. A. (2000). A data mining system for infection control surveillance. Methods of Information in Medicine, 39 (4-5), 303–310. Brown, J., Glauber, H., & Nichols, G. (1998). Impact on a population-based registry of changing diagnostic thresholds for diabetes. Diabetes Care, 21 (8), 1374–1375. Burn-Thornton, K. E., & Edenbrandt, L. (1998). Myocardial infarctionpinpointing the key indicators in the 12-lead ECG using data mining. Computers and Biomedical Research, 31 (4), 293–303. Cabena, P., Pablo, H., Rolf, S., Jaap, V., & Alessandro, Z. (1998). Discovering data mining: From concept to implementation. Upper Saddle River, NJ: Prentice Hall.
240 Carlgren, G. (1996). It is justified to question the diabetes registry. Lakartidningen, 93 (6), 448. Carpenter, G. A., & Markuzon, N. (1998). ARTMAP-IC and medical diagnosis: instance counting and inconsistent cases. Neural Networks, 11 (2), 323–336. Centers for Disease Control. (1998). National diabetes fact sheet. http://www. cdc.gov/diabetes/pubs/facts98.htm. Centers for Disease Control. (2001). Diabetes: A serious public health problem. http://www.cdc.gov/diabetes/pubs/glance.htm. Chaudhuri, S., & Madigan, D. (Eds.). (1999). KDD-99: Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; proceedings, August 15-18, 1999, San Diego, CA. New York, NY: Association for Computing Machinery. Cheung, D., Williams, G. J., & Li, Q. (2001). Advances in knowledge discovery and data mining: 5th Pacific-Asia Conference, PAKDD 2001, Hong Kong, China, April 16-18, 2001: proceedings. Berlin; New York: Springer. Chyun, D., Obata, J., Kling, J., & Tocchi, C. (2000). In-hospital mortality after acute myocardial infarction in patients with diabetes mellitus. American Journal of Critical Care, 9 (3), 168–79. Cios, K. J. (2000). Medical data mining and knowledge discovery [editorial]. IEEE Engineering in Medicine and Biology Magazine, 19 (4), 15–16. Cooper, G. F., Aliferis, C. F., Ambrosino, R., Aronis, J., Buchanan, B. G., Caruana, R., Fine, M. J., Glymour, C., Gordon, G., Hanusa, B. H., Janosky, J. E., Meek, C., Mitchell, T., Richardson, T., & Spirtes, P. (1997). An evaluation of machine-learning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine, 9 (2), 107–138. Crichton, N. J., Hinde, J. P., & Marchini, J. (1997). Models for diagnosing chest pain: Is CART helpful? Statistics in Medicine, 16 (7), 717–727. Dahlquist, G., & Nystrom, L. (1994). Starting a new Swedish registry of diabetes is a waste of resources! Lakartidningen, 91 (38), 3408–3409. DeGroot, L. J., Jameson, J. L., & Burger, H. (Eds.). (2001). Endocrinology (4th ed.). Philadelphia: W.B. Saunders Co. DeYoung, J. (2001). Tying a lucrative knot—weddingnetwork.com sits on the brink of database-sharing bliss. but the site has learned that you can’t just rush into these things. PC Magazine, Feb 20, 2001, 17a. Dhar, V., & Stein, R. (1997). Seven methods for transforming corporate data into business intelligence. Upper Saddle River, NJ: Prentice Hall.
241 Dillman, D. A. (2000). Mail and internet surveys: the tailored design method (2nd ed.). New York: J. Wiley & Sons. Doering, S., Muller, E., Kopcke, W., Pietzcker, A., Gaebel, W., Linden, M., Muller, P., Muller-Spahn, F., Tegeler, J., & Schussler, G. (1998). Predictors of relapse and rehospitalization in schizophrenia and schizoaffective disorder. Schizophrenia Bulletin, 24 (1), 87–98. Dorchy, H. (1999). Screening, prediction and prevention of type 1 diabetes: Role of the Belgian Diabetes Registry. Revue Medicale de Bruxelles, 20 (1), 15–20. Draper, D. (2000). Bayesian hierarchical modeling. Statistics Group, University of Bath, UK. (Short Course Notes, April 9, 2000) Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. DuMouchel, W. (2001). Data squashing: Constructing summary data sets. In A. Goodman, P. Smyth, X. Ge, & E. Wegman (Eds.), Computing Science and Statistics, 33rd Symposium on the Interface: June 13-16, 2001, Costa Mesa, CA (Vol. 33). Fairfax, VA: Interface Foundation of North America. (in press) DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Pregibon, D. (1999). Squashing flat files flatter. In S. Chaudhuri & D. Madigan (Eds.), KDD99: proceedings, August 15-18, 1999, San Diego, CA; Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 6–15). New York, NY: Association for Computing Machinery. Edelstein, H. A. (1999). Introduction to data mining and knowledge discovery (3rd ed.). Potomic, MD: Two Crows Corporation. Edelstein, H. A. (2001). Pan for gold in the clickstream. Information Week, March 12, 2001, 77. Eklund, P. W., & Hoang, A. (1998). Classifier selection and training set features: LMDT. http://citeseer.nj.nec.com/309003.html on March 18, 2001. Eriksen, L. R., Turley, J. P., Denton, D., & Manning, S. (1997). Data mining: a strategy for knowledge development and structure in nursing practice. Studies in Health and Technology Information, 46, 383–388. Ewens, W. J., & Grant, G. R. (2001). Statistical methods in bioinformatics: An introduction. New York: Springer. Falconer, J. A., Naughton, B. J., Dunlop, D. D., Roth, E. J., Strasser, D. C., & Sinacore, J. M. (1994). Predicting stroke inpatient rehabilitation outcome using a classification tree approach. Archives of Physical Medicine and Rehabilitation, 75 (6), 619–625.
242 Falkenberg, M., & Wernerson, M. (1996). A national registry on diabetes— cooperation is necessary for quality. Lakartidningen, 93 (14), 1318. Fayyad, U. M. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press: MIT Press. Fayyad, U. M., & Uthurusamy, R. (1995). KDD-95: Proceedings of the First International Conference on Knowledge Discovery Data Mining, August 20-21, 1995, Montr´eal, Qu´ebec. Menlo Park, CA: AAAI Press. Feinglass, J., Yarnold, P. R., McCarthy, W. J., & Martin, G. J. (1998). A classification tree analysis of selection for discretionary treatment. Medical Care, 36 (5), 740–747. Flack, J. R. (1995). Seven years experience with a computerized diabetes clinic database. Medinfo, 8 (Pt 1), 332. Forgionne, G. A., Gangopadhyay, A., & Adya, M. (2000). Cancer surveillance using data warehousing, data mining, and decision support systems. Topics in Health Information Management, 21 (1), 21–34. Fos, P. J., & Fine, D. J. (2000). Designing health care for populations: applied epidemiology in health care administration. San Francisco: Jossey-Bass Publishers. Franco, S., Mitchell, C., & Buzon, R. (1997). Primary care physician access and gatekeeping: a key to reducing emergency department use. Clinical Pediatrics, 36 (2), 63–68. Friedman, N. M., Gleeson, J. M., Kent, M. J., Foris, M., Rodriguez, D. J., & Cypress, M. (1998). Management of diabetes mellitus in the Lovelace Health Systems’ EPISODES OF CARE program. Effective Clinical Practice, 1 (1), 5–11. Gagliardino, J., Werneke, U., Olivera, E., Assad, D., Regueiro, F., Diaz, R., Pollola, J., & Paolasso, E. (1997). Characteristics, clinical course, and in-hospital mortality of non-insulin-dependent diabetic and nondiabetic patients with acute myocardial infarction in argentina. Journal of Diabetes and Its Complications, 11 (3), 163–171. Garbe, C., Buttner, P., Bertz, J., Burg, G., d’Hoedt, B., Drepper, H., Guggenmoos-Holzmann, I., Lechner, W., Lippold, A., Orfanos, C. E., & al. et. (1995). Primary cutaneous melanoma: Identification of prognostic groups and estimation of individual prognosis for 5,093 patients. Cancer, 75 (10), 2484–2491. Goebel, M., & Gruenwald, L. (1999). A survey of data mining and knowledge discovery software tools. SIGKDD Explorations, 1 (1), 20–33.
243 Goodall, C. (1995). Massive data sets in healthcare. In Massive data sets. Washington, DC: Committee on Applied and Theoretical Statistics, National Academy of Sciences, National Research Council; Online publication at http://bob.nap.edu/html/massdata/media/cgoodall-t.html. Goodall, C. R. (1999). Data mining of massive datasets in healthcare. Journal of Computational and Graphical Statistics, 8 (3), 620–634. Goodwin, L., Prather, J., Schlitz, K., Iannacchione, M. A., Hage, M., Hammond, W. E., & Grzymala-Busse, J. (1997). Data mining issues for improved birth outcomes. Biomedical Sciences Instrumentation, 34, 291–296. Gujarati, D. N. (1995). Basic econometrics (3rd ed.). New York: McGraw Hill, Inc. Gunopulos, D., & Das, G. (2000). Time series similarity measures (Tutorial PM2). In R. Ramakrishnan & S. Stolfo (Eds.), KDD-2000: Tutorials, August 20-23, 2000, Boston, MA; Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 243–307). New York, NY: Association for Computing Machinery. Han, J., & Kamber, M. (2001). Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers. Hand, D. J. (1999). Statistics and data mining: Intersecting disciplines. SIGKDD Explorations, 1 (1), 16–19. Hand, D. J. (2000). Mining medical data [editorial]. Statistical Methods in Medical Research, 9 (4), 305–307. Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. Hanson, R. L., Ehm, M. G., Pettitt, D. J., Prochazka, M., Thompson, D. B., Timberlake, D., Foroud, T., Kobes, S., Baier, L., Burns, D. K., Almasy, L., Blangero, J., Garvey, W. T., Bennett, P. H., & Knowler, W. C. (1998). An autosomal genomic scan for loci linked to type II diabetes mellitus and body-mass index in Pima Indians. American Journal of Human Genetics, 63 (4), 1130-1138. Harrell, F. E. (2001). Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer. Hashemi, R. R., Jelovsek, F. R., & Razzaghi, M. (1993). Developmental toxicity risk assessment: a rough sets approach. Methods of Information in Medicine, 32 (1), 47–54. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer.
244 He, H., Koesmarno, H., Van, T., & Huang, Z. (2000). Data mining in disease management—a diabetes case study. In R. Mizoguchi & J. K. Slaney (Eds.), PRICAI 2000, Topics in artificial intelligence: 6th Pacific Rim International Conference on Artificial Intelligence, Melbourne, Australia, August 28-September 1, 2000: proceedings (p. 799). Berlin; New York: Springer. (The conference proceedings published a 1 page summary. Full 10 page paper sent to me by the author on 12/5/00.) Heckerman, D. E. (Ed.). (1997). KDD-97: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 14-17, 1997. Menlo Park, CA: AAAI Press. Hedberg, S. R. (1995). The data gold rush. Byte(October), 83–88. Hess, K. R., Abbruzzese, M. C., Lenzi, R., Raber, M. N., & Abbruzzese, J. L. (1999). Classification and regression tree analysis of 1,000 consecutive patients with unknown primary carcinoma. Clinical Cancer Research, 5 (11), 3403–3410. Hoang, A. (1997). Supervised classifier performance on the UCI database. Unpublished master’s thesis, University of Adelaide. Holena, M., Sochorova, A., & Zvarova, J. (1999). Increasing the diversity of medical data mining through distributed object technology. Studies in Health and Technology Information, 68, 442–447. Hollis, J. (1998). Deploying an HMO’s data warehouse. Health Management Technology, 19 (8), 46–48. Hood, D. (2001). 2001 Louisiana health report card (http: //www.dhh.state.la.us/OPH/statctr/4Report%20Card/2001/ 2001LouisianaHealthReportCard.pdf posted March 23, 2001). State of Louisiana, Department of Health and Hospitals. Horn, W. (1999). Artificial intelligence in medicine: Proceedings of Joint European Conference on Artificial Intelligence in Medical Intelligence in Medicine and Medical Decision Making, AIMDM’99, Aalborg, Denmark, June 20-24, 1999. Berlin; New York: Springer. (LNAI1620) Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: Wiley. Hsu, W., Lee, M. L., Liu, B., & Ling, T. W. (2000). Exploration mining in diabetic patients databases: Findings and conclusions. In R. Ramakrishnan & S. Stolfo (Eds.), KDD-2000: proceedings, August 20-23, 2000, Boston, MA; Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 430–436). New York, NY: Association for Computing Machinery.
245 Huang, Y. W., & Yu, P. S. (1999). Adaptive query processing for time-series data. In S. Chaudhuri & D. Madigan (Eds.), KDD-99: proceedings, August 15-18, 1999, San Diego, CA; Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 282–286). New York, NY: Association for Computing Machinery. Iezzoni, L. I. (1997). Risk adjustment for measuring healthcare outcomes (2nd ed.). Chicago, IL: Health Administration Press. IOM. (2001). Crossing the quality chasm: A new health system for the 21st century. Washington, DC: Institute of Medicine, National Academy Press. Jantke, K. P., & Shinohara, A. (Eds.). (2001). Discovery science: 4th international conference: DS 2001: Washington, DC: November 25-28, 2001: proceedings. Springer. (LNAI2226) Jensen, D. (2000). Data snooping, dredging and fishing: The dark side of data mining: A SIGKDD99 panel report. SIGKDD Explorations, 1 (2), 52–54. Joslin, E. P., Kahn, C. R., & Weir, G. C. (Eds.). (1994). Joslin’s diabetes mellitus (13th ed.). Philadelphia: Lea & Febiger. Kalis, L. (2000). Going all out with health care. Red Herring(September 1). Kambayashi, Y., Mohania, M., & Tjoa, A. M. (2000). Data warehousing and knowledge discovery: Second International Conference, DaWaK 2000, London, UK, September 4-6, 2000; proceedings. New York: Springer. (LNCS1874) Kambayashi, Y., Winiwarter, W., & Arikawa, M. (2001). Data warehousing and knowledge discovery: Third International Conference, DaWaK 2001, Munich, Germany, September 5-7, 2001: proceedings. New York: SpringerVerlag. (LNCS2114) Kelling, D. G., Wentworth, J. A., & Wright, J. B. (1997). Diabetes mellitus. using a database to implement a systematic management program. North Carolina Medical Journal, 58 (5), 368–371. Khan, A. H. (1998). Multiplier-free feedforward networks. http://citeseer.nj. nec.com/6034.html accessed March 18, 2001. Kiel, J. M. (2000). Data mining and modeling: Power tools for physician practices. MD Computing, 17 (3), 33–34. King, M. A., Elder IV, J. F., Gomolka, B., Schmidt, E., Summers, M., & Toop, K. (1998). Evaluation of fourteen desktop data mining tools. In IEEE International Conference on Systems, Man, and Cybernetics. San Diego, CA: http://citeseer.nj.nec.com/293388.html.
246 Klabunde, C. N., Potosky, A. L., Legler, J. M., & Warren, J. L. (2000). Development of a comorbidity index using physician claims data. Journal of Clinical Epidemiology, 53 (12), 1258–1267. Kleinbaum, D. G. (1998). Applied regression analysis and other multivariable methods (3 ed.). Pacific Grove: Duxbury Press. Knowler, W. C., Bennett, P. H., Hamman, R. F., & Miller, M. (1978). Diabetes incidence and prevalence in Pima Indians: A 19-fold greater incidence than in Rochester, Minnesota. American Journal of Epidemiology, 108 (6), 497– 505. Kohn, L. T., Corrigan, J., & Donaldson, M. S. (2000). To err is human: Building a safer health system. Washington, DC: National Academy Press. Komorowski, J., & Øhrn, A. (1999). Modeling prognostic power of cardiac tests using rough sets. Artificial Intelligence in Medicine, 15 (2), 167–191. Komorowski, J., Polkowski, L., & Skowron, A. (1998). Rough sets: A tutorial. In S. K. Pal & A. Skowron (Eds.), Rough-fuzzy hybridization: A new trend in decision-making (p. 454). New York: Springer Verlag. Komorowski, J., & Zytkow, J. M. (1997). Principles of data mining and knowledge discovery: First European Symposium, PKDD ’97, Trondheim, Norway, june 24-27, 1997; proceedings. Berlin; New York: Springer. (LNAI1263) Kopelman, P. G., & Sanderson, A. J. (1996). Application of database systems in diabetes care. Medical Informatics, 21 (4), 259–271. Kreuze, D. (2001). Debugging hospitals. Technology Review, 2001 (March), 32. Kuo, W. J., Chang, R. F., Chen, D. R., & Lee, C. C. (2001). Data mining with decision trees for diagnosis of breast tumor in medical ultrasonic images. Breast Cancer Research and Treatment, 66 (1), 51–57. LADHH. (2000). 1999 data tables. Louisiana State Center for Health Statistics, Department of Health and Hospitals, Office of Public Health: http://www. dhh.state.la.us/OPH/statctr/1Tables/1999/Parish/t26_99i.xls. Lamma, E., Manservigi, M., Mello, P., Storari, S., & Riguzzi, F. (2000). A system for monitoring nosocomial infections. In Medical data analysis: First International Symposium, ISMDA 2000, Frankfurt, Germany, September 29-30, 2000; proceedings (pp. 282–292). Berlin; New York: Springer. Lavrac, N. (1999). Selected techniques for data mining in medicine. Artificial Intelligence in Medicine, 16 (1), 3–23. Lim, T.-S. (2000). Polytomous logistic regression trees. Unpublished doctoral dissertation, University of Wisconsin.
247 Liu, B. (1998). Integrating classification and association rule mining. In KDD-98, Knowledge Discovery and Data Minng (pp. 80–86). New York. Lonergan, B. J. F. (1957). Insight: A study of human understanding. London, New York: Longmans Green. Lu, H., Motoda, H., & Liu, H. (1997). KDD, techniques and applications: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 23-24 Feb. 97. Singapore; River Edge, NJ: World Scientific. Mannila, H. (2000). Theoretical frameworks for data mining. SIGKDD Explorations, 1 (2), 30–32. Matheus, C. J., Piatetsky-Shapiro, G., & McNeill, D. (1996). Selecting and reporting what is interesting: The KEFIR application to healthcare data. In U. M. Fayyad (Ed.), Advances in knowledge discovery and data mining (pp. 495–515). Menlo Park, CA: AAAI Press: MIT Press. McDonald, J. M., Brossette, S., & Moser, S. A. (1998). Pathology information systems: Data mining leads to knowledge discovery. Archives of Pathology and Laboratory Medicine, 122 (5), 409–411. McLeish, M., Yao, P., Garg, M., & Stirtzinger, T. (1991). Discovery of medical diagnostic information: An overview of methods and results. In G. Piatetsky-Shapiro & W. Frawley (Eds.), Knowledge discovery in databases (pp. 477–490). Menlo Park, CA; Cambridge, MA: AAAI Press & MIT Press. Merler, S., Furlanello, C., Chemini, C., & Nicolini, G. (1996). Classification tree methods for analysis of mesoscale distribution of ixodes ricinus (Acari:Ixodidae) in Trentino, Italian Alps. Journal of Medical Entomology, 33 (6), 888–893. Michel, C., & Beguin, C. (1994). Using a database to query for diabetes mellitus. Studies in Health and Technology Information, 14, 178–182. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (Eds.). (1994). Machine learning, neural and statistical classification. New York: Ellis Horwood. Milley, A. (2000). Healthcare and data mining. Health Management Technology, 44–45. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Mitchell, T. M. (1999). Machine learning and data mining. Communications of the ACM, 42 (11), 30.
248 Mohania, M., & Tjoa, A. M. (1999). Data warehousing and knowledge discovery: First International Conference: DaWaK’99: Florence, Italy: August 30– September 1, 1999; proceedings. Berlin; New York: Springer. (LNCS1676) Moise, P. A., Forrest, A., Bhavnani, S. M., Birmingham, M. C., & Schentag, J. J. (2000). Area under the inhibitory curve and a pneumonia scoring system for predicting outcomes of vancomycin therapy for respiratory infections by Staphylococcus aureus. American Journal of Health-System Pharmacy, 57 Suppl 2, S4–S9. Montani, S., & Bellazzi, R. (2000). Exploiting multi-modal reasoning for knowledge management and decision support: an evaluation study. Proceedings of the American Medical Informatics Association Symposium, 585–589. Montani, S., Bellazzi, R., Portinale, L., d’Annunzio, G., Fiocchi, S., & Stefanelli, M. (2000). Diabetic patients management exploiting case-based reasoning techniques. Computer Methods and Programs in Biomedicine, 62 (3), 205– 218. Montani, S., Bellazzi, R., Portinale, L., Fiocchi, S., & Stefanelli, M. (1998). A case-based retrieval system for diabetic patients therapy. In R. Bellazzi & B. Zupan (Eds.), ECAI ’98 workshop notes on Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP 98). Brighton, UK. Montani, S., Bellazzi, R., Portinale, L., & Stefanelli, M. (2000). A multi-modal reasoning methodology for managing IDDM patients. International Journal of Medical Informatics, 58-59, 243–256. Moser, S. A., Jones, W. T., & Brossette, S. E. (1999). Application of data mining to intensive care unit microbiologic data. Emerging Infectious Dieseases, 5 (3), 454–457. Oates, T. (1994, March). MSDD as a tool for classification. EKSL memorandum 94-29. Department of Computer Science, University of Massachusetts at Amherst. Øhrn, A. (1999). Discernibility and rough sets in medicine: Tools and applications. Unpublished doctoral dissertation, Norwegian University of Science and Technology. Øhrn, A., & Rowland, T. (2000). Rough sets: A knowledge discovery technique for multifactorial medical outcomes. American Journal of Physical Medicine and Rehabilitation, 79 (1), 100–108. Øhrn, A., Vinterbo, S., Szymanski, P., & Komorowski, J. (1997). Modelling cardiac patient set residuals using rough sets. Proceedings of the American Medical Informatics Association Symposium, 203–207.
249 Olsson, B., & Persson, L. (1996a). A national registry on diabetes does not improve the quality of diabetic care. Lakartidningen, 93 (4), 235. Olsson, B., & Persson, L. (1996b). A national registry on diabetes—for what purpose? Lakartidningen, 93 (14), 1317. Owrang, M. M. (2000). Using domain knowledge to optimize the knowledge discovery process in databases. International Journal of Intelligent Systems, 15 (1), 45–60. Paterson, G. I. (1995). A rough sets approach to patient classification in medical records. Medinfo, 8 (Pt 2), 910. Petitti, D. B. (2000). Meta-analysis, decision analysis, and cost-effectiveness analysis: methods for quantitative synthesis in medicine (2nd ed.). New York: Oxford University Press. Pilote, L., Miller, D. P., Califf, R. M., Rao, J. S., Weaver, W. D., & Topol, E. J. (1996). Determinants of the use of coronary angiography and revascularization after thrombolysis for acute myocardial infarction. New England Journal of Medicine, 335 (16), 1198-1205. Podraza, W., & Podraza, H. (1999). Childhood leukaemia relapse risk factors. a rough sets approach. Medical Informatics and the Internet in Medicine, 24 (2), 91–108. Pogach, L. M., Hawley, G., Weinstock, R., Sawin, C., Schiebe, H., Cutler, F., Zieve, F., Bates, M., & Repke, D. (1998). Diabetes prevalence and hospital and pharmacy use in the Veterans Health Administration (1994). Use of an ambulatory care pharmacy-derived database. Diabetes Care, 21 (3), 368– 373. Porte, D., Sherwin, R. S., Ellenberg, M., & Rifkin, H. (Eds.). (1997). Ellenberg & Rifkin’s diabetes mellitus (5th ed.). Stamford, CT: Appleton & Lange. Prather, J. C., Lobach, D. F., Goodwin, L. K., Hales, J. W., Hage, M. L., & Hammond, W. E. (1997). Medical data mining: Knowledge discovery in a clinical data warehouse. Proceedings of the American Medical Informatics Association Symposium, 101–105. Provost, F. (Ed.). (2001). KDD-2001: Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann Publishers. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers.
250 Raedt, L. d., & Siebes, A. (2001). Principles of data mining and knowledge discovery: 5th European conference, PKDD 2001, Freiburg, Germany, September 3-5, 2001; proceedings. New York: Springer-Verlag. (LNAI2168) Rainer, T. H., Lam, P. K., Wong, E. M., & Cocks, R. A. (1999). Derivation of a prediction rule for post-traumatic acute lung injury. Resuscitation, 42 (3), 187–196. Ramakrishnan, R., & Stolfo, S. (2000). KDD-2000: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: Proceedings, August 20-23, 2000, Boston, MA. New York: Association for Computing Machinery. Ramoni, M., Riva, A., Stefanelli, M., & Patel, V. (1995). An ignorant belief network to forecast glucose concentration from clinical databases. Artificial Intelligence in Medicine, 7, 541–559. Rehnqvist, N. (1996). Quality development with a national registry of diabetes. Lakartidningen, 93 (5), 334, 339. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge; New York: Cambridge University Press. Riva, A., & Bellazzi, R. (1995). Intelligent analysis techniques for diabetes data time series. In G. E. Lasker & X. Liu (Eds.), Advances in intelligent data analysis (pp. 144–148). Germany: IIAS Press. Rogers, J. (2001). Data mining fights fraud. Computer Weekly, Feb 8, 2. Roper, N., Bilous, R., Kelly, W., Unwin, N., & Connolly, V. (2002). Cause-specific mortality in a population with diabetes: South Tees diabetes mortality study. Diabetes Care, 25 (1), 43-48. Rothman, K. J., & Greenland, S. (1998). Modern epidemiology (2nd ed.). New York: Lippincott, Williams & Wilkins. Sakamoto, N. (1996). Object-oriented development of a concept learning system for time-centered clinical data. Journal of Medical Systems, 20 (4), 183–196. Sauerbrei, W., Madjar, H., & Prompeler, H. J. (1998). Differentiation of benign and malignant breast tumors by logistic regression and a classification tree using doppler flow signals. Methods of Information in Medicine, 37 (3), 226234. Schrage, M. (1999). Working in the data mines: Sixteen tons of information overload. Fortune, August 2, 244.
251 Selker, H. P., Griffith, J. L., Patil, S., Long, W. J., & D’Agostino, R. B. (1995). A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. Journal of Investigative Medicine, 43 (5), 468–476. Shih, Y.-S. (1999). Families of splitting criteria for classification trees. Statistics and Computing, 9 (4), 309–315. Shortliffe, E. H., Perreault, L. E., Wiederhold, G., & Fagan, L. M. (Eds.). (2000). Medical informatics: Computer applications in health care (2nd ed.). New York: Springer. Silverstein, C., Brin, S., Motwani, R., & Ullman, J. (1998). Scalable techniques for mining causal structures (Tech. Rep.). Department of Computer Science, Stanford University. Simoudis, E., Han, J., & Fayyad, U. M. (1996). KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, August 2-4, 1996. Menlo Park, CA: AAAI Press. Slowinski, K., Slowinski, R., & Stefanowski, J. (1988). Rough sets approach to analysis of data from peritoneal lavage in acute pancreatitis. Medical Informatics, 13 (3), 143-159. Smith, J. W., Everhart, J. E., Dickenson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In R. A. Greenes (Ed.), Proceedings of the Symposium on Computer Applications and Medical Care: Washington, DC (pp. 261–265). Los Angeles, CA: IEEE Computer Society Press. Smyth, P. (2000). Data mining: Data analysis on a grand scale? Methods in Medical Research, 9 (4), 309–327.
Statistical
Songer, T. J. (1995). Disability in diabetes. In Diabetes in America, 2nd edition (pp. 259–282). Bethesda, MD: National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. (NIH publication no. 95-1468) Soyer, H. P., Smolle, J., Leitinger, G., Rieger, E., & Kerl, H. (1995). Diagnostic reliability of dermoscopic criteria for detecting malignant melanoma. Dermatology, 190 (1), 25–30. Srivastava, A. N., & Weigend, A. S. (1997). Data mining in finance: Introducing the special issue of IJNS [editorial]. International Journal of Neural Systems, 8 (4), 367–371. Stenstrom, G. (1994). Time for a Swedish registry on diabetes. Lakartidningen, 91 (32-33), 2845–2846.
252 Stepaniuk, J. (1998). Rough set based data mining in diabetes mellitus data table. In H. Zimmermann (Ed.), EUFIT ’98, September 7-10, 1998: Intelligent techniques and soft computing (Vol. 2, pp. 980–984). Aachen, Germany: Verlag Mainz. Stepaniuk, J. (1999). Rough set data mining of diabetes data. In Z. Ras & A. Skowron (Eds.), Foundations of intelligent systems: 11th International Symposium, ISMIS’99, Warsaw, Poland, June 8-11, 1999: proceedings (pp. 457–465). Berlin; New York: Springer. Sullivan, R., Timmermann, A., & White, H. (1998). Dangers of data driven inference: The case of calendar effects in stock returns (Tech. Rep.). USCD Working Papers in Economics 98-16. Tabar, P. (1999). Mining your pharma data. Healthcare Informatics, 22. Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (2000). ROC and CART analysis of subcutaneous adipose tissue topography (SAT-Top) in type 2 diabetic women and healthy females. American Journal of Human Biology, 12, 388–394. Takae, T., Chikamune, M., Arimura, H., Shinohara, A., Inoue, H., Takeya, S., Uezono, K., & Kawasaki, T. (1999). Knowledge discovery from health data using weighted aggregation classifiers. In S. Arikawa & K. Furukawa (Eds.), Discovery Science: Second International Conference, DS’99, Tokyo, Japan, December 6-8, 1999; proceedings (pp. 359–361). Berlin; New York: Springer. Temkin, N. R., Holubkov, R., Machamer, J. E., Winn, H. R., & Dikmen, S. S. (1995). Classification and regression trees (CART) for prediction of function at 1 year following head trauma. Journal of Neurosurgery, 82 (5), 764–771. Terano, T., Liu, H., & Chen, A. L. P. (2000). Knowledge discovery and data mining: Current issues and new applications: 4th Pacific-Asia Conference, PAKDD 2000, Kyoto, Japan, April 18-20, 2000; proceedings. Berlin; New York: Springer. Thiele, J., Kvasnicka, H. M., Zirbes, T. K., Flucke, U., Niederle, N., Leder, L. D., Diehl, V., & Fischer, R. (1998). Impact of clinical and morphological variables in classification and regression tree-based survival (CART) analysis of CML with special emphasis on dynamic features. European Journal of Haematology, 60 (1), 35–46. Tigrani, V. S., & John, G. H. (1998). Data mining and statistics in medicine: An application in prostate cancer detection. In Proceedings of the Joint Statistical Meetings, Section on Physical and Engineering Sciences. American Statistical Association.
253 Timberlake-Consultants. (2001). CART frequently asked questions. http://www. timberlake.co.uk/software/cart/cartfaq1.htm accessed 10/21/01. Tsai, Y. S. (1998). Knowledge discovery with medical databases: A casebased reasoning approach. Unpublished doctoral dissertation, Vanderbilt University. Tsai, Y. S., King, P. H., Higgins, M. S., Pierce, D., & Patel, N. P. (1997). An expert-guided decision tree construction strategy: An application in knowledge discovery with medical databases. Proceedings of the American Medical Informatics Association Symposium, 208–212. Tsien, C. L. (2000). Event discovery in medical time-series data. Proceedings of the American Medical Informatics Association Symposium, 858–862. Tsien, C. L., Fraser, H. S., Long, W. J., & Kennedy, R. L. (1998). Using classification tree and logistic regression methods to diagnose myocardial infarction. Medinfo, 9 (Pt 1), 493–497. Tsumoto, S. (1998). Automated knowledge acquisition from clinical databases based on rough sets and attribute-oriented generalization. Proceedings of the American Medical Informatics Association Symposium, 548-52. Tsumoto, S., & Tanaka, H. (1994). Induction of medical expert system rules based on rough sets and resampling methods. Proceedings of the Annual Symposium on Computer Applications in Medical Care, 1066–1070. Tsumoto, S., & Tanaka, H. (1995). Induction of expert system rules based on rough sets and resampling methods. Medinfo, 8 (Pt 1), 861–865. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley. Turney, P. D. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, 2, 369–409. Tusch, G., Muller, M., Rohwer-Mensching, K., Heiringhoff, K., & Klempnauer, J. (2000). Data warehouse and data mining in a surgical clinic. Studies in Health and Technology Information, 77, 784–789. Wacker, W. (2001). Data for dollars. Entrepreneur, 29 (1), 22. Wahba, G., Gu, C., Wang, Y., & Chappell, R. (1992). Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In D. H. Wolpert (Ed.), The mathematics of generalization: The proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning (pp. 331–360). Santa Fe: Addison-Wesley.
254 Waldrop, M. M. (2001). Usama Fayyad: Data mining. Technology Review, 104 (1), 101–102. Walker, M. G., & Blum, R. L. (1986). Towards automated discovery from clinical databases: the RADIX project. Proceedings of the Fifth Conference on Medical Informatics, 5, 32–36. Wegman, E. J. (2001). Statistical data mining. In A. Goodman, P. Smyth, X. Ge, & E. Wegman (Eds.), Computing Science and Statistics, 33rd Symposium on the Interface: June 13-16, 2001, Costa Mesa, CA (Vol. 33). Fairfax, VA: Interface Foundation of North America. (in press) Weiss, S. M., & Indurkhya, N. (1998). Predictive data mining: A practical guide. San Francisco, CA: Morgan Kaufmann Publishers. Weng, C., Coppini, D. V., & Sonksen, P. H. (1997). Linking a hospital diabetes database and the National Health Service Central Register: A way to establish accurate mortality and movement data. Diabetic Medicine, 14 (10), 877–883. White, H. (2000). A reality check for data snooping. Econometrica, 68 (5), 1097–1126. Widang, K. (1996). The national registry on diabetes—simple things are being complicated. Lakartidningen, 93 (8), 667. Witten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with JAVA implementations. San Francisco, CA: Morgan Kaufmann. Wong, L. (2000). Datamining: Discovering information from bio-data. http: //citeseer.nj.nec.com/375806.html. Wu, X., Ramamohanarao, K., & Korb, K. B. (1998). Research and development in knowledge discovery and data mining: Second Pacific-Asia Conference, PAKDD-98, Melbourne, Australia, April 15-17, 1998; proceedings. Berlin; New York: Springer. (LNAI1394) Yarnold, P. R., Soltysik, R. C., & Bennett, C. L. (1997). Predicting in-hospital mortality of patients with AIDS-related pneumocystis carinii pneumonia: An example of hierarchically optimal classification tree analysis. Statistics in Medicine, 16 (13), 1451–1463. Zhong, N., & Zhou, L. (1999). Methodologies for knowledge discovery and data mining: Third Pacific-Asia Conference, PAKDD-99, Beijing, China, April 1999; proceedings. Berlin; New York: Springer. (LNAI1574)
255 Ziarko, W. (1991). The discovery, analysis, and representation of data dependencies in databases. In G. Piatetsky-Shapiro & W. Frawley (Eds.), Knowledge discovery in databases (pp. 195–209). Menlo Park, CA: AAAI Press and MIT Press. Zighed, D. A., Komorowski, J., & Zytkow, J. M. (2000). Principles of data mining and knowledge discovery: 4th European Conference, PKDD 2000, Lyon, France, September 13-16, 2000; proceedings. Berlin; New York: Springer. (LNAI1910) Zupan, B., Lavrac, N., & Keravnou, E. (1999). Data mining techniques and applications in medicine [editorial]. Artificial Intelligence in Medicine, 16 (1), 1-2. Zytkow, J. M., & Quafafou, M. (1998). Principles of data mining and knowledge discovery: Second European Symposium, PKDD ’98, Nantes, France, September 23-26, 1998; proceedings. Berlin; New York: Springer. (LNAI1510) Zytkow, J. M., & Rauch, J. (Eds.). (1999). Principles of data mining and knowledge discovery: Third European Conference, PKDD’99, Prague, Czech Republic, September 15-18, 1999; proceedings. Berlin; New York: Springer. (LNAI1704)
256
Biography Joe Breault grew up in North Arlington, New Jersey where he attended Queen of Peace School from K-12. He received his BS and MS in physics at Stevens Institute of Technology in Hoboken. He spent some years working in community development before attending Columbia University in New York City for a post-graduate premedical program. He received his MD and MPH from Tulane University, and then did his family practice residency at Montefiore Hospital in New York City within the Residency Program in Social Medicine. He joined the Indian Health Service and spent three years working in South Dakota on the Pine Ridge Indian Reservation and at the Sioux San Indian Hospital, Rapid City as a physician and TB Control Officer. He met his wife on the Oglala Sioux reservation soon after she moved from Los Angeles to work in a legal aid clinic. Their son was born in South Dakota. He moved to New Orleans in 1992 to join the Ochsner Clinic Foundation where he works as a Senior Staff Physician, the Associate Director and Research Director of the Family Practice Residency, and recently was appointed the Chair of the Institutional Review Board. He has written a few dozen medical articles and some book chapters. The Ladies Home Journal listed him as one of America’s best family doctors in May, 2002.
Colophon This document was typeset using LATEX2e with WinEdt 5 as an interface. An electronic version with hyperlinks is available at www.meddatamine.com in PDF format. The tuthesis.cls style was downloaded from http://www.cs.tulane. edu/www/Preserve/tips/latex.html and modified for HSM thesis requirements using apacite. DATA and allCLEAR graphics were exported as wmf or emf files and converted to eps by WMF2EPS v1.31. Other graphics were converted to eps by http://magick.net4tv.com/MagickStudio/scripts/advanced.cgi. Parts of this study have been presented at the conferences listed below. Articles based on this dissertation include (Breault, 2001; Breault, Goodall, & Fos, 2002). Conference Interface 2001 Louisiana Academy of Family Physicians Research Day International Conference on Health Policy Research Mathematical Challenges in Scientific Data Mining Society of Teachers of Family Medicine
Location Costa Mesa, CA Baton Rouge, LA Boston, MA Los Angeles, CA San Francisco, CA
Date 6/16/01 11/3/01 12/7/01 1/14/02 4/29/02
Joseph L. Breault, M.D. cordially invites The faculty and students of the Tulane HSM Department and the interested public to a presentation of his dissertation
DATA MINING A DIABETIC DATA WAREHOUSE TO IMPROVE OUTCOMES on the twenty first day of March, 2002, at 4:00 P.M. in Tidwater Building, Room 1920
Abstract Modern society’s large databases are characteristic of today’s information age. Many industries utilize these large databases as the rich resources they are. Data mining, or knowledge discovery in databases, has become a key strategy in many industries to improve outputs and decrease costs. This field is only recently being applied to healthcare management. We review its current status in healthcare, propose a method for applying it to transactional healthcare databases to improve outcomes, and apply it to a diabetic data warehouse. This retrospective secondary data analysis uses classification and regression trees to identify key relationships out of which models are formulated to improve outcomes. Decision analysis trees are used to validate the improvements in outcomes. The perspective is that of the patient for outcomes and the institution for cost savings. The subjects are diabetic patients at an urban/suburban vertically integrated healthcare system in the Gulf South with a majority of HMO members. Outcome measures are the number of areas where cost savings of 5% or outcome improvements of 10% can be identified. Results show two areas where these outcomes improvements can occur. The proposed method should be applicable to any healthcare database, though the models developed will vary.