Detecting Adverse Drug Events Using Concept ... - IEEE Xplore

4 downloads 17503 Views 388KB Size Report
Using Concept Hierarchies of Clinical Codes. Jing Zhao ... Email: [email protected] .... levels in the clinical code hierarchies on three of the datasets.
2014 IEEE International Conference on Healthcare Informatics

Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes Jing Zhao

Aron Henriksson

Henrik Bostr¨om

Department of Computer and Systems Sciences (DSV) Stockholm University Stockholm, Sweden Email: [email protected]

Department of Computer and Systems Sciences (DSV) Stockholm University Stockholm, Sweden Email: [email protected]

Department of Computer and Systems Sciences (DSV) Stockholm University Stockholm, Sweden Email: [email protected]

Abstract—Electronic health records (EHRs) provide a potentially valuable source of information for pharmacovigilance. However, adverse drug events (ADEs), which can be encoded in EHRs with specific diagnosis codes, are heavily under-reported. To provide more accurate estimates for drug safety surveillance, machine learning systems that are able to detect ADEs could be used to identify and suggest missing ADE-specific diagnosis codes. A fundamental consideration when building such systems is how to represent the EHR data to allow for accurate predictive modeling. In this study, two types of clinical code are used to represent drugs and diagnoses: the Anatomical Therapeutic Chemical Classification System (ATC) and the International Statistical Classification of Diseases and Health Problems (ICD). More specifically, it is investigated whether their hierarchical structure can be exploited to improve predictive performance. The use of random forests with feature sets that include only the original, low-level, codes is compared to using random forests with feature sets that contain all levels in the hierarchies. An empirical investigation using thirty datasets with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, can be significantly improved by exploiting codes on all levels in the hierarchies, compared to using only the low-level encoding. A further analysis is presented in which two strategies are employed for adding features level-wise according to the concept hierarchies: topdown, starting with the highest abstraction levels, and bottom-up, starting with the most specific encoding. The main finding from this subsequent analysis is that predictive performance can be kept at a high level even without employing the more specific levels in the concept hierarchies.

I.

drugs and their use in the treatment of patients. Cerivastatin, a drug to lower cholesterol and prevent cardiovascular disease, was, for instance, withdrawn from the market in 2003 due to its fatal ADE – rhabdomyolysis – which was identified through post-marketing drug safety surveillance [4]. Post-marketing drug safety surveillance relies primarily on the analysis – typically through disproportionality methods – of individual case reports, submitted voluntarily by patients and clinicians to organizations such as the Food and Drug Administration (FDA)1 in the USA and the World Health Organization (WHO)2 . Such spontaneous reporting systems suffer, however, from several severe limitations. Not only are they affected by gross under-reporting of ADEs; they also have problems with reliability and compliance, as well as insufficient information about patients’ medical history and the total number of patients taking a particular drug. Many of these limitations are not present in the data that is routinely collected in electronic health records (EHRs), which have recently been explored as an alternative resource for conducting various aspects of post-marketing drug safety surveillance. The systematic documentation of health care in EHRs provides valuable access to longitudinal observations of patients, including their basic characteristics, drug prescriptions and administrations, diagnoses, clinical measurements, laboratory tests and clinical notes. Research on using EHR data for drug safety surveillance is nascent [5]–[9]. Using EHR data for drug safety surveillance is, however, not unproblematic. Electronic health records contain various and disparate types of data, ranging from structured information – input through predefined templates – to unstructured clinical notes [10]. The use of machine learning algorithms on EHR data for obtaining predictive models that are able to detect potential ADEs has been investigated in several previous studies, see, e.g., Harpaz et. al. [5], Chazard et. al. [11] and Karlsson et. al. [12]. It is well-known from the machine learning field that the predictive performance of the generated models is not only affected by the choice of learning algorithm, but also by the way in which the data is represented. The latter aspect has not been studied extensively in the context of ADE detection using EHR data, effectively providing motivation for the present study. One example of the many

I NTRODUCTION

The prevalence of adverse drug events (ADEs), i.e., injuries resulting from the use of a drug [1], constitutes a major public health issue. The (in)ability to prevent ADEs is therefore of critical importance for patient safety: worldwide, around 3.7% of hospital admissions are due to ADEs [2], and, in Sweden, ADEs have been identified as the seventh most common cause of death [3]. Prior to the release of a drug on the market, its benefits and risks are evaluated in clinical trials. However, due to limitations of clinical trials in terms of sample size and duration, as well as the uncertain impact of certain chemical compounds on the human body, not all adverse effects of the drug can be identified prior to its launch. As a result, postmarketing drug surveillance, or pharmacovigilance, is carried out throughout the life cycle of a pharmaceutical product in order to inform decisions about the sustained marketing of /14 $31.00 © 2014 IEEE 978-1-4799-5701-9/14 $31.00 © 2014 IEEE DOI 10.1109/ICHI.2014.46

1 http://www.fda.gov 2 http://www.who.int/en/

285

potential challenges of processing EHR data is that, even in the structured part of EHRs, drug names in the drug administration information are recorded under different commercial names, abbreviations and names with misspellings. Fortunately, most EHR systems also encode such information through standard classification systems, such as the Anatomical Therapeutic Chemical Classification System (ATC) [13] for drugs, and the International Statistical Classification of Diseases and Related Health Problems (ICD) [14] for diagnoses.

a set of clinical codes – to detect patients with ADEs, using random forests as the underlying learning algorithm. First, the use of two feature sets is compared: (1) using only the original, low-level, clinical codes (as assigned in the EHRs), and (2) using a combination of all levels of the concept hierarchies. The hypothesis is that using the latter feature set will lead to improved predictive performance, as the high-level concepts, at least occasionally, could help to discriminate between the classes (ADE vs. non-ADE). A subsequent investigation of the relative importance of the variables corresponding to different levels in the clinical code hierarchies on three of the datasets is presented. The outcome of this investigation is used for suggesting two strategies for performing a more refined study of the impact of different levels in the hierarchies: investigating the addition of concept levels one by one, either starting with the highest level, i.e., performing a top-down investigation, or starting with the most specific level, i.e., performing a bottomup investigation.

Clinical codes that encode diagnoses and drugs are critical sources of information for drug safety surveillance. There is, in fact, a limited number of diagnosis codes that specifically encode ADEs; however, similar to spontaneous reporting systems, EHRs suffer from under-reporting of ADEs [15]. One reason for this is that clinicians sometimes fail to recognize a medical event as an ADE and, consequently, may assign a diagnosis code that indicates the disease or symptom, but not specifically that it was drug-induced. An important challenge is therefore to detect missing ADE-specific diagnosis codes in order to make EHRs a more reliable source for estimation of ADE incidence. Adding missing ADE codes may, in turn, lead to improved patient safety, e.g. by providing decision-support for clinicians when conducting the benefit-risk analyses that invariably take place when prescribing drugs to patients.

A. Data Source The data used in this study was extracted from the Stockholm EPR Corpus3 [16], which contains health records from the Karolinska University Hospital in Stockholm, Sweden. The data encompasses around 700,000 (anonymized) patients and their documented encounters with health care over a two-year period (2009-2010). In this database, there are approximately 10,000 unique diagnoses, encoded with ICD-10-SE4 , and 1,300 drugs, encoded with ATC codes.

One means of detecting missing ADEs is through predictive modeling, or machine learning, by learning from the medical history of those patients, including their diagnoses and drugs, whom have been assigned ADE codes, and then applying the trained model on records of patients whom have not been assigned ADE codes. Recently, Karlsson et al. [12] presented a method for predicting missing ADEs in EHRs by using drugs and diagnoses, represented by their ATC and ICD codes, as they are encoded in the EHRs. Representing drugs and diagnoses in this manner, however, fails to leverage the hierarchical structure of ATC and ICD. Since there are typically tens of thousands of distinct ATC and ICD codes in an EHR database, this representation will be of a very high dimensionality, and since each health record typically only contains a small portion of the possible codes, it also leads to high sparsity.

The experiments, described below, were carried out with thirty ADE datasets, extracted from this database. Each dataset contains patients who share a specific ADE-related ICD code, i.e., patients who have experienced a drug-induced disorder, as positive examples, and, as negative examples, equally many randomly selected patients whom have not been assigned this diagnosis code. Each classification task is hence binary, where the existence of a particular ADE-related ICD code is to be determined. As ADEs are rare events, there are many fewer patients with a given ADE than without. Since an imbalanced class distribution leads to poor performance with many classifiers, which assume a rather balanced class distribution and equal misclassification costs [17], the negative examples were under-sampled to equal number of positive examples in all datasets. The ADE-related ICD codes were selected on the basis of having been classified as indicating ADEs according to Stausberg et al. [18], where category A.15 and category A.26 were considered in this study, as they were deemed to be the clearest indicators of ADEs. Among all of the ICD codes in A.1 and A.2, the thirty most frequent ones in the Stockholm EPR Corpus were chosen, out of which the least frequent one was assigned only to 24 patients, see Table I for a summary of the datasets. Each dataset contains diagnoses and drugs from the health records to describe each patient; these

By recognizing the hierarchical structure of ATC and ICD, the low-level clinical codes can be aggregated to more general levels, which would reduce both the dimensionality and sparsity of the data. Chazard et al. [11] propose an aggregation engine, by which diagnoses are aggregated to more general chronic diseases and drugs are aggregated to a certain level of a family of drugs. However, no evidence that supports the choice of these particular levels has been presented, e.g., demonstrating their superiority over other levels in terms of model performance. In order to leverage the concept hierarchies of clinical codes for detecting missing ADE codes, this study aims to explore different feature representations based on the ATC and ICD hierarchies and evaluate their impact on predictive performance compared to using the original, lowlevel, codes. II.

3 This research has been approved by the Regional Ethical Review Board in Stockholm (Etikpr¨ovningsn¨amnden i Stockholm), permission number 2012/834-31/5. 4 ICD, version 10, Swedish Modification. All references in the paper to ICD refer to this particular version. 5 ”A drug-related causation was noted in the ICD, e.g., G44.4: Drug-induced headache, not elsewhere classified.” 6 ”A drug- or other substance-related causation was noted in the ICD, e.g., I42.7: Cardiomyopathy due to drugs and other external agents.”

M ETHODS AND M ATERIALS

This paper reports on a series of experiments on thirty EHR datasets – here, limited to encoding each observation by

286

are represented as binary features: the presence (1) or absence (0) of each ICD and ATC code in the record.

                    

B. Exploiting Concept Hierarchies of Clinical Codes The drugs and diagnoses describing each patient were represented in a number of ways to exploit the concept hierarchies of the two types of clinical codes used in this study: ATC and ICD. The ATC system classifies drugs into five different levels: the first level is composed of 14 anatomical groups, indicated by a single Latin letter; the second level corresponds to a therapeutic subgroup, indicated by two digits; the third and fourth levels are pharmacological and chemical subgroups, respectively, and are indicated by two Latin letters; the final, fifth, level corresponds to chemical substances, indicated by two digits [13]. An example of the ATC hierarchy for the drug Simvastatin is shown in Figure 1. In this study, the same hierarchical structure was exploited to aggregate and denote the ATC codes; the fifth level is also denoted by original level since most drugs are encoded in EHRs by their commercial names, which correspond to this specific level.

             Fig. 2.

Demonstration of concept hierarchy of an ICD code

all levels; the drug Simvastatin, encoded as C10AA01, was represented as C10AA01 in the original-level feature set, and as C10AA01, C10AA, C10A, C10, and C in the all-levels feature set. Descriptive statistics of each dataset, with originallevel features and all-levels features, respectively, are presented in Table I, where sample size, feature size, and density that is defined as the proportion of non-zero values are shown. The smallest dataset, with respect to number of instances, is G25.1, with 48 patients, and the largest one is T78.4, with 3,586 patients. Among all the recorded ICD and ATC codes in the Stockholm EPR Corpus, in each dataset codes that were not assigned to any patient, i.e., all values are zero, have been removed, therefore the number of features ranges from 354 to 3,246 on the original level, while the number is approximately twice as large when including all levels. In general, the density of all-levels feature set (1.4% - 6%) is higher than the-original level feature set (0.3% - 4%).

                !  

TABLE I. Fig. 1.

Demonstration of concept hierarchy of an ATC code

Dataset D64.2 E27.3 F11.0 F11.2 F13.0 F13.2 F15.0 F15.1 F15.2 F19.0 F19.1 F19.2 F19.9 G24.0 G25.1 G44.4 G62.0 I42.7 I95.2 L27.0 L27.1 N14.1 O35.5 T59.9 T78.2 T78.3 T78.4 T80.8 T88.6 T88.7

The ICD system contains 21 groups according to organ system or etiology. These constitute the first level of the hierarchy and are indicated by a single Latin letter. In each group, similar diseases are assembled in the second level, indicated by two digits. Finally, these diseases are divided into a third level that corresponds to different types or stages of a disease, indicated by another digit [14]. Figure 2 shows an example of the ICD hierarchy for the disease Schizoaffective disorder, depressive type. It should be noted, however, that the hierarchy of ICD does not always strictly follow the structure depicted in the figure, which is mainly reflected in the second level, where the first digit does not always distinguish the subgroup of similar diseases. For instance, D50-D53 encodes Nutritional anemia, while D55-D59 encodes another subgroup: Hemolytic anemia. For the sake of simplicity when creating aggregation rules, we employed the hierarchy shown in Figure 2 for all ICD codes, where the last level is also denoted by original level. As with drugs, most diagnoses are encoded in EHRs with codes that signify specific individual diseases, and these correspond to this level. For each dataset, various representations were created based on the concept hierarchy of the clinical codes. Initially, two basic feature sets were created: (1) using only the original ICD and ATC codes (as encoded in the EHRs), and (2) using a combination of all levels in the ICD and ATC hierarchies. For instance, the diagnosis Schizoaffective disorders, depressive type, encoded as F25.1, was represented as F251 in the feature set, and as F251, F25, F2, and F in the feature set comprising

287

sample 234 78 132 280 264 86 70 56 206 212 56 292 74 76 48 156 82 80 98 470 138 52 680 104 168 674 3586 726 120 1090

DESCRIPTION OF DATASETS

Original features 1289 699 767 1360 1145 701 481 402 895 886 509 1260 572 572 352 646 658 591 719 1856 837 411 1520 504 755 1633 3244 1927 859 2612

Level density 2.9% 3.1% 2.0% 1.6% 1.2% 2.9% 2.6% 3.9% 1.8% 1.4% 3.7% 1.5% 3.2% 2.7% 3.8% 1.7% 3.1% 4.0% 3.0% 1.2% 2.5% 3.7% 1.0% 2.1% 1.8% 0.7% 0.3% 1.4% 2.4% 0.7%

All Levels features density 2437 4.8% 1506 4.7% 1582 3.1% 2557 2.8% 2224 2.2% 1522 4.5% 1134 3.9% 952 5.6% 1817 3.1% 1816 2.3% 1137 5.4% 2408 2.6% 1293 4.8% 1310 4.0% 892 5.3% 1437 2.7% 1431 4.8% 1303 6.0% 1546 4.7% 3280 2.2% 1749 3.9% 978 5.3% 2778 1.9% 1131 3.2% 1587 2.9% 2978 1.4% 5105 0.7% 3353 2.6% 1758 3.9% 4288 1.4%

C. Experimental Setup

importance. Again, a Friedman test was conducted to test the statistical significance of the null hypothesis that all of them are equally important. This test ranks the features on different levels for all original codes in each dataset, ranging from 1 (the best) to 5 for ATC and 4 for ICD (the worst).

The impact of feature representations based on concept hierarchies for predicting patients with ADEs was investigated in a series of experiments. The random forests algorithm [19] was chosen as the underlying machine learning algorithm to generate predictive models, due to its reputation of achieving high accuracy and the possibility of obtaining estimates of variable importance. The algorithm constructs an ensemble, or forest, of decision trees, which together vote for which class label to assign to an example. Each tree in the forest is built from a bootstrap replicate of the original instances, and a subset of all features is sampled at each node when building the tree; both steps taken in order to increase the diversity among the trees. When the number of trees in the forest increases, the probability that a majority of trees makes an error decreases, given that the trees perform better than random and that the errors are made independently. Although this condition can only be guaranteed in theory, the algorithm has often been shown in practice to result in state-of-the-art predictive performance. Moreover, random forests has been suggested as a standard classifier for high-dimensional, sparse data, e.g., microarray data [20]. This hence suggests that the method is also suitable for the type of data considered in this study, with each dataset having many more features than instances (see Table I). In this study, we consider random forests with five hundred trees. For each dataset, the presence or absence of the selected ADE-related ICD code in a record determines the assigned (binary) class label. In all experiments, models were built and evaluated using 10-fold cross validation, and the results were averaged over ten iterations.

Based on the variable importance analysis, two alternative strategies for investigating the impact of the different levels in the concept hierarchies were suggested: a top-down approach, by which levels are added one-by-one starting with the most important level and a bottom-up approach, by which levels are added one-by-one starting with the least important level (see Figure 3). These strategies are designed to provide insight into how the predictive performance is affected by successively adding more important, or less important, levels to the feature set. The alternative feature sets are used to generate models, which are then evaluated and compared on the remaining 27 datasets. 



        

































 





      

The considered performance metrics are accuracy and area under ROC curve (AUC). Accuracy corresponds to the percentage of correctly classified instances, while AUC depicts the performance of a model without regard to class distribution or error costs by estimating the probability that a model ranks a randomly chosen positive instance ahead of a negative one. In the first experiment, where two competing models are compared, the Wilcoxon signed-rank test was employed for statistical hypothesis testing, where the null hypothesis is that the methods perform equally well. This test ranks the differences in performance of two feature representations on each dataset, ignoring the signs, and compares the ranks of positive and negative differences. It was chosen for its robustness when comparing two classifiers [21]. In subsequent experiments, which involve comparisons of multiple methods, a Friedman test is instead employed, followed by a post-hoc test using the Bergman-Hommel procedure, as suggested in [22]. Again, the ranks are compared, but now adjusting for the fact that multiple comparisons are performed.

Fig. 3. Schema of top-down and bottom-up level-wise feature addition. TD refers to Top-Down, BU refers to Bottom-Up and L to Level. Examples of the ATC code ”C10AA01” and the ICD code ”F25.1” are given. Larger font size means higher variable importance.

III.

R ESULTS

We first present the results from comparing the predictive performance when using the lowest-level clinical codes and when using all levels. We then present the outcome of the variable importance analysis from three of the thirty datasets. The results from employing the top-down and bottom-up approaches, which were guided by the variable importance analysis on three of the datasets, are then presented for the remaining 27 datasets. A. Original level versus All Levels Accuracy and AUC for each dataset using features consisting of ATC and ICD on the original level and using a combination of all levels, respectively, are listed in Table II. The mean accuracy of the random forests models is 74.5% when using the original level and 79.8% when using all levels, while the mean AUC is 0.83 when using the original level and 0.88 when using all levels.

The first experiment compared the use of features on the original level to the use of features on all levels over thirty datasets. Subsequently, three of the thirty datasets were randomly chosen to allow for analyzing the importance of variables corresponding to different levels in the concept hierarchies. The three datasets were: L27.0, E27.3 and F15.0. Variable importance can be estimated in different ways, see, e.g., [19]. In this study, Gini importance was chosen as the variable importance metric [23]. For each original ATC and ICD code, it and its corresponding aggregated codes on all levels were compared and ranked based on their Gini

When comparing the results from the original level and from all levels, it can be seen that the latter clearly outperforms the former, as shown in bold in Table II. Employing a feature set from all levels wins 28 times with respect to accuracy and 26 with respect to AUC, out of 30 comparisons. The p-values

288

TABLE II. ACCURACY AND AUC FROM RANDOM FORESTS WITH FEATURES ON THE ORIGINAL LEVEL AND ALL LEVELS ( THE HIGHER VALUES ARE IN BOLD )

MEAN RANK OF VARIABLE IMPORTANCE OF EACH ICD HIERARCHY LEVEL

Level 1 1.073 1.226 1.109

Level 2 2.195 2.570 2.457

Accuracy 4 2 0 −2

1−2 vs 1−5

1−3 vs 1−4

1−3 vs 1−5

1−4 vs 1−5

1−2 vs 1−5

1−3 vs 1−4

1−3 vs 1−5

1−4 vs 1−5

Level 5 4.496 4.144 4.170

Rank Difference

0 −2

P-value < 0.0001 < 0.0001 < 0.0001

1 vs 1−5

1 vs 1−4

1 vs 1−3

1 vs 1−2

−4 1−2 vs 1−4

Level 4 3.851 3.709 3.684

2

1−2 vs 1−3

Level 3 2.982 3.220 3.218

1−2 vs 1−4

AUC 4

MEAN RANK OF VARIABLE IMPORTANCE OF EACH ATC HIERARCHY LEVEL

Level 2 2.342 2.534 2.490

1−2 vs 1−3

1 vs 1−5

1 vs 1−2

1 vs 1−4

−4

After exploring the variable importance of each level in the ATC and ICD hierarchies on three randomly selected datasets, the mean rank of each hierarchy level is presented in Table III and Table IV, respectively. The small p-values indicate that the differences between levels are significant for each dataset. The results for the ATC and ICD hierarchy levels are consistent: the most general level is the most important, while the most specific level is the least important.

Level 1 1.329 1.393 1.437

P-value < 0.0001 < 0.0001 < 0.0001

Given the small p-values according to the Friedman test for both accuracy and AUC, the generated feature sets have significantly different impacts on the predictive performance of the random forest models. A more detailed evaluation of the pairwise relations between these feature sets, based on the post-hoc Bergman-Hommel procedure, is shown in Figure 4, where the ranking difference between each pair is plotted and boxes corresponding to significant differences (p < 0.05) are colored in green.

B. Variable Importance Analysis

Dataset L27.0 E27.3 F15.0

Level 4 3.590 3.203 3.231

Table V shows the rank of the models generated from the five feature sets with respect to accuracy and AUC on the remaining 27 datasets.

for both comparisons, i.e., accuracy and AUC, indicate that the differences are significant, and hence we can safely reject the null hypothesis that the two feature representations result in equally strong models.

TABLE III.

Level 3 3.142 3.000 3.202

in the figure, five feature sets were generated: (1) codes from level 1 only; (2) codes from level 1 and level 2; (3) codes from level 1, level 2 and level 3; (4) codes from level 1, level 2, level 3 and level 4; and (5) codes from all levels.

1 vs 1−3

Dataset D64.2 E27.3 F11.0 F11.2 F13.0 F13.2 F15.0 F15.1 F15.2 F19.0 F19.1 F19.2 F19.9 G24.0 G25.1 G44.4 G62.0 I42.7 I95.2 L27.0 L27.1 N14.1 O35.5 T59.9 T78.2 T78.3 T78.4 T80.8 T88.6 T88.7 P-value

Dataset L27.0 E27.3 F15.0

AUC Original All 0.997 0.996 0.928 0.914 0.713 0.863 0.842 0.948 0.742 0.836 0.880 0.883 0.671 0.761 0.770 0.848 0.950 0.958 0.757 0.897 0.676 0.803 0.884 0.929 0.887 0.895 0.588 0.672 0.635 0.742 0.796 0.870 0.817 0.846 0.951 0.950 0.898 0.915 0.763 0.893 0.771 0.884 0.813 0.840 0.994 0.993 0.781 0.835 0.835 0.907 0.880 0.928 0.943 0.959 0.985 0.984 0.850 0.902 0.773 0.844 < 0.0001

Rank Difference

Accuracy (%) Original All 96.24 96.58 83.32 84.38 66.96 78.32 78.64 88.00 68.56 74.43 79.60 77.24 54.86 64.00 64.77 70.07 87.59 87.83 70.52 80.47 59.57 69.97 77.94 84.86 81.48 83.63 52.30 57.75 54.40 65.65 71.38 74.65 72.50 76.81 85.38 86.63 80.93 83.87 69.74 80.38 67.49 79.00 66.70 70.97 96.12 96.10 68.47 70.98 75.02 82.34 78.34 85.07 86.79 89.93 95.70 95.93 73.17 83.00 70.55 76.30 < 0.0001

TABLE IV.

Fig. 4. Box plots of the pairwise rank differences between top-down selected feature sets. Green indicates that the difference is significant. A positive rank difference means that the first method performs worse than the second. Labels on the x-axis refer to the feature sets in the top-down (TD) level-wise feature addition strategy, as illustrated in Figure 3

C. Level-Wise Feature Addition The top-down feature addition strategy is illustrated by Figure 3, where the font size indicates variable importance: the bigger the size, the more important the level. As shown

The bottom-up strategy to level-wise feature addition is also illustrated by Figure 3, where five feature sets were

289

TD L1–L5 1 4 4 4 5 2 5 4 5 1 3 4 2 3 1 1 5 2 4 4 5 5 3 1 2 3 2 3.148

TD L1 5 4 5 5 5 4 5 5 1 5 5 5 1 5 5 5 5 5 3 5 4 4 5 5 5 5 5 4.481

generated: (1) codes from the lowest (original) level only; (2) codes from the two lower levels; (3) codes from the three lower levels; (4) codes from the four lower levels; and (5) codes from all levels.

AUC TD L1–L3 TD L1–L4 3 2 2 5 2 3 1 3 2 1 3 5 3 2 1 3 3 5 2 4 2 1 3 1 5 3 3 1 3 1 3 2 2 1 3 4 2 4 3 2 1 3 2 3 3 4 3 2 4 2 2 4 2 1 2.519 2.667 < 0.0001

TD L1–L5 1 3 4 4 4 1 4 4 4 1 3 4 4 2 2 1 4 1 5 1 5 5 2 1 1 1 3 2.778

Accuracy

Rank Difference

4

In Table VI, the ranks of the resulting five models with respect to accuracy and AUC are shown. Given the small pvalues according to the Friedman test on both accuracy and AUC, the observed differences between the feature sets are significant. The result of a pairwise comparison of the feature sets based on the Bergman-Hommel post-hoc procedure is shown in Figure 5, where the ranking differences between each pair are plotted and boxes corresponding to significant differences (p < 0.05) are colored in green.

2 0 −2

5−4 vs 5−1

5−3 vs 5−2

5−3 vs 5−1

5−2 vs 5−1

5−4 vs 5−1

5−3 vs 5−2

5−3 vs 5−1

5−2 vs 5−1

5−4 vs 5−2

5−4 vs 5−3

5 vs 5−4

−4

To summarize the effect of adding features according to the top-down and bottom-up procedures, average ranks with respect to accuracy and AUC for the different feature sets, as shown in Table V and Table VI, are compared in Figure 6. It can be clearly seen that the bottom-up procedure monotonically benefits from including more general-level codes in the feature set, while the top-down strategy levels out after including the two most general-level codes in the feature set.

AUC

Rank Difference

4

D ISCUSSION

2 0 −2

5 vs 5−4

5−4 vs 5−2

−4

The impact of using the concept hierarchies of ATC and ICD codes on the predictive performance of random forests for detecting ADEs in EHRs has been investigated. A series of experiments were conducted: (1) the use of original-level codes as features was compared to using codes from all levels in the hierarchies, where the results show that the latter significantly outperforms the former; (2) variable importance analysis of models using codes from all levels was undertaken, where the results show that higher levels are more important than lower levels, in the order of specificity; and (3) the impact of using different levels of the ATC and ICD concept hierarchies was

5−4 vs 5−3

IV.

TD L1–L2 4 1 1 2 3 2 1 2 2 3 4 2 2 4 4 4 3 2 1 4 2 1 1 4 3 3 4 2.556

5 vs 5−1

TD L1–L4 2 2 3 3 3 5 2 3 4 4 2 2 5 2 2 2 3 5 5 1 3 4 1 3 4 5 3 3.074

5 vs 5−1

Accuracy TD L1–L3 3 3 1 1 2 4 3 1 1 3 1 3 3 1 4 3 2 3 3 2 1 3 2 4 1 4 1 2.333 < 0.001

5 vs 5−2

TD L1–L2 5 1 2 2 1 3 1 2 2 2 4 1 1 4 3 4 1 1 2 3 2 2 4 2 3 1 4 2.333

5 vs 5−3

TD L1 4 5 5 5 4 1 4 5 3 5 5 5 4 5 5 5 4 4 1 5 4 1 5 5 5 2 5 4.111

5 vs 5−2

Dataset D64.2 F11.0 F11.2 F13.0 F13.2 F15.1 F15.2 F19.0 F19.1 F19.2 F19.9 G24.0 G25.1 G44.4 G62.0 I42.7 I95.2 L27.1 N14.1 O35.5 T59.9 T78.2 T78.3 T78.4 T80.8 T88.6 T88.7 Mean P-value

RANK OF ACCURACY AND AUC OF FEATURE SETS SELECTED IN THE TOP - DOWN APPROACH

5 vs 5−3

TABLE V.

Fig. 5. Box plots of the pairwise rank differences between bottom-up selected feature sets. Green indicates that the difference is significant. A positive rank difference means that the first method performs worse than the second. Labels on the x-axis refer to the feature sets in the bottom-up (BU) level-wise feature addition strategy, as illustrated in Figure 3

290

TABLE VI. Dataset D64.2 F11.0 F11.2 F13.0 F13.2 F15.1 F15.2 F19.0 F19.1 F19.2 F19.9 G24.0 G25.1 G44.4 G62.0 I42.7 I95.2 L27.1 N14.1 O35.5 T59.9 T78.2 T78.3 T78.4 T80.8 T88.6 T88.7 Mean P-value

BU L5 2 5 5 5 4 2 5 5 5 4 5 5 5 4 5 2 5 3 5 4 4 5 4 5 2 5 5 4.260

RANK OF ACCURACY AND AUC OF FEATURE SETS SELECTED IN THE BOTTOM - UP APPROACH

BU L5–L4 3 4 4 4 3 5 3 4 4 5 3 2 3 5 4 3 3 5 2 3 5 4 5 4 4 4 4 3.778

Accuracy BU L5–L3 BU L5–L2 4 5 3 2 3 2 3 1 1 2 4 3 2 1 3 2 3 2 3 2 4 2 3 1 4 2 3 1 1 3 4 5 1 2 4 2 3 4 1 2 2 1 3 2 3 2 3 2 3 5 3 2 3 2 2.852 2.296 < 0.0001

BU L5–L1 1 1 1 2 5 1 4 1 1 1 1 4 1 2 2 1 4 1 1 5 3 1 1 1 1 1 1 1.815

3 5

3 Levels

4 Levels

All Levels

Levels in Feature Set

Fig. 6.

AUC BU L5–L3 BU L5–L2 4 5 3 2 3 2 3 1 2 1 4 3 2 1 3 1 4 2 4 2 1 3 2 1 4 2 3 2 2 3 3.5 5 2 1 4 2 3 1 2 4 2 1 3 1 3 2 3 2 4 5 4 2 3 2 2.981 2.185 < 0.0001

BU L5–L1 3 1 1 2 4 1 3 2 1 1 4 4 1 1 1 2 4 1 2 3 3 2 1 1 3 1 1 2.000

Although interesting to observe, the fact that the higher levels in the concept hierarchies were, in this case, found to be more important was somehow expected. That is because the variable importance metric that was employed in this study is biased towards dense features, which means that the scores are favoring the higher-level codes. The reason for this is that even when there is no correlation between a certain feature and the class label, there will, in a sufficiently large sample, almost always be some difference in the relative class frequencies for those having a value for this feature and those that do not – with less dense (more sparse) features, there is less chance of such deviations. As higher levels are invariably denser, it is not surprising that they turned out to be more important. Another explanation can be that the negative examples in this study were randomly selected from the whole database, which means that they are potentially very different from the positive examples: they may, for instance, have diseases from different organ systems. The trained model may consequently capture such general differences rather than a specific pattern that detects the ADE of interest. Depending on the use case, it may be more reasonable to select, as negative examples, patients who are as similar as possible to the positive examples, but who have not experienced the ADE of interest.

4

Mean Rank

2

1

Bottom−Up (accuracy) Bottom−Up (AUC) Top−Down (accuracy) Top−Down (AUC)

2 Levels

BU L5–L4 2 4 4 4 3 5 4 4 5 5 2 3 3 4 5 3.5 3 5 4 5 4 4 5 4 2 5 4 3.907

features based on their hierarchical structure, where, in fact, density increases, or sparsity decreases, when adding higherlevel codes to the feature set (see Table I). This is, of course, because more patients have codes in common on higher levels. This possibly explains why the predictive performance did not improve monotonically in the top-down strategy, since the added lower-level features are less dense – and generally less important based on the analysis of variable importance – than the existing features.

Bottom−Up vs Top−Down

1 Level

BU L5 1 5 5 5 5 2 5 5 3 3 5 5 5 5 4 1 5 3 5 1 5 5 4 5 1 3 5 3.926

Comparison of bottom-up and top-down strategies

further analyzed through two strategies for level-wise feature addition – top-down and bottom-up – which were based on the observed importance of each level in the previous experiment. It was found that, in the bottom-up strategy, predictive performance improves monotonically, while, in the top-down strategy, predictive performance levels out after adding the two most important levels. By representing codes on all levels instead of only the lowest level, the predictive performance was significantly improved, despite the fact that the dimensionality was approximately doubled when including all levels. A similar phenomenon was observed for the bottom-up strategy, where the addition of more codes from higher levels led to a monotonic increase in predictive performance. An explanation for this is that it is a consequence of the special character of representing

The feature set that performed best in the bottom-up strategy was the one in which codes were represented on all levels, while the best one in the top-down strategy – and best

291

rate), Normal and Tachycardia (high heart rate). Furthermore, in some cases, the same concept can be measured repeatedly with different forms of tests: these can be aggregated, too, into a simpler representation – an example of this can be found in [11]. Using textual features typically results in high dimensionality and high sparsity; this can, however, be reduced by detecting synonyms that refer to the same medical concept [25] and by leveraging biomedical ontologies [26].

overall – was the one representing codes on the highest three levels. In this latter strategy, it was observed that using only the highest three levels outperformed using codes represented on all levels, albeit not significantly so. The question, then, is whether we gain anything from including fewer features, or if we are better off – as is generally considered to be the rule-ofthumb when employing random forests – including as many features as possible. One motivation to include fewer features could, of course, be for computational reasons. On the other hand, it may be safer to represent the codes on all levels, should the prediction task be framed in a more challenging fashion. In any case, it is worth highlighting that the best-performing feature set would not have been identified in the bottom-up strategy, indicating that analyzing variable importance in the manner of this study could be employed as a feature selection strategy.

V.

C ONCLUSION

This study demonstrates that, when representing drugs and diagnoses by their ATC and ICD codes, it is suboptimal to use the codes on the original level, as they are typically encoded in EHRs, as features to predict ADEs, even if this might be considered the most natural option. Instead, this study shows that leveraging the concept hierarchies of clinical codes and representing them on all levels yields a significant improvement in predictive performance. Moreover, this study suggests that it is possible to analyze variable importance generated from random forests for feature selection and it shows that the predictive performance can be kept at a high level by only including codes on the higher levels in the hierarchies.

Moreover, the number of levels is not consistent in ATC and ICD: ATC has five levels in total, while ICD only has four. This causes some trouble for both the top-down and bottomup strategies. In the top-down strategy, the last step in the level-wise feature addition will only add the lowest level of the ATC codes, while, in the bottom-up strategy, the last step will only add the highest level of the ATC codes. This has less impact in the top-down strategy (see Figure 6), as the final step in the level-wise feature addition involves adding the least important level (i.e., the original level); however, in the bottom-up strategy, the last step involves adding the most important level (i.e., the highest level), which means that, in this case, the most important ATC level and ICD level are not added simultaneously.

ACKNOWLEDGMENT This work was partly supported by the project HighPerformance Data Mining for Drug Effect Detection at Stockholm University, funded by Swedish Foundation for Strategic Research under grant IIS11-0053. R EFERENCES

Concerning the task – that is, to detect patients with ADEs – the predictive performance of random forests in this study can be considered relatively strong when compared to a previous study on a similar task [12], with high accuracy and AUC scores on most datasets, using either codes represented as they were originally encoded or on all levels of the concept hierarchies. Besides the robustness of random forests and informativeness of the selected features, this high performance can partly be explained by two minor limitations of this study: (1) the prediction task is fairly easy for the algorithm, since the negative examples are randomly selected patients from the whole database, but do not really constitute a control group, in which case they should be as similar as possible to the positive examples; and (2) the negative examples are as many as the positive examples to avoid unbalanced class distributions, whereas, in a real-word setting, there would be many more negative examples than positive examples, since ADEs are a rare medical event.

[1]

[2]

[3]

[4]

[5]

[6]

In this study, predictive models are built using drugs and diagnoses encoded in the medical history of patients. This is, however, only the tip of the iceberg of the wealth of information that is contained in EHRs. Future studies could also include, for instance, clinical measurements [24] and clinical notes [6], which are important predictors of ADEs. Although there is no off-the-shelf concept hierarchy for clinical measurements, this data can still be aggregated, or discretized, into different levels according to certain rules corresponding to their ontological character and based on medical knowledge. Some continuous-valued measurements, such as heart rate, can, for instance, be discretized into Bradycadia (low heart

[7]

[8]

[9]

292

J. R. Nebeker, P. Barach, and M. H. Samore, “Clarifying adverse drug events: a clinician’s guide to terminology, documentation, and reporting,” Annals of internal medicine, vol. 140, no. 10, pp. 795–801, 2004. R. Howard, A. Avery, S. Slavenburg, S. Royal, G. Pipe, P. Lucassen, and M. Pirmohamed, “Which drugs cause preventable admissions to hospital? a systematic review,” British journal of clinical pharmacology, vol. 63, no. 2, pp. 136–147, 2007. K. Wester, A. K. J¨onsson, O. Spigset, H. Druid, and S. H¨agg, “Incidence of fatal adverse drug reactions: a population based study,” British journal of clinical pharmacology, vol. 65, no. 4, pp. 573–579, 2008. C. D. Furberg and B. Pitt, “Withdrawal of cerivastatin from the world market,” Curr Control Trials Cardiovasc Med, vol. 2, no. 5, pp. 205–7, 2001. R. Harpaz, K. Haerian, H. S. Chase, and C. Friedman, “Mining electronic health records for adverse drug effects using regression based methods,” in Proceedings of the 1st ACM International Health Informatics Symposium. ACM, 2010, pp. 100–107. P. Warrer, E. H. Hansen, L. Juhl-Jensen, and L. Aagaard, “Using textmining techniques in electronic patient records to identify adrs from medicine use,” British journal of clinical pharmacology, vol. 73, no. 5, pp. 674–684, 2012. M. Suling and I. Pigeot, “Signal detection and monitoring based on longitudinal healthcare data,” Pharmaceutics, vol. 4, no. 4, pp. 607– 640, 2012. J. A. Linder, J. S. Haas, A. Iyer, M. A. Labuzetta, M. Ibara, M. Celeste, G. Getty, and D. W. Bates, “Secondary use of electronic health record data: spontaneous triggered adverse drug event reporting,” Pharmacoepidemiology and drug safety, vol. 19, no. 12, pp. 1211–1215, 2010. P. M. Coloma, M. J. Schuemie, G. Trifir`o, R. Gini, R. Herings, J. Hippisley-Cox, G. Mazzaglia, C. Giaquinto, G. Corrao, L. Pedersen et al., “Combining electronic healthcare databases in europe to allow for large-scale drug safety monitoring: the eu-adr project,” Pharmacoepidemiology and drug safety, vol. 20, no. 1, pp. 1–11, 2011.

[10]

[11]

[12]

[13]

[14]

[15] [16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

[24]

[25]

[26]

P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better research applications and clinical care,” Nature Reviews Genetics, vol. 13, no. 6, pp. 395–405, 2012. E. Chazard, G. Ficheur, S. Bernonville, M. Luyckx, and R. Beuscart, “Data mining to generate adverse drug events detection rules,” Information Technology in Biomedicine, IEEE Transactions on, vol. 15, no. 6, pp. 823–830, 2011. I. Karlsson, J. Zhao, L. Asker, and H. Bostr¨om, “Predicting adverse drug events by analyzing electronic patient records,” in Conference Proceedings of Artificial Intelligence in Medicine. Springer, 2013, pp. 125–129. WHO Collaborating Centre for Drug Statistics Methodology. (2014) Anatomical Therapeutic Chemical (ATC) classification system. Accessed: March 15, 2014. [Online]. Available: http://www.whocc.no/atc/structure$\ $and$\ $principles/ World Health Organization. (2014) International Classification of Diseases. Accessed: March 15, 2014. [Online]. Available: http: //www.who.int/classifications/icd/en/ L. Hazell and S. A. Shakir, “Under-reporting of adverse drug reactions,” Drug Safety, vol. 29, no. 5, pp. 385–396, 2006. H. Dalianis, M. Hassel, A. Henriksson, and M. Skeppstedt, “Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care,” in Swedish Language Technology Conference. Citeseer, 2012. Y. Sun, M. S. Kamel, Y. Wang et al., “Boosting for learning multiple classes with imbalanced class distribution.” in ICDM, vol. 6, 2006, pp. 592–602. J. Stausberg and J. Hasford, “Drug-related admissions and hospitalacquired adverse drug events in germany: a longitudinal analysis from 2003 to 2007 of icd-10-coded routine data,” BMC health services research, vol. 11, no. 1, p. 134, 2011. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. R. D´ıaz-Uriarte and S. A. De Andres, “Gene selection and classification of microarray data using random forest,” BMC bioinformatics, vol. 7, no. 1, p. 3, 2006. J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006. S. Garcia and F. Herrera, “An extension on” statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons.” Journal of Machine Learning Research, vol. 9, no. 12, 2008. C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, “Bias in random forest variable importance measures: Illustrations, sources and a solution,” BMC bioinformatics, vol. 8, no. 1, p. 25, 2007. A. K. Jha, G. J. Kuperman, J. M. Teich, L. Leape, B. Shea, E. Rittenberg, E. Burdick, D. L. Seger, M. Vander Vliet, and D. W. Bates, “Identifying adverse drug events development of a computer-based monitor and comparison with chart review and stimulated voluntary report,” Journal of the American Medical Informatics Association, vol. 5, no. 3, pp. 305–314, 1998. A. Henriksson, H. Moen, M. Skeppstedt, V. Daudaraviˇcius, and M. Duneld, “Synonym extraction and abbreviation expansion with ensembles of semantic spaces,” Journal of Biomedical Semantics, vol. 5, no. 1. P. LePendu, S. V. Iyer, A. Bauer-Mehren, R. Harpaz, J. M. Mortensen, T. Podchiyska, T. A. Ferris, and N. H. Shah, “Pharmacovigilance using clinical notes,” Clinical Pharmacology & Therapeutics, vol. 93, no. 6, pp. 547–555, 2013.

293

Suggest Documents