Pharmacovigilance Using Clinical Notes - Wiley Online Library

4 downloads 747 Views 1MB Size Report
workflows,37 the Medi-Span (Wolters Kluwer Health, Indianapolis, IN). Adverse Drug Effects ..... 611–621 (2012). 11. Melton, G.B. & Hripcsak, G. Automated detection of adverse events using .... events in post-marketing data. Drug Discov.
Articles

nature publishing group

Open see COMMENTARY page 474 and ARTICLE page 539

Pharmacovigilance Using Clinical Notes P LePendu1, SV Iyer1, A Bauer-Mehren1, R Harpaz1, JM Mortensen1, T Podchiyska2, TA Ferris2 and NH Shah1 With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient–feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug–adverse event associations and adverse events associated with drug–drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk. Phase IV surveillance is a critical component of drug safety because not all safety issues associated with drugs are detected before market approval. Each year, drug-related events account for up to 50% of adverse events occurring in hospital stays,1 significantly increasing costs and length of stay in hospitals.2 As much as 30% of all drug reactions result from concomitant use—with an estimated 29.4% of elderly patients on six or more drugs.3 Efforts such as the Sentinel Initiative and the Observational Medical Outcomes Partnership4 envision the use of electronic health records (EHRs) for active pharmacovigilance.5–7 Complementing the current state of the art—based on reports of suspected adverse drug reactions—active surveillance aims to monitor drugs in near real time and potentially shorten the time that patients are at risk. Coded discharge diagnoses and insurance claims data from EHRs have already been used for detecting safety signals.8–10 However, some experts argue that methods that rely on coded data could be missing >90% of the adverse events that actually occur, in part because of the nature of billing and claims data.1 Researchers have used discharge summaries (which summarize information from a care episode, including the final diagnosis and follow-up plan) for detecting a range of adverse events11 and for demonstrating the feasibility of using the EHR for pharmacovigilance by identifying known adverse events associated with seven drugs using 25,074 notes from 2004.12 Therefore, the clinical text can potentially play an important role in future pharmacovigilance,13,14 particularly if we can transform notes taken daily by doctors, nurses, and other practitioners into more accessible data-mining inputs.15–17

Two key barriers to using clinical notes are privacy and accessibility.16 Clinical notes contain identifying information, such as names, dates, and locations, that are difficult to redact automatically, so care organizations are reluctant to share clinical notes. We describe an approach that computationally processes clinical text rapidly and accurately enough to serve use cases such as drug safety surveillance. Like other terminology-based systems, it deidentifies the data as part of the process.18 We trade the “unreasonable effectiveness”24 of large data sets in exchange for sacrificing some individual note-level accuracy in the text processing. Given the large volumes of clinical notes, our method produces a patient–feature matrix encoded using standardized medical terminologies. We demonstrate the use of the resulting patient–feature matrix as a substrate for signal detection algorithms for drug–adverse event associations and drug–drug interactions. RESULTS

Our results show that it is possible to detect drug safety signals using clinical notes transformed into a feature matrix encoded using medical terminologies. We evaluate the performance of the resulting data set for pharmacovigilance using curated reference sets of single-drug adverse events as well as adverse events related to drug–drug interactions. In addition, we show that we can simultaneously estimate the prevalence of adverse events resulting from drug–drug interactions. The reference set, described in the Methods section, contains 28 positive associations and 165 negative associations spanning 78 drugs and 12 different events for single drug–adverse event associations. For the drug–drug

1Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA; 2Stanford Center for Clinical Informatics, Stanford University , Stanford, California, USA. Correspondence: P LePendu ([email protected])

Received 26 October 2012; accepted 22 February 2013; advance online publication 10 April 2013. doi:10.1038/clpt.2013.47 Clinical pharmacology & Therapeutics | VOLUME 93 NUMBER 6 | June 2013

547

Articles interactions, the reference set contains 466 positive and 466 negative associations spanning 333 drugs across 10 events.

performance for detecting associations between a single drug and its adverse event, with an area under the receiver operating characteristic curve (AUC) of 75.3% (unadjusted) and 80.4% (adjusted). A threshold of 1.0 (a commonly used cutoff) on the lower bound of the 95% CI of the adjusted ORs translates to 39% sensitivity and 97.5% specificity. Choosing a signaling threshold, defined using minimum specificity of 90%, based on the receiver operating characteristic curve, yields a cutoff of 1.18 (unadjusted) and 0.84 (adjusted) on the lower bound of the 95% CI. Supplementary Data S1 online lists all adjusted results, and Supplementary Data S2 online lists the AUC threshold data.

Feasibility of detecting drug–adverse event associations

To demonstrate the feasibility of using free text–derived features for detecting drug–adverse event associations, we reproduce the well-known association between rofecoxib and myocardial infarction. Rofecoxib was taken off the market because of the increased risk of heart attack and stroke.19,20 We compute an association between rofecoxib and myocardial infarction, keeping track of the temporal order of the diagnosis of rheumatoid arthritis, exposure to the drug, and occurrence of an adverse event as described in the Methods section. Using data up to 2005, we obtain an odds ratio (OR) of 1.31 (95% confidence interval (CI): 1.16–1.45) for the association, which agrees with previously reported results.19,20 In a previous study, we compared using clinical notes with using the codes from the International Classification of Diseases, Ninth Revision (ICD-9), and found no association (OR: 1.71; 95% CI: 0.74–3.53) using the coded data.21 This is probably due to undercoding: for patients to be counted as exposed requires a prior arthritis indication, and approximately one-third of the patients meet that criterion.

Profiling drug–adverse event associations over time

Figure 3 shows the cumulative ORs and exposures over time based on the unadjusted associations for the 10 drugs in our reference set that have had an alert in the past decade. Using a threshold of 1.0 on the lower bound of the CI for the association, we would flag six of nine alerts earlier than the official date (we do not have enough data for one drug, troglitazone). By comparison, the propensity-adjusted method would catch three of the alerts early. The unadjusted associations can flag signals worth investigating, and the adjusted associations may reduce false alarms.

Performance of detecting adverse drug events

Figure 1 shows the adjusted ORs and 95% CIs for the 28 truepositive associations from our single drug–adverse event reference set. As expected, the results show some variation by event across the adverse events.10 Figure 2a shows the overall Myocardial infarction

Performance of detecting adverse drug–drug interactions

Figure 2b shows the performance (AUC of 81.5%) for detecting known adverse events arising from drug–drug interactions. Adjusting the associations for potential confounding Cardiac valve fibrosis

Rosiglitazone – mi (1,401)

Venous thrombosis

Cabergoline – cvf (52)

Raloxifene – vt (1,137)

Rofecoxib – mi (5,294)

Clozapine – vt (222)

Medroxyprogesterone – mi (4,828)

Pergolide – cvf (114)

Anastrozole – vt (2,523)

Celecoxib – mi (9,132) Valdecoxib – mi (1,722)

Drospirenone – vt (64)

Phentermine – cvf (341)

Levonorgestrel – mi (4,208)

Levonorgestrel – vt (4,214)

Sibutramine – mi (304)

0 0.5 1 1.5 2 2.5 3 3.5 4

Aplastic anemia Ticlopidine – aa (19)

0 0.5 1 1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2 2.5 3 3.5 4

Progressive multifocal leukoencephalopathy (5.2)

Fludarabine – pml (1,243)

Other Cerivastatin – rhabd (109)

Allopurinol – aa (1,173)

(4.4)

Troglitazone – arf (85) Rituximab – pml (3,148)

Valproic acid – aa (1,882)

Cisapride – qt (447) Phenobarbital – aa (1,285)

Alemtuzumab – pml (318) Pioglitazone – ubc (1,578)

Clopidogrel – aa (1,899)

0 0.5 1 1.5 2 2.5 3 3.5 4

0

1.5

3

4.5

6

7.5

9

11

0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 1  Adjusted odds ratios (ORs) for positive cases in the single drug–adverse event set. Results show some variability by event. The 28 positive cases include the following events: myocardial infarction (mi), rhabdomyolysis (rhabd), cardiovascular fibrosis (cvf), acute renal failure (arf), QT prolongation (qt), urinary bladder cancer (ubc), progressive multifocal leukoencephalopathy (pml), aplastic anemia (aa), and venous thrombosis (vt). Some associations are off the scale, and we indicate the OR in parenthesis above the line (one exception, Natalizumab-pml (232), is not shown at all due to extreme scale: OR: 79.5; 95% CI: 30.8–270.4). We also include the number of exposed patients in parenthesis for each drug–adverse event pair. Typically, a signal occurs when the lower bound of the confidence intervals exceed 1.0; however, this threshold may have different optimal settings on the basis of the event. 548

VOLUME 93 NUMBER 6 | June 2013 | www.nature.com/cpt

Articles b

Drug–event

Drug–drug–event

1.0

1.0

0.8

0.8

Sensitivity (true positive rate)

Sensitivity (true positive rate)

a

0.6

0.4 AUC 75.3% 80.4%

0.2

0.6

0.4 AUC 74.8% 81.5%

0.2

0.0

0.0 0.0

0.2

0.4

0.6

0.8

0.0

1.0

0.2

0.6

0.4

0.8

1.0

1–Specificity (false positive rate)

1–Specificity (false positive rate)

Figure 2  Performance of adverse drug reactions and drug–drug interaction detection. Overall performance is measured using areas under the receiver operating characteristic curve (AUCs). (a) The unadjusted (blue) vs. adjusted (red) methods yield AUCs of 75.3 and 80.4% overall. (b) For drug interactions, the adjusted methods (red) reach 81.5% AUC.

Score

2004

2006

2008

Year

2002

2004

2006

2008

2010

Year

1,500 500

2010

Year

Exposure

Cerivastatin – rhabd

20

100

15

500 1,000

7.9

0 2006

2008

2010

2004

Score

Exposure

Sibutramine – mi

2002

Year

2008

2010 Year Exposure

44.5

0.0

100

2006

Cisapride – qt

800

100 80 60

2004

40 20

0 2000

Score

2008

600

4

Score

3 2 1

2002

Year

Exposure

Pergolide – cvf

2006

400

2010

300

2008

200

2006

2.0

2004

1.0

Score

2002

2004

6 4 0

1.6

2

500 0

1,000

3,000 1,000 0

2000

2002

Exposure

8

Pioglitazone – ubc

2010

1.3 1998

1.5 0.5

2002

10

Exposure

Rosiglitazone – mi

2000

25

Year

5

2010

120

2008

20 40 60 80

2006

2,000

Score

2004

1.3

0.0

2,000

0.8 2002

1,500

2000

Exposure

Valdecoxib – mi

1,000

6,000

1.3

Score

1.0

1.4

1.6 2,000

0.8

Exposure

Celecoxib – mi

1.2

4,000

2.0 1.6

1.2

1.2

Score

10,000

Exposure

Rofecoxib – mi

1.0

Score

2000

2002

2004

2006

2008

2010

Year

2000

2002

2004

2006

2008

2010

Year

Figure 3  Cumulative (unadjusted) odds and exposure plots for 10 positive cases involving US Food and Drug Administration (FDA) intervention. Signals are flagged earlier than official alerts in six of nine cases (troglitazone excluded for lack of sufficient exposure). The solid red line is the odds ratio (OR), and the dotted red lines are the confidence intervals (CIs). The solid blue line is the exposure rate. The shaded area marks the period for which FDA intervention applies (e.g., withdrawal). The point estimate marks the earliest year and OR when the lower bound of the 95% CI is above a threshold of 1.0, i.e., when the unadjusted method would flag the drug for monitoring. As more data accumulate and exposure increases, patterns often converge toward more confident signals. cvf, cardiovascular fibrosis; mi, myocardial infarction; qt, QT prolongation; rhabd, rhabdomyolysis; ubc, urinary bladder cancer.

improves the signal detection capability (red curves in Figure 2b).22 In the drug–drug interaction scenario, we do not constrain by drug indications because of combinatorial complexity. We obtain 52% sensitivity at 91% specificity, using 1.0 as a threshold on the lower bound of the CI for the adjusted associations. Clinical pharmacology & Therapeutics | VOLUME 93 NUMBER 6 | June 2013

Estimating the prevalence of adverse events

Population-level prevalence data for adverse events are hard to come by. For single drugs, sources such as Side Effect Resource provide information on the frequency of specific adverse events from the drug product label. No such comparable resource exists for adverse events arising from drug–drug interactions. 549

Articles While performing the drug–adverse event association calculations using data from a clinical data warehouse, we can in parallel estimate the prevalence of adverse events associated with drug–drug interactions. For example, we found that 42.8% (176 of 411) of patients on both levodopa and lorazepam experience parkinsonian symptoms, 19.8% (140 of 707) of patients on paclitaxel and trastuzumab experience neutropenia, and 17.8% (796 of 4,467) of patients on amiodarone and metoprolol experience bradycardia. DISCUSSION

We have demonstrated that adverse drug events as well as adverse events associated with drug–drug interactions can be detected using a deidentified patient–feature matrix extracted from free-text clinical documents. Blumenthal and others5 envision a scenario in which a new drug comes to market and a nationwide learning system monitors for safety signals. Our results show that deidentified clinical notes can be used to generate drug safety signals—taking a step toward such a scenario. In addition, the patient–feature matrix also provides prevalence data not available from other data sources (e.g., spontaneous reports). Having such prevalence information can assist in prioritizing actionable events and reducing alert fatigue.23 Our approach to processing clinical notes is simple in comparison with advanced natural language processing (NLP) systems that may have better accuracy in identifying nuanced attributions of disease conditions. We sacrifice some individual note-level accuracy in exchange for the ability to detect population-level trends against massive data sets. Our results, based on a reference set of known drug–event pairs, show that when exposure data are numerous enough, the use of relatively simple text mining with standard association strength tests for signal detection can work, reflecting the adage in the machine-learning community that “a dumb algorithm with lots of data beats a clever one with modest amounts of it.”24,25 When used in combination with other data sets, clinical notes may address cases that otherwise pass undetected. We sacrifice sensitivity for specificity because for a new approach, and a new data source (clinical notes), keeping false-discovery rates low is important, particularly in the initial stages of establishing feasibility. We find that ontologies are an excellent source of features and allow systematic normalization and aggregation when the feature set needs reduction.15,26 For example, we can count all patients who experience cardiac arrhythmias as patients with arrhythmias because of the hierarchical relationships. Therefore, ontology hierarchies can organize a very large number of terms into a smaller feature set. Moreover, because names, dates, and locations are not present in the clinical terminologies, those are not extracted as features by dictionary-based methods.18,27 We believe that the information embedded in text is crucial for leveraging EHR data,10,13,14 particularly for rare events for which large amounts of data are needed. Our annotationbased approach produces a feature matrix that complements other structured data such as codes from the ICD-9. Of note, 550

our methods are not dependent on any particular NLP tool (we contrast MGREP and UNITEX in the Methods section), and we expect results to improve given the availability of better and faster clinical NLP tools.28,29 We are currently collaborating with researchers at the Mayo Clinic to improve the speed of the clinical Text Analysis and Knowledge Extraction System,29 one of the state-of-the-art NLP tools available for clinical text. Broader availability of curated clinical NLP data sets and health outcome definitions would accelerate research and validation. Our work has several limitations and opportunities for improvement. Not all conditions are equally identifiable from text using lexical approaches (Supplementary Data S3 online reports validation results by condition). Advanced NLP tools would improve accuracy in these cases. Biases in our reference set—although among the largest used for such a study— affect our performance estimation. A new reference standard covering four events has just recently been released by the Observational Medical Outcomes Partnership,4 and we are currently evaluating its utility. Some adverse drug events are dose dependent, and our methods currently ignore this information. The UNITEX tool, described in the Methods section, includes libraries for dosage extraction and thus is a logical next step. We do not distinguish between new users of drugs and existing or chronic ones. Our methods have a limited ability to define eras (durations of medication and illness). We are currently examining the annotation data for the utility of the last mention of a concept, sentence-level co-occurrences, and temporal density of mentions to address this question. The majority of our findings are based on the Stanford Hospital and Clinics, which is a tertiary-care center representing a skewed population. At the same time, this population has added utility for investigating rare events. Variations in signaling thresholds can also occur as a result of the prevalence or rarity of an event,10 and more research is needed to adapt detection algorithms accordingly. The prevalence data estimated in studies such as ours are an important step in this direction.10 Finally, we note that the Observational Medical Outcomes Partnership group suggests that no single method works best uniformly, that different methods be considered for each event and data source, and that profiling performance via receiver operating characteristic curves assists in understanding the utility of a method or data source.4 To conclude, our method extracts from textual clinical notes a deidentified patient–feature matrix encoded using standardized medical terminologies. We have demonstrated the use of the resulting patient–feature matrix as a substrate for detecting single drug–adverse event associations (AUC of 80.4%) and for detecting adverse events associated with drug–drug interactions (AUC of 81.5%), illustrating that clinical notes can be a source for detecting drug safety signals at scale.15 The patient–feature matrix can also be used to learn off-label usage30 and to discern drug adverse events from indications.31 Using the textual contents of the EHR complements efforts using billing and claims data or spontaneous reports4,8,14,32,33 and opens up new opportunities for leveraging observational data. VOLUME 93 NUMBER 6 | June 2013 | www.nature.com/cpt

Articles A

B

Drug–event Patient timeline I I

2×2 cell

Reason

d

No drug taken and event not encountered

c

Event encountered without taking drug

c

Event encountered without taking drug

b

Drug taken but no event encountered

a

Event encountered after taking drug

E >0

I

E I

I

D D

D

E

C

Key + Exposure −

+Outcome − a b c d

Drug–drug–event Patient timeline

D E D

E

D1

D2

D1

D2

E

2×2 cell

Reason

d

No drug taken and event not encountered

d

Single drug taken and no event encountered

c

Event encountered without taking either drug

c

Single drug precedes adverse event

b

Drugs possibly interact, but adverse event not encountered

a

Drugs possibly interact, followed by adverse event

Figure 4  Assignment of patients to 2 × 2 contingency tables. Patients are assigned to cells a, b, c, and d of a 2 × 2 contingency table (C) on the basis of the patterns shown in parts (A) and (B). In the patterns, indications are abbreviated with “I”, drugs with “D”, and outcomes or events with “E.” A patient exposed to the drug is counted in cells “a” or “b” depending on whether the outcome occurs after the drug exposure, based on temporal ordering of first mentions of the I, D, and E. Other patients (i.e., unexposed) are placed in the bottom row of the 2 × 2 contingency table in cells “c” or “d” depending on whether the outcome occurred in the observation duration after the indication. Therefore, for example, an indication followed by a drug and then an event would go into the “a” cell. An indication followed by no drug mention but having an occurrence of the event would go into cell “c.” For drug–drug interactions, we do not restrict the assignment on the basis of the indications. Therefore, patients with mentions of both drugs (in either order) before an event would go into the “a” cell.

METHODS

Data sources. Our primary data source was the Stanford Translational

Research Integrated Database Environment,34 which spans 18 years of patient data from 1.8 million patients; it contains 19 million encounters, 35 million coded ICD-9, diagnoses, and >11 million unstructured clinical notes, which are a combination of pathology, radiology, and transcription reports. The gender split is ~60% female; the average age is 44 with an SD of 25. The reference standard. We created reference standards of known

drug–adverse event associations for testing the performance of our methods in detecting drug safety signals from text. Supplementary Data S4 online lists the single drug–event reference set. For the single-drug adverse events, our reference set included 12 distinct events worth monitoring35 and 78 distinct drugs, 28 positive cases, and 165 negative cases. We started with a validation set from the European Union adverse drug event project (EU-ADR)36 and to that set, we added 10 drug safety signals that involved US Food and Drug Administration intervention in the past decade, manually curating these from the literature and cross-referencing with the agency’s website. We established our false-discovery rate by generating a set of negative associations by creating all combinations of drugs and events and subtracting any known associations that were identified by any one of the EU-ADR filtering workflows,37 the Medi-Span (Wolters Kluwer Health, Indianapolis, IN) Adverse Drug Effects Database, or the Side Effect Resource database.38 For the two-drug case, known drug–drug interactions were extracted (and manually validated) from textual monographs in DrugBank and the Medi-Span Drug Therapy Monitoring System. In this case, we simulated the negative set by associating drug pairs with a randomly chosen event, removing any cases that were already known to be associated on the basis of external knowledge (DrugBank, Medi-Span, Drugs.com, Unified Medical Language System (UMLS), or Side Effect Resource). This reference set included 10 distinct events, 333 distinct drugs, 466 positive cases, and 466 negative cases. Testing for drug safety signals. We followed a two-step process for detecting drug safety signals: first, we computed a raw association in the form of an unadjusted OR, followed by adjustment for potential confounders. The first step is useful for flagging putative signals, and the second step is useful in reducing false alarms. In the first step, we computed unadjusted ORs and 95% CIs by constructing a 2 × 2 contingency table26,33 from the patient–feature matrix. On the basis of first mentions of drug, event, and indication and their tempoClinical pharmacology & Therapeutics | VOLUME 93 NUMBER 6 | June 2013

ral order, we assigned patients to specific cells of a 2 × 2 contingency table as shown in Figure 4 (see also Supplementary Data S5 online). The temporal information in the patient–feature matrix is critical for determining whether the event follows exposure.39 Patients having no mention of the indication at any time are excluded from the analysis (see Supplementary Data S6 online for those patients being excluded). Using data following the indication, and not counting repeat mentions, the ordering of the drug and event determined into which cell of a 2 × 2 matrix the patient fell. Because all unexposed patients have the indication, they could be on an alternative drug or other treatment, or none at all. In the second step, we adjusted for confounding by specific patient factors. We included age, gender, race, and comorbidity and coprescription frequency (as a surrogate for overall health status) in calculating the propensity score.9 The propensity score quantified the likelihood of a patient to be exposed to a drug. Patients with known indications were matched (exposed vs. unexposed) via the propensity score. Finally, we included the propensity score as a covariate in logistic regression to compute adjusted ORs and 95% CIs using the coefficients of the regression model. We used the Matching and Survival packages in R.40 For single drug–event associations, we identified the indications of the drug using the Medi-Span Drug Indications Database and the National Drug File–Reference Terminology. In the drug–drug interaction scenario, the key idea is to determine whether the association of the event with the combination of the two drugs outweighs any association of the event with either one of the drugs alone (or none at all). Including the indications adds a degree of combinatorial complexity, so we focused primarily on the temporal order of the two drugs and event (Figure 4b) without restricting by the indications of the drugs. Generating the patient–feature matrix. Our annotator workflow,

described previously,21,30 uses ~5.6 million strings from existing terminologies; filters unambiguous terms that are predominantly noun phrases representing drugs, diseases, devices, or procedures; uses the cleaned up lexicon for term recognition in the clinical notes to tag or annotate41 the text; excludes negated terms or terms that apply to family and medical history;42 normalizes all terms using the ontology hierarchies; and finally uses the time stamps of the note to produce a deidentified, temporally ordered patient–feature matrix. The process is summarized in Figure 5 and the individual steps are detailed below.

Using biomedical ontologies for text annotation. We use existing ontologies as a source of (i) a lexicon of strings that are grouped together and 551

Articles

Creating clean lexicons Frequency

Annotation

Term-1

2

Syntactic types

Normalizing concepts 3

BioPortal megathesaurus

4

Term-n

Concept-1

Concept-m

UMLS semantic types

5 1

BioPortal – ontologies

Term B Term C

Reconstructed representation

6

Term D

Further analysis P1 P2

Diseases

Devices

Procedures

(with some keys shown for illustration)

P3

Cohorts of interest, clinical data subsets, information retrieval

Patients

Term A

Drugs

Ti m e

True internal representation

Root

1

0

1

0

0

1

1

1

1

0

0

1

0

0

1

1

1

0

0

1

0

0

1

1

1

1

0

0

0

1

0

0

1

1

1

1

1

0

1

0

0

1

0

1

0

0

0

1

0

0

1

1

1

1

0

0

0

1

0

1

0

0

1

0

0

1

0

1

1

1

1

1

1

0

0

0

1

0

0

0

0

0

1

1

0

0

1

1

1

1

1

0

1

1

0

0

Figure 5  Generation of the patient–feature matrix. The workflow (1) starts by downloading ~5.6 million strings for every term in ontologies from both the Unified Medical Language System (UMLS) and BioPortal, as well as all trigger terms from NegEx and ConText; (2) uses term frequency and syntactic type information (e.g., predominant noun phrases) from MedLine to prune the set of strings into a clean lexicon; (3) applies the lexicon directly against the textual notes using exact string matching; (4) applies NegEx and ConText rules to filter negation and family history contexts; (5) applies UMLS Metathesaurus and BioPortal mappings and semantic type information to normalize terms into concepts that are grouped by drug, disease, device, or procedure; and (6) results finally in the patient–feature matrix. Each row of the matrix represents a single note that is linked to a single patient, and the time stamps of the notes induce a temporal ordering over the entire patient–feature matrix.

linked to over a million concepts via synonymy (referred to as mappings) and (ii) a hierarchy of >14 million parent–child relationships among those concepts. We use the lexicon to recognize terms in the input text using a tool called MGREP,41 which also tracks the relative position at which each term occurs (Figures 5 and 6). In addition to clinical terms, based on the ConText system,42 we include terms corresponding to contextual cues called “triggers” in our lexicon. Cues such as “denies,” “no sign of,” and “father has a history of ” are used in a postprocessing step to identify terms that are negated or that apply to family or medical history. Terms that correspond to mentions in these contexts are ignored—thus, the subsequent analysis relies on positive, present mentions of concepts. The resulting annotations for the Stanford Translational Research Integrated Database Environment data set comprise ~3.75 billion records. It takes 1 hour to generate annotations from 3 million documents using a single computer workstation and ~2 hours to postprocess the data. MGREP can be substituted with other NLP tools: one such tool we have tested is UNITEX,43 which offers advanced functionality such as regular expressions for drug doses and morpheme-based matching at the cost of an additional 10–20% processing time. Cleaning the lexicon. Motivated by previous work on identifying and removing noninformative terms,44,45 we apply a series of suppression rules that fall into two categories: syntactic and semantic. We keep terms that are predominantly noun phrases46 based on an analysis of over 20 million MEDLINE abstracts; we remove uninformative phrases based on term frequency analysis of >50 million clinical documents from the Mayo clinic;47 and we suppress terms having fewer than four characters by default because the majority of these tend to be ambiguous abbreviations. Finally, using frequency-based sorting, we manually identify ambiguous terms that belong to more than one semantic group (drug, disease, device, and procedure),47,48 and we suppress their least likely interpretation. For example, “clip” is more likely to be a device than a drug in clinical text, so we suppress the interpretation as “corticotropinlike intermediate lobe peptide” even though clip is listed as its synonym. Normalizing terms in the patient–feature matrix. Drug prescrip-

tions are identified via the text processing and normalized into active

552

ingredients using relationships from RxNorm (e.g., “tradename_of ”). Therefore, “rofecoxib 12.5 mg oral tablet” and “Vioxx” are normalized to the active ingredient rofecoxib. In addition, we map ingredients to the Anatomical Therapeutic Chemical Classification System, which enables four levels of aggregation, i.e., rofecoxib, celecoxib, and valdecoxib are all cyclooxygenase-2 inhibitors, which are nonsteroidal anti-inflammatory drugs, and so on. Although drug normalization is fairly straightforward, diseases, devices, and procedures present a challenge. In what we call the two-hop method (Figure 7), we use a query-driven approach to normalize disease, device, and procedure concepts. We start with definitions from the E ­ U-ADR project’s specifications and MedDRA standardized query definitions: for example, for myocardial infarction, we would start with the ICD-9, code 410 (acute myocardial infarction) and 18 different UMLS concept unique identifiers including C0027051 ­(myocardial infarction), C0340324 (silent myocardial infarction), and C0155626 (acute ­myocardial infarction). Starting with these “seed” concepts, we utilize mappings across ontologies and the hierarchical parent–child relationships to expand subsumed entities. Supplementary Data S7 online lists all seed queries and their full expansions. We first precompute the transitive closure over all parent–child hierarchies, and we index it such that we can retrieve all ancestors or all descendants of a given concept. Second, the mappings among synonymous terms form an equivalence class to which we assign a unique identifier (similar to the UMLS Metathesaurus concept unique identifiers). Using these two resources, given concepts of interest as a seed query, for example, the 18 concepts for myocardial infarction, we use the mappings to find all canonical identifiers (first hop) and then use the transitive closure to include all subsumed concepts in the query. Next, we repeat the process once more with this expanded set of concepts (second hop). For myocardial infarction, the expansion process yields 470 unique strings. In principle, recursion with a least fixed-point semantics would apply; however, recursion does not work well in practice because of differing abstraction levels among ontologies, which induce cycles. We have found that two hops achieve an adequate balance between soundness and completeness for the current use case. VOLUME 93 NUMBER 6 | June 2013 | www.nature.com/cpt

Articles Input text

Annotations

b a c

True internal representation (with some keys shown for illustration)

Reconstructed representation

Figure 6  Sample annotations. (a) A discharge summary is encoded internally using (b) a highly compressed, numerical representation. The strings in parenthesis are keyed to the first column of numbers and are included merely for illustration purposes. (c) The annotations keep track of relative positional information and are so rich owing to the vast lexicon that if we reconstruct the note, very little of the useful information is lost (notice the section headers). The blank areas in the reconstruction represent terms that are not recognized, and terms highlighted in red denote ones that will not be attributed to the present patient because of contextual cues (e.g., family history and negated findings). CABG, coronary artery bypass graft; COPD, chronic obstructive pulmonary disease; CT, computed tomography. First iteration

O2

O1

Second iteration O1 ∈C′

∈C

O2

∈C″

Figure 7  Two-hop query expansion. The algorithm takes a set of concepts C (solid red) and derives all subconcepts C′ (all red) in each ontology O and then repeats the process only once more for all derived concepts C′ (solid blue) to obtain C′′ (all red and blue). Because concepts are mapped across ontologies, the process traverses simultaneously all ontologies that contain C (and C′), thereby “hopping” across ontologies twice. In this illustration, C′′ captures two more concepts from the adjacent ontology O2 that would have otherwise been missed with a single iteration.

Recognizing events and exposures. By combining the above proce-

dures (seeding queries using established definitions, normalizing and aggregating terms, and using only positive, present mentions; see Supplementary Data S8 online), we are able to recognize events and exposures with enough accuracy for the drug safety use case. We determine the accuracy of the event identification using a gold-­standard corpus from the 2008 i2b2 Obesity Challenge.49 This corpus has been manually annotated by two annotators for 16 conditions and was designed to evaluate the ability of NLP systems to identify a condition present for a patient given a textual note. We extended this corpus by manually annotating each of the events listed in Figures 1 and 3 (see Supplementary Data S3 online). Using the set of terms corresponding to the definition of the event of interest (see Supplementary Data S7 online) and the set of terms recognized by our annotation workflow in the i2b2 notes, we evaluate the sensitivity and specificity of identifying each of the events (see Supplementary Data S3 online). Overall, our event identification has 74% sensitivity and 96% specificity. Accuracy varies by condition: for example, Clinical pharmacology & Therapeutics | VOLUME 93 NUMBER 6 | June 2013

myocardial infarction has 63% sensitivity and 94% specificity, whereas gallstones have 15% sensitivity and 99% specificity. Drug recognition is done in a similar manner using strings from RxNorm and an independent study at the University of Pittsburgh, which examined the annotations on 1,960 clinical notes manually50 and estimated over 84% recall and 84% precision for recognizing drugs (R. Boyce, personal communication). Ordering the features. We use the time stamps for each note to induce a temporal ordering over the recognized concepts on a per-patient basis. We focus on first mentions of concepts and do not use exposure windows or eras. We keep positive, present mentions and ignore negated mentions and family and medical history mentions identified via trigger terms. Therefore, for every patient, the feature matrix contains a temporally ordered list of drugs, diseases, devices, and procedures mentioned in their medical record. SUPPLEMENTARY MATERIAL is linked to the online version of the paper at http://www.nature.com/cpt 553

Articles ACKNOWLEDGMENTS The authors acknowledge support from the National Institutes of Health grant U54-HG004028 for the National Center for Biomedical Ontology. NHS also acknowledges support from NIH grant U54-LM008748. The authors thank Cédrick Fairon for assistance in evaluating UNITEX and Richard Boyce for evaluating drug accuracy. AUTHOR CONTRIBUTIONS P.L., A.B.-M., S.V.I, and N.H.S wrote the manuscript. P.L., S.V.I., A.B.-M., and N.H.S. designed the research. P.L., S.V.I., A.B.-M., J.M.M., and N.H.S. performed the research. P.L., S.V.I., A.B.-M., R.H., T.P., and T.A.F. analyzed the data. P.L., S.V.I, A.B.-M, and N.H.S contributed new reagents/analytical tools. CONFLICT OF INTEREST The authors declared no conflict of interest.

Study Highlights WHAT IS THE CURRENT KNOWLEDGE ON THE TOPIC?

33 The current state of the art in drug safety surveillance relies either on databases of reported adverse events (such as the Adverse Event Reporting System) or on longitudinal observational data, primarily claims and billing, derived from coded EHR sources.

WHAT QUESTION DID THIS STUDY ADDRESS?

33 In this study, we demonstrate the feasibility of using

large amounts of free-text notes as a substrate for performing pharmacovigilance after transforming clinical notes into a deidentified patient–feature matrix coded with standard medical terminologies.

WHAT THIS STUDY ADDS TO OUR KNOWLEDGE

33 We show that by using a large corpus, we can detect single drug–adverse event associations and adverse events associated with drug–drug interactions with high accuracy.

HOW THIS MIGHT CHANGE CLINICAL PHARMACOLOGY AND THERAPEUTICS

33 We argue that drug safety surveillance can be advanced by using this yet untapped data source of clinical notes, which comprise the majority of EHR data available.

© 2013 American Society for Clinical Pharmacology and Therapeutics

1. Classen, D.C. et al. ‘Global trigger tool’ shows that adverse events in hospitals may be ten times greater than previously measured. Health Aff. (Millwood). 30, 581–589 (2011). 2. Hug, B.L., Keohane, C., Seger, D.L., Yoon, C. & Bates, D.W. The costs of adverse drug events in community hospitals. Jt. Comm. J. Qual. Patient Saf. 38, 120–126 (2012). 3. Bushardt, R.L., Massey, E.B., Simpson, T.W., Ariail, J.C. & Simpson, K.N. Poly­ pharmacy: misleading, but manageable. Clin. Interv. Aging 3, 383–389 (2008). 4. Stang, P.E. et al. Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. Ann. Intern. Med. 153, 600–606 (2010). 5. Friedman, C.P., Wong, A.K. & Blumenthal, D. Achieving a nationwide learning health system. Sci. Transl. Med. 2 (57), 57cm29 (2010). 6. McClellan, M. Drug safety reform at the FDA–pendulum swing or systematic improvement? N. Engl. J. Med. 356, 1700–1702 (2007). 7. Avorn, J. & Schneeweiss, S. Managing drug-risk information–what to do with all those new numbers. N. Engl. J. Med. 361, 647–649 (2009). 8. Gagne, J.J. et al. Active safety monitoring of newly marketed medications in a distributed data network: application of a semi-automated monitoring system. Clin. Pharmacol. Ther. 92, 80–86 (2012). 554

9. Schneeweiss, S., Rassen, J.A., Glynn, R.J., Avorn, J., Mogun, H. & Brookhart, M.A. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 20, 512–522 (2009). 10. Coloma, P.M. et al. Electronic healthcare databases for active drug safety surveillance: is there enough leverage? Pharmacoepidemiol. Drug Saf. 21, 611–621 (2012). 11. Melton, G.B. & Hripcsak, G. Automated detection of adverse events using natural language processing of discharge summaries. J. Am. Med. Inform. Assoc. 12, 448–457 (2005). 12. Wang, X., Hripcsak, G., Markatou, M. & Friedman, C. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J. Am. Med. Inform. Assoc. 16, 328–337 (2009). 13. Hennessy, S. & Flockhart, D.A. The need for translational research on drugdrug interactions. Clin. Pharmacol. Ther. 91, 771–773 (2012). 14. Harpaz, R., DuMouchel, W., Shah, N.H., Madigan, D., Ryan, P. & Friedman, C. Novel data-mining methodologies for adverse drug event discovery and analysis. Clin. Pharmacol. Ther. 91, 1010–1021 (2012). 15. Nadkarni, P.M. Drug safety surveillance using de-identified EMR and claims data: issues and challenges. J. Am. Med. Inform. Assoc. 17, 671–674 (2010). 16. Chapman, W.W., Nadkarni, P.M., Hirschman, L., D’Avolio, L.W., Savova, G.K. & Uzuner, O. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J. Am. Med. Inform. Assoc. 18, 540–543 (2011). 17. Radecki, R.P. & Sittig, D.F. Application of electronic health records to the Joint Commission’s 2011 National Patient Safety Goals. JAMA 306, 92–93 (2011). 18. Morrison, F.P., Li, L., Lai, A.M. & Hripcsak, G. Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes? J. Am. Med. Inform. Assoc. 16, 37–39 (2009). 19. Graham, D.J. et al. Risk of acute myocardial infarction and sudden cardiac death in patients treated with cyclo-oxygenase 2 selective and non-selective non-steroidal anti-inflammatory drugs: nested case-control study. Lancet 365, 475–481 (2005). 20. Brownstein, J.S., Sordo, M., Kohane, I.S. & Mandl, K.D. The tell-tale heart: population-based surveillance reveals an association of rofecoxib and celecoxib with myocardial infarction. PLoS ONE 2, e840 (2007). 21. Lependu, P., Iyer, S.V., Fairon, C. & Shah, N.H. Annotation analysis for testing drug safety signals using unstructured clinical notes. J. Biomed. Semantics 3 (suppl. 1), S5 (2012). 22. Ryan, P.B., Madigan, D., Stang, P.E., Overhage, J.M., Racoosin, J.A. & Hartzema, A.G. Empirical assessment of methods for risk identification in healthcare data: results from the experiments of the Observational Medical Outcomes Partnership. Stat. Med. 31, 4401–4415 (2012). 23. Phansalkar, S. et al. Drug-drug interactions that should be non-interruptive in order to reduce alert fatigue in electronic health records. J. Am. Med. Inform. Assoc. (2012); e-pub ahead of print 25 September 2012. 24. Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. Intelligent Systems, IEEE 24 (2):8–12 (2009). 25. Domingos, P. A few useful things to know about machine learning. Commun. ACM. 55 (10):78–87 (2012). 26. Bate, A. & Evans, S.J. Quantitative signal detection using spontaneous ADR reporting. Pharmacoepidemiol. Drug Saf. 18, 427–436 (2009). 27. Aronson, A.R. & Lang, F.M. An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17, 229–236 (2010). 28. D’Avolio, L.W., Nguyen, T.M., Goryachev, S. & Fiore, L.D. Automated conceptlevel information extraction to reduce the need for custom software and rules development. J. Am. Med. Inform. Assoc. 18, 607–613 (2011). 29. Savova, G.K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17, 507–513 (2010). 30. LePendu, P., Liu, Y., Iyer, S., Udell, M. & Shah, N.H. Analyzing patterns of drug use in clinical notes for patient safety. AMIA Summit on Clinical Research Informatics, San Francisco, CA, 21–23 March 2012. 31. Liu, Y., LePendu, P., Iyer, S., Udell, M. & Shah, N.H. Using temporal patterns in medical records to discern adverse drug events from indications. AMIA Summit on Clinical Research Informatics, San Francisco, CA, 21–23 March 2012. 32. Schuemie, M.J. et al. Using electronic health care records for drug safety signal detection: a comparative evaluation of statistical methods. Med. Care 50, 890–897 (2012). 33. Hauben, M. & Bate, A. Decision support methods for the detection of adverse events in post-marketing data. Drug Discov. Today 14, 343–357 (2009). 34. Lowe, H.J., Ferris, T.A., Hernandez, P.M. & Weber, S.C. STRIDE–an integrated standards-based translational research informatics platform. AMIA Annu. Symp. Proc. 2009, 391–395 (2009). VOLUME 93 NUMBER 6 | June 2013 | www.nature.com/cpt

Articles 35. Trifirò, G. et al.; EU-ADR group. Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? Pharmacoepidemiol. Drug Saf. 18, 1176–1184 (2009). 36. OMOP. Detection of Long Term Adverse Drug Reactions in Electronic Health Data, 2012; (2012). 37. Bauer-Mehren, A. et al. Automatic filtering and substantiation of drug safety signals. PLoS Comput. Biol. 8, e1002457 (2012). 38. Kuhn, M., Campillos, M., Letunic, I., Jensen, L.J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6, 343 (2010). 39. Hanauer, D.A. & Ramakrishnan, N. Modeling temporal relationships in large scale clinical associations. J. Am. Med. Inform. Assoc. 20, 332–341 (2012). 40. Sekhon, J.S. Multivariate and propensity score matching software with automated balance optimization: the matching package for R. J. Stat. Softw. 42 (7), 1–52 (2011). 41. Shah, N.H., Bhatia, N., Jonquet, C., Rubin, D., Chiang, A.P. & Musen, M.A. Comparison of concept recognizers for building the Open Biomedical Annotator. BMC Bioinformatics 10 (suppl. 9), S14 (2009). 42. Harkema, H., Dowling, J.N., Thornblade, T. & Chapman, W.W. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J. Biomed. Inform. 42, 839–851 (2009). 43. Paumier, S. De la reconnaissance de formes linguistiques à l’analyse syntaxique, Université de Marne-la-Vallée (2003).

Clinical pharmacology & Therapeutics | VOLUME 93 NUMBER 6 | June 2013

44. Demner-Fushman, D., Mork, J.G., Shooshan, S.E. & Aronson, A.R. UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text. J. Biomed. Inform. 43, 587–594 (2010). 45. McCray, A.T., Bodenreider, O., Malley, J.D. & Browne, A.C. Evaluating UMLS strings for natural language processing. Proc. AMIA Symp., 448–452 (2001). 46. Xu, R., Musen, M.A. & Shah, N.H. A comprehensive analysis of five million UMLS metathesaurus terms using eighteen million MEDLINE citations. AMIA Annu. Symp. Proc. 2010, 907–911 (2010). 47. Wu, S.T. et al. UMLS term occurrences in clinical notes: a large scale corpus analysis. JAMIA 19, e149–e156 (2012). 48. Bodenreider, O. & McCray, A.T. Exploring semantic groups through visual approaches. J. Biomed. Inform. 36, 414–432 (2003). 49. Uzuner, O. Recognizing obesity and comorbidities in sparse data. J. Am. Med. Inform. Assoc. 16 (4), 561–570 (2009). 50. Marshall, M.S. et al. Emerging practices for mapping and linking life sciences data using RDF—a case series. Web Semantics: Science, Services and Agents on the World Wide Web 14, 2–13 (2012).

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/

555