Mining Personalized Medicine Algorithms with ...

Mining Personalized Medicine Algorithms with Surrogate Algorithm Tags Chih-Lin Chi1, Peter J. Kos2, Vincent A. Fusaro1, Rimma Pivovarov1, Prasad Patil1, Peter J. Tonellato1,2

1

Laboratory for Personalized Medicine, Center for Biomedical Informatics, Harvard Medical School 2 Laboratory for Personalized Medicine, School of Public Health, Univ. of WI-Milwaukee

{Chih-Lin_Chi, Peter_Kos, Vincent_Fusaro, Rimma_Pivovarov, Prasad_Patil, Peter_Tonellato}@hms.harvard.edu ABSTRACT This paper demonstrates a method to identify keyword strategies to facilitate the search for articles containing decision support and clinical algorithms represented in the text by complex unsearchable items such as decision tree figures, pseudo code, or mathematical formulae. We use a text mining approach to generate ‘Surrogate Algorithm Tags’ (SRATs), i.e., keyword combinations highly associated with articles containing the algorithms of interest. In this project, we obtain an initial SRAT set from analyzing abstracts of publications available in PubMed with known warfarin dosing algorithms, gradually refine the SRATs by iterative optimization to improve precision or recall of the search, and then apply cut-off thresholds to terminate the optimization process and obtain optimal SRATs.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Search process; I.2.4 [Knowledge Representation Formalisms and Methods]: Representation languages; I.2.7 [Natural Language Processing]: Text analysis

General Terms Algorithms, Design, Performance

Keywords Keyword Search, Algorithms, Optimization

1. INTRODUCTION Personalized medicine depends on algorithms that use individual and family medical history variables, lab values, private health information and genetic data as inputs to produce predictions of clinical importance and use. These predictions include accurate therapeutic dose (eg warfarin [1]); risk to acquiring a disease (breast cancer [2]); indication of likelihood of a drug treating a particular cancer type (HER2 overexpression [3]) and type of breast cancer (Oncotype Dx [4]) as outputs. Typically, these algorithms are derived from a diverse collection of clinical trials conducted using different designs, cohorts, countries, and racial compositions [5], thus producing a sometimes large collection of algorithms all formulated to produce an optimal clinical Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IHI’10, November 11–12, 2010, Arlington, Virginia, USA. Copyright 2010 ACM 978-1-4503-0030-8/10/11...$10.00.

prediction for the trial conditions. The emergence of comparative effectiveness studies naturally leads to the collection and comparison of the strengths and weaknesses of these algorithms, their predictions, and the resulting clinical outcomes. One of the most prolific examples of algorithms in personalized medicine is prediction of a therapeutic dose of warfarin. Warfarin is the most widely used anticoagulant agent in the world, but clinical management of this drug is very difficult because of its narrow therapeutic window. If over-dosed, patients have an increased risk of bleeding; if under-dosed, patients have an increased risk of thrombosis. Dozens of warfarin dosing algorithms have been published to facilitate its clinical management and thereby reduce the risks associated with warfarin therapy. The warfarin dosing algorithm used at Llandough Hospital in the mid 1980’s in the UK was the first algorithm published [6]. By 2000, most algorithms predicted dose based on the international normalized ratio (INR) (e.g., [7, 8]) and in 2003, quantitative analysis produced mathematical algorithms dependent on typical patient variables such as age, weight, and serum albumin [9]. By 2004, genetic influences had been identified demonstrating the contribution to dosing variance of one (CYP2C9), two (VKORC1) and eventually three (CYP2C19) genes along with a wide collection of other variables including age, sex, weight, smoking, deep vein thrombosis status, amiodarone use, previous INR, etc. (e.g., [1, 10, 11]) Recently, the International Warfarin Pharmacogenetics Consortium compared a pharmacogenetic and non-genetic warfarin dosing algorithm [12]. However, dozens of other carefully developed and studied warfarin dosing algorithms were not included in the comparison. Furthermore, no complete catalogue of all existing warfarin dosing algorithms or comparisons between them has been compiled. The first step in a systematic review of the strengths and weaknesses of the prediction algorithms is to correctly and completely identify all publications that contain algorithms. However, such a review is a tedious and complicated task for which no automated process has been developed. We conducted a manual search and review to identify publications containing warfarin dosing algorithms. The process required several individuals’ attention over a period of months, and a total of more than 80 person-hours to identify 34 relevant publications. Not unexpectedly, just months after the original review, we discovered new publications and previously undetected articles. Although no automation is likely to achieve perfect recall and precision, the judicious application of sound search strategies may aid in the initial stages of the search and should help update reviews of the literature.

An automated method that accurately detects algorithms available in the published corpora will simplify the process of collecting the algorithms and allow investigators to focus their attention on the intended comparison analysis. However, due to the complexity of the desired objects, no such method exists in the literature. We demonstrate a text-mining method to generate Surrogate Algorithm Tags (SRATs), keyword patterns that are highly associated with published algorithms, and use the SRATs to automatically identify publications containing algorithms. The SRAT method complements costly manual curation efforts. We use warfarin dosing algorithm detection to demonstrate the method.

2. BACKGROUND Information retrieval (IR) techniques are successfully used in a generalized fashion to search articles [13], webpages [14], chemical structures [15], and images [16]. Google’s and PubMed’s search functionalities are successful examples of IR. In general, IR techniques identify highly informative ‘targets of recognition’ as surrogates or metadata for the actual search object, which may be a full-text document, image, or video. These surrogates exist in the form of keywords, entity vectors, or other elements that a.) represent the actual object, b.) have well defined shortest distance metrics, and c.) can be manipulated and ‘scored’ by search algorithms. No such natural surrogate representation exists for ‘algorithms’ found in the biomedical literature. They are often published as complex and diverse objects such as tables [17], equations embedded in text [1], images [18], and figures or tables of decision rules [1]. Nor does an IR method exit to automatically search the complex collection of published algorithm objects. Consequently, the only viable approach to identify a targeted set of algorithms is to read through a judicious and often highly inaccurate selection of articles returned from a general abstract/metadata search engine (e.g. PubMed’s) using keywords. This IR problem shares a complication with other IR problems involving complex object searches. Most search strategies use abstracts of articles based on the premise that essential information in the full text article will be closely represented in the abstract. However, in general, no algorithms appear in abstracts. Rather, algorithms, due to their complexity, only appear in the full-text. Thus, surrogates must be found that strongly associate with the object in the body of the article rather than with the much more easily accessed concepts that appear in the abstract. To address the above complications and difficulties, we propose a new approach that analyzes abstracts to identify keywords and combinations of keywords that have demonstrated association for publications that contain algorithms of interest. We refine the full collection of keywords and combinations of keywords through a process of iterative optimization of the precision and recall objective functions. We call the resulting optimal collection of individual and combination keywords ‘Surrogate Algorithm Tags’ (SRATs).

3. METHODS We demonstrate the method by focusing on the search for all published warfarin dosing algorithms. The method consists of 4 key steps: Identify a collection of keywords with high association

to warfarin algorithms; Form the SRAT candidate set (SRATC) of keyword and keyword combinations; Execute the candidate SRAT search, collect and review the returned publications and compute the objective functions (precision or recall) for each SRAT candidate and finally; Optimize the SRAT over the space of all combinations of keywords to identify the minimal set that maximizes the objective functions:

#(relevant retrieved ) and #(retrieved )

precision

# (relevant retrieved ) # (relevant )

recall

, where relevant is

the set of “gold-standard” articles and retrieved is the set of all returned articles.

3.1 The Method of Surrogate Algorithm Tags Input Let

Strain Stest , where S

S

is the target set of articles

Strain is the training set of articles and Stest is the testing set of articles. Define SRAT as a keyword N-gram consisting of a collection of 1 or more keyword combinations. Output

SRATOPT , the SRAT that optimizes one or both objective functions. Processing Steps Step 1: Let

K

be the set of n keywords with the highest

frequency in Step 2: Initiate let ST

Strain , K i K , i 1,...n

SRATC

SRATOPT

{K1 , K 2 ,..., K m } , m d n

and

K

SRATC set ST j K i , K i K , ST j ST ,

Step 3: Generate permutation to create ( SRATC

K i ST j )

( SRATC z ) in SRATC search for PubMed articles ( Stest ) and compute objective

Step 4: For each candidate SRAT

function values using Step 5: Update

Stest

ST and SRATOPT : ST

SRATOPT

SRATC z*

and

SRATOPT SRATCz*

Step 6: Repeat Steps 3 to 5 until threshold is met or iteration fails. Step 7: Return

SRATOPT

Figure 1: Surrogate Algorithm Tags Algorithm

We define a set of “gold-standard” papers in this work as those articles that include a collection of specific warfarin algorithms that have significant evidence of use, support in the medical community or derived from rigorous clinical or cohort studies. The objective of the “gold-standard” set is to provide at least 10 articles whose abstract text has an association perhaps even a syntactical relationship with the algorithm appearing in the body of the paper. The “gold-standard” set of papers are used to form a high frequency set of keywords ( K ) and to test if a given keyword n-gram (SRAT) is predictive. In this project, the “goldstandard” paper collection was identified using publically available search engines (i.e., PubMed). Finding optimal SRATs from a collection of N keywords for a targeted list of articles (“gold-standard” papers) involves an NPhard combinatorial optimization problem. Assuming there are 600,000 English words, identification of the optimal two-keyword SRAT candidates require approximately 360 billion 2-word combinations, searches and computations of the objective functions. To reduce the search space and computational complexity while still maintaining a robust method, we use domain knowledge to limit the candidate keyword set to words appearing most frequently in a set of “gold-standard” papers. We summarize the SRAT method in Figure 1 and describe the steps below. The input to the method is the abstracts of a set of manuallyidentified articles of interest S (“gold-standard” papers) that include well documented warfarin dosing algorithms. We equally separate “gold-standard” papers into training (Strain) and testing (Stest) subsets. The output is the optimal SRAT (SRATOPT). SRATOPT is generated from each iterative optimization. For each n-gram keyword set (where n=1, 2, 3…) we execute the algorithm twice, first using recall and then precision as the optimizing objective function. Step 1: Compute frequencies of words from abstracts in the Strain subset, and create a core keyword set K using the most frequent words. Ki represents one frequent word. n is the total number of frequent words. Step 2: Initiate candidate SRAT set (SRATC) and the optimal SRAT set (SRATOPT) as null. In addition, we select a subset of K as initial ST based on domain knowledge to avoid overly general words that are frequent in most scientific abstracts. For example, we exclude “patient” and “therapy” in the initial ST. A publication search using an overly general word may result in a large number of publications and consequently will have a poor objective function value. Step 3: Produce candidate permutations to generate the SRATC set from K and ST. The combinations of ST and K will produce a large number of permutations for the SRATC set. Since we do not want to create a permutation with redundant words, each Ki is constrained to be selected if an STj has included the same Ki. In this step, domain knowledge is used to screen keyword combinations that are semantically correct but domain-ambiguous from the SRATC set. For example, the two-word combination (warfarin, algorithm) is selected, but (algorithm, based) is removed from SRATC. In addition, domain knowledge helps to find those keyword combinations that appear in the text most likely to have syntactic relationship with the algorithm objects in the body of the article.

Step 4: Conduct article search with each candidate SRAT element (SRATCz), and compute the resulting objective function value using Stest subset. Step 5: Choose the candidate SRAT element with the optimal objective function value (SRATCz*) for the next iterative optimization. Then update ST as SRATCz* as the base to produce the next SRATC set. In addition, include SRATCz* in SRATOPT. Step 6: Determine if the cut-off threshold of objective function is met; if not, repeat Steps 3 to 5. The threshold value was chosen to balance feasibility and validity. The value of precision or recall is influenced by the number of retrieved and relevant articles. We choose 0.9 as the threshold for recall to achieve high validity. On balance, we choose 0.1 for the threshold for precision and the number (100) of retrieved articles. In this project, 0.1 is a high validity target for precision that may be reachable by iterative optimization, and we set 100 as the lowest number of retrieved articles to avoid overly small number of retrieved (and relevant) articles to ensure appropriate calculation of precision. Our iterative optimization is a variant of the “hill climbing” class of optimization algorithms. After each iteration, different keywords are added to the previous SRATCz* and the method is re-run if the objective function threshold is not met. Cut-off of the iteration is triggered when the value of the objective function is above a defined threshold or the number of returned articles is below a threshold.

4. RESULTS We identified 30 “gold-standard” papers (including [1, 6, 7, 8, 9, 10, 11]), and randomly assigned 15 each to Strain and Stest. We computed frequencies of words in abstracts from Strain using the on-line text analysis tool textalyser.net. We then reduced the total keyword set (K) to the top 14 frequent words: {warfarin, dose, patients, CYP, INR, dosing, genotype, therapeutic, doses, surgery, algorithm, VKORC, based, and therapy}. We selected a subset of K as ST: ST = {warfarin, algorithm, dose, inr, genotype, CYP2C9}. We use the SRAT algorithm to find SRATs with the optimal recall (Section 4.1) and precision (Section 4.2) and use these optimal SRATs to identify newly published warfarin algorithms (Section 4.1.1) and previously unidentified algorithms (Section 4.2.1).

4.1 Recall as the objective function In the first iteration, the 2-word permutations in the SRATC set are {(warfarin, algorithm), (warfarin, dose), (warfarin, INR), (warfarin, genotype), (warfarin, CYP2C9)}. We then executed a PubMed search with each candidate and calculate the recall {0.6, 0.93, 0.6, 0.2, 0.33} (Figure 2). In this case, the cut-off threshold (recall0.9) is achieved in the first iteration and SRATOPT = {(warfarin, dose)}.

4.1.1 Search newly published algorithms New articles appear regularly. Consequently, we extended the method to conduct an incremental search to capture recently published algorithms. We use the 2-gram keyword combination (warfarin, dose) to search for newly published articles. Most of our “gold-standard” papers were published earlier than 2010, so we used this keyword combination to find the 20 most recent articles with full-text access (many of them are still in press). We

read the resulting articles and found that 10 of them have algorithms.

4.2 Precision as the objective function In the first iteration using precision as the objective function, the SRATC set is the same as the one used in Section 4.1. The precision of each 2-word SRATC element is {0.0398, 0.0052, 0.0037, 0.0072, 0.0095} (Figure 2), and the optimal SRATCz* is (warfarin, algorithm). Thus, we change ST to {(warfarin, algorithm)} and include (warfarin, algorithm) in SRATOPT. In the next iterative optimization, the SRATC set is {(warfarin, algorithm, dose), (warfarin, algorithm, INR)} with precision, 0.08 and 0.0526 respectively (Figure 2). The optimal SRATCz* is (warfarin, algorithm, dose), but the precision has not reached the threshold. At the end of this iteration, the SRATOPT set is {(warfarin, algorithm), (warfarin, algorithm, dose)}. The third iteration produces SRATC set {(warfarin, algorithm, dose, INR)} with precision 0.094 (Figure 2). The cut-off threshold is triggered because the number of returned articles (64 articles) is too small. Finally, the algorithm returns the SRATOPT set {(warfarin, algorithm), (warfarin, algorithm, dose), (warfarin, algorithm, dose, INR)}. The element (Warfarin, algorithm, dose, INR) with the highest precision is selected to find previously unidentified articles.

4.2.1 Search previously unidentified algorithms To identify previous articles, we used (warfarin, algorithm, dose, INR) to identify algorithms not included in the “gold-standard” set. Executing a PubMed search using these keywords yielded 64 articles. After excluding articles already in the “gold-standard” set and articles without full-text access, 33 articles were used for this experiment. We randomly select 14 articles from the 33 articles, and found 2 previously unidentified algorithm papers including one dosing algorithm and one dose adjustment algorithm.

5. CONCLUSION Identification of algorithms from the literature is tedious work, but the task is very important in personalized medicine and comparative effectiveness research. Although information retrieval has been successfully used to search text, images, and videos, finding algorithms from publications still largely depends on human effort. We propose the SRAT approach as a strategy to identify suitable keyword combinations highly associated with publications containing algorithms and thereby improve and facilitate algorithm search. N-gram keyword combinations are generated using a training set of “gold-standard” papers from a diverse set of published warfarin publications each containing a warfarin algorithm. Our algorithm is applicable to those personalized medicine applications where one anticipates a robust collection and history of publications with diverse collections of algorithms. In these cases, one can formulate a “gold-standard” set of papers to use the SRAT algorithm. Frequent keywords obtained from abstracts can help domain experts to generate keyword combinations to search articles.

The SRAT algorithm may be further improved in several ways. First, the selection of SRAT candidates can be improved by trying all combinations of core keywords. The combinations of keywords are currently decided based on the domain knowledge. Although this speeds up the process of identifying optimal SRATs, we may be excluding SRATs with potentially high objective function values. Second, the method may be improved by using different sets of core keywords (e.g., produced from domain knowledge instead of from word frequency) to generate different SRATs. Third, we may consider more diverse keyword strategies. We did not consider specific search fields such as title and MeSH; instead, every term applied to all fields. Fourth, in this project, we used AND to connect all terms. Other conjunctions (OR and NOT) may provide higher precision or recall. Our next steps include an automatic approach to form the candidate SRATs; use of other conjunctions to formulate SRATs; and exploration of the influence of stemming functions on the optimization. In addition, we will apply the SRAT method to other personalized-medicine examples (e.g. breast cancer risk algorithms). Our preliminary review has identified 39 risk prediction and 6 gene carrier prediction algorithms. As in the warfarin situation, personalized medicine in breast cancer risk is crucial and will benefit from the interrogation of new algorithms.

6. ACKNOWLEDGEMENTS The project is funded by NIH grant R01LM010130. We thank Drs. Ethan Munson and Hong Yu for valuable comments.

7. REFERENCES [1] J. Anderson, B. Horne, S. Stevens, A. Grove, S. Barton, Z. Nicholas, S. Kahn, H. May, K. Samuelson, J. Muhlestein, J. Carlquist, and C.-G. Investigators, “Randomized trial of genotype-guided versus standard warfarin dosing in patients initiating oral anticoagulation,” Circulation, vol. 116, no. 22, pp. 2563–2570, 2007. [2] S. Wacholder, P. Hartge, R. Prentice, M. Garcia-Closas, H. Feigelson, W. Diver, M. Thun, D. Cox, S. Hankinson, P. Kraft, B. Rosner, C. Berg, L. Brinton, J. Lissowska, M. Sherman, R. Chlebowski, C. Kooperberg, R. Jackson, D. Buckman, P. Hui, R. Pfeiffer, K. Jacobs, G. Thomas, R. Hoover, M. Gail, S. Chanock, and D. Hunter, “Performance of common genetic variants in breast-cancer risk models,” The New England Journal of Medicine, vol. 362, no. 11, pp. 986–993, 2010. [3] A. Garber and S. Tunis, “Does comparative-effectiveness research threaten personalized medicine?” The New England Journal of Medicine, vol. 360, no. 19, pp. 1925–1927, 2009. [4] M. Cronin, C. Sangli, M.-L. Liu, M. Pho, D. Dutta, A. Nguyen, J. Jeong, J. Wu, K. Langone, and D. Watson, “Analytical validation of the oncotype dx genomic diagnostic test for recurrence prognosis and therapeutic response prediction in node-negative, estrogen receptorpositive breast cancer,” Clinical Chemistry, vol. 53, no. 6, pp. 1084–1091, 2007. [5] Y. Caraco, S. Blotnick, and M. Muszkat, “Cyp2c9 genotypeguided warfarin prescribing enhances the efficacy and safety of snticoagulation: A prospective randomized controlled

study,” Clinical Pharmacology and Therapeutics, vol. 83, no. 3, pp. 460–470, 2008. [6] A. Fennerty, J. Dolben, P. Thomas, G. Backhouse, D. Bentley, I. Campbell, and P. Routledge, “Flexible induction dose regimen for warfarin and prediction of maintenance dose,” British Medical Journal, vol. 288, no. 28, pp. 1268–1270, 1984.

[12] The International Warfarin Pharmacogenetics Consortium, “Estimation of the warfarin dose with clinical and pharmacogenetic data,” The New England Journal of Medicine, vol. 360, no. 8, pp. 753–764, 2010. [13] PubMed Overview. Accessed in May. 2010: http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.ht ml.

[7] M. Cooper and T. Hendra, “Prospective evaluation of a modified fennerty regimen for anticoagulating elderly people,” Age and Ageing, vol. 27, no. 5, pp. 655–656, 1998.

[14] A. Langville and C. Meyer, Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton, 2006.

[8] M. Schooff, “Initiating warfarin therapy,” Journal of Family Practice, vol. 48, no. 4, pp. 250–251, 1999.

[15] J. Rhodes, S. Boyer, J. Kreulen, Y. Chen, and P. Ordonez, “Mining patents using molecular similarity search,” in Pacific Symposium on Biocomputing, Maui Hawaii, 2007, pp. 304–315.

[9] D. Shine, J. Patel, J. Kumar, A. Malik, J. Jaeger, M. Maida, L. Ord, and G. Burrows, “A randomized trial of initial warfarin dosing based on simple clinical criteria,” Thromb Haemost, vol. 89, no. 2, pp. 297–304, 2003. [10] B. Gage, C. Eby, P. Milligan, G. Banet, J. Duncan, and H. McLeod, “Use of pharmacogenetics and clinical factors to predict the maintenance dose of warfarin,” Thromb Haemost, vol. 91, no. 1, pp. 87–94, 2004. [11] E. Millican, P. Lenzini, P. Milligan, L. Grosso, C. Eby, E. Deych, G. Grice, J. Clohisy, R. Barrack, R. Burnett, D. Voora, S. Gatchel, A. Tiemeier, and B. Gage, “Geneticbased dosing in orthopedic patients beginning warfarin therapy,” Blood, vol. 110, no. 5, pp. 1511–1515, 2007.

[16] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from google image search,” in 10th International Conference on Computer Vision, Beijing, China, 2005, pp. 1816–1823. [17] Y. Zhu, M. Shennan, K. Reynolds, N. Johnson, M. Herrnberger, R. V. Jr, and M. Linder, “Estimation of warfarin maintenance dose based on vkorc1 and cyp2c9 genotypes,” Clinical Chemistry, vol. 53, pp. 1199–1205, 2007. [18] T. Wilkinson and R. Sainsbury, “Evaluation of a warfarin initiation protocol for older people,” Internal Medicine Journal, vol. 26, pp. 465–467, 2003.

0.10 0.09 0.08

Precision

0.07

warfarin algorithm warfarin dose warfarin inr warfarin genotype warfarin cyp2c9 warfarin dose algorithm warfarin inr algorithm warfarin dose inr algorithm

0.06 0.05 0.04 0.03 0.02 0.01 0.00 0.0

0.2

0.4

0.6

0.8

1.0

Recall Figure 2: Summary of precision versus recall of all keyword combinations. When we refine keyword combinations by optimization, the keyword combinations become more specific and the precision increases because we are filtering out more unrelated articles. However, a few related articles are also filtered out, leading to a decrease in recall.