Sequence-Based Predictive Models of Resistance ...

6 downloads 0 Views 3MB Size Report
2013; 14 Suppl 4: S3. [15]. Abram ME, Hluhanich RM, Goodman DD, Andreatta KN, Margot NA, ... Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in.
Send Orders for Reprints to [email protected] Current HIV Research, 2015, 13, 497-502

497

Sequence-Based Predictive Models of Resistance to HIV-1 Integrase Inhibitors: An n-Grams Approach to Phenotype Assessment Majid Masso* Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, VA 20110, USA Abstract: Amino acid substitutions in HIV-1 proteins critical to the viral replication cycle have the potential to undermine successful inhibition of those targets, with some mutations leading to either reduced susceptibility to certain medications or complete drug resistance. Phenotypic tests are best suited to quantify the effects of complex mutational patterns on drug resistance; however, the relatively high cost and long turnaround time associated with phenotyping has increased the demand Majid Masso for in silico drug-specific models capable of accurately predicting phenotype directly from the target protein sequences. The focus of this study is on the HIV-1 integrase (IN) enzyme, which mediates integration of reversibly transcribed viral DNA into the host cell genome, and the development of predictive statistical learning models of resistance to the IN inhibitors Raltegravir (RAL) and Elvitegravir (EVG). Models were trained using datasets of IN protein sequence variants each having a known phenotype, quantified as the fold change in susceptibility to the respective inhibitor, and obtained using an experimental assay. A sequence-based approach employing n-grams relative frequencies was implemented to uniquely characterize each IN variant as a feature vector of input attributes. Models for classifying IN variants as susceptible or resistant reach cross-validation balanced accuracy rates of 89% with RAL and 85% with EVG. Additionally, regression models achieve Pearson’s correlation coefficients, between experimental and predicted log-transformed phenotypic fold change values, as high as r = 0.80 with RAL and r = 0.76 with EVG. Our results suggest that as additional training data are made publicly available, the models may hold promise as supplementary tools for making treatment decisions.

Keywords: Drug resistance, genotype-phenotype correlations, HIV-1 integrase, n-grams, regression, statistical learning models, supervised classification. INTRODUCTION Among HIV-1 infected patients, sequence diversity in viral proteins essential for replication arises from random and treatment selected mutations, as well as from mutations selected for by innate and adaptive immune pressures [1]. In certain cases, such amino acid replacements can lead to a significant reduction in susceptibility to medications designed to inhibit these targets. A number of medications with non-overlapping resistance patterns are currently available for inhibiting both the HIV-1 protease and reverse transcriptase enzymes, providing patients with alternatives in the search for effective treatments [1]. In recent years, the US Food and Drug Administration (FDA) has also approved two drugs to inhibit HIV-1 integrase (IN), namely Raltegravir (RAL) and Elvitegravir (EVG) [2]. The IN enzyme, a protein whose primary sequence is 288 amino acid residues in length, catalyzes integration of the reverse transcribed viral DNA into the human host cell genome [3]. As such, IN represents an essential component within the HIV-1 replication cycle. Amino acid residue replacements in the IN protein sequence have the potential to alter the degree of *Address correspondence to this author at the Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, VA 20110, USA; Tel: 703-2575756; Fax: 703-993-8401; E-mail: [email protected] 1570-162X/15 $58.00+.00

susceptibility of the IN variant target to each integrase inhibitor (INI) in distinct ways, with specific residue mutations resulting in drug cross-resistance [1, 3]. Published studies have independently reported on the results of phenotypic assays that quantify the degree to which certain in vivo (clinic/patient) and in vitro (laboratory) strains of IN variants are resistant to either RAL or EVG. These data are compiled and routinely maintained on the Stanford University HIV Drug Resistance Database [4]. Genotypic testing, on the other hand, is significantly less expensive and has a much shorter turnaround time, an alternative that reports the sequence of the IN variant. Here, we applied the publicly available phenotypic data toward the development of IN sequence-based predictive models of drug resistance. In particular, our dataset for training predictive models of resistance to RAL (resp. EVG) consisted of 212 (resp. 141) distinct IN variant proteins, each having an experimentally quantified fold-change (FC) phenotype value (i.e., degree of susceptibility to the respective INI). Every IN variant amino acid residue sequence in each INI dataset was represented as a feature vector, whose components were dataset relative frequencies of occurrence of constituent n-grams in that dataset (i.e., subsequences of consecutive residues from each IN variant sequence generated using a sliding window of size n). Subsequent to impressive results in the area of statistical language-independent analysis of text [5], n-grams were successfully applied to analyzing proteins in terms of function prediction [6]; secondary structure prediction [7]; © 2015 Bentham Science Publishers

498

Current HIV Research, 2015, Vol. 13, No. 6

Majid Masso

motif, family, and superfamily classification [8-10]; and identification of domain-domain interactions [11], among others. Here the RAL and EVG datasets of IN variants with known phenotypes, whose sequences were represented as feature vectors through an application of n-grams, were used to train predictive models of drug resistance. The models were developed by implementing two supervised classification and two regression statistical learning algorithms, and cross-validation studies were performed in order to gauge model performance. This study builds upon the prior success with n-grams toward developing accurate predictive models of both HIV-1 co-receptor usage [12, 13], as well as HIV-1 drug resistance to eight protease and eleven reverse transcriptase inhibitors [14]. MATERIALS AND METHODS Datasets The RAL (resp. EVG) dataset consists of 212 (resp. 141) distinct IN variant sequences (i.e., translated IN genotypes) with published phenotypes, corresponding to laboratory [3, 15-18] as well as clinical [3, 16] HIV-1 isolates, which were downloaded from the Stanford Database [4]. Susceptibility of each IN variant to either RAL or EVG was reported as a fold change (FC) value obtained using the PhenoSense assay (Monogram Biosciences, South San Francisco, CA) [19]. The FC value is defined as the ratio of 50% inhibitory concentration (IC50) for the IN variant protein relative to that for a drug-sensitive, wild type IN control. For both INI datasets, IN variants with phenotypes FC ≤ 2.5 were categorized as drug-susceptible (S), and those for which FC > 2.5 were labeled resistant (R), to develop predictive classification models (Fig. 1). The FC = 2.5 cutoff, applied to IN variants with respect to both the RAL and EVG inhibitors, was selected in an effort to simultaneously satisfy three conditions: to remain relatively consistent with biological cutoffs documented by the assay manufacturer; to minimize the number of IN variants that have FC values within a two-fold interval centered at the cutoff (i.e., the reproducibility variability based on validation studies [20]), thereby reducing the number of inadvertent misclassifications caused by a strict binary cutoff; and, to ensure that a reasonable number of IN variants populate each phenotype category. All FC values were log-transformed prior to analysis. n-Grams Representation of IN Variant Sequences The following procedure was performed separately on each dataset of IN variant sequences. For the purposes of this discussion, we focus on the RAL dataset; the EVG dataset was identically processed. First, a sliding window of size n was used on each of the 212 IN variant sequences in the RAL dataset to identify all subsequences of n consecutive amino acid residues. Though there are 20n different n-grams based on the use of a 20-letter protein alphabet, a majority of these clearly are not observed in the dataset. Next, the relative frequency of occurrence for each distinct type of ngram identified in the dataset was calculated to be the total number of times that the n-gram appears in the entire dataset (i.e., its absolute frequency) divided by the total number of

Fig. (1). Distributions of RAL and EVG datasets of HIV-1 integrase (IN) variant sequences. The RAL (resp. EVG) dataset contains IN variant sequences, each with a known fold change value (FC, degree of resistance) relative to the IN inhibitor raltegravir (resp. elvitegravir). S = drug susceptible, R = drug resistant.

all n-grams generated by all the IN variant sequences in the entire dataset. Finally, these relative frequencies were used in turn as the components for feature vectors to represent the IN variant sequences in the dataset. With n = 4, for example, the sliding window yields 285 ordered n-grams for each of the 212 IN variant sequences in the RAL dataset (each IN sequence contains 288 residues). Once all n-gram relative frequencies were calculated for this dataset, each IN variant was then represented as an ordered vector of 285 relative frequencies that correspond to its ordered set of n-grams. For this study, we explored the impact of sliding window size on model prediction performance by representing the IN variant sequences based on n-grams of sizes n = 2, 3, and 4. Since the number of examples (i.e., IN variant sequences) in each dataset is significantly less than the number of features (i.e., relative frequencies of the ordered n-grams), a biomedically-inspired feature reduction was implemented. There are 14 IN sequence positions at which specific residue replacements known to have an impact on RAL susceptibility have been well documented: 66, 74, 92, 97, 121, 138, 140, 143, 147, 148, 153, 155, 157, and 263; with the exception of 74, certain mutations at these positions are similarly implicated in the development of EVG resistance [21, 22]. Therefore, each IN variant feature vector was

Predicting Resistance to HIV-1 Integrase Inhibitors

reduced in dimensionality by retaining only the relative frequencies corresponding to n-grams that included these drug-resistance positions. In the RAL dataset, with n = 4 for example, the first 62 components were removed from each IN variant feature vector, components 63 - 66 were retained, 67 - 70 were removed, 71 - 74 were retained, 75 - 88 were removed, and so on; likewise with the IN variant feature vectors comprising the EVG dataset in the case of n = 4, except that positions 71 - 74 were additionally removed. Hence for n = 4, IN variant sequences in the RAL and EVG datasets were ultimately represented as 46- and 42dimensional feature vectors of input attributes (i.e., the independent variables), respectively, with the drug-specific phenotype (either a categorical label for classification models, or a numerical value for regression models) of each IN variant defining the single output attribute (i.e., the dependent variable). Statistical Learning and Performance Predictive models were trained by implementing two supervised classification (random forest, RF [23]; support vector machine, SVM [24]) and two regression (reducederror pruned tree, REPTree [25]; support vector regression, SVR [26]) statistical learning algorithms with the Weka software package [25, 27]. Though distinct algorithmic approaches underlie each of these methods, they all share a common goal: to learn a complex nonlinear function from the available training data (i.e., the trained model). An accurate model can subsequently be used to make a reliable drug susceptibility (i.e., output attribute or dependent variable) prediction for any new IN variant given only the relative frequency values of the components obtained for its feature vector representation using n-grams (i.e., values of the input attributes or independent variables). Relevant nondefault software parameters that were selected include 100 trees for RF; a radial basis function (RBF) kernel for SVM and SVR; and 20 bagged (i.e., bootstrap aggregated) iterations for REPTree. For each IN variant, the trained RF and SVM models predict a phenotype category, either resistant (R) or susceptible (S) with respect to a particular inhibitor, while REPTree and SVR models predict a numerical log-transformed FC value. Model performance was reported using a leave-one-out cross-validation (LOOCV) procedure, whereby each IN variant in any given dataset (RAL or EVG, n-grams of size n = 2, 3, or 4) was predicted by a distinct model trained using all the remaining variants in that dataset. With classification, drug susceptibility was designated positive (R, resistant) or negative (S, susceptible), so that TP (TN) represents the total number of correct R (S) predictions, while FN (FP) correspond to the total number of respective misclassifications. Using this notation, we computed Se(R) = sensitivity = TP / (TP + FN), Sp(R) = specificity = TN / (TN + FP), and PPV(R) = positive predictive value = TP / (TP + FP). The following quantities were also reported, given they are particularly informative when the classes are highly unbalanced: the balanced accuracy rate (BAR), given by BAR = 0.5 × [Se(R) + Sp(R)];

Current HIV Research, 2015, Vol. 13, No. 6

499

Matthew’s correlation coefficient (MCC), calculated as MCC =

TP × TN - FP × FN (TP + FN)(TP + FP)(TN + FN)(TN + FP)

;

and area (AUC) under the receiver operating characteristic (ROC) curve, which ranges from AUC ~ 0.5 (random guessing) to AUC = 1.0 (perfect classifier). With regression, we reported Pearson’s correlation coefficient (r) between actual and predicted log-transformed FC values for IN sequence variants, mean-squared error (mse), and balanced accuracy (BAR) associated with converting these actual and predicted values to R/S categories using the FC = 2.5 cutoff. RESULTS AND DISCUSSION Classification Tables 1 and 2 summarize the performance of both RF and SVM classifiers for predicting RAL and EVG drug resistance, respectively, based on LOOCV. Taking into account the tabulated results for all six of the evaluation metrics, models trained using n-grams datasets based on increasing values of n generally performed better. Owing to the relatively smaller sizes of the RAL and EVG datasets, these results are comparable to, albeit slightly lower than, those obtained from each of 19 distinct n-grams models that we previously developed for predicting drug resistance to eight protease and eleven reverse transcriptase inhibitors Table 1. Method

RAL classification performance. Se(R)

Sp(R)

PPV(R)

BAR

MCC

AUC

RF 2-grams

0.97

0.77

0.96

0.87

0.77

0.97

3-grams

0.98

0.80

0.96

0.89

0.82

0.97

4-grams

0.98

0.80

0.96

0.89

0.81

0.95

SVM 2-grams

0.96

0.51

0.91

0.74

0.55

0.81

3-grams

0.98

0.80

0.96

0.89

0.82

0.93

4-grams

0.98

0.80

0.96

0.89

0.81

0.93

BAR

MCC

AUC

Table 2. Method

EVG classification performance. Se(R)

Sp(R)

PPV(R) RF

2-grams

0.86

0.83

0.96

0.85

0.59

0.91

3-grams

0.88

0.78

0.95

0.83

0.59

0.93

4-grams

0.84

0.87

0.97

0.85

0.59

0.91

SVM 2-grams

0.82

0.83

0.96

0.82

0.53

0.85

3-grams

0.89

0.74

0.95

0.81

0.57

0.75

4-grams

0.85

0.78

0.95

0.82

0.53

0.85

500

Current HIV Research, 2015, Vol. 13, No. 6

Majid Masso

[14]. Similarly, the performance measures in Tables 1 and 2 are in range of, but consistently lower than, those obtained with n-grams models that we previously developed for predicting HIV-1 co-receptor usage based on a much larger dataset of V3 loop sequences [12, 13]. Shown in Fig. (2A). are ROC curves that correspond specifically to the AUC values for the SVM classifiers trained using the original RAL n-grams datasets in Table 1, as well as control ROC curves obtained by initially performing a random shuffling of the R/S class labels associated with the 212 IN variant sequences in these datasets prior to implementing the LOOCV procedure. Similarly, Fig. (2B) displays ROC curves that correspond specifically to the AUC values for the RF classifiers trained using the original EVG n-grams datasets in Table 2, as well as control ROC curves obtained by initially performing a random shuffling of the R/S class labels associated with the 141 IN variant sequences in these datasets prior to LOOCV. Fig. (2) clearly depicts the degree to which the n-grams relative frequency signals are capable of discriminating between IN variant sequences that belong to their respective R/S categories, since control datasets based on initial random shuffling of the R/S class labels among the IN variants yield models that perform no better than random guessing (i.e., ROC curves near or below the diagonal, with AUC values in the vicinity of 0.5). The calculated BAR and MCC values for the random control datasets based on LOOCV further support this conclusion: RAL dataset, SVM classifiers (2grams: BAR = 0.53, MCC = 0.08; 3-grams: BAR = 0.52, MCC = 0.05; 4-grams: BAR = 0.47, MCC = -0.09); EVG dataset, RF classifiers (2-grams: BAR = 0.57, MCC = 0.11; 3-grams: BAR = 0.52, MCC = 0.03; 4-grams: BAR = 0.44, MCC = -0.09). Similar observations were also made with respect to the RF classifiers trained with the RAL n-grams datasets, as well as the SVM classifiers trained with the EVG n-grams datasets (results not shown). For a more comprehensive approach that reveals statistical significance of results given in Table 1 (resp. Table 2), we generated 1000 control datasets via random class label shuffling among the 212 (resp. 141) IN variant sequences in the original RAL 3-gram (resp. EVG 2-gram) dataset, assessing SVM (resp. RF) LOOCV performance in each case. Presented in Fig. (3A) (resp. Fig. 3B) are permutation distributions of the corresponding BAR (0.50 ± 0.03) and MCC (0.00 ± 0.09) data (resp. BAR (0.50 ± 0.07) and MCC (0.00 ± 0.11) data), both centered around values indicative of random guessing and significantly lower than comparable values obtained by using the actual IN variant class labels in the original RAL 3-gram (resp. EVG 2-gram) dataset (Table 1: BAR = 0.89 and MCC = 0.82; resp. Table 2: BAR = 0.85 and MCC = 0.59); hence, the p-value for predictive power is less than 0.001 in each case. We obtained comparable results by similarly using all other RAL and EVG n-grams datasets and implementing RF and SVM to assess LOOCV performance (results not shown). Regression

Fig. (2). Receiver operating characteristic (ROC) curves obtained using the original and control RAL and EVG datasets. The original RAL (resp. EVG) dataset contains HIV-1 integrase (IN) sequences whose susceptibility phenotypes (i.e., R/S class labels) are known with respect to the IN inhibitor raltegravir (resp. elvitegravir). Three forms of the RAL (resp. EVG) dataset are explored, based on whether n-grams of size n = 2, 3, or 4 are used in representing the IN sequences. Control RAL (resp. EVG) datasets are obtained by randomly shuffling the known R/S labels among the IN sequences in the original RAL (resp. EVG) datasets. ROC curves are based on leave-one-out cross-validation (LOOCV) and implementation of either (A) the SVM algorithm with the RAL datasets or (B) the RF algorithm with the EVG datasets. The area (AUC) under the ROC curves associated with original RAL and EVG datasets range from 0.81 to 0.93 as shown in Tables 1 and 2, respectively, where AUC = 1 suggests a perfect classifier. On the other hand, the class shuffled RAL and EVG control datasets all yield ROC curves with AUC values closer to 0.5, indicative of models that perform no better than random guessing.

Unlike the RF and SVM supervised classification models, which predict IN variant sequences to be either

susceptible or resistant to RAL or EVG, the trained REPTree and SVR regression models numerically predict RAL or

Predicting Resistance to HIV-1 Integrase Inhibitors

EVG phenotypes for IN variants. We employed the same RAL and EVG n-grams datasets of IN variants and replaced the R/S class labels corresponding to those sequences with their original log-transformed FC values. The LOOCV results appear in Table 3, reported using Pearson’s correlation coefficient (r) between actual and predicted log10(FC) values as well as mean squared error (mse). Similar to what we observed in the case of classification, these n-grams regression performance results for models developed to predict the level of RAL and EVG drug resistance are comparable to, though slightly lower than, those obtained with n-grams regression models we previously developed for predicting the levels of drug resistance to eight protease and reverse transcriptase inhibitors [14].

Current HIV Research, 2015, Vol. 13, No. 6

Table 3.

501

Regression performance.

Method

RAL r

EVG mse

r

mse

REPTree 2-grams

0.80

0.18

0.75

0.21

3-grams

0.78

0.20

0.76

0.20

4-grams

0.77

0.21

0.73

0.22

SVR 2-grams

0.68

0.28

0.72

0.25

3-grams

0.77

0.21

0.73

0.23

4-grams

0.77

0.21

0.72

0.24

Next, to directly compare SVM classification and SVR regression models, the SVR predicted log10(FC) values for IN variant sequences based on LOOCV were converted to predicted S/R classes according to which side they fell on relative to the log10(FC) = log10(2.5) = 0.397 cutoff. These, in turn, were compared to the actual IN variant class labels (i.e., those previously obtained by converting PhenoSense assay FC values to R/S labels), to calculate overall prediction balanced accuracy (BAR). The results (Table 4) are mixed, with SVM outperforming SVR with respect to all RAL datasets, but no clear winner with the EVG datasets. Lastly, sample scatter plots based on LOOCV predictions using SVR regression models for two cases in Table 3 (RAL, 4gram; and EVG, 3-gram), selected for having the best BAR values with SVR (Table 4), are presented in Fig. (4). Table 4.

BAR comparison.

Method

Fig. (3). Permutation distributions elucidate statistical significance. (A) Using the original RAL dataset of HIV-1 integrase (IN) sequences represented via n-grams of size n = 3, one thousand control RAL datasets were generated by repeated random shuffling of the R/S class labels from their original assignments to the IN sequences. An implementation of the SVM algorithm with leaveone-out cross-validation (LOOCV) testing was applied to each of these control datasets, Matthew's correlation coefficient (MCC) and balanced accuracy rate (BAR) performance measures were calculated in each case, and distributions were generated from these values. The MCC and BAR values obtained with the original RAL dataset are greater than all those associated with the one thousand controls. (B) Similar results were obtained using the original EVG dataset of HIV-1 integrase (IN) sequences represented as n-grams of size n = 2, along with implementation of the RF algorithm and LOOCV testing applied to one thousand control EVG datasets.

RAL

EVG

SVM

SVR

SVM

SVR

2-grams

0.74

0.70

0.82

0.83

3-grams

0.89

0.81

0.81

0.87

4-grams

0.89

0.82

0.82

0.71

CONCLUSION In summary, n-grams were used to represent IN variant sequences as feature vectors in order to develop efficient sequence-based models that predict INI resistance. Though all the models generally performed well, prior work suggests that as additional training data (e.g., at least 400 IN variants with known phenotypes) become available, further improvements can be expected [12-14]. These increasingly accurate models may potentially be useful clinically as supplementary diagnostic tools designed to tailor effective patient treatment strategies. CONFLICT OF INTEREST The author confirms that this article content has no conflicts of interest.

502

Current HIV Research, 2015, Vol. 13, No. 6

Majid Masso [6] [7] [8] [9] [10] [11] [12]

[13] [14] [15]

[16]

Fig. (4). Scatter plots of predicted and actual HIV-1 integrase (IN) variant phenotypes. (A) The original RAL dataset of IN variant sequences represented via n-grams of size n = 4, each having a known fold change (FC) susceptibility value with respect to the IN inhibitor raltegravir, was used in conjunction with the SVR regression algorithm and leave-one-out cross-validation (LOOCV) testing to generate a predicted FC value for each IN sequence in the dataset. Each point in the scatter plot characterizes a distinct IN sequence from the RAL dataset by virtue of its predicted and actual log10(FC) values. (B) A similar plot based on the SVR algorithm and LOOCV testing, obtained using the original EVG dataset of IN sequences represented as n-grams of size n = 3, and each having a known phenotype (i.e., FC value) with respect to the IN inhibitor elvitegravir.

[17]

[18] [19]

[20]

ACKNOWLEDGEMENTS The author thanks the Stanford University Drug Resistance Database for compiling the genotype-phenotype data used in this study.

[21] [22]

REFERENCES [1] [2] [3] [4] [5]

Cortez KJ, Maldarelli F. Clinical management of HIV drug resistance. Viruses. 2011; 3: 347-378. Gutierrez Mdel M, Mateo MG, Vidal F, Domingo P. Drug safety profile of integrase strand transfer inhibitors. Expert Opin Drug Saf. 2014; 13: 431-445. McColl DJ, Chen X. Strand transfer inhibitors of HIV-1 integrase: bringing IN a new era of antiretroviral therapy. Antiviral Res. 2010; 85: 101-118. Stanford University HIV Drug Resistance Database [accessed May 2013]. Available from: http://hivdb.stanford.edu/cgi-bin/IN_Phenotype. cgi. Damashek M. Gauging similarity with n-grams: language-independent categorization of text. Science. 1995; 267: 843-848.

Received: June 7, 2014

[23] [24]

[25] [26] [27]

Revised: June 1, 2015

Dong Q, Zhou S, Deng L, Guan J. Gene ontology-based protein function prediction by using sequence composition information. Protein Pept Lett. 2010; 17: 789-795. Vries JK, Liu X, Bahar I. The relationship between n-gram patterns and protein secondary structure. Proteins. 2007; 68: 830-838. Cheng BY, Carbonell JG, Klein-Seetharaman J. Protein classification based on text document classification techniques. Proteins. 2005; 58: 955-970. Mansoori EG, Zolghadri MJ, Katebi SD. Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans Nanobioscience. 2009; 8: 92-99. Wu CH, Zhao S, Chen HL, Lo CJ, McLarty J. Motif identification neural design for rapid and sensitive protein family search. Comput Appl Biosci. 1996; 12: 109-118. Zhang KX, Ouellette BF. GAIA: a gram-based interaction analysis tool--an approach for identifying interacting domains in yeast. BMC Bioinformatics. 2009; 10 Suppl 1: S60. Masso M, Vaisman II. Sequence-based prediction of HIV-1 coreceptor usage: utility of n-grams for representing gp120 V3 loops. Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology, and Biomedicine; August 1-3; ACM, New York 2011; pp. 309-314. Masso M, Vaisman, II. Accurate and efficient gp120 V3 loop structure based models for the determination of HIV-1 co-receptor usage. BMC Bioinformatics. 2010; 11: 494. Masso M, Vaisman II. Sequence and structure based models of HIV-1 protease and reverse transcriptase drug resistance. BMC Genomics. 2013; 14 Suppl 4: S3. Abram ME, Hluhanich RM, Goodman DD, Andreatta KN, Margot NA, Ye L, et al. Impact of primary elvitegravir resistance-associated mutations in HIV-1 integrase on drug susceptibility and viral replication fitness. Antimicrob Agents Chemother. 2013; 57: 2654-2663. Huang W, Frantzell A, Fransen S, Petropoulos CJ. Multiple genetic pathways involving amino acid position 143 of HIV-1 integrase are preferentially associated with specific secondary amino acid substitutions and confer resistance to raltegravir and cross-resistance to elvitegravir. Antimicrob Agents Chemother. 2013; 57: 4105-4113. Jones C, Ledford R, Yu F, McColl D. Resistance profile of HIV-1 mutations in vitro selected by the HIV-1 integrase inhibitor, GS-9137 (JTK-303). 14th Conference on Retroviruses and Opportunistic Infections; February 25-28; International Antiviral Society-USA, San Francisco 2007. Mesplede T, Quashie PK, Osman N, Han Y, Singhroy DN, Lie Y, et al. Viral fitness cost prevents HIV-1 from evading dolutegravir drug pressure. Retrovirology. 2013; 10: 22. Petropoulos CJ, Parkin NT, Limoli KL, Lie YS, Wrin T, Huang W, et al. A novel phenotypic drug susceptibility assay for human immunodeficiency virus type 1. Antimicrob Agents Chemother. 2000; 44: 920-928. Fransen S, Gupta S, Danovich R, Hazuda D, Miller M, Witmer M, et al. Loss of raltegravir susceptibility by human immunodeficiency virus type 1 is conferred via multiple nonoverlapping genetic pathways. J Virol. 2009; 83: 11440-11446. Johnson VA, Calvez V, Gunthard HF, Paredes R, Pillay D, Shafer RW, et al. Update of the drug resistance mutations in HIV-1: March 2013. Top Antivir Med. 2013; 21: 6-14. Shafer RW, Schapiro JM. HIV-1 drug resistance mutations: an updated framework for the second decade of HAART. AIDS Rev. 2008; 10: 6784. Breiman L. Random forests. Machine Learning. 2001; 45: 5-32. Platt J. Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A, Eds. Advances in kernel methods - support vector learning. Cambridge, Massachusetts: The MIT Press 1998; pp. 41-65. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. San Francisco: Morgan Kaufmann 2005. Shevade SK, Keerthi SS, Bhattacharyya C, Murthy KRK. Improvements to SMO algorithm for SVM regression. IEEE Trans on Neural Networks. 2000; 11: 1188-1194. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics. 2004; 20: 2479-2481.

Accepted: June 23, 2015

Suggest Documents