Supplementary appendix - The Lancet

7 downloads 404 Views 895KB Size Report
The caret package was used in R, as a wrapper for the glmnet and gbm packages. These 25 features were ... All R-code developed for statistical analyses are available from ...... and the two models only differ by 1 predictor amongst the top 10.
Supplementary appendix This appendix formed part of the original submission and has been peer reviewed. We post it as supplied by the authors. Supplement to: Chekroud AM, Zotti RJ, Shehzad Z, et al. Cross-trial prediction of treatment outcome in depression: a machine learning approach. Lancet Psychiatry 2016; published online Jan 20. http://dx.doi.org/10.1016/S2215-0366(15)00471-X.

Supplementary Materials Outline:  SA1 – Data Access Information  SA2 – Statistical Modeling Procedures  ST1 – Complete variable list  SA3 – Clinician Pilot Survey  SA4 – STAR*D Receiver Operating Characteristic Curve and Model Calibration  ST2/3 – Performance of Smaller Models  SA5 – STAR*D analyses including early-drop outs  SA6 - Inclusion of 2-week post-baseline severity measurement improves model performance.  ST4 – Cross-trial model performance metrics  SF1 – Modeling COMED arms independently.  ST5 – Additional predictive modeling analyses of each COMED arm independently.  References

SA1. Data Access Information. STAR*D data were obtained through a limited access data use certificate (DUC), awarded to Adam Chekroud under the supervision of Prof. Gregory McCarthy (PI). Data used in the preparation of this article were obtained from the limited access datasets distributed from the NIH-supported “Sequenced Treatment Alternatives to Relieve Depression” (STAR*D). STAR*D focused on non-psychotic major depressive disorder in adults seen in outpatient settings. The primary purpose of this research study was to determine which treatments work best if the first treatment with medication does not produce an acceptable response. The study was supported by NIMH Contract # N01MH90003 to the University of Texas Southwestern Medical Center. The ClinicalTrials.gov identifier is NCT00021528. This manuscript reflects the views of the authors and may not reflect the opinions or views of the STAR*D Study Investigators or the NIH. COMED data were obtained through a limited access DUC awarded to Adam Chekroud (PI). Data used in the preparation of this article were obtained from the limited access datasets distributed from the NIH-supported “Combing Medications to Enhance Depression Outcomes” (CO-MED). This is a multi-site, clinical trial of persons with depression comparing the effectiveness of randomly assigned medication treatment. The study was supported by NIMH Contract # N01 MH090003-02 to the University of Texas Southwestern Medical Center. The ClinicalTrials.gov identifier is NCT00590863. This manuscript reflects the views of the authors and may not reflect the opinions or views of the CO-MED Study Investigators or the NIH. Investigators seeking access to the data can visit http://www.nimh.nih.gov/funding/clinical-trials-forresearchers/datasets/nimh-procedures-for-requesting-data-sets.shtml. All participants provided written informed consent at enrollment, with consent and study protocols approved by institutional review boards at each participating institution.1,2

1

SA2. Statistical Modeling Procedures.

Predictive Modeling Technical Summary We extracted all readily available information about patients at baseline that overlapped between the STAR*D and COMED trials. Variables were centered and scaled (stats::scale function in R): for STAR*D using the completer sample; in COMED variables were centered and scaled for each treatment arm separately. Next, a cross-validated (10x10cv) elastic net model was used to identify a group of 25 predictive features in the entire STAR*D completer cohort, based on ranked absolute beta values. The caret package was used in R, as a wrapper for the glmnet and gbm packages. These 25 features were then used to train a GBM to predict treatment outcome. In both cases, optimal hyperparameters were selected through a grid search of the plausible parameter spaces through an ROC-maximization process. All modeling was done using a parallel backend (R package doMC), restricted to 32 cores. To improve the reproducibility of our results, a seed was set locally (set.seed(1)), and a sequence of random integers (of pre-determined length) was generated and pushed to backend worker nodes for model building during repeated cross-validation. Before each train() call, the seed was reset locally, and all models used the same pre-determined sequence of parallel seeds. The best model built on the STAR*D completer set was taken without modification and applied to each of the three COMED arms (stats::predict function in R). As before, confusion matrices and a range of performance metrics were generated for each treatment arm separately (R function caret::confusionMatrix), and p-values were obtained by comparing the model’s accuracy to the Null-Information Rate with a binomial test. All R-code developed for statistical analyses are available from the corresponding author upon request. Dataset Description STAR*D and COMED were both designed to be as representative of the general population as possible, using broad inclusion criteria and few exclusion criteria. Full details of these criteria can be found in the protocol files included in the dataset. Although criteria were largely similar, some criteria were different (e.g. STAR*D patients had to have a HAM-D score over 14, for COMED it was 16 or more). The criteria were as follows:

STAR*D Inclusion Criteria:  Outpatients with nonpsychotic MDD. 
  A score of >14 on the HAM-D17. 
  Outpatients for whom antidepressant treatment is deemed appropriate by the treating clinician. 
  Age range: 18–75.  Participants with suicidal ideation are eligible, as long as outpatient treatment is deemed safe by the clinician (i.e., inpatient care is not called for clinically). 
  Participants who have most GMCs are eligible. Participants whose GMCs could conceivably be physiologically causing their depressive symptoms will receive treatment as usual for their GMCs as well as protocol Level 1 treatment for their MDD. We anticipate that during Level 1 most medical GMCs will be treated so that if depressive symptoms persist after treatment for the GMC and after CIT for the depression, participants with these conditions are eligible for randomization into Level 2. 
 Exclusion Criteria:  Participants must not have an established, well-documented history of nonresponse or clear intolerability in the current major depressive episode to one or more treatments required by the protocol, delivered at an adequate dose (e.g., >40mg/d of citalopram for at least 6 weeks or > 16 sessions of CT).  Participants with a lifetime history of bipolar disorder (BPD I, II, and NOS), schizophrenia, schizoaffective disorder, or MDD with psychotic features. 
  Participants who currently suffer from a primary diagnosis of anorexia nervosa, bulimia nervosa, or obsessive compulsive disorder (OCD). 


2

  







Participants with severe, unstable concurrent psychiatric conditions likely to require hospitalization within six months from study entry (e.g., participants with severe alcohol dependence who have a history of recent admissions aimed at detoxification). 
 Participants with substance dependence disorders who require inpatient detoxification. Participants with active substance abuse or dependence disorders who enter the study will receive whatever care is routine at their clinical site (e.g., substance abuse counseling) for these conditions. 
 Participants with certain concurrent psychiatric or medical conditions that are relative or absolute contraindications to the use of more than one treatment option within the protocol so that randomization to any of the strategies or substrategies within each level (Levels 2, 3, and 4) is not possible. Participants with certain concurrent psychiatric or medical conditions that are relative or absolute contraindications to the use of one or more of the treatment options within the protocol and with the possibility of randomization to at least one of the strategies or substrategies within each level (Levels 2, 3, and 4) may enter the study, as long as the contraindication is noted and the strategy/substrategy involving the contraindicated treatment option is dropped. (Participants taking any concomitant nonpsychotropic medications (save for anxiolytics and sedative hypnotics) may enter the study as long as their clinician determines that antidepressant treatments in the protocol are appropriate and safe. When there is a known association between the concomitant medication and depression, as suggested by the AHCPR guidelines, we will encourage clinicians to substitute, whenever possible, the concomitant medication with another that is not associated with depression before study entry or, when the latter is not feasible, during Level 1. 
 Participants already receiving a targeted psychotherapy aimed at their depression may not enter the study. Those who have not responded to such psychotherapy and subsequently terminated it prior to study enrollment or those who are receiving counseling or therapy for other problems (e.g., marital counseling to address marital discord; psychodynamic treatment of character issues) may enter the study. 
 Participants who are pregnant or who will be trying to become pregnant within the subsequent 6-9 months. 


COMED Inclusion Criteria:  Patients must be seeking treatment at the primary or specialty care site, and be planning to continue living in the area of that clinic for the duration of the study 
  Patients must be 18-75 years old 
  Patients must meet clinical criteria for nonpsychotic MDD, recurrent (with the current episode being at least 2 months in duration), or chronic (current episode > 2 years) as defined by a clinical interview and confirmed by the MINI International Neuropsychiatric Interview (M.I.N.I.). 
  Screening HRSD17 score ≥ 16 
  Treatment with antidepressant medication combinations is clinically acceptable. 
  Patients must give written informed consent 
  Patients with and without current suicidal ideation may be included in the study as long as outpatient treatment is clinically appropriate 
 Exclusion Criteria:  Patients who are pregnant or breastfeeding  Patients who plan to become pregnant over the ensuing 8 months following study entry or are sexually active and not using adequate contraception  History (lifetime) of psychotic depression, schizophrenia, bipolar (I, II, or NOS), schizoaffective, or other Axis I psychotic disorders  Current psychotic symptom(s)  History (within the last 2 years) of anorexia or bulimia  Current primary diagnosis of obsessive compulsive disorder  Current substance dependence that requires inpatient detoxification or inpatient treatment  Patients requiring immediate hospitalization for a psychiatric disorder  Definite history of intolerance or allergy (lifetime) to any protocol medication  History of clear nonresponse to an adequate trial of an FDA-approved monotherapy in the current MDE if recurrent, or during the last 2 years if chronic (see Appendix IV for the definition of an adequate trial) 


3

              

History of clear nonresponse to an adequate trial of escitalopram or S-CIT, BUP-SR, VEN-XR, or MIRT (see Appendix IV for the definition of an adequate trial) used as a monotherapy, or to one or more of the protocol combinations in the current or any prior MDE 
 Patients currently taking any of the study medications at any dose 
 Patient having taken Prozac (fluoxetine) or an MAOI in the prior 4 weeks 
 Patients with an unstable general medical condition (GMC) that will likely require hospitalization or to be deemed terminal (life expectancy < 6 months after study entry) 
 Patients who are taking medications or have GMCs that contraindicate any study medications (e.g., seizure disorder) 
 Patients requiring medications for their GMCs that contraindicate any study medication 
 Epilepsy or other conditions requiring an anticonvulsant 
 Lifetime history of having a seizure including febrile or withdrawal seizures 
 Patients who are receiving or have received (lifetime) vagus nerve stimulation (VNS), electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), or other somatic antidepressant treatments 
 Patients currently taking or having taken within the prior 7 days, any of the following exclusionary medications: antipsychotic medications, anticonvulsant medications (gabapentin, pregabalin, and topiramate are allowed for pain as determined by the treating clinician), mood stabilizers, or central nervous system stimulants. 
 Antidepressant medication used for the treatment of depression or other purposes such as smoking cessation or pain are excluded since these agents may interfere with the testing of the major hypotheses under study (low dose trazodone is allowed for insomnia, < 200 mg/day). 
 Uncontrolled narrow angle glaucoma 
 Patients taking thyroid medication for hypothyroidism may be included only if they have 
 been stable on the medication for 3 months 
 Patients using agents within the prior 7 days that are potential augmenting agents (e.g., T3 in the absence of thyroid disease, SAMe, St. John’s Wort, lithium, buspirone) 
 Depression-focused psychotherapy including Cognitive Therapy (CT), Interpersonal Psychotherapy of Depression (IPT), Cognitive Behavioral Assessment System of Psychotherapy (CBASP), Problem-Solving Therapy, and light therapy will not be allowed during participation. Patients can participate if they are receiving psychotherapy that is not targeting the symptoms of depression, such as individual therapy, group therapy, family or couples therapy. 


Sample selection Although patients were encouraged to visit the clinic every two weeks, most patients did not attend every appointment. Many patients left the study before the full 12-week duration was complete, although some of these had already achieved remission. In order to allow the model to extract some relationship between the inputs and treatment response, it is important to allow the treatment sufficient time to take effect. As such, the sample only included patients for whom a severity score was recorded after 12 or more weeks of treatment (N = 1,949). Analyses of the full STAR*D sample, including patients who dropped out of the trial after week 0, are described in Supplementary Appendix 5. All models were constructed and examined with repeated 10-fold cross-validation, which partitions the original set of samples into 10 disjoint subsets, uses 9 of those subsets in the training process, and then makes predictions on the remaining subset. To avoid opportune data splits, model performance metrics are averaged across each of the test folds. We used 10 repeats of 10-fold CV (i.e. 10x10 CV)

Feature Selection Machine-learning seeks to identify predictive variables (i.e. “features”).3 In the present context, this problem is particularly salient: although models could benefit from more variables, the model loses utility as more and more questions are asked of the patient (as implementation becomes more effortful). A classic solution to this problem is to use a stepwise feature selection procedure,4 but this approach is slow and prone to over-fitting.5,6 Some studies have employed other multivariate approaches to this problem, e.g. recursive partitioning methods 7,8. Here, we demonstrate a purely data-driven approach and select features using the elastic net regularization: a blended method for feature selection and shrinkage. 9,10 There are two

4

primary outcomes: coefficients of correlated predictors are shrunk toward each other, and the model is made sparser by deselecting features from the regression. Mathematically, the method linearly combines the 𝛽 weight penalties of the lasso (𝑙1 -norm) and ridge (𝑙2 -norm) regressions. As such, the elastic net solves the following problem: 𝑁

1 𝑚𝑖𝑛 ∑ 𝛽0 ,𝛽 𝑁

𝑤𝑖 𝑙(𝑦𝑖 , 𝛽0 + 𝛽 𝑇 𝑥𝑖 ) + 𝜆[(1 − 𝛼)||𝛽||22 /2 + 𝛼||𝛽||1 ],

𝑖=1

over a grid of values of 𝜆 covering the entire range. Here 𝑙(𝑦, 𝜂) is the negative log-likelihood contribution for observation 𝑖; 1 e.g. for the Gaussian case it is (𝑦 − 𝜂)2 . The elastic-net penalty is controlled by 𝛼, where the lasso is 𝛼 = 1 and ridge is 2 𝛼 = 0. The tuning parameter 𝜆 controls the overall strength of the penalty. The elastic net approach maintains model parsimony by explicitly penalizing overfitting, and yields stable and sparse models that are robust to multicollinearity among features.9,10 Each variable was centered and then scaled (resulting in a mean of 0 and standard deviation of 1) before entry into the elastic net model to account for difference in variable types and ranges. The top 25 features were selected according to the (ranked) absolute magnitude of variable coefficients in the best model across all k-fold iterations. Twenty-five features was determined by the authors on a balanced consideration of how long a questionnaire would be feasible to put into clinical practice. Models with fewer items were evaluated in Supplementary Tables 2-3.

Predictive Modeling To select the most appropriate model parameters, we performed a cross-validated grid search over a pre-determined range of reasonable parameters. Typically, optimal tuning parameters for the model are determined through an accuracy-maximization cross-validation procedure. However, in cases where outcome frequencies are skewed, an accuracy-maximization process will bias a model towards successfully classifying the most common outcome. In this instance, this would lead to a model that is optimized for negative predictions (non-remission), rather than one that seeks to detect patients for whom Citalopram treatment is beneficial. As such, optimal tuning parameters were selected through an area under the receiver-operating curve (ROC)-maximization process (comparing true positives to false alarms). Finally, the best performing model in the training set was then used to generate predictions in the independent validation data (COMED). Relevant descriptions of model discrimination – including sensitivity, specificity, and area under curve (AUC) – were determined at each stage. When developing learning algorithms, model accuracy is often compared with a binomial test against the null-information rate, a conservative estimate of the expected accuracy of “random” performance. It has recently been empirically demonstrated that this offers equivalent results to a permutation test as sample size exceeds 250 11, although the binomial test is obviously far less computationally intense.

Alternative Modeling Approaches.

The primary model class in the present study was a Gradient Boosting Machine, a recent ensemble classifier. Boosting refers to a meta-algorithm for reducing bias (i.e. improving accuracy) in machine learning (and reducing variance somewhat). Boosting is a general approach for prediction (i.e. supervised learning) in which a set of weak learners (i.e., smaller models whose predictions may only be slightly correlated with the true outcome) is combined to create a single strong learner. In principle, Boosting is a general approach that can be used to combine any classifier (e.g. one can create a boosted support vector machine), although it is often used to combine tree-based models. There are many ways in which the results of weak learners can be combined when boosting. Most commonly, after each weak learner is added, the data is reweighted to increase the weight of misclassified observations, and decrease the weight of correctly classified observations. In the present manuscript we used Gradient Boosting, as developed by Jerome Friedman 3,12,13 . Here, like other boosting methods, gradient boosting combines weak learners in an iterative fashion by fitting successive models to the residual from the combined previous weak learners. Misclassification error is defined at each step using any differentiable loss function (we used the Bernoulli loss function), and the minimization is solved via gradient descent. In pseudo-code, Friedman 13 describes the generic gradient boosting method as follows:

5

Gradient boosting is often used with a tree-based classifier (Classification and Regression Trees, aka “decision trees”) as they can be fit extremely quickly, and can easily capture non-linear interactions (if present). The size of each tree, and the minimum number of observations in each node, is typically pre-determined (e.g. through CV). To avoid overfitting, regularization techniques can be employed that constrain the model’s ability to fit the training data. The most obvious parameter is the number of weak learners to be combined when making the overall strong learner (i.e. the number of boosting iterations): allowing more will reduce error in the training set, but may lead to poor generalization. Another option is to modify the updating rule to include a shrinkage parameter (also known as the learning rate). A smaller learning rate will require more boosting iterations, and carries a greater computational demand, although it can yield improvements in the model’s ability to generalize beyond the training data. In practice, both parameters are often tuned by cross-validation. “Stochastic” gradient boosting refers to a modification of Friedman’s gradient boosting algorithm in which at each iteration of the algorithm, weak learners are fit on just a sub-sample of the training data (drawn at random without replacement). Empirically, this can help prevent overfitting and also increases the computational efficiency of the procedure. In preliminary analyses (using a subset of 60% of the STAR*D data), we considered three other methods designed to cover a good range of alternative approaches. Conclusions about the relative performance of different models was strictly based on test-fold performance in this 60% portion, and was not based on either the left out 40%, nor any COMED data. A naïve Bayes classifier was selected as a common and easily implemented method. We also considered Linear Discriminant Analysis, a traditional linear method that will be familiar to many researchers, and is more easily interpretable. Finally, to give an example of a more common, but non-linear, approach, we implemented a radial (Gaussian) Support Vector Machine. A brief description of these approaches is offered below. A naïve Bayes classifier implements Bayes’ theorem by multiplying conditional probabilities over a set of predictors, and then selecting the most likely class (maximum a posteriori (MAP) estimation). Although the model relies on independence amongst predictors, an assumption that ostensibly does not hold in this context, naïve Bayes often performs well when the assumptions do not hold, is straightforward to implement, and can be more interpretable. Support Vector Machines are an alternative class of models that have been used infrequently in clinical research 14, but have become the classifier of choice in modern neuroimaging research (multivoxel pattern analysis) 15–17. A SVM model is a representation of observations in multidimensional space where target classes are separated as best as possible by a dividing “hyperplane”. New (unseen) examples can then be mapped into the same multidimensional space and classified according to which side of the hyperplane they fall. Finally, we included LDA as a simple but effective prediction approach in which linear combinations of features are used to separate target classes. We found that the GBM achieved better performance when trained over the entire feature set,

6

or the 25-item subset returned by the elastic net, and so it was our classifier of choice in the main manuscript. A full, explicit comparison of various classifier algorithms in predicting treatment outcome was beyond the scope of this article, although it would be an interesting further research question. Alternative Outcomes We focused on categorical prediction of clinical remission (QIDS ≤ 5) because it is associated with better function and a better prognosis than response without remission (2, 3), and was the primary outcome of the original clinical trials. However, this approach would consider someone who finished with a QIDS score of 6 sick, but 5 remitted. With this in mind, in the main manuscript (Results: Alternative Analyses) we presented an additional approach in which we predict each patient’s final QIDS score directly (i.e. continuous prediction, using regression). More generally, the present approach can be used to predict any continuous or categorical outcome of interest. R Package Citations This manuscript relied heavily on a number of user-written R packages, credited below. Parallel data manipulation and High Performance Computing: Hadley Wickham and Romain Francois (2015). dplyr: A Grammar of Data Manipulation. R package version 0.4.2. http://CRAN.R-project.org/package=dplyr Revolution Analytics (2014). doMC: Foreach parallel adaptor for the multicore package. R package version 1.3.3. http://CRAN.R-project.org/package=doMC Data comparison (R wrapper for JavaScript library to detect (and visualize) differences in data.frames) Paul Fitzpatrick and Edwin de Jonge (2015). daff: Diff, Patch and Merge for Data.frames. R package version 0.1.4. http://CRAN.R-project.org/package=daff

Plotting tools and data visualization: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. http://CRAN.R-project.org/package=RColorBrewer Statistical Packages: Greg Ridgeway with contributions from others (2015). gbm: Generalized Boosted Regression Models. R package version 2.1.1. http://CRAN.R-project.org/package=gbm Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL http://www.jstatsoft.org/v33/i01/. Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77 Machine Learning Framework: Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty,

7

Reynald Lescarbeau, Andrew Ziem and Luca Scrucca. (2015). caret: Classification and Regression Training. R package version 6.0-52. http://CRAN.R-project.org/package=caret

8

ST1. Full Data Dictionary The following information (i.e. 164 variables/features) was extracted from both STAR*D and COMED trial databases. The majority of items are binary endorsements of a question (1= “Yes”, 0=”No”), given a timescale over which the question applied.

Code Sex Age Hspnc White Black Asian Amind School Employment DSMDM DSMDI DSMCW DSMIH DSMPA DSMLE DSMFW DSMDT DSMTD dage epino Anorexia Bulimia OCD_PHX LEX_EV EFF_EV WEL_EV CEL_EV PROZ_EV PAX_EV ZOL_EV SSOIN SMNIN SEMIN SHYSM SMDSD SAPDC SAPIN SWTDC SWTIN SCNTR SVWSF SSUIC SINTR SENGY SSLOW SAGIT QSTOT *QSTOT.2 HMDSD HVWSF HSUIC HSOIN HMNIN HEMIN HINTR HSLOW

Time scale

Ever Ever Ever Ever Ever Ever Ever Ever

Questionnaire Item 1= male 2=female Rounded to nearest year Hispanic? White? Black or African American? Asian? American Indian or Alaskan native? Years of education (12=graduated high school) If unemployed, score 0, else score 1. Depressed mood most of the day, nearly every day Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day Significant weight loss while not dieting, or decrease or increase in appetite nearly every day Insomnia or hypersomnia nearly every day Psychomotor agitation or retardation nearly every day Fatigue or loss of energy nearly every day Feelings of worthlessness, or excessive or inappropriate guilt nearly every day Diminished ability to think or concentrate, or indecisiveness nearly every day Recurrent thoughts of death, recurrent suicidal ideation, or suicide attempt Age at onset of first MDE Number of previous MDEs Diagnosed with Anorexia? Diagnosed with Bulimia? Diagnosed with OCD? Ever taken Lexapro/ESC? Ever taken Effexor/VEN? Ever taken Wellbutrin/BUP? Ever taken Celexa/CIT? Ever taken Prozac/Fluoxetine? Ever taken Paxil/Paroxetine? Ever taken Zoloft/Sertraline? QIDS - Sleep onset insomnia QIDS - Mid-nocturnal insomnia QIDS - Early morning insomnia QIDS - Hypersomnia QIDS - Mood (sad) QIDS - Appetite (decreased) QIDS - Appetite (increased) QIDS - Weight (decrease) last 2 weeks QIDS - Weight (increase) last 2 weeks QIDS - Concentration/decision making QIDS - Outlook (self) QIDS - Suicidal ideation QIDS - Involvement QIDS – Energy/Fatigability QIDS – Psychomotor slowing QIDS – Psychomotor agitation QIDS total (scored manually as instructed) *QIDS total measured 2-weeks post baseline (supplementary analysis only) HRS Depressed mood HRS Guilt feelings and delusions HRS Suicide HRS Initial insomnia HRS Middle insomnia HRS Delayed insomnia HRS Work and interests HRS Retardation

9

HAGIT HPANX HSANX HAPPT HENGY HSEX HHYPC HINSG HWL HDTOT_R TREXP TRWIT

2Y 2Y

TETHT TEUPS TEMEM TEDIS TEBLK TERMD TEFSH TESHK TEDST TENMB TEGUG TEGRD TEJMP EBNGE EBCRL EBFL EBHGY EBALN EBDSG EBUPS EBDT EBVMT EBWGH OBGRM OBFGT

2W 2W 2W 2W 2W 2W 2W 2W 2W 2W 2W 2W 2W 2Y 2Y 2Y 2Y 2Y 2Y 2Y 2Y 2Y 2Y 2W 2W

OBVLT OBSTP

2W 2W

OBINT OBCLN OBRPT OBCNT ANHRT ANBRT ANSHK ANRSN ANCZY

2W 2W 2W 2W 2W 2W 2W 2W 2W

ANSYM

2W

ANWOR ANAVD FRAVD FRFAR

2W 2W 6M 6M

FRCWD

6M

FRLNE

6M

FRBRG

6M

FRBUS

6M

HRS Agitation HRS Psychic anxiety HRS Somatic anxiety HRS Appetite HRS Somatic energy HRS Libido HRS Hypochondriasis HRS Loss of insight HRS Weight loss HRS Total Score Have you ever experienced a traumatic event such as combat, rape, assault, sexual abuse, or any other extremely upsetting event? Have you ever witnessed a traumatic event such as rape, assault, someone dying in an accident, or any other extremely upsetting event? Did thoughts about a traumatic event frequently pop into your mind? Did you frequently get upset because you were thinking about a traumatic event? Were you frequently still bothered by memories or dreams of a traumatic event? Did reminders of a traumatic event cause you to feel intense distress? Did you try to block out thoughts or feelings related to a traumatic event? Did you try to avoid activities, places, or people that reminded you of a traumatic event? Did you have "flashbacks," where it felt like you were reliving a traumatic event? Did reminders of a traumatic event make you shake, break out into a sweat, or have a racing heart? Did you feel distant and cutoff from other people because of having experienced a traumatic event? Did you feel emotionally numb because of having experienced a traumatic event? Did you give up on goals for the future because of having experienced a traumatic event? Did you keep your guard up because of having experienced a traumatic event? Were you jumpy and easily startled because of having experienced a traumatic event? Did you often go on eating binges (eating a very large amount of food very quickly over a short period of time)? Did you often feel you could not control how much you were eating during a eating binge? Did you go on eating binges during which you ate so much that you felt uncomfortably full? Did you go on eating binges during which you ate a large amount of food even when you didn't feel hungry? Did you eat alone during an eating binge because you were embarrassed by how much you were eating? Did you go on eating binges and then feel disgusted with yourself after overeating? Were you very upset with yourself because you were going on eating binges? To prevent gaining weight from an eating binge did you go on strict diets, or exercise excessively? To prevent gaining weight from an eating binge did you force yourself to vomit, or use laxatives or water pills? Was your weight or the shape of your body one of the most important things that affected your opinion of yourself? Did you worry obsessively about dirt, germs, or chemicals? Did you worry obsessively that something bad would happen because you forgot to do something important - like locking the door, turning off the stove, pulling out the electrical cords of appliances, etc? Did you worry obsessively that you would act or speak violently when you really didn't want to? Were there things you felt compelled to do over and over (for at least 1/2 hour per day) that you could not stop doing when you tried? Were there things you felt compelled to do over and over even though it interfered with getting other things done? Did you wash and clean yourself or things around you obsessively and excessively? Did you obsessively and excessively check or repeat things over and over again? Did you count things obsessively and excessively? Did you get very scared because your heart was beating fast? Did you get very scared because you were short of breath? Did you get very scared because you were feeling shaky or faint? Did you get sudden attacks of very intense anxiety or fear that came on from out of the blue, for no reason at all? Did you get sudden attacks of very intense anxiety or fear during which you thought something terrible might happen, such as you might die, go crazy, or lose control? Did you have sudden, unexpected attacks of anxiety during which you had 3 or more of
 the following symptoms: heart racing or pounding, sweating, shakiness, shortness of breath nausea, dizziness, or feeling faint? Did you worry a lot about having unexpected anxiety attacks? Did you have attacks of anxiety that caused you to avoid certain situations or to change your behavior or normal routine? Did you regularly avoid any situations because you were afraid you'd have an anxiety attack in the situation? Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Going outside far away from home Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Being in crowded places Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Standing in long lines Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Being on a bridge or in a tunnel Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Traveling in a bus, train, or plane


10

FRCAR

6M

FRALO

6M

FROPN

6M

FRANX FRSIT EMWRY EMSTU EMATN EMSOC EMAVD EMSPK EMEAT

6M 6M 6M 6M 6M 6M 6M 6M 6M

EMUPR

6M

EMWRT

6M

EMSTP

6M

EMQST

6M

EMBMT EMPTY

6M 6M

EMANX EMSIT DKMCH DKFAM DKFRD DKCUT DKPBM DKMGE

6M 6M 6M 6M 6M 6M 6M 6M

DGMCH DGFAM DGFRD DGCUT DGPBM DGMGE

6M 6M 6M 6M 6M 6M

WYNRV WYBAD WYSDT WYDLY WYRST WYSLP WYTSN WYCNT WYSNP WYCRL PHSTM PHACH PHSCK PHPR PHCSE WISER WISTP WIILL WINTR WIDR

6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M 6M

Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Driving or riding in a car Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Being home alone Did any of the following make you feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation? Being in wide open spaces (like a park) Did you almost always get very anxious as soon as you were in any of the above situations? Did you avoid any of the above situations because they make you feel anxious or fearful? Did you worry a lot about embarrassing yourself in front of others? Did you worry a lot that you might do something to make people think that you were stupid or foolish? Did you feel very nervous in situations where people might pay attention to you? Were you extremely nervous in social situations? Did you regularly avoid any situations because you were afraid you'd do or say something to embarrass yourself? Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Public speaking Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Eating in front of other people Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Using public restrooms Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Writing in front of others Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Saying something stupid when you’re with a group of people Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Asking a question when in a group of people Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Business meetings Did you worry a lot about doing or saying something to embarrass yourself in any of the following situations? Parties or other social gatherings Did you almost always get very anxious as soon as you were in any of the above situations? Did you avoid any of the above situations because they made you feel anxious or fearful? Did you think that you were drinking too much? Did anyone in your family think or say that you were drinking too much, or that you had an alcohol problem? Did friends, a doctor or anyone else think or say that you were drinking too much? Did you think about cutting down or limiting your drinking? Did you think that you had an alcohol problem? Because of your drinking did you have problems in your marriage, in your job, with your friends or family, doing household chores, or in any other important area of your life? Did you think that you were using drugs too much? Did anyone in your family think or say that you were using drugs too much, or that you had a drug problem? Did friends, a doctor or anyone else think or say that you were using drugs too much? Did you think about cutting down or limiting your drug use? Did you think that you had a drug problem? Because of your drug use did you have problems in your marriage, in your job, with your friends or family, doing household chores, or in any other important area of your life? Were you a nervous person on most days of the past 6 months? Did you worry a lot that bad things might happen to you or someone close to you? Did you worry about things that other people said you shouldn't worry about? Were you worried or anxious about a number of things in your daily life on most days of the past 6 months? Did you often feel restless or on edge because you were worrying? Did you often have problems falling asleep because you were worrying about things? Did you often feel tension in your muscles because of anxiety or stress? Did you often have difficulty concentrating because your mind was on your worries? Were you often snappy or irritable because you were worrying or feeling stressed out? Was it hard for you to control or stop your worrying on most days of the past 6 months? Have you had a lot of stomach and intestinal problems such as nausea, vomiting, excessive gas, stomach bloating, or diarrhea? Have you been bothered by aches and pains in many different parts of your body? Do you get sick more than most people? Has your physical health been poor most of your life? Are your doctors usually not able to find a physical cause for your physical symptoms? Did you often worry that you might have a serious physical illness? Was it hard to stop worrying that you have a serious physical illness? Did your doctor say you didn't have a serious illness but it was still hard to stop thinking about it? Did you worry so much about having a serious illness that it interfered with your activities or it caused you problems? Did you visit the doctor a lot because you were worried that you had a serious physical illness?

11

SA3. Clinician Pilot Survey.

In developing tools for clinical prediction, an important question is whether a tool might improve treatment decisions beyond the performance that clinicians already achieve. In the Leuchter et al. (2009) report of the BRITE-MD trial, “Clinician prediction at day 7 of the likelihood of response (51%) or remission (59%) was not significantly associated with outcome at the primary endpoint (P>0.44)” 18. Here, we conducted an online survey designed to directly measure the performance of clinicians against a statistical model in predicting treatment outcome for archival (anonymous) patients in the STAR*D trial. The goal of the survey was to collect preliminary data to estimate the range of performance in predicting treatment outcome amongst practicing clinicians.

Method Participants were recruited from an email database of medical doctors (MD’s) involved with the Yale Department of Psychiatry. An invitation email was sent to: 108 currently-paid MD’s, 109 current psychiatry residents (PostGraduate Years (PGYs) 1-4 and advanced residents/fellows), and 605 alumni of the department; for a total of 822 invitations. Response rates were low, as is often seen for surveys of practicing physicians 19; 86 people clicked on the survey link in the email, and a total N=23 completed the survey. Participating clinicians (mean age 41.7 years, median 35, range 29-83; 7 female, 16 male) had between 1-52 years experience as a practicing psychiatrist (mean = 12.6; median = 5.5). The survey was conducted online using Qualtrics. It took approximately 30 minutes to complete, and upon completion of the survey, participants were offered the option to join a prize draw for a $100 Amazon gift card. The survey was determined by the Yale University Human Investigation Committee to be exempt from full committee review under Category 2 (45 CFR 46.101(b)(2), protocol #1409014654, and participating clinicians provided consent before beginning the survey. Clinicians were asked to predict treatment outcome for 26 patients in the STAR*D trial. The 26 patients were selected from the STAR*D completers using the default random sampling function in R (Mersenne-Twister algorithm) within each class (remission and non-remission) to try to ensure that the sample of patients had the same balance of responders and nonresponders as the STAR*D completer sample. We chose to sample 26 patients as this was the number that a clinician was able to complete in 30 minutes during piloting. Clinicians were told the treatment regimen that all patients followed (12-14 weeks Citalopram/Celexa, 20mg/day for the first four weeks; then rising to 40mg/day, side effects permitting). Patients had no documented history of non-response or intolerability to treatment in the current Depressive Episode. Clinicians were then presented a brief list of baseline information for each patient, and asked to indicate whether each individual patient eventually reached remission (QIDS < 5) or not. The list included the top 25 pieces of information available in the STAR*D trial, as identified by an Elastic Net model1. A further 8 demographic/clinical variables were added to improve ecological validity, based on early clinical feedback. These variables were: age, gender, ethnicity, marital status, age at first MDE, number of previous MDEs, and blood relative diagnoses of MDD or alcohol abuse. Results Due to the preliminary nature of these data, inferential statistics are not conducted and we only present descriptive statistics. Raw data are available from the corresponding author upon request. Each clincian’s predictions about treatment response were compared to the true outcome (known for each patient from the STAR*D trial). For each clinician, we calculated standard performance metrics (accuracy, sensitivity, specificity, PPV, and NPV), and present summary statistics of these metrics across the group of clinicians. For comparison, we also offer the performance of a GBM that was trained using the same top 25 pieces of information, and tested on these same 26 patients.

1

Some variables used in the clinician pilot study were not available in COMED, and so were not included in the main manuscript. However, they were included for consideration in this pilot study to ensure that we presented clinicians with the best available data for predicting outcome in these cases.

12

Clinicians (N=23) Performance Metrics

Mean

Median

Range

Accuracy Sensitivity Specificity PPV NPV

49.3% 39.8% 60.5% 58.2% 44.9%

46.2% 42.9% 58.3% 52.9% 46.2%

34.6 - 69.2% 0 - 64.3% 16.7 - 100% 33.3 - 100% 22.2 - 62.5%

GBM

69.2% 78.6% 58.3% 68.8% 70.0%

19/23 clinicians had an accuracy below the null-information rate (that is, the bigger class proportion). 3/23 clinicians had accuracies over 60%. When asked “If developed, would you use statistical models to help guide treatment selections?”, 12/23 said “Yes”, 7/23 said “No”, and 4/23 said “Not Sure”.

13

SA4. STAR*D Receiver Operating Characteristic Curve and Model Calibration When developing classification algorithms, one often wants not only to predict a binary outcome (e.g. remission or nonremission), but also obtain a probability of that outcome. This often serves as a proxy for one’s confidence in a given prediction. Indeed, the fact that a classifier can make accurate binary predictions does not guarantee that it is well-calibrated (although, a well-calibrated classifier can still have poor accuracy). Here we provide the calibration curve for the 25-item STAR*D model. Since our 10-fold cross validation process was repeated (to avoid opportune data splits), we averaged the extracted class probabilities (that is, the predicted probabilities in each test fold, not the training folds) for each patient across the 10 repeats. The main model is well calibrated, as the calibration line is very close to the identity line. This was not guaranteed -- treebased models do not always give perfect calibration compared to models that minimize log-loss, like logistic regression models.

In addition, it can sometimes be helpful to examine the performance of a classifier at varying classification thresholds (i.e. the ROC curve). Accordingly, we include an ROC curve for the main model.

14

Equivalent plots for each COMED treatment arm would be inappropriate given that ROC-based metrics are known to be unreliable for small sample sizes 20. Similarly, interpretation of calibration plots in COMED is unreliable due to the low sample sizes within each bin along the x-axis (with just 13-14 patients per bin).

15

ST2/3. Performance of Smaller Models. Although a 25-item model was selected as a good balance between ease of implementation, and sufficient information, we were also interested in how smaller models might perform. Supplementary Table 2 provides full descriptions of crossvalidation performance and optimal hyperparameters for models using STAR*D completers with 25-items vs 15- or 10items. In STAR*D, the performance of the 25- and 15-item models is remarkably similar, suggesting that 15-items could be used with little drop in model performance. With just 10-items model sensitivity falls below 60%, although the overall accuracy remains significantly greater than chance.

ST2

25-item model

15-item model

10-item model

Accuracy

64.6% (3.2%)

64.2% (3.2%)

63.8% (3.3%)

AUC

0.700 (0.036)

0.693 (0.035)

0.683 (0.039)

Performance Metrics

-33

< 3.1 x10-29

p-value [acc>NIR]

< 9.8 x10

Sensitivity

62.8% (5.1%)

61.1% (4.6%)

59.3% (5.2%)

Specificity

66.2% (4.6%)

67.1% (4.6%)

68.1% (4.9%)

PPV

64.0% (3.5%)

63.9% (3.8%)

64.0% (3.9%)

NPV

65.3% (3.3%)

64.5% (3.1%)

63.8% (3.1%)

Kappa

0.291

0.282

0.275

NNT

8 Shrinkage: 0.05 Interaction depth: 1 # Trees: 370

8 Shrinkage: 0.1 Interaction depth: 1 # Trees: 210

8 Shrinkage: 0.15 Interaction depth: 1 # Trees: 170

Key Parameters

< 7.7 x10

-31

Mean (SD) performance over repeated 10-fold cross-validation. NIR = Null Information Rate; PPV = Positive Predictive Value; NPV = Negative Predictive Value; NNT = Number Needed to Treat (= 1/Absolute Risk Reduction)

Although the models appear to perform equivalently in the STAR*D cohort, they did not perform equally well during external validation. We compared the performance of the 25-, 15-, and 10-item models trained in the STAR*D cohort when applied to the Escitalopram monotherapy arm of COMED, as illustrated in Supplementary Table 3. Only the 25-item model showed significant performance in the ESC group of COMED by conventional pNIR]

0.043

0.060 (ns)

0.40 (ns)

Sensitivity

49.4

50.6

40.5

Specificity

70.8

68.1

68.1

PPV

65.0

63.5

58.2

NPV

56.0

55.7

51.0

Kappa

0.200

0.185

0.084

Mcnemar’s p

0.021

0.057

0.006

The key finding here is that the model is still extracting predictive information from predictors that may not be in the top 10 or 15 predictors overall, and that this information is necessary for the model to successfully generalize. The exact reason for this is less clear. There are many reasons why a model can fail to generalize: one crucial reason that is often overlooked is when the model is being asked to generalize to a non-representative sample. For example, if we trained on children in Germany, and tested on adults in Vietnam, we might not be surprised if the model failed to generalize (that is, the new examples are not representative of the training population). Given the differences in inclusion/exclusion criteria between STAR*D and COMED, there is certainly some of that going on, but the fact that the 25-item model did generalize demonstrates that these differences are not insurmountable. Two more likely explanations are a) noise, and b) the difficulty of the problem (i.e. when two patients with similar inputs have different outcomes). We know that classification in biological psychiatry is extremely difficult for both of these reasons. Furthermore, overfitting – where models start to describe noise rather than the underlying relationship of interest, impairs generalization. It may be that the model still continues to accurately capture the structure of the training data, but it does so to a fault: by over-using certain variables that just happened to work in STAR*D but do not work in other populations. In the 25-item model, there may be a solution that is comparable in overall accuracy, but leverages a wider range of features with more representative cut-points, and thus generalizes better. We went to great lengths to avoid overfitting (rigorous repeated cross validation, and examining a limited grid of model hyperparameters), but as we showed here (and indeed, as one can read in many statistical journals), these efforts do not guarantee that a model will generalize to a new population.

17

SA5. Alternative analyses including early drop-outs. Here, we provide a summary of the results for analyses where the full level 1 cohort is included, regardless of when (or why) the patient left the trial (i.e. last observation carried forward). Of the 4,041 patients in STAR*D, 12 have no outcome data, and 444 only have a QIDS severity score at week 0. These patients were not included in this analysis because their outcome (severity at week 0) would be used as a predictor. After excluding 67 patients for missing baseline data, this left a total N=3,518 in the LOCF analysis. The breakdown of final observations was as follows: week 2 (N=308), week 4 (N=358), week 6 (N=903), week 12 (N=1142), and week 14 (N=807). Based on their final observation, 1295 patients reached remission (36.8%), 2223 did not (63.2%). This means that for this analysis, the Null Information Rate (“chance” performance) was 63.2%. In this LOCF analysis, model accuracy was significantly greater than the null information rate. However, the model’s performance is now dominated by the ability to predict patients that do not respond, as evidenced by the shift in predictive power from sensitivity toward specificity. A model trained on the full Level 1 STAR*D cohort (regardless of treatment duration) is able to identify 86.4% of those who do not reach remission with Citalopram, with a NPV of 70.4%. The top 25 predictive items in the elastic net are largely similar to those observed in the 12-week cohort: 19/25 are the same, and the two models only differ by 1 predictor amongst the top 10. “Importance” is simply the (absolute) magnitude of the beta weight in the final elastic net model. Note that the p-value for this model is much larger than the main completer analysis (although still significant) because the class imbalance is so great (63.2% of the sample do not remit, meaning that the null-information rate in this instance is 63.2%).

Predictor QSTOT.0 employment SENGY.0 SMDSD.0 SAGIT.0 HDTOT_R SCHOOL BLACK HENGY SMNIN.0 HSUIC ZOL_EV FRCWD FRLNE SSOIN.0 EMSOC HEMIN PHACH DSMDM HINSG SEMIN.0 SSUIC.0 WHITE TRWIT SWTIN.0

Importance 0.09862 0.08566 0.06751 0.06448 0.06171 0.05308 0.04564 0.04347 0.04318 0.03916 0.03856 0.03780 0.03699 0.03572 0.03425 0.03407 0.03176 0.03154 0.03088 0.02928 0.02920 0.02793 0.02746 0.02663 0.02560

Direction Negative Positive Negative Negative Negative Negative Positive Negative Negative Negative Negative Negative Negative Negative Negative Negative Negative Negative Negative Positive Negative Negative Positive Negative Negative

Full STAR*D Cohort (inc. dropouts) Performance Metrics Accuracy AUC

68.4% (1.9%) 0.699 (0.024)

p-value [acc>NIR] Sensitivity Specificity PPV NPV Kappa

7.1 x 10-11 37.5% (3.9%) 86.4% (2.4%) 61.7% (4.5%) 70.4% (1.3%) 0.260

Key Parameters

Shrinkage: 0.05 Interaction depth: 1 # Trees: 530

18

SA6. Inclusion of 2-week post-baseline severity measurement improves model performance. Although we focused on the predictive utility of baseline (pre-treatment) information, others have attempted to make predictions about outcome based on measurements taken once treatment has begun 18,21,22. With this in mind, we examined how our classification and regression models would perform if we allowed them to include the total QIDS-SR score measured at two weeks post-baseline one of the 25 variables. This analysis included STAR*D treatment completers for whom a week 2 measurement of QIDS severity was available (N=1,680). All cross-validated performance measures in STAR*D completers improved substantially. Classification mean (and SD) performance: Accuracy = 67.9% (3.8%), ROC = 0.743 (0.041), Sensitivity = 69.1% (5.2%), Specificity = 66.8% (5.0%), PPV = 67.7% (3.9%), NPV = 68.4% (4.2%); Regression equivalent: RMSE = 4.17 (0.23), R2 = 0.256 (0.060). As one would expect, this changes the balance of the top-25 features considerably, with the elastic net (and subsequent GBM) relying heavily (but still not exclusively) on the 2-week severity measure.

Predictor

Importance

Direction

QSTOT.2 FRCAR BLACK FRLNE employment SAGIT.0 HINSG TRWIT PHACH TESHK ANAVD SENGY.0 FRCWD TERMD ZOL_EV OBCLN ANWOR SCHOOL HENGY TEJMP QSTOT.0 ANHRT HSANX HEMIN OBVLT

0.45291 0.08786 0.06090 0.05856 0.05675 0.05134 0.04576 0.04526 0.04410 0.03968 0.03862 0.03857 0.02856 0.02703 0.02558 0.02481 0.02390 0.02369 0.02162 0.02063 0.02041 0.01960 0.01810 0.01169 0.01029

Negative Negative Negative Negative Positive Negative Positive Negative Negative Negative Negative Negative Negative Negative Negative Negative Negative Positive Negative Negative Negative Negative Negative Negative Negative

19

ST4. Cross-trial model performance metrics. STAR*D Citalopram model successfully generalizes to the COMED Escitalopram group, as well as the Escitalopram+Buproprion group, but not the Venlafaxine + Mirtazapine arm. Full model performance metrics are presented below.

ST4

Treatment Arm Escitalopram + Placebo

Escitalopram + Buproprion

Venlafaxine + Mirtazapine

Remitters; Nonremitters Accuracy

79; 72

66; 68

72; 68

0.596

0.597

0.514

95% CI

[0.513, 0.675]

[0.509, 0.681]

[0.428, 0.600]

p-value [acc>NIR]

0.0431

0.0232

0.53

Sensitivity

0.494

0.561

0.389

Specificity

0.708

0.632

0.647

PPV

0.650

0.597

0.539

NPV

0.560

0.597

0.500

Kappa

0.200

0.193

0.0357

Mcnemar’s p

0.021

0.68

0.021

20

SF1 & ST5. Additional predictive modeling analyses of each COMED arm independently.

Although it was not the focus of the present study, we applied the same general modeling pipeline to each COMED treatment arm separately to examine the predictability of each drug, albeit in greatly reduced sample sizes. As such, we developed three separate models (each with its own combination of 25-features, coefficient weightings, and model hyperparameters). Since we had relatively few completers in each arm (around 140), we did not split each sample into training and validation arms. Instead, we simply report average cross-validated (10x10cv) performance measures over the whole arm (Supplementary Table 5). For each arm, we first selected 25-items using an elastic net, and then trained a GBM using the restricted 25-item feature space. As illustrated in Supplementary Figure 1, different variables emerged amongst the top 25 for each treatment arm. Due to the small number of patients in each cross-validated test fold (N/10 will be approximately 14 patients), results will be more variable and the possibility for overfitting is certainly greater. In addition, all three treatment arms contained fewer patients than the total number of predictor variables. As such, these particular results are for illustrative purposes, and should be interpreted with caution.

ST5

Treatment Arm Escitalopram + Placebo

Escitalopram + Buproprion

Venlafaxine + Mirtazapine

GBM Performance (Cross-Validation) AUC

0.709 (0.132)

0.805 (0.121)

0.804 (0.106)

Sensitivity

0.764 (0.170)

0.723 (0.170)

0.729 (0.161)

Specificity

0.508 (0.204)

0.723 (0.188)

0.720 (0.172)

Accuracy

0.642 (0.122)

0.724 (0.125)

0.725 (0.100)

Kappa

0.272 (0.250)

0.446 (0.249)

0.449 (0.202)

Key Parameters

Shrinkage: 0.05 Interaction depth: 2 # Trees: 10

Shrinkage: 0.15 Interaction depth: 1 # Trees: 570

Shrinkage: 0.05 Interaction depth: 3 # Trees: 450

21

Supplementary Figure 1. Modeling each drug treatment group separately highlights a different optimal mapping between symptom profile and target outcome (remission). Lists shows three items with highest absolute beta value during elastic net feature selection. Performance measures are averaged across test-folds during repeated 10-fold cross-validation. Exact source of questionnaire item indicated by glyphs: † = QIDS-SR, ‡ = PDSQ, ¤ = HAM-D, a = Medical Hx.

References

1

Rush AJ, Trivedi MH, Stewart JW, et al. Combining Medications to Enhance Depression Outcomes (CO-MED): Acute and longterm outcomes of a single-blind randomized study. Am J Psychiatry 2011; 168: 689–701.

2

Rush AJ, Fava M, Wisniewski SR, et al. Sequenced treatment alternatives to relieve depression (STAR*D): Rationale and design. Control. Clin. Trials. 2004; 25: 119–42.

3

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Elements 2009; 1: 337–87.

4

Draper NR, Smith H. Applied regression analysis. In: Applied regression analysis. 1981: 709.

5

Berk RA. Regression analysis: A constructive critique. Sage, 2004.

6

Kessler RC, Warner CH, Ivany C, et al. Predicting Suicides After Psychiatric Hospitalization in US Army Soldiers The Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). JAMA Psychiatry 2014; 72: 49–57.

7

Jain F a., Hunter AM, Brooks JO, Leuchter AF. Predictive socioeconomic and clinical profiles of antidepressant response and remission. Depress Anxiety 2013; 30: 624–30.

8

Jakubovski E, Bloch MH. Prognostic subgroups for citalopram response in the STAR*D trial. J Clin Psychiatry 2014; 75: 738–47.

9

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Statistical Methodol 2005; 67: 301– 20.

22

10

Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010; 33: 1–22.

11

Combrisson E, Jerbi K. Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. J Neurosci Methods 2015; published online Jan 14. DOI:10.1016/j.jneumeth.2015.01.010.

12

Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal 2002; 38: 367–78.

13

Friedman JH. Stochastic Gradient Boosting. 1999.

14

Perlis RH. A clinical risk stratification tool for predicting treatment resistance in major depressive disorder. Biol Psychiatry 2013; 74: 7–14.

15

Ip IB, Bridge H, Parker AJ. Effects of spatial and feature attention on disparity-rendered structure-from-motion stimuli in the human visual cortex. PLoS One 2014; 9. DOI:10.1371/journal.pone.0100074.

16

Coutanche MN, Thompson-Schill SL, Schultz RT. Multi-voxel pattern analysis of fMRI data predicts clinical symptom severity. Neuroimage 2011; 57: 113–23.

17

Mourao-Miranda J, Reinders A, Rocha-Rego V, et al. Individualized prediction of illness course at the first psychotic episode: a support vector machine MRI study. Psychol Med 2012; 42: 1037–47.

18

Leuchter AF, Cook I a., Marangell LB, et al. Comparative effectiveness of biomarkers and clinical indicators for predicting outcomes of SSRI treatment in Major Depressive Disorder: Results of the BRITE-MD study. Psychiatry Res 2009; 169: 124–31.

19

Wiebe ER, Kaczorowski J, MacKay J. Why are response rates in clinician surveys declining? Can Fam Physician 2012; 58: 225– 8.

20

Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics 2010; 26: 822–30.

21

Leuchter AF, Cook IA, Hamilton SP, et al. Biomarkers to predict antidepressant response. Curr Psychiatry Rep 2010; 12: 553–62.

22

Cook I a., Hunter AM, Gilmer WS, et al. Quantitative electroencephalogram Biomarkers for predicting likelihood and speed of achieving sustained remission in Major Depression: A report from the Biomarkers for Rapid Identification of Treatment Effectiveness in Major Depression (BRITE-MD) trial. J Clin Psychiatry 2013; 74: 51–6.

23