Subgroup Identification in Clinical Trials by Stochastic

1 downloads 0 Views 965KB Size Report
Sep 14, 2017 - dures is that the former method compares a single VI(X) score estimated from the original dataset to a threshold, while the lat- ter compares a ...
Statistics in Biopharmaceutical Research

ISSN: (Print) 1946-6315 (Online) Journal homepage: http://www.tandfonline.com/loi/usbr20

Subgroup Identification in Clinical Trials by Stochastic SIDEScreen Methods Ilya Lipkovich, Alex Dmitrienko, Kaushik Patra, Bohdana Ratitch & Erik Pulkstenis To cite this article: Ilya Lipkovich, Alex Dmitrienko, Kaushik Patra, Bohdana Ratitch & Erik Pulkstenis (2017) Subgroup Identification in Clinical Trials by Stochastic SIDEScreen Methods, Statistics in Biopharmaceutical Research, 9:4, 368-378, DOI: 10.1080/19466315.2017.1371069 To link to this article: https://doi.org/10.1080/19466315.2017.1371069

Accepted author version posted online: 14 Sep 2017. Published online: 14 Sep 2017. Submit your article to this journal

Article views: 39

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=usbr20

STATISTICS IN BIOPHARMACEUTICAL RESEARCH , VOL. , NO. , – https://doi.org/./..

Subgroup Identification in Clinical Trials by Stochastic SIDEScreen Methods Ilya Lipkovicha , Alex Dmitrienkob , Kaushik Patrac , Bohdana Ratitchd , and Erik Pulkstenise a QuintilesIMS, Durham, NC; b Mediana, Inc, Overland Park, KS; c Alexion, Lexington, MA; d QuintilesIMS, Montreal, Québec, Canada; e MedImmune, Gaithersburg, MD

ABSTRACT

ARTICLE HISTORY

Subgroup identification for personalized medicine has become very popular in the last decade. Efficient recursive partitioning procedures adapted from machine learning are natural approaches for performing subgroup identification based on pre-defined biomarkers since they provide subgroups as terminal nodes in the decision tree. However, recursive partitioning is also known as a potentially unstable procedure with results being quite sensitive to normal sampling variability in the data. One common approach, borrowed from ensemble learning, to overcome such instability is application of recursive partitioning to multiple data sets sampled from the observed data followed by averaging the results over the collection of subgroups. This article proposes an alternative approach to subgroup identification in clinical trials that first evaluates the predictive strength of biomarkers based on variable importance and then applies recursive partitioning to the biomarkers with the highest variable importance scores. A deterministic version of this idea was implemented in the Adaptive SIDEScreen method that generates a collection of patient subgroups by retaining multiple candidate splits of each parent group by different biomarkers (Lipkovich and Dmitrienko 2014a, 2014b). Now, we extend the Adaptive SIDEScreen and introduce the Stochastic SIDEScreen method. The key idea is to introduce randomness in the subgroup generation process, borrowing from bagging methods, to produce a broader collection of subgroups. Specifically, the SIDES method, where the most promising biomarkers are selected for each parent group from a set of candidate biomarkers, is applied to multiple bootstrap samples of the data. This new approach leads to a more reliable biomarker selection process, which is especially important for smaller, early phase studies when biomarker selection is typically carried out. The method is illustrated using clinical trial examples.

Received February  Accepted August 

1. Introduction The main challenge of identifying predictive biomarkers and associated subgroups (often sought as “regions” in the space defined by biomarkers) lies in that the targets of predictive biomarkers are individual patient’s treatment differences, such as, for example, the difference in potential outcomes for a given patient that would be observed if s/he were treated with an experimental treatment “A” versus the standard of care treatment “B.” Note that only one of these potential outcomes is observed for any given patient in a typical clinical trial with a parallel design, where patients are randomized to one of the available treatment arms. This is in contrast with the task of traditional supervised learning modeling (e.g., identifying prognostic biomarkers that predict outcomes in untreated patients), where the outcome is fully observed in a training sample. To overcome this challenge, several methods for identification of predictive biomarkers and subgroups as biomarker signatures (e.g., “regions” or other types of rules defining subsets of patients in terms of their biomarker values) from randomized clinical trial data have been recently proposed and evaluated in the context of personalized medicine (see Lipkovich, Dmitrienko, D’Agostino 2017). One broad class of subgroup identification methods that we labeled global outcome modeling is trying to approach this task in two stages. At the first stage, the outcome is modeled using either CONTACT Ilya Lipkovich

[email protected]

©  American Statistical Association

KEYWORDS

Bootstrap; Personalized medicine; Recursive partitioning; Resampling; Subgroup identification

a single regression model that incorporates both main (prognostic) effects and treatment-by-covariate interactions (predictive effects) or different models are fitted for each treatment arm; at the second stage, the model(s) fitted at the first stage is(are) used to predict hypothetical treatment differences at individual patient’s level. These differences are modeled as an outcome variable of the second stage regression via traditional methods of predictive modeling. For example, the Virtual Twins method by Foster et al. (2011) uses random forests (Breiman 2001) at the first stage and CART (classification and regression trees, Breiman et al. 1984) at the second stage. These methods are based on machine learning approaches, while some researchers advocate for more traditional parametric regression approaches. As fitting parametric models with a large number of interaction terms poses problems, methods of penalized regression and their extensions have been proposed that may mitigate some of the issues (see, e.g., Imai and Ratkovic 2013). Another class of subgroup identification procedures called global treatment effect modeling aims at direct estimation of predictive effects, thus obviating the need of fitting prognostic effects. For example, the Interaction Trees by Su et al. (2009) fit a piecewise constant model for individual treatment differences by using a CART-type procedure with the treatment-bysplit interaction as the splitting criterion, rather than a traditional “reduction in impurity due to the split” criterion. As a

Center for Statistics in Drug Development, QuintilesIMS,  Emperor Blvd., Durham, NC .

STATISTICS IN BIOPHARMACEUTICAL RESEARCH

result, procedures in this class may be more robust compared to global outcome modeling procedures, as they would not be so prone to model misspecification inevitable when trying to estimate prognostic models. Recently proposed solutions by Loh, He, and Man (2015) within the GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) platform and Seibold et al. (2015) within the model-based recursive partitioning framework belong to the same category. An important subclass of global treatment effect modeling is formed by methods for identifying optimal individual treatment regimes (ITR) which originated as a result of combining ideas from causal inference and machine learning. These methods are based on optimizing the value function associated with a treatment regime proposed by Qian and Murphy (2011). For example, Gunter et al. (2011) proposed methods for identifying only biomarkers contributing to qualitative interaction with treatment (and therefore to ITR) based on the notion of the Value function. They used resampling procedures to ensure control of the familywise Type I error rate. Later Zhang et al. (2012) and Zhao et al. (2012) showed that estimation of an optimal ITR can be cast as a classification problem, where the optimal classifier found by minimizing the outcome-weighted classification loss corresponds to an optimal treatment regime. Finally, we note a group of methods that fall within the class of local modeling. This last class of subgroup/biomarker identification methods focuses on direct search for treatment-bycovariate interactions and selecting subgroups with desirable characteristics, for example, subgroups with enhanced treatment effect, compared to that in the overall population. This approach obviates the need to estimate the response function over the entire covariate space and focuses instead on identifying specific regions with a large treatment effect. Some of the approaches under this heading were inspired by bump hunting (also known as Patient Rule Induction Method [PRIM]) introduced by Friedman and Fisher (1999). They argued that it may be more efficient to search directly for some “interesting” regions in the covariate space rather than estimating the outcome function over the entire covariate space and then discarding the regions that are “uninteresting.” The offspring of bump hunting for the task of subgroup identification is procedures proposed by Kehl and Ulm (2006), and later improved by Chen et al. (2015). Also, it is worth noting that, within each of the three classes described above, Bayesian methods also have been developed (see Lipkovich, Dmitrienko, D’Agostino, 2016). Subgroup Identification using Differential Effect Search [SIDES] (Lipkovich et al. 2011) and its extension known as SIDEScreen (Lipkovich and Dmitrienko 2014a) are methods for subgroup identification developed via recursive partitioning that also fall within the third category of local modeling. This article is focused on the SIDES methodology and introduces new enhancements compared to the previously published versions. We provide a high-level overview of the SIDES method in Section 3 and the reader is referred to Lipkovich and Dmitrienko (2014b) and Lipkovich, Dmitrienko, D’Agostino (2017) for further detail. SIDES-based procedures can be used within drug development programs at various stages. These methods can be used either for selection of biomarkers for further investigation or for defining promising subgroups or both. For example, in

369

early-stage trials (Phase 2), a higher emphasis is placed onto selection of potentially predictive biomarkers for further examination in late-stage trials (Phase 3) rather than defining a specific cutoff value for each biomarker to define patient subgroups. In this context various screening methods, for example, based on variable importance, can be used to facilitate selection process. In late-stage development programs that initially motivated the development of subgroup identification methods, SIDES/SIDEScreen can be used to identify promising subgroups (often for failed trials) that can be validated in subsequent trials or using independent retrospective data. The variable importance (VI) scoring concept is common in many applications of machine learning. We discuss variable importance in more detail in Section 3, but at a high level, it represents the variable’s ability to explain or predict a dependent variable, e.g., treatment effect in our case. The VI scoring is an integral part of ensemble learning methods such as bagging (a shorthand for “bootstrap aggregating”, Breiman 1996) and Random Forests (Breiman 2001) that generate multiple splits by applying tree methods to different samples from the original data. The SIDES methodology also relies on VI scoring. In contrast to the ensemble machine learning methods, Adaptive SIDEScreen can be called a deterministic ensemble method in that it harvests promising subgroups by retaining the second, third, etc., best splits rather than applying the base recursive algorithm to multiple random samples from the data, but the VI scoring remains a key tool for the selection of promising biomarkers based on their contribution to harvested subgroups in this approach as well. As we argue in the article, for relatively small clinical trials in early-phase development programs, improved evaluation of variable importance scores and biomarker selection can be carried out using bagged estimators of variable importance. These estimators may help overcome an inherent instability of biomarker selection, i.e., improve robustness with respect to sample variation so that small changes in the observed dataset would not result in significantly different outcomes of the selection process. Bagging (bootstrap aggregating) is a general approach where classification, regression, or other types of methods potentially sensitive to sampling variability are applied to multiple bootstrap samples, obtained by sampling with replacement from the original dataset, and then appropriate averaging (aggregation) techniques are used to combine the results from the bootstrap samples. A similar approach can be used with the SIDES methodology by averaging variable importance scores computed by the recursive partitioningbased SIDES method using multiple bootstrap samples from the original data. We further propose to define variable importance scores based on a more “focused” splitting criterion that accounts only for the biomarker-positive subgroup with enhanced treatment effect and ignores contributions from the biomarker-negative subgroup that favors the control arm. Several methods that can be used for the purpose of biomarker selection are illustrated in this article, including the Adaptive SIDEScreen method and a novel method called Stochastic SIDEScreen that improves on Adaptive SIDEScreen by implementing the bagging procedure outlined above. The article is organized as follows. Section 2 presents a motivating clinical trial example. Section 3 provides an overview of

370

I. LIPKOVICH ET AL.

Table . Summary statistics for the four candidate biomarkers in the RA trial. Biomarker Disease duration (DD) Rheumatoid factor (RF) Number of RA meds (NRAM) Tender joint count (TJC)

Type

Mean [range] (for numerical biomarkers)

Nominal Numerical Numerical Numerical

. [,] . [,] . [,]

SIDES and SIDEScreen methods, Section 4 discusses available splitting criteria used in SIDES-based subgroup search methods and illustrates them using the example. Section 5 introduces the Stochastic SIDEScreen method. Section 6 applies the Adaptive SIDEScreen and Stochastic SIDEScreen to the case study introduced in Section 2. We conclude with a discussion section (Section 7).

2. Motivating Example As a motivating example, we will use simulated dataset mimicking outcomes and covariates from a rheumatoid arthritis (RA) study with 251 patients randomized to three different doses of an experimental treatment vs. placebo in a 1:1:1:1 ratio. For the purpose of this article, we combined the three dosing arms into a single arm (n1 = 188) to compare with the placebo arm (n0 = 63). The outcome variable is a binary variable based on the American College of Rheumatology definition of improvement (improvement is denoted by Y = 1 and lack of improvement is denoted by Y = 0). The overall treatment effect was non-significant (one-sided p-value = 0.404, based on the normal approximation for the difference in response rates with no continuity correction). The observed improvement rates were 0.44 and 0.43 for the treated and placebo groups, respectively. Table 1 provides a summary of four candidate biomarkers that are hypothesized to be predictive of treatment response in this trial. The biomarkers are measured prior to the initiation of treatment and will be used to illustrate the subgroup search methods described in the subsequent sections. Three of the biomarkers, namely, Rheumatoid factor (RF), Number of RA medications (NRAM) and Tender joint count (TJC), are continuous and Disease duration (DD) is categorical with 3 levels: “5 years.” When conducting subgroup search with SIDES methods, the candidate subgroups will be formed as follows: for the continuous biomarkers we will consider two child groups for each observed value x0 of the biomarker X as {Xࣘ x0 } and {X> x0 }. For DD, the categories will be treated as unordered levels based on the assumption of a nonmonotone relationship between the disease duration and treatment effect. Specifically, the following three candidate splits will be considered: “5 years,” “5 years” and “5 years” vs. “2–5 years.” The following three important considerations will be discussed in light of this clinical trial example. First, because of the small sample size (the control arm contains only 63 patients) the cutoffs for any continuous biomarker to define patient subgroups may not be accurately evaluated. Hence the focus will be more on the evaluation of the predictive strength (variable importance) of the candidate biomarkers to aid subsequent

Frequency (for nominal biomarkers) “ ” (.%)

clinical development rather than specific subgroup findings. The second consideration deals with the choice of the most appropriate splitting criterion for subgroup and biomarker selection. Finally, given the small-sample size, considerations will be given to account for larger variability in the measure of variable importance due to model selection uncertainty which leads to resampling-based methods. An overview of the SIDES and SIDEScreen method and enhancements with respect to the choice of splitting criteria and variable importance scores are given in the next three sections.

3. Overview of the SIDES and SIDEScreen Methods The SIDES method uses recursive partitioning to perform a direct search for subgroups of patients who experience an enhanced treatment benefit. The recursive partitioning algorithm is applied to each individual candidate biomarker and an optimal split is found by minimizing the appropriate splitting criterion. The splitting criterion is defined on a scale from 0 to 1, where smaller values are preferred, which is derived as an upper tail probability from the null distribution of a specific splitting criterion (see Section 4 for details) and is further adjusted to reduce the selection bias of a biomarker due to multiple candidate splits using the modified Šidák procedure as explained by Lipkovich and Dmitrienko (2014b). Unlike common tree-based procedures, SIDES retains multiple promising splits into child groups that are formed by the second, third, etc. best biomarkers (as determined by the width parameter). The reason for retaining more than just the best splitter (i.e., having width >1) is that this helps avoid local optima due to an overly greedy local search, which is often the case when following “the winner gets all” approach. Each split results in retaining one of the two child groups that produces a larger treatment effect as “promising.” Once the first set of promising child groups is generated, the same algorithm is recursively applied within each promising child (abandoning the nonpromising one), resulting in the collection of terminal (or final) subgroups defined by a combination of multiple biomarkers (controlled by the degree of subgroup nesting or depth parameter). For example, the terminal subgroups with depth = 3 may look as a combination of three biomarkers such as {TJC > 13 and DD > 5 yr and RF ࣘ 25}. Note that the subgroups in the final set are, in general, overlapping, hence different from subgroups represented by the terminal nodes of a tree in CART (Breiman et al. 1984) or other methods of tree regression. The SIDES method employs complexity control to reduce the size of the search space and multiplicity adjustments to account for selection bias inherent in subgroup search. The SIDEScreen procedures (Fixed and Adaptive) improve on the basic SIDES procedure by introducing a concept

STATISTICS IN BIOPHARMACEUTICAL RESEARCH

of biomarker screening via variable importance scores. The variable importance score is an integral characteristic of a biomarker’s predictive ability and is computed in the SIDES algorithm by averaging the biomarker’s contributions across all subgroups in the final set. We emphasize that, unlike the definitions of VI indices used in the predictive learning literature, the measure of variable importance used in SIDES captures a biomarker’s “value” as a predictor of treatment effect irrespectively of its prognostic value. To fix ideas, we introduce some notation and briefly explain the computation of variable importance scores (see Lipkovich and Dmitrienko 2014a). Let X1 , X2 , . . . , Xp denote the candidate biomarkers that can be continuous or categorical, in a clinical trial. A split on a continuous biomarker X forms two child groups {X ≤ x0 } versus {X > x0 }, where x0 is one of realized values of X. For a categorical (nominal) biomarker X, splits are formed by dividing the m levels of biomarker into two mutually exclusive and exhaustive groups. Missing values in categorical covariates can be treated as an additional (m+1)th level and, for continuous covariates, missing values can be omitted when evaluating the splitting criterion. The VI score, V I(X ), associated with a particular biomarker X is computed as the average contribution of that covariate across all promising subgroups. Contribution of a biomarker is set to zero for all subgroups where it is not included in the subgroup signature and as the negative logarithm of the splitting criterion where it is included, that is, VI (X ) = K

−1

K 

νi ,

i=1

rule:

371

 VI (X ) > Eˆ0 + k Vˆ 0 ,

where Eˆ0 and Vˆ 0 are the mean and variance of the maximal (over all biomarkers) VI score under the null distribution obtained by permuting the treatment labels. These parameters are estimated from a large number (M) of such permuted samples. The multiplier k is a free parameter that is often calibrated so as to ensure a desired overall rate of false positives in selecting at least one noise biomarker, in the absence of predictive biomarkers in the dataset. For example, if the desired level of the overall false positive rate at the screening stage is 10% and assuming that the VI scores are approximately normal, requires k = −1 (0.9) ≈ 1.282. Biomarkers with VI scores larger than this threshold (if any) are selected and at the second stage the basic SIDES is applied to the selected biomarkers. The final adjusted p-values are computed by replicating the entire two-stage procedure on a large number of additional null sets.

4. Criteria for Selecting Promising Biomarkers In this section, we provide a brief overview of the three splitting criteria, Differential effect (D1 ), Maximal effect (D2 ), and Directional effect (D3 ), proposed for evaluating each candidate biomarker split in the SIDES algorithm (Lipkovich and Dmitrienko 2014b). Let Z1 and Z2 be the standardized test statistics associated with the treatment effect in two child groups resulting from a candidate split (e.g., X ≤ x0 and X > x0 ). Assume that larger positive values of Z1 (Z2 ) correspond to a larger treatment effect (experimental treatment better than control) in the child group 1 (group 2). Under the null hypotheses of no treatment effect in any of the child groups, Z1 and Z2 follow N(0,1) distribution.

where νi = − log Di (X ) if the ith subgroup signature contains the biomarker X and νi = 0, otherwise. Further, K is the number of identified terminal subgroups, Di (X ) is the adjusted splitting criterion (as defined in Section 4) evaluated for the biomarker X for the selected split in the ith subgroup. The value of the criterion is adjusted for multiple cutoff points using the modified Šidák procedure. As an illustration, Table 2 lists the variable importance scores for the four biomarkers in the RA trial example based on the basic SIDES procedure with width = 3, depth = 2 and the differential effect splitting criterion. The variable importance scores are based on K = 32 = 9 generated subgroups. The Adaptive SIDEScreen procedure (Lipkovich and Dmitrienko 2014a) is a two-stage procedure that first “harvests” potential subgroups to evaluate the predictive strength of the candidate biomarkers (based on their VI scores) and then it applies a SIDES-based recursive partitioning algorithm to the biomarkers with the highest VI scores. Specifically, at the first stage, biomarkers are selected based on the following screening

The “differential effect” criterion is defined as an absolute value of the difference in the z-statistics associated with the two child √ groups normalized by a constant, that is, T = |Z1 − Z2 |/ 2 . Since a larger value of T would imply a greater discrimination between the two child groups, we focus on the upper-tailed pvalue, i.e., p (Tobs ) = 1 − FT (Tobs ), where Tobs is the observed value of the differential criterion and FT (t ) is the cumulative distribution function of the random variable T under the null hypothesis. Then, it is easy to verify that     Z1 − Z2 |Z1 − Z2 | < t = Pr −t < √ 0 ⎪ ⎪|Z1 − Z2 |, ⎪ ⎪ ⎨Zmax − Zmin = |Z1 | + |Z2 |, if Zmax > 0, −δ ≤ Zmin ≤ 0 T= ⎪ if Zmax > 0, Zmin < −δ Zmax + δ, ⎪ ⎪ ⎪ ⎩ 0, if Zmax ≤ 0. A natural choice for the truncation parameter is δ = 0 (this value will be used in this article). As another approach, δ can be set in a data driven fashion, depending on the observed effects so as to equalize the negative and positive treatment effects in the two child groups: δ = min(Zmax , −Zmin ), if Zmin < 0. In the case of δ = 0, the directional splitting criterion simplifies to ⎧ ⎪ ⎨ |Z1 − Z2 |, if Zmin > 0 if Zmax > 0, Zmin ≤ 0 T = Zmax , ⎪ ⎩ 0, if Zmax ≤ 0. Note that, under the null distribution, the probabilities of the three events in this definition are equal to ¼, ½, and ¼, respectively. It can be seen that the directional splitting criterion essentially simplifies to the differential effect splitting criterion if both

Z1 and Z2 are positive and to the maximal effect splitting criterion if Zmax > 0 and Zmin < 0. If both Z1 and Z2 are nonpositive, the criterion is set to 0. It can be shown that the null distribution of the directional splitting criterion T is given by 2

FT (t ) =

1 − 4[( √t 2 ) − 1] + 2(2(t ) − 1) 3

·

To put this criterion on a probability scale, we compute the upper tail probability and define the criterion as D3 = 1 − FT (Tobs )· 4.4. Performance of Splitting Criteria Next, we examine the performance of the splitting criteria described above using the clinical trial example described in Section 2. First, TJC is considered which is associated with the largest variable importance score (see Table 2). The left panel of Figure 2 shows the Z-statistics for the lower and upper child groups associated with each candidate cutoff. The right panel shows the profiles of the two differential and maximal effect criteria. Note that the directional criterion with δ = 0 is essentially equivalent to the maximal effect criterion since the Zmin is < 0 virtually everywhere in this case, hence this criterion is not plotted in Figure 2. The criterion values (on the negative log scale) are shown on the Y-axis. Note that, when selecting biomarkers, these values are further adjusted for multiplicity using the modified Šidák procedure. However unadjusted values are plotted without any loss in the interpretation as the adjustment roughly amounts to adding a constant penalty (on the log scale) and thus it does not affect the shapes of the criterion functions. The thin and dashed vertical lines indicate the optimal subgroups corresponding to the differential and maximal effect/directional criteria, namely, {TJC ࣘ 13} and {TJC ࣘ 19}, respectively. It can be seen that the maximal effect/directional criteria select a larger cutoff value compared to the differential criterion as they “ignore” the large negative treatment effect observed at lower values of TJC.

373

8 6

Splitting criterion (−log scale) Z(TJCcutoff)

10

15

20

25

Differential criterion Maximal criterion

0

−4

2

−2

4

0

Z−statistics for left and right nodes

10

2

12

4

14

STATISTICS IN BIOPHARMACEUTICAL RESEARCH

10

Splitting cutoff (TJC)

15

20

25

Splitting cutoff (TJC)

Figure . Z statistics corresponding to the “upper” (broken line with filled circles) and “lower” (broken line with open circles) child groups (left panel) and the criterion functions of the differential and maximal splitting criteria on the negative log scale for TJC (right panel). The solid and dashed vertical lines indicate optimal cutoffs for the differential and maximal splitting criteria.

Table . Z statistics for the biomarker-positive (Z ) and biomarker-negative (Z ) child groups and splitting criteria (−log scale) for candidate splits for the nominal biomarker DD (disease duration). Candidate splits for DD (biomarker-positive vs. biomarker-negative group) > OR – vs.  OR  vs. 5 years.”

Table . Variable importance score based on the three splitting criteria in the RA trial and associated threshold values (in parentheses). Biomarker TJC DD NRAM

Differential criterion (threshold = .)

Maximal criterion (threshold = .)

Directional criterion (threshold = .)

. . .

. . .

. . .

Table 4 shows the variable importance of the top three selected biomarkers when applying the basic SIDES method to the RA dataset with width = 3 and depth = 1 for the 3 splitting criteria. The thresholds shown in parentheses roughly correspond to the upper 90% percentile of the null distribution of the maximal VI score (E0 + z0.9 × SD0 , assuming normality, where z0.9 = 1.282) based on 10,000 samples from the null distribution obtained by randomly permuting treatment labels in the original dataset. For both maximal effect and directional criteria, the VI scores are smaller compared to the differential criterion for all three candidate biomarkers as they ignore the strong negative treatment effect. It is noteworthy that these criteria seem to attribute a lesser importance to DD compared to TJC, which is consistent with the observation made earlier that all splits based on DD favored placebo compared to the experimental treatment. This behavior is consistent with the objective of searching for patient subgroups with a positive treatment effect. The results for the three splitting criteria are different when the Adaptive SIDEScreen method is applied. It is shown in Table 4 that, while for the differential criterion both TJC and DD appear to have a strong predictive effect and are selected for the second stage (VI scores exceed the threshold value of 1.161). For the two other criteria, only TJC is selected as the VI scores for DD fails below the criterion-specific thresholds.

374

I. LIPKOVICH ET AL.

5. Stochastic SIDEScreen Method In this section, we introduce a variation of the family of SIDES based methods, called Stochastic SIDEScreen. Like Adaptive SIDEScreen, it has two stages used for “harvesting” patient subgroups using the basic SIDES method and “screening” VI scores in order to identify the final set of subgroups and associated cutoffs for the selected biomarkers in the second stage. The key enhancement made in the Stochastic SIDEScreen procedure is that the VI scores are computed not from subgroups “harvested” from the original data but rather from subgroups generated by applying the basic SIDES method to multiple (say, B = 10,000) bootstrap samples from the original data. Each bootstrap sample {Yb∗ , R∗b , Xb∗ }, b = 1, . . . , B, of the same size (N) as the observed dataset is obtained by sampling individual patient records with replacement N times and is comprised of the outcome variable, treatment indicator (R = 0,1), and row vector of biomarkers. To preserve treatment balance, we will do sampling stratified by the treatment arm, R. Then, the VI scores VIb (X ) for the biomarkers Xi , i = 1, . . . , p, are computed from each bootstrap sample by running the basic SIDES method on the bootstrap data in the same way that VI(X ) is computed by running SIDES on the original data, i.e., {Y,R,X}. The bootstrap distribution of the VI score is generated for each candidate biomarker. The resulting distribution contains useful information that can be utilized in several ways as described below. Obtaining smooth point estimates of the VI scores: The estimates are found by averaging the VI scores from each bootstrap sample. This enables the trial’s sponsor to factor in the (often substantial) uncertainty associated with the subgroup selection process. Note that the VI scores calculated from the observed data already rely on model averaging and, in this sense, are “smoothed” measures of each biomarker’s contribution (the degree of smoothness depends on the scope of the subgroup search controlled by the width and depth parameters). Stochastic SIDEScreen makes a further improvement in this direction and adds a random component to averaging, the same way as done in the bagging method introduced in Breiman (1996). It is expected that a great level of noise reduction will be achieved by averaging over relatively “independent” VI scores from multiple samples. It will be shown below that the bagging estimators of the VI scores are often quite different from the original estimates based on the observed data. Our expectation is that the bagging process will amplify the VI scores for true predictive biomarkers while causing the VI scores of the noise biomarkers to shrink toward zero. The idea is that strong predictors of treatment response would consistently manifest themselves across the majority of the bootstrap samples. By contrast, non-informative biomarkers would emerge in different samples, which will result in partial cancelation of their importance when averaging over the ensemble of bootstrap samples. Obtaining interval estimates of VI scores: The VI scores computed from the original data may be fairly unstable when dealing with small-sample sizes and/or biomarkers with a large number of potential splits. A bootstrap-based confidence interval of the mean VI score provides an insight into this inherent instability of VI scores. Consequently, this confidence interval can serve as a more reliable tool for biomarker screening compared

to the standard VI score computed from the original data as in the Adaptive SIDEScreen method. This approach provides the foundation for the Stochastic SIDEScreen procedure. The fundamental idea is to construct a more robust biomarker screening rule based on the bootstrap distribution of the VI scores along with the null distribution of the scores. More specifically, a candidate biomarker X can be selected for the second stage if Lα (X ) > Eˆ0 (X ), where Lα (X ) is the lower limit of the 100 × (1 − α)% (say, 80%) bootstrap confidence interval of the VI score associated with this biomarker and Eˆ0 (X ) is the mean of the biomarker’s VI score under the null distribution obtained by permuting the treatment labels. While the adaptive SIDEScreen method also uses the null distribution based on permuted treatment labels to define a variable selection threshold, the main difference between the Adaptive and Stochastic SIDEScreen procedures is that the former method compares a single VI(X) score estimated from the original dataset to a threshold, while the latter compares a more robust bootstrap-based lower confidence limit for the VI score, Lα (X ), to a biomarker-specific threshold for biomarker selection. Several choices of Lα (X ) can be considered: r Method 1: Percentile method, Lα (X ) = q α2 [VIb (X ), b = 1, . . . , B]. r Method 2: Normal approximation method with Lα (X ) =  Vˆ B (X )), where VI(X ) is the observed variable importance and Vˆ B (X ) is the bootstrap estimate of the variance of VI(X ). r Method 3: Normal approximation method for bagging estimator, Lα (X ) = ˆ max(0, V I B (X ) − z1−α/2 × VIJ (X )), where V I B (X ) VIb (X ) and is the bagging estimator V I B (X ) = B−1 ˆ VIJ (X ) is the variance of the bagging estimator V I B (X ), computed using the bias-corrected version of the Infinitesimal Jackknife estimate (Efron, 2014) presented in Eq. (7) of Wager, Hastie and Efron (2014). To facilitate the presentation of ideas, Figure 3 shows the schematics of the Stochastic SIDEScreen method. This article focuses on the simpler percentile method (Method 1) and compares it with Method 3. We note that the normal approximation method (Method 2) is likely to be unsatisfactory because the VI(X ) based on observed data is unstable. Finally, the third method may result in a tighter confidence interval compared to the percentile method; however, it requires further exploration in future work. In summary, the proposed bootstrap-based rules for biomarker selection (Lα (X ) > Eˆ0 (X )) will be contrasted with the rule used in the Adaptive SIDEScreen procedure: 

I(X ) − z1− α × max(0, V 2

select the biomarker X if VI(X ) > Eˆ0 + k Vˆ 0 , where VI(X ) is obtained from the original dataset and k is calibrated to ensure 100α% Type I error rate (to match the rule based on the 100 × (1 − α)% bootstrap confidence interval, it should be selected as k = −1 (1 − α/2)), and Eˆ0 and Vˆ 0 are the expectation and variance of the maximal VI score (over all biomarkers) estimated from the null distribution by randomly permuting the treatment labels from the original data.

STATISTICS IN BIOPHARMACEUTICAL RESEARCH

375

Figure . Schematic representation of the Stochastic SIDEScreen method.

6. Application of Adaptive and Stochastic SIDEScreen Methods In this section, we will discuss the results obtained by applying the stochastic SIDEScreen procedure to the clinical trial example introduced in Section 2. To perform a sensitivity assessment, an additional 10 noise binary candidate biomarkers labeled X1 , . . . , X10 were added to the dataset. These noise biomarkers were generated as Bernoulli random variables with the success probability randomly generated from a uniform distribution U(0,1). The performance of Adaptive SIDEScreen with the multiplier k = 1.282 (corresponding to a threshold approximately equivalent to the 90th percentile of the null distribution of VI scores) will be compared with Stochastic SIDEScreen based on the differential and directional splitting criteria. For all investigations, subgroup search will be performed with width = 3 and depth = 2. Figure 4(a) depicts the variable screening process using the Adaptive SIDEScreen procedure with the differential effect splitting criterion. The circles represent the observed VI scores and the vertical dotted line is the threshold based on the 90th percentile from the null distribution of the maximal VI score, which is approximately equivalent to E0 + 1.282 × SD0 . Here, the mean E0 and standard deviation SD0 are evaluated from 1,000 datasets simulated from the original data by randomly permuting the treatment labels. Both TJC and DD are identified as

4A

4B

X10

X9

X9

X7

X7

X6

X6

X3

X3

X2

X2

X10

X1

X1

NRM

RF

RF

NRM

X4

X4

X8

X8

X5

X5

DD

DD

TJC

) ) ) ) ) ) ) ) ) ) ) ) ) )

TJC

0

1

2

3

4

5

Variable importance

6

7

8

0

1

2

3

4

5

Variable importance (10th bootstrap percentile)

Figure . (A) Variable screening using Adaptive SIDEScreen procedure in the RA example using the differential splitting criterion. The dotted line indicates the threshold for variable importance based on the approximate th percentile of the null distribution for maximal variable importance. (B) Variable screening using Stochastic SIDEScreen in the RA trial. The horizontal dashed (red) lines extend to the mean of the null distribution for that variable. The vertical dashed line corresponds to the mean for the null distribution of the maximal variable importance across all candidate covariates.

the two candidate biomarkers that pass the screening rule since their VI scores are greater than the threshold. Table 5 provides the details of the patient subgroups based on TJC and DD identified in the second stage of the subgroup search procedure. Since the sample sizes are not sufficiently

376

I. LIPKOVICH ET AL.

Table . Subgroups identified with Adaptive SIDEScreen in the RA trial using the differential splitting criterion. Subgroup Overall population TJC> TJC>; DD = “> yrs” DD = “> yrs DD = “> yrs”; TJC >

Sample size

Response rate (Treatment)

Response rate (Placebo)

P-value (-sided)

Adjusted p-value

    

. . . . .

. . . . .

. .