Inference for Non-regular Parameters in Optimal Dynamic Treatment ...

Inference for Non-regular Parameters in Optimal Dynamic Treatment Regimes Bibhas Chakraborty Department of Biostatistics, Columbia University Joint work with Susan Murphy and Victor Strecher

ENAR Spring Meetings New Orleans, Louisiana March 22, 2010

Dynamic Treatment Regimes

Chronic diseases like mental illnesses, substance abuse, HIV infection and certain types of cancer are treated in multiple stages – At each stage, treatment type and dosage are adapted to the individual patient’s ongoing response, adherence, burden, side effects, preference, and past treatments Dynamic treatment regimes – are multi-stage, individualized treatment rules – involve a series of treatment decisions on the same patient – operationalize the adaptive clinical practice

Outline

1

Dynamic Treatment Regimes: A Brief Overview

2

Optimal Dynamic Treatment Regimes: Estimation and Inference Estimation via Q-learning The Problem of Non-regularity Soft-threshold Estimator

3

Simulation Experiments

4

Analysis of Smoking Cessation Data

5

Discussion

Dynamic Treatment Regimes

Formally, a dynamic treatment regime (DTR) is a sequence of decision rules or functions, one per stage – Each decision rule takes a patient’s treatment and covariate history as input, and outputs a recommended treatment – These decision rules should lead to a maximal (optimal) long-term mean outcome Goal: To estimate and make inference about these decision rules from longitudinal data

Data Structure K stages on a single patient: O1 , A1 , . . . , OK , AK , OK+1 Oj : Observation (pre-treatment) at the j-th stage Aj : Treatment (action) at the j-th stage, Aj ∈ Aj Hj : History at the j-th stage, Hj = {O1 , A1 , . . . , Oj−1 , Aj−1 , Oj } Yj : Outcome at the j-th stage, Yj = fj (Hj+1 ), with known fj

A DTR is a sequence of decision rules: d ≡ (d1 , . . . , dK ) with dj (Hj ) ∈ Aj Consider only two stages, i.e., K = 2 throughout this talk

Data for constructing Optimal d Sequential Multiple Assignment Randomized Trial (SMART) (Lavori and Dawson, 2004; Murphy, 2005) – Each patient is followed through stages of treatment – At each stage the patient is randomized to one of the possible treatment options

Examples of SMART (or precursors): – Schizophrenia: CATIE (Schneider et al., 2001) – Depression: STAR*D (Rush et al., 2003) – ADHD: Pellham (ongoing) – Prostate Cancer: Thall et al. (2000) – Leukemia: CALGB Protocol 8923 (Wahed and Tsiatis, 2004) – Smoking: Project Quit (Strecher et al., 2008) – Alcohol Dependence: Oslin (complete)

In this talk, we will be dealing with SMART data only

A Two-stage Smoking Cessation Trial (Simplified)

NCI-funded Internet-based (eHealth) Behavioral Intervention for Smoking Cessation (Strecher et al., 2008) O1 : education (≤ high school vs. > high school) A1 : A1 ∈ {1, −1}, high vs. low tailored (success) story (additionally, each subject received free nicotine patch) O2 : quit status, months non-smoking over 0-6 months (0, 1, 2) A2 : A2 ∈ {1, −1}, booster-prevention vs. control O3 : months non-smoking over 6-12 months (0, 1, 2) Y1 and Y2 are the months non-smoking

Public Health Research Questions

Stage-1 question: Should the tailoring of the story (stage-1 intervention) depend on the smoker’s baseline characteristics (e.g. education)?

Stage-2 question: Is the stage-2 intervention useful for both quit and non-quit smokers?

Outline

1


2


3


4


5

Discussion

Regression-based Methods for Constructing Optimal DTRs Q-learning (Watkins, 1989) – A popular method from Reinforcement Learning – A generalization of least squares regression to multistage decision problems (Murphy, 2005)

Optimal Structural Nested Mean Models (SNMM) (Murphy, 2003; Robins, 2004; Moodie et al., 2007) – Under simple conditions, Q-learning is an inefficient version of Robins’ SNMM (Chakraborty et al., 2009) – Q-learning is simpler to visualize

The problem of non-regularity arises in both For simplicity, we will focus on Q-learning throughout this talk

Motivation for Q-learning

The intuition comes from Dynamic Programming (Bellman, 1957) in case the multivariate distribution of the data is known Move backward in time to take care of the delayed effects Define the “Quality of the action”, Q-functions: h i Q2 (h2 , a2 ) = E Y2 H2 = h2 , A2 = a2 i h Q1 (h1 , a1 ) = E Y1 + max Q2 (H2 , a2 ) H1 = h1 , A1 = a1 a2 {z } | delayed effect

Optimal DTR: dj (hj ) = arg max Qj (hj , aj ), j = 1, 2 aj

Q-learning with Linear Regression (K = 2) In real life, we do not know the true distribution of the data. All we have is a data set consisting of n patients: {O1i , A1i , O2i , A2i , O3i }, i = 1, . . . , n

Assume that Aj ∈ {−1, 1}’s are randomized with P[Aj = 1|Hj ] = P[Aj = −1|Hj ] = 0.5

Model for Q-functions: Qj (Hj , Aj ; βj , ψj ) = βjT Sj0 + (ψjT Sj )Aj , j = 1, 2, where Sj0 and Sj are two vector summaries of Hj

Q-learning with Linear Regression (K = 2) Stage-2 Regression: n 2 1 X Y2i − Q2 (H2i , A2i ; β2 , ψ2 ) (βˆ2 , ψˆ2 ) = arg min β2 ,ψ2 n i=1

Stage-1 Pseudo-outcome: Yˆ1i = Y1i + max Q2 (H2i , a; βˆ2 , ψˆ2 ), i = 1, . . . , n a

Stage-1 Regression: (βˆ1 , ψˆ1 ) = arg min

β1 ,ψ1

n 2 1 X ˆ Y1i − Q1 (H1i , A1i ; β1 , ψ1 ) n i=1

Estimated Optimal DTR: dˆj (hj ) = arg max Qj (hj , a; βˆj , ψˆj ), j = 1, 2 a

Bias due to Non-regularity

0 + |ψ ˆT S2i | Yˆ1i = Y1i + maxa Q2 (H2i , a; βˆ2 , ψˆ2 ) = Y1i + βˆ2T S2i 2

– Maximization is a non-smooth, non-linear operation. Even when Q2 (H2i , a; βˆ2 , ψˆ2 ) is unbiased for Q2 (H2i , a; β2 , ψ2 ), we have h i E max Q2 (H2i , a; βˆ2 , ψˆ2 ) ≥ max Q2 (H2i , a; β2 , ψ2 ) a

a

– Thus Yˆ1i is a biased predictor of Y1i + maxa Q2 (H2i , a; β2 , ψ2 ) – Bias in Yˆ1i , i = 1, . . . , n, can induce bias in ψˆ1

Toy Example to Illustrate Bias

i.i.d.

To estimate |µ| based on X1 , . . . , Xn ∼ N(µ, 1) ¯ n |, where X ¯ n is the The maximum likelihood estimator of |µ| is |X sample mean ¯ n | is biased At µ = 0, the point of non-differentiability of |µ|, |X ( q 2 √ ¯ n | − |µ|)] = π if µ = 0 lim E[ n(|X n→∞ 0 if µ 6= 0 Detailed discussion of this bias: Robins (2004); Moodie and Richardson (2009)

Non-regularity in Inference

We want to construct CIs for ψ1 or test H0 : ψ1 = 0. Why? – reduce the variety of information to be collected for future implementations of the DTR – know when there is insufficient evidence in the data to recommend one treatment over another – allow expert opinion to select among treatments

The problem again is that the pseudo-outcome Yˆ1i is non-differentiable in the estimators from stage-2 regression

Non-regularity in Inference

0 + |ψ ˆT S2i | Yˆ1i = Y1i + maxa Q2 (H2i , a; βˆ2 , ψˆ2 ) = Y1i + βˆ2T S2i 2

Because of the lack of smoothness (non-differentiability) of Yˆ1i : – The asymptotic distribution of ψˆ1 is normal if P[ψ2T S2 = 0] = 0 and non-normal if P[ψ2T S2 = 0] > 0 – The change between the two distributions is abrupt – The asymptotic distribution does not converge uniformly over the parameter space

Details on non-regularity in this problem: Robins (2004)

Confidence Intervals and Non-regularity CIs can be wrongly centered due to the bias in ψˆ1 Non-regularity can cause additional problems in tail behavior Whenever true ψ2T S2 ≈ 0, Wald type CIs (asymptotic normality based) show poor frequentist properties (e.g., Robins, 2004; Moodie and Richardson, 2009) Usual bootstrap CIs also show poor frequentist properties – Bootstrap is inconsistent due to non-differentiability (e.g., Shao, 1994; Andrews, 2000)

We consider a Shrinkage (Empirical Bayes) approach

Soft-threshold Estimator 0 + |ψ ˆT S2i | · 1 − Yˆ1iST = Y1i + βˆ2T S2i 2 |

λi

+

, λi > 0 |ψˆ2T S2i |2 {z }

shrinkage factor

The corresponding estimator ψˆ1ST from the stage-1 regression is the soft-threshold estimator

Soft-threshold Estimator

The soft-threshold estimator is a shrinkage estimator akin to: – non-negative garrote (Breiman, 1995) – adaptive lasso (Zou, 2006) – positive-part James-Stein estimator (Efron and Moris, 1973)

Yˆ1iST is still a non-smooth function, but the problematic term |ψˆ2T S2i | is shrunk (thresholded) towards zero One “good” choice of the tuning parameters λi , i = 1, . . . , n, can be derived from a Bayesian formulation of the problem – We use a Bayesian formulation originally used for wavelet-based image processing in Figueiredo and Nowak (2001)

Choice of Tuning Parameter: An Empirical Bayes Approach Lemma Consider a hierarchical Bayesian model where X|µ ∼ µ|φ2

∼

p(φ2 ) ∝

N(µ, σ 2 ) (known σ 2 ) N(0, φ2 ) (Normal prior) 1/φ2

(Jeffrey’s hyper-prior)

Then an empirical Bayes estimator of |µ| is given by ! r X EB 3σ 2 + 3σ 2 + c |µ| = X 1− 2 2Φ 1− 2 −1 X σ X ( ) r r X2 3σ 2 + 2 3σ 2 + σ exp − 2 1 − 2 + 1− 2 π X 2σ X c Furthermore, |µ|

EB

≈ |X| 1 −

3σ 2 X2

+

How close is the approximation?

How close is the approximation?

Choice of Tuning Parameter

For fixed S2i , i = 1, . . . , n: – Put X = ψˆ2T S2i and µ = ψ2T S2i T ˆ – Plug in σ ˆ 2 = S2i Σ2 S2i /n for σ 2

– Apply the previous lemma (with approximation) TΣ ˆ 2 S2i /n in the soft-threshold This leads to the choice λi = 3S2i ST pseudo-outcome Yˆ1i : + T 0 + |ψ ˆT S2i | · 1 − 3S2i Σˆ 2 S2i Yˆ1iST = Y1i + βˆ2T S2i 2 2 ˆT n|ψ2 S2i |

Outline

1


2


3


4


5

Discussion

Simulation Design

1000 simulated data sets, each of size n = 300 Analysis Model: Q2

=

β20 + β21 O1 + β22 A1 + β23 O1 A1 + (ψ20 + ψ21 O2 + ψ22 A1 ) A2 {z } | ψ2T S2

Q1

=

β10 + β11 O1 + (ψ10 + ψ11 O1 )A1

The stage-2 treatment effect ψ2T S2 governs the results 1000 bootstrap replications to construct CIs

Example (Non-regular) Bias is the main issue ∈ {−1, 1} with probability 0.5 exp(0.5(O1 + A1 )) P[O2 = 1|O1 , A1 ] = 1 + exp(0.5(O1 + A1 )) Y1 = 0; Y2 = −0.5A1 + 0.5A2 + 0.5A1 A2 + , ∼ N(0, 1) O1 , A1 , A2

P[ψ2T S2 = 0] = 0.5 Estimator hard-max soft-threshold

Estimation of ψ10 Bias MSE Coverage -0.0401 0.0075 88.4* -0.0185 0.0058 93.4·

Length 0.2987 0.2923

Coverage = coverage of nominal 95% percentile bootstrap CI

Example (Non-regular) Tail behavior is the main issue ∈ {−1, 1} with probability 0.5 exp(0.5(O1 + A1 )) P[O2 = 1|O1 , A1 ] = 1 + exp(0.5(O1 + A1 )) Y1 = 0; Y2 = , ∼ N(0, 1) O1 , A1 , A2

P[ψ2T S2 = 0] = 1 Estimator hard-max soft-threshold

Estimation of ψ10 Bias MSE Coverage 0.0003 0.0045 96.8* 0.0009 0.0036 95.3

Length 0.2687 0.2498


Example (Regular) Price of shrinkage O1 , A1 , A2 P[O2 Y1 = 0; Y2 = −0.5A1

∈ {−1, 1} with probability 0.5 exp(0.1(O1 + A1 )) = 1|O1 , A1 ] = 1 + exp(0.1(O1 + A1 )) + 0.25A2 + 0.5O2 A2 + 0.5A1 A2 + , ∼ N(0, 1)

P[ψ2T S2 = 0] = 0 Estimator hard-max soft-threshold

and ψ2T S2 is far from 0

Estimation of ψ10 Bias MSE Coverage 0.0009 0.0067 95.0 0.0052 0.0074 94.8

Length 0.3094 0.3191


What have we seen in the simulations?

Soft-threshold estimator addresses the issues of bias and coverage of CIs under non-regular settings The original hard-max estimator would be preferred over the soft-threshold estimator when the stage 2 treatments are “too different” (so-called regular settings) – However we cannot depend on the occurrence of regular settings in clinical trial data – This is so due to ethical considerations, e.g., “the principle of clinical equipoise” (Freedman, 1987)

Outline

1


2


3


4


5

Discussion

Data Analysis Results No significant stage-2 treatment effect (n = 479) Stage-1 Analysis Summary (n = 1848) Variable education high vs. low tailoring tailoring:education

Coefficient 0.01 0.07 -0.11

95% Bootstrap CI (-0.18, 0.20) (-0.01, 0.11) (-0.24, -0.00)*

The “highly individually tailored” level of story appears more effective for subjects with less education (≤ high school) This finding is consistent with that of Strecher et al. (2008) – a logistic regression analysis of the stage-1 quit status (binary) data

Missing Data

Missingness Rates Variable Education 6-month quit status Stage-1 months non-smoking Stage-2 months non-smoking

% Missing < 1% 24% 38% 55%

We used multiple imputation (Rubin, 1987), using a multivariate normal model and “non-informative” priors (Schafer, 1997, 2008) Within each of the 500 bootstrap samples, we used 10 imputations

Outline

1


2


3


4


5

Discussion

Summary and End Notes We used shrinkage to reduce bias and to construct (empirically) valid CIs for non-regular treatment effect parameters This approach can be used for both randomized (SMART) and observational longitudinal data – We focused on randomized data since we wanted to separate causal inference issues from the issue of non-regularity – Propensity score methods to estimate the treatment allocation model, in conjunction with Q-learning, will be necessary for observational data

“Adaptive, evidence-based, individualized” treatments are of great interest in the behavioral sciences and in medicine – this research is an ongoing endeavor to address that interest

Going Forward ... This is ongoing joint work with Erica Moodie

How to deal with shared parameters across stages, i.e. when some of the ψ’s are same for stages 1 and 2? – For example, this setting can arise if the investigator believes that the optimal treatment decision at each stage is the same function of the time-varying covariate (or history) at that stage (e.g. Robins, 2004; Moodie and Richardson, 2009)

This implicitly imposes restrictions on the parameter space of the generative model Simultaneous estimation of parameters is necessary – Can we devise a modified Q-learning with soft-thresholding for this setting, and make it work?

Main References

Methodological Research: Chakraborty, Murphy, and Strecher (2009). Inference for non-regular parameters in optimal dynamic treatment regimes. To appear in Statistical Methods in Medical Research. Details on the Smoking Cessation Trial: Strecher, McClure, Alexander, Chakraborty, Nair, et al. (2008). Web-based smoking cessation components and tailoring depth: Results of a randomized trial. American Journal of Preventive Medicine, 34(5): 373-381.

Back-up Slides

Q-learning and Structural Nested Mean Models A bridge between Machine Learning and Causal Inference ... Lemma Consider linear models for the Q-functions, and assume that: (i) the parameters in Q1 and Q2 are distinct; (ii) Aj has zero conditional mean given Hj , j = 1, 2; and (iii) the covariates used in the model for Q1 are nested within the covariates used in the model for Q2 , i.e., (S10T , S1T A1 ) ⊂ S2T . Then Q-learning is algebraically equivalent to an inefficient version of Robins’ method of estimation in Structural Nested Mean Models.

Why move through stages? Why not run a “all-at-once” multivariate regression?

There may be non-causal association(s) even with randomized data. Berkson’s paradox!

Why move through stages?

Measured effect of stage-1 treatment will be lower than the true effect.

More Sophisticated Bootstrap CIs We tried double bootstrap (Davison and Hinkley, 1997; Nankervis, 2005) with the hard-max estimator – It gave valid coverage rates of CIs – It is computationally much more expensive

m-out-of-n bootstrap (Shao, 1994) is not a panacea – Choice of m in the present context is not obvious – sometimes we would want m n and some other times m ≈ n √ – It gives slower rate of convergence than n even in regular settings (regular subjects in the data set) – Literature on choosing a data-adaptive m in other areas: Hall et al. (1995), Bickel and Sakov (2008)

Moodie-Richardson Hard-threshold (ZIPI) Estimator n o 0 + |ψ ˆT S2i | · 1 |ψˆT S2i | > λi , λi > 0 Yˆ1iHT = Y1i + βˆ2T S2i 2 2

The corresponding estimator ψˆ1HT is the hard-threshold estimator

Hard-threshold (ZIPI) Estimator The hard-threshold estimator reduces bias of the original hard-max estimator (Moodie and Richardson, 2008) The hard-threshold pseudo-outcome has two points of discontinuity q TΣ ˆ 2 S2i /n, where Σ ˆ 2 /n is the Usually λi is set to be equal to zα/2 S2i estimated covariance matrix of ψˆ2 , and α is an unknown tuning parameter to be specified by the user ) (√ n|ψˆ2T S2i | HT T 0 T ˆ ˆ ˆ Y1i = Y1i + β2 S2i + |ψ2 S2i | · 1 q > zα/2 TΣ ˆ 2 S2i S2i The hard-threshold estimator is the same as the pre-test estimator (Sen and Saleh, 1987)

Example (Nearly Non-regular)

exp(0.5(O1 + A1 )) 1 + exp(0.5(O1 + A1 )) = 0.01A2 + , ∼ N(0, 1)

P[O2 = 1|O1 , A1 ] = Y2 P[ψ2T S2 Estimator hard-max soft-threshold

= 0] = 0

but ψ2T S2 ≈ 0

Estimation of ψ10 Bias MSE Coverage 0.0003 0.0045 96.7* 0.0008 0.0036 95.4

Length 0.2690 0.2500


Example (Non-regular)

exp(O1 ) 1 + exp(O1 ) = −0.5A1 + A2 + 0.5O2 A2 + 0.5A1 A2 + , ∼ N(0, 1)

P[O2 = 1|O1 , A1 ] = Y2 P[ψ2T S2

= 0] = 0.25

Estimator hard-max soft-threshold

Estimation of ψ10 Bias MSE Coverage -0.0209 0.0074 92.7* -0.0065 0.0069 93.8

Length 0.3197 0.3198


Asymptotics (work in progress) Let Z, Z1∗ , Z2∗ be mean-zero normal random variables (with different variances). Then for large n: Original hard-max estimator: √

h i n(ψˆ1 − ψ1 ) ⇒ Z1∗ + ES1 ,A1 ,S2 1{ψ2T S2 =0} 2 · 1{S2T Z>0} + 1 (S2T Z)(A1 S1 )

Soft-threshold estimator: √

n(ψˆ1ST − ψ1 ) ⇒ Z2∗ + "

ES1 ,A1 ,S2 1{ψ2T S2 =0} 2 · 1{S2T Z>0} + 1

(S2T Z)(A1 S1 )

3ST Σ2 S2 1 − 2T 2 |S2 Z| | {z

shrinkage factor

!+ #

}

The “shrinkage factor” shrinks the non-mean-zero functional towards zero!

Inference for Non-regular Parameters in Optimal Dynamic Treatment ...

Inference for Non-regular Parameters in Optimal Dynamic Treatment ...

Suggest Documents

Statistical Inference in Dynamic Treatment Regimes

OPTIMAL PROCESS PARAMETERS FOR IN SITU

Q- and A-Learning Methods for Estimating Optimal Dynamic Treatment ...

Statistical Inference for Interval Identified Parameters - CiteSeerX

INFERENCE IN THE INDETERMINATE PARAMETERS ... - Statistica

Optimal Conditional Inference for Instrumental ... - Semantic Scholar

Exact optimal and adaptive inference in regression

Inference for ordered parameters in multinomial ... - Google Sites

FINDING OPTIMAL ALGORITHMIC PARAMETERS

Adaptive coding for dynamic sensory inference

DySy: Dynamic Symbolic Execution for Invariant Inference

Dynamic Optimal Control Models in

Efficient inference in persistent Dynamic Bayesian Networks

Simultaneous Statistical Inference in Dynamic Factor Models

Variational Inference in Stochastic Dynamic ... - CiteSeerX

Bayesian Inference in Nonparametric Dynamic ... - Semantic Scholar

Bayesian Inference in Dynamic Disequilibrium Models - CiteSeerX

Optimal parameters for edge detection - IEEE Xplore

Optimal process parameters for minimizing the

DIRECT SEARCH FOR OPTIMAL PARAMETERS WITHIN ...

Commonsense Inference in Dynamic Spatial Systems - Qualitative

Exact Optimal Confidence Intervals for Hypergeometric Parameters

Optimal parameters for synthesizing single phase

Optimal Regularization Parameters for General-Form Tikhonov ...