Estimating Average Treatment Effects with a Response-Informed ...

2 downloads 0 Views 652KB Size Report
Feb 4, 2017 - ME] 4 Feb 2017 ... Rosenbaum and Rubin (1983) showed that adjusting for ... Methods for estimating treatment effects based on the PS.
Estimating Average Treatment Effects with a Response-Informed Calibrated Propensity Score

arXiv:1702.01349v1 [stat.ME] 4 Feb 2017

David Cheng1 , Abhishek Chakrabortty2 , Ashwin N. Ananthakrishnan3 , and Tianxi Cai1 1

Department of Biostatistics, Harvard T.H. Chan School of Public Health 2 Department of Statistics, University of Pennsylvania 3 Division of Gastroenterology, Massachusetts General Hospital

Abstract Approaches based on propensity score (PS) modeling are often used to estimate causal treatment effects in observational studies. The performance of inverse probability weighting (IPW) and doubly-robust (DR) estimators deteriorate under model mis-specification or when the dimension of covariates that are adjusted for is not small. We propose a response-informed calibrated PS approach that is more robust to model mis-specification and accommodates a large number of covariates while preserving the double-robustness and local semiparametric efficiency properties under correct model specification. Our approach achieves additional robustness and efficiency gain by estimating the PS using a two-dimensional smoothing over an initial parametric PS and another parametric response score. Both of the scores are estimated via regularized regression to accommodate covariates with a dimension that is not small. Simulations confirm these favorable properties in finite samples. We illustrate the method by estimating the effect of statins on colorectal cancer risk in an electronic medical record study and the effect of smoking on C-reactive protein in the Framingham Offspring Study.

Keywords: Causal inference; double-robustness; electronic medical records; kernel smoothing; regularized estimation; semiparametric efficiency. 1

1

Introduction

While randomized clinical trials (RCT) remain the gold standard for evaluating the effects of a treatment, large scale observational studies (OS) are valuable sources of data that complement RCT. Among the benefits is the accumulation of large samples in broad patient populations, which enables inferences about treatment effects in subgroups not available in RCT. But drawing conclusions about treatment effects with OS is challenging. In the absence of randomization, naive contrasts between treatment groups may exhibit substantial bias for the treatment effect. Rosenbaum and Rubin (1983) showed that adjusting for the propensity score (PS), the conditional probability of assignment to a treatment group T ∈ {0, 1} given a p dimensional covariate vector X, can remove such bias provided that X satisfies strong ignorability. Methods for estimating treatment effects based on the PS π1 (X) = P (T = 1|X), typically estimated from the data, have since become widespread. The inverse probability weighting (IPW) estimator (Horvitz and Thompson, 1952; Rosenbaum, 1987) is one of the common approaches that weights responses based on π1 (X). The doubly-robust (DR) estimator (Robins et al., 1994) augments an IPW estimator by a term involving a model for the conditional mean of the response Y given T and X to improve its efficiency. It achieves the semiparametric efficiency bound when both the PS and response models are correctly specified. The DR estimator is doubly-robust in that it is consistent when either the PS or response model is correctly specified (Scharfstein et al., 1999). A principal concern of PS methods is that π1 (X) is often estimated from mis-specified parametric models, which can lead to severe bias (Drake, 1993; Kang and Schafer, 2007). Nonparametric models for π1 (X) (Hirano et al., 2003; McCaffrey et al., 2004) are rarely used in practice due to the curse of dimensionality and difficulties in implementation. Since the true PS balances covariate distributions between treatments, some recent methods proposed robust estimates of the PS or other weights that can be used for causal inference by direct balancing of the covariate empirical moments (Imai and Ratkovic, 2014; Zubizarreta, 2015). Some work along these lines resemble calibration estimators in survey sampling (Hainmueller, 2011; Chan et al., 2015). The performance of these methods appears to be excellent in settings with few covariates but is unclear when p is not small relative to the sample size n. The DR estimator is more robust than methods relying only on the PS, but, despite its double-robustness, mis-specification of the response model results in efficiency loss. A second concern of equal if not greater importance is variable selection for PS models, particularly when p is not small. Suppose that π1 (X) and the conditional mean of the 2

counterfactual responses {Y (1) , Y (0) } given a X satisfying strong ignorability depend only on XJ π and XJ µ respectively, where XJ denotes the sub-vector of X indexed by J . It is sufficient to adjust only for XJ π to identify the counterfactual means. Adjusting for additional covariates in J µ improves efficiency of PS-based estimators (Lunceford and Davidian, 2004; Brookhart et al., 2006). Asymptotically, adjusting for more covariates than those in J π will potentially gain efficiency and adjusting for the entire X using the DR estimator achieves semiparametric efficiency under correct model specifications even if a subset of X does not belong to J π ∪ J µ . However, in finite samples, such an exhaustive “over-modeling” approach (Perkins et al., 2000; Rubin and Thomas, 1996) may lead to instability in estimation and efficiency loss. It is thus desirable to incorporate variable selection into the PS and/or response models, which has been recently proposed by various authors to improve estimation of causal effects. Screening covariates based on marginal associations between the treatment and response (Schneeweiss et al., 2009; Hirano and Imbens, 2001) may be mis-leading because marginal associations need not agree with conditional associations. Bayesian model averaging provides a principled approach to PS variable selection (Zigler and Dominici, 2014) but relies on restrictive parametric assumptions and encounters burdensome computations when p is large. Farrell (2015) proposed a penalized version of the DR estimator by imposing regularization for both the PS and response models in the large p setting. However, the validity of the influence function expansion for the estimated treatment effect requires correct specification of both models. Belloni et al. (2013) proposed to estimate J π ∪ J µ via regularization while allowing X to include a large number of non-linear transformations of the original observed covariates and then estimate the treatment effect by fitting a linear response model given the selected covariates. Wilson and Reich (2014) considered estimating J π ∪ J µ via a regularized loss function for estimates from initial treatment and response models and then re-fitting the response model with selected covariates, primarily focusing on the case of continuous exposures. These two approaches rely on a response model to estimate the treatment effect and may produce biased estimators if the model is mis-specified. The problems of model mis-specification and variable selection are especially relevant in modern OS due to the large number of variables collected and the complex, largely unknown relationships among them. Studies based on electronic medical records (EMRs) or claims data can easily have hundreds or thousands of covariates available on patients’ medical histories and current health. Modern cohort studies also collect information on a large number of clinical, questionnaire, and/or genetic variables. Non-linearities and interactions

3

among such covariates are often expected in these contexts, and yet there is usually limited subject matter knowledge to provide adequate guidance for manual model building or feature selection. To address these issues and overcome limitations of existing methods, we propose an IPW estimator based on a response-informed calibrated PS (RiCaPS) where we estimate the PS in two steps: (1) obtain initial estimates of the PS and a response score, defined as the linear predictors based on the covariates from a response model, as a bivariate parametric score of X, and (2) calibrate the initial PS by a nonparametric estimate of the probability of treatment assignment given the bivariate score in (1). The response score in (1) can be viewed as a working prognostic score (Hansen, 2008a). We employ regularization throughout to stabilize estimates and perform variable selection as in some of the aforementioned methods. The additional calibration in (2) distinguishes our method from existing methods with respect to both robustness and efficiency gain. We show that the RiCaPS IPW estimator also achieves double-robustness and local semiparametric efficiency. It maintains additional robustness and efficiency benefits beyond that of standard DR methods when the initial PS and response models are mis-specified and when p is not small relative to n. The rest of this paper is organized as follows. We set up the notation and causal inference framework in Section 2.1. In Sections 2.2-2.4 we describe estimating the calibrated PS and examine the properties of an IPW estimator based the RiCaPS. A perturbation resampling procedure is proposed in Section 2.5 for interval estimation. We present simulation results showing the robustness and efficiency of RiCaPS in Section 3 and applications to estimating treatment effects in an EMR study with few covariates and in the Framingham Offspring Study (FOS) with a large number of covariates in Section 4. We conclude with some remarks in Section 5. Regularity conditions and proofs are deferred to the Supplementary Materials.

2 2.1

Method Data and Framework

Let Y ∈ R denote a response that can be modeled by a generalized linear model (GLM), such as a binary, ordinal, or continuous response , T ∈ {0, 1} a binary treatment, and X ∈ X a p+1-dimensional vector of bounded covariates that includes 1 as its first element. We consider p to be fixed but potentially not small relative to n. The observed data consists of n independent and identically distributed (iid) observations D = {Zi ≡ (Yi , Ti , XiT )T : 4

i = 1, . . . , n}. Let Y (1) and Y (0) denote the counterfactual outcomes (Rubin, 1974) had an individual received treatment or control. Based on D, we want to make inferences about the average treatment effect: ∆ = E(Y (1) ) − E(Y (0) ) = µ1 − µ0 .

(1)

For identifiability, we require the following standard causal inference assumptions. Y = T Y (1) + (1 − T )Y (0) (Y

(1)

,Y

(0)

(2)

) ⊥ T |X

(3)

πk (x) ≡ P (T = k|X = x) ∈ (0, 1) for k = 0, 1 when f (x[−1] ) > 0

(4)

where f (x[−1] ) is the joint density for X[−1] , and x[−1] denotes the subvector of x excluding the first element. Under these assumptions, ∆ can be identified from the g-formula (Robins, 1986):   I(T = 0)Y I(T = 1)Y − ∆ = E{µ(1, X) − µ(0, X)} = E π1 (X) π0 (X) where µ(k, X) = E(Y |X, T = k), for k = 0, 1.

2.2

Parametric Working Models

To estimate ∆ from D, we consider the following two parametric working models for π1 (X) and µ(T, X), either of which could be correct or mis-specified: Model Mπ : π1 (X; β π ) = gπ (X T β π ) Model Mµ :

µ(T, X; βTµ , β µ )

=

gµ (T βTµ

(5) T

µ

+X β )

(6)

where β π , β µ ∈ Rp+1 , βTµ ∈ R are unknown regression parameters, and gπ (·) and gµ (·) are specified smooth link functions. The initial PS estimated from fitting Mπ will be calibrated by smoothing T over both estimated scores X T βbπ and X T βbµ . The response score X T βbµ will serve to inform the calibration. We estimate the parameters in Mπ and Mµ using regularized estimators that minimize penalized likelihood functions: ( n ) X 1 π βbπ = argmin −`π (β π ; Ti , Xi ) + pπn (β[−1] ) n i=1 βπ " n # (7) n o X  1 µT T ) (βbTµ , βbµT )T = argmin −`µ (βTµ , β µT )T ; Zi + pµn (βTµ , β[−1] µ n µ βT ,β i=1 5

where `π (β π ; Zi ) and `µ (βTµ , β µ ; Zi ) are the log-likelihood contributions of the i-th observation and pπn (·) and pµn (·) are penalty functions chosen such that the oracle properties (Fan and Li, 2001) hold. Examples of such estimators include the adaptive least absolute shrinkage and selection operator (LASSO) (Zou, 2006) with initial weights estimated from ridge regression. The oracle properties enable regularization procedures to potentially select all covariates in J π for the PS model and covariates only in J µ for the response model. This preserves the ignorability of treatment assignment under assumption (3) during adjustment with the PS while selecting out irrelevant covariates that would promote efficiency loss. Regularization also stabilizes estimates, which also results in efficiency gains.

2.3

Response-Informed Calibrated Propensity Score

To guard against model mis-specifications, our proposed RiCaPS estimator calibrates an b = M bX: initial PS estimated from Mπ based on smoothing T over the estimated scores S β n o P P bj − sb)I(Tj = k) n−1 nj=1 Kh Mβb(Xj − x) I(Tj = k) n−1 nj=1 Kh (S b n o = π bk (x; β) = P P bj − sb) n−1 nj=1 Kh (S n−1 nj=1 Kh Mβb(Xj − x) (8) for k = 0, 1, where Mβ = (β π , β µ )T for β = (β πT , β µT )T , sb = Mβbx, Kh (u) = h−2 K(u/h) and K(·) is a bivariate q-th order kernel function with q > 2. The bandwidth h is chosen such that n1/2 hq +n−1/2 h−2 → 0 as n → ∞, which is possible when q > 2. More discussions on h are given in the Discussion section. Smoothing over the PS direction calibrates the initial PS estimates toward the true PS under mis-specification of Mπ . Smoothing over the response direction could produce efficiency gain by leveraging information from covariates in J µ but not in J π . This smoothing approach addresses the dilemma of whether to take an exhaustive variable selection strategy so as to capture all covariates in J π ∪ J µ or a more conservative strategy to avoid unstable estimates when selecting covariates for the PS model. Applying regularization to the PS model by itself would discard information from efficiency covariates in J µ , which is avoided by smoothing over the response direction. A b prior to smoothing to improve finite sample monotone transformation can be applied to S performance (Wand et al., 1991). In our numerical studies, we applied a standard normal cumulative distribution function transformation after standardizing the scores to obtain approximately uniformly distributed scores and scaled a component so that a common bandwidth h can be used for both components of the score. 6

2.4

RiCaPS IPW Estimator

b we now consider an IPW estimator for ∆ defined by: With the PS estimated by π b(x; β), Pn −1 n ω bik Yi I(Ti = k) b =µ and ω bik = for k = 0, 1 (9) ∆ b1 − µ b0 , where µ bk = −1 Pi=1 n b n bik π bk (Xi ; β) i=1 ω b let β¯ = (β¯πT , β¯µT )T be the To present our main result on the asymptotic expansion for ∆, ¯ = P (T = k|S), b S¯ = Mβ¯ X, πk (X; β) ¯ and: limit of β, I(Ti = k) ¯ =µ ∆ ¯1 − µ ¯0 , where µ ¯k = E(¯ ωik Yi ) and ω ¯ ik = ¯ , for k = 0, 1 πk (Xi ; β) Regardless of the adequacy of working models, we establish in Appendix B of the Supplementary Materials the following result. ck = n1/2 (b b − ∆) ¯ = W c1 − W c0 . Theorem 1. Let W µk − µ ¯k ) for k = 0, 1 so that n1/2 (∆ Then under the causal assumptions (2), (3), (4) and the regularity conditions detailed in ck has the expansion: Appendix A of the Supplementary Materials, W n h oi n X (k) (k) −1/2 ¯ c ¯k ) + (¯ ωik − 1) Yi − E(Yi |Si , Ti = k) Wk = n (Yi − µ (10) i=1 + n1/2 (βbµ − β¯µ )T v µ + n1/2 (βbπ − β¯π )T v π + Op (n1/2 hq + n−1/2 h−2 ) k

k

where vkµ and vkπ are deterministic vectors, for k = 0, 1. When the PS model Mπ holds, v0µ = v1µ = 0. When both Mπ and the response model Mµ hold, we also have that v0π = v1π = 0. Thus, under Mπ , estimating the response model does not contribute extra variability b This form of the asymptotic expansion is analogous to to the asymptotic variance of ∆. that of the standard DR estimator in general and exactly identical when both Mπ and Mµ hold. In addition to the root-n convergence rate, the asymptotic expansion yields two important asymptotic properties that the RiCaPS IPW estimator shares with the DR estimator: double-robustness and local semiparametric efficiency. Specifically, we have the following two corollaries. Corollary 1. Under the assumptions required for Theorem 1 and a bandwidth h = O(n−α ) 1 1 for α ∈ ( 2q , 4 ), when either the PS model Mπ or response model Mµ is correctly specified, b − ∆ = Op (n−1/2 ). ∆ 7

Corollary 2. Under the assumptions required for Theorem 1 and a bandwidth h = O(n−α ) 1 1 b achieves the semiparametric , 4 ), if both models Mπ and Mµ hold, then ∆ for α ∈ ( 2q efficiency bound under a model correctly specified by Mπ . Compared to the standard DR estimator, the RiCaPS IPW estimator enjoys additional b uses PS estirobustness and efficiency benefits. By weighting with a calibrated PS, ∆ mates that better approximate the true PS under mis-specification of Mπ . Asymptotically, b converges uniformly in x to π1 (x; β), ¯ which coincides with the true PS π1 (x) unπ b1 (x; β) der Mπ and otherwise is a conditional probability of treatment assignment closer to π1 (x) than the parametric limiting estimate gπ (xT β¯π ) when Mπ is incorrect. We expect this b more robust to bias from mis-specification of the PS model. In some calibration to make ∆ cases, this calibration can substantially if not completely overcome mis-specification of the ¯ still coincides with the true PS π1 (x) if π1 (x) follows a PS model. For example, π1 (x; β) single-index model (SIM) and X is elliptically distributed such that E(X T b|X T β0π ) is linear in X T β0π , for any b ∈ Rp+1 , with β0π being the true coefficients in the SIM, owing to the results of Li and Duan (1989). This implies that the RiCaPS IPW can be consistent even when Mπ is mis-specified up to the link and Mµ is arbitrarily mis-specified. Incidentally, µ if the true response model follows a SIM such that µ(T, X) = g0µ (βT,0 T + Xβ0µ ) for some µ smooth link function g0µ (·) and some βT,0 ∈ R, β0µ ∈ Rp+1 and X is elliptically distributed, then the results of Li and Duan (1989) hold approximately for β0µ . As a result, when Mπ b can potentially still is arbitrarily mis-specified and Mµ is mis-specified up to the link, ∆ be approximately consistent for ∆. n o (k) ¯ The term (¯ ωik − 1) Yi − E(Yi |Si , Ti = k) in (10) reveals potential benefits in terms of asymptotic efficiency. The efficient influence function for ∆ in a semiparametric model eff (k) eff − µk + (ωk − 1) Y (k) − µ(k, X , with under Mπ is Ψeff = Ψeff 1 − Ψ0 where Ψk = Y ωk = I(T = k)/πk (X), for k = 0, 1 (Tsiatis, 2007). The conditional expectation µ(k, X) = E(Y |X, T = k) can be viewed as an orthogonal projection of Y (k) onto the space of real-valued measurable functions of X with finite second moments. The term (ωk −  (k) 1) Y − µ(k, X) is a projection of (ωk − 1)Y (k) , which consequently has lower variance than (ωk − 1)Y (k) itself. The DR estimator attains Ψeff as its influence function under both Mπ and Mµ . Under Mπ but mis-specified Mµ the influence function of the DR estimator  takes the form Y (k) − µk + (ωk − 1) Y (k) − ξ(k, X) where ξ(k, X) is some projection onto a subspace that does not contain µ(k, X). For the RiCaPS IPW, the corresponding projection is driven by a projection of µ(k, X) onto a larger subspace of measurable functions of S¯ with finite second moments, which contributes to efficiency gain. The DR and RiCaPS 8

IPW estimators differ slightly in how estimating β π contributes to the asymptotic variance, and it is difficult to compare their asymptotic efficiencies analytically. However, simulation evidence suggests that the RiCaPS IPW achieves substantial efficiency gain compared to the DR estimator. An additional important feature of the RiCaPS is the efficiency gains in finite samples when p is not small relative to n. As discussed above, in such scenarios routine maximum likelihood estimators for β π and β µ become unstable and produces subsequent instability in the IPW and DR estimators. Smoothing to calibrate the initial PS estimate stabilizes the estimates by adjusting them toward the true PS. The use of regularization to estimate both β π and β µ further stabilizes the estimates while selecting the relevant covariates, which produces additional efficiency gains.

2.5

Perturbation Resampling

b − ∆) ¯ , which can be determined from (10), needs to The asymptotic variance for n1/2 (∆ be estimated to construct confidence intervals (CIs). But because the expansion involves complex unknown functionals, a direct empirical estimate is challenging. We instead propose a simple perturbation-resampling procedure similar to that proposed in (Cai et al., 2010). Let G = {Gi : i = 1, . . . , n} be non-negative iid random variables with mean and variance equal to 1 that are independent of D. The perturbation procedure perturbs each layer of estimation in the RiCaPS IPW. First each i-th observation in the loss functions for the regularized estimators from (7) is weighted by Gi . The weighted loss is minimized to obtain the perturbed estimators βbπ∗ and βbµ∗ . The perturbed RiCaPS are calculated by: π bk∗ (x; βb∗ )

=

n−1

Pn

b∗ − sb∗ )I(Tj = k)Gj Kh ( S i P n b∗ − sb∗ )Gj n−1 j=1 Kh (S i j=1

for k = 0, 1, where βb∗ = (βbπ∗T , βbµ∗T )T and sb∗ = Mβb∗ x. The perturbed RiCaPS IPW is: P b ∗ Yi Gi n−1 ni=1 ω I(Ti = k) ∗ ∗ ∗ ∗ ∗ b ∆ =µ b1 − µ b0 , where µ bk = −1 Pn ik∗ and ω bik = for k = 0, 1 bik Gi n π bk∗ (Xi ; βb∗ ) i=1 ω b − ∆) ¯ coincides with that of n1/2 (∆ b ∗ − ∆) b given D. The asymptotic distribution of n1/2 (∆ ∗ b b The Generating a sample of ∆ ’s with different G’s approximates the distribution for ∆. b can be estimated, for example, from the sample variance or mean absolute variance of ∆ b ∗ ’s. We may construct CIs using a normal approximation or the deviation (MAD) of the ∆ 9

b ∗ ’s. When using a non-robust method to estimate the standard empirical quantiles of the ∆ error (SE) or other functionals, the perturbed samples may need to be trimmed to protect against the effect of outliers.

3

Simulation Studies

We performed extensive simulations to assess the finite sample properties of our proposed RiCaPS IPW estimator compared to that of existing PS-based IPW estimators, including those using the true PS (True PS), PS estimated from logistic regression (GLM PS), PS estimated from adaptive LASSO logistic regression (ALAS PS), the DR estimator (DR), and a DR estimator using LASSO to estimate the PS and response models (LAS DR) that is akin to Farrell (2015). For the RiCaPS estimator, we use adaptive LASSO to estimate β and a Gaussian product kernel of order q = 4 with a plug-in bandwidth at the optimal order (see Discussion) in the smoothing unless noted otherwise. For all regularized estimators, we chose tuning parameters based on a modified BIC criterion that replaces log(n) by min{n.1 , log(n)} to avoid excessive shrinkage in finite samples. For working models in all methods, we let gπ (u) = 1/{1 + exp(−u)} in Mπ and gµ (u) = u in Mµ . We focused on a continuous response in the simulations, where the data were generated according to:  X ∼ N {0, 0.8Ip + 0.2} , T |X ∼ Ber {π1 (X)} , and Y |T, X ∼ N µ(T, X), 102 , Covariates shared by both π(X) and µ(T, X) induce bias in estimators of the marginal association between T and Y for causal effects. We considered sample sizes n = 500, 1000 and 2000 and p = 10, 30, 50 and 100 but only present representative subsets of the results for conciseness. Results are summarized based on 5000 replications for each setting unless noted otherwise. In a first set of simulations we considered the empirical bias, root mean square error (RMSE), and relative efficiency (RE) compared to the DR estimator. We varied the simulations over three model specification scenarios: (A) Both Mπ and Mµ are correct, (B) Mπ holds but Mµ is mis-specified with the true µ(T, X) generated from a double-index model (DIM), and (C) Mπ is mis-specified with the true π(X) generated from a DIM but

10

Mµ holds. Specifically, in these three scenarios, we set: (A)

µ(T, X) = βTµ T + XTβ µ ,

π1 (X) = gπ (X T β π )

µ µ (B) µ(T, X) = βTµ T + X T β[1] (.5X T β[2] + .5), π1 (X) = gπ (X T β π )  π π (.5X T β[2] + .5) (C) µ(T, X) = βTµ T + XTβ µ , π1 (X) = gπ X T β[1]

where βTµ = 1, β µ = (0, 1, 1.4, 1, .5T3×1 , −1T3×1 , −1.5, 0T(p−10)×1 )T , β π = (0, .25, 0, .25T7×1 , 0T(p−9)×1 )T , µ µ β[1] = (1, 0, 1, 0, .5, 0, −.5, 0, −1, 0, 0T(p−10)×1 )T , β[2] = (0, .7, 0, .7, 0, .7, 0, .7, 0, .7, 0T(p−10)×1 )T , π π β[1] = (.6, 0, .6, 0, .6, 0, .6, 0, .6, 0, 0T(p−10)×1 )T , and β[2] = (0, .4, 0, .4, 0, −.4, 0, −.4, 0, −.4, 0T(p−10)×1 )T . We used orthogonal indices in the DIMs so that SIMs cannot provide a close approximation. Table 1 presents the bias and RMSE for n = 500, 2000 under the different model specification scenarios when p = 10. The bias for RiCaPS is small across all these scenrios. In larger samples when n = 2000, the small amount of bias exhibited by RiCaPS is diminishing to zero and is negligible, demonstrating its double-robustness. As expected, GLM PS and ALAS PS incur substantial biases that do not diminish in large samples when Mπ is mis-specified. RiCaPS, DR, and LAS DR have similar RMSE either when both models are correct or only the response model is correct. In large samples when both Mπ and Mµ are correct, all three estimators have asymptotic variances that approach the semiparametric variance bound. When Mµ is mis-specified, RiCaPS achieves a lower RMSE than all other competing IPW estimators. Figure 1 shows the RE for n = 500, 2000 across increasing p and mis-specification scenarios. When n = 500 and both Mπ and Mµ are correct, RiCaPS achieves substantial efficiency gains over DR as p increases. In this scenario GLM PS exhibits favorable efficiency relative to True PS when p is small, illustrating the efficiency paradox where estimating the PS yields efficiency gain even when the PS is known (Lunceford and Davidian, 2004). But as p increases, this efficiency gain from estimating the PS via GLM is lost due to the instability of the estimated PS. DR also suffers a similar efficiency loss due to instability in estimating the nuisance parameters when p is not small relative to n. ALAS PS, which employs regularization to stabilize the PS estimates, shows favorable efficiency when p is large. However ALAS PS may incur substantial bias if the PS model is incorrect and is inefficient because it may select out covariates in XJ µ . LAS DR achieves some efficiency gain over DR when p is not small relative to n and models are correctly specified but does not gain as much as RiCaPS. In large samples when both Mπ and Mµ are correct, RiCaPS, LAS DR, and DR are all similar due to the semiparametric efficiency bound, whereas the other estimators are not fully efficient. When Mπ is mis-specified, RiCaPS, LAS DR, and 11

Mπ and Mµ Correct Estimator Bias RMSE True PS 0.001 0.443 GLM PS 0.002 0.365 ALAS PS 0.015 0.365 n = 500 DR 0.002 0.343 LAS DR 0.003 0.343 RiCaPS 0.018 0.342 True PS -0.002 0.227 GLM PS -0.002 0.177 ALAS PS 0.001 0.182 n = 2000 DR -0.001 0.168 LAS DR -0.001 0.168 RiCaPS 0.009 0.168

Mis-specified Mµ Bias RMSE 0.007 0.430 0.007 0.437 0.021 0.422 0.006 0.435 0.007 0.436 0.025 0.368 -0.002 0.215 -0.002 0.214 0.002 0.211 -0.002 0.212 -0.002 0.212 0.011 0.179

Mis-specified Mπ Bias RMSE 0.003 0.542 0.260 0.497 0.187 0.453 0.005 0.384 0.005 0.383 0.055 0.375 0.004 0.276 0.258 0.325 0.192 0.280 0.000 0.181 0.000 0.181 0.021 0.176

Table 1: Empirical bias and RMSE of IPW estimators for n = 500, 2000, p = 10 and by model specification scenarios. DR have comparable performances in large samples, but RiCaPS gains the most when p is not small relative to n. When Mµ is mis-specified, RiCaPS achieves 41%-64% greater efficiency than DR LAS and DR in large and and small samples, suggesting the calibration step in RiCaPS leads to additional efficiency gain. We also evaluated the performance of SE and CI estimation based on perturbation when both models are correct for p = 10, 30. When estimating the SEs, we considered both MAD and also robust estimation by removing outliers more than 5 MADs away from the median. We estimated the 95% CIs using the .025 and .975 quantiles of the perturbed samples. As shown in Table 2, the average of the estimated SEs are close to the empirical SEs for both choices of the SE estimators. The empirical coverage levels of the CIs are close to the nominal level. The performance of SE and CI estimation is expected to weaken with larger p for a fixed n, but these results suggest the performance is reasonable when p is not too large.

12

(a) Correctly Specified Models

(b) Mis-specified Models

Figure 1: RE of various IPW estimators estimated relative to DR for n = 500 and 2000 when (a) both models are correct with varying p; and (b) either the response model or the PS model is mis-specified with p = 50.

4 4.1

Applications Effect of Statins on Colorectal Cancer Risk with EMR Data

We applied the RiCaPS IPW to assess the effect of statins, a medication for lowering cholesterol levels, on the risk of colorectal cancer (CRC) in patients with inflammatory bowel disease (IBD) identified from the EMRs of a large metropolitan healthcare provider. Previous studies have suggested that use of statins have a protective effect on CRC (Liu 13

n = 500 n = 1000 n = 500 p = 30 n = 1000 p = 10

Average

ESE

ASErobust

ASEMAD

Coverage

1.026 1.008 1.017 1.016

0.332 0.246 0.357 0.242

0.321 0.223 0.331 0.225

0.310 0.220 0.318 0.222

0.959 0.925 0.952 0.944

Table 2: Average estimates, empirical SE (ESE), average of the estimated SE (ASE) based on either the robust estimate or the MAD, and coverage for 95% CIs with 1000 datasets and 1000 perturbations per dataset. et al., 2014). Few studies have considered this potential effect among IBD patients. The EMR cohort consisted of n = 10, 817 IBD patients. The CRC status and statin use were ascertained by the presence of ICD9 diagnosis codes and electronic prescriptions, respectively. We adjusted for p = 15 baseline covariates that were potential confounders, including age, gender, race, smoking status as assessed via natural language processing (NLP), indication of elevated inflammatory markers, examination with colonoscopy, use of biologics and immunomodulators, subtypes of IBD, disease duration, and presence of primary sclerosing cholangitis (PSC). Our goal was to estimate the average treatment effect of statin use on CRC risk. We specified gµ (·) to be the logistic link to accommodate the binary response when implementing RiCaPS. Estimators based on competing methods are also obtained for comparison. When selecting tuning parameters for the regularized estimators, the modified P P BIC criterion for binary response replaces log(n) by min{( ni=1 Yi ).1 , log( ni=1 Yi )}. The bootstrap was used to obtain SEs and CIs for competing methods. A two-sided p-value based on a Wald test for the null that statins have no effect was calculated based on the SEs. As shown in Table 3, all methods found a significant protective effect of statins. Without adjustment, the naive risk difference is estimated to be −0.8% with SE 0.4%. After adjusting for covariates, various IPW estimators have larger point estimates of around −2%, suggesting the protective effect. The point estimates from RiCaPS, LAS DR, and DR are similar but the RiCAPS estimator is estimated to be 33% more efficient than the DR and DR-LAS estimators.

14

IBD EMR Study Estimate SE None -0.008 0.004 GLM -0.021 0.003 ALAS -0.021 0.003 DR -0.020 0.003 LAS DR -0.020 0.003 RiCaPS -0.024 0.002

FOS

95% CI (-0.017, 0.000) (-0.028, -0.015) (-0.028, -0.015) (-0.026, -0.015) (-0.026, -0.015) (-0.029, -0.019)

p Estimate SE 95% CI p 0.043 0.180 0.057 (0.064, 0.295) 0, we obtain the desired result by taking k = (M/)1/2 . The next lemma simplifies the average of the gradients of the inverse RiCaPS with ¯ Terms of this form will appear multiple times when respect to β π and β µ , evaluated at β. identifying the contributions to the expansion from estimating β π and β µ . 22

Lemma 2. Let g : Rp+3 → R be some real-valued square-integrable transformation of the data Z. Under assumptions listed above and additionally that E g(Zi )|S¯i = s and E Xi g(Zi )|S¯i = s are continuous in s: n

−1

n X i=1

    ∂ I(Tj = k) g(Zi ) †T −1 T ¯ ¯ ˙ π bk (Xi ; β) g(Zi ) = E Kh (Sji ) 1 − ¯ ¯ Xji ∂β πT πk (Xi ; β) lk (Xi ; β) 1

+ Op (n− 2 h−1 + n−1 h−3 )     n X I(Tj = k) g(Zi ) ∂ ‡T −1 −1 T ¯ ¯ ˙ π bk (Xi ; β) g(Zi ) = E Kh (Sji ) 1 − n ¯ ¯ Xji ∂β µT πk (Xi ; β) lk (Xi ; β) i=1 1

+ Op (n− 2 h−1 + n−1 h−3 ) for k = 0, 1. Proof. We will show the first equality for the gradient with respect to β π . The second equality is analogous. The general strategy is to write the left hand side as a V-statistic and apply the V-statistic projection lemma. We begin by simplifying the gradient itself. ∂ ¯ −1 = π bk (Xi ; β) ∂β πT

∂ b ¯b ¯ f (Xi ; β) lk (Xi ; β) ∂β πT

=n

¯ ∂πT b ¯ − fb(Xi ; β) lk (Xi ; β) ∂β b ¯2 lk (Xi ; β)

−1

n X

b ¯ − I(Tj = k)fb(Xi ; β) ¯ lk (Xi ; β) K˙ h (S¯ji )T Xji†T 2 b ¯ lk (Xi ; β) j=1

Plugging this into the left hand side: −1

n

n X i=1

X b ¯ b ¯ g(Zi ) ∂ −2 ˙ h (S¯ji )T lk (Xi ; β) − I(Tj = k)f (Xi ; β) X †T g(Zi ) = n K ji b ¯2 ∂β πT π bk (Xi ; β¯π , β¯µ ) lk (Xi ; β) i,j

¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) Xji†T g(Zi ) K˙ h (S¯ji )T b ¯2 lk (Xi ; β) i,j n o b ¯ ¯ b ¯ ¯ l (X ; β) − l (X ; β) − I(T = k) f (X ; β) − f (X ; β) X i i k i k i j + n−2 K˙ h (S¯ji )T Xji†T g(Zi ) 2 b ¯ lk (Xi ; β) i,j X ¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) = n−2 K˙ h (S¯ji )T Xji†T g(Zi ) + Op (an ) 2 b ¯ lk (Xi ; β) i,j

= n−2

X

23

where the last step can be shown using the uniform convergence of the kernel estimators in the numerator: 2 X k=0

¯ − l(Xi ; β)| ¯ = sup |fb(Xi ; β) ¯ − f (Xi ; β))| ¯ = Op (an ), sup |b lk (Xi ; β)

Xi ∈X

Xi ∈X

and noting that the remaining double sum is Op (1). We now continue simplifying the main term to obtain a proper V-statistic: n−2

¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) Xji†T g(Zi ) K˙ h (S¯ji )T 2 b ¯ lk (Xi ; β) i,j X ¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) K˙ h (S¯ji )T = n−2 Xji†T g(Zi ) ¯2 lk (Xi ; β)

X

i,j

( + n−2

X

= n−2

X

 ¯ − I(Tj = k)f (Xi ; β) ¯ X †T g(Zi ) K˙ h (S¯ji )T lk (Xi ; β) ji

i,j

i,j

1 − ¯2 b ¯ 2 lk (Xi ; β) lk (Xi ; β) 1

¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) Xji†T g(Zi ) + Op (an ) K˙ h (S¯ji )T ¯2 lk (Xi ; β)

where the last step can be shown using a similar argument as above by using uniform convergence of the kernel estimators in the numerator and noting that the remaining double sum is Op (1). We now define some terms to facilitate the application of the V-statistic projection lemma.   ¯ − I(Tj = k)f (Xi ; β) ¯ l (X ; β) k i †T T m1,k (Zj ) = EZi K˙ h (S¯ji ) Xji g(Zi ) ¯2 lk (Xi ; β)   ¯ ¯ †T T lk (Xi ; β) − I(Tj = k)f (Xi ; β) ¯ ˙ m2,k (Zi ) = EZj Kh (Sji ) Xji g(Zi ) ¯2 lk (Xi ; β)   ¯ − I(Tj = k)f (Xi ; β) ¯ l (X ; β) k i †T T mk = E K˙ h (S¯ji ) Xji g(Zi ) ¯2 lk (Xi ; β) ¯ − I(Ti = k)f (Xi ; β) ¯ lk (Xi ; β) ε1,k = n−1 E|K˙ h (S¯ii )T Xii†T g(Zi )| 2 ¯ lk (Xi ; β) " 2 #!1/2 ¯ ¯ l (X ; β) − I(T = k)f (X ; β) k i j i ε2,k = n−1 E K˙ h (S¯ji )T Xji†T g(Zi ) 2 ¯ lk (Xi ; β) 24

)

We now evaluate each of the terms. The first term can be simplified through a standard change-of-variables argument:  o ¯lk (S¯i ) − I(Tj = k)f¯(S¯i ) n †T T E Xji g(Zi )|S¯i m1,k (Zj ) = ES¯i K˙ h (S¯ji ) ¯lk (S¯i )2   Z n o I(Tj = k) 1 T ¯ ˙ = Kh (Sj − s¯1 ) 1 − E Xji†T g(Zi )|S¯i = s¯1 d¯ s1 π ¯k (¯ s1 ) π ¯k (¯ s1 )   Z n o I(Tj = k) 1 †T −1 T ¯i = hψj + S¯j dψj ˙ =h K(ψj ) 1 − E X g(Z )| S i ji π ¯k (hψj + S¯j ) π ¯k (hψj + S¯j ) = O(h−1 ) where the last step the integrand, using that π ¯k (¯ s) is bounded  can be shown from bounding  ¯ ¯ away from 0, E Xi g(Zi )|Si = s and E g(Zi )|Si = s are continuous over a compact ˙ support, and the integrability of K(u). Similarly for the second term:     ¯j ) π ¯ ( S g(Z ) k i †T T )X |S¯j m2,k (Zi ) = ES¯j K˙ h (S¯ji ) E (1 − π ¯k (S¯i ) ji π ¯k (S¯i )   Z ¯i ) π ¯ (hψ + S g(Zi ) ¯ k i †T −1 T ˙ i ) E (1 − =h K(ψ )Xji |S¯j = hψi + S¯i f (hψi + S¯i )dψi ¯ π ¯k (Si ) π ¯k (S¯i ) = O(h−1 ) where the last step can be shown by bounding the integrand, using that π ¯k (¯ s) is bounded ¯ ˙ away from 0, f (¯ s) is bounded, X is bounded, and the integrability of K(u). For the last terms, note that: ¯ − I(Tj = k)f (Xi ; β) ¯ lk (Xi ; β) 0g(Zi )| = 0 ε1 = n−1 E|K˙ h (0)T ¯2 lk (Xi ; β) " 2 #!1/2 ¯ ¯ l (X ; β) − I(T = k)f (X ; β) k i j i ε2 = n−1 E K˙ h (S¯ji )T Xji†T g(Zi ) = O(n−1 h−3 ) 2 ¯ lk (Xi ; β) ˙ where the order of ε2 can be found by bounding terms in the expectation, using that K(u) ¯ is bounded and bounded away from 0, f (Xi ; β) ¯ is bounded, and X is is bounded, lk (x; β)

25

bounded. Finally, application of the projection lemma yields: n−2

X i,j

¯ − Tj f (Xi ; β) ¯ l(Xi ; β) K˙ h (S¯ji )T Xji†T g(Zi ) 2 ¯ l(Xi ; β) n

n

1X 1X = mk + m1,k (Zj ) − mk + m2,k (Zi ) − mk + Op (ε1 + ε2 ) n j=1 n i=1 1

= mk + Op (n− 2 h−1 ) + Op (n−1 h−3 ) for k = 0, 1, where the last line follows from application of Lemma 1. Collecting all the results from above we obtain the desired equality. The next lemma shows that the normalizing constant in the IPW is 1 up to some lower order terms, which will allow us to account for the normalization in the expansion. The approach to analyzing this constant also parallels that of the main expansion. Lemma 3. Under the assumptions listed above, the normalizing constant is: n

−1

n X

−1

ω bik = n

i=1

n X I(Ti = k) = 1 + Op (an ) b π bk (Xi ; β) i=1

for k = 0, 1. Proof. We begin by decomposing the sum: n

−1

n X i=1

 n  1 1X 1 ω bik = n ω ¯ ik + ¯ − πk (Xi ; β) ¯ I(Ti = k) n i=1 π bk (Xi ; β) i=1 ) ( n 1X 1 1 + − ¯ I(Ti = k) b n i=1 π π bk (Xi ; β) bk (Xi ; β) −1

n X

= Sb1,k + Sb2,k + Sb3,k for k = 0, 1. The first term is: Sb1,k = 1 + n−1

n X 1 I(Ti = k) ( − 1) = 1 + Op (n− 2 ). ¯ πk (Xi ; β) i=1

26

The second term can be controlled: |Sb2,k | = |n−1

n X ¯ −π ¯ πk (Xi ; β) bk (Xi ; β) ¯ k (Xi ; β) ¯ I(Ti = k)| π bk (Xi ; β)π i=1

¯ − πk (Xi ; β)|n ¯ −1 ≤ sup |b πk (Xi ; β) Xi ∈X

n X i=1

I(Ti = k) ¯ k (Xi ; β) ¯ = Op (an ) π bk (Xi ; β)π

where the last step follows from applying the uniform convergence of π b and noting the remaining double sum is Op (1). The third term can be written: ( ) n X 1 1 1 1 − + − Sb3,k = n−1 I(Ti = k) π µ π µ π µ b b ¯ b ¯ b bk (Xi ; β¯π , β¯µ ) π bk (Xi ; β , β ) π bk (Xi ; β , β ) π bk (Xi ; β , β ) π i=1 ) ( n X ∂ ∂ 1 1 (βbπ − β¯π ) + (βbµ − β¯µ ) I(Ti = k) = n−1 πT µT π µ ¯ ¯ π µ ¯ b ∂β π ∂β π bk (Xi ; β , β ) bk (Xi ; β , β ) i=1 + Op (||βbπ − β¯π ||2 + ||βbµ − β¯µ ||2 )  n  X ∂ 1 ∂ 1 −1 π π µ µ b ¯ b ¯ =n πT π ¯π , β¯µ ) (β − β ) + ∂β µT π ¯π , β¯µ ) (β − β ) I(Ti = k) ∂β b (X ; β b (X ; β k i k i i=1 + Op (||βbπ − β¯π ||2 + ||βbµ − β¯µ ||2 + ||βbπ − β¯π ||||βbµ − β¯µ ||) where the second and third equalities can be shown using that ∂b πk (Xi ; β)−1 /∂β πT and ˙ ∂b πk (Xi ; β)−1 /∂β µT are Lipshitz continuous in β, which follows from that K(u) is Lipshitz continuous in u and b lk (x; β) and fb(x; β) are Lipschitz continuous in β. To simplify the average of the gradients, we apply Lemma 2, taking g(Zi ) = I(Ti = k). n

−1

n X i=1

    I(Tj = k) I(Ti = k) †T ∂ I(Ti = k) T ¯ ˙ = E Kh (Sji ) 1 − ¯ ¯ Xji ∂β πT π bk (Xi ; β¯π , β¯µ ) πk (Xi ; β) lk (Xi ; β) 1

+ Op (n− 2 h−1 + n−1 h−3 )

27

where:     I(Tj = k) I(Ti = k) †T T ¯ ˙ E Kh (Sji ) 1 − ¯ ¯ Xji πk (Xi ; β) lk (Xi ; β)  n o 1 h = E K˙ h (S¯ji )T ¯ ¯ π ¯k (S¯i ) E(Xj†T |S¯j ) − E(Xi†T |S¯i , Ti = k) lk (Si ) n oi †T †T −π ¯k (S¯j ) E(Xj |S¯j , Tj = k) − E(Xi |S¯i , Ti = k) Z Z h n o ¯ ˙ 1 )T f (hψ1 + s¯1 ) π ¯k (¯ s1 ) E(Xj†T |S¯j = hψ1 + s¯1 ) − E(Xi†T |S¯i = s¯1 , Ti = k) = h−1 K(ψ π ¯ (¯ s) n k 1 oi −π ¯k (hψ1 + s¯1 ) E(Xj†T |S¯j = hψ1 + s¯1 , Tj = k) − E(Xi†T |S¯i = s¯1 , Ti = k) dψ1 d¯ s1 = O(h−1 ) where the last step can be shown by bounding the integrand using that f¯(¯ s), πk (¯ s), and X ˙ are bounded, π ¯k (¯ s) is bounded away from 0, and K(u) is integrable. Analogously applying Lemma 2 to the gradient with respect to β µ with g(Zi ) = I(Ti = k) we have: −1

n

n X i=1

1 ∂ I(Ti = k) = O(h−1 ) + Op (n− 2 h−1 + n−1 h−3 ) µT π µ ¯ ¯ ∂β π bk (Xi ; β , β )

Collecting all the results from above we have: −1

n

n X 1 1 I(Ti = k) = 1 + Op (n− 2 ) + Op (an ) + Op (n− 2 h−1 + n−1 h−1 + n−3/2 h−3 + n−1 ) b π bk (Xi ; β) i=1

= 1 + Op (an )

B.2

ck Expansion of W

ck into separate terms The approach for the asymptotic expansion will be to decompose W and then separately analyze each term. In particular, analysis of the second and third terms proceeds by first writing them in terms of V-statistics and then applying the Vstatistic projection lemma, as in the above lemmas. We begin by showing that using the 28

normalizing constant in the IPW effectively contributes a Op (an ) term to the expansion. ) !−1 ( n n X X 1 1 ck = n 2 (b ω bik (Yi − µ ¯k ) ω bik n−1 µk − µ ¯k ) = n 2 n−1 W i=1

i=1 1

= n− 2

n X

n   −1 X −1 2 ω bik (Yi − µ ¯k ) + {1 + Op (an )} − 1 n ω bik (Yi − µ ¯k )

i=1

i=1

fk + Op (an )W fk =W fk = Op (1) so that where we use Lemma 3 in the second to last step. We will show that W fk = Op (an ). We now write: the remainder Op (an )W fk = W f1,k + W f2,k + W f3,k W f1,k = n− 12 Pn ω ¯k ), where W i=1 ¯ (Yi − µ f2,k = n− 21 W f3,k = n− 12 and W

n X ¯ πk (Xi ; β) ( ωik (Yi − µ ¯k ) ¯ − 1)¯ π b (X ; β) k i i=1 n X ( i=1

B.3

1 b π bk (Xi ; β)



1 ¯k ) ¯ )I(Ti = k)(Yi − µ π bk (Xi ; β)

f2,k Expansion of W

f2,k = −n− 12 W

n b X ¯ − fb(Xi ; β)π ¯ k (Xi ; β) ¯ lk (Xi ; β)

ω ¯ ik (Yi − µ ¯k ) b ¯ lk (Xi ; β) b ¯ − fb(Xi ; β)π ¯ k (Xi ; β) ¯ 1 lk (Xi ; β) = −n− 2 ω ¯ ik (Yi − µ ¯k ) ¯ lk (Xi ; β) i=1 ( ) n o n X 1 1 − 21 b ¯ b ¯ ¯ lk (Xi ; β) − f (Xi ; β)πk (Xi ; β) ω ¯ ik (Yi − µ ¯k ) −n − ¯ b ¯ lk (Xi ; β) lk (Xi ; β) i=1 i=1 n X

1

= −n− 2

n b X ¯ − fb(Xi ; β)π ¯ k (Xi ; β) ¯ 1 lk (Xi ; β) 2 2a ) ω ¯ (Y − µ ¯ ) + O (n ik i k p n ¯ lk (Xi ; β) i=1

f f 2,k + Op (n 12 a2n ) =W 29

where the second to last step can be shown by using the uniform convergence of of f b ¯ − fb(Xi ; β)π ¯ k (Xi ; β) ¯ and b ¯ − lk (Xi ; β). ¯ We now decompose W f 2,k furlk (Xi ; β) lk (Xi ; β) 1

ther into centered and non-centered n 2 -scaled V-statistics: X  ω ¯ ik (Yi − µ ¯k ) f f f f 1,2,k + W f 2,2,k f 2,k = −n 12 n−2 Kh (S¯ji ) I(Tj = k) − π ¯k (S¯i ) =W W ¯lk (S¯i ) i,j where: X  ω ¯ ik (Yi − µ ¯k ) f f 1,2,k = −n 21 n−2 W Kh (S¯ji ) I(Tj = k) − π ¯k (S¯j ) ¯lk (S¯i ) i,j X  ω ¯ ik (Yi − µ ¯k ) f f 2,2,k − n 21 n−2 and W Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯i ) ¯lk (S¯i ) i,j

We now prepare to apply the projection lemma to each of these terms. For the centered f f 1,2,k , define the following: V-statistic W    ω ¯ (Y − µ ¯ ) ik i k m1,1,2,k (Zj ) = EZi Kh (S¯ji ) I(Tj = k) − π ¯k (S¯j ) ¯lk (S¯i )    ω ¯ ik (Yi − µ ¯k ) ¯ ¯ m2,1,2,k (Zi ) = EZj Kh (Sji ) I(Tj = k) − π ¯k (Sj ) ¯lk (S¯i )    ω ¯ ik (Yi − µ ¯k ) ¯ ¯ m1,2,k = E Kh (Sji ) I(Tj = k) − π ¯k (Sj ) ¯lk (S¯i )  ω ¯ ik (Yi − µ ¯k ) ε1,1,2,k = n−1 E|Kh (S¯ii ) I(Ti = k) − π ¯k (S¯i ) ¯lk (S¯i ) | ( 2 !)1/2   ω ¯ (Y − µ ¯ ) ik i k ε2,1,2,k = n−1 E Kh (S¯ji ) I(Tj = k) − π ¯k (S¯j ) ¯lk (S¯i ) Now we evaluate each of these terms. The first term can be obtained through a standard

30

change-of-variables argument:    E(Yi |S¯i , Ti = k) − µ ¯k ¯ ¯ ¯k (Sj ) m1,1,2,k (Zj ) = ES¯i Kh (Sji ) I(Tj = k) − π ¯lk (S¯i ) Z  E(Yi |S¯i = s¯1 , Ti = k) − µ ¯k = Kh (S¯j − s¯1 ) I(Tj = k) − π ¯k (S¯j ) d¯ s1 π ¯k (¯ s1 ) Z  = K(ψj )ξ¯k (hψj + S¯j ) I(Tj = k) − π ¯k (S¯j ) dψj   Z  hq ⊗q ∂ ¯ ¯∗ T ∂ ¯ ¯ ¯ ¯ = K(ψj ) ξk (Sj ) + hψj ξk (Sj ) + . . . + ψj ⊗ ⊗q ξk (Sj ) dψj I(Tj = k) − π ¯k (S¯j ) ∂s q! ∂s   q = ξ¯k (S¯j ) + Op (h ) I(Tj = k) − π ¯k (S¯j )  = E(Yi − µ ¯k |S¯i = S¯j , Ti = k) {¯ ωjk − 1} + Op (hq ) I(Tj = k) − πk (S¯j ) where ξ¯k (s) = E(Yi − µ ¯k |S¯i = s, Ti = k)/¯ πk (s) and S¯j∗ is an intermediate such that ∗ ||S¯j − S¯j || ≤ ||hψj ||. In the second to last step, we bound the remainder term using that E(Y |S¯ = s, T = k) and π ¯k (s) are q-times continuously differentiable for k = 0, 1 and X is compact. Due to the centering in this first V-statistic:    ω ¯ (Y − µ ¯ ) ik i k =0 m2,1,2,k (Zi ) = ES¯j Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯j ) ¯lk (S¯i ) m1,2,k = EZi {m2,1,2,k (Zi )} = 0 For the error terms: ε1,1,2,k ε2,1,2,k

  ω ¯ ik |Yi − µ ¯k | ¯ = n h K(0)E |I(Ti = k) − π ¯k (Si )| ¯ ¯ = O(n−1 h−2 ) lk (Si ) ( 2 !)1/2   I(T = k)(Y − µ ¯ ) i i k = n−1 E = O(n−1 h−2 ) Kh (S¯ji ) I(Tj = k) − π ¯k (S¯j ) π ¯k (S¯i )¯lk (S¯i ) −1 −2

which can be shown by bounding terms inside the expectations using that E(Y |S¯ = s, T = k) and E(Y 2 |S¯ = s, T = k) are continuous, X is compact, π ¯k (s) and ¯lk (s) are bounded and bounded away from 0, and K(u) is bounded. By application of the projection lemma,

31

the first V-statistic can now be written: ( ) n n X X 1 f −1 −1 f 1,2,k = −n 2 n m1,1,2,k (Zj ) + n m2,1,2,k (Zi ) − m1,2,k + Op (ε1,1,2,k + ε2,1,2,k ) W j=1 1

= −n− 2

i=1

n X

1 I(Tj = k) − 1) + Op (hq ) + Op ( √ 2 ) E(Y |S¯ = S¯j , T = k)( ¯ nh π ¯k (Sj ) j=1

f f 2,2,k , define the following: For the non-centered V-statistic W    ω ¯ ik (Yi − µ ¯k ) ¯ ¯ ¯ m1,2,2,k (Zj ) = EZi Kh (Sji ) π ¯k (Sj ) − π ¯k (Si ) ¯lk (S¯i )    ω ¯ ik (Yi − µ ¯k ) ¯ ¯ ¯ m2,2,2,k (Zi ) = EZj Kh (Sji ) π ¯k (Sj ) − π ¯k (Si ) ¯lk (S¯i )    ω ¯ (Y − µ ¯ ) ik i k m2,2,k = E Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯i ) ¯lk (S¯i )  ω ¯ ik (Yi − µ ¯k ) ε1,2,2,k = n−1 E|Kh (S¯ii ) π ¯k (S¯i ) − π ¯k (S¯i ) ¯lk (S¯i ) | ( 2 !)1/2   ω ¯ (Y − µ ¯ ) ik i k ε2,2,2,k = n−1 E Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯i ) ¯lk (S¯i ) We now evaluate each of these terms. For the first term we have:  ¯i , Ti = k)   E(Yi − µ ¯ | S k m1,2,2,k (Zj ) = ES¯i Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯i ) ¯lk (S¯i ) Z  s1 )d¯ s1 = Kh (S¯j − s¯1 ) π ¯k (S¯j ) − π ¯k (¯ s1 ) ξ¯k (¯ Z  = K(ψj ) π ¯k (S¯j ) − π ¯k (hψj + S¯j ) ξ¯k (hψj + S¯j )dψj   Z h2 T ∂ hq ⊗q ∂ T ∂ ∗ ¯ ¯ ¯ ¯ ¯ = K(ψj ) −hψj π ¯k (Sj ) − ψj ⊗2 π ¯ (Sj )ψj − . . . − ψj ⊗ ⊗q π ¯k (Sj ) ξ(hψ j + Sj )dψj ∂s 2! ∂s q! ∂s = Op (hq ) where S¯ ∗ is such that ||S¯j∗ − S¯j || ≤ ||hψj ||. The last step can be shown by bounding the remainder term by using that π ¯k (s) is q-times continuously differentiable, π ¯k (s) and 32

E(Y |S¯ = s, T = k) are continuous, and X is compact. Similarly for the second term:    ω ¯ ik (Yi − µ ¯k ) ¯ ¯ ¯ m2,2,2,k (Zi ) = ES¯j Kh (Sji ) π ¯k (Sj ) − π ¯k (Si ) ¯lk (S¯i ) Z  ω ¯ ik (Yi − µ ¯k ) ¯ = Kh (¯ s2 − S¯i ) π ¯k (¯ s2 ) − π ¯k (S¯i ) s2 )d¯ s2 ¯lk (S¯i ) f (¯ Z  ω ¯ ik (Yi − µ ¯k ) = K(ψi ) π ¯k (hψi + S¯i ) − π ¯k (S¯i ) f¯(hψi + S¯i )dψi ¯ ¯ lk (Si )   Z q h ∂ ω ¯ ik (Yi − µ ¯k ) ⊗q T ∂ ∗ = K(ψi ) hψi π ¯k (S¯i ) + . . . + ψi ⊗ ⊗q π ¯k (S¯i ) f¯(hψi + S¯i )dψi ¯ ¯ ∂s q! ∂s lk (Si ) q = Op (h ) where the last step can be shown by bounding the remainder term using that π ¯k (s) is q-times continuously differentiable, f¯(s) is continuous, and X is compact. The errors are: ε1,2,2,k = n−1 E|Kh (0)0 ( ε2,2,2,k = n−1

E



I(Ti = k)(Yi − µ ¯k ) |=0 ¯ ¯ ¯ π ¯k (Si )lk (Si )

ω ¯ ik (Yi − µ ¯k ) Kh (S¯ji ) π ¯k (S¯j ) − π ¯k (S¯i ) ¯lk (S¯i ) 

2 !)1/2

= O(n−1 h−2 )

where the order of the second error can be shown by bounding terms in the expectation using that E(Y 2 |S¯ = s, T = k) is continuous, X is compact, π ¯k (¯ s) and ¯lk (¯ s) are bounded and bounded away from 0, and K(u) is bounded. Upon application of the projection lemma, the second V-statistic can now be written: n n h X X 1 f −1 −1 f 2 W 2,2,k = −n n {m1,2,2,k (Zj ) − m2,2,k } + n {m2,2,2,k (Zi ) − m2,2,k } j=1

i=1

i + m2,2,k + Op (ε1,2,2,k + ε2,2,2,k ) 1

1

= Op (hq ) − n 2 m2,2,k + Op (n− 2 h−2 ) where the last equality follows from application of Lemma 1. To complete the analysis of f f 2,k , it remains for us to identify the order of √nm2,2,k . We evaluate m2,2,k using a similar W

33

approach.   E(Yi |S¯i , Ti = k) − µ ¯k ¯ ¯ ¯ = E Kh (Sji ) π ¯k (Sj ) − π ¯k (Si ) ¯lk (S¯i ) Z Z ¯ s1 )f¯(¯ = Kh (¯ s21 ) {¯ πk (¯ s2 ) − π ¯k (¯ s1 )} ξ(¯ s2 )d¯ s2 d¯ s1 Z Z ¯ s1 )d¯ = K(ψ1 ) {¯ πk (hψ1 + s¯1 ) − π ¯k (¯ s1 )} f¯(hψ1 + s¯1 )dψ1 ξ(¯ s1   Z Z ∂ hq ∂ = K(ψ1 ) hψ1T π ¯k (¯ s1 ) + . . . + ψ1⊗q ⊗ ⊗q π s1 )d¯ s1 ¯k (¯ s∗1 ) f¯(hψ1 + s¯1 )dψ1 ξ¯k (¯ ∂s q! ∂s = O(hq ) 

m2,2,k

where the last step can be shown by bounding the remainder term using that π ¯k (s) is q-times continuously differentiable, f¯(s) is continuous, and X is compact. The second V-statistic is now of the order: f f 2,2,k = Op (hq ) + Op (n 21 hq ) + Op (n− 21 h−2 ) W

B.4

f3,k Expansion of W

The third term can be written: ) ( n X 1 1 1 f3,k = n− 2 ¯k ) W − ¯π , β¯µ ) I(Ti = k)(Yi − µ bπ , βbµ ) π b (X ; β k i π b (X ; β k i (i=1 ) n X 1 ∂ ∂ 1 1 = n− 2 (βbπ − β¯π ) + (βbµ − β¯µ ) I(Ti = k)(Yi − µ ¯k ) πT µT π π, β µ) ¯ ¯ π µ ¯ b ∂β ∂β b (X ; β k i π b (X ; β , β ) k i i=1 n 1 o + Op n 2 ||βbπ − β¯π ||2 + ||βbµ − β¯µ ||2 n X

n

X ∂ I(Ti = k)(Yi − µ ¯k ) 1 bπ ¯π ¯k ) 1 bµ ¯µ ∂ I(Ti = k)(Yi − µ −1 2 (β − β ) + n 2 =n n πT µT π µ π µ ¯ , β¯ ) ¯ , β¯ ) n (β − β ) ∂β ∂β π b (X ; β π b (X ; β k i k i i=1 i=1 n 1 o π π 2 µ µ 2 π π + Op n 2 ||βb − β¯ || + ||βb − β¯ || + ||βb − β¯ ||||βbµ − β¯µ || −1

using that ∂b πk (Xi ; β)−1 /∂β πT and ∂b πk (Xi ; β)−1 /∂β µT are Lipshitz continuous in β, which ˙ follows from that K(u) is Lipshitz continuous in u and b lk (x; β) and fb(x; β) are Lipschitz continuous in β. 34

Applying Lemma 2 taking g(Zi ) = I(Ti = k)(Yi − µ ¯1 ) we have: n−1

n X

¯k ) ∂ I(Ti = k)(Yi − µ πT ∂β π bk (Xi ; β¯π , β¯µ ) i=1     I(T = k) I(T = k)(Y − µ ¯ ) j i i k †T − 21 −1 T h + n−1 h−3 ) = E K˙ h (S¯ji ) 1 − X + O (n p ji ¯ ¯ πk (Xi ; β) lk (Xi ; β) 1

ekπT + Op (n− 2 h−1 + n−1 h−3 ) =v n 1 X ∂ I(Ti = k)(Yi − µ ¯k ) µT n i=1 ∂β π bk (Xi ; β¯π , β¯µ )     I(T = k) I(T = k)(Y − µ ¯ ) j i i k ‡T − 21 −1 T h + n−1 h−3 ) = E K˙ h (S¯ji ) 1 − X + O (n p ji ¯ ¯ πk (Xi ; β) lk (Xi ; β) 1

ekµT + Op (n− 2 h−1 + n−1 h−3 ) =v ekπ and vkµ = limn→∞ v ekµ denote the limiting values. It for k = 0, 1. Let vkπ = limn→∞ v ekπ and v ekµ are O(1) in general, vkµ = 0T when the PS model is remains for us to verify that v correctly specified, and vkπ = vkµ = 0 when both the PS and response models are correctly specified, for k = 0, 1. We first consider the general case, in which we can first further simplify the mean. π ¯k (S¯i ) − I(Tj = k) ekπT = E K˙ h (S¯ji )T v ¯lk (S¯i ) h

· E(Yi − µ ¯k |S¯i , Ti =

k)Xj†T

n oi †T ¯ − E (Yi − µ ¯k )Xi |Si , Ti = k

!

n oi K˙ h (S¯ji )T h †T ¯ †T ¯ ¯ ¯k )Xi |Si , Ti = k E(Yi − µ ¯k |Si , Ti = k)E(Xj |Sj ) − E (Yi − µ =E f¯(S¯i ) ! n oi ¯ h †T †T − K˙ h (S¯ji ) ¯ ¯ ¯k )Xi |S¯i , Ti = k E(Yi − µ ¯k |S¯i , Ti = k)E(Xj |S¯j , Tj = k) − E (Yi − µ lk (Si ) ¯k (Sj ) Tπ

which can be expressed in the form: ) ( 4 4 o n X X T T¯ πT ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ˙ ˙ e v = E Kh (Sji ) ζ1,u,k (Si )ζ2,u,k (Sj ) = E Kh (Sji ) ζ1,u,k (Si )ζ2,u,k (Sj ) k

u=1

u=1

35

where: E(Yi − µ ¯k |S¯i , Ti = k) ¯ ζ2,1,k (S¯j ) = E(Xj†T |S¯j ) ζ¯1,1,k (S¯i ) = f¯(S¯i ) n o †T ¯ E (Yi − µ ¯k )Xi |Si , Ti = k ζ¯1,2,k (S¯i ) = − ζ¯2,2,k (S¯j ) = 1 f¯(S¯i ) E(Yi − µ ¯k |S¯i , Ti = k) ¯ ζ¯1,3,k (S¯i ) = − ζ2,3,k (S¯j ) = π ¯k (S¯j )E(Xj†T |S¯j , Tj = k) ¯lk (S¯i ) n o E (Yi − µ ¯k )Xi†T |S¯i , Ti = k ζ¯1,4,k (S¯i ) = ζ¯2,4,k (S¯j ) = πk (S¯j ) ¯lk (S¯i ) Each of the four terms can be simplified using a change-of-variables argument: n o Z Z 0 0 T K˙ h (¯ s21 )T ζ¯1,u,k E K˙ h (S¯ji ) ζ¯1,u,k (S¯i )ζ¯2,u,k (S¯j ) = (¯ s1 )ζ¯2,u,k (¯ s2 )d¯ s1 d¯ s2 Z Z ˙ 2 )T ζ¯0 (hψ2 + s¯1 )ζ¯0 (¯ = h−1 K(ψ s2 1,u,k 2,u,k s2 )dψ2 d¯ 0 0 T ˙ ˙ ˙ where ζ¯1,u,k (s) = ζ¯1,u,k (s)f¯(s) and ζ¯2,u,k (s) = ζ¯2,u,k (s)f¯(s). Let K(u) = (K(u) [1] , K(u)[2] ) be the partial derivatives with respect to the first and second components of u eval 0 of K(u) 0 ¯ ¯ uated at u, and let ζ1,u,k (¯ s1 )ζ2,u,k (¯ s2 ) [i,j] denote the (i, j)-th component of ζ1,u,k (¯ s1 )ζ2,u,k (¯ s2 ) evaluated at s¯1 and s¯2 , for i = 1, 2 and j = 1, . . . , p + 1. Applying integration by parts we can write the j-th element of the expectation above as: 2 Z Z X

 ˙ 2 )[i] ζ¯0 (hψ2 + s¯1 )ζ¯0 (¯ h−1 K(ψ s2 1,u,k 2,u,k s2 ) [i,j] dψ2 d¯

i=1

=−

2 Z Z X i=1

K(ψ2 )

i ∂ h ¯0 0 ζ1,u,k (hψ2 + s¯1 )ζ¯2,u,k (¯ s2 ) [i,j] dψ2 d¯ s2 = O(1) ∂ψ2i

The last equality can be shown by bounding the integrand using that E(Y |S¯ = s, T = k), E(X|S¯ = s), E(X|S¯ = s, T = k), and E(XY |S¯ = s, T = k) are continuously differentiable, X is compact, and f¯(s) and ¯lk (s) are bounded away from 0. This shows ekπ = O(1) for k = 0, 1 in general. Applying the same argument it can be shown that that v ekµ = O(1) for k = 0, 1 as well. in general v 36

ekµ as: Now consider the case where the PS model is correct. We can write v ekµ v

n oi ¯k (S¯i ) h ‡T ¯ ‡T ¯ Tπ ¯ ¯ ˙ = ES¯i ,S¯j Kh (Sji ) ¯ ¯ E(Yi − µ ¯k |Si , Ti = k)E(Xj |Sj ) − E (Yi − µ ¯k )Xi |Si , Ti = k lk (Si ) ! h n oi ¯ π ¯k (Sj ) − K˙ h (S¯ji )T ¯ ¯ E(Yi − µ ¯k |S¯i , Ti = k)E(Xj‡T |S¯j , Tj = k) − E (Yi − µ ¯k )Xi‡T |S¯i , Ti = k lk (Si )

= ES¯i ,S¯j

n oi ¯k (S¯i ) − π ¯k (S¯j ) h (k) (k) ‡T ¯ ‡T ¯ Tπ ¯ ¯ ˙ E(Yi − µ ¯k |Si )E(Xj |Sj ) − E (Yi − µ ¯k )Xi |Si , Ti = k Kh (Sji ) ¯lk (S¯i )

where the second equality follows from using that X ⊥ T |S¯ and Y (k) ⊥ T |S¯ when the PS model is correctly specified. Now evaluate this expectation: Z Z  π ¯k (¯ s1 ) − π ¯k (¯ s2 ) ¯ (k) µ ek = f (¯ s2 ) E(Yi − µ v K˙ h (¯ s21 )T ¯k |S¯i = s¯1 )E(Xj‡T |S¯j = s¯2 ) π ¯k (¯ s1 ) o n (k) ¯k )Xi‡T |S¯i = s¯1 , Ti = k d¯ s2 d¯ s1 − E (Yi − µ Z Z h ¯k (¯ s1 ) − π ¯k (hψ1 + s¯1 ) ¯ (k) ˙ 1 )T π ¯k |S¯i = s¯1 ) = h−1 K(ψ f (hψ1 + s¯1 ) E(Yi − µ π ¯k (¯ s1 ) oi n (k) s1 ¯k )Xi‡T |S¯i = s¯1 , Ti = k dψ1 d¯ · E(Xj‡T |S¯j = hψ1 + s¯1 ) − E (Yi − µ   Z Z T ∂ ¯k (¯ s1 ) + hψ1T ∂s∂⊗2 π ¯k (¯ s∗1 )ψ1 ¯ T ψ1 ∂s π T ∂ ¯ ∗∗ ˙ =− K(ψ1 ) f (¯ s1 ) + hψ1 f (¯ s1 ) π ¯k (¯ s1 ) ∂s   h ∂ (k) ‡T ¯ ‡T ¯ ∗∗∗ ¯ ¯k |Si = s¯1 ) E(Xj |Sj = s¯1 ) + hψ1 ⊗ · E(Yi − µ E(Xj |Sj = s¯1 ) ∂s oi n (k) ¯k )Xi‡T |S¯i = s¯1 , Ti = k dψ1 d¯ s1 − E (Yi − µ Z Z h T ∂ ¯k (¯ s1 ) ¯ (k) ˙ 1 )T ψ1 ∂s π =− K(ψ ¯k |S¯i = s¯1 )E(Xj‡T |S¯j = s¯1 ) f (¯ s1 ) E(Yi − µ π ¯k (¯ s1 ) oi n (k) s1 + O(h) ¯k )Xi‡T |S¯i = s¯1 , Ti = k dψ1 d¯ − E (Yi − µ ¯∗∗∗ where s¯∗1 , s¯∗∗ are intermediate values between hψ1 + s¯1 and s¯1 . The last step 1 , and s 1 ˙ can be shown by bounding the integrand, using that K(u) is bounded, π ¯k (s) is twice ¯ ¯ continuously differentiable, f (s) and E(X|S = s) are continuously differentiable, and E(Y |S¯ = s, T = k) and E(Y X|S¯ = s, T = k) are continuous, and X is compact. After 37

!

some rearrangement, the main term can be further simplified: ekµT v

Z =−

∂ π ¯ (¯ s) ∂sT k 1

= −E

π ¯k (¯ s1 )

f¯(¯ s1 )

∂ π ¯ (S¯i ) ∂sT k

π ¯k (S¯i )



Z

h ˙ 1 )T dψ1 E(Y (k) − µ ψ1 K(ψ ¯k |S¯i = s¯1 )E(Xj‡T |S¯j = s¯1 ) i n oi (k) − E (Yi − µ ¯k )Xi‡T |S¯i = s¯1 , Ti = k d¯ s1 + O(h)

!   E(Y (k) − µ ¯k |S¯i )E(X ‡T |S¯i ) − E (Y (k) − µ ¯k )X ‡T |S¯i , T = k + O(h)

= 0T + O(h) R ˙ 1 )T dψ1 = I2×2 by integration by where the second to last step can be shown using ψ1 K(ψ parts. Let the partial derivatives of π ¯k (¯ s) with respect to the first and second arguments, evaluated at s¯, be denoted by ∂ π ¯ (¯ s)/∂sT = (∂ π ¯ (¯ s)/∂s1 , ∂ π ¯ (¯ s)/∂s2 ). The last equality can be shown using that ∂ π ¯ (S¯i )/∂s2 = 0 since π ¯ (s) depends only on its first argument when the PS model is correct. This shows that vkµ = 0 for k = 0, 1 when the PS model is correct. Finally we consider the case where the response model, in addition to the PS model, is ekπT in this case as: correctly specified. We can write v π ¯k (S¯i ) − π ¯k (S¯j ) ekπT = E K˙ h (S¯ji )T v ¯lk (S¯i ) ·  = ES¯i ,S¯j

h

(k) E(Yi

n oi (k) ¯k )Xi†T |S¯i , Ti = k −µ ¯k |S¯i )E(Xj†T |S¯j ) − E (Yi − µ

!

n o ¯k (S¯i ) − π ¯k (S¯j ) (k) †T ¯ †T ¯ Tπ ¯ ¯ ˙ Kh (Sji ) ¯k |Si ) E(Xj |Sj ) − E(Xi |Si ) E(Yi − µ ¯lk (S¯i )

¯ T = k when the response model is correct, and X ⊥ T |S¯ where we used that Y (k) ⊥ X|S,

38

and Y (k) ⊥ T |S¯ when the PS model is correct. Evaluating this expectation: Z Z π ¯k (¯ s1 ) − π ¯k (¯ s2 ) (k) πT ek = E(Yi − µ ¯k |S¯i = s¯1 ) v K˙ h (¯ s21 )T π ¯k (¯ s1 ) n o · E(Xj†T |S¯j = s¯2 ) − E(Xi†T |S¯i = s¯1 ) f¯(¯ s2 )d¯ s2 d¯ s1 Z Z ¯k (¯ s1 ) − π ¯k (hψ1 + s¯1 ) (k) ˙ 1 )T π E(Yi − µ ¯k |S¯i = s¯1 ) = h−1 K(ψ π ¯k (¯ s1 ) n o · E(Xj†T |S¯j = hψ1 + s¯1 ) − E(Xi†T |S¯i = s¯1 ) f¯(¯ s2 )dψ1 d¯ s1 Z Z T ∂ ¯k (¯ s∗1 ) ∂ (k) ¯ s2 )dψ1 d¯ ˙ 1 )T ψ1 ∂s π s1 = −h K(ψ E(Yi − µ ¯k |S¯i = s¯1 )ψ1 ⊗ E(Xi†T |S¯i = s¯∗∗ 1 )f (¯ π ¯k (¯ s1 ) ∂s = O(h) ¯1 and s¯1 . The last step can be where s¯∗1 and s¯∗∗ 1 are intermediate values between hψ1 + s shown by bounding various terms in the integrand, using that π ¯k (¯ s) is bounded away from ¯ 0, π ¯k (s) is continuously differentiable, E(Y |S = s, T = k) is continuous, E(X|S¯ = s) is continuously differentiable, f¯(s) is continuous, and X is compact. This shows that vkπ = 0 for k = 0, 1 when both the response and PS models are correct. The same argument can be used to show that vkµ = 0 for k = 0, 1 when both models are correct. Collecting all the results from above, we find that: ck = n− 21 W

n X 

ω ¯ ik (Yi − µ ¯k ) − E(Y − µ ¯k |S¯i , T = k)(¯ ωik − 1)



i=1 1 1 1 1 + vkπT n 2 (βbπ − β¯π ) + vkµT n 2 (βbµ − β¯µ ) + Op (n 2 hq + n− 2 h−2 ) n h oi n X 1 (k) (k) ¯k ) + (¯ ωik − 1) Yi − E(Y |S¯i , T = k) (Yi − µ = n− 2

i=1 1 1 1 1 + n 2 (βbπ − β¯π )T vkπ + n 2 (βbµ − β¯µ )T vkµ + Op (n 2 hq + n− 2 h−2 )

where vkπ and vkµ are deterministic vectors such that vkµ = 0 when the PS model is correct and vkπ = vkµ = 0 when both the PS and response models are correct, for k = 0, 1. 

39

B.5

Proof of Corollary 1

Proof. Under the assumptions required for Theorem 1, µ bk − µ ¯k = Op (n−1/2 ) when h = 1 1 ¯ = πk (X) so that: Op (n−α ) for α ∈ ( 2q , 4 ). When the PS model is correct, πk (X, β) µ ¯k = E(¯ ωk Y ) = E(ωk Y ) = µk ¯ T = k) = E(Y |X, T = k) so for k = 0, 1. When the response model is correct, E(Y |S, that:   ¯ T = k) = E E(Y (k) |X, T = k) = µk µ ¯k = E(¯ ωk Y ) = E E(Y (k) |S, for k = 0, 1.

B.6

Proof of Corollary 2

1 1 , 4) Proof. Under the assumptions required for Theorem 1, when h = Op (n−α ) for α ∈ ( 2q (k) ¯ ¯ and both PS and response models are correct, we have that πk (X, β) = πk (X), E(Y |S, T = k) = E(Y (k) |X, T = k), ω ¯ ik = ωik , and µ ¯k = µk for k = 0, 1 so that:

ck = n− 21 W

n o n X (k) (k) ¯ ¯k ) + (¯ ωik − 1) Yi − E(Y |Si , T = k) + op (1) (Yi − µ i=1

1

= n− 2

n X

Ψeff ik + op (1)

i=1

b can be written: Consequently the influence function for ∆ 1/2

n

b − ∆) = W c1 − W c0 = n− 21 (∆

n X

Ψeff i + op (1)

i=1 eff eff where Ψeff i = Ψi1 − Ψi0 is the efficient influence function for ∆ in a semiparametric model where the PS is correctly specified (Tsiatis, 2007).

40