Combining Estimates of Conditional Treatment Effects

Combining Estimates of Conditional Treatment Effects

Craig A. Rolling

Yuhong Yang

Lundquist College of Business

School of Statistics

University of Oregon

University of Minnesota

Eugene, OR 97401

Minneapolis, MN 55455

[email protected]

[email protected]

Combining Estimates of CTE

Corresponding Author: Craig A. Rolling

Abstract Estimating a treatment’s effect on an outcome conditional on covariates is a primary goal of many empirical investigations. Accurate estimation of the treatment effect given covariates can enable the optimal treatment to be applied to each unit or guide the deployment of limited treatment resources for maximum program benefit. Applications of conditional treatment effect estimation are found in direct marketing, economic policy, and personalized medicine. When estimating conditional treatment effects, the typical practice is to select a statistical model or procedure based on sample data. However, combining estimates from the candidate procedures often provides a more accurate estimate than the selection of a single procedure. This paper proposes a method of model combination that targets accurate estimation of the treatment effect conditional on covariates. We provide a risk bound for the resulting estimator under squared error loss and illustrate the method using data from a labor skills training program.

Keywords: Causal inference; Heterogeneous treatment effects; Matching; Model combination; Risk bound

2


1.

INTRODUCTION

Estimating the causal effect of a treatment on a response is a primary goal of many empirical investigations, particularly in fields such as business, medicine, and public policy. Imbens and Wooldridge (2009) provide an extensive review of treatment effect estimation in the context of program evaluation in econometrics. Accurate estimation of treatment effects is often challenging because of confounding and because each observed unit is assigned only one of the possible treatments; the latter was famously deemed the “fundamental problem of causal inference” in Holland (1986). Perhaps because of these challenges, much causal inference research has focused on estimation of the treatment’s average effect across a population. For example, Abadie and Imbens (2011) and Zhang (in press) provide two recent methods to estimate the average effect of a treatment from an observational study. While knowledge of the average treatment effect in a population can be useful, treatment effects are often heterogeneous within the population of interest. In the presence of treatment effect heterogeneity, when a treatment can be applied at the level of individual units (as in medicine or direct marketing), accurate estimation of the treatment’s effect on different individuals can be used to increase the effectiveness of the treatment program in maximizing the outcome of interest. For example, a retailer with a limited marketing budget would be able to optimize a seasonal catalog mailing if it knew the effect of the catalog on the purchasing behavior of each household. In the public sector, an economic development agency often needs to decide which applicants will create the best utilization of grant dollars. Accurate estimation of individual treatment effects is central to the success of personalized medicine. With the increasing volume of data becoming available to many organizations, estimating heterogeneous treatment effects has become more feasible. Treatment effect heterogeneity can be identified by conditioning on baseline covariates that are observed before the treatment is applied, and there is a growing literature in this area. Methods recently pro-

3


posed to estimate treatment effects conditional on covariates include Cai, Tian, Wong, and Wei (2011), Green and Kern (2012), Imai and Ratkovic (2013), and Taddy, Gardner, Chen, and Draper (2015). Each of these works takes a different approach to estimate conditional treatment effects. Therefore, for an analyst with a dataset in hand and treatment effect estimation as the goal, two questions naturally follow. First, which of these procedures will produce the best treatment effect estimates? Second, can the estimates from the procedures be combined to produce more accurate estimates? Rolling and Yang (2014) addressed the first question by discussing model selection in the context of estimating the treatment effect conditional on covariates. They found that within a given candidate set of models, the best model for treatment effect estimation may be different than the best model for response estimation or prediction. This phenomenon also was discussed in Qian and Murphy (2011) in the context of optimizing treatment decisions. While targeted model selection tools are a step in the right direction for accurate estimation of treatment effects, post-model selection estimators from finite samples often have large variability because of model selection instability. The current paper addresses the second question by introducing a model combination method for estimation of treatment effects. This method is (to the best of our knowledge) the first combination algorithm specifically targeted at estimating the effect of a treatment conditional on covariates. It has been well-established (e.g., Yang, 2003) that for estimating the full regression function or forecasting, model combination algorithms often lead to more accurate estimates and predictions than model selection procedures when selection instability is high. The same is true for conditional treatment effect estimation, which motivates the work in this paper. Different researchers have taken different approaches to model combination. Methods introduced from a machine learning perspective, such as bagging (Breiman, 1996), boosting (Freund and Shapire, 1996), and random forests (Breiman, 2001), were motivated initially by intuition and empirical performance, although some theoretical understanding of these 4


methods was later developed (e.g., B¨ uhlmann and Yu, 2002). The method of frequentist model averaging (Hjort and Claeskens, 2003) is motivated by asymptotic arguments and is justified within a parametric local misspecification setting. Bayesian model averaging (Hoeting, Madigan, Raftery, and Volinsky, 1999) originates from a Bayesian perspective, with the model weights based on posterior probabilities of the models. Yang (2001) viewed model combination from an adaptation point of view. His combination algorithm, called adaptive regression by mixing (ARM), possesses an oracle inequality that bounds the risk of the resulting estimator in terms of the minimum risk among the candidate procedures. This approach, which has connections with information theory, was shown to perform almost as well (up to a constant) as the best procedure among the candidates, without knowing in advance which procedure is best. An important practical advantage of ARM is its flexibility; it can combine different classes of regression models and machine learning algorithms. The method we present in this paper shares the flexibility of ARM but is targeted to estimate the conditional effect of a treatment rather than the full regression function. Since models that are good for response estimation or prediction may not be good for treatment effect estimation (and vice versa), an algorithm targeted to the specific goal of treatment effect estimation is needed to ensure that models doing a good job of estimating the treatment effect receive higher weights. We call our method Treatment Effect Estimation by Mixing, abbreviated TEEM. (Incidentally, TEEM is a homophone for “team”; this is fitting because the candidate estimators work together to produce an accurate estimate.) TEEM relies on data splitting to evaluate the candidate procedures and therefore can combine multiple types of regression procedures and estimates; any procedure that, given data, produces an estimate of the treatment effect conditional on covariates can be used as a candidate in TEEM. Furthermore, the theoretical results we present for TEEM do not assume that any of the candidate models are correct. These features give the method tremendous flexibility to be used in a wide variety of settings. 5


The rest of the paper is organized as follows. A mathematical formulation of the conditional treatment effect estimation problem is presented in Section 2. Section 3 introduces the TEEM algorithm and gives a bound on its risk for estimating the conditional treatment effect. The method is applied to the benchmark National Supported Work Demonstration dataset (LaLonde, 1986) in Section 4, and in Section 5, we use a simulation guided by this dataset to compare our algorithm to other model selection and combination methods. Section 6 contains some concluding remarks. A detailed proof of the TEEM risk bound is given in the Appendix.

2.

FRAMEWORK

We consider a general regression framework in which the distribution of the response Y may depend on a binary treatment variable T ∈ {t, c} and one or more baseline covariates U ∈ Rp . In order to isolate the treatment difference of primary interest, we express the observations as follows: Yi = [ft (Ui ) + σt εi ]I(Ti = t) + [fc (Ui ) + σc εi ]I(Ti = c),

1 ≤ i ≤ n.

(1)

In the above, the covariates Ui are assumed to be i.i.d. from some unknown distribution PU with support U. The εi are i.i.d. Gaussian noise variables independent from U with zero mean and unit variance. The error variance under treatment is allowed to differ from the variance under control, and these are denoted by σt2 and σc2 , respectively. Both variances are assumed to be constant with respect to U. The treatment variable Ti may be fixed or random. The object of interest in our work is ∆(u) := ft (u) − fc (u), the difference between the regression functions under treatment and control. As we discuss next, some conditions are needed to properly interpret ∆(u) as a causal effect. We define causal effects using the potential outcomes framework of the Rubin Causal Model (Holland, 1986). That is, let Yi,(t) denote the response that would have been observed had Ti = t, and let Yi,(c) denote the corresponding potential outcome if Ti = c. Then the causal effect of the treatment T on 6


unit i is the unobserved random variable Yi,(t) − Yi,(c) . Following Imbens and Wooldridge (2009), we define the Conditional Average Treatment Effect (CATE) as the expectation of this random variable conditional on the observed value of the covariate vector Ui : CATE(u) := E[(Yi,(t) − Yi,(c) )|Ui = u].

(2)

Note that ∆(u) = E(Yi |Ti = t, Ui = u) − E(Yi |Ti = c, Ui = u). We call ∆ the Conditional Average Treatment Difference (CATD) to distinguish it from the CATE because ∆ may be influenced by unobserved confounding variables and therefore not accurately represent the conditional effect of the treatment. If the pair of potential outcomes for each i is conditionally independent of the treatment given covariates; that is, (Yi,(t) , Yi,(c) ) ⊥ ⊥ Ti |Ui ,

(3)

then ∆(u) = CATE(u) and therefore ∆(u) represents the (mean) causal effect of the treatment variable given the covariate value u. The assumption (3) is known to hold in randomized experiments, but it only holds in observational studies if all variables confounding the treatment and response are observed in the covariate vector U. In this work, we focus on estimation of ∆ because it is identifiable in most experiments and observational studies. However, it is important to keep in mind that in order for ∆(u) to represent a causal effect of the treatment variable T , we may need to require that (3) holds. We define the L2 norm with respect to the probability distribution of the covariates, [∫ ∥f ∥2 :=

]1/2 |f (u)| PU (du) , 2

U

(4)

where PU denotes the probability distribution of Ui for 1 ≤ i ≤ n. This norm will be used b over U. to measure the average discrepancy between ∆ and various estimates ∆ The TEEM method involves combining a finite collection of procedures proposed for estimating the function ∆. A treatment effect estimation procedure or strategy, denoted by √ ψ, refers to a method of estimating ∆ and σ := σt2 + σc2 based on Zl = (Yi , Ti , Ui )li=1 at 7


each sample size l. (Estimators of σ can be shared between the procedures if desired.) Let b l,j (u) ψj , 1 ≤ j ≤ J, denote the candidate treatment effect estimation procedures, and let ∆ and σ ˆl,j denote the estimators of ∆ and σ, respectively, resulting from the application of procedure ψj to the data Zl . For the TEEM method, ψ may be any sort of procedure or algorithm that, given data, produces estimates of ∆ and σ. A given candidate ψj could represent a parametric, nonparametric, or semiparametric statistical model, a non-statistical machine learning procedure, or subjective expert judgement. Therefore, the TEEM method of procedure combination is very general and has the ability to combine different types of estimators. This flexibility is an important advantage for solving the problem of treatment effect estimation because, as discussed in Section 1, there are a variety of approaches proposed in the literature for estimation of conditional treatment effects.

3.

THE TEEM ALGORITHM

The TEEM algorithm for combining estimates of conditional treatment effects is based on data splitting, as in cross-validation. The candidate procedures are fit using a subset of the data (called the training set) and evaluated on the remaining subset (the evaluation set). Traditional cross-validation compares the individual responses in the evaluation set with their predicted values from the various procedures. However, for treatment effect estimation, since each observational unit is in either the treatment or control group, individual treatment effects are not available to compare with the estimates. Our solution to this problem is to approximate individual treatment effects in the evaluation data by using pairs of nearby observations, one from each treatment group. Suppose we have a pair of observations (i, j) such that Ti = t and Tj = c. If individuals i and j have the same baseline covariates (Ui = Uj ), then within the framework of the previous section, Yi − Yj is an observation from N (∆(Ui ), σt2 + σc2 ) and this difference can

8

Combining Estimates of CTE b i ). If the covariates of i and j do not be used to evaluate the accuracy of estimates ∆(U match, then Yi − Yj ∼ N (∆(Ui ) + (fc (Ui ) − fc (Uj )), σt2 + σc2 ), and the bias for Yi − Yj as an estimate of ∆(Ui ) is represented by fc (Ui ) − fc (Uj ). If the distance between Ui and Uj is small and the control regression function fc is smooth, this bias will be small and the paired difference Yi − Yj will be a nearly unbiased estimate of ∆(Ui ). The TEEM algorithm uses differences between such nearby treatment/control pairs to evaluate the candidate estimates of treatment effects and assign to them appropriate weights. In Sections 3.1 and 3.3, we present two versions of the TEEM algorithm. In the first version, each observation in the evaluation set is used in at most one treatment-control pair. In the second version, each observation in the evaluation set is paired with its nearest neighbor in the other treatment group (within the evaluation set) and observations are allowed to belong to more than one pair. The two versions of the algorithm are similar; the main distinction between them can be thought of as the difference between matching without replacement and matching with replacement. The first version, in which all treatmentcontrol pairs are independent of each other, is used to facilitate theoretical development. The second version will tend to perform better in applications, because as argued in Abadie and Imbens (2006), matching with replacement will produce higher-quality (closer) matches and therefore introduce less bias.

3.1

TEEM with Independent Pairs

Here we describe in detail the version of the TEEM algorithm with independent pairs for which we derive the risk bound in Section 3.2. For our theoretical development, the support of the covariates U is assumed to be a compact subset of Rp and the covariate distribution PU is assumed to have a density bounded below by a constant c > 0 on U almost surely. Without further loss of generality, we set U = [0, 1]p . Note that these restrictions on U and PU are not required for the version of the algorithm described in Section 3.3. The version of TEEM described in Section 3.3 should be used in practice if some covariates in U are 9


discrete or have no known bounds. Step 0. Select a fraction ρ ∈ (0, 1) of the n observations that will be used to fit the models. Denote ⌊ρn + 0.5⌋ by n1 ; n1 is the number of observations used to fit the models. Similarly denote the size of the evaluation set, n − n1 , by n2 . Note that asymptotically, n1 and n2 are both of order n. Step 1. Randomly permute the order of the n observations; call this permutation π. 1 Split the resulting ordered data into two parts: the training part Z(1) = (Yi , Ti , Ui )ni=1 and

the evaluation part Z(2) = (Yi , Ti , Ui )ni=n1 +1 . Step 2. Within the evaluation data Z(2) , let nt2 denote the number of observations for which Ti = t and nc2 the number for which Ti = c. Let n∗2 = min(nt2 , nc2 ). Partition U = [0, 1]p into hypercubes each with side length h such that ⌊( )1/p ⌋ 1 cn∗2 = . h 2 log n∗2

(5)

Let n e2 denote the number of these hypercubes containing at least one realized covariate value from each treatment group in Z(2) . Within each of these n e2 cells, randomly select a pair of observations (i, i∗ ) such that Ti = t and Ti∗ = c. Use the indices i from these pairs to create the ordering m = 1, . . . , n e2 , where each m represents the treatment-control pair (i, i∗ ) with the mth-smallest value of i among the pairs created in this step. Using this index, hereafter denote the treatment and control observations (i, i∗ ) in pair m by (mt , mc ). Step 3. For each resulting matched pair (mt , mc ), create approximate treatment effects δem = Ymt − Ymc . These approximate local treatment effects will be used to evaluate the candidate procedures and assign them weights. Step 4. Fit the J candidate models (or generally, the J candidate estimation procedures) ψ1 , . . . , ψJ to the data Z(1) to obtain J estimates of the treatment effect function, √ b n ,1 , . . . , ∆ b n ,J , and J estimates of σ := σ 2 + σc2 , denoted by σ denoted by ∆ ˆn1 ,1 , . . . , σ ˆn1 ,J . 1 1 t Step 5. For each procedure indexed by j = 1, 2, . . . , J, assign initial weights (or prior probabilities) W1,j = ωj , where the ωj ’s are positive numbers that sum to 1. Then for 10


2≤m≤n e2 , let ] } ∏m−1 {[e b n ,j (Ul ) /ˆ ϕ δ − ∆ σ /ˆ σn1 ,j n ,j l 1 1 t l=1 {[ ] } =∑ , ∏ K m−1 e b ω ϕ δ − ∆ (U ) /ˆ σ σn1 ,k l n1 ,k lt n1 ,k /ˆ k=1 k l=1 ωj

Wm,j

where ϕ is the standard normal density function. Note that

∑

j≥1 Wm,j

(6)

= 1 for each

m = 1, . . . , n e2 . Step 6. For m = 1, . . . , n e2 , let e m (u) = ∆

J ∑

b n ,j (u). Wm,j ∆ 1

(7)

j=1

Step 7. For every cell m containing at least one treatment-control pair, let Um denote the region of the covariate space representing the cell. Then let    e m (Umt ) if u ∈ Um ∆ e e ∆π (u) =   0 if the cell containing u has no treatment-control pair in Z(2) . The subscript π indicates the estimator’s dependence on the permutation π applied in step 1. Step 8. Repeat steps 1-7 a total of P times for some P ≥ 1, and average the resulting e e π to obtain the TEEM estimator ∆ P 1 ∑e e πp (u), b ∆ ∆(u) = P

(8)

p=1

where for each iteration 1 ≤ p ≤ P , πp denotes the permutation applied in step 1 of the iteration. The partition size given in step 2 is not a bandwidth in the traditional sense. It takes the form given in (5) so that, asymptotically, the partition becomes finer (allowing for more precise estimation of local treatment effects) while each cell continues to contain at least one treatment-control pair with high probability. The technical aspect of this partition in establishing the risk bound for the TEEM algorithm can be understood from the proof provided in the Appendix. 11


3.2

Risk Bound for the TEEM Estimator

In this section we bound the risk of the estimator produced by the TEEM algorithm described in Section 3.1. Our proof uses the following assumptions on the data-generating process:

Regularity Conditions 1. Covariate distribution of treatment and control groups: Let PUt and PUc denote the covariate distributions, conditional on treatment status, for the treatment and control groups, respectively. Note that we allow treatment to be associated with covariates, as is the case in many observational studies, so PUt and PUc may differ from each other. We assume that the realizations Ui |Ti = t be i.i.d. from PUt and, similarly, that Ui |Ti = c be i.i.d. from PUc . 2. Sizes of treatment and control groups: For n large enough, there exist constants (a, b) not depending on n such that 0 < a < nt /n < b < 1, where nt is the number of the n observations for which Ti = t. 3. Boundedness: The regression functions ft and fc are uniformly bounded in absolute value by A < ∞, and the standard deviations σt and σc each are bounded above and below by σ < ∞ and σ > 0, respectively. We assume correspondingly that the √ √ b l,j and σ b l,j ∥∞ ≤ 2A and σ estimators ∆ ˆl,j satisfy ∥∆ ˆl,j ∈ [ 2σ, 2σ], for each l ≥ 1 and j ≥ 1. We also assume the densities of the covariate distributions for each treatment group, PUt and PUc , are bounded above and below by c < ∞ and c > 0 on U. 4. Smoothness: The regression functions for the treatment and control groups, ft and fc , b l,j for l ≥ 1 and j ≥ 1 have all p first-order partial derivatives, and the estimators ∆ and each of these first-order partial derivatives is upper bounded in absolute value by a constant L on U. We also assume the densities of the distributions PUt and PUc

12


are continuous on U. The theorem below bounds the risk of the TEEM estimator in terms of the minimum risk of the individual candidate procedures, the size of the evaluation set, and the dimension of the covariate vector. b from the TEEM algorithm Theorem 1. Under regularity conditions 1-4, the risk of ∆ described in Section 3.1 has the following bound: {( )1/p [( ) ]} log n log n 1 2 2 b 22 ≤ C b n ,j ∥22 E∥∆−∆∥ + inf log + E(σ − σ ˆn1 ,j )2 + E∥∆ − ∆ , 1 j n2 n2 ωj where the constant C depends on a, b, c, c, σ, σ, A, p, and L (but not on n).

Proof. See the Appendix.

Remarks 1. In the setting we have assumed, with homoscedastic errors within the treatment and control groups and smoothness conditions on ft and fc , the variance terms σt2 and σc2 (and therefore σ 2 ) can be estimated at rate n−1 1 independently of the candidate models (see, e.g., Rice, 1984). Thus, the term E(σ − σ ˆn1 ,j )2 could be removed from the risk bound by incorporating this independent estimation of σ into the algorithm. However, we believe that in practice, separate model-based estimators of σ often are helpful in assigning proper weights to each of the candidate procedures. 2. By choosing a fixed fraction of n to fit the estimators and using the remainder to construct the combining weights, n1 and n2 both are of order n. Therefore, if one of the candidate models (say j ∗ ) is a correctly specified parametric representation of the b n ,j ∗ ∥2 each will converge data-generating process, then E(σ − σ ˆn1 ,j ∗ )2 and E∥∆ − ∆ 1 2 to zero at a rate of n−1 . In this case, if p = 1, the risk of the combined estimator will converge to zero at rate (log n)n−1 , almost as fast as an oracle that knows the true model in advance. 13


3. The dimension of U slows the convergence of the combined estimator due to the “curse of dimensionality” in constructing the treatment-control pairs. This suggests that more efficient estimation can be achieved by reducing the dimension of the covariate vector before constructing the pairs. Ideally, the dimension reduction would not result in any loss of information about ∆. Such dimension reduction often can be accomplished using variable selection techniques or by finding a few linear combinations of the covariates that are sufficient for the regression of ∆ on U. (See Cook (1998) for an overview of sufficient dimension reduction.) Note that any dimension reduction that is sufficient for ∆ will also be sufficient for the CATE under the unconfoundedness assumption given in (3). 4. Since estimation of σ can usually be done at the parametric rate (see Remark 1), the above oracle inequality says that the combined estimator of ∆ converges at the best rate offered by the candidate procedures, up to (log n2 /n2 )1/p . 5. Theorem 1 can be generalized to handle heteroscedastic errors; our proof assumes homoscedastic errors only for simplicity of presentation.

3.3

TEEM with Nearest Neighbors

In this section, we describe a second version of the TEEM algorithm for use in applications. This algorithm is fundamentally similar to the one described in Section 3.1; both are based on data splitting, pairing of nearby treatment and control observations, and the use of a likelihood based on these pairings to compute the combination weights. However, in this version, which we call TEEM with nearest neighbors, treatment-control pairs are created by searching for the nearest neighbor of each observation in the other treatment group rather than by partitioning the covariate space. The algorithm is presented in detail below.

14


Steps 0A and 1A. These steps are the same as steps 0 and 1 of the algorithm in Section 3.1. Step 2A. For each unit i in Z(2) (regardless of its treatment status), let i∗ denote its nearest neighbor, in terms of Euclidean distance d(·), from the other treatment group in Z(2) . Specifically, i∗ represents an observation such that d(Ui , Ui∗ ) is the smallest among d(Ui , Uk ) with Ti ̸= Tk and n1 + 1 ≤ k ≤ n. (If in practice the covariates U are discrete, there may be multiple observations in the other treatment group that are equidistant from i. Any tiebreaking method may be used to choose one i∗ in this case.) This method of matching will result in n2 pairs, but the pairs will not be pairwise independent. In some situations, it may be useful to apply a caliper to bound the matching discrepancy so that none of the matched pairs are too far apart. Althauser and Rubin (1970) and others have argued that caliper matching can remove a large percentage of the total bias induced by matching while only removing a small percentage of the matched pairs. For TEEM with nearest neighbors, if a caliper is applied, observations without a “close enough” match in the other treatment group are not used to evaluate the treatment effect estimation procedures. Step 3A. For each of the n2 treatment-control pairs produced by step 2A (or the n e2 pairs for some n e2 ≤ n2 if a caliper is applied), create approximate treatment effects δei = [2I(Ti = t) − 1] (Yi − Yi∗ ).

(9)

In other words, for each pair (i, i∗ ), δei is the response of the treated unit minus the response of the control unit. Step 4A. Same as step 4 of the algorithm in Section 3.1. b n ,j on the evaluation data to create Step 5A. Use the performance of the estimates ∆ 1 the weights Wπ,j , where the subscript π represents the permutation applied in step 1A. {[ ] } b n ,j (Ui ) /ˆ δei − ∆ σn1 ,j /ˆ σn1 ,j 1 {[ ] } =∑ , ∏ K n ei − ∆ b n ,k (Ui ) /ˆ ϕ δ σ /ˆ σ n1 ,k n1 ,k 1 k=1 i=n1 +1 ∏n

Wπ,j

i=n1 +1 ϕ

where ϕ is the standard normal density function. 15

(10)


Step 6A. Repeat steps 1A-5A a total of P times for some P ≥ 1. For each j average the weights Wπ,j over the permutations to create averaged weights W j . P 1 ∑ Wπp ,j , Wj = P

(11)

p=1

where for each iteration 1 ≤ p ≤ P , πp denotes the permutation applied in step 1A of the ∑ iteration. Note that Jj=1 W j = 1. Step 7A. Fit each candidate procedure j to the entire sample of n observations to b n,j (u) for any u ∈ U. obtain estimates ∆ Step 8A. Create the final TEEM estimator b A (u) = ∆

J ∑

b n,j (u). W j∆

(12)

j=1

b in Section 3.1, While the risk bound in Section 3.2 is guaranteed for the estimator ∆ b A will typically exhibit better performance. TEEM with nearest we believe the estimator ∆ neighbors sacrifices independence across the different treatment-control pairs so that the average distance between the observations within each pair will be reduced. Furthermore, TEEM with nearest neighbors does not require the covariates to be continuous or bounded, increasing its applicability. The numerical analysis in Sections 4 and 5 demonstrates the b A , and we generally recommend this version of TEEM with nearest neighbors producing ∆ version for applications.

4. 4.1

APPLICATION: LABOR TRAINING PROGRAM

The LaLonde Data

In this section we apply the TEEM method to the well-known LaLonde (1986) National Supported Work (NSW) Demonstration data set. The NSW Demonstration was a federally and privately funded program in the 1970s that provided work experience to individuals who were struggling financially. Eligible participants were randomly assigned to the treatment 16


or control group, and follow-up interviews were conducted with both groups to obtain information about post-intervention earnings. LaLonde (1986) analyzed the male and female participants separately, and we will focus on the study’s male participants. The male participants from this experiment were previously analyzed by Dehejia and Wahba (1999) in a study of propensity scores and by Imai and Ratkovic (2013), who used a penalized regression method to estimate heterogeneity of the treatment effect. [Table 1 about here] There were n = 722 male participants in the experiment; 297 were treated and 425 were in the control group. The outcome variable Y in our analysis is the change in the square root of income from 1975 (pre-treatment) to 1978 (post-treatment); square root transformations on income are done to reduce skewness. The treatment variable T equals 1 if the person was treated in the NSW demonstration and 0 otherwise. Four baseline covariates, measured before the treatment was applied, are used to identify heterogeneity of the treatment effect in √ some of the candidate models. These covariates are 1975 income, age, years of education, and marital status, with variable names Inc75, Age, Educ, and Married, respectively. Racial indicators for black and Hispanic individuals also are available in the LaLonde dataset, but these variables were not used because a preliminary analysis provided no evidence that race moderated the treatment effect. The six two-way interactions of the four baseline covariates were not effective moderators in a preliminary linear model, so these were not considered further. Descriptions, means, and standard deviations for each of the variables used in our analysis are given in Table 1.

4.2

Candidate Models

A feature of this analysis that seems typical of many studies of treatment effect heterogeneity is the plausibility of non-linear treatment effects with respect to some of the covariates. For example, intuitively it seems quite possible that the job training program 17


may be most beneficial for those in the middle of the income, education, or age distributions. Therefore, our set of candidate models includes linear models and additive models that are possibly non-linear. Preliminary analysis shows that baseline income (Inc75) is clearly related to the response, so this covariate is included in every candidate model. The linear candidate models are those containing different subsets of the variables {T, Educ, Age, Married, T:Inc75, T:Educ, T:Age, T:Married}. Model hierarchy is enforced, meaning that if a treatmentcovariate interaction is included in the model, both corresponding main effects must be included as well. This constraint applied to this set of variables allows for 62 possible linear models. The additive candidate models are estimated with the gam function from the R mgcv package (Wood, 2006). Each term in the additive model is a smooth, possibly non-linear function of a single variable. Treatment-covariate interactions are estimated by allowing these functions to differ for each covariate depending on the value of the treatment variable. The default choice of smoothing parameter (based on generalized cross-validation) is used to fit each model. Terms involving categorical variables (T, Married, and T:Married) in the additive models are linear. Model hierarchy is similarly enforced as in the linear model consideration set, generating 62 possible additive models and 124 candidate models in all. b = 0). Since these models produce Sixteen of the models have no treatment effects (i.e., ∆ different estimates of the full regression function, they are considered as separate procedures in the TEEM combination algorithm so that all selection and combination methods utilize the same set of candidates. Note that each of these models will have the same weight W j in the TEEM algorithm.

4.3

Model Selection and Combination Methods

In this section we describe the eight model selection and combination methods that are applied to the candidate models described in the previous section. Each selection or 18


combination method applied to the candidate models will produce an estimate of ∆, the treatment’s average effect on change in income conditional on covariates. Model Selection Methods. Four methods of model selection are used to choose one of the 124 candidate models. Two of these are the familiar criteria of AIC (Akaike, 1974) and BIC (Schwarz, 1978). These likelihood-based criteria can be used to compare additive models to linear models because the additive model can be represented as a penalized likelihood, where the penalty is a measure of the wiggliness (roughness) of the function. The third model selection method used is the traditional cross-validation (CV) that uses individual prediction errors of the response to select a model. To implement CV, we use half of the data to estimate the candidate models, then compute the average squared prediction error of each candidate estimate on the other half. We repeat the process 100 times to average out the variability in data splitting, and the model chosen by CV is the one with the lowest average squared error over the 100 splits. Treatment effect cross-validation (TECV; Rolling and Yang, 2014), a form of crossvalidation targeted to selecting an accurate estimate of the treatment effect from a set of candidates, is the fourth model selection method used. As with traditional CV, we randomly split the data into equal-sized training and evaluation samples 100 times. The model with the lowest average TECV statistic over the 100 splits is chosen by treatment effect crossvalidation. The TECV method is targeted to estimation of ∆, while the other three model selection methods are targeted to estimate the conditional mean of Y . Model Combination Methods. Four model combination methods are analyzed in this study. Each of them forms an estimate of the treatment effect ∆ from a convex combination b function by of the candidate procedures. That is, each combination method produces a ∆ b ∆(u) =

J ∑

b n,j (u), wj ∆

(13)

j=1

b n,j is the estimate of ∆ produced by applying procedure where J = 124 in this case and ∆ j to the entire sample of size n. The combination methods differ in their choices of the 19


weights wj . One benchmark method of model combination is Bayesian Model Averaging (BMA), in which each wj represents the posterior probability of model j. Raftery (1995) derived the approximation exp(− 12 BICj ) wj = ∑J 1 j=1 exp(− 2 BICj )

(14)

to the posterior probability of model j when the models have equal prior probabilities. Buckland, Burnham, and Augustin (1997) suggested combining models based on AIC by replacing BICj with AICj in the above expression for wj . We refer to these two weighting schemes as BMA and cAIC, respectively. Adaptive Regression by Mixing (ARM; Yang, 2001) was discussed briefly in Section 1. ARM is a method of model combination based on data splitting that targets estimation of the full regression function. Essentially, in ARM each model’s weight wj is based on the model’s ability to predict the response on outside data. We construct the ARM weights wj by assuming normal errors, using each procedure separately to estimate the error variance, and averaging the weights over 100 different 50/50 data splittings. Finally, the version of TEEM described in Section 3.3 is used with no caliper and 100 different 50/50 data splittings. To save computing time and compare the methods most accurately, the same 100 random data splittings are used for CV, TECV, ARM and TEEM.

4.4

Results [Table 2 about here]

Table 2 summarizes the results of the model selection and combination methods when applied to the LaLonde data. For each model selection method, the table lists the type of model (additive or linear) that was selected, along with the active variables in the model. For each selection and combination method ν, the mean and standard deviation (SD) of b ν (Ui ) over the 722 data values of Ui are reported in Table 2. the resulting estimator ∆ 20

Combining Estimates of CTE b ν (Ui ) measures the amount of treatment effect heterogeneity The standard deviation of ∆ indicated by the estimator resulting from the method ν. Interestingly, each of the four model selection methods chooses a different model, with each implying something different about the treatment’s effect on the outcome. The additive model selected by AIC implies the treatment effect varies nonlinearly with pre-treatment income and age, as well as being different for married vs. single people. The model selected by BIC implies the NSW treatment has no effect at all on the outcome, while the model chosen by traditional CV implies a homogeneous positive treatment effect. TECV, the model selection method targeted to estimation of ∆, selects a linear model implying the treatment effect differs by marital status. The coefficient of the T:Married interaction is positive in the model selected by TECV, indicating the treatment provides a greater benefit to married men. [Figure 1 about here] The top half of Table 2 suggests substantial model selection uncertainty in this analysis. In such situations, model combination often provides a good compromise between similarb ν (Ui ) values performing models that give quite different estimates. In Figure 1, the ∆ from the LaLonde data resulting from AIC, CV, TECV, and TEEM are plotted against pre-treatment income. There is substantial treatment effect heterogeneity suggested by b AIC , with an overall negative association between ∆ b AIC and pre-treatment income. For ∆ the models selected by CV and TECV, there is no interaction between the treatment and baseline income. The TEEM model combination lies somewhere in between these extremes, b TEEM and pre-treatment income but with exhibiting some negative association between ∆ b AIC . Although we do not know how the treatment effect truly much less variability than ∆ varies with pre-treatment income, it is reasonable to believe that those who entered the program with a higher income benefited less from the program overall. At the same time, b AIC suggests. it seems unlikely that the treatment effect is as heterogeneous as ∆ 21


[Figure 2 about here] [Figure 3 about here] b TEEM compared with ∆ b AIC can also be seen in Figures 2 The decreased variability in ∆ b ν from AIC and TEEM, respectively, over the range and 3. These are contour plots of ∆ of Inc75 and Education in the LaLonde data. Age is set to 25 (the sample mean), and subfigures (a) and (b) show results for single and married males, respectively. The points on the contour plots represent values of Inc75 and Education realized in the LaLonde sample for single and married subgroups. b AIC with respect to both the Inc75 The plots in Figure 2 show the large heterogeneity in ∆ and Education variables. In particular, the heterogeneity and non-linearity of the estimated treatment effect with respect to education in Figure 2 seem implausible. The plots in Figure 3 show a much more reasonable degree of heterogeneity (note the differing scales in Figures b TEEM in Figure 3 show a positive estimated treatment 2 and 3). The contour plots of ∆ effect for most of the individuals, including all of those with no pre-treatment income. TEEM does suggest some treatment effect heterogeneity, with the program estimated to be more beneficial for those with little or no income, those with more education, and married participants.

5. 5.1

“CROSS-EXAMINATION” OF LALONDE DATA

Setup

To gain further insight into the performance of model selection and combination methods on the LaLonde data, we perform a guided simulation experiment that evaluates the methods under different simulation scenarios consistent with the data. Each simulation scenario is based on a model selected by one of the model selection methods. Li, Lue, and Chen (2000) called this type of simulation a “cross-examination”.

22

Combining Estimates of CTE Specifically, for each model selection method τ , we generate a response vector Yτ∗ by adding i.i.d. noise to the estimated regression functions from the model selected by τ at the n = 722 sample values of (T, U). The noise is generated by a mean-zero Gaussian distribution with variance equal to the error variance estimate from the model selected by τ . For each τ , all eight model selection and combination methods are then applied to (Yτ∗ , T, U) to produce an estimate of ∆. Since the true ∆τ is known for each choice of τ , we can compare the performances of the estimators from the selection and combination methods under each version of the truth put forth by the four model selection methods, b τ,ν respectively. For each selection or combination method ν, we estimate the risk of each ∆ [ ]2 b τ,ν (Ui ) over the n = 722 sample values under squared error loss by averaging ∆τ (Ui ) − ∆ of Ui . To average out the variability in the random errors, the results are aggregated over 100 different realizations of the error vector. The active variables involved in generating Yτ∗ under each of the four scenarios can be found by looking at the model selected by each method in Table 2. While ∆AIC varies nonlinearly with the Inc75 and Educ variables, the BIC scenario has ∆BIC = 0. The scenario under CV has ∆CV = 6.6. Finally, ∆TECV differs by marital status but is otherwise constant; specifically, ∆TECV is 20.5 for married males and 3.8 for unmarried males.

5.2

Results [Table 3 about here]

b τ,ν as an estimator Table 3 shows the estimated risk (average mean squared error) of ∆ of ∆τ for each of the eight model selection and combination methods under each of the four scenarios. For the scenario with a non-linear ∆ (the model selected by AIC), the model combination methods of ARM and TEEM significantly outperform all other methods, including AIC. AIC does not perform very well even on its own “home field” because even if the true additive model is selected, there is substantial variability involved in estimating

23


the nonlinear ∆AIC with a nonlinear estimate. When ∆ = 0 (the BIC scenario), BIC usually chooses a model with no treatment effect and thus has the lowest risk. In the two scenarios based on the models chosen by CV and TECV, the model combination methods of ARM and TEEM perform the best. Overall, the performances of ARM and TEEM are similar, with TEEM performing perhaps slightly better under the models selected by AIC and TECV, the two scenarios under which the treatment effect is heterogeneous. [Figure 4 about here] A summary of each method’s estimated risk across all four scenarios can be found in [ ]2 b τ,ν (Ui ) Figure 4. Each data point in the boxplot represents an average of ∆τ (Ui ) − ∆ over the n = 722 realized Ui values. There are 100 realizations generated from each of the four τ scenarios; therefore, 400 data points are summarized in each boxplot. Among the model selection methods, the TECV method that targets the treatment effect had the lowest median MSE for ∆. Model combination methods possessed lower risk than model selection methods on average, with the methods of ARM and TEEM performing the best overall.

6.

CONCLUSION

There is much theoretical and empirical evidence in the literature to support the practice of model combination rather than model selection when the goal is accurate estimation or prediction of a response. Our work indicates that model combination also is useful for the purpose of estimating a possibly heterogeneous treatment effect. The problem of treatment effect estimation differs from traditional regression in some ways and may result in different candidate models needing to receive higher weights; within a given candidate set, models that are the best for prediction may not be the best for treatment effect estimation. Therefore, there is a need for model combination methods targeted to the treatment effect. The TEEM algorithm proposed in this work is one such method. 24


TEEM has the flexibility to simultaneously combine traditional regression models, machine learning procedures, and any other current or future method proposed in the growing area of treatment effect estimation. We provide an oracle inequality for the TEEM estimator under squared error loss that guarantees the TEEM estimator will converge to the true ∆ as long as at least one of the candidate estimators converges to ∆; the convergence will be at nearly the best rate offered by the candidates if the covariate vector is one-dimensional. An analysis of the benchmark LaLonde labor training data shows that TEEM provides a sensible data-driven weighting of linear and non-linear treatment effect estimates, and a guided simulation provides evidence that in this setting, TEEM compares favorably with other selection and combination methods in providing accurate estimates of the treatment effect. Combining treatment effect estimators is a rich topic with abundant practical applications and a number of future research directions. We present our results under the traditional squared error loss; however, in practice treatment effect estimates may inform decisions, such as assigning or prioritizing treatments to units, that suggest other loss functions. Different loss functions may lead to different formulas for assigning weights to the candidate estimators. Kernel-based methods to estimate the accuracy of the candidate treatment effect estimators may result in less variability than our proposed nearest-neighbor pairing scheme, although the “curse of dimensionality” remains a factor in theory and practice. Finally, we do not claim that our method of combining estimates of treatment effects assigns the optimal weights to the procedures. Calculation of theoretically optimal combining weights requires estimation of the covariances between the candidate estimates, a notoriously difficult task that often introduces substantial estimation error. Smith and Wallis (2009) and Claeskens, Magnus, Vasnev, and Wang (2014) discuss this issue for the problem of forecast combination. Yang (2004), also in a forecast combination context, terms the search for the optimal linear combination of procedures “combining for improvement” and distinguishes it from the more modest goal of “combining for adaptation”, which targets 25


the performance of the best candidate. Combining for improvement is shown to pay an unavoidable price in performance as a result of searching for the optimal weights. The TEEM method falls into the category of combining for adaptation, because its aim is to achieve the best performance (for estimating the treatment effect) among the candidates. In this sense, the objective of TEEM is similar to the objective of a model selection method that targets the treatment effect; however, the TEEM estimator may outperform postmodel selection estimators because of the latter’s high variability. Indeed, the results in Section 5 show that when there is a large degree of model selection uncertainty, the TEEM method of model combination results in an estimator of the treatment effect that often is more accurate than an estimator from a single selected candidate. Our work demonstrates that a properly targeted method of model combination can provide large advantages over model selection in the important setting of treatment effect estimation.

REFERENCES Abadie, A. & G. W. Imbens (2006) Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–267. Abadie, A. & G. W. Imbens (2011) Bias-corrected matching estimators for average treatment effects. Journal of Business and Economic Statistics 29, 1–11. Akaike, H. (1974) A new look at the statistical model identification. IEEE Transactions on Automatic Control 19, 716–723. Althauser, R. P. & D. Rubin (1970) The computerized construction of a matched sample. American Journal of Sociology 76, 325–346. Barron, A. R. (1987) Are Bayes rules consistent in information?

In T. M. Cover &

B. Gopinath (eds.), Open Problems in Communication and Computation, pp. 85–91. Springer-Verlag. Breiman, L. (1996) Bagging predictors. Machine Learning 24, 123–140.

26


Breiman, L. (2001) Random forests. Machine Learning 45, 5–32. Buckland, S., K. Burnham & N. Augustin (1997) Model selection: An integral part of inference. Biometrics 53, 603–618. B¨ uhlmann, P. & B. Yu (2002) Analyzing bagging. Annals of Statistics 30, 927–961. Cai, T., L. Tian, P. H. Wong & L. Wei (2011) Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 12, 270–282. Chvátal, V. (1979) The tail of the hypergeometric distribution. Discrete Mathematics 25, 285–287. Claeskens, G., J. Magnus, A. L. Vasnev & W. Wang (2014) The forecast combination puzzle: A simple theoretical explanation. Tinbergen Institute Discussion Paper 14-127/III. Cook, R. D. (1998) Regression Graphics. Wiley. Dehejia, R. H. & S. Wahba (1999) Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association 94, 1053–1062. Freund, Y. & R. E. Shapire (1996) Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156. Green, D. P. & H. L. Kern (2012) Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public Opinion Quarterly 76, 491– 511. Hjort, N. L. & G. Claeskens (2003) Frequentist model average estimators. Journal of the American Statistical Association 98, 879–899. Hoeting, J. A., D. Madigan, A. E. Raftery & C. T. Volinsky (1999) Bayesian model averaging: A tutorial. Statistical Science 14, 382–417. Holland, P. W. (1986) Statistics and causal inference. Journal of the American Statistical Association 81, 945–960. Imai, K. & M. Ratkovic (2013) Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics 7, 443–470. 27


Imbens, G. & J. M. Wooldridge (2009) Recent developments in the econometrics of program evaluation. Journal of Economic Literature 47, 5–86. LaLonde, R. J. (1986) Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review 76, 604–620. Li, K.-C., H.-H. Lue & C.-H. Chen (2000) Interactive tree-structured regression via principal Hessian directions. Journal of the American Statistical Association 95, 547–560. Qian, M. & S. A. Murphy (2011) Performance guarantees for individualized treatment rules. The Annals of Statistics 39, 1180–1210. Raftery, A. E. (1995) Bayesian model selection in social research. Sociological Methodology 25, 111–163. Rice, J. (1984) Bandwidth choice for nonparametric regression. The Annals of Statistics 12, 1215–1230. Rolling, C. A. & Y. Yang (2014) Model selection for estimating treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 749–769. Schwarz, G. (1978) Estimating the dimension of a model. The Annals of Statistics 6, 461– 464. Smith, J. & K. Wallis (2009) A simple explanation of the forecast combination puzzle. Oxford Bulletin of Economics and Statistics 71, 331–355. Taddy, M., M. Gardner, L. Chen & D. Draper (2015) A nonparametric Bayesian analysis of heterogeneous treatment effects in digital experimentation. ArXiv:1412.8563v3 [stat.AP]. Wood, S. N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC. Yang, Y. (2001) Adaptive regression by mixing. Journal of the American Statistical Association 96, 574–588. Yang, Y. (2003) Regression with multiple candidate models: Selecting or mixing? Statistica Sinica 13, 783–809.

28


Yang, Y. (2004) Combining forecasting procedures: Some theoretical results. Econometric Theory 20, 176–222. Zhang, B. (in press) Empirical likelihood in causal inference. Econometric Reviews.

APPENDIX Proof of Theorem 1 First let P = 1, where P is the number of permutations from step 8 of the algorithm. For each pair m created in step 2 of the algorithm, denote the realized values of (Umt , Umc ) as (umt , umc ), and let δem = Ymt − Ymc . Conditional on (Umt , Umc ) = (umt , umc ), the density of δem under ∆, fc and σ can be expressed as { } em − ∆(um ) − [fc (um ) − fc (um )] 1 δ t t c p∆,fc ,σ (δem |umt , umc ) = ϕ . σ σ b and σ The estimated density of δem under ∆ ˆ and supposing fc (umt ) = fc (umc ) is { } em − ∆(u b mt ) 1 δ e p∆,ˆ ϕ . b σ (δm |umt , umc ) = σ ˆ σ ˆ Define J ∑

q1 (δe1 |u1t , u1c ) =

ωj p∆ bn

1 ,j

j=1

and for 2 ≤ m ≤ n e2 , define ∑J qm (δem |umt , umc ) =

j=1 ωj

[∏ m−1

,ˆ σn1 ,j

(δe1 |u1t , u1c ),

] el |ul , ul ) p b ( δ t c ∆n ,j

e

(δm |umt , umc ) ,ˆ σn1 ,ˆ σ 1 ,j n1 ,j . ∑J ∏m−1 (δel |ult , ulc ) b n ,j ,ˆ j=1 ωj l=1 p∆ σ n ,j 1 1

l=1

p∆ bn

1 ,j

The error density ϕ has mean 0; therefore, given π, Z(1) , (ult , ulc , ylt , ylc )m−1 l=1 , and (umt , umc ), ∑ b e qm (δem |umt , umc ) has mean j Wm,j ∆n1 ,j (umt ) = ∆m (umt ), where Wm,j represent the weights defined in step 5 of the TEEM algorithm. Let

n e2 [ ] ∏ e2 gj (δem )nm=1 = p∆ bn

1 ,j

m=1

29

,ˆ σn1 ,j

(δem |umt , umc ),


and let

J [ ] ∑ [ ] n e2 e2 e ge (δm )m=1 = ωj gj (δem )nm=1 . j=1

[ ] ∏e2 e2 Note that nm=1 qm (δem |umt , umc ) = ge (δem )nm=1 . One can view qm (δem |umt , umc ) as an estimator of the conditional density of δem given (umt , umc ). The cumulative risk, under the e2 can Kullback-Leibler divergence, of qm (δem |umt , umc ) at the design points (umt , umc )nm=1

be bounded in terms of the risks of the individual procedures using an idea from Barron (1987). Letting Eπ denote the expectation conditional on the permutation π and D(f ||g) the Kullback-Leibler divergence of g from f , we have n e2 ∑ m=1 n e2 ∑

Eπ D[p∆,fc ,σ (δem |umt , umc )||qm (δem |umt , umc )] ∫

p∆,fc ,σ (δem |umt , umc ) e dδm p∆,fc ,σ (δem |umt , umc ) log qm (δem |umt , umc ) m=1 } ∫ {∏ n e2 n e2 ∑ p∆,fc ,σ (δem |umt , umc ) e = dδm Eπ p∆,fc ,σ (δem |umt , umc ) log em |um , um ) q ( δ m t c m=1 m=1 } { ne } ∫ {∏ n e2 2 e ∑ p ( δ |u , u ) mc ∆,fc ,σ m mt =Eπ p∆,fc ,σ (δem |umt , umc ) log dδe1 · · · dδene2 e q ( δ |u , u ) m m mt mc m=1 m=1 } ∏ne2 ∫ {∏ n e2 e m=1 p∆,fc ,σ (δm |umt , umc ) e =Eπ p∆,fc ,σ (δem |umt , umc ) log ∏ dδ1 · · · dδene2 n e2 e q ( δ |u , u ) mc m=1 m m mt m=1 } ∏ne2 ∫ {∏ n e2 p∆,f ,σ (δem |umt , umc ) e ] dδ1 · · · dδene2 . =Eπ p∆,fc ,σ (δem |umt , umc ) log m=1 [ c n e2 e g e ( δ ) m=1 m m=1 =

Eπ

Since ϕ is a positive-valued function and log(x) is an increasing function, we have that for any j ≥ 1, } ∏ne2 ∫ {∏ n e2 p∆,f ,σ (δem |umt , umc ) e ] Eπ p∆,fc ,σ (δem |umt , umc ) log m=1 [ c dδ1 · · · dδene2 n e2 e g e ( δ ) m=1 m m=1 } ∏ ∫ {∏ n e2 n e2 p∆,fc ,σ (δem |umt , umc ) e [ ] ≤ Eπ p∆,fc ,σ (δem |umt , umc ) log m=1 dδ1 · · · dδene2 n e2 e ωj gj (δm )m=1 m=1 { } ∏ne2 ∫ n e2 ∏ p∆,f ,σ (δem |umt , umc ) e 1 e ] = log + Eπ p∆,fc ,σ (δm |umt , umc ) log m=1 [c dδ1 · · · dδene2 . n e2 ωj e gj (δm )m=1 m=1 30


The last term in the preceding equation is the cumulative risk, under the Kullback-Leibler divergence, of p∆ bn

1 ,j

,ˆ σn1 ,j

e2 at the design points (umt , umc )nm=1 , given the permutation π.

This is because } ∏ne2 ∫ {∏ n e2 p∆,f ,σ (δem |umt , umc ) e e ] Eπ p∆,fc ,σ (δm |umt , umc ) log m=1 [c dδ1 · · · dδene2 n e2 e gj (δm )m=1 m=1   } ne ∫ {∏ n e2 2 ∑  e p ( δ |u , u ) mc ∆,fc ,σ m mt =Eπ dδe · · · dδene2 p∆,fc ,σ (δem |umt , umc ) log em |um , um )  1  p ( δ b t c m=1 m=1 ∆n ,j ,ˆ σn ,j =

n e2 ∑

∫ Eπ

1

p∆,fc ,σ (δem |umt , umc ) log

m=1

=

n e2 ∑

p∆,fc ,σ (δem |umt , umc ) dδem e p∆ ( δ |u , u ) mc b n ,j ,ˆ σn ,j m mt 1

Eπ D[p∆,fc ,σ (δem |umt , umc )||p∆ bn

1 ,j

m=1

1

1

,ˆ σn1 ,j

(δem |umt , umc )].

By definition, (δe |u , umc )] D[p∆,fc ,σ (δem |umt , umc )||p∆ b n ,j ,ˆ σn1 ,j m mt 1 { } ∫ ( 1 δem − ∆(umt ) − [fc (umt ) − fc (umc )] = ϕ σ σ ({ } )) (1/σ)ϕ δem − ∆(umt ) − [fc (umt ) − fc (umc )] /σ {[ ] } dδem . × log e b (1/ˆ σn1 ,j )ϕ δm − ∆n1 ,j (umt ) /ˆ σn1 ,j Letting z=

δem − ∆(umt ) − [fc (umt ) − fc (umc )] , σ

we perform an integral transformation to obtain D[p∆,fc ,σ (δem |umt , umc )||p∆ (δe |u , umc )] b n ,j ,ˆ σn1 ,j m mt 1 ∫ ϕ(z) { } dz. = ϕ(z) log b n ,j (umt ) + [fc (umt ) − fc (umc )]/ˆ (σ/ˆ σn1 ,j )ϕ σz + ∆(umt ) − ∆ σn1 ,j 1 The standard normal p.d.f. ϕ has the property that for each pair 0 < s0 < 1 and T > 0, there exists a constant B0 (depending on s0 and T ) such that ∫ ϕ(x) log

ϕ(x) dx ≤ B0 [(1 − s)2 + t2 ] (1/s)ϕ[(x − t)/s] 31


for all s0 ≤ s ≤ 1/s0 and −T < t < T (see Assumption A2 in Yang, 2001). Using this fact and taking s0 = σ/σ, s = σ ˆn1 ,j /σ, T = 4A/σ, and [ ] [ ]  b n ,j (umt ) + fc (umt ) − fc (umc )   ∆(umt ) − ∆ 1 t=− ,   σ it follows that D[p∆,fc ,σ (δem |umt , umc )||p∆ (δe |u , umc )] b n ,j ,ˆ σn1 ,j m mt 1  ] [ ] 2  [ { }2  ∆(u ) − ∆ b  (u ) + f (u ) − f (u ) m n ,j m c m c m t t t c 1 σ ˆn ,j   ≤ B0  1 − 1 + ,   σ σ for a constant B0 depending on A, σ, and σ. Using σ 2 ≥ 2σ 2 and the parallelogram law, we obtain that for any j ≥ 1, (δe |u , umc )] D[p∆,fc ,σ (δem |umt , umc )||p∆ b n ,j ,ˆ σn1 ,j m mt 1 { [ ]2 [ ]2 [ ]2 } B0 1 b n ,j (umt ) + fc (umt ) − fc (umc ) ≤ 2 σ−σ ˆn1 ,j + ∆(umt ) − ∆ . 1 σ 2 Thus we have shown n e2 1 ∑ Eπ D[p∆,fc ,σ (δem |umt , umc )||qm (δem |umt , umc )] n e2 m=1 ( n e2 [ ]2 B0 ∑ 1 1 ≤ 2 Eπ fc (umt ) − fc (umc ) + inf log j σ n e2 n e2 ωj m=1 { }) n e2 [ ]2 B0 1 1 ∑ 2 b n ,j (umt ) + 2 Eπ (σ − σ ˆn1 ,j ) + Eπ ∆(umt ) − ∆ . 1 σ 2 n e2

(A.1)

m=1

Let d2H (f, g) =

∫ √ √ ( f − g)2 dν denote the squared Hellinger distance between the

densities f and g with respect to the measure ν. The squared Hellinger distance is upper bounded by the K-L divergence, so n e2 1 ∑ Eπ d2H [p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc )] n e2 m=1

32


is bounded above by (A.1). As mentioned earlier, for each m, given π, Z(1) , (ult , ulc , ylt , ylc )m−1 l=1 , and (umt , umc ), e m (umt ) with respect to δem . For this estimator, we have qm (δem |umt , umc ) has mean ∆ [∫

δem p∆,fc ,σ (δem |umt , umc )dδem −

∫

δem qm (δem |umt , umc )dδem

]2

{∫

}2 [ ] e e e e = δm p∆,fc ,σ (δm |umt , umc ) − qm (δm |umt , umc ) dδm [√ {∫ ] √ = δem p∆,fc ,σ (δem |umt , umc ) + qm (δem |umt , umc ) × ∫ ≤

2 δem

[√

[√

p∆,fc ,σ (δem |umt , umc ) −

p∆,fc ,σ (δem |umt , umc ) +

√

√

}2 ] qm (δem |umt , umc ) dδem

qm (δem |umt , umc )

]2

dδem

∫ [√

]2 √ e e × p∆,fc ,σ (δm |umt , umc ) − qm (δm |umt , umc ) dδem [∫ ] ∫ 2 2 e e e e e ≤2 δm p∆,fc ,σ (δm |umt , umc ) + δm qm (δm |umt , umc )dδm ∫ [√

]2 √ e e × p∆,fc ,σ (δm |umt , umc ) − qm (δm |umt , umc ) dδem [ ] ∫ 2 2 e e e e =2 E(δm |umt , umc ) + δm qm (δm |umt , umc )dδm [ ] × d2H p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc ) {[ } ∫ ]2 2 2 =2 E(δem |umt , umc ) + σ + δem qm (δem |umt , umc )dδem [ ] × d2H p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc ) {[ } ∫ ]2 2 2 e e e =2 ∆(umt ) + fc (umt ) − fc (umc ) + σ + δm qm (δm |umt , umc )dδm [ ] × d2H p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc ) , where the first and second inequalities follow from the Cauchy-Schwarz inequality and the parallelogram law, respectively. By the third regularity condition, [∆(umt ) + fc (umt ) − fc (umc )]2 ≤ (4A)2 . Now ∫

2 2 q (δ 2 e e e e2 e δem m m |umt , umc )dδm = Eqm (δm |umt , umc ) ≤ [Eqm (δm |umt , umc )] +σ , and qm (δm |umt , umc )

is a convex combination of J densities in the location-scale family ϕ[(x − b)/a]/a, each with 33

Combining Estimates of CTE b n ,j (umt ) with respect to δem . Therefore, mean ∆ 1

∫

2 q (δ e e δem m m |umt , umc )dδm is bounded above

by (2A)2 + σ 2 . It follows that [∫

δem p∆,fc ,σ (δem |umt , umc )dδem −

∫

δem qm (δem |umt , umc )dδem

]2

[ ] ≤ (40A2 + 4σ 2 )d2H p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc ) . Together with ∫

δem p∆,fc ,σ (δem |umt , umc )dδem = E(δem |umt , umc ) = ∆(umt ) + fc (umt ) − fc (umc )

and

∫

e m (umt ), δem qm (δem |umt , umc )dδem = ∆

we have, for each 1 ≤ m ≤ n e2 , [ ]2 e m (umt ) ∆(umt ) + fc (umt ) − fc (umc ) − ∆ [ ] ≤ (40A2 + 4σ 2 )d2H p∆,fc ,σ (δem |umt , umc ), qm (δem |umt , umc ) .

(A.2)

{ }2 e m (umt ) . The expression (A.2) also is an upper bound for ∆(umt ) − [fc (umt ) − fc (umc )] − ∆ [ ]2 e m (umt ) . Then by So by the parallelogram law, (A.2) is an upper bound for ∆(umt ) − ∆ using the earlier risk bound on the average squared Hellinger distance and combining constants, we obtain n e2 [ ]2 1 ∑ e m (umt ) Eπ ∆(umt ) − ∆ n e2 m=1 ( { n e2 ]2 [ 1 ∑ 1 1 ≤ B2 Eπ fc (umt ) − fc (umc ) + inf log j n e2 n e2 ωj m=1

n e2 [ ]2 1 ∑ b n ,j (umt ) + Eπ (σ − σ ˆn1 ,j ) + Eπ ∆(umt ) − ∆ 1 n e2 2

}) ,

(A.3)

m=1

where B2 depends on σ, σ, and A. e e π to the average risk of the individual Now we connect the global risk of the estimator ∆ e m at the design points. Let Dπ denote the event that n estimators ∆ e2 = (1/h)p ; that is, the 34


event that every cell in the partition of U contains at least one treatment-control pair from Z(2) after the permutation π. Let Um denote the cell in the partition containing the mth treatment-control pair. Conditional on Dπ , e e π ∥2 Eπ ∥∆ − ∆ 2 ]2 ∫ [ e e = Eπ ∆(u) − ∆π (u) dPU U

= Eπ

n e2 ∫ ∑ m=1 Um

[

]2 e e ∆(u) − ∆π (u) dPU .

e e e π , for any u ∈ Um , ∆ e π (u) = ∆ e m (umt ). Therefore, for u ∈ Um , By the definition of ∆ [ ]2 e e ∆(u) − ∆π (u) =

{[ ] [ ]}2 e m (umt ) ∆(u) − ∆(umt ) + ∆(umt ) − ∆ [ ]2 [ ]2 e m (umt ) . ≤ 2 ∆(u) − ∆(umt ) + 2 ∆(umt ) − ∆

Combining the previous two displays and using the fact that for any m,

∫ Um

dPU ≤ c/e n2 ,

we have e e π ∥2 Eπ ∥∆ − ∆ 2 { ne ∫ 2 ∑ ≤ 2Eπ m=1

n e2 [ [ ]2 ]2 c ∑ e m (umt ) ∆(u) − ∆(umt ) dPU + ∆(umt ) − ∆ n e2 Um

} .

(A.4)

m=1

For the first summation on the right-hand side of (A.4), by the Mean Value Theorem for integrals and the fact that every cell Um has volume 1/e n2 , we have n e2 ∫ n e2 [ ]2 [ ]2 ∑ 1 ∑ ∆(u) − ∆(umt ) dPU = f (u∗m ) ∆(u∗m ) − ∆(umt ) , n e2 Um m=1

where

u∗m

m=1

is some point in the hypercube Um and f (u∗m ) represents the design density at this

point. The smoothness conditions on ft and fc imply that ∆ satisfies a Lipschitz condition √ with Lipschitz constant pL. Thus for any m, since the distance between u∗m and umt is √ at most ph, ∆(u∗m ) − ∆(umt ) ≤ pLh. Thus we have n e2 ∫ ∑ m=1 Um

[ ]2 ∆(u) − ∆(umt ) dPU ≤ c(pLh)2 .

35

(A.5)


Combining (A.3), (A.4), and (A.5), we have established [ Eπ

] e 2 e ∥∆ − ∆π ∥2 Dπ ( ≤ 2c(pLh) + 2cB2 2

{ + inf j

n e2 [ ]2 1 ∑ Eπ fc (umt ) − fc (umc ) n e2 m=1

n e2 [ ]2 1 1 1 ∑ b n ,j (umt ) log + Eπ (σ − σ ˆn1 ,j )2 + Eπ ∆(umt ) − ∆ 1 n e2 ωj n e2

}) .

m=1

(A.6) b n ,j to its average risk at the design points. Next we relate the global risk of each ∆ 1 Again using the Mean Value Theorem for integrals and conditioning on Dπ , we have for any j ≥ 1, n e2 [ ]2 1 ∑ b n ,j (umt ) − Eπ ∥∆ − ∆ b n ,j ∥22 Eπ ∆(umt ) − ∆ 1 1 n e2 m=1 n e2 {[ ]2 [ ]2 } ∑ c∗ ∗ ∗ b b Eπ ∆(umt ) − ∆n1 ,j (umt ) − ∆(um ) − ∆n1 ,j (um ) , ≤ n e2 m=1

where c∗ is a constant bounded by max(1/c, c) that exists by the boundedness of PU . The difference in the squared differences after the summation can be bounded for each m by the b n ,j . smoothness of ∆ and ∆ 1 Indeed, for each m we have [ ]2 [ ]2 b n ,j (umt ) − ∆(u∗ ) − ∆ b n ,j (u∗ ) ∆(umt ) − ∆ m m 1 1 {[ ] [ ]} b n ,j (umt ) + ∆(u∗ ) − ∆ b n ,j (u∗ ) = ∆(umt ) − ∆ m m 1 1 {[ ] [ ]} b n ,j (umt ) − ∆(u∗ ) − ∆ b n ,j (u∗ ) . × ∆(umt ) − ∆ m m 1 1 b n ,j both are bounded between −2A and 2A, Since ∆ and ∆ 1 [ ] [ ] b n ,j (umt ) + ∆(u∗m ) − ∆ b n ,j (u∗m ) ≤ 4A. ∆(umt ) − ∆ 1 1 b n ,j ensure that both satisfy a Lipschitz condition Meanwhile, the smoothness of ∆ and ∆ 1

36


with Lipschitz constant

√

pL. Thus for any m, since each Um has diameter

√ ph,

] [ ] [ b n ,j (umt ) − ∆(u∗ ) − ∆ b n ,j (u∗ ) ∆(umt ) − ∆ m m 1 1 ] ] [ [ b n ,j (u∗ ) − ∆ b n ,j (umt ) ≤ 2pLh. = ∆(umt ) − ∆(u∗m ) + ∆ m 1 1 Therefore, conditional on Dπ , n e2 [ ]2 ∑ 1 b n ,j (umt ) ≤ Eπ ∥∆ − ∆ b n ,j ∥22 + 8c∗ ApLh. Eπ ∆(umt ) − ∆ 1 1 n e2

(A.7)

m=1

Thus combining (A.7) with (A.6), we have established that ] [ e 2 e Eπ ∥∆ − ∆π ∥2 Dπ { n e2 [ ]2 1 ∑ ∗ 2 ≤ 8c ApLh + c(pLh) + B2 Eπ fc (umt ) − fc (umc ) n e2 m=1 ]} [ 1 1 b n ,j ∥22 log + Eπ (σ − σ ˆn1 ,j )2 + Eπ ∥∆ − ∆ + inf . 1 j n e2 ωj Using the Lipschitz condition for fc within each cell, in a similar fashion as before, we can show that

n e2 [ ]2 1 ∑ Eπ fc (umt ) − fc (umc ) ≤ (pLh)2 . n e2 m=1

Thus we have ] [ e 2 e Eπ ∥∆ − ∆π ∥2 Dπ ≤ 8c∗ ApLh + B3

{

[

1 1 b n ,j ∥2 log + Eπ (σ − σ ˆn1 ,j )2 + Eπ ∥∆ − ∆ (pLh)2 + inf 2 1 j n e2 ωj

]} , (A.8)

for a constant B3 depending on σ, σ, A, and c. Now, ] ] [ [ e e e 2 c 2 2 e e e Eπ ∥∆ − ∆π ∥2 ≤ Eπ ∥∆ − ∆π ∥2 Dπ + Eπ ∥∆ − ∆π ∥2 Dπ × P (Dπc ). e e π between −2A and 2A, By the boundedness of ∆ and ∆ ] [ e 2 c e Eπ ∥∆ − ∆π ∥2 Dπ ≤ 16A2 . 37

(A.9)

(A.10)


To use (A.9), we need to bound P (Dπc ). Denote the event that all cells in our partition contain at least observation from the treatment group by Dπ,t , and let Dπ,c denote the c )+ corresponding event for the control group. Since Dπ = Dπ,t ∩ Dπ,c , P (Dπc ) ≤ P (Dπ,t c ). P (Dπ,c

Let Ug denote an arbitrary cell in the partition. By the third regularity condition, the probability that any observation from the treatment group falls into Ug is at least chp . Since the covariate values of the nt2 treatment observations are i.i.d., the probability that Ug contains no treatment observations from Z(2) is at most p)

(1 − chp )nt2 = ent2 log(1−ch

≤ e−nt2 ch , p

where the last inequality results from the fact that log x ≤ x − 1. Since Ug is arbitrary and there are (1/h)p such cells in the partition of U, the probability that any of them contain no treatment observations is at most (1/h)p e−nt2 ch = exp[−nt2 chp + p log(1/h)]. p

By the choice of h in step 2 of the TEEM algorithm, h ≥ [2 log(n∗2 )/cn∗2 ]1/p . Therefore, − nt2 chp + p log(1/h)

( ) cn∗2 −2nt2 log(n∗2 ) + log ≤ n∗2 2 log n∗2 ( ) c ≤ log ∗ 2n2 log n∗2 ( ) c ≤ log . 2e n2 log n e2

The second inequality in the above expression results from nt2 ≥ n∗2 . Thus [ ( c P (Dπ,t ) ≤ exp log

c 2n∗2 log n∗2

)]

( =

c 2n∗2 log n∗2

) .

c ); therefore, The same bound may be established for P (Dπ,c

P (Dπc ) ≤

c

. n∗2 log n∗2

38

(A.11)

Combining Estimates of CTE Using (A.9) together with (A.8), (A.10), and (A.11), and using the fact that h = B4 {log(n∗2 )/n∗2 }1/p for some B4 depending on c and p, we have e e π ∥2 Eπ ∥∆ − ∆ 2

) ) ( ) ( ∗ 2/p log n∗2 1/p 1 2 log n2 2 ≤ 8c ApLB4 + B3 (B4 pL) + 16A c n∗2 n∗2 n∗2 log n∗2 [ ] 1 1 b n ,j ∥2 . + B3 inf log + Eπ (σ − σ ˆn1 ,j )2 + Eπ ∥∆ − ∆ (A.12) 2 1 j n e2 ωj ∗

(

With the exception of small n∗2 , 1 ≤ n∗2 log n∗2

(

log n∗2 n∗2

)2/p

( ≤

log n∗2 n∗2

)1/p ,

so we can rewrite expression (A.12) as e e π ∥2 Eπ ∥∆ − ∆ 2 {( ) [ ]} log n∗2 1/p 1 1 b n ,j ∥22 + inf log + Eπ (σ − σ ˆn1 ,j )2 + Eπ ∥∆ − ∆ ≤ B5 , 1 j n∗2 n e2 ωj for a constant B5 depending on c, c, σ, σ, A, p, and L. Now n∗2 and n e2 , which heretofore we have treated as fixed, are random variables determined by the values of (Ui , Ti )ni=1 and the permutation π. By the iterated expectation law, unconditional on the permutation π, ( ) e e 2 2 e π ∥2 = E Eπ ∥∆ − ∆ e π ∥2 E∥∆ − ∆ { [( [ ) ] ]} log n∗2 1/p 1 1 2 2 b n ,j ∥2 ≤ B5 E + inf E log + E(σ − σ ˆn1 ,j ) + E∥∆ − ∆ . (A.13) 1 j n∗2 n e2 ωj Let α ∈ (0, 1) be a fixed constant and let Hα,π denote the event that P (n∗2 ≥ αn2 ). Since (log n∗2 /n∗2 )1/p ≤ 1, we have ] [( [( ) ] ) log n∗2 1/p log n∗2 1/p c ) E ≤E Hα,π + P (Hα,π n∗2 n∗2 ( ) log n2 1/p c ≤ α−1/p + P (Hα,π ). n2

39


c ), the exponential bound on the upper tail probability of the hypergeometric For P (Hα,π

distribution established by Chvátal (1979) can be used to show that we can find α ∈ (0, 1) depending on a and b from the second regularity condition such that c P (Hα,π ) ≤ B6 e−n2 ,

for a constant B6 depending on a and b. Thus [( ( ) ] ) log n2 1/p log n∗2 1/p E ≤ B7 , n∗2 n2

(A.14)

for B7 depending on a and b. For E(1/e n2 ), conditional on Dπ , {⌊( )1/p ⌋}−p ( ) ( ) 1 cn∗2 log n∗2 log n2 p =h = ≤ B8 ≤ B7 B8 , n e2 2 log n∗2 n∗2 n2

(A.15)

for a constant B8 depending on c. As established earlier in this proof, P (Dπc ) converges faster than O(1/n∗2 ) = O(1/n2 ). Using (A.14) and (A.15) to replace the random variables in (A.13) with fixed constants, e e π: we obtain a bound for the risk of ∆ e e π ∥22 E∥∆ − ∆ {( ) [( ) ]} log n2 1/p log n2 1 b n ,j ∥22 ≤ B9 log + E(σ − σ ˆn1 ,j )2 + E∥∆ − ∆ , (A.16) + inf 1 j n2 n2 ωj for a constant B9 depending on a, b, c, c, σ, σ, A, p, and L. b from step 8 of the algorithm is the average (over the set For P > 1, the estimator ∆ e e of P permutations) of ∆ πp . Therefore, by the convexity of the L2 loss, an application of Jensen’s inequality gives us P ∑ e e πp ∥22 . b 22 ≤ 1 E∥∆ − ∆ E∥∆ − ∆∥ P

(A.17)

p=1

The permutation π used to establish the bound in (A.16) was arbitrary; therefore, by (A.17), b 2 . This completes the proof of the theorem. the bound in (A.16) also holds for E∥∆ − ∆∥ 2 40


Table 1: Variables used in LaLonde NSW data analysis. Name

Type

Description

Mean

SD

Y

Outcome

20.4

56.4

T

Treatment

Inc75

Covariate

T=1 if enrolled in training; otherwise, T=0 √ 1975 income

0.411

0.492

37.5

40.5

Educ

Covariate

Years of education

10.3

1.7

Age

Covariate

Age in years

24.5

6.6

Married

Covariate

Married=1 if married; otherwise; Married=0

0.162

0.369

√ √ 1978 income − 1975 income

41


b ν (u) estimates from model selection and combination methods applied to Table 2: ∆ the LaLonde NSW data. b ν (Ui ) Valuesb ∆ Method

Model Type

AIC

Additive

BIC

Active Variablesa Mean

SD

T*s(Inc75), T*s(Educ), T*Married, Age

5.7

12.5

Linear

Inc75

0

0

CV

Linear

T, Inc75

6.6

0

TECV

Linear

Inc75, T*Married

6.7

6.4

cAIC

5.6

5.2

BMA

1.4

0.1

ARM

5.3

2.9

TEEM

5.6

3.8

a

The presence of interaction terms implies the presence of both main effects.

b

b ν (Ui ) over the n = 722 sample values of The mean and standard deviation of ∆ Ui , where ν denotes a model selection or combination method that produces an b estimate ∆(u).

42


b τ,ν for 8 methods under Table 3: Guided simulation results: estimated riska (SE) of ∆ 4 scenarios. Model Selection Method Determining E(Y |T, U) Method

AIC

BIC

CV

TECV

AIC

158.6 (6.1)

41.3 (6.2)

54.6 (5.5)

74.2 (6.0)

BIC

191.6 (2.8)

1.4 (1.4)

37.0 (2.1)

72.3 (1.5)

CV

169.3 (4.4)

12.6 (3.7)

31.1 (3.4)

59.6 (3.7)

TECV

160.9 (4.0)

21.4 (4.2)

36.4 (3.6)

48.7 (4.0)

cAIC

134.6 (5.4)

28.0 (4.9)

35.8 (4.8)

51.4 (5.0)

BMA

166.4 (2.1)

1.3 (1.1)

24.1 (1.7)

57.4 (1.8)

ARM

112.1 (2.7)

11.6 (1.7)

16.6 (1.8)

32.5 (2.1)

TEEM

108.6 (2.9)

15.1 (1.9)

20.4 (2.0)

32.3 (2.3)

a

Numbers in bold represent the methods with the lowest estimated risks for each scenario and those not significantly different (using Tukey’s HSD method of multiple comparisons) from the lowest-risk method.

43


b ν (Ui ) from methods AIC, CV, TECV and TEEM plotted against the Inc75 Figure 1: ∆ variable for the n = 722 observations in the LaLonde NSW data.

Model Selected by CV

−40

−20

−40 −20

0

0

^ ∆ CV

^ ∆ AIC

20

20

40

40

60

60

Model Selected by AIC

50

100

150

200

0

100

150

200

Model Selected by TECV

TEEM Model Combination

40 20 0 −40 −20

0

^ ∆ TEEM

20

40

60

Inc75

−40 −20

^ ∆ TECV

50

Inc75

60

0

0

50

100

150

200

0

Inc75

50

100 Inc75

44

150

200


b AIC for the LaLonde NSW data. The circles indicate the Figure 2: Contour plots for ∆ locations of one or more original data points.

(a) Single Males

(b) Married Males

45


b TEEM for the LaLonde NSW data. The circles indicate the Figure 3: Contour plots for ∆ locations of one or more original data points.

(a) Single Males

(b) Married Males

46


Figure 4: Results of the LaLonde NSW cross-examination over the four scenarios combined.

300 200 100 0

^ (U ))2 for One Realization Average Value of (∆(Ui) − ∆ ν i

400

Comparison of Methods Over All Cross−Examinations

AIC

BIC

CV

TECV

cAIC

BMA

Model Selection or Combination Method Used

47

ARM

TEEM