Unbiased causal Inference from an Observational ...

1 downloads 0 Views 233KB Size Report
1 We thank William R. Shadish for helpful comments on a first version of this paper. 2 P.M. Steiner and T.D. ... P.M. Steiner was also supported by a grant from the Spencer Foundation. ...... discontinuity approach. Newbury Park, CA: Sage.
Unbiased Causal Inference from an Observational Study

1

Running Head: Unbiased causal Inference from an Observational Study

Unbiased Causal Inference from an Observational Study: Results of A Within-Study Comparison1 Steffi Pohl Friedrich-Schiller-Universität Jena, Germany Peter M. Steiner2 Northwestern University, USA Jens Eisermann Freie Universität Berlin, Germany Renate Soellner Freie Universität Berlin, Germany Thomas D. Cook2 Northwestern University, USA

Footnotes 1 We

thank William R. Shadish for helpful comments on a first version of this

paper. 2

P.M. Steiner and T.D. Cook were supported in part by grant R305U070003 from

the Institute for Educational Sciences, U.S. Department of Education. P.M. Steiner was also supported by a grant from the Spencer Foundation.

Unbiased Causal Inference from an Observational Study

Running Head: Unbiased Causal Inference from an Observational Study

Unbiased Causal Inference from an Observational Study: Results of A Within-Study Comparison

2

Unbiased Causal Inference from an Observational Study

3

Abstract Adjustment methods like propensity scores and ANCOVA are often used for estimating treatment effects in non-experimental data. Shadish, Clark and Steiner (2008) used a within-study comparison to test how well these adjustments work in practice. They randomly assigned students to participating either in a randomized or a nonrandomized experiment. Treatment effects were then estimated in the experiment and compared to the adjusted non-experimental estimates. Most of the selection bias in the non-experiment was reduced. The present study replicates the study of Shadish et al. despite some differences in both design and the size and direction of initial bias. The results show that the selection of covariates matters considerably for bias reduction in non-experiments, but, the choice of analysis matters less.

Keywords: within-study comparison, propensity scores, ANCOVA, causal inference, observational study

Unbiased Causal Inference from an Observational Study

4

Introduction

Well implemented randomized experiments provide the best warrant for unbiased causal inference in the social and behavioral sciences. But they are not always possible. From this arises the need to develop a compendium of non-experimental alternatives that can often yield unbiased causal knowledge. Theory provides one guide to the relative desirability of specific non-experimental alternatives. Formal proofs have been offered that regression discontinuity (RD) estimators are unbiased (Goldberger, 1972a; Rubin, 1977). But they are still less efficient than experimental estimators (Goldberger, 1972b), they assess cause at a single cutoff point rather than at the average of the entire treatment group, and they are very sensitive to functional form assumptions that cannot always be easily checked in the data (Trochim, 1984). Proof also exists that instrumental variables (IV) will yield consistent estimates when their key assumption is met that the instrument is only related to outcome via the treatment under analysis (e.g., Angrist, Imbens & Rubin, 1996). Unfortunately, in many other research applications it is all but impossible to know whether this central assumption has been met; and anyway, IV estimators are less efficient than experimental ones. Unbiased causal inference is also possible when the process of selection into treatment is completely known and perfectly modeled. Indeed, this is the shared rationale for both the experiment and RD. Unfortunately, in other applications it is rarely possible to know when this condition has been fully met. However, our argument is not that RD, IV and knowing the selection process are as good as the experiment or as each other. It is that they are sometimes

Unbiased Causal Inference from an Observational Study

5

available when an experiment is not possible, thus complementing the experiment rather than seeking to replace it. Theory of method is not the only route to determining how well certain kinds of nonexperiments fare in warranting bias-free causal conclusions. A second route is empirical. Research can be designed to pit the results of an experiment against those of a (usually statistically adjusted) non-experiment to see if the results of each method differ. If they do not, the inference is plausible that no appreciable bias remains in the nonexperiment. LaLonde (1986) was the first to use this logic in a systematic way within a single study. He took a randomized local experiment on job training and compared its effect size to the adjusted effect size from a non-experiment that shared the same intervention group but whose comparison group was constructed from national data archives. In essence, this tested whether the randomly formed control group was similar to the non-randomly formed comparison group after statistical adjustments for selection had been made in the latter. If the two causal estimates were similar, this warranted the conclusion that the adjusted non-experiment was unbiased. The LaLonde study had three treatment arms—the experimental treatment group, the experimental control group, and the non-equivalent comparison group. Studies of this form are also called within-study comparisons. They contrast experimental and non-experimental results that share a common treatment group but that vary in whether the counterfactual is formed at random or not. There are now at least 25 studies with this three-arm within-study comparison design. Collectively, they have examined many different procedures to adjust the non-experimental comparison group data, the main ones being multiple regression (ANCOVA), propensity scores (PS) and some form of IV analysis predicated

Unbiased Causal Inference from an Observational Study

6

on Heckman-type selection models. Glazerman, Levy and Myers (2003) have summarized the earlier within-study comparisons in job training, while Cook, Shadish and Wong (2008) have done so for the more recent job-training studies and also for studies in other domains than job training. Of necessity, within-study comparisons make several assumptions. One is that the randomized experiment has been successfully implementedand competently analyzed. Otherwise it cannot function as an unbiased causal benchmark. Another is that the nonexperiment has been well designed and analyzed. Otherwise the confounded comparison will be between a well-done experiment and a poorly realized nonexperiment. Further assumptions are that knowledge of the experimental results does not bias analysis of the non-experiment, and that the experiment and non-experiment estimate the same causal quantity—e.g., there is not an intent to treat estimate in the experiment and a treatment on treated one in the non-experiment. A fifth assumption is that no third variable is correlated with both the outcome and the contrast between the experiment and non-experiment, an assumption that has often been violated in the past. LaLonde (1986), for example, compared experimental groups in quite local settings to comparison cases from all over the nation. It is also not clear whether outcomes were always measured at the same times and in otherwise the same ways across the experiment and observational study. Sensitivity to confounds like these has grown over time, and they are less frequent with later three-arm designs than earlier ones (Cook et al., 2008). But the possibility of such confounding is still a problem. Shadish, Clark & Steiner (2008) solved the problem by moving to a four-arm design. In their example, respondents (undergraduate students) were first randomly assigned to

Unbiased Causal Inference from an Observational Study

7

being in a randomized experiment or a non-randomized experiment. After this they were then assigned to one of two educational treatments whose effects were to be estimated. In the random experiment respondents were randomly assigned to a Math or a Vocabulary training, and in the nonrandomized experiment they were free to self-select into either of these two training conditions. Respondents in the randomized and nonrandomized experiment were treated in identical ways once they were assigned to an educational treatment, attending the very same training sessions and experiencing the same pretest and posttest assessments in the same testing sessions. The first random assignment is meant to ensure that the respondents exposed to the randomized experiment and non-randomized experiment are identical on expectation; and the identical training and testing protocols for the experimental and non-experimental respondents rule out confounds from third variables. The full design of Shadish et al. is depicted in Figure 1. The Math posttest scores of those experiencing the Math training served as treatment scores while the Math scores of those experiencing the Vocabulary training served as the comparison scores. The design logic was reversed for those experiencing the Vocabulary intervention, thus enabling two tests of the study’s basic goals based on the difference between experimental and non-experimental scores, one in Math and the other in Vocabulary.

------------------------------Insert Figure 1 about here -------------------------------

Unbiased Causal Inference from an Observational Study

8

At pretest, data were collected on many potential determinants of selection from five domains: demographic, personality, prior academic performance, proxy pretest scores on Math and Vocabulary, and preference for Math or Vocabulary. The key null hypothesis was that the average treatment effect in the experiment would not differ from the adjusted average treatment effect in the non-experiment. Adjustments were made via PS and simple ANCOVA, and a subsidiary null hypothesis was that these two methods of data analysis would not differ. The estimated propensity scores were analyzed three different ways: in stratification analyses, as covariates, and in weighted analyses, thus permitting a test of whether these three ways of analyzing PS data differ in the results they provide. The results from Shadish et al. (2008) showed (1) that there was bias in the nonexperiment for both the Vocabulary and Math treatments, being 0.24 standard deviation units in the former and 0.26 in the latter; (2) that about 70% to 95% of this bias was reduced with the Vocabulary outcome and 60% to 85% for the Math outcome; (3) that adjustments based only on demographic variables did very poorly for both Vocabulary and Math; (4) that the type of analysis—PS or ANCOVA—was not related to the amount of bias reduction achieved; and (5) that how the PS scores were analyzed also made no consistent difference. Subsequent work by Steiner, Cook, Shadish & Clark (2009) has shown that the most important covariates for bias reduction were motivational variables measuring preference for Math over Vocabulary, followed by measures of prior performance in Math and Vocabulary. The study by Shadish et al. (2008) has been published in the Journal of the American Statistical Association as a discussion-paper, with discussions invited from Hill

Unbiased Causal Inference from an Observational Study

9

(2008), Little, Long & Lin (2008) and Rubin (2008b). None of them questioned the paper’s fundamental conclusions. The results of Shadish et al. are important because they illustrate that the assumptions underlying the use of statistical adjustment methods are reasonable in practice and so unbiased causal inference can sometimes be drawn from nonrandomized experiments. The study also illustrates that adjusting for selection is not so much a matter of the mode of data analysis as of which covariates are available and how well they capture the true selection process. And finally, the study indicates that academic disciplines can attain causal knowledge about relationships that do not lend themselves to random assignment, thus enlarging the range of causal questions that can be researched. Like any single study, Shadish et al. (2008) has its limitations. It takes place in one setting—Memphis, Tennessee, USA. It involves only undergraduate students taking an introductory course in Psychology. The interventions are short-lasting and occur in a laboratory, whereas many researchers will want to generalize to longer lasting treatments in field settings. The initial bias of 0.24 and 0.26 standard deviation units is quite small and is exclusively positive in direction because the Memphis students tended to self-select themselves for training in areas where they were already more proficient. In many real-world contexts, the process of selection into treatment may be differently structured and depend on administrator selection, some mix of administrator and selfselection, or even on quite different selection mechanisms from individual to individual. It may also vary in the direction and size of any initial bias. Although Shadish et al. provide important insights into the analysis of nonrandomized experiments, their results are still of quite unknown generality.

Unbiased Causal Inference from an Observational Study

10

The present study seeks to replicate Shadish et al. (2008) in a way that preserves the four-arm within-study design and laboratory analog while varying some features that might have affected the earlier study’s results. Thus, the replication takes place in Berlin, Germany, rather than Memphis, USA. The Math training was identical to Shadish et al. (except for translation into German), but the second training was designed to improve the English of German students rather than the English vocabulary of American students. We will soon see that the size and direction of the standardized initial selfselection bias in the English treatment condition is larger in the Berlin nonrandomized experiment (d = -0.38) than were the initial biases found in Shadish et al. (d = 0.26 and d = 0.24). Thus, the selection process into treatment is opposite in direction in Berlin where the students self-selecting into English were the less proficient whereas in Memphis it was the more proficient. As a result, the present replication entails some novel sources of heterogeneity not found in Shadish et al. Even so, the null hypotheses are the same: (1) that the experiment and adjusted non-experiment will not differ in causal estimates; (2) that the PS and ANCOVA analyses will not differ; and (3) that how the estimated PS scores are analyzed—in stratification or covariance analysis—will not make a difference.

Method Participants The sample consists of 205 undergraduate students majoring in Psychology (N = 127) or Education (N = 75) at the Freie Universität Berlin, Germany. The students were freshmen and invited to participate in the study during the first week of their introductory

Unbiased Causal Inference from an Observational Study

11

course. Participating students received credit as well as written feedback on their competency level in English and Math. We deleted three cases because the type of assignment was mistakenly not recorded. Of the remaining sample, 84.7% of students were female and their age ranged from 18 to 51 with a mean of 22.96 (SD = 5.63).

Procedure Without revealing the actual purpose of the study it was introduced to students as an evaluation of training in English and Math, given that both skills are needed for the study of Psychology and Education in Germany. Before assignment to the experimental or the non-experimental condition students were pre-tested in English and Math. The study design is depicted in Figure 2, and its similarity to Shadish et al. (2008) is evident. ------------------------------Insert Figure 2 about here ------------------------------We used a simple random procedure for assigning students to conditions. Each student drew a sheet of paper indicating assignment into one of three conditions, that is, the English or Math training in the randomized experiment (N = 44 and 55, respectively) or to the nonrandomized experiment (N = 103). Of the students in the nonrandomized experiment 55 self-selected into Math and 48 into English. Except for the selection mechanism, all students in the same training condition were then treated in exactly the same way. Each training session lasted about 15 minutes. The English training dealt with the use of adverbs, indefinite articles, reflexive pronouns, the phrase ‘give’, and the distinction between ‘since’ and ‘for’. An overview of some English vocabulary was also

Unbiased Causal Inference from an Observational Study

12

provided. The Math training materials were translated from Shadish et al., and dealt with five rules for transforming exponential equations. After the training, posttest measures were taken in both English and Math, thus enabling the calculation of two treatment effects—one in English where the outcomes of those exposed to Math serve as the notreatment controls, and the other in Math with the outcomes of those exposed to English serve as controls. The assumption here is that training in English does not improve performance in Math and vice versa.

Measures Covariates. The questionnaire was mainly designed to capture the selection process in the nonrandomized experiment. In total, we constructed 25 covariates from 86 questionnaire items tapping into the five domains that Steiner et al. (2009) created from the Shadish et al. (2008) items. They are demographic variables, proxy-pretests, prior academic achievement, topic preference and psychological variables. The demographic variables are gender, age, marital status (married or single), nationality (German speaking country or not), study major (Psychology or Education) and educational attainment of student’s mother and father (four categories each). Except for age, all the demographic variables were dummy coded. Because of the right-skewed distribution of age we used the logarithm of age in all analyses. The proxy-pretests in Math and English are only proxy measures since they did not use the same items as the posttest. The Math pretest consisted of 15 translated items of the Arithmetic Aptitude Test (Educational Testing Services, 1993) used by Shadish et al. (2008). The English pretest used 30 items of the START-E Test for young professionals

Unbiased Causal Inference from an Observational Study

13

(Liepmann, Tartler, Nettelnstroth & Smolka, 2006) measuring grammar, orthography and translation. Both tests consisted of multiple-choice items with four or five response options. The percentage of correct answers then constituted the two proxy-pretest scores. Measures of prior academic achievement included high school grades in Math, English, Literature, first language (German or other language) and Biology, the dummy coded high school course level in Math and English (with four levels: advanced, basic, not taken, or other), as well as a single self-assessment item on Math and English knowledge. The topic preference domain consisted of five items asking (1) for preference of Math and Statistics over English, (2) how much they like Math and Statistics as well as (3) English, and whether they like (4) Math and Statistics as well as (5) English so much that they would attend extra lessons. Finally, the psychological items measured the positive and negative affect at the actual moment. These state measures were taken from the German version (Krohne, Egloff, Kohlmann, Tausch, 1996) of the Positive and Negative Affect Schedule (PANAS, Watson, Clark, Carey, 1988; Watson, Clark & Tellegen, 1988). Each scale consisted of 10 items on five-point Likert scales with the average representing the psychological scores. Since the negative affect scores showed a right-skewed distribution, we used the logarithm in all analyses.

Outcomes. We adopted the 20-item Math posttest from Shadish et al. (2008) that required applying rules for transforming exponential equations. Thus, the outcome was closely aligned with the specifics of the Math training and is much more specific in its alignment than the proxy pretest measure of general Math knowledge. The English

Unbiased Causal Inference from an Observational Study

14

posttest dealt with the same topics as the pretest, with 30 items assessing students’ grammar, orthography and translation skills. However, the English posttest is not closely aligned with the specifics taught in the English training. All students took both tests, completing first the test in the domain where their training took place and, thereafter, the test in the domain where they were not trained. This is different from Shadish et al., where the posttests were completed in the same order, regardless of the treatment condition. The percentage of correctly answered items formed the posttest score in Math and English.

Missing Data. We had no missing values on the outcomes or the proxy pretests. However, some values of other covariates were missing, especially for ‘father’s education’ (8.9%), ‘mother’s education’ (5.4%), high school grade in Biology (4.5%) and the course difficulty level of high school Math and English classes (3.5 % and 4.0%, respectively). 154 participants (76.2%) had no missing values, 22 (10.9%) had one missing value and 26 (12.9%) had more than one missing value. We imputed missing values using a restricted general location model as implemented in the mix package in R (Schafer, 2007, 1997; R Development Core Team, 2008). This is basically an EMalgorithm with a further Bayesian data augmentation that uses both categorical and continuous covariates simultaneously in the imputation process in order to create categorical and continuous imputations. The imputation model did not involve the treatment indicator and posttest scores.

Data Analysis

Unbiased Causal Inference from an Observational Study

15

This study investigates the performance of different methods of selection bias adjustments, both relative to each other and to the causal benchmark provided by the randomized experiment. We estimated the average causal effects according to the Rubin Causal Model (Neyman, 1923; Rubin, 1974, 1978). In this theory each subject has two potential outcomes: Y 1 under the treatment condition Z = 1 and Y 0 under the control or an alternative treatment condition Z = 0. The individual causal effect is then given by the difference in potential outcomes, Y 1 − Y 0 . However, in practice we have to deal with the fundamental problem of causal inference (Rubin, 1978; Holland, 1986), that is, we never observe both potential outcomes simultaneously. Though the individual causal effects are not known the average causal effect E (Y 1 ) − E (Y 0 ) can still be estimated without bias if individuals are randomly assigned to treatment conditions. This is so because randomization makes the treatment groups comparable in expectation. If treatments are not randomly assigned but actively chosen by participants or third persons like administrators, then the unadjusted mean difference in the observed outcomes is in general a biased estimate of the average treatment effect. Obtaining an unbiased estimate requires that the treatment assignment is strongly ignorable, that is (Y 0 , Y 1 ) ⊥ Z | X , with X being a vector of all covariates that are related to both treatment Z and the potential outcomes (Y 0 , Y 1 ) , and 0 < P( Z = 1| X) < 1 . In other words, an

unbiased estimate of the average treatment effect results if (1) the potential outcomes are independent of the treatment assignment given the vector of observed covariates X and if (2) the conditional probability of being in each of the treatment groups given the vector of covariates X is different from zero. This means that in planning a study the aim is to observe all the covariates related to both treatment selection and potential

Unbiased Causal Inference from an Observational Study

16

outcomes and to make sure that each unit has a chance to get into each of the treatment groups. Different adjustment methods exist for estimating the average causal treatment effect. We focus on two main classes of methods: (i) balancing pre-treatment group differences via propensity score modeling and (ii) direct covariance adjustments within the outcome model (ANCOVA). While PS-analyses address selection bias by modeling the assignment mechanism and equate non-equivalent treatment samples before examining the outcome data, ANCOVA directly models the outcome in order to estimate a covariance adjusted treatment effect. Hence, PS modeling capitalizes on treatment correlated covariates whereas outcome modeling capitalized on outcome correlated covariates.

Adjustment using Propensity Scores. The propensity score e( X) is defined as the conditional probability of being in treatment condition Z = 1 given the vector of observed covariates X —that is e(= X) P= ( Z 1| X) with 0 < e( X) < 1 . Since Z = 1 represents the English training condition in our study and Z = 0 the Math training condition the propensity score is the conditional probability of attending English training given the observed covariates described above. Rosenbaum and Rubin (1983) showed that if treatment assignment is strongly ignorable given the observed covariates, it is also strongly ignorable given the propensity score. We estimated propensity scores using logistic regression and selected the propensity score model according to two main balance metrics—the standardized mean difference and the variance ratio between treatment groups (Rubin, 2001a). Rubin contends that standardized mean differences of

Unbiased Causal Inference from an Observational Study

17

more than half a standard deviation and a variance ratio lower than 4/5 or greater than 5/4 indicates imbalance. However, one should always try to get as much balance as possible, i.e., getting standardized mean differences very close to zero and variance ratios close to one (Imai, King & Stuart, 2008; Ho, Imai, King & Stuart, 2007). We assessed balance on the logit of the PS, on each individual covariate (including the logarithm of the two right-skewed covariates) and on certain functions of each covariate, especially quadratic terms, and all two-way interactions. Balance checks required comparing weighted means and variances between the two training groups—with weights derived from the PS-stratification approach (described below). Using Rubin’s criteria we could not achieve balance when all the cases were considered because the self-selected treatment groups did not completely overlap on the logit of the propensity score. So we had to discard observations with no corresponding case in the other group, using a caliper of 0.1 standard deviations. This led to deleting 16 (=16%) nonoverlapping cases. We then re-estimated the PS-model and achieved satisfactory balance and overlap. In each PS-stratum we had at least four observations for each training group. Table 1 shows the parameter estimates of the logistic model. According to the t-values the most important covariates explaining selection into the English condition are the English pretest and also Math grades. ------------------------------Insert Table 1 about here -------------------------------

Unbiased Causal Inference from an Observational Study

18

Table 2 presents the relevant balance statistics for covariates and the PS-logit with B indicating the standardized mean difference and R the variance ratio. All the mean differences are considerably less than 0.5. Indeed, 97.4% of covariates fall below 0.25 and even 79.5% fall below 0.125. As for the variance ratios, 61.5% fall inside Rubin’s benchmark of 4/5 and 5/4. However, it is striking that four of the more imbalanced ratios are for covariates closely related to preference and knowledge of English—viz., high school English grades, liking for English, self-attributed knowledge in English and most advanced high school course level. Analytic attention needs to be paid to these apparently systematic differences in variance ratios for covariates touching on knowledge of English. ------------------------------Insert Table 2 about here ------------------------------Overall, balance is good but not perfect, since we could not come up with other ways to improve balance by re-specifying the PS-model. Doing so always resulted in less overlap and worse balance even after repeatedly discarding non-overlapping observations and re-specifying PS-models. Differences in variance ratios were even harder to balance than mean differences. Such problems in achieving perfect balance reflect the limits of PS-modeling with samples of small size. Of course, having the randomized experiment as a benchmark permits assessing whether PS-stratification and PS-ANCOVA can work with the quality of balance actually achieved in a small data set like this one. The data were balanced by the second author who was blind to the

Unbiased Causal Inference from an Observational Study

19

outcome data. Only after the decision on the PS model was made were the covariate and outcome data merged. Following Shadish et al. (2008), we used the estimated propensity scores in two different ways. (1) Propensity score stratification using five strata (Rosenbaum and Rubin, 1984); (2) ANCOVA using propensity score logits where the treatment variable is entered into the outcome model together with linear, quadratic and cubic terms for the propensity score logit (Rosenbaum & Rubin, 1983). We did not consider PS weighting (Lunceford & Davidian, 2004) due to its sensitivity to large weights which may result whenever the PS is close to zero or one. We also excluded PS matching (see e.g. Dehejia & Wahba, 2002; Rubin, 1973a, 1973b; Rubin & Thomas, 1996), first, because of the small sample sizes and the relatively smaller pool of control cases for matching the Math group with the English group (that pool of control cases should usually be much larger than the treatment group); and second, because matching typically focuses on the treatment effect for the treated (TOT) while in both the randomized and quasiexperiment we estimated the average treatment effect for treated and untreated subjects together. We performed each of these analyses with and without the individual covariates in the outcome model. The additional inclusion of individual covariates makes the estimation of the causal effect more robust against the misspecification of either the propensity score or outcome model (Cochran & Rubin, 1973; Robins & Rotnitzky, 1995; Rubin, 1973b). However, if both models are mis-specified estimates may be biased (Kang and Schafer, 2007). As it happens, the additional covariate adjustment made little difference to the degree of bias reduction achieved, though it did reduce standard errors,

Unbiased Causal Inference from an Observational Study

20

as we see later in Table 3. We combined PS techniques and covariance adjustment within the regression framework as suggested by Hirano & Imbens (2001). Turning to PS-stratification, it was implemented as weighted regression (WLS) using PS-stratum weights which were derived from each observation’s stratum and group membership, that is w= (nz ⋅ nq / n) / nzq , where nzq is the number of subjects in treatment zq group z and stratum q , nz = ∑ q =1 nzq , nq = ∑ z =0 nzq , and n the total number of subjects. Q

1

The WLS approach then allows the inclusion of further covariates. This is different from the classical covariate-adjustment in an PS stratification analysis where regression adjustments are made within each stratum separately rather than between strata as here. We did not follow the classical approach because of the small sample size within each stratum. Given the medium sample size, we used five strata and these remove around 90% of selection bias due to observed covariates (Cochran, 1968; Rosenbaum & Rubin, 1984). Regarding our second PS approach, PS-logits in ANCOVA, one analysis used just the cubic polynomial of the PS-logits (including the linear, quadratic and cubic term) and the other also the individual covariates. In all analyses using individual covariates, we entered all 33 predictors (categorical covariates were dummy coded) as main effects in the outcome model. Given the ongoing discussion about the appropriate method for estimating standard errors of PS based effect estimates (e.g., Abadie & Imbens, 2008; Ho et al., 2007; Rubin & Thomas, 1996) we followed Ho et al. (2007) and computed conventional standard errors within the regression framework described above.

Unbiased Causal Inference from an Observational Study

21

Adjustment using ANCOVA without PS. As in Shadish et al. (2008), we explored the use of ANCOVA with observed covariates but without propensity scores. As with the estimation of PS we again fixed the outcome model at the design stage of our analyses, i.e., before examining the outcome data. We always included all 33 covariates as main effects, without any model selection procedure (see Appendix A and Appendix B for the estimated regression coefficients). For comparison, we applied ANCOVA to exactly the same observations used to test the PS methods, i.e., the data omitting non-overlapping cases. However, ANCOVA relies on functional form assumptions and extrapolation and so does not need to restrict the target population by deleting non-overlapping cases. Therefore, we also estimated the treatment effect using all observations and thus including the 16 discarded cases. Again, all covariates were included as main effects.

Equating the population being compared. To judge whether the different adjustment methods give an unbiased estimate of the average treatment effect, the results of the adjusted mean differences in the nonrandomized experiment are compared to those in the randomized experiment. However, due to sampling error the covariates may not be perfectly balanced even in the randomized experiment, and thus the mean difference in the randomized experiment may not give a good estimate of the true average treatment effect. This is especially true when—as here—sample sizes are modest. In the randomized arm we found standardized mean differences of up to 0.34 (for study major) on the covariates between the two treatment groups. However, neither hypothesis testing (1 out of 39 covariates showed a significant mean difference) nor the pattern of mean differences indicates a deviation from random covariate imbalance. Nonetheless,

Unbiased Causal Inference from an Observational Study

22

the random imbalance in the randomized experiment resulted in a non-negligible difference between the covariance adjusted and unadjusted treatment effects (0.25 and 0.19 standard deviations for Math and English, respectively). So following the suggestions by Rubin (2008b) we adjusted the experimental treatment effect using ANCOVA with all covariates included as main effects. Given the small sample sizes, adjusting for all 33 predictors is a sensitive choice. However, in order to avoid critical issues related to selecting the outcome model and estimating and testing parameters on the same data set (e.g., Rao & Wu, 2001), we decided prior to analyzing the outcome data to use either no or all covariates for adjusting the treatment effects of both the experiment and observational study. In the nonrandomized study the target population is defined by the common support region where cases overlap, thus creating different populations in the experiment and non-experiment. To assure comparable populations, the randomized experimental results were limited to the 79 observations lying within the overlapping range of the nonrandomized data. This was done by estimating the PS for the randomized experiment using the estimated PS model of the nonrandomized experiment (Rubin, 2008b) and then discarding all cases of the random experiment scoring 0.1 standard deviations below the minimum or above the maximum of the observed non-experimental PS-logit.

Results Initial Bias

Unbiased Causal Inference from an Observational Study

23

Table 3 presents the mean outcome differences at posttest for the overlapping cases in both the randomized and the unadjusted nonrandomized experiment. Results from the randomized experiment reveal that the Math training had a medium effect of 12.6 (Cohen’s d = 0.57), while the English training did not (-1.6 which correspond to d = -0.1). Initial bias is defined as the difference between estimates of the average treatment effect in the adjusted experiment and the unadjusted nonrandomized experiment, given that adjustment in the experiment deals with sampling error only and not selection bias. The initial bias on the Math outcome of 13.9 - 12.6 = 1.3 (d = 0.06) is trivial and not statistically significant. Thus, in Berlin there was no observed bias in who selected Math. However, initial bias on the English outcome amounts to -7.8 - (-1.6) = -6.1 (d = -0.38)1. This is larger than in Shadish et al. (d = 0.24 and 0.26). It also operates in the opposite direction since students who scored low on the English pretest and don’t like English self-selected themselves into the English training. This is the opposite direction of bias to Shadish et al. (2008) who discovered that American students chose exposure to topics at which they performed better and that they liked more. The net result in Germany, then, is a negative mean posttest difference between the unadjusted nonrandomized experiment and the randomized experiment with training in English. The research question is: Will statistical adjustments remove all this bias? More exactly, since there was no treatment effect for English in the randomized experiment, will the adjustments for group non-equivalence make the spurious effect found in the non-experiment disappear, thus reproducing the null finding from the experiment? We should also not forget another possibility. There was no detectable bias initially on the Math outcome.

Unbiased Causal Inference from an Observational Study

24

This does not negate the possibility, though, that statistical adjustments might create a bias where formerly there was none! ------------------------------Insert Table 3 about here -------------------------------

Bias reduction Table 3 also shows the adjusted mean differences in the nonrandomized experiment, the amount of bias, and the R-square of the outcome regression model using only the overlapping cases. The estimated average treatment effect of the Math training does not differ across adjustment methods and is similar to what it was with the randomized experiment and the unadjusted non-experiment. This suggests that the statistical adjustment methods used did not create bias where there was none formerly. There was demonstrable selection bias for the English outcome, but all the adjustment methods succeeded in reducing it from -0.38 standard deviations to 0.07 or less. Moreover, within the power limits of this study, no method did obviously better than any other. All significantly reduced the bias close to zero. According to Schafer & Kang (2008), an outcome regression model performs well in terms of bias, efficacy and robustness when prediction is strong. This is especially the case with models of the English outcome where a large proportion of the variance is explained ( 0.63 < R 2 < 0.68 versus 0.38 < R 2 < 0.47 for the Math outcome). For English, where there was initial bias, nearly all of it was reduced by the covariates whether PS or ANCOVA analysis was used and however the PS scores were analyzed.

Unbiased Causal Inference from an Observational Study

25

------------------------------Insert Table 4 about here ------------------------------All cases outside the region of common support were omitted from the analyses above, limiting the target population to which findings can be generalized. But even when these more extreme cases are included in an ANCOVA analysis without propensity scores, bias reduction is still significant. Table 4 shows that including all the cases was associated with initial bias on the English outcome, now larger at d = -0.49 rather than d = -0.38 when the non-overlapping cases were excluded. Yet ANCOVA still reduced the bias to d = 0.04. For the Math intervention there was again no initial bias (d = 0.01) when non-overlapping cases were added, and using ANCOVA did not change this finding (bias of d = 0.01). In this application, then, it was appropriate to assume linearity and to make extrapolations applying to the whole population and not just to those cases falling within the region of common support.

Discussion

The main purpose of this study was to replicate Shadish et al. (2008) by recreating their short-term laboratory analog experiment while varying some factors that might have affected their results. We found with the English intervention, where there was initial bias like in Shadish et al., that (1) the estimated average treatment effects were similar in the randomized experiment and the adjusted nonrandomized experiment; (2) the PS and ANCOVA analyses did not obviously differ in bias reduction; and (3) how the balancing

Unbiased Causal Inference from an Observational Study

26

PS was used in estimating treatment effects (stratification or PS-ANCOVA) had little effect. We can be confident, therefore, that the earlier results are not due to chance or to incidental but efficacious operational details whose subtlety other researchers cannot reproduce. The details needed to reproduce Shadish et al. are robust enough to cross oceans and languages and to occur when the intervention entails learning a foreign language as well as when learning Math and Vocabulary in one’s national language. In both studies the choice of covariates made all the difference and the mode of data analysis did not. In this Berlin study, the size and direction of bias were different from in Memphis. Students self-selecting themselves into the English treatment were the worst performers whereas in Memphis it was the better performers who self-selected into both Math and Vocabulary. This was probably because, in Berlin, the study was introduced as training in academic subjects that students needed for their university study while in Memphis the students took part in order to get credits for participating in compulsory experiments. Bias was also larger in the Berlin replication by about 50% compared to Memphis (0.38 versus 0.26). Yet the bias-reducing consequences of the covariates were similar despite study differences in the degree and direction of initial bias. Of course, one cannot extrapolate to even larger initial biases than the 0.38 SD units found here. When more radically different populations are compared our confidence in the ability of ANCOVA to compensate for selection will be lower, as will our ability to generalize PS results beyond the limited area of common support shared by radically different populations. But many group differences in social science and behavioral applications are not large. Indeed, in many circumstances researchers have it in their power to select control groups from

Unbiased Causal Inference from an Observational Study

27

populations that are only modestly different from a treatment population, thereby maximizing overlap. To show that the careful selection of covariates reduces initial bias in the order of -0.38 SD units is not irrelevant to much actual and possible practice in the social and behavioral sciences. For the Math training the experimental and unadjusted non-experimental outcomes were very similar and did not statistically differ, suggesting that no detectable bias was observed. This was neither expected nor intended. Though fortuitous, it points to the fact that comparison populations, while always potentially different on some unobserved variables, may nonetheless not differ on direct measures of bias. In applications without a yoked randomized experiment it will not be possible to know this, though uncertainty is presumably lower if deliberate steps were taken in study planning to reduce initial population differences on the major predictors of outcome. Cook et al. (2008) have documented three within-study comparisons where researchers deliberately sought to contrast experimental results with those from a non-experiment with a local, intact group that was not randomly selected but was instead matched on pretest measures of the outcome. In all three studies, effect sizes in the matched non-experiment were not much different from those obtained in the randomized experiment, even in the absence of statistical controls for non-comparability. Presumably the study sampling designs were alone sufficient to reduce any initial group non-equivalence and hidden bias played no subsequent role in influencing the relevant outcomes. The balancing properties of propensity scores rely on large samples (Rosenbaum & Rubin, 1983), and propensity score modeling is usually not advised when the sample size is as modest as here and the covariates are as numerous. A lack of overlap is likely

Unbiased Causal Inference from an Observational Study

28

to result, making it difficult to achieve balance on all covariates. In this study, balance met most of the traditional criteria of adequacy but was still not perfect. Fortunately, though, the obtained balance results could be corroborated both by the experimental results and the prior Shadish et al. (2008) findings. In applications without the benchmark from the randomized experiment and a prior study a PS analysis with as few cases as here is not advisable. Nonetheless, the current results show that in some cases PS analyses may even work with small samples. This may be due to the low complexity of both the model for the selection mechanism and the functional form of the outcome models. Small samples also affect the reliability of experimental estimates since it is only on average that random assignment results in comparable treatment and control samples. The law of large numbers indicates that small samples are especially prone to generate larger group differences and, in their review of within-study comparisons, Glazerman et al. (2003) pointed to the restricted reliability of the experimental causal benchmarks they used when they were based on small samples. Hence, covariance adjustment may help to equate groups in the randomized experiment (Rubin, 2008b), though only on observed group differences. In the study just reported, the means of the covariates were well balanced, but some variance ratios lay outside of the recommended benchmark. Rubin (2004) states: “… the investigator will almost certainly not be able to achieve balance for many covariates simultaneously, and higher order terms in minor covariates are clearly less important than means of important ones, and so scientific judgment must enter the process…” (p. 856). In the present study the variance ratios of the covariates assessing knowledge in

Unbiased Causal Inference from an Observational Study

29

and preference for English, which may be assumed to be important covariates, were out of balance. However, their means were very well balanced and their coefficients in the English outcome model were not significant. As it turned out, the most important covariates in the English outcome model were the English pretest and liking for Math, and these two were well balanced. Of course in research without experimental and replication benchmarks, one cannot know independently what the most important covariates are. Then, one has to rely on theory, expert judgment and common sense instead. We have shown it is possible to model the selection process and to get an unbiased treatment effect even with nonrandomized experiments. This finding has also been confirmed in a large-scale multi-site field study using a three-arm within-study comparison (Diaz and Handa, 2006). The authors used data from Progresa in Mexico, an intervention designed to provide income support to poor families contingent on their children obtaining health services and not dropping out of school. Eligible villages were randomly assigned to the program or not, and eligible families in the assigned villages were then compared with eligible families in the control villages. Family eligibility was determined according to scores on a measure of material hardship. This hardship measure was also available from the same time for a set of villages that were too affluent on average to be part of the random assignment. Yet there were still some eligible families within these relatively more affluent villages, though they were on average older, more affluent, better educated and had fewer children than eligible families in the eligible villages. Nonetheless, when the material hardship measure that determined access to treatment was used as a covariate it was enough to equate the

Unbiased Causal Inference from an Observational Study

30

originally non-equal groups of eligible families in the eligible and ineligible villages. Knowing the selection process into treatment—the material hardship score—and having quality measures of it in the non-equivalent comparison sample was sufficient to eliminate all the observed group non-equivalence and to result in an experiment and adjusted non-experiment with similar causal estimates even though the experiment and unadjusted non-experiment had given different causal answers. The present results raise the question of which adjustment method to use in applications. Modeling the outcome (ANCOVA) and modeling the selection process (PS) are different processes and each has its own advantages and has to meet its own set of requirements and assumptions. One advantage of propensity scores is that they balance selection differences in groups prior to analyzing the outcome (Rubin, 2001b; Rubin 2008a). A second advantage of PS analyses is their reduced dependence on functional form assumptions in analyzing the outcome once balance on observed covariates was achieved, whereas ANCOVA relies on the correct modeling of the outcome model (Cochran & Rubin, 1973, Schafer & Kang, 2008). Furthermore, sensitivity analyses play a much more formal role for PS-methods (Rosenbaum, 2002) than for ANCOVA, though sensitivity tests can also be conducted for ANCOVA results. However, a salient disadvantage is that only overlapping cases may be used for drawing causal inference in PS (Rubin, 2004), thereby restricting the target population achieved. Moreover, when the selection mechanism is complex little overlap may result. ANCOVA on the other hand allows inference for the whole population by relying on extrapolation. However, extrapolation relies on functional form assumptions that may not be met, thus leading to biased estimates of the treatment effect. So given the different assumptions involved, it

Unbiased Causal Inference from an Observational Study

31

is inevitable that the mode of data analysis could prove to be consequential. However, in this study and in Shadish et al. (2008) it was not. In their review of 12 within-study comparison projects in job training, Glazerman et al. (2003) concluded that ANCOVA and PS methods hardly differed in the bias reduction they achieved, and Cook et al. (2008) concluded the same across 12 contrasts most in substantive areas other than job training. The present findings and those reviewed above are bound to raise the question: Are propensity scores really needed? The empirical reality is that, in studies to date, ANCOVA has worked just as well as PS methods and has not required omitting cases and so has permitted generalization to the whole population under study, not just an artificial subset. We cannot claim, of course, that this comparability in results achieved will also hold with non-equivalent populations that are even more different than those studied to date. Neither can we claim that these results hold for different treatments or in other contexts than those of the within-study comparisons done to date. In other applications, the set of observed covariates may be smaller, pretests on the outcome of interest not available, the outcome model may have a more complex non-linear functional form, and overlap might be weaker challenging extrapolation to non-overlapping treatment and control subjects. Though linearity seemed to be an appropriate assumption in applications like those incorporated into past within-study comparisons, the generalizability of these findings requires further studies in different fields of research.

Unbiased Causal Inference from an Observational Study

32

References Abadie, A., & Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators. Econometrica, 76(6), 1537-1557. Angrist, J. D., Imbens, G. W. & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, 444455. Cochran, W. G. (1968): The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24, 295-313. Cochran, W. G.& Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhya: The Indian Journal of Statistics, 35, 417-466. Cohen, J (1969). Statistical power analysis for the behavioural sciences. New York: Academic Press. Cook, T. D., Shadish, W. R. & Wong, V. C. (2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27, 724-750. Dehejia, R. H. & Wahba, S. (2002) Propensity score matching methods for nonexperimental causal studies. The Review of Economics and Statistics, 84, 151-175. Diaz, J. J. & Handa, S. (2006). An assessment of propensity score matching as a nonexperimental impact estimator: Evidence from Mexico's PROGRESA program Journal of Human Resources, 16, 319-334.

Unbiased Causal Inference from an Observational Study

33

Educational Testing Services (1993). Arithmetic Aptitute Test (RG-1). Kit of factor referenced cognitive tests. Princeton, NJ: Educational Testing Services. Fröhlich, M. (2004). Finite sample properties of propensity score matching and weighting estimators. Review of Economics and Statistics, 86, 77-90. Glazerman, S., Levy, D. & Myers, D. (2003). Nonexperimental versus experimental estimates of earning impacts. The Annals of the American Academy, 589, 63-93. Goldberger, A. S. (1972a). Selection Bias in Evaluating Treatment Effects: Some Formal Illustrations. Institute For Research on Poverty: Discussion Papers, Paper 12372. Goldberger, A. S. (1972b). Selection Bias in Evaluating Treatment Effects: The Case of Interaction. Institute For Research on Poverty: Discussion Papers, Paper 129-72. Hirano, K. & Imbens, G. W. (2001). Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes Research Methodology, 2, 259-278. Hill, J. (2008). Comment on “Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random to Nonrandom Assignment”. Journal of the American Statistical Association, 103, 1346-1350. Ho, D. E., Imai, K., King, G. & Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15, 199-236.

Unbiased Causal Inference from an Observational Study

Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945-960. Imai, K., King, G. & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A, 171, 481-502. Kang, J. D. Y. & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22, 523-539. Krohne, H. W., Egloff, B., Kohlmann, C.-W. & Tausch, A. (1996). Untersuchungen mit einer deutschen Version der "Positive and Negative Affect Schedule" (PANAS) [Investigations with the German version of the „Positive and Negative Affect Schedule“ (PANAS)]. Diagnostica, 42, 139-156. LaLonde, R. (1986). Evaluating the econometric evaluations of training with experimental data. The American Economic Review, 76, 604-620. Liepmann, D., Tartler, K., Nettelnstroth, W. & Smolka, S. (2006). START-E. Testbatterie für Berufseinsteiger. Englisch [START-E. Test battery for young professionals. English]. Göttingen: Hogrefe. Little, R. J. A. & Rubin, D. B. (1987). Statistical analysis with missing data. Wiley, New York.

34

Unbiased Causal Inference from an Observational Study

35

Little, R., Long, Q. & Lin, X. (2008). Comment on “Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random to Nonrandom Assignment”. Journal of the American Statistical Association, 103, 1344-1346. Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via propensity score in estimation of causal treatment effects: A comparative study. Statistical Medicine, 23, 2937-2960. Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9 (reprint 1990). Statistical Science,5, 465480. R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria ISBN 3-900051-070, URL http://www.R-project.org. Rao, C. R. & Y. Wu (2001): On model selection. IMS Lecture Notes-Monograph Series 38, 1-57. Robins, J. M. & Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90, 122-129. Rosenbaum, P. R. & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rosenbaum, P. R. (2002). Observational studies. New York: Springer.

Unbiased Causal Inference from an Observational Study

36

Rubin, D. B. (1973a). Matching to remove bias in observational studies. Biometrics, 29, 159-183. Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, 29, 185-203. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688-701. Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate. Journal of Educational Statistics, 2, 1-26. Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 6, 34-58. Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74, 318-328. Rubin, D. B. (2001a). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2, 169-188. Rubin, D. B. (2001b). Estimating the causal effects of smoking. Statistics in Medicine, 20, 1395-1414. Rubin, D. B. (2004). On principles for modeling propensity scores in medical research. Pharmacoepidemiology and Drug Safety, 13, 855-857.

Unbiased Causal Inference from an Observational Study

37

Rubin, D. B. (2008a). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2(3), 808-840. Rubin, D. B. (2008b). The Design and Analysis of Gold Standard Randomized Experiments. Journal of the American Statistical Association, 103, 1350-1353. Rubin, D. B. & Thomas, N. (1996). Matching using estimated propensity scores: Relating theory to practice. Biometrics, 52, 249-264. Shadish, W. R., Clark, M. H. & Steiner, P. M. (2008). Can randomized experiments yield accurate answers? A randomized experiment comparing random to nonrandom assignment. Journal of the American Statistical Association, 103, 1334-1343. Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapmann & Hall. Schafer, J. L. (2007). mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. R package version 1.0-6. http://www.stat.psu.edu/~jls/misoftwa.html Schafer, J. L., & Kang, J. D. Y (2008). Average causal effects from non-randomized studies: A practical guide and simulated example. Psychological Methods, 13, 279-313. Steiner, P. M., Cook, T. D., Shadish, W. R. & Clark, M. H. (2009). The importance of covariate selection in controlling for selection bias in observational studies. Manuscript submitted for publication. Trochim, W. M. K. (1984). Research design for program evaluation: The regressiondiscontinuity approach. Newbury Park, CA: Sage.

Unbiased Causal Inference from an Observational Study

38

Watson, D., Clark, L. A. & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology, 54, 1063-1070. Watson, D., Clark, L. A., & Carey, G. (1988). Positive and Negative Affectivity and their relation to anxiety and depressive disorders. Journal of Abnormal Psychology, 97, 346353.

Unbiased Causal Inference from an Observational Study

Footnotes 1 We displayed the exact mathematical value of the estimated bias. This bias differs from the result of the equation due to rounding of the estimated effects.

39

Unbiased Causal Inference from an Observational Study

Table 1: Regression coefficients in the final model estimating the propensity scores Standard Covariates in the model

Coefficient

error

t-value

Intercept

-6.28

4.57

-1.38

Sex

-0.87

0.70

-1.24

Age (log)

2.20

1.36

1.62

Education father (medium)

0.88

0.64

1.39

Education father (other)

1.30

1.00

1.30

-4.30

2.16

-1.99

0.61

0.41

1.49

Grade in Math

-0.65

0.35

-1.89

Grade in Biology

-0.57

0.41

-1.41

English knowledge

0.56

0.37

1.50

Extra lessons Math

0.46

0.25

1.86

Course level in English (not taken)

1.25

1.02

1.23

English pretest Grade in English

40

Unbiased Causal Inference from an Observational Study

41

Table 2: Balance statistics for the covariates. B is the standardized mean difference in the covariates between the two training groups and R is the variance ratio. Covariates

B

R

Demographic Variables Sex Study Major

Covariates

B

R

Prior academic achievement 0.086 0.848 -0.078 0.945

Grade in Literature Grade in English

0.051

1.040

-0.070

0.588

0.027

1.398

Nationality

0.209 1.424

Grade in Math

Family status

0.010 1.035

Grade in Biology

-0.020

0.593

Age

-0.046 0.542

Math knowledge

0.162

1.189

Age (log)

-0.009 0.719

English knowledge

0.137

0.256

0.166

1.075

Foreign language

0.085 1.142

Course level in English (advanced)

Education mother (high)

0.062 0.997

Course level in English (basic)

-0.058

0.990

Education mother (medium)

0.102 1.087

Course level in English (not taken) -0.038

0.902

Course level in English (other)

-0.183

0.463

0.110

1.123

Education mother (low)

-0.324 0.364

Education mother (other)

0.049 1.153

Course level in Math (advanced)

Education father (high)

0.101 0.960

Course level in Math (basic)

-0.104

1.055

-0.034

0.901

0.067

1.483

Prefer Math over English

0.032

1.028

Extra lessons Math

0.039

1.169

-0.044

0.826

0.090

0.549

Like Math

-0.057

1.017

PS logit

-0.027

0.980

Education father (medium)

-0.022 0.969

Course level in Math (not taken)

Education father (low)

-0.106 0.775

Course level in Math (other)

Education father (other)

-0.025 0.925

Topic preference

Pretest measures English pretest Math pretest

0.033 0.911 -0.001 0.801

Psychological variables

Extra lessons English Like English

Positive affect

-0.013 0.698

Negative affect

-0.220 0.436

Negative affect (log)

-0.174 0.589

Unbiased Causal Inference from an Observational Study

Table 3: Unadjusted and adjusted results of the randomized and the nonrandomized experiment including only the data of common support. Mean

Standard

Bias

R

diff.

error

in SD

squared

Random. experiment (covar.-adj.)

12.6

4.7

0.0

0.00

0.47

Unadjusted quasi-experiment

13.9

4.0

1.3

0.06

0.11

PS stratification

14.0

4.2

1.4

0.06

0.10

15.0

3.5

2.4

0.11

0.47

12.6

4.5

0.0

0.00

0.10

13.7

4.0

1.1

0.05

0.40

14.2

3.8

1.6

0.07

0.38

Mean

Standard

Bias

R

diff.

error

in SD

squared

Random. experiment (covar.-adj.)

-1.6

2.5

0.0

0.00

0.63

Unadjusted quasi-experiment

-7.8

3.3

-6.1

-0.38

0.05

PS stratification

-0.5

3.7

1.2

0.07

0.00

-1.1

2.2

0.5

0.03

0.68

-0.5

3.3

1.2

0.07

0.24

-1.7

2.5

0.0

0.00

0.63

-1.3

2.4

0.4

0.02

0.64

Math (SD = 22)

plus covariate adjustment PS nonlinear ANCOVA plus covariate adjustment ANCOVA using observed covariates

English (SD = 16)

plus covariate adjustment PS nonlinear ANCOVA plus covariate adjustment ANCOVA using observed covariates

Bias

Bias

42

Unbiased Causal Inference from an Observational Study

Table 4: Unadjusted and adjusted results of the randomized and the nonrandomized experiment including all data. Mean

Standard

Bias

R

diff.

error

in SD

squared

Random. experiment (covar.-adj.)

14.2

3.7

0.0

0.00

0.53

Unadjusted quasi-experiment

14.5

3.7

0.3

0.01

0.13

ANCOVA using observed covariates

14.4

3.6

0.2

0.01

0.43

Mean

Standard

Bias

R

diff.

error

in SD

squared

Random. experiment (covar.-adj.)

-1.5

2.0

0.0

0.00

0.70

Unadjusted quasi-experiment

-8.8

3.1

-7.3

-0.49

0.06

ANCOVA using observed covariates

-0.9

2.2

0.6

0.04

0.69

Math (SD = 22)

English (SD = 15)

Bias

Bias

43

Unbiased Causal Inference from an Observational Study

Figure Captures

Figure 1. Design of the study by Shadish et al. (2008). Figure 2. Design of the study.

44

Unbiased Causal Inference from an Observational Study

Figure 1.

45

Unbiased Causal Inference from an Observational Study

Figure 2.

46