Evaluating Methods for Selecting School-Level Comparisons in

0 downloads 124 Views 1MB Size Report
Quasi-experiments (QE) fit this bill. But the dilemma is that they .... In some cases, comparison units were drawn from
EdPolicyWorks Working Paper: Evaluating Methods for Selecting School-Level Comparisons in Quasi-Experimental Designs: Results from a Within-Study Comparison Kelly Hallberg1, Vivian C. Wong2, & Thomas D. Cook3

This paper compares the performance of three approaches to selecting school-level comparison units in educational evaluations that try to match treatment and comparison schools. In one approach, matching is on “focal” characteristics that are assumed to be related to both treatment assignment and outcome. In another, matching is on geographical attributes and so both sets of schools come from the same “local” area – in this case, from the same school district. In the third “hybrid” approach, both focal and local attributes are used sequentially. First, matching occurs within school districts and, for treatment schools without comparable local matches, focal matches are found with non-local schools that have otherwise similar observed characteristics. To assess the performance of these three approaches to matching, the study employs a within-study comparison design in which treatment effect estimates from a quasi-experimental research design are compared to results from a randomized experiment that shares the same treatment group. We find that focal and local matching each reduce bias by at least 75 percent relative to the simplest two-group design where posttest differences are compared without any covariates. Indeed, after covariate adjustment, all the estimated treatment effects are less than .02 SDs from the experimental benchmark and there were no statistically significant differences between the matched quasiexperimental and the experimental effects. Even so, the hybrid matching approach outperformed both the focal and local approaches, reducing even more of the initial bias and coming closer to the experimental benchmark.

Urban Labs, University of Chicago 2 University of Virginia 3 Northwestern University / Mathematica 1

Updated April 2016 EdPolicyWorks University of Virginia PO Box 400879 Charlottesville, VA 22904 EdPolicyWorks working papers are available for comment and discussion only. They have not been peer-reviewed. Do not cite or quote without author permission. Working paper retrieved from: http://curry.virginia.edu/uploads/resourceLibrary/47_School_Comparisons_in_Observational_Designs.pdf

EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia Working Paper © 2016 Rector and Visitors of the University of Virginia. For more information please visit www.curry.virginia.edu/edpolicyworks or contact [email protected]

School Comparisons in Observational Designs EVALUATING METHODS FOR SELECTING SCHOOL-LEVEL COMPARISONS IN QUASIEXPERIMENTAL DESIGNS: RESULTS FROM A WITHIN-STUDY COMPARISON Kelly Hallberg, Vivian C. Wong, & Thomas D. Cook Introduction In many program evaluations, the treatment effect of interest is at the school level. This is for both conceptual and pragmatic reasons. First, policy-makers and program administrators often have influence through decisions that affect some school level policy. As such, they are most interested in learning “what works” for policy levers that are available to them to affect change. Examples include adopting a new curriculum program, implementing a whole school reform, or changing the structure of a school day. Second, because of concerns about treatment contamination and spillover effects, programs must sometimes be evaluated at the aggregate level to account for possible interference between units that may bias treatment effect results. Thus, school level treatment effects are often of interest to program evaluators. Third, the introduction of longitudinal state data systems that track information such as average student performance and school demographics provide new opportunities for researchers to use this information to evaluate schoollevel programs and policies for decision-making. From an internal validity perspective, the best evaluations of school-level interventions use cluster randomized controlled trials (RCT) to assign treatments to schools (Gerber & Green, 2012). But such RCTs are often infeasible for ethical, political or practical reasons. Moreover, RCTs often entail a tradeoff between internal and external validity because the schools that volunteer for an RCT are not randomly sampled from a clear target population. For either reason, researchers need an alternative to the RCT. Quasi-experiments (QE) fit this bill. But the dilemma is that they substitute non-equivalent comparison units for the RCT’s randomly assigned ones. To date, we are aware of only one paper that examines QE approaches for selecting schoollevel comparisons (Stuart, 2007). Stuart’s paper presents methods for matching schools as a straightforward extension of the student-level case where covariate adjustment or propensity score matching are usually used to select comparable students. The causal inference framework and assumptions required for school-level matching are indeed the same as at the student-level. However, the implementation of school-level matching poses some unique challenges in QE evaluations and invites the use of a new approach to matching that combines elements of the currently advocated approaches in the work on QE designs and analysis. 1 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs The current literature on QE design recommends (1) conditioning on pretreatment measures of the outcome of interest (Glazerman, Levy, & Myers, 2002; Bloom, Michalopoulos, & Hill, 2005; Smith & Todd, 2005; Bifulco, 2012; Hallberg & Cook, in progress), (2) using rich, reliable and heterogeneous sets of covariates to form matches (Cook, Shadish, & Wong, 2008; Shadish, Clark, & Steiner, 2008; Steiner, Cook, & Shadish, 2011), and (3) selecting treatment and comparison units that are geographically close (Friedlander & Robins, 1995; Bell, Orr, Blomquist, & Cain, 1995; Heckman, Ichimura, Smith & Todd, 1998; Bloom et al., 2005; Bifulco, 2012). Cook et al. (2008) summarized this as “focal, local” matching. “Focal” matches lead to treatment and comparison schools that are similar on observed covariates that are considered to be relevant to the selection process and the study outcome. However, such matches might not be balanced on all relevant unobserved variables. “Local” matching entails selecting units that are geographically close under the assumption that this achieves some level of comparability on both observed and unobserved characteristics. For example, in research on school-level interventions to raise academic achievement, the claim is that schools in the same district are more alike than schools that are randomly selected across districts in terms of district policies and resources, community attitudes towards schooling, and even in the race/ethnicity and socio-economic characteristics that affect achievement. One ideal would be to achieve a large number of treated and comparison schools from within the same district that are observably similar on a wide and heterogeneous array of variables related to both within-district selection processes and the study outcome. However, the modest number of schools in most districts means that this ideal is rarely achieved. Instead, researchers confront the dilemma of having to choose between some comparison schools from the same district that are not very similar to their treatment cases, or to choose comparison schools that are closely matched on what is believed to be relevant characteristics but may not be geographically close. This paper offers a way out of this dilemma by testing an approach for combining local and focal matching that Stuart (2007; Stuart & Rubin, 2008) originally proposed for matching individuals, but we extend to matching of aggregate units. This approach – which we call “hybrid matching” – involves looking for comparable matches within the same geographic area first, and if none are found, selecting non-local matches that appear similar to treatment units on observable characteristics. This paper has two primary goals: (1) to assess how well each matching strategy reduces bias in the QE and achieves a causal estimate close to that of an RCT benchmark; and (2) to evaluate whether the hybrid strategy outperforms local or focal matching alone. (This last assumes that 2 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs neither focal nor local matching achieves perfect bias reduction by itself). These objectives require a within-study comparison design (WSC), sometimes also called a design replication study. Introduced by LaLonde (1986), the earliest WSCs used data from job training evaluations to compare results from a QE and RCT that shared the same treatment group. The authors used extant datasets such as the Current Population Survey (CPS) or the Panel Study of Income Dynamics (PSID) to determine whether statistical adjustment procedures (e.g. regression, propensity score, or instrumental variable) replicated the RCT’s causal estimate. The early conclusions were that these methods failed to produce comparable results to the causal benchmark (LaLonde, 1986; Fraker & Maynard, 1987; Friedlander & Robins, 1995; summarized in Glazerman, Levy & Meyers, 2002). However, these early WSCs had RCT and QE units that differed in many ways other than whether the comparison group was formed at random or not. In some cases, comparison units were drawn from distant locations and measured at different times on different scales from those found in the RCT. Later studies took pains to draw comparisons from the same target population as the RCT, were located within the same geographical area, and were measured at the same time on the same scale. WSC studies also moved beyond asking whether QE methods could replicate RCT results and came to focus instead on the conditions under which QE and RCT methods generated similar causal results. The strategy employed here takes advantage of recent advances in the theory and conduct of WSCs as will be discussed in more detail below. Addressing Bias in Quasi-Experimental Designs Following Heckman, Ichimura, Smith, and Todd (1998), Bilfulco (2013), and Jung and Pirog (2014), we consider three sources of bias in a QE. First, bias occurs when treatment and comparison units with different values of X are compared. For example, if X is a measure of average wealth in a school, effect estimates are biased if achievement outcomes for wealthy treatment schools are compared with those from low-income no-treatment schools. Second, even when there is overlap in X, bias can still arise when the distribution of covariates takes on a different form between the treatment and comparison groups. In the case where X is a measure of urbanicity and both treatment and comparison samples includes urban, suburban, and rural schools, the treatment sample can still consist of mostly urban schools and the comparison sample of mostly rural schools, thus leading to a lack of overlap and balance even on observed covariates. Third, bias may arise because of selection on unobserved covariates. Here, treatment and comparison schools might have similar values on the observed confounder X, but still differ in unknown ways related to the treatment assignment and outcome. For example, treatment and comparison schools may be similar 3 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs on aggregate measures of race/ethnicity, material wealth, and other observed demographic factors, but nonetheless be different in school management and leadership styles that are unobserved. Rosenbaum and Rubin (1983) refer to this last case as “hidden bias,” a violation of the strong ignorability assumption. The causal inference literature suggests methods for addressing all three forms of bias. When all the confounders are known and measured reliably, covariate adjustment and matching methods will validly estimate treatment effects. In regression, treatment effects are estimated by fitting parametric models that include all causally relevant confounders and indicators for treatment. But badly biased results will emerge when the model fits the data poorly, such as when treatment and comparison groups have very different characteristics and the fitted regression line extrapolates over a multi-dimensional space that is poorly estimated (Rubin, 1997; Kang and Shafer, 2007; St. Clair et al., in press). An alternative is to match treatment and comparison units with similar values on pretreatment covariates to improve covariate balance and avoid extrapolation. The challenge with case matching, however, is that as the number of matching variables increases, so do the dimensionality of matches and the difficulty of finding suitable matches for each treated unit. Propensity score matching addresses the multi-dimensionality issue by allowing researchers to match via a single estimated propensity score. Rosenbaum and Rubin (1984) showed that propensity score matching can generate unbiased treatment effects, but only if the strong ignorability assumption is met. This means that the propensity scores are estimated from reliable measures of all covariates that simultaneously affect selection and potential outcomes, and that observed covariates are well balanced in the treatment and comparison group and have overlap in the propensity score. Focal matching methods, whether implemented through regression, case-matching and propensity score matching methods, all rely on the strong ignorability assumption. When it is independently known that the strong ignorability assumption is met, causal inference is not problematic. The practical problem is that knowledge of selection is almost always incomplete. Covariates are selected in part from attempts to conceptualize the true, but unknown selection process and/or by seeking to collect as many covariates from different domains as possible in hopes of tapping into the true selection process. The practical problem with focal matching is that hidden bias may remain, or imbalance on unobserved variables that are related to both the selection variables and the study outcome. In theory, local matching has an advantage over focal matching to the extent that all confounders are fixed at the same value within the same region, school district, or labor market. In 4 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs this way, local matching can be seen as a special case of fixed effects approaches that are common in the econometrics literature. Of course, there is no guarantee that local matching equates treatment and comparison units on all relevant observed and unobserved variables; the more modest case is that it does better than any other QE alternative. There are many ways to implement local matching in education contexts. In the form exploited here, local matching entails choosing comparison schools within the same district without explicitly balancing on other covariates, though local matching may also be expanded to match on some observed characteristics that might otherwise vary between the treatment and comparison groups. Another challenge with local matching is that there may not be enough comparable units within a fixed geographic area. In the education context, a school district may comprise of two schools (one treatment, one comparison) that appear different in most baseline characteristics. One school is located in a wealthy area with high average pretest scores while the other is located in a low-income neighborhood with low prior achievement scores. Alternatively, the researcher may know that all schools were offered the opportunity to participate in treatment but for unobserved reasons that vary by schools within districts, schools select into treatment differentially. In both cases, local comparisons are clearly not optimal. Hybrid matching is one way of gaining the advantages of focal and local matching. In the hybrid approach (Stuart & Rubin, 2008), geographically local matches are chosen first, but if the local matches appear too different from treatment schools, they are discarded in favor of non-local schools that are selected based on observable focal characteristics. Here, “similar enough” is operationalized as some caliper width for the maximum distance in estimated propensity scores for schools within the same district. A large district caliper indicates that local matches are prioritized for addressing observed and unobserved confounders, while a small district caliper indicates that focal matching on observed covariates is preferred for removing bias. Methods The WSC Design: Conceptual Framework We begin by formalizing the WSC design in potential outcomes notation. Under the Rubin Causal Model, each school i has different potential outcomes based on its assignment to a treatment or control condition. Let i = 1….N index schools in a study sample that have been drawn from some population of interest. Next, let Di be a binary variable that indicates whether the school received a specific treatment or control intervention, with Di = 1 for treated schools, and Di = 0 for control schools. For each school, let Yi (1) and Yi (0) denote the pair of potential outcomes that a 5 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs school would have experienced under the two treatment scenarios. Here, both potential outcomes are defined for each school, but only one of the two potential outcomes is observable for the school. The observed outcome can be written as a simple function of the potential outcomes and the treatment status variable. For a random school in the sample, the observed outcome is: Yi = Yi (1)Di + Yi (0)(1− Di ) , and the unit-specific causal effect is the difference in the two potential

outcomes for each school τ i = Yi (1) − Yi (0) . However, because only one of two potential outcomes is observed for each school, we can never form a direct estimate of the school specific effect (Holland, 1986). We can, however, estimate average treatment effects (ATE) across an overall population of schools, which is defined τ ATE = E[Yi (1)] − E[Yi (0)] . Researchers may also be interested in the average treatment effect for the sub-population of schools that are actually treated (Di=1). This parameter is the average treatment effect on the treated (ATT), which is τ ATT = E[Yi (1) | Di = 1] − [Yi (0) | Di = 1] .

For a WSC design to produce interpretable results, there must be a valid causal benchmark for evaluating the quasi-experimental approach. Here, the benchmark is an RCT, with wellimplemented randomization into treatment and control conditions, no differential attrition, and no interference between units, sometimes called the Stable Unit Value Assumption (SUTVA). When these assumptions are met, E[Yi (0)] = E[Yi (0) | Di = 0] == E[Yi (0) | Di = 1] and

E[Yi (1)] = E[Yi (1) | Di = 1] . These equalities hold because random assignment ensures that the potential outcomes in the treatment group are equivalent to those in the control group, such that,

τ ATE = τ ATT = E[Yi (1) | Di = 1] − E[Yi (0) | Di = 0] . In WSC designs, RCT data are used to estimate the causal estimand of interest τ ATT . For clarity, we represent this estimate in the experimental condition

ˆ (1) | D = 1] − E[Y ˆ (0) | D = 0] , which represents the sample analog of the as τˆRCT _ ATT = E[Y i i i i RCT population expectation. The second WSC design requirement is variation in treatment assignment between the RCT benchmark and the QE design under investigation. The difference between the two conditions is that QE comparison units must have self-selected into their treatment status or have been selected by a third party, while control cases in the RCT were assigned at random. In QEs, we expect the unadjusted conditional differences (sometimes called the naïve treatment effect) to be unequal to the causal estimand of interest, such that: 6 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs

E[Yi (1) | Di = 1] − E[Yi (0) | Di = 0] ≠ E[Yi (1) | Di = 1] − E[Yi (0) | Di = 1] = τ ATT . However, unbiased causal treatment effects may be obtained in the QE if all the sources of bias have been addressed either by matching or statistical adjustment. In addition, SUTVA must hold, as indeed it must in the RCT too. Given these two assumptions, we can define the average treatment on treated effect in the QE as the difference in conditional expectations between the treatment and no-treatment group’s outcomes -- that is, τ ATT = E[Yi (1) | X, Di = 1] − E[Yi (0) | X, Di = 0] , which is identical to

τ ATT = E[Yi (1) | Di = 1] − E[Yi (0) | Di = 0] . In the WSC context, we estimate the causal estimand of ˆ (1) | D = 1] − E[Y ˆ (0) | D = 0] . interest using QE data, which we call: τˆOS _ ATT = E[Y i i i i In WSC designs, the research question of interest is whether the QE approach produces an unbiased causal treatment estimate for some well-defined population. Here, bias for the ATT is defined as: B = {E[Yi (1) | Di = 1] − E[Yi (0) | Di = 0]}OS _ ATT −{E[Yi (1) | Di = 1] − E[Yi (0) | Di = 0]}RCT _ ATT , which

may be simplified to B = E[Yi (0) | Di = 0]RCT _ ATT − E[Yi (0) | Di = 0]OS _ ATT . If we assume that the RCT provides a valid estimate of τ ATT , we can estimate bias in the WSC set-up by taking the difference in treatment effect estimates between the QE and RCT, such that Bˆ = τˆOS − ATT − τˆRCT − ATT . Alternatively, we may eliminate the treatment group altogether because treatment group is shared by both QE and RCT effect estimates and difference conditional mean outcomes of the RCT control ˆ (0) | D = 0] ˆ (0) | D = 0] group and the QE comparison groups Bˆ = E[Y − E[Y . i OS − ATT i RCT − ATT

The WSC Data Source The experimental data for this WSC come from a cluster RCT designed to examine how Indiana’s benchmark assessment system affected student achievement in mathematics and English Language Arts (ELA) using the annual Indiana Statewide Testing for Educational Progress-Plus (ISTEP+). The treatment involved teachers receiving regular feedback about student performance that could be disaggregated in a variety of ways to inform them how well the whole class and individual students were performing on highly specific learning tasks. The original RCT took place over two cohorts, where schools within each cohort were randomized into treatment and control conditions. However, this WSC uses data only from the second round of random assignment. Sixtythree K-8 schools volunteered to implement the system in the 2010-11 school year. Of these, 32 were randomly assigned to the state’s benchmark assessment system while 31 served as controls. 7 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs The quasi-experimental arm of the WSC was constructed from information about the experimental treatment group and ISTEP+ data from nearly all other schools in the state of Indiana that served 4th through 8th graders. We excluded 441 schools from the comparison pool that were implementing something that closely resembled the state’s benchmark assessment system, and the 58 schools that had participated in the first pilot round of the experimental study. In addition, while the cluster RCT gathered data from kindergarten through 8th grade, our focus is limited to the 4th through 8th grades. This is because ISTEP+ is administered statewide in grades 3-8 only. Finally, we excluded 3rd graders because they did not have pre-treatment ISTEP+ scores. The final pool of potential comparison cases comprised of 958 schools throughout the state of Indiana that did not implement the benchmark assessment system during the 2010-11 school year. The outcome for the WSC was English Language Arts (ELA) and math scores standardized by grade and year. The study takes advantage of both student- and school-level data. How Well Do These Data Meet the Requirements for a Strong WSC? Cook et al. (2008) proposed requirements for a maximally interpretable WSC. The study should include: (1) variation in how the comparison group is formed, being at random in the RCT but systematic in the QE; (2) the RCT should be well-implemented so as to warrant its status as the causal benchmark; (3) results about the RCT and matched QE should be initially blinded; (4) no third variable confounds should compromise conclusions about the comparability of RCT and QE results; (5) the same causal quantity should be assessed in the RCT and QE; and (6) a clear and defensible criterion should be available for assessing the degree of correspondence between the RCT and QE causal results. The present WSC takes pains to ensure these six criteria were met to the highest degree possible. First, the RCT schools were randomly assigned to treatment while the QE comparison schools self-selected into their no-treatment control status, with principals making the decision to participate in the RCT or not. This is plausibly related to schools’ selection into treatment, which is generally the selection process of interest. In fact, qualitative interviews with the RCT principals suggest that they volunteered for the RCT to gain access to the feedback intervention. Second, the RCT had minimal differential attrition from the treatment and control groups, and there was no evidence that the SUTVA condition was violated. To examine chance imbalances on baseline characteristics, we regressed school treatment assignment status on all the student- and school-level covariates in Table 1. No differences between the treatment and control groups were observed in our joint statistical tests of significance. However, the direction of several non-significant coefficients suggests that treatment schools may have underperformed slightly prior to random 8 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs assignment. To address possible imbalances in pre-intervention covariates and to improve statistical precision (Rubin, 2008), the RCT outcome estimates are modeled with school- and student-level covariates in Table 1. The model is: [1]

yij = β 0 + β1trti + x ' j β + xi ' β + c j + uij ,

where yij is the ELA or mathematics ISTEP+ score for student i in school j, trtj is an indicator of whether school j was randomly assigned to implement the benchmark assessment system, x ' j is the vector of school control covariates, and x 'i is the vector of student control covariates. cj and uij are school- and student-level random error terms and are assumed to be normal and independently and identically distributed (iid). In this model, β1 serves as the benchmark estimate of the effect of the Indiana assessment system. Although a few schools dropped out of the study and did not implement the intervention, we were able to obtain administrative data on outcomes as well as all relevant school- and student-characteristics in the original RCT. As such, we estimated the Intent to Treat (ITT) effects for the experimental benchmark. Table 2 presents the results. They indicate that the new student assessment system did not significantly increase achievement in either ELA or math and that the coefficients hover around zero. [INSERT TABLES 1 AND 2 HERE] Third, the quasi-experimental and the RCT units were measured on the same outcome at the same time and in similar testing conditions, reducing the threat of irrelevant confounds. Fourth, although the treatment effects for the first wave of the RCT were published in prior reports (Konstantopolous, Miller, & van der Ploeg, 2013), the analysts were unaware of the results for the modified WSC sample with its different (but overlapping) grades and schools; and they remained blinded to the WSC experimental results until after the analysis of the QE data. Even so, this blinding is only partial. Analysts were blinded of results while conducting focal and local matching, but not for the hybrid approach. Fifth, at posttest, we examined how much the focal, local and hybrid matching strategies improved the unadjusted QE difference and whether the hybrid strategy did better than the focal and local approaches. In all these comparisons, ITT effects for schools that state intended to treat are examined to ensure that the same causal quantity was always compared and to avoid confounds between the design alternatives under investigation and the causal quantity being estimated. Sixth, we applied three approaches for examining correspondence between QE results and the benchmark, which we described in more detail below. Implementing Focal, Local and Hybrid Estimates 9 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Figure 1 provides an overview of our three approaches for constructing different QE comparison groups. It shows that the treatment group remains the same whether we are evaluating the local, focal, or hybrid approaches. Focal Matching. To implement focal matching on observable covariates, we employed propensity score matching, formulating the propensity score as: log(D j ) = u + X j/ β , where Dj indicates the treatment status for school j (Dj = 1 for schools that were selected to implement the treatment, Dj = 0 for non-implementing schools) and a school’s logit is a linear function of a vector of school characteristics, X j/ , and β is the corresponding coefficient vector. Each school-level covariate listed in Table 1 was available for inclusion in the estimation of the propensity score, and the final propensity score model was selected to maximize balance across all of these covariates. Most of these school covariates are demographic in nature, including measures of urbanicity, percent of students on free or reduced lunch, and percent of students in special education. But we also had six years of school pretest data on the outcome. Taken together, these two kinds of covariates represent the information QE researchers will likely have access to as state longitudinal data systems in education become larger. Once our propensity score models were finalized, each treatment school was matched to four comparison schools that had propensity scores within a caliper of .25 standard deviations of the propensity score logit (Rosenbaum & Rubin, 1985). Ideally, one would implement a form of optimal matching in which the number of matches for each treatment school varies with the number of suitable matches available, however, such an approach was not possible here because it would not have been replicable across each of the matching approaches described below. Any choice of number of exact matches is admittedly, somewhat arbitrary. We selected four matches because results from another WSC in Indiana suggested that four school matches outperformed one to one school matching in practice (Hallberg, Cook, & Figlio, 2014). Matching was implemented with replacement, so that a given comparison school may serve as a match to multiple treatment schools. Using this approach, we were able to achievement sufficient balance between the treatment and comparison cases on observable pretreatment characteristics as we demonstrate below. Finally, we estimated treatment effects using the outcome model identified in equation [1], using the same student- and school-level covariates as in the calculation of the RCT benchmark to assure that both the QE and RCT were estimating the same causal effects. To calculate weights for producing the corresponding ITT effect for the schools the state intended to treat in the QE, we first determined 10 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs the observed number of students Ozq in each experimental treatment group ! ∈ 0,1 within each matched groups of schools & ∈ {1 … 32}. Based on the observed frequencies, weights for comparison students were then computed as ,-. = 01. /03. . All experimental treatment students were given a weight of one; all comparison students who were overrepresented were down-weighted while comparison students who were underrepresented were up-weighted. Model standard errors that took account of the nested data structure from a two–level random effects model were used to test whether the QE treatment effects were significantly different from zero.1 Local Matching. To implement local matching, for each treatment school we selected all other non-treatment schools within the same district regardless of how similar they were to the treatment school on observable characteristics. While in practice, researchers may select within district matches that maximize similarity on observable characteristics, we did not do this in local matching in order to isolate any bias reduction associated with geographic matching only. To estimate treatment effects, we used the outcome model identified in equation [1], but without the school- or studentlevel covariates. Treatment on the treated weights were again calculated, up- or down-weighting district comparison cases based on the number of treated students in that district to ensure that the RCT and QE estimated the same causal quantity. Hybrid Matching. Hybrid matching attempts to balance the tradeoffs between local and focal matching. To implement this approach, we estimated a propensity score for each available school in the state using the best available set of school-level covariates (as described above). We then matched each experimental treatment school to a maximum of four comparison schools within the same district that had propensity scores within what we call a “local” caliper. If four within district comparison schools were not available that met this criterion, we sought the closet matches based on the propensity score within a “focal” caliper of .25 SDs from any school in the state until each treatment school had four matched comparisons. Treatment effects were estimated using the outcome model in equation [1] with school- and student-level covariates included in the model, and appropriate weights applied. A key issue with the hybrid approach is the selection of calipers that correctly balances local versus focal matching of schools. By changing the “local” caliper size, one is implicitly changing the preference given to local/within district matches relative to matches that are more closely balanced We should note that using model standard errors ignores the fact that the propensity score itself is estimated from the data (Abadie & Imbens, 2006). Ho, Imai, King, & Stuart (2007) suggest that model standard errors and sufficient in practice. 1

11

EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs on observable characteristics. In the former case, the researcher is concerned about selection on unobserved covariates that may be addressed through local matching; in the latter case, the researcher is concerned about selection on observed covariates. Thus, selecting a local caliper based on covariate balance of observed covariates preferences focal matching. The methodological literature does not provide guidance on selection of the local caliper, so we addressed this concern in two ways. First, we applied the hybrid approach using the pretest as the outcome for the treatment and quasi-experimental comparison group. We then selected the local caliper width that produced the smallest mean squared error in the pretest for experimental treatment and comparison groups. The intuition here is that prior to the intervention, there should be no differences in average pretest between the experimental treatment and comparison groups. This is admittedly an imperfect approach, which likely privileges observable characteristics over unobservable ones and thus focal over local matches. However, we hypothesize that unobservable characteristics that affect the outcome of interest also affect that outcome measured in the pretreatment year, so it may provide some guidance as a tool for selecting the local caliper. Using this method, we selected a width of 1.5 SDs of the propensity score. It should be noted that the optimal local and focal caliper values could vary from application to application. As such, we would suggest that researchers use an analogous caliper selection process when implementing the method. We also check the sensitivity of our hybrid estimates by using different district caliper widths that ranged from 0.1 to 4 standard deviations. In this application, we find that results are not sensitive to caliper selection. However, it is unclear how generalizable this finding is, especially given how well both the local and focal approaches perform in this case. Naïve Comparison. We contrasted the ELA and math results from the RCT treatment group against the corresponding data from all other no-treatment schools in Indiana, accounting for the nesting of students via a hierarchical modeling approach parallel to that described above for the other approaches. No covariates or weights were included, making it a test of unadjusted group difference in posttest means. While a strong, quasi-experimental study would never implement such a strategy, the difference between the naïve and RCT effect estimates serves as one indicator of the extent of initial selection bias that our matching approaches seek to ameliorate. Assessing the Performance of the Three Matching Approaches We assess the correspondence of each of the three matching strategies with the RCT benchmark through two approaches. First, we examine the standardized mean differences in the 12 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs

RCT and QE results, where the effect size difference is

τˆnx − τˆre , and s is the sample standard s

deviation of the outcome in the RCT control group. In the absence of a policy-justified criterion of “how close is close enough?” (Wilde & Hollister, 2007), we used .15 standard deviations as the minimum threshold for considering the RCT and QE difference acceptably close. This level was chosen because most field experiments in education are powered to detect “meaningful” effects of between .15 and .25 and smaller differences are considered tolerable. Second, we calculated bootstrapped standard errors of the difference between the RCT and QE to examine whether the difference in results was statistically different from zero. Bootstrapped standard errors are used to account for the correlation stemming from the shared treatment group. Neither of these tests is perfect, of course; and exact point correspondences are not to be expected since both the RCT and QE estimates include sampling error. Nonetheless, the tests contrast each matching strategy with the RCT results. Second, we evaluate whether the hybrid strategy outperforms local or focal matching alone, assuming that neither focal nor local matching achieves perfect bias reduction by itself. To do this, we first describe whether the hybrid matching estimates are closer to the RCT than the focal and local estimates. Then to examine the robustness of this pattern, we examined the proportion of times each matching approach most closely approximated the experimental benchmark across 1,000 bootstrap replicates. Relative performance was calculated as 1 B ∑ c = 1[min τˆnx1 − τˆrct ...τˆnxj − τˆrct ] j , where b = 1…B indexes the bootstrap replicates, j indexes B b=1

the observational comparison that was applied (local, focal or hybrid matching), τˆnxj is the estimate for observational approach j, and τˆrct is the experimental estimate for each bootstrap replicate b. The indicator function shows that observational approach j receives a 1 if the absolute bias is the smallest of all matching methods examined in that bootstrap replicate, and 0 if it is not. Then, for each observational approach, we calculated the proportion of times that it was the least biased approach across 1000 bootstrap replicates. We also examined whether the differences between the focal and hybrid estimates and the local and hybrid estimates were statistically significant using boot strap standard errors.

13 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Results The Naïve Comparison Table 3 shows that the unmatched treatment schools reliably differ from the comparison schools on 12 of the 34 pre-treatment covariates. They also differed on the unadjusted causal estimates at posttest. And since these last causal estimates differ from their corresponding RCT estimates by .092 standard deviations (SDs) for ELA and by .150 SDs in math, the evidence of bias is overwhelming. Policy-makers would clearly be premature to conclude from the unadjusted posttest difference in the QE that the Indiana program had improved student performance because at least 12 initial group differences had not been accounted for. Focal and Local Matching Alone Both focal and local matching reduced the number of statistically significant pretreatment covariate differences to zero, as compared to the 12 of 34 in the naïve comparison. Focal matching resulted in better balance than local matching -- on 19 of the 24 school-level covariates where the means were not identical, whereas local matching resulted in better balance than focal matching for eight of the nine student-level covariates. This is probably due to the fact that focal matching was done only at the school-level and was based on the same covariates in Table 3. That focal matching achieves closely matched groups on observables is hardly surprising. Local matching, however, will only improve the treatment/comparison group overlap to the extent that schools are more similar to each other within districts rather than schools in different districts. [INSERT TABLE 3 HERE] Table 4 compares treatment effect estimates from the RCT and quasi-experimental approaches. With focal matching, a treatment effect of .022 SDs resulted for ELA. This indicates considerable bias reduction over the naïve comparison of .217 and a total difference from the RCT of -.003 SDs. For math, focal matching produced a small negative treatment effect of -.005 SDs that was not reliably different from zero and indicated much less bias than the unadjusted estimate of .178. The difference between the QE and RCT effect sizes was -.033 SDs -- again not significantly different from zero. Thus, focal matching reduced more than three quarters of the bias that were between -.003 and -.033 SDs of the RCT benchmark -- both under the .15 standard. With local matching, the QE treatment effects were -.042 standard deviations for ELA and .007 for math. Neither difference was statistically different from zero. These values indicate bias reduction of 70% or more, given unadjusted biases of .192 and .150 respectively. The local comparisons produced QE effects that were larger than the RCT benchmark by .017 SDs for ELA 14 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs and smaller than the RCT benchmark by .007 SDs in math. So local matching also substantially reduced bias and, when statistical uncertainty is accounted for, did not significantly differ from the RCT estimate or differ by more than .15 SDs. [INSERT TABLE 4 HERE] Hybrid Matching Given the considerable bias reduction achieved by focal and local matching alone, did hybrid matching afford even more improvement? Table 5 details the bias reduction achieved as a function of the outcome variable and caliper. The table shows that treatment effect estimates were not sensitive to different choices in caliper widths. For math, the hybrid strategy reduces the bias to .010 as opposed to the -.003 and .017 obtained with the focal and local strategies. For ELA, the hybrid reduces the bias to .007 as opposed to the -.033 and -.035 values for focal and local matching respectively. Figure 1 presents these results visually using the pre-specified optimal caliper. Across the two outcomes, the hybrid approach seems to approximate the RCT benchmark better than either the focal or local matching alone. [INSERT TABLE 5 AND 6 AND FIGURE 2 HERE] But were the differences in treatment effect estimates produced by the hybrid approach reliably different from those produced by local and focal matching? Figure 2 begins to address this question by assessing the relative performance of the focal, local and hybrid matching models across 1000 bootstrapped replicates. For ELA, the hybrid approach performs best (in terms of smallest amount of absolute bias) in over 80 percent of the bootstrapped replicates – well above the 25 percent figure that we would expect from chance alone. In comparison, local and focal matching performs best in fewer than 10 percent of the bootstrap replicates, and the naïve comparison never performs best. For math, the hybrid approach outperforms other approaches approximately 75 percent of the time, local matching performs best about 12 percent of the time, and focal matching around 11 percent of the time. Thus, across both the math and ELA outcomes, the hybrid approach performed better than the other matching alternatives. [INSERT FIGURE 3 HERE] We also conducted direct statistical tests in the difference of treatment effect estimates between local and hybrid approaches, as well as focal and hybrid approaches. Because of nonindependence in units between the experimental and QE conditions, we used bootstrapped standard errors to assess whether differences in treatment effect estimates between these methods were 15 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs statistically significant. Overall, we found that that none of the differences were statistically significant. Discussion This study provides guidance for selecting comparison units when school-level interventions are undertaken and an RCT cannot be done. It takes advantage of data from both an RCT benchmark study and from a state longitudinal data system that permitted the conduct of a credible quasi-experimental design and analysis. Two main results emerged. First, the focal and local matching strategies reduce bias by at least 75 percent relative to a naïve comparison of posttest treatment/comparison differences. Indeed, the bias reduction achieved was always to a level less than .15 SDs from the RCT estimate, a standard that is sometimes invoked in educational research to indicate an acceptable level of ignorable true difference. Moreover, no reliable differences at the .05 level emerged from contrasting the RCT benchmarks for ELA with the corresponding estimates for focal and local matching or from contrasting the RCT benchmarks for math with the corresponding tests of focal and local matching. By themselves, both focal and local matching reduced selection bias to a level close to that of the RCT. However, the second finding is that the hybrid approach resulted in even closer approximations to the RCT benchmark for both the ELA and math outcomes. Moreover, bootstrapped analyses showed that the hybrid matching strategy would have done better than the other alternative matching strategies in both ELA and mathematics. Across 1000 bootstrapped replicates, the hybrid approach performed best a majority of times for both the ELA and mathematics outcomes. So these results provide empirical support at the school-level for Stuart and Rubin’s (2008) contention that applied researchers often face a tradeoff between local and focal matching and that this tradeoff can be reconciled by using a hybrid approach that preferences good local matches over non-local ones but recognizes the utility of good non-local matches when a good local match is not possible. In educational research, high quality local matches are not likely to be common, since many school districts have few schools and the largest districts tend to be internally heterogeneous. Thus, as a practical matter it is very difficult to find good comparison schools, further helping to justify use of the hybrid matching strategy. The approach has yet another advantage. In both the focal and local strategies, it is possible that no matches can be made for some treatment cases. Then, the resulting study sample may not represent the original target population. The hybrid strategy reduces the risks 16 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs of this external validity loss because two criteria are now used for matching: a local one and a focal one. The hybrid strategy has some limitations around (1) choosing the best caliper for local matching and (2) selecting appropriate covariates for focal matching. We recommend data driven procedures for selecting the optimal local caliper, such as exploring how different caliper widths affect pretest rather than posttest scores. In the case of covariate selection, we recommend theoretical consideration and measurement of the most likely causes of the selection process followed by the use of balance and overlap tests to assess the comparability of treatment and comparison groups. Hybrid matching does not absolve researchers from explicating and measuring how schools within a district that are exposed to treatment might differ from those not so exposed, or from explicating and measuring how treatment schools in one district differ from possible comparison schools in other districts. Careful consideration is needed to the tradeoff between focal and local matches. This tradeoff will influence the analysts’ decision as to whether to adopt a local, focal, or hybrid matching strategy as well as caliper size if the hybrid approach is adopted. At its core this tradeoff is about whether the analysts believe effect estimates are more likely to be biased by observed covariates or unobserved covariates that are similar at the local level. This tradeoff is likely to vary across policy contexts. Observed covariates are likely to reduce much of the bias in contexts where the theory of selection is quite clear and where rich pretreatment covariates are available. In contrast, researchers may have to rely more heavily on identifying local matches when doubt about the ability to control for unobserved characteristics is considerable. The importance of local labor markets is well documented in the workforce development literature, for example, while it is likely to assert influence in other contexts. Short of empirical replications of the kind presented here across fields, applied researchers must turn to theory to assess the importance of potential sources of bias in each particular application to make these methodological choices. Inevitably, this study has limitations. As regards external validity, the data are drawn from one application (providing professionals with feedback about client performance) in one policy domain (Education), in one state (Indiana) and a single outcome domain (achievement), albeit with two outcomes (performance in both math and ELA ). Future WSCs on the hybrid sampling strategy need to vary these sources of homogeneity. Finally, our original intention was to provide a replication of the present WSC using the Tennessee class size experiment (Krueger, 1999). In an earlier conference presentation, we reported 17 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs findings indicating a replication of the findings here (Wong, Hallberg, & Cook, 2013). However, we eventually concluded that the Tennessee data were not well suited to replicate the analyses conducted here for several reasons. First and most importantly, the original Tennessee study was a student level intervention (students were randomly assigned to small or large classes). To suit the purposes of this paper, we had to delete some of the cases to create a synthetic dataset in which class size was treated as a school-level intervention. We ultimately concluded that this approach was too artificial to serve as a strong test of the matching approaches employed here. Second, the school level covariates in the Tennessee dataset were limited to a few demographic characteristics of the schools. As such, this dataset did not provide a strong test of focal matching as it is currently implemented, generally with multiple years of pretreatment measures of the outcome available. Finally, the synthetic approach to creating a usable school-level dataset resulted in small sample of study schools that could be used in the analysis. As a result, robustness of the results originally presented was unstable. This paper suggests promising approaches for finding comparison units when the intervention is at the school-level. However, strong advocacy of the hybrid approach over its purely local or focal alternatives depends on replicating the current results in different areas within and beyond education contexts and settings.

18 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs References Bell, S.H., Orr, L.L., Blomquist, J.D., & Cain, G.G. (1995). Program applicants as a comparison group in evaluating training programs. Kalamazoo, MI: W.E. Upjohn Institute for Employment Research. Bifulco, R. (2012). Can nonexperimental estimates replicate estimates based on random assignment in evaluations of schools choice? A within study comparison. Journal of Policy Analysis and Management, 31, 3, 729-751. Bloom, H., Michalopoulos, C., & Hill, C. (2005). Using experiments to assess nonexperimental comparison-group methods for measuring program effects. In H. Bloom, Learning more from social experiments. New York: Russell Sage Foundation. Cook, T. D., Shadish, W. J., & Wong, V. C. (2008). Three conditions under which observational studies produce the same results as experiments. Journal of Policy Analysis and Management, 27, 4, 724-750. Diaz, J.J. & Handa, S. (2006). As assessment of propensity score matching as a nonexperimental impact estimator: Evidence from Mexico’s PROGRESA program. The Journal of Human Resources, XLI, 2, pp. 319-345. Fraker, T. & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. Journal of Human Resources, 22, 2, 194-227. Friedlander, D. & Robins, P.K. (1995). Evaluating program evaluations: New Evidence on commonly used nonexperimental methods. American Economics Review, 85, 4, 923-37. Gerber, A.S. & Green, D.P. (2012). Field experiments: Design, analysis, and interpretation. New York: W.W. Norton and Company. Glazerman, S., Levy, D., & Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy , 589, 63-91. Gleason, P.M., Resch, A.M., & Berk, J.A. (2012). Replicating experimental impact estimates using a regression discontinuity approach. Jessup, MD: Nationals Center for Education Evalaution and Regional Assistance. Green, D.P., Leong, T.Y., Kern, H.L., Gerber, A.S., & Larimer, C.W. (2009). Testing the accuracy of regression discontinuity analysis using an experimental benchmark. Political Analysis, 17, 4, 400-417. Hallberg, K. & Cook ,T.D. (In progress). The role of pretests in education observational studies: Evidence from empirical within study comparisons. 19 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Hallberg, K., Cook, T.D., & Figlio, D. (2013, September). Empirically Examining the Performance of Approaches to Multi-Level Matching to Study the Effect of School-Level Interventions. Presentation given at the Society for Research on Educational Effectiveness Annual Meeting, Washington, DC. Heckman, J.J., Ichimura, H., Smith, J.A., & Todd, P.E. (1998). Characterizing selection bias using experimental data. Econometrica, 66, 1017-1098. Holland, P.W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 396, 945-960. Jung, H. & Pirog, M.A. (2014). What works best and when: Accounting for multiple sources of pure selection bias in program evaluation. Journal of Policy Analysis and Management, 33(3), pp. 1-23. Kang, J., & Schafer, J. L. (2007). Demystifying double robustness: a comparison of alternative strategies for estimating population means from incomplete data. Statistical Science, 26, 523539. Krueger, A.B. (1999). Experimental estimates of education production functions. The Quarterly Journal of Economics, 114, 2, 497-532. Konstantopoulos, S., Miller, S., van der Ploeg, A. (2013, in press). The Impact of Indiana’s System of Interim Assessments on Mathematics and Reading Achievement. Education, Evaluation and Policy Analysis. LaLonde, R. (1986). Evaluating the econometric evalautions of training programs with experiemental data. Annual Economic Review , 76, 604-20. Nye, B., Hedges, L.V., & Kostantopolous, S. (2002). Do low achieving students benefit more from the small class sizes? Evidence from the Tennesse class size experiment. Educational Evaluation & Policy Analysis, 26, 237-257. Pohl, S., Steiner, P. M., Eisermann, J., Soellner, R., & Cook, T. D. (2009). Unbiased causal inference from an observational study: Results of a within-study comparison. Educational Evaluation and Policy Analysis, 31(4), 463–479. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 1, 41-55. Rosenbaum, P.R. & Rubin, D.B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39:33–38. Rubin, D.B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757-763. 20 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Rubin, D.B. (2008). Comment: The design and analysis of gold standard randomized experiments. Journal of the American Statistical Association, 103, 484, 1350-1353. Shadish, W.R., Clark, M.H., & Steiner, P.M. (2008). Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random to Nonrandom Assignment. Journal of the American Statistical Association, 103, 1334-1343. Smith, J., & Todd, P. (2005). Does matching overcome LaLonde's critique of nonexperimental estimators? Journal of Econometrics , 305-353. St. Clair, T., Cook, T. D., & Hallberg, K. (2013). Examining the validity and statistical precision of the comparative interrupted time series design by comparison with a randomized experiment. Manuscript submitted for publication. Steiner, P.M., Cook, T.D., Shadish, W.R. (2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, 36, 2, 213-236. Stuart, E.A. (2007). Estimating causal effects using school-level datasets. Educational Researcher, 36, 187-198. Stuart, E.A. (2010). Matching Methods for Causal Inference: A review and a look forward. Statistical Science 25(1): 1-21. PMCID: PMC2943670. http://www.ncbi.nlm.nih.gov/pubmed/20871802. Stuart, E.A. and Rubin, D.B. (2008). Matching with multiple control groups and adjusting for group differences. Journal of Educational and Behavioral Statistics, 33(3): 279-306. Wilde, E.T. & Hollister, R. (2007). How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. Journal of Policy Analysis and Management, 26, 3, 455-477. Wing, C. & Cook, T.D. (In Press). How can comparison groups strengthen regression discontinuity designs? Journal of Policy Analysis and Management. Wong, V.C., Hallberg, K., & Cook. T.D. (2013, March). Intact group matching in education contexts. Presentation given at the Society for Research on Educational Effectiveness Annual Meeting, Washington, DC.

21 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Tables and Figures Table 1. Treatment/Control Differences in the Indiana Benchmark Assessment RCT School-level covariates Average ELA score 2005a Average ELA score 2006a Average ELA score 2007a Average ELA score 2008a Average ELA score 2009a Average ELA score 2010a Average math score 2005a Average math score 2006a Average math score 2007a Average math score 2008a Average math score 2009a Average math score 2010a Average student attendance Charter school Number of full time employees Number of students Percent limited English proficient Percent male Percent of free or reduced price lunch

Treatment/Control Difference

SE

-0.384

0.235

-0.272

0.236

-0.223

0.235

-0.123

0.228

-0.085

0.233

0.05

0.242

-0.329

0.273

-0.22

0.255

-0.117

0.253

-0.059

0.25

-0.086

0.249

-0.008

0.265

0.339 -0.008

0.264 0.006

1.953

5.323

44.223

88.461

-1.70% 2.20%

2.00% 1.90%

-3.30%

6.00%

Student-level covariates Grade Limited English proficient Special education status ELA ISAT fall 2008a Math ISAT fall 2008a ELA ISAT spring 2009a Math ISAT spring 2009a ELA ISAT spring 2010a Math ISAT spring 2010a N(students)

Treatment/Control Difference

SE

0.115

0.208

-3.10%

2.30%

2.10%

1.50%

-0.119

0.093

-0.104

0.095

-0.100

0.102

-0.138

0.11

-0.044

0.104

-0.09

0.118

17,474

22 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Percent special education 1.30% 1.40% Percent white 1.20% 9.40% School-wide Title I eligibility -6.80% 13.70% Suburban -11.20% 10.20% Title I eligibility -2.90% 8.60% Urban 9.90% 12.70% N 63 Fb 0.79 Notes: Results of regressing each covariate on treatment. Figures reported are coefficients with standard errors in parentheses. Regressions at the student-level cluster standard errors at the school-level. a Test scores are standardized using grade-by-year specific means and student-level standard deviations for the entire state. b F-statistic for the join hypothesis that all model coefficients equal zero.

23 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Table 2. RCT results with and without covariates ELA Without With covariates covariates 0.042 0.025 Treatment (0.099) (0.022) Avg. school ELA score 0.035 a 2010 (0.054) Avg. school math score -0.002 a 2010 (0.055) -0.117* Percent white (0.059) 0.205 Percent special education (0.218) Percent of free or reduced 0.120 price lunch (0.132) Percent limited English -0.279 proficient (0.163) Avg. school ELA score 0.106 a 2009 (0.056) Avg. school math score -0.085 a 2009 (0.057) Limited English 0.116* proficiency status (0.017) 0.168* Special education status (0.012) ELA ISAT score spring 0.622* 2010a (0.006) Math ISAT score spring 0.185* 2010a (0.006)

Without covariates -0.001 (0.121)

Mathematics With covariates 0.011 (0.029) -0.107 (0.069) 0.129 (0.071) -0.156* (0.077) -0.354 (0.283) -0.057 (0.173) -0.512* (0.213) 0.081 (0.072) 0.028 (0.074) 0.003 (0.017) 0.061* (0.012) 0.199* (0.006) 0.638* (0.006)

N students 21,246 21,246 21,246 21,246 N schools 63 63 63 63 Notes: Dependent variable is ELA or mathematics ISAT score in spring 2011 standardized using grade-by-year specific student-level standard deviations for the state. * denotes indicates statistically different at 0.05 level. a Test scores are standardized using grade-by-year specific means and student-level standard deviations for the entire state.

24 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Figure 1. Approaches to Identifying a Comparison Group. Analytic Sample

Methods for Selecting Comparison Group

Randomized to experiment

Comparison Group

Randomized control Randomized treatment

No statistical controls for selection

Naïve comparison

All schools in the state Matching on pretreatment characteristics

Matching within district

Focal comparison

Local comparison

Local match selected if within Hybrid 25 caliper, otherwise focal match comparison EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Table 3. Treatment and comparison group pre-intervention descriptives

School-level covariates School-level covariates Average ELA score 2005a Average ELA score 2006a Average ELA score 2007a Average ELA score 2008a Average ELA score 2009a Average ELA score 2010a Average math score 2005a Average math score 2006a Average math score 2007a Average math score 2008a Average math score 2009a Average math score 2010a Average student attendance Charter school Number of full time employees Number of students Percent limited English proficient Percent male Percent of free or reduced price lunch Percent special education Percent white School-wide Title I eligibility Suburban Title I eligibility Urban N Schools Student-level covariates Grade Limited English Proficient Special education status ELA ISAT fall 2008a Math ISAT fall 2008a ELA ISAT spring 2009a Math ISAT spring 2009a ELA ISAT spring 2010a Math ISAT spring 2010a N students

Treatment

Naïve comparison

Focal matching

Local matching (within district)

0.125 0.172 0.243 0.302 0.388 0.464 0.142 0.165 0.271 0.287 0.335 0.413 96.138 0.00%

0.075 0.044 0.072 0.062 0.041* 0.041* 0.025 0.005 0.005 -0.013* -0.023* -0.032* 96.162 3.30%

0.158 0.169 0.309 0.298 0.347 0.426 0.11 0.145 0.212 0.184 0.31 0.381 96.181 0.00%

0.278 0.208 0.274 0.216 0.235 0.22 0.269 0.119 0.147 0.065 0.073 0.119 96.032 0.00%

37.571 652.594

27.571 484.995*

35.417 624.328

32.429 566.182

4.60% 51.20%

4.00%* 51.00%

4.80% 50.50%

5.80% 46.90%

57.10% 16.10% 55.10% 62.50% 15.60% 87.50% 50.00% 32

46.00% 14.40%* 80.00% 46.20%* 22.40% 73.00% 18.20% 990

53.50% 15.40% 56.20% 59.40% 16.40% 75.80% 49.20% 128

54.90% 13.90% 55.00% 63.30% 9.40% 78.20% 52.20% 51

5.768 5.00% 14.30% 0.04 0.025 0.08 0.037 0.127 0.102 8,137

5.535 4.20% 12.20% -0.09 -0.095* -0.092* -0.09 -0.065* -0.07* 222,884

5.61 7.50% 12.70% 0.157 0.158 0.193 0.200 0.222 0.24 40,892

5.505 6.70% 11.20% 0.021 -0.027 0.025 0.00 0.018 -0.011 12,862 26

EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Notes: Figures are means. * denotes indicates statistically different at 0.05 level. a Test scores are standardized using grade-by-year specific means and student-level standard deviations for the entire state.

27 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Table 4. Experimental and quasi-experimental effect estimates in standardized score units ELA Math Treatment Difference from Treatment Difference from RCT Effect RCT Benchmark Effect Benchmark (SE) (Bootstrapped SE) (SE) (Bootstrapped SE) RE 0.025 0.011 --Benchmark (0.022) (0.029) Naïve 0.217* 0.192* 0.178* 0.150 effect (0.060) (0.069) (0.066) (0.082) Focal 0.022 -0.003 -0.005 -0.033 matching (0.019) (0.078) (0.024) (0.090) Local 0.042 0.017 -0.007 -0.035 matching (0.094) (0.112) (0.109) (0.135) Notes: Figures are treatment effects from a two-level outcome model. Model standard errors are in parentheses for each treatment effect. The differences are the difference between the RCT benchmark and the observational estimate. Bootstrap standard errors of this difference are provided in the parentheses. * denotes indicates statistically different at 0.05 level.

28 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Table 5. Experimental and hybrid caliper effect estimates in scale score units ELA Math Treatment Effect Difference Treatment Effect Difference (SE) (Bootstrap SE) (SE) (Bootstrap SE) RE Benchmark 0.025 0.011 --(0.022) (0.029) Unadjusted 0.217* 0.192* 0.178* 0.150 (0.060) (0.069) (0.066) (0.082) Hybrid match - 0.10 0.042 0.017 0.025 -0.004 (0.073) (0.068) (0.082) (0.081) Hybrid match - 0.50 0.077 0.052 0.053 0.024 (0.078) (0.070) (0.082) (0 .084) Hybrid match - 1.00 0.002 -0.022 0.035 0.007 (0.075) (0.073) (0.086) (0.087) Hybrid match - 1.50 0.031 0.007 0.038 0.010 (0.076) (0.073) (0.082) (0.090) Hybrid match – 2.00 0.056 0.032 0.040 0.012 (0.077) (0.071) (0.088) (0.088) Hybrid match - 2.50 0.052 0.028 0.040 0.011 (0.077) (0.073) (0.085) (0.090) Hybrid match – 3.00 0.053 0.029 0.049 0.021 (0.076) (0.073) (0.087) (0.088) Hybrid match - 3.50 0.044 0.020 0.045 0.017 (0.077) (0.073) (0.085) (0.090) Hybrid match – 4.00 0.065 0.040 0.062 0.033 (0.076) (0.073) (0.086) (0.090) Notes: Figures are treatment effects from a two-level outcome model. Model standard errors are in parentheses for each treatment effect. The differences are the difference between the RCT benchmark and the observational estimate. Bootstrap standard errors of this difference are provided in the parentheses. * denotes indicates statistically different at 0.05 level.

29 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Figure 2. Performance of naïve effect, local matching, focal matching and hybrid approach.

Four school match

Hybrid

Unadjusted

Within district match

-.2

-.1 0 .1 .2 Treatment effect (in sd units) relative to benchmark Math

ELA

30 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

School Comparisons in Observational Designs Figure 3. Percentage of times observational approach performed best across 1000 replications for the Indiana Benchmark Assessment Study

Naive effect

Covariate match

Within district

Hybrid approach

0

20

40 Math

60

80

ELA

31 EdPolicyWorks Working Paper Series No. 47. April 2016. Available at http://curry.virginia.edu/edpolicyworks/wp Curry School of Education | Frank Batten School of Leadership and Public Policy | University of Virginia

Suggest Documents