Heterogeneity in the Returns to Schooling - University of Chicago

2 downloads 517 Views 1MB Size Report
Apr 19, 2003 - individuals induced to go to college by different tuition subsidies. ..... If they did not attend college because they faced low returns to college ...
Heterogeneity in the Returns to Schooling: Implications for Policy Evaluation Pedro Carneiro∗ The University of Chicago April 19, 2003

Abstract In this paper I examine the empirical importance of accounting for heterogeneity (and selection) in the estimation of the returns to schooling and in the evaluation of education policy. I study white males and females in the National Longitudinal Survey of Youth and High School and Beyond, and white males in the Panel Study of Income Dynamics. I find that, across datasets, heterogeneity (and selection) in returns is an empirically relevant phenomenon. The return to schooling for the average student in college is systematically above the return to schooling for the average individual indifferent between going to college or not (marginal individual). It is also generally above the return for individuals induced to go to college by different tuition subsidies. ∗ Email: [email protected]. I would like to thank James Heckman for his continued support and encouragement. I also thank him for helpful discussions and detailed comments on this paper. I am grateful for comments from Alberto Abadie, Rita Almeida, Mark Duggan, Michael Greenstone, Karsten Hansen, Joao Hrotko, Tom Kane, Steve Levitt, Dimitriy Masterov, Costas Meghir, Kathleen Mullen, Salvador Navarro-Lozano, Derek Neal, Rodrigo Soares, Robert Townsend, Pedro Vicente, Edward Vytlacil and workshop participants at the University of Chicago, 2002 LACEA Meetings, Carnegie Mellon, Ohio State, Brown, World Bank, RAND, UCLA, Georgetown and University College London. Jingjing Hsee, Dimitry Masterov, Sergio Urzua and especially Maria Isabel Larenas and Maria Victoria Rodriguez provided excellent research assistance. Financial support from Fundacao Ciencia e Tecnologia and Fundacao Calouste Gulbenkian is gratefully acknowledged.

1

1

Introduction

The economic return to schooling is a fundamental parameter of interest in many different areas of economics and public policy. It is one of the most frequently estimated parameters in empirical economics. Economists interested in growth are concerned about the role of schooling in productivity and growth. Economists studying inequality and poverty seek to learn how schooling increases the incomes of the poor. Therefore, the evaluation of policies that promote school attendance is a central research question. The increase in earnings due to additional schooling (what is usually called the return to schooling) is a main component of the benefits of proposed policies. The specification most often used to estimate the return to schooling is due to Mincer (1974):

ln Y = α + βS + ε

(1)

where ln Y is log earnings, S is years of schooling and β is the return to schooling1 . There are two important sources of heterogeneity to consider in this equation. The first source is ε, and it influences the potential earnings of an individual uniformly, no matter what level of schooling he chooses. In this paper selection in levels arises if ε is correlated with schooling. For example, this could happen if those who have more years of schooling also have higher earnings ability as measured by ε2 . The other source of heterogeneity is β, the individual gain in wages from additional schooling. We say there is selection in returns if β is correlated with schooling. For example, this could happen if those who choose to get more schooling are also the ones who benefit the most from schooling3 . Both types of selection are sources of econometric problems. Most of the earlier work on this topic considered a representative agent model where β is the return to schooling for the representative agent. Assume, for example, that we want to evaluate a policy that increases college enrollment by amount K. β is the return to college for an individual, so B = β ∗K is the total benefit of the policy. In the traditional approach β is either assumed to be the same for everyone (possibly conditioning on a set of individual characteristics) or assumed to be a random variable uncorrelated with S. The main econometric problem in estimating β (or the 1 β is the percentage increase in earnings due to an additional year of schooling. Heckman, Lochner and Todd (2002) show that in general this parameter does not correspond to the internal rate of return. However it can still be interpreted as an hedonic relationship that shows the price of schooling in the labor market. 2 This is the usual assumption in the “ability bias” literature. 3 This could arise, for example, in a Roy model with income maximization.

2

average β, in the case where β is random) is that S can be correlated with ε due to unobserved ability4 . Therefore it is assumed that there is only selection in levels but no selection in returns, and the usual way to correct for this is to use linear instrumental variables (IV). If there is heterogeneity in β in the population and individuals respond to these differences when choosing their level of schooling (if, for example, individuals decide to enroll in school based on the returns they face), two additional problems can emerge. The first is an economic problem. No single number summarizes the distribution of returns. It becomes necessary to define what is the parameter that one is interested in to answer a specific economic question. Separate policies target groups of individuals that are located in different sections of the distribution of β. To compute B we need to know where in the distribution of β are the new entrants coming from. Even though many economists focus on the average return to schooling in the population, this is only one of many parameters that can be defined, and, in general, is not the relevant answer to most policy questions5 . A more interesting question is, for example, “what is the return to the marginal entrant?”, and “is it above or below the return to the average student?”. The second problem is an econometric problem. Once we have defined the parameter of interest how can we estimate it? The usual intuitions about how instrumental variables work break down in a model where β is heterogeneous in the population and individuals respond to these differences when making their schooling decisions (see Heckman and Vytlacil, 2003, and Carneiro, Heckman and Vytlacil, 2001). In general, applications of the linear IV method (the solution to the selection problem in the common return model) will not generate estimates of policy relevant parameters. In the recent literature there is substantial concern about heterogeneity in returns to education. Card (1999, 2001) surveys this literature with a special focus on instrumental variables estimates of β, since this has been the preferred method of estimation for most of the economists working in this area in recent years (see also Heckman, 1997, and Heckman and Vytlacil, 1998). The question he asks is the following: if β varies in the population, what parameter do we get from instrumental variables estimates of the returns to schooling? Based on the local average treatment effect (LATE) framework of Imbens and Angrist (1994), he interprets the instrumental variables 4 Measurement

error will also generate a correlation between S and ε. However measurement error is not consider in this paper. particular, suppose we have a policy that consists of a tuition subsidy which we want to evaluate. The relevant return in this case is the return to schooling for those induced to go to school by the subsidy, not the average return in the population. 5 In

3

estimate as a weighted average of returns for individuals induced to change their level of schooling by variation in the instrumental variable (see also Kling, 2001). For example, if the instrument being used is distance to college, IV measures a weighted average of the return for individuals who are induced to increase their schooling by reductions in commuting costs associated with having a college nearer to their place of residence. The IV approach has been advanced as less restrictive and more robust than econometric selection models, which provide an alternative way to account for selection in the estimation of the returns to schooling6 (see Krueger, 2000). However, as suggested by Card’s interpretation, for many policy questions it does not estimate the policy relevant parameters unless the policy affects the same people that are affected by the instrument7 . Therefore, the policy evaluation problem is greatly simplified if the assumption of a common β in the population is satisfied since there is only one parameter of interest, which can be estimated using IV. But, empirically, is this a valid working assumption? This is the main question in this paper. I focus on the high school - college transition and examine whether selection in the returns to college is an empirically important phenomenon. I present estimates of the returns for the average person in the population, the average person in college, and the average person at the margin between going or not going to college. I analyze different demographic groups in three different datasets: the National Longitudinal Survey of Youth of 1979 (NLSY), the High School and Beyond (HSB) and the Panel Study of Income Dynamics (PSID). Across these different samples, I find that the average person going to college has a higher return from the marginal person who is indifferent between enrolling in college or not. This suggests that heterogeneity is important and needs to be accounted for in policy analysis. The analysis in this paper also shows that measured cognitive ability is an important determinant of returns: individuals with higher ability have higher returns. This is a well documented finding in the literature. I also estimate the distribution of ability for individuals induced to enroll in college by a policy (say, a tuition subsidy), and then I compute the implied return to schooling for the average individual affected by the policy. I can predict whether a policy attracts mostly individuals with high returns to college (high ability) or mostly individuals with low returns to college (low ability). I compare the distribution of ability and the return to schooling for the marginal person and 6 However, Vytlacil (2002) shows that the assumptions underlying the LATE parameter of Imbens and Angrist (1994) are the same assumptions of a nonparameteric selection model and therefore the two approaches are equivalent. 7 Heckman and Vytlacil (2001b, 2003) show how to design an instrumental variable that estimates the effect of a policy. Obviously, the instrument is policy dependent.

4

for the average person attending college. They are clearly different: those who are at the margin have lower ability (and lower returns) than those who choose to go to college. I find that measured ability is the main observable determinant of returns. However, there may also be unobservable determinants of the return to college on which individuals act on when deciding whether or not to enroll in college. I study the importance of these unobservables by estimating a nonparametric selection model using the method of local instrumental variables developed by Heckman and Vytlacil (2000, 2003). This allows me to estimate how the returns to college vary with “observed ability” and with “unobserved ability”, and how the distribution of observed and unobserved ability differs across groups of individuals affected by different policies8 . My analysis suggests that the unobservable is a quantitatively important determinant of returns and therefore we need to consider it when doing policy evaluation. After accounting for selection on unobservables I still find that the return for the marginal person is lower than the return for the average person in college. At the same time I find that, empirically, the average returns to college for individuals induced to go to college by a $500 tuition subsidy are very similar to the average returns for individuals attracted into college by a $2000 tuition subsidy. These are two very different policies. The first increases college enrollment by 2.5 percentage points while the latter increases it by 8 percentage points. If, as I show in this paper, heterogeneity and selection are important, we should expect the average returns to college to be different between the two groups of individuals affected by the different tuition subsidies. These results can be reconciled. Even though more people are induced to go to college by a $2000 tuition subsidy than by a $500 tuition subsidy, empirically, the distribution of observed and unobserved abilities is very similar for these two groups, and therefore the composition of returns is also very similar for the two groups. The consequence is that the relevant return that we need to evaluate these policies is essentially the same for both of them. Even though heterogeneity is important, one single parameter is sufficient for evaluating a diverse set of policies. We should stress that this is an empirical finding about the quantitative importance of accounting for heterogeneity in the evaluation of different tuition policies, not a theoretical result (it can theoretically happen, but only in a very special case). In theory, we expect that different policies affect groups of individuals with different distributions of abilities, and therefore, different distributions of returns. In the analysis of this paper these differences exist but happen to be 8 The expression “unobserved ability” is convenient (for parallelism with “observed ability”), but potentially misleading since the the unobservable variable I consider cannot really be called ability and can influence all types of costs and benefits of going to college.

5

quantitatively small for a relatively large range of policies, although not for all policies. When I simulate a $5000 tuition subsidy the average returns to college for individuals induced to go to college by this policy are different from the returns to college for individuals induced to go to college by a $500 subsidy, so for very large subsidies this equality across policies no longer holds. Furthermore, I should also stress that even though average returns are the same across different policies, the return to the average person in the population is different from the return to the average person affected by the policies. The result that the policy relevant return is quantitatively similar across tuition policies holds for different demographic groups in the NLSY. It is also true in the PSID9 . It is possible to generate tuition policies that affect individuals with a significantly different distribution of ability than the individuals affected by the policies just described. However one can only do so with very large tuition subsidies. Even though Ordinary Least Squares and Linear Instrumental Variables are widely used methods in the estimation of the returns to schooling, neither method provides accurate estimates of policy effects, although the latter (using the usual instruments in the literature) generates a much better estimate than the former. I also show that most of the commonly used instruments for schooling are either correlated with cognitive ability or they are only weakly correlated with schooling. Ability belongs in the earnings equation and therefore these are not valid instruments. Ability needs to be included in the model if these instrumental variables are to be used. However, measures of ability are often not available in commonly used datasets such as the Current Population Survey (CPS). I find that conditioning on proxies for ability such as family background attenuates the correlation between the instrument and ability but also substantially weakens the correlation between the instrument and schooling, making it difficult to use for other reasons. The plan of this paper is as follows. In the next section I present a model of schooling choice and earnings. The model is typical of the modern program evaluation literature, where schooling is the program being evaluated. I define the policy relevant parameter and then I describe the main estimation problems and the methods I use to deal with them. In section 3 I present the evidence on heterogeneity and selection in the returns to schooling and show how different the marginal and the average persons are. I describe in detail the results for white males in the NLSY and 9 And, for the specifications of the schooling decision model estimated in this paper, it is also true for other policies that promote college enrollment by reducing the costs of going to college, either by introducing tuition subsidies, reducing commuting costs, or changing local labor market conditions. I do not do policy experiments for the HSB since a tuition variable is not available for that dataset.

6

summarize the results obtained from other samples10 . In section 4 I estimate the returns to schooling for individuals affected by different policies and compare them to least squares and linear instrumental variables estimates. Section 5 discusses the validity of the instrumental variables used in this paper. The last section summarizes and concludes the discussion in the paper.

2

Causes and Consequences of Selection in Returns

Heterogeneity is an intrinsic feature of economic data and its consequences for policy evaluation have been extensively discussed in the literature (see Heckman, 2001). In equation (1) both β (the return to schooling) and ε can be individual specific and vary in the population. If β is common for everybody in the population and ε is a random variable possibly correlated with schooling then the problems that can arise (ability bias) and the solutions (instrumental variables (IV)) are relatively well known11 . However, if β is different for different individuals then evaluating the benefits of a policy inducing individuals to acquire more schooling requires knowing whether those who are affected by the policy have a high return or a low return to schooling. Furthermore, since different policies may affect different people, analysts need to know where in the distribution of β each policy has its greatest impact. Suppose β varies in the population and costs (C) are common across individuals. Assume an individual decides to go to college if β − C > 0. Figure 1 plots an hypothetical distribution of returns and the cost for this example. A policy that made college attendance compulsory would have a large effect on those not currently attending college. If they did not attend college because they faced low returns to college then the policy would be drawing people from the left tail of the distribution of β (those at the left of C: β < C). In another example, suppose the policy being considered is a $500 tuition subsidy (a shift in cost from C to c0 ). Even with this subsidy many individuals with low returns are likely not to enroll in college, while those with very high returns are already enrolled anyway and would not be affected by the policy. In this case the subsidy mainly affects individuals more to the middle of the distribution of β. In this paper, I examine to what extent β varies in the population and identify what is the return for individuals attracted into school by different policies. I start by presenting a simple model of schooling and earnings, the framework for the analysis performed in this 1 0 These are referenced throughout the paper and will be available on http://lily.src.uchicago.edu/~klmcarn. The appendix for this paper (to which I refer continuously in the text) can also be found in this website. 1 1 The goal is just to estimate the common β. The problem is that S may be correlated with ε. One solution is to use IV.

7

paper. Assume that there are two schooling levels: high school (S = 0) and college (S = 1)12 . The log earnings of individual i if he goes to college are ln Y1i = µ1 (Xi ) + U1i

(2)

ln Y0i = µ0 (Xi ) + U0i

(3)

If he does not go to college his earnings are

where Xi are observable characteristics for individual i and (U1i , U0i ) are unobservable characteristics such that

E (U1i ) = 0 E (U0i ) = 0.

Additive separability between X and (U1 , U0 ) is not necessary for most of what I do in this paper, but it is a convenient assumption that simplifies the analysis and the empirical work (see Heckman and Vytlacil, 2000, 2003). Individuals can differ in their observables and in their unobservables. This is the potential outcomes model used in studies of unionism, migration, sectorial choice, job training and in the literature on policy evaluation. The return to schooling for individual i is

βi

= ln Y1i − ln Y0i

(4)

= µ1 (Xi ) − µ0 (Xi ) + U1i − U0i

and it varies with Xi and U1i − U0i . We only observe an individual’s earnings for the sector he chooses. Then 1 2 In this paper I ignore two important problems by considering a binary schooling variable. One is the allocation of students across many different types of college of very different quality. College quality has been shown to be important for earnings by Black and Smith (2002), although they find effects only for males. The other one is the allocation of students across different years of college: while some students complete college, many attend only one, two or three years of college. Still others may choose to go to graduate school and enroll in more than four years of college. As shown in Angrist and Imbens (1995), because individuals who enroll in college have on average more than one additional year of schooling than those who do not enroll in college, the gross estimate of the return to college will overestimate the return to one year of college. However, once we annualize the estimated return to college we cannot tell whether the bias induced by this aggregation is positive or negative (I annualize the returns to college by dividing them by the difference in the average years of schooling between those who ever enrolled in college and those who have a high school degree but never enrolled in college).

8

observed earnings are:

ln Yi

= Si ln Y1i + (1 − Si ) ln Y0i

(5)

= µ0 (Xi ) + Si [µ1 (Xi ) − µ0 (Xi ) + U1i − U0i ] + U0i = µ0 (Xi ) + Si β i + U0i

which is exactly in the same form as (1) (with the inclusion of observable variables X) where

αi

= µ0 (Xi )

εi

= U0i .

(and β i is a function of Xi and U1i − U1i ). Individual i chooses to go to college (S = 1) if

µS (Xi , Zi ) + USi > 0

(6)

where Zi are variables that affect the choice of schooling but not the potential earnings in each sector, USi is the unobservable in the choice of schooling equation. Equation (6) should be interpreted as a reduced form representation of the choice problem, typical in applications of selection models13 . Assume that

µS (Xi , Zi ) is a nondegenerate function of Z

(7)

(so that Z is not independent of S) and that

U1 , U0 , US ⊥ ⊥Z|X

(8)

U1 , U0 , US ⊥⊥X

(9)

and

1 3 Carneiro, Heckman and Vytlacil (2001) show how (6) can be justified by an economic model where the agent chooses the level of schooling that maximizes his present value of earnings.

9

where ⊥ ⊥ denotes independence. Z is a vector of instrumental variables. When β is not independent of S, or potential outcomes ln Y1 and ln Y0 are not independent of S, we will say that there is selection. We can consider two types of selection. Let f (ln Y ) denote the probability density function (p.d.f.) of ln Y . There is selection in levels in high school potential outcomes if f (ln Y0 |S = 1) 6= f (ln Y0 |S = 0) and there is selection in levels in college potential outcomes if f (ln Y1 |S = 1) 6= f (ln Y1 |S = 0)14 . If there is selection in levels in high school then the wage that the average college graduate would get had he not gone to college is different (it could be higher or lower) than the wage that the average high school graduate gets. If there is selection in levels in college then the wage that the average high school graduate would get if he went to college is different than the wage that the average college graduate gets. There is selection in returns if f (β|S = 1) 6= f (β|S = 0). This means that the (potential) returns to college are different for those who chose to go to college and for those who chose not to go to college15 . It is possible to have selection in levels without having selection in returns. The early literature on the returns to schooling, surveyed by Griliches (1977), assumed that β in (1) was common across individuals (or that we could account for all the variation in β by conditioning on X). Therefore, it assumed that, in terms of (4), U1 = U0 = U (therefore ln Y1 − ln Y0 = µ1 (X) − µ0 (X)). The main econometric problem in the Griliches model is that S is not independent of ε (ability bias)16 . It is not possible to estimate β by least squares since:

ˆ plimβ OLS = β +

COV (S, ε) 6= β V (S)

There is selection in levels (ε⊥/⊥S) but not in returns (β⊥ ⊥S)17 . The usual presumption is that COV (S, ε) > 0 1 4 Before I defined selection in levels as ε⊥ /⊥S. This definition and the one I present now are compatible if one thinks of ε as measuring the residual in the base state. The base state can either be high school (in which case β measures the effect of going to college) or college (in which case β measures the effect of not going to college). The former is the most usual. 1 5 For example, Willis and Rosen (1979) study the returns to college using wages measured in two periods: the first is 1949 and the second is 1969. In the first period they find strong evidence that E (ln Y1 |S = 1) > E (ln Y1 |S = 0) (positive selection into college) but they cannot reject that E (ln Y0 |S = 1) = E (ln Y0 |S = 0) (no selection into high school). Since β = ln Y1 −ln Y0 , in this case, E (β|S = 1) > E (β|S = 0) (positive selection in returns). However, in the second period they find strong evidence that E (ln Y0 |S = 1) < E (ln Y0 |S = 0) (positive selection into high school) but can’t reject that E (ln Y1 |S = 1) = E (ln Y1 |S = 0). This also generates E (β|S = 1) > E (β|S = 0), but the reason why this happens now is that if college graduates did not go to college they would be very poor high school graduates. 1 6 In terms of the model in (5), S was not independent of U . 0 1 7 In terms of the model in (5) for the case where U = U = U : 1 0

ˆ p lim β OLS

= =

COV (S, ln Y ) V (S) β + E (U|S = 1) − E (U |S = 0) .

10

(individuals of higher ability are more likely to go to school). The standard solution is to use instrumental variables. Let Z be an instrument satisfying (8). Then,

ˆ plimβ IV

= β+

COV (Z, ε) COV (S, Z)

= β

The analysis becomes more difficult if there is also selection in returns. This type of selection will be the main focus of the rest of this paper. If β varies in the population then β can be correlated with S, in which case there is both selection in levels and in returns. The standard solutions developed to deal with ability bias (instrumental ¯ be the mean of β. Then one can rewrite (1) as: variables) need to be reinterpreted. Let β

ln Y

¡ ¢ ¯ +ε+S β−β ¯ = α + βS

¯ + ε + S (U1 − U0 ) . = α + βS

If Z is an instrument independent of ε and β, but not independent of S, using the method of instrumental variables identifies: ˆ plimβ IV

£ ¡ ¢¤ ¯ COV Z, S β − β ¯+ =β . COV (Z, S)

We acquire an additional term relatively to the previous model:

¯ )] COV [Z,S (β−β . COV (Z,S)

In general

¯ )] COV [Z,S (β−β COV (Z,S)

6= 0. By

iterated expectations: £ ¡ ¢¤ ¯ COV Z, S β − β COV (Z, S)

= =

COV [Z, S (U1 − U0 )] COV (Z, S) COV [Z (U1 − U0 ) |S = 1] Pr (S = 1) + E (U1 − U0 |S = 1) . COV (Z, S)

Suppose we have the following model: S = 1 if Y1 −Y0 −C (Z) > 0. Then E (U1 − U0 |S = 1) 6= 0 and COV [Z (U1 − U0 ) |S = 1] 6= 0. Z is unconditionally independent of U1 − U0 , but not conditional on S. Then, conditional on S = 1, high Z implies high U1 − U0 . If S⊥⊥U1 − U0 or if U1 = U0 then E (U1 − U0 |S = 1) = 0 and COV [Z (U1 − U0 ) |S = 1] = 0. Finally, notice also that

¯ )] COV [Z,S (β−β COV (Z,S)

6=

¯ )] COV [W,S (β−β , COV (W,S)

where Z and W are two different instruments. This means that

11

different instruments estimate different parameters. Dependence between β and S can occur if the variables that determine the return to schooling in (4) are correlated with the variables that determine choice of schooling in (6). A simple special case where this happens is the Roy model18 :

S

= 1 if

ln Y1 ≥ ln Y0 ⇐⇒ β ≥ 0

(10)

= 0 otherwise

(individuals choose the schooling level that gives them higher earnings). In this case it is possible to define different parameters that answer different questions. Table 1 lists the main parameters in the evaluation literature. Many economists focus on the average return to schooling in the population (ATE), but this is only one of many possible parameters. TT is treatment on the treated and measures the return to college for those who choose to go to college while TUT, treatment on the untreated, measures the return to college for those who choose not to go to college. The AMTE (average marginal treatment effect) is the average return to college for individuals at the margin between enrolling or not in college. LATE is a parameter defined by an instrumental variable (see Imbens and Angrist, 1994). It measures the average return to schooling for individuals induced to go to school by a change in the instrument from Z = z to Z = z 0 . Different instruments produce different parameters and different estimates. The PRTE (policy relevant treatment effect), introduced in Heckman and Vytlacil (2001), is the relevant parameter for evaluating a specific policy, say a tuition subsidy. It measures the average return for individuals who would not go to college in the absence of the policy but who would go to college if the policy were in place. Different policies correspond to different parameters. Ichimura and Taber (2001) discuss identification and estimation of a very similar parameter. If there is selection in returns then all these parameters are different from each other. E(β − β|S = 1) is the sorting gain (or loss, if it is negative) that arises from purposive selection into schooling19 . In the Roy framework the choice of schooling is explicitly modeled. If agents know or can partially predict β at the time they make schooling decisions, there is dependence between β and S in equation (1) and the parameters defined above will be different from each other. In model (10) the sorting gain is positive: TT>ATE>TUT. However, in more general models, it 1 8 For

simplicity I drop the i subscripts from the individual variables. these parameters can be defined conditional on X. For example, AT E (X = x) = E (β|X = x) is the average return to schooling for individuals with X = x. 1 9 All

12

may be of either sign. If β is independent of S then AT E = T T = T U T = AM T E = P RT E. This is true if µ1 and µ0 do not depend on X and U1 = U0 , in which case there is no heterogeneity in returns, but it can also happen if β varies in the population but it is a random variable independent of S (if X and U1 − U0 are independent of S)20 . These assumptions underlie the representative agent model that is widely used for educational policy evaluation because it greatly simplifies the analysis (since we can focus on a single parameter). Theoretically, heterogeneity is an important problem. Small differences in the estimates of the return to schooling in a given year can generate large differences in estimates of lifetime earnings and of lifetime returns to schooling. However, dealing with heterogeneity is difficult. Then the fundamental question I ask in this paper is: empirically, how important is it to account for heterogeneity in the evaluation of education policy? In the next section I assess the empirical importance of heterogeneity and selection for educational policy evaluation. The central parameter of my analysis is the marginal treatment effect (MTE) which is defined as:

M T E (x, u) = E (β|X = x, US = u) = µ1 (x) − µ0 (x) + E (U1 − U0 |X = x, US = u)

(see Heckman and Vytlacil, 1999, 2000, 2001b, 2003, Carneiro, Heckman and Vytlacil, 2001, and Bjorklund and Moffitt, 1987). M T E (x, u) measures the average return to schooling for individuals with X = x and US = u. X and US are variables that determine whether or not an individual goes to school (see equation (6))21 . As an example, suppose X is a scalar and µS (X, Z) is increasing in X so that people are more likely to go to school the higher their X and the higher their US . If the MTE is increasing in X and US then the average return for those going to college (individuals with high X and high US ) is higher than the average return for those not going to college (individuals 2 0 All

these statements can be made conditional on X. If U1 = U0 or U1 − U0 is independent of S then AT E (X) = T T (X) = T UT (X) = AMT E (X) = P RT E (X)

where AT E (X) is the average return to schooling for individuals with a given level of X (and likewise for the other parameters). 2 1 I assumed that Z is independent of U and U conditional on X (see condition (8)). Therefore 1 0 E (β|X = x, US = u, Z = z)

= =

µ1 (x) − µ0 (x) + E (U1 − U0 |X = x, US = u, Z = z)

µ1 (x) − µ0 (x) + E (U1 − U0 |X = x, US = u)

so even though it determines schooling it does not affect potential outcomes. Using the MTE is similar to indexing β by X and US and then see how it varies in the population. However it is not exactly the same unless we observe all variables that determine β, in which case we can estimate the distribution of β. If we do not have all variables determining β then we can only estimate the mean β conditional on the variables at our disposal.

13

with low S and low US ). Dependence between β and (X, US ) leads to dependence between β and S and to a model where there is selection on returns. However, if β is independent of X and US then there is no selection. In that ¯ where β ¯ is a constant independent of X and U 22 . case M T E (x, u) = β, Notice that the marginal treatment effect is a function of an unobservable variable: US . There is a large literature in econometrics that develops methods to deal with selection on unobservables. In this paper I use a nonparametric selection model23 and apply the method of local instrumental variables (LIV)24 introduced by Heckman and Vytlacil (2000) and applied in Carneiro, Heckman and Vytlacil (2001). To focus on unobservables, and also for easier exposition, I keep the conditioning on X implicit. The goal is to estimate E (ln Y1 − ln Y0 |US = u) = ¯ = µ − µ . The method of local instrumental variables requires that there ex¯ + E (U1 − U0 |US = u), where β β 1 0 ists a continuous instrument Z satisfying (7) and (8). Furthermore, we also need that the instrument satisfies a monotonicity property25 . The intuition of the method is best explained with an example. Suppose the model is the following:

S = 1 if

− Zγ + US > 0

where Z (the instrumental variable) is tuition in county of residence (γ > 0) and US is ‘ability’ and it is unobserved26 . Assume we start by using only two counties: A and B. In county A, Z = $100, and in county B, Z = $200. The two counties are equal in every aspect except tuition27 . We can estimate β by standard IV using tuition as the instrument and data only from counties A and B: 2 2 There are intermediate cases. For example, if MT E (x, u), E (ln Y |X = x, U = u), E (ln Y |X = x, U = u) are independent of U 0 0 S S S but not of X then there is selection on observables but there is no selection on unobservables. I can also analyze selection in levels by examining to what extent E (ln Y1 |X = x, US = u) and E (ln Y0 |X = x, US = u) vary with X and US . In the common coefficient model there ¯ In this model selection in levels is the major source of bias (even though E (ln Y1 |X = x, US = u) is no selection in returns since β = β. and E (ln Y0 |X = x, US = u) are paralell they may not be constant). Notice that MT E (x, u) can be independent of X and US even though E (ln Y1 |X = x, US = u) and E (ln Y0 |X = x, US = u) are not (selection in levels without having selection in returns). 2 3 Typical applications of selection models assume a parametric form for f (U , U , U ) (for example, multivariate normal) which in 1 0 S turn implies a parametric form for the MTE. In contrast I estimate the MTE nonparametrically. 2 4 Typical applications of instrumental variables (for example, linear instrumental variables) use global variation instead of local variation in the instrument (such as linear regression of Y on X uses global variation in X to fit a line while a nonparametric regression of Y on X uses local variation in X to fit a curve). 2 5 We require that some regularity conditions and that a monotonicity condition is satisfied. Monotonicity is satisfied if given any two values of the instrument, z and z 0 , either S (z) ≥ S (z 0 ) for all individuals or S (z 0 ) ≥ S (z) for all individuals. There cannot be a change in the value in the instrument that simultaneously induces some individuals to go to school and other individuals not to go to school. In the appendix I formally present the assumptions required to apply the method of LIV. 2 6 The unobservable does not have to be “ability”. I call it ability for simpicity but U can represent any unobservable (ex: unobservable S cost). 2 7 Tuition can vary across counties if there are barriers to student migration. Tuition variation could also reflect quality variation which would invalidate the use of tuition as an instrumental variable but we abstract from quality considerations in this example and in the rest of the paper. For an example see Carneiro and Heckman (2001).

14

ˆ 100,200 β IV

=

E (ln Y |Z = 100) − E (ln Y |Z = 200) E (S|Z = 100) − E (S|Z = 200)

= E [β|S (Z = 100) = 1, S (Z = 200) = 0] = E (β|100 ∗ γ < US ≤ 200 ∗ γ)

This is the Local Average Treatment Effect for the case where the instrument takes values Z = $100 and Z = $200. It is the average return for individuals who go to college if Z = 100 but do not go if Z = 200. Therefore these individuals are at the margin between going to college or not if Z is between 100 and 200. The fact that they are at the margin at such a low level of tuition means that they have low ability (US )28 . Suppose now we take two different counties. County C has Z = $2100 and county D has Z = $2200. Using C and D only:

ˆ 2100,2200 β IV

=

E (ln Y |Z = 2100) − E (ln Y |Z = 2200) E (S|Z = 2100) − E (S|Z = 2200)

= E [β|S (Z = 2100) = 1, S (Z = 2200) = 0] = E (β|2100 ∗ γ < US ≤ 2200 ∗ γ)

This is the average return for individuals not going to school if Z = 2200 but going to school if Z = 2100. They are at the margin at a high level of tuition which means that they have a high level of ability. The general formula for any pair of counties is:

ˆ z,z β IV

0

=

E (ln Y |Z = z) − E (ln Y |Z = z 0 ) E (S|Z = z) − E (S|Z = z 0 )

= E [β|S (Z = z) = 1, S (Z = z 0 ) = 0] = E (β|z ∗ γ < US ≤ z 0 ∗ γ) 2 8 If individuals switch when tuition varies vary little at such a low range then that means that even though they are facing tuition at a very low level before the change these individuals still decide not to enroll in college. Therefore they are likely to have low levels of ability.

15

We can make z and z 0 close and get E (β|US = zγ)

Therefore, by varying Z we can trace out how β varies with US . This is the Marginal Treatment Effect. If the MTE is flat then there is no heterogeneity. Notice that to trace out whole support of US we need large support for the instrument. Figure 2 illustrates this idea. In this figure I assume that the MTE is a straight line with positive slope. Variation in tuition at different ranges identifies different segments of this curve. We can only identify the MTE within the range of the data. For example, if tuition only varies between 100 and 2200 we can only identify the MTE between these two values. If there is no data for tuition within a given interval of data (ex: 500 to 600) then it is not possible to identify the MTE within this interval. To implement this in practice we first aggregate multiple instruments into a single (cost) index (by using a first stage regression29 ): S = 1 if Zγ + US > 0

By aggregating multiple instruments into a single index and then using it as the instrumental variable I can get a larger support over which to estimate the MTE (larger in the sense of having extremes that are farther away and having no holes in the middle of the support). Then we construct P (Z) = Pr (S = 1|Z) and use it as the instrument.

ˆ p,p β IV

0

= =

E (ln Y |P = p) − E (ln Y |P = p0 ) E (S|P = p) − E (S|P = p0 ) E (ln Y |P = p) − E (ln Y |P = p0 ) p − p0

In the numerator we evaluate the function E (ln Y |P ) in two points (p and p0 ) and then we take the difference. In the denominator we have the difference in the points of evaluation. Notice that this is like a derivative. In summary, the first step is to construct E (ln Y |P ). This is simply a regression of Y on P and can be estimated using parametrically or nonparametrically30 . Then we take the derivative31 . If the derivative is flat then selection is not 2 9 However

this is a nonlinear first stage. In standard two stage least squares the first stage is linear. example, we can use kernel regression, local linear regression, polynomials in P , splines, etc.. 3 1 For example, this can be done by computing what is the exact slope of the parametric function we are estimating, or by using discrete differences. 3 0 For

16

important (heterogeneity plays a small role in the problem). Therefore a simple test of selection on unobservables is the following: is E (ln Y |P ) linear in P ? Standard instrumental variable assumes no selection on unobservables and therefore imposes linearity in P (see figure 3 for an example32 ). In this framework, we can also allow for X (ex: test scores), in a parametric way. In this case:

β = α1 − α0 + φ (X) + U1 − U0

A simple formal justification for this estimator of the marginal treatment effect is the following. We can transform the variables in the choice model in (6) such that it becomes

S = 1 if P (Z) > VS

(11)

P (Z) = 1 − FUS [−µS (Z)]

(12)

where

= Pr (S = 1|Z) VS

= 1 − FUS (US ) .

Notice that VS ∼ U nif [0, 1] (uniform distribution taking values between 0 and 1). Under these conditions Heckman and Vytlacil (2000, 2003) and Carneiro, Heckman and Vytlacil (2001) show that we can identify E (β|VS = vS )33 by first running a (nonparametric) regression of ln Y on P and then computing the derivative with respect to P . To see this, notice that we can write observed outcomes as:

ln Y

= S ln Y1 + (1 − S) ln Y0 ¯ + U0 + S (U1 − U0 ) . = µ0 + S β

3 2 If the regression function is linear (the derivative is flat) then linear IV is valid. If the regression function is nonlinear (the derivative is not flat) then linear IV is not valid. 3 3 This is the MTE but as a function of V , the transformed unobservable, and not as a function of U as before. S S

17

Then,

¯ + E (U1 − U0 |P > VS , P = p) p E (ln Y |P = p) = µ0 + βp Z p ¯ = µ0 + βp + E (U1 − U0 |VS = v) dv

(13)

0

(since E (µ0 + U0 |P = p) = µ0 , E (S|P = p) = p, E (U1 − U0 |S = 1, P = p) = E (U1 − U0 |VS < p) and VS ∼ U nif [0, 1]). Finally,

∂E (ln Y |P = p) ∂p

¯ + E (U1 − U0 |VS = p) = β

(14)

= E (β|VS = p) .

A continuous instrument Z generates a continuous P , which we need to have in order to be able to compute this derivative34 . E (ln Y |P ) and its derivative can be estimated using standard nonparametric methods35 . One motive why it is useful to study the MTE is because it is a natural way to look at heterogeneity. But there is one other important reason. Heckman and Vytlacil (1999, 2000, 2001b, 2003) and Carneiro, Heckman and Vytlacil (2002) show how we can compute different evaluation parameters by constructing different weighted averages of the MTE. Tables 2a-b illustrate how we can do it for the parameters presented in Table 1. For example, to compute the average return to schooling in the population one can weight the MTE by the distribution of (X, US ) in the population. For treatment on the treated one weights MTE by the distribution of (X, US ) for the individuals that go to college. For the policy relevant parameter we weigh MTE by the distribution of (X, US ) for those individuals induced to go to college by the policy. These weights are interesting objects by themselves because they tell us what is the distribution of observable and unobservable determinants of returns for different groups in the population. For example, by considering them together with the MTE we can have an idea of whether a given policy benefits mostly individuals with positive returns or individuals with negative returns to college (or high vs. low returns)36 . 3 4 It

is also possible that by using a set of many discrete Z variables we can approximate a continuous P . It is important to be careful with the support of P since we can only identify E (β|VS = p) for values of p that are in both the support of P for individuals with S = 1 and in the support of P for individuals with S = 0. 3 5 A formal and detailed description of the method of local instrumental variables is provided in Heckman and Vytlacil, 2000, 2003, and in Carneiro, Heckman and Vytlacil, 2001, and summarized in the appendix to this paper. 3 6 Furthermore, since X and U S are also related to the determinants of potential wages, these weights together with estimates of E (ln Y1 |X = x, US = u) and E (ln Y0 |X = x, US = u) tell us how different policies change the average quality of people in each sector. The estimation of E (ln Y1 |X = x, US = u) and E (ln Y0 |X = x, US = u) is very similar to the estimation of the MTE and is explained in

18

To calculate the weights in the second column of this table we need to estimate the joint distribution of (X, Z, US ). However, under assumptions (8) and (9), fX,Z,US (x, z, uS ) = fX (x) fUS (uS ) fZ (z|X = x)37 . Both fX (x) and fZ (z|X = x) can be directly estimated from the data. fUS (uS ) can in principle be estimated nonparametrically from (6) (see Matzkin, 1992) although discrete choice models are commonly estimated by assuming a parametric functional form for fUS (uS ) (such as a normal, a logistic, or a uniform). In this paper the empirical results I present are for a normal38 . Then we can construct all the quantities in the third column of Table 2b. Notice that even though one does not know a priori which individuals will be affected by a given policy, it is possible to estimate the distribution of (X, US ) for these individuals by making use of the schooling model. Suppose that one of the variables in Z is tuition and the policy we are interested in is a tuition subsidy of λ given for each person that decides to attend college. Furthermore, for simplicity of exposition, assume that we can write (6) as

S = 1 if Xγ X + Z + US > 0

(assume γ Z = 1 for simplicity). Therefore an individual that chooses not to go to college without the subsidy may be induced to enroll in college once the subsidy is in place if

Xγ X + Z + US < 0 and Xγ X + (Z − λ) + US > 0

or if − (Z − λ) < Xγ X + US < −Z. This condition defines the set of (X, US ) values such that individuals who have (X, US ) in this set switch from one schooling level to another in response to the policy, for a given value of Z 39 . Once we estimate fZ,X,US (z, x, u), we the appendix. 3 7 Under (8) and (9): fX,Z,US (x, z, uS )

= = =

fZ,US (x, uS |X = x) fX (x)

fZ (z|X = x) fUS (uS |X = x) fX (x) fZ (z|X = x) fUS (uS ) fX (x)

3 8 The

results in this paper are robust to using a logistic instead of a normal. that under these assumptions, if λ > 0 (positive tuition subsidy) no individual will want to switch from being enrolled in college to not being enrolled in college, and some individuals will want to switch from no college to college. 3 9 Notice

19

can compute the policy weight in Table 2b. The class of policies that can be evaluated with this method have the following characteristics: 1) the policy cannot affect the outcome equation, either through X, U1 and U0 , or through the functions µ1 (X) and µ0 (X); it can only operate through the choice equation (this rules out general equilibrium effects); 2) the policy has to operate through one of the Z variables that observed in the data; and 3) the policy cannot change Z to values outside the support of the observed data, unless we can extrapolate by parameterizing in advance the relationship between Z, S, X and US outside the support of observed data (see Heckman, 2001). The idea behind this exercise is that variation in Z in the cross section mimics variation in the policy. For example, we can estimate the average return to college for individuals that would go to college if they faced a net tuition of $1000 but not if they faced a net tuition of $2000 by using LATE:

ˆ β P

=

E (ln Y |Z = 1000) − E (ln Y |Z = 2000) Pr (S = 1|Z = 1000) − Pr (S = 1|Z = 2000)

= E [β|S (Z = 1000) = 1, S (Z = 2000) = 0]

This parameter measures the effect of giving a tuition subsidy of $1000 to individuals that currently face a net tuition of $2000. Using this method we can estimate the effect of giving a subsidy of $1000 for individuals facing a tuition of $2100, $2200, $2300, etc..., and therefore we can estimate the overall effect of a tuition subsidy of $1000. Using cross sectional variation in the data to mimic policy variation is the way structural models usually work. This is also the basis for the estimator proposed by Ichimura and Taber (2000,2002), which is very closely related to the estimator of policy effects used in this paper. The policy just presented is very simple and consists of a change in Z which is uniform across all levels of Z (subtract λ to each value of Z). Although this is the set of policies considered in this paper, it is possible to study much more general policies, provided that they are subject to the conditions specified in the previous paragraph. For example, the subsidy could be proportional (in that case the tuition faced by each individual after the policy is implemented is λZ, instead of Z − λ, where λ is the subsidy rate). Or the subsidy could be targeted to individuals with a given level of X and Z. More general policies are considered in Heckman and Vytlacil (2001b, 2003), and Carneiro, Heckman and Vytlacil (2001)40 . 4 0 For easier interpretation it is convenient that the policy satisfies a monotonicity property of the same type as the one defined before. In particular, a monotone policy cannot at the same time induce some individuals to enroll in college and other individuals to drop out

20

In the next section I present estimates of the MTE, of the parameter weights in Table 2, and of the parameters in Table 1. I start the analysis by assuming that there is no selection on unobservables. In other words, the MTE is a function of X but not of US . This assumption simplifies the estimation procedure, the data demands, and improves the precision of the estimates. I use the method of matching, which is described below. However this may be a wrong assumption and therefore in the second part of the next section I estimate the MTE both as a function of X and of US . I use the method of local instrumental variables. Finally, I recalculate the weights and the parameters of Tables 1 and 2.

3 3.1

The Empirical Importance of Heterogeneity and Selection Selection on Observables

We can find large differences in estimates of the average return to schooling across demographic groups and local labor markets, suggesting that heterogeneity in the returns to schooling is an important phenomenon. The analysis in this paper makes use of datasets with extensive detail on individual characteristics and starts by examining how β varies for individuals with these variables. There exist some similar studies in the literature. One of these is done by Altonji and Dunn (1996) who measure the effects of family background on the returns to schooling. They find a positive effect of good background on returns in their preferred fixed effects specification41 . There is also extensive research that focuses on the effects of measured cognitive ability on the returns to schooling (see Blackburn and Neumark, 1993; Bishop, 1991; Grogger and Eide, 1995; Heckman and Vytlacil, 2001a; Murnane, Levy and Willet, 1995; Meghir and Palme, 2002). The recent literature generally finds that individuals with higher test scores have higher returns42 . Since cognitive ability is also a major determinant of schooling attainment this research suggests that selection on observables is important. However, using twins, Ashenfelter and Rouse (1998, 2000) claim that the effects of family background (and ability) on returns are small. Griliches (1977) also reports that he failed of college. Even though in the case of a nonmonotone policy a policy effect can still be defined at the aggregate level (what is the total change in income for society of this policy) it is much less meaningful to talk about the average return to schooling for individuals induced to change their schooling level in response to the policy if some are moving into college and others are moving out of college. 4 1 Although they also say that their estimate may be upward biased. 4 2 However Heckman and Vytlacil (2001a) show that there is very large sorting of ability groups by different schooling levels and therefore it is hard to separate the effects of schooling and ability on earnings.

21

to find significant interactions between schooling and his measures of ability in wage regressions43 . My reading of the recent literature is that the returns to education vary across individuals that are different on many dimensions that are observable in these different datasets. This suggests that heterogeneity is important. Furthermore, what is observable or unobservable varies across datasets and therefore this evidence also suggests that we should worry about unobserved heterogeneity in many datasets where these variables are missing44 . I start by assuming that the analyst has a sufficiently rich set of variables such that conditional on it there is no dependence between U1 , U0 and US (no selection on unobservables). This implies that

ln Y1 , ln Y0 ⊥ ⊥S|X, Z.

(15)

Under assumption (15), the MTE is:

E (β|X = x, US = u) = E (β|X = x) = µ1 (x) − µ0 (x)

(since E (U1 − U0 |X = x, US = u) = E (U1 − U0 ) = 0). In what follows I first estimate the average return to schooling as a function of X, i.e., the M T E (x). Then I estimate the distribution of X for different groups in the population, and in particular, the distribution of X for those individuals that are affected by the set of policies that I simulate (the weights in Table 2). Finally I weight the MTE with these distributions to obtain the different mean parameters introduced in Table 1. Since the MTE does not depend on US it is possible to estimate the MTE by comparing the earnings of college attenders and high school graduates with the same X variables. Conditioning on X is enough to account for all the dependence between ln Y1 , ln Y0 and S:

E (ln Y1 |X = x, Z = z, S = 1) = µ1 (x)

(16)

E (ln Y0 |X = x, Z = z, S = 0) = µ0 (x) 4 3 However his analysis is done data from the 1960s and 1970s while the analysis that finds positive effects of test scores on returns is based on data from the 1980s and 1990s. 4 4 For example, even though cognitive test scores are an important determinant of returns, most of the analysis of returns to schooling is done in datasets with no measure of test score such as the CPS or the Census.

22

(then M T E (x) = µ1 (x) − µ0 (x)). This is a version of the method of matching and consists of matching high school graduates with a specific set of (X, Z) characteristics with college graduates with exactly the same characteristics and then comparing their earnings. Assumption (15) means that the econometrician observes enough detail about each individual to enable him to capture all the correlation between potential outcomes and schooling. The agent does not act on any left out unobserved component of outcomes when choosing his level of schooling. It is only possible to identify µ1 (X) and µ0 (X) for values of X such that

0 < Pr (S = 1|X) < 1

(17)

(if Pr (S = 1|X) = 0 then it is only possible to estimate µ0 (X) for that value of X while if Pr (S = 1|X) = 1 one can only estimate µ1 (X) , but not µ0 (X)). I provide a detailed analysis of the results for white males in the NLSY (a large longitudinal panel that follows individuals from 1979 through 2000), and I summarize the main findings from other samples. For the NLSY, the wage variable I use is the average hourly wage between the years of 1990 and 1994 (the average of all nonmissing wages) and (actual) working experience is measured in 1992. I match individuals on family background variables (number of siblings and father’s education), cognitive test scores (AFQT), average tuition in county of residence at 17, distance to college at 14 and local labor market variables (state blue collar wages at 17 and county unemployment rate at 17). However at the date the AFQT is administered different individuals have different levels of schooling, and therefore differences in AFQT scores between two individuals can be caused either by the fact that they have different levels of ability, by the fact that they had different levels of schooling at the date they took the test, or both. Hansen, Heckman and Mullen (2003) design and implement a procedure for correcting AFQT scores for the effect of schooling at test date and therefore isolate the effect of ability on the test score45 . I use 1444 individuals out of 2439 white males in the NLSY46 . I exclude all high school dropouts. An individual is a high school graduate if he has a high school degree or a GED (general equivalence degree), or if he completed 4 5 The argument is as follows. The test is administered in 1981 for everyone, and a large part of the sample is still enrolled in school in 1981. Then conditional on the completed schooling level of individuals (many years later) it can be argued that schooling at test date is exogenous (since the date of the test is exogenous) if there is no school interruption nor grade repetition, and if age of entry into school is independent of ability. I apply this correction procedure by first running a regression of AFQT scores on schooling at test date and completed schooling and then using the coefficients on schooling at test date to correct the AFQT scores. The details of the application of this method and the coefficients of this regression are shown in the appendix. 4 6 The NLSY79 has an oversample of poor whites which I exclude from this analysis.

23

12 years of schooling47 . An individual is a college attender if he has a college degree, or has completed more than 12 years of schooling. I exclude individuals who have at least one missing value for the variables listed above. The results shown in this paper are robust to different coding of the schooling variable, different specifications of the test score correction, different sets of variables used and imputation procedures that change the sample size and the sample composition48 . Table 3 presents descriptive statistics for this sample. We observe that those who attend college have on average 3.5 more years of schooling than those who do not attend college. They also have 30% higher wages, even though they have lower experience. They have much higher cognitive ability as measured by AFQT, already corrected for the effects of schooling at test date. They come from smaller and more educated families. They live in locations with slightly lower college tuition and with more colleges present. Local labor market conditions at age 17 are not much different for these two groups of individuals. The construction of these variables is discussed in the appendix in more detail. I want to estimate the functions in (16) nonparametrically. However before I do this I first estimate a very simple parametric version of (16) where the only X is a test score and it enters linearly in the return function:

Y1

= α1 + Xβ 1 + U1

Y0

= α0 + Xβ 0 + U0

β

= α1 − α0 + X (β 1 − β 0 ) + U1 − U0

Therefore, to estimate (β 1 − β 0 ) I run the following regression:

Y = α + βS + γSX + θX + ε

and γ = β 1 −β 0 . This is the usual specification estimated in the literature that assesses the effects of cognitive ability (and other observable characteristics) on test scores. The first column of table 4 shows the results for white males in the NLSY. The coefficient on the SX term is statistically significant and economically important49 . This suggests 4 7 Cameron and Heckman (1993) show that receiving a GED is not equivalent to receiving a high school degree. When I exclude GED recipients from my sample of high school graduates the results do not change significantly. These results are available in the appendix. 4 8 Results are available on request. 4 9 These estimates imply a minimum return in the population of -4% and a maximum return of 44%. The dispersion in possible returns is therefore very large.

24

that test scores are an important determinant of returns and therefore that heterogeneity in returns (through test scores) is important. The remaining columns of this table show that test scores are important determinants of returns in other samples as well (although the effect for white females in HSB is statistically insignificant). It is convenient to reduce the dimension of the conditioning set (X and Z) since it is difficult to be nonparametric in multiple dimensions. The common practice is to use P (X, Z) = Pr (S = 1|X, Z) as the conditioning variable instead of X and Z 50 . This is called matching on the propensity score and was suggested by Rosenbaum and Rubin (1983), who show that if (15) holds then ln Y1 , ln Y0 ⊥ ⊥S|P (X, Z)51 . I estimate P (X, Z) using a probit where all the variables enter linearly in the index52 . Therefore:

µS (X, Z) = Xγ X + Zγ Z ¡ ¢ US ∼ N 0, σ 2S The coefficients and average marginal derivatives for each variable are displayed in Table 5. Cognitive ability is a major determinant of college attendance. One standard deviation change in average test score increases the probability of attending college by more than 15%. Family background variables also play an important role. Tuition elasticities are roughly similar to the ones reported in the literature: a decrease of $1000 in tuition leads to a 4% increase in enrollment (see Kane, 1994, Dynarski, 2001). Distance to college has no effect once we condition on all other variables. Unfavorable local labor market conditions (low wages and high unemployment) induce individuals to enroll in college. All these effects are robust to different specifications, although in some specifications the distance effect becomes negative and statistically strong. Notice that I did not include years of experience in the set of variables in P , even though experience is a variable in X. However I still condition on experience but in a parametric way. I estimate E [ln Y1 |experience,P (X, Z) = p, S = 1] 50 P

(X, Z) is just an index of the variables that enter in the choice equation. It allows us to reduce the dimension of the conditioning set. Furthermore this index can be interpreted as measuring the probability of going to college for each individual, based on observable characteristics. Then we can estimate how the average return to schooling varies with this probability. 5 1 Therefore, E (ln Y1 |P, S = 1)

E (ln Y0 |P, S = 0)

= =

E (ln Y1 |P ) E (ln Y0 |P )

5 2 When using the propensity score it is important that X, Z⊥ ⊥S|P (X, Z) be satisfied. Dehejia and Wahba (1998) design an iterative procedure for determining the best specification of the functional form of P (X, Z), given that X and Z are specified, such that the condition above is satisfied. The results in this paper are robust to applications of such a procedure.

25

and E [ln Y0 |experience,P (X, Z) = p, S = 0] using the local linear methods developed and applied in Heckman, Ichimura and Todd (1997). I use a local linear regression with a biweight kernel and a bandwidth of 0.253 . I include linear controls for experience and experience squared:

E [ln Y1 |P (X, Z) = p, S = 1] = γexperience + θexperience2 + φ1 (p)

(18)

E [ln Y0 |P (X, Z) = p, S = 0] = γexperience + θexperience2 + φ0 (p)

where φ1 and φ0 are arbitrary functions of p54 . This is a version of the regression adjusted matching estimator discussed in Heckman, Ichimura and Todd (1997, 1998). The two major differences between usual applications of matching and the equations presented in (18) are that: 1) I use some Z variables (instruments) in the model for P ; and 2) instead of including experience and experience squared in the model for P , I include them linearly. Because very large sorting in the ability variable leads to a large sorting (by schooling category) of high P into S = 1 and low P into S = 0, we can have problems with the matching procedure. Variation in Z enlarges the support of the propensity score for both groups of individuals and therefore helps with identification of the MTE. The linear quadratic in experience specification is standard practice in the estimation of the returns to schooling. The estimated coefficients are provided in the appendix (standard errors are bootstrapped55 ). Under assumption (15) (which leads to the conditions in (16)), we can estimate the MTE (now as a function of P , not of X) as the difference between the two estimated functions in (18)56 :

ˆ (ln Y1 − ln Y0 |P = p) = E ˆ [ln Y1 |P (X, Z) = p, S = 1] − E ˆ [ln Y0 |P (X, Z) = p, S = 0] E

In Figure 4 I show the distribution of the estimated propensity score for the population with S = 0 and S = 1. It is only possible to estimate the MTE for values of P with positive support for individuals in S = 0 and in S = 1. Even though there is a large degree of sorting of individuals with high P into college and individuals with low P into 5 3 Different

bandwidths (0.1 to 0.3) produce approximately the same results. estimate the partially linear model using a double residual regression procedure using local linear regression for the requisite nonparametric regression steps, following Heckman, Ichimura, Smith and Todd (1998). 5 5 The bootstrap accounts for the estimation of P - in every iteration of the bootstrap I reestimate P . Horowitz (2000) suggests a special procedure for using bootstrap methods with nonparametric regression for achieving better asymptotic refinements, which is not applied in the current version of the paper, but will be in future versions. 5 6 Since the conditioning set is collapsed from (X, Z) to P (X, Z) I redefine MTE to be a function of P instead of being a function of (X, Z). 54 I

26

high school, P still has positive support for both these groups between the values of 0 and 0.9. Therefore, at least in this range, it is possible to estimate the MTE. This large degree of sorting is caused by sorting on ability, which is a major determinant of P (if we take ability out of the choice model the sorting on P becomes less pronounced57 ). Figure 5 presents the estimated MTE. It shows that the average return to schooling is increasing with P : individuals who have characteristics that make them more likely to enroll in college have the highest returns to college. Therefore, selection in returns is important. However, most of the rise occurs for high values of P 58 . The numbers reported in this figure as well as in the rest of the figures and tables of the paper are average returns to one year of college. I annualize the estimates by dividing them by 3.5, the difference in the average years of schooling between individuals in college and in high school (see Table 3). Cognitive test scores are the main variable determining selection: individuals with higher cognitive skills are more likely to enroll in school and have higher returns to college. In results available on request I estimated the MTE when I only include AFQT scores in the model for P and I exclude all the other variables in Table 5 and then compared to the version of MTE where I include all variables in the model. The two functions are very similar. Excluding variables other than test scores from the P index has a very small effect on the estimated MTE. Cognitive ability is both the major observable determinant of schooling, and conditional on schooling, the main determinant of earnings. It accounts for nearly all the selection on observables. An analysis of the High School and Beyond (HSB) data shows similar results. In Figures 6a and 6b I plot the distribution of P (which is basically the distribution of ability) for different groups in the population. These distributions correspond to the weights of Table 2, although they need to be slightly modified because I assume that the MTE is not a function of US , and is a function of P instead of X. The modified weights are presented in Table 6 and derived in the appendix. I estimate fT (t), fT (t|S = 1) and fT (t|S = 0) (where t = xγ X + zγ Z ) using a Gaussian kernel with bandwidth equal to 0.1 and I assume that US is distributed by a standard normal (I normalize the variance of US to be 1 since this variance is not identified from the probit). Figure 4a shows that there is an oversampling of individuals with low probability of going to college (low ability) among high school graduates and an oversampling of individuals with high P among college attenders, relative to the distribution 5 7 These

results are available on request. results from HSB also show a rise of returns with the propensity score. However this rise is more uniform with P (it occurs over the whole support of P instead of being concentrated at high values of P ). 5 8 The

27

of P in the population. Since individuals with high P also have high returns, college attracts mostly the individuals with the highest returns to college. Figure 4b shows that those who are indifferent between attending college or not (the marginal individuals) are located at the center of the P distribution. The first column of Table 7 translates Figures 5 and 6 into the different returns described in Table 1. To construct these I weigh the estimated MTE in Figure 5 by the estimated weights of Figure 6 (see Table 6). TT is the highest parameter which suggests sorting on returns: those who have the highest returns to college select in to college. The evidence also suggests complementarity: individuals with high ability (high P ) are the ones that benefit more from going to college. The difference between TT and TUT is about 2.5% and accumulates to more than 7.5% over 3.5 years. The average individual and the average marginal individual have returns between TT and TUT. I find similar patterns for white males and females in the NLSY and for white males in HSB. Returns are higher for those with high levels of measured ability, who are also more likely to enroll in college. Cognitive ability is the main observable determinant of returns. The estimated MTE in the PSID is flat (there are no test scores in the PSID59 ; when I estimate the MTE in the NLSY and exclude test scores we also get a flatter function). In summary, heterogeneity is important: the return for the average person in college is different from the return for the average marginal individual. This happens for two reasons: 1) the returns to college are different for individuals with low ability and for individuals with high ability (the MTE is not a constant function of P ); and 2) different parameters weigh different sections of this function differently and therefore they lead to different numbers (if the MTE is flat then no matter how we weight it we always get the same number). Individuals in college have a different distribution of ability than individuals at the margin, and therefore the two different parameters weight the MTE differently. Then, if different policies affect groups of individuals with different levels of cognitive ability then this needs to be taken into account in the evaluation of their benefits. Individuals with higher cognitive ability have higher returns. They are the ones who have the highest monetary benefits from going to college. This result is strongly suggestive of the existence of important complementarities in the process of skill accumulation. It is consistent with a large body of evidence that finds positive effects of cognitive 5 9 This statement is not completely accurate. There is one sentence completion test administered in 1972 to all heads of household. However the test is very simple and it is not as strong a predictor of schooling and earnings as the tests available in HSB and NLSY (I use a math test for HSB). Furthermore, it is only available for individuals in the sample in 1972. For HSB wages measured in 1991 when individuals are 26-28 years of age while for NLSY wages are measured in 1994 when individuals are 30-37 years of age. If an individual is already an head of household in 1972 then he will be too old in the early 1990s to belong to a sample comparable (in terms of age of the individuals and date at which we measure his outcomes) to the samples of the HSB and NLSY that I am using here.

28

ability on returns, referenced at the beginning of this section and in Carneiro and Heckman (2003).

3.2

Selection on Unobservables

Assumption (15) (no selection on unobservables) may not be valid and we may want to consider unobservable variables that determine both schooling choices and the return to schooling. Selection on unobservables is extensively discussed in analyses of the returns to schooling, and also in many other areas of social science research. It can be motivated by the analysis in the previous section that finds that ability is important, combined with the fact that ability is not measured in most of the datasets we have available. There is also a large body of work that finds that including test scores in the wage equations significantly changes the estimates of β (although earlier work found much smaller effects), which suggests that ability may be a very relevant omitted variable in analyses where test scores are not available. The result in the literature that IV estimates of β are higher than OLS estimates also suggests that this problem may be important60 (see Card, 1999, 2001, and Carneiro and Heckman, 2002). The empirical importance of unobservable heterogeneity has also been more directly studied in the context of many economic problems. Some examples are studies of labor supply (Heckman, 1974a,b, Hausman, 1986), the effect of unionism on wages (Lee, 1978, Farber, 1983, Duncan and Leigh, 1985, Robinson, 1989), job training (Bjorklund and Moffitt, 1987, Heckman, Ichimura, Smith and Todd, 1998), migration (Pessino, 1991, Tunali, 2000) and sectorial choice (Heckman and Sedlacek, 1985). Not all studies find that selection in returns is important. For example, Lee (1978), Farber (1983) and Duncan and Leigh (1985) cannot reject the hypothesis that there is no selection in β. For the case of schooling Willis and Rosen (1979), Heckman, Tobias and Vytlacil (2001) and Carneiro, Heckman and Vytlacil (2001) present evidence that suggests that selection in returns is important. I build on this work and expand it by applying the modern tools of the program evaluation literature to multiple sources of data. The basic difference relatively to the last section is that in this section I allow the MTE to depend on an unobservable component as well as on measured ability. I estimate the MTE using the method of local instrumental variables described in the previous section. I use as instrumental variables number of siblings, father’s education, tuition, distance and local labor market variables, all of which are common instruments in this literature61 . I 6 0 However, in most cases, it is not possible to reject the hypothesis that β ˆ ˆ IV = β OLS . This is also true in the data I analyze in this paper. 6 1 I also estimate models where I include number of siblings and parental education in the wage equation (so that they are not instruments) and the results (available on request) remain the same.

29

aggregate them into a single index, P (X, Z) (as in the previous section). The crucial assumption is the existence of instrumental variables satisfying (7) (the instrument is correlated with schooling) and (8) (the instrument is independent of potential outcomes conditional on X). I assume that once I include ability in the model I can estimate the marginal treatment effect by the method of local instrumental variables using P (X, Z) as the instrument62 . To apply this method I first estimate P (X, Z) using a probit model. The results, in Table 2, were already presented and discussed. Then I estimate E [ln Y |P (X, Z) = p, X] using the local linear methods developed and applied in Heckman, Ichimura and Todd (1997). It is necessary to specify how X variables enter. In the previous subsection ability entered nonparametrically in the model through P . However, in this subsection the MTE is both a function of X and US . It is difficult (although not impossible) to estimate MTE as a nonparametric function of both X and US . Therefore I include corrected AFQT, experience and experience squared in the model in a linear way (parametric model for X). The coefficients on experience are restricted to be the same in equations (2) and (3) (as suggested by the matching results and assumed in most of the literature that estimates a version of (1)) while AFQT (A, in the equations below) is allowed to have a different effect in the college and in the high school outcome equations (therefore the returns to schooling will depend on ability) but not on experience. The equations are:

µ1 (X) = α1 + η1 experience + γ 1 experience2 + θ1 A

and µ0 (X) = α0 + η0 experience + γ 0 experience2 + θ0 A and I impose that η 1 = η0 = η and γ 1 = γ 0 = γ. In this case

µ1 (X) − µ0 (X) = α1 − α0 + (θ1 − θ0 ) A 6 2 Even

after conditioning on ability P (X, Z) is strongly correlated with schooling.

30

From (5),

ln Y = α0 + ηexperience + γexperience2 + θ0 A + S [α1 − α0 + (θ1 − θ0 ) A] + U0 + S (U1 − U0 )

Taking expectations:

E (ln Y |X = x, P = p) = α0 +ηexperience+γexperience2 +θ0 A+(α1 − α0 ) p+(θ1 − θ0 ) pA+E (U1 − U0 |S = 1, P = p) p (19) This is a regression of ln Y on a constant, experience, experience squared, A, A interacted with P , and a nonparametric function of P , φ (P )63 . Finally, from (13) and (14),

E (β|X = x, VS = p) =

∂E (ln Y |X = x, P = p) ∂p

(20)

= α1 − α0 + (θ1 − θ0 ) A + E (U1 − U0 |VS = p)

Then I estimate (19) in two ways: 1) with a local linear regression using a biweight kernel with a bandwidth of 0.364 ; 2) using polynomials in P (I present results for a cubic and a quartic in P )65 . To compute the derivative of the φ (p) with respect to p I use the following approximation:

∂φ (p) φ (p + h) − φ (p) ≈ ∂p h

where h = 0.0166 . The standard errors are bootstrapped accounting for the estimation of P 67 . 6 3 ψ (A, p)

= (α1 − α0 ) p + (θ1 − θ0 ) pA + E (U1 − U0 |S = 1, P = p) p is just a control function, typical in selection corrections (see Heckman, 1990; Heckman and Robb, 1985, 1986). 6 4 I experimented bandwidths of 0.2 - 0.4 and the results were basically the same. 6 5 Being parametric allows me to get more precision in my estimates and still estimate this function quite flexibly. Spline results are available on request. 6 6 An alternative approach would be to extract the local slopes of the local linear regression directly. The final result is similar. 6 7 Even though I use exactly the same propensity score in LIV and in matching they way I use it is different in these two methods. In matching I use it as a conditioning variable in the regression (what is usually called a control variable) while in LIV I use it as an instrumental variable. In order to be able to use the propensity score as an instrument I need to control for ability in the outcome equation since ability is not a valid instrument and the other variables included in the choice model are not valid instruments if we don’t condition on A (that is why A shows up in regression (19) and in the third column of table 5). Earlier I showed that matching on A and matching on A and a set of other variables produces roughly the same results: A is the main and basically only observable variable that is a determinant of returns. However, in matching we are nonparametric in A while in this application of LIV A enters linearly in the outcome equations. If it were possible to be nonparametric in A and still use local instrumental variables then this latter model could nest the version of matching with exclusion restrictions I used before, provided that assumption (8) is satisfied (this assumption is not required in usual applications of matching). The latter model allows returns to depend on A and on US while the former only allows returns to depend on A. Therefore, if A is the only observable determining returns and if we have available a set of instruments, it is

31

Notice that M T E (X = x, US = u) = M T EX (X = x) + M T EUS (US = u)

where

M T EX (X = x) = µ1 (x) − µ0 (x) M T EUS (US = u) = E (U1 − U0 |US = u)

If µ1 (x) − µ0 (x) = α1 − α0 + (θ1 − θ0 ) A then M T EX (X = x) is a straight line. Table 8 shows the coefficients in P 2 and P 3 of a regression where I parameterize E (U1 − U0 |S = 1, P = p) p to be a cubic in p. The coefficients in the second and third order terms are statistically strong which indicates that this function is not linear in P and therefore a test for no selection on unobservables is rejected in this data. This finding is robust across specifications where I condition on different sets of X variables. I estimate that θ1 − θ0 > 0 indicating that higher ability leads to higher returns to schooling (as suggested by the matching results). This coefficient is imprecisely estimated but in different specifications it comes out consistently positive and roughly of the same magnitude68 . Furthermore it is also positive for white females69 . Figure 7a plots possible to test the matching assumptions just by testing if the MTE is independent of US . More formally, suppose we have both X and Z variables satisfying (8). The matching assumption in (15) implies that E (U1 |S, X, Z)

=

E (U0 |S, X, Z)

=

E(U1 |X, Z)

=

E(U1 |X, Z)

E(U0 |X, Z).

The IV assumption in (8) implies that

E(U0 |X, Z)

=

E(U1 |X)

E(U0 |X).

Therefore, given the IV assumption the matching (with exclusion restrictions) assumption implies that E(U1 |S, X, Z) E(U0 |S, X, Z)

= =

E (U1 |X) E (U0 |X)

for all X and Z. However this can only be the case if E(U1 |X, US ) E(U0 |X, US )

= =

E (U1 |X) E (U0 |X)

Therefore, given an IV assumption, the matching assumption is equivalent to saying that, conditional on X, U1 and U0 are independent of US . If (8) holds it is possible to estimate E(U1 |X, US ) and E(U0 |X, US ) and test whether or not these functions are independent of US . The assumption of linearity in A is relaxed in work in currently progress. 6 8 Available on request. 6 9 In the HSB the coefficient on test scores is positive for white males but negative for white females.

32

(θ1 − θ0 ) A. Figure 7b graphs the unobservable component of MTE:

M T EUS (US = u) = E (U1 − U0 |US = u)

Returns are high for individuals who have unobservables that make them either very likely to go to school or very unlikely to go to school leading to a U-shaped function70 . In a Roy model where schooling choices are made on the basis of equation (10) we expect returns to be higher for individuals who are more likely to go to school. This would lead to a rising MTE. However there are costs of going to college, and if costs and returns of going to college are positively correlated then it is possible to have segments of the MTE where average returns increase and at the same time the likelihood of going to college decreases (leading to a declining MTE). Therefore, the U-shaped pattern for the MTE may result from the intersection of two different populations, both of them with monotone MTEs in terms of X and US , but one population where it is increasing in these variables and another where it is decreasing. I illustrate this in the appendix. Figure 7b suggests that E (U1 − U0 |US = u) is not independent of u, but the standard errors (shown in the appendix) are large. Those in the middle of the distribution of US have the lowest returns71 . If M T EUS (u) were independent of u then we would only need to worry about selection on observables. In Figure 8 I graph the surface M T E (X, US ) that results from adding M T EX (X = x) and M T EUS (US = u). Returns are the highest for those with high AFQT (A in the figures) and high US 72 . It is convenient to group all instruments in a single index and treat it as a single instrument. If there is more than one observable X it is also convenient to construct an index of observables. This procedure is justified by the index assumption already implicit in (6). Let TX = Xγ SX and TZ = Zγ SZ , where X is ability, Z is the vector of instruments, and γ SX and γ SZ are the estimated coefficients from the choice equation (the probit coefficients showed in Table 5). Redefine MTE to be a function of TX instead of a function of X 73 . Then we can rewrite the weights in 7 0 The function that I estimate is the transformed E (U − U | − V = v). This is also the function that I plot. Since V is just a 1 0 S S transformation of US throughout the paper I use E (U1 − U0 |US = u) to refer to E (U1 − U0 | − VS = v) for easier exposition. 7 1 The same patterns are found for white females in the NLSY and white males in the HSB. Even when test scores are excluded the MTE has this U-shaped pattern. However, for the PSID the MTE is rising. 7 2 One possible story for a U-shaped MTE i s the following. The rising section at the end is the usual story: individuals with higher ability are more likely to go to school and they have higher returns. Individuals with very low ability are verly unlikely to go to school. However, because they have such low ability and possibly have learned very little during high school, teaching them some basic skills may generate very large returns, generating the decling segment in the MTE. While the latter is suggestive of a decreasing returns story (those with high levels of human capital benefit less from learning than those with low levels of human capital) the former is suggestive of a complementarity story (those with higher ability benefit more from further learning). 7 3 MT E (T , V ) = E (ln Y − ln Y |T 1 0 X = tX , VS = vS ) X S

33

Table 2 as those presented in Table 9. The density of TX is estimated nonparametrically using a gaussian kernel. I estimate fTZ (tZ |TX = tX ) using local polynomials (see Hyndman, Bashtannyk and Grunwald (1996), Hyndman and Yao, 2002)74 . US is a standard normal random variable by assumption. These weights correspond to the distribution of observed and unobserved “abilities” for individuals in different groups of the population. As an example, Figure 9 presents policy weights for a $1000 tuitions subsidy75 . From Table 9 it can be inferred that there is an oversampling of high unobserved ’ability’ individuals in college and an undersampling of these individuals in high school, relative to the overall distribution in the population (for a given level of X). Once again, the marginal person is located at middle values of the distribution of US . Table 10a shows estimates of different returns (standard errors are bootstrapped). They are constructed by applying the weights of Tables 2a-b to the estimated MTE. I estimate that TT>ATE>TUT. Under the local linear regression specification, the average return to schooling for a randomly selected individual from the population is 16%. The difference between TT and TUT is about 3.5% per year of college. Over 3.5 years it accumulates to more than 12%. When I use polynomials in P instead of a local linear regression I get similar figures for both M T EX (X) (increasing in test scores) and M T EUS (US ) (U-shaped)76 . The results from white females in the NLSY and white males in the HSB are similar in magnitude and in the main patterns: M T E (A) is increasing in A and M T E (US ) is U-shaped. These results are summarized in Table 10b. In the PSID the M T E (US ) is rising with US . Across datasets, the marginal entrant in to college has consistently a lower return than the average student in college. In summary, cognitive test scores account for a large part of heterogeneity in returns to schooling but they are not the whole story. It is still necessary to consider selection on unobservables. The estimates in this section suggest that heterogeneity and selection in returns is very important: the average individual in school is systematically different from the marginal individual indifferent between high school and college. This happens for two reasons: 1) the returns to schooling vary in the population and are correlated with the variables that determine the schooling decision; and 2) different groups of individuals in the population have a different distribution of these variables and, 74 I

use the cde.est routine for S-plus in Hyndman’s website at http://www-personal.buseco.monash.edu.au/~hyndman/. presentation of the graphs for MT E (US ) and for the corresponding weights, which include US in the x-axis, is not completely accurate, although it is convenient to save on notation and simplify exposition. The reason is that the graphs are plotted as a function of 1 − VS , and therefore this variable and not US should be referenced in the text and placed in the x-axis. However, from (12) we have that US = FU−1 (1 − VS ) and therefore the transformation that needs to be done is simple. S 7 6 These are presented in the appendix. 7 5 The

34

therefore, a different distribution of returns. The precision of these estimates is low. In fact, it is not possible to reject that they are not different from the matching estimates of the returns to schooling. This is a manifestation of a common problem in this literature: even though IV estimates of the returns to schooling are different (usually higher) from OLS estimates, in general it is not possible to statistically reject that they are equal to each other (this is also true with the data used in this paper). Nevertheless, economists usually assume that the difference between the two sets of estimates is important even though it is imprecisely estimated. Furthermore, as mentioned before, a formal test of the hypothesis of no selection on unobservables is rejected in the data.

4

Accounting for Heterogeneity and Selection in the Evaluation of Education Policies: What is the Return to Schooling?

If, as documented so far, E (β|X = x, US = u) is not independent of x and u then AT E 6= T T 6= T U T 6= AM T E. Which, if any, of these effects should be designated as the return to schooling? Suppose that we want to evaluate some policy. In particular, assume that the policy is a $1000 tuition subsidy. This policy will affect a subgroup of individuals in the population. These are the individuals that do not go to college with tuition set at the current level but that would go to college if they were given a $1000 subsidy. The distribution of (X, US ) for this group of people, or the policy weight, is shown in Figure 9. The analytical expression for the weight is presented in Table 9 and derived in the appendix. The return to schooling for those induced to go to college by a tuition subsidy of $1000 is 15% (see Table 11a). The policy weight is very similar to the marginal person weight (empirically, not analytically), and therefore the policy parameter is similar to AMTE. On the other hand, both of these parameters are different from ATE. Suppose that I also considered a subsidy of $500, and another of $2000. What parameters do we need to evaluate each of these three policies? In theory these policies should affect different groups of individuals and therefore the returns that we need to evaluate each policy should be different. However, empirically the composition of X and US for the groups of individuals affected by these three policies is very similar. In other words, the policy relevant weights look very alike across policies. This is true even though these policies have different quantitative impacts on enrollment: the $500 subsidy increases enrollment by 2.5%, the $1000 subsidy increases enrollment by 5% and the

35

$2000 increases enrollment by 8%. In Figure 10a I present the distribution of cognitive ability (X) for individuals induced to go to school by each of the policies. It shows that the distribution of cognitive ability for individuals affected by these three different subsidies is very similar. Furthermore, it is also similar to the distribution of ability for individuals who are at the margin between high school and college. Figure 11a also shows that the distribution of unobservable ability for groups of people affected by the different policies is also similar across policies. However, as emphasized in the previous section, these policies affect the marginal individual, not the average individual77 . This is ¯ and the weights shown in Figures 10b and 11b. The weights for the average return to schooling in the population (β) for the average return for individuals affected by a $1000 tuition subsidy are very different. Figures 10c and 11c illustrate the magnitude of the subsidies that we would need to generate significant differences in the distributions of abilities across policies. A $5000 tuition subsidy switches the distribution of ability significantly to the left in both pictures, relatively to a subsidy of $50078 . In order to have substantial differences between the distribution of abilities for individuals affected by the $500 policy and by the $2000 policy the coefficient on tuition in the decision model would need to be 3 to 4 times higher than it actually is. Table 11a presents the estimates of β P for the three different tuition subsidies: $500, $1000 and $2000 (under different specification for the MTE). As expected, β P does not correspond to any of the first three parameters in Table 10a, but is very close to the fourth, the average marginal return. The policy weights can only be equal across different policies in a very special case. To simplify the argument, suppose there is no X or suppose that all calculations are done conditional on X, and X is kept implicit (i.e., consider weights like the ones presented in figure 11)79 . Let γ Z = −1, Zi be net tuition faced by individual i under no policy, ZAi (< Zi ) be net tuition faced by an individual i under policy A and ZBi (< Zi ) be net tuition faced by individual i under policy B, where ZAi > ZBi for all i (the higher net tuition is the smaller the probability of going to college). 7 7 Let h X,US (x, u) denote the weight for a given parameter. The weight is a function of both X and US . However it is possible to look only to who are the Xs relevant to compute a given parameter by integrating the weight over the distribution of US :

hX (x) =

Z

Similarly, hUS (u) =

hX,US (x, u) fUS (u) du US

Z

hX,US (x, u) fX (x) dx

X

I compute these functions for the different policy weights and present them in figures 10 and 11. It is more convenient to compare two dimensional than to compare three dimensional graphs. 7 8 However since we can find very few pairs of individuals facing tuition values that are $5000 apart evaluating this policy requires considerable extrapolation. 7 9 This can be generalized to the case where X is considered in a more explicit way.

36

Then, dropping the i subscript, the two policy weights are:

fUS [u| − Z + US < 0, −ZA + US > 0] = fUS [u| − Z + US < 0, −ZB + US > 0] =

[FZA (u) − FZ (u)] fUS (u) E [P (ZA )] − E [P (ZZ )] [FZB (u) − FZ (u)] fUS (u) E [P (ZB )] − E [P (ZZ )]

(see Heckman and Vytlacil, 2001, 2003, and Carneiro, Heckman and Vytlacil, 2001)80 . The numerator of these expressions is the fraction of people switching schooling levels in response to the policy at each level of u, multiplied by the density of people at that level of u. The denominator is the aggregate fraction of people switching schooling levels in response to the policy. These two functions are equal to each other at each value of US = u if

[FZB (u) − FZ (u)] fUS (u) [FZA (u) − FZ (u)] fUS (u) = E [P (ZA )] − E [P (ZZ )] E [P (ZB )] − E [P (ZZ )]

(21)

or [FZB (u) − FZ (u)] fUS (u) = For example, if

E[P (ZB )]−E[P (ZZ )] E[P (ZA )]−E[P (ZZ )]

E [P (ZB )] − E [P (ZZ )] [FZA (u) − FZ (u)] fUS (u) . E [P (ZA )] − E [P (ZZ )]

= 2, i.e., policy B induces twice as many people to go to college than policy A in

the aggregate, then at each level of u policy B has to induce twice as many people to go to college than policy A. Since the amount of people switching schooling levels at each value of u relative to the total amount of people switching schooling levels is the same for both policies, the distribution of US for individuals who switch in response to the policy (the policy weight) also has to be the same for both policies, even if the total change in college enrollment is 8 0 For

policy A:

fUS [u| − Z + US < 0, −ZA + US > 0]

=

hR +∞ R u

=

i

Pr (−Z + US < 0, −ZA + US > 0) fUS (u)

=

−∞

u

f (u, z, zA ) dzA dz

hR

+∞

−∞

£

Ru

−∞

f (z, zA ) dzA dz −

Ru Ru −∞

E [P (ZA )] − E [P (Z)]

¤

−∞

f (z, zA ) dzA dz

i

fUS (u) FZA (u) − FZ (u) E [P (ZA )] − E [P (Z)]

(since Z > ZA ) where E [P (ZA )] is the fraction of people in college under policy A and E [P (Z)] is the fraction of people in college under no policy. A similar reasoning appplies for policy B.

37

very different between them. Condition (21) also implies that:

fUS [u| − Z + US < 0, −ZA + US > 0] = fUS [u| − Z + US < 0, −ZB + US > 0] = fU∗S (u)

(22)

where fU∗S (u) =

[FZB (u) − FZA (u)] fUS (u) E [P (ZB )] − E [P (ZA )]

and therefore the weight can be written independently of the original distribution of Z 81 . In figure 12 I use the weights computed in figure 11 to calculate different versions of fU∗S (u). In particular I plot the weight for the $500 tuition subsidy and together with versions of fU∗S (u) computed using the $500 and the $1000 policies together and the $1000 and the $2000 policies together. The curves are all similar each other. In the data I analyze, condition (22) approximately holds for these three policies. I also include the $5000 - $2000 weight which is clearly different from the other three weights. The estimation of policy weights does not require the estimation of the MTE. It is much simpler since we just need to estimate densities. The result that policy weights are similar across different policies is robust in the sample of white females in the NLSY and white males in the PSID (see Table 11b). It is also true for blacks and hispanics (males and females) in the NLSY. In most of the work in this area researchers do not use either matching or local instrumental variables. Instead, the usual methods are OLS and IV. In general OLS and linear IV do not estimate β P , the policy relevant treatment effect. It is possible to express IV and OLS estimates as weighed averages of E (Y1 |X = x, US = u), E (Y0 |X = x, US = u) and E (Y1 − Y0 |X = x, US = u). β P , β OLS and β IV correspond to three different ways of averaging the MTE and its two components (E (Y1 |X = x, US = u) and E (Y0 |X = x, US = u)). Therefore in general they give rise to different 8 1 If

(21) holds then ∗ (u) fU S

=

∗ fU (u) S

=

∗ (u) as: Therefore we can write fU S

∗ fU S

(u) =

£

¤

FZA (u) − FZ (u) fUS (u)

£

E [P (ZA )] − E [P (ZZ )]

¤

FZB (u) − FZ (u) fUS (u) E [P (ZB )] − E [P (ZZ )]

£

¤ £ ¡

FZB (u) − FZA (u) fUS (u)

E [P (ZB )] − E P ZZA

38

¢¤ .

, .

numbers (see also Heckman and Vytlacil, 1999, 2000, 2001, 2003, and Carneiro, Heckman and Vytlacil, 2001). In Table 11a I present OLS and IV (using the propensity score as an instrument) estimates of β P for the NLSY. The OLS and IV estimates depend on ability (A) since I allow for an interaction between ability and schooling. In particular I estimate the following equation:

ln Y = α + δS + γSA + θA + ε

where S is an indicator for enrollment in college, A is cognitive ability and SA is the interaction between these two variables. Then the estimated return to schooling is

ˆ (A) = ˆδ + γˆ A β

I estimate this equation using OLS and using IV where P is the instrument for S and P A is the instrument for SA. I evaluate this return at the average A induced to go to college by a $1000 tuition subsidy (computed as in the matching section82 ). Neither OLS nor IV estimate the policy parameters, although IV is much better than OLS83 . In summary, even though selection is important, one parameter is all we need to evaluate a broad range of tuition policies (this result is robust to using a logit or a linear probability model instead of a probit for the choice equation84 ). The reason is that even though returns are different for different individuals, and different policies attract different amounts of people into college, they do not change the composition of abilities among the individuals that they target. Surprisingly, substantial differences in enrollment rates across policies do not correspond to substantial differences in average returns across policies. Theoretically this can only happen in a special case that seems to be empirically relevant for tuition subsidies between $500 and $2000, but not for much larger subsidies. Neither OLS nor conventional IV provide estimates of the policy relevant return, although IV provides a much better approximation to it than OLS. Table 11b shows results for females in the NLSY and for the PSID. Notice that, throughout this 8 2 Let

T be the tuition subsidy. Then I evaluate the OLS and IV estimates at E [A|S (T = 0) = 0, S (T = λ) = 1]. of using an index of instruments I could have used each instrument separately and obtain more instrumental variables estimates of β. These results will be available on http://lily.src.uchicago.edu/~klmcarn. Different instruments estimate different parameters and in general none of them estimates the policy parameter even though estimates using parental education variables as instruments come close to it. Tuition generates a low estimate of β, distance generates a very high estimate and local wage and local unemployment rate generate erratic estimates with very large standard errors. 8 4 Work in progress examines the robustness of these results for different specifications of the index µ (X, Z). It also analyzes different S policies. 8 3 Instead

39

section, the PSID estimated returns are very high. One possible explanation for this phenomena is advanced in the next section.

5

Invalidity of the Instruments when Cognitive Test Scores are not available

Suppose it is not possible to measure cognitive ability, as happens for example in the CPS and in many other data sets that are usually used to estimate the returns to schooling. Then ability becomes part of U1 , U0 and US instead of being in X. Assumption (8) implies that the instruments have to be independent of cognitive ability. However, the instruments that are commonly used in the literature are correlated with AFQT. The first column of Table 8a shows the correlations between different instruments (Z) and college attendance (S), denoted by ρZ,S . With the exception of local unemployment rate, all candidate instruments are strongly correlated with schooling. The second column of this table presents the correlation between instruments and AFQT scores (A), denoted by ρZ,A . It shows that most of the candidates for instrumental variables in the literature are correlated with cognitive ability. Therefore, in datasets where cognitive ability is unobserved most of these variables are not valid instruments since they violate assumption (8). In fact, the usual finding in the literature that OLS estimates of β are lower than IV estimates of β can be caused by the use of invalid instruments and not by any economic reason. The argument is as follows. Consider, for a moment, a version of (1) where returns are homogeneous in the population. Ability is unobserved and therefore part of the error term. Then,

ln Y = α + βS + γA + ε

where A is ability and β is the common return to schooling (U1 = U0 = ε). In traditional formulations A is an omitted variable. To focus on the central argument, suppose that COV (S, ε) = 0, COV (S, A) > 0 and that γ > 0 (individuals of high ability get more schooling and ability has a positive effect on wages). Suppose we have a candidate instrument Z with the properties that COV (Z, ε) = 0, COV (Z, S) 6= 0 but COV (Z, A) 6= 0, so Z is an invalid instrument. Then

40

ˆ plim β OLS ˆ plim β IV

COV (S, A) V (S) COV (Z, A) = β+γ COV (Z, S) = β+γ

ˆ >plim β ˆ so plim β IV OLS if COV (S, A) COV (Z, A) > COV (Z, S) V (S) (since γ > 0), where V (S) is the variance of S. If COV (Z, S) > 0, this condition can be rewritten as follows:

COV (Z, A) [V (A) V (Z)]

1 2

>

COV (S, A) COV (Z, S) 1

V (S) [V (A) V (Z)] 2

or ρZA > ρSA ρSZ where the ρXY is the correlation between X and Y . If COV (Z, S) < 0, the ordering is reversed and

ρZA < ρSA ρSZ .

The last column of Table 12a shows that these inequalities are satisfied for five out of the eight instruments examined ˆ ˆ and therefore β OLS < β IV may be just a consequence of using invalid instruments in the absence of adequate controls for ability85 . Notice that local wage and local unemployment are not strongly correlated with AFQT. However, they are weakly correlated with college attendance as well. In Table 12b I present the F-statistic for the test of the hypothesis that the coefficient on the instrument is zero in a regression of schooling on the instrument. Staiger and Stock (1997) suggest using an F-statistic of 10 as a threshold for separating weak and strong instruments86 . The table shows that for local wage and local unemployment these statistics are well below 10 which suggests that they are weak instruments. Therefore either the candidate instrumental variable is correlated with ability or it is weakly 8 5 However, even if we have a measure of ability there may still exist left out components of the error term that are correlated with the instruments. 8 6 In a recent paper Stock and Yogo (2002) propose a different test. However they still find that the rule of thumb first proposed in Staiger and Stock (1997) works well in general.

41

correlated with schooling87 . In Table 12c I present partial correlations between instruments, schooling and ability, after controlling for family background variables (number of siblings and parental education)88 . The major change is for the case of tuition: the correlation between tuition and AFQT falls dramatically once we residualize on family background variables and becomes statistically indistinguishable from zero. Conditioning on family background, tuition may be a valid instrument. For the other instruments there is some attenuation of the correlation between instrument and ability but it is not nearly as strong. However, Table 12d shows that the F-test for a regression of schooling on the residualized instrument is low (by Staiger-Stock standards): residualizing on family background attenuates the correlation between the instruments and ability but also between the instruments and schooling. Therefore the estimates of returns from the PSID may be high because we do not have a good measure of cognitive ability in the dataset. Furthermore, if we exclude cognitive test scores from our analysis of the NLSY or the HSB the estimated returns also come out much higher than they do with test scores included89 .

6

Conclusion

The main problem in the estimation of the returns to schooling is how to account for selection. But is selection just a theoretical possibility or is it an important empirical problem? I focus on the returns to college and using different methods and different samples I document the empirical importance of accounting for heterogeneity and selection in the estimation of these returns and in the evaluation of education policy. This paper makes four main points: 1) Heterogeneity and selection are important in the returns to college. The average individual in the population has a different return to college than the marginal individual. 2) Individual returns are determined by both observable and unobservable components of ability (or abilities), which differ in the population. It is important to account for both. Ignoring selection on unobservables may lead to large biases (there is a big difference between the numbers displayed in tables 7 and 10). The distributions of these abilities are very different across the populations of high school graduates, college graduates and individuals at the 8 7 In the appendix I summarize an analogous analysis for white females in the NLSY, white males and females in the HSB and white males in the PSID. A similar analysis is also presented in Carneiro and Heckman (2002). 8 8 I exclude the family background instruments from this table since now I want to use these variables as controls. 8 9 Results available on request.

42

margin. Therefore the distributions of returns to schooling are also very different across these different groups. 3) Since heterogeneity is important we need to clearly define what is the particular return we are interested in. The average return to schooling in the population is not the right parameter to evaluate a tuition policy, what we call the policy relevant parameter. Even though many economists focus on the former, the latter is what we want. We would also think that to evaluate the benefits of different tuition policies we would need to estimate different parameters, one for each policy. However I show that, empirically, the relevant return to schooling required to evaluate a broad range of different policies is not very different across them, although this is not true for all policies. For a broad range of policies, we only need one number, the return to schooling for the marginal person, even if these policies have very different effects on college enrollment. This is a surprising result. It is an empirical result about the quantitative magnitude of the effects of different policies, not a theoretical statement about evaluating policies in other fields of social policy. Theoretically, it can only happen in a very special case, which seems to be empirically important. It cannot be generalized to other areas of economics and policy evaluation where heterogeneity and self-selection are thought to be important. It breaks down when we consider very large scale policies. Neither OLS nor linear IV (using as the instrumental variable an index of commonly used instruments variables in the literature) estimate policy relevant parameters. However, IV performs much better than OLS. 4) Most of the instrumental variables used in this literature are either correlated with measured cognitive ability, and therefore are not valid instruments if ability is not included in the model, or they are weakly correlated with schooling. When family background variables are included in the model to proxy for ability the correlation between the instrument and ability becomes, in some cases, smaller, but so does the correlation between the instrument and schooling. However, there are some important caveats to the results to consider: 1) The results of this paper rely on the validity of the instrumental variables used, after controlling for measured cognitive ability. Table 12 can be interpreted as saying that controlling for cognitive ability is important. However, it can also suggest the following question: if the instruments are correlated with measured ability why shouldn’t they be also correlated with unmeasured ability? 2) As in other studies of the returns to schooling that rely on the method of instrumental variables, the estimates ˆ >β ˆ I obtain are not very precise. It is usually found that β IV OLS but that these two parameters are not statistically

43

significantly different from each other. Likewise, I find that matching and local instrumental variables of estimates of the return to schooling are not statistically different. However the magnitude of the numbers is consistently very different. 3) General equilibrium effects of policies are not considered and for the particular case of tuition policies they are potentially very important (see Heckman, Lochner and Taber, 1998). 4) There is no consideration of school quality and of how students are allocated across schools. Tuition policies are assumed to operate on a single dimension: college enrollment. One obvious problem is that if tuition is correlated with school quality and if school quality is a determinant of the returns to schooling then tuition may no longer be a valid instrument (see Carneiro and Heckman, 2002). Furthermore there is a lot of heterogeneity in types of college, degrees offered and types of students, that may cause my estimates of the returns to college to be of difficult interpretation. Also related to this last point is the fact that I aggregate individuals with different amounts of college education into a single category. However, this latter problem can be handled by an extension of the model to a multioutcome setting which is the subject of another paper (Carneiro and Heckman, 2001). 5) Finally, preliminary analysis of blacks and hispanics shows different results. The results of this paper are not generalizable to other race groups. These need to be studied separately. I conclude with two additional notes, related to important extensions of this work currently in progress. The first one concerns the simulation of policies. The choice model estimated in this paper is very simple and therefore the way the policies can operate is very limited. That may be driving part of the result that one number is all we need to evaluate a broad range of policies. More analysis is needed and better models of schooling attainment need to be incorporated. Furthermore, local average tuition in the county of residence at 17 (the tuition variable used in this paper) may also not be a good approximation to the tuition faced by each individual. The second point concerns important recent developments in the study of heterogeneity in returns. The evidence in this paper (and in many others) suggests that heterogeneity is important. Although the focus of the literature (and also of this paper) has been in the estimation of different mean returns, much more can be learned from the estimation of distributions of returns. Carneiro, Hansen and Heckman (2001, 2003) estimate distributions of returns to schooling using a very different methodology from the one in the paper (which was developed to deal with means). They show that there is substantial dispersion in the returns to schooling and that many individuals make a mistake by going to college

44

(they have negative returns). Moreover, there is substantial uncertainty about wages and returns at the time agents make their schooling decisions, which helps explain why there are so many mistakes ex post.

References [1] Altonji, J. and T. Dunn (1996). “The Effects of Family Characteristics on the Return to Education,” Review of Economics and Statistics, 78(4):692-704. [2] Angrist, J., K. Graddy and G. Imbens (2000). “The Interpretation of Instrumental Variables Estimators in Simultaneous Equations Models with an Application to the Demand for Fish,” The Review of Economic Studies, 67(232):499-527. [3] Angrist, J. and Imbens, G. (1995). “Two-Stage Least Squares Estimation of Average Causal Effects in Models With Variable Treatment Intensity”, Journal of the American Statistical Association, 90, 431-442 [4] Ashenfelter, O. and A. Krueger (1994). “Estimates of the Economic Return to Schooling from a New Sample of Twins,” American Economic Review, 84(5)1157-1173. [5] Ashenfelter, O. and C. Rouse (1998). “Income Schooling and Ability: Evidence from a New Sample of Identical Twins,” The Quarterly Journal of Economics, 113(1):253-284. [6] Ashenfelter, O. and C. Rouse (2000). “ Schooling, Intelligence, and Income in America.” in K. Arrow; S. Bowles, and S. Durlauf, eds. Meritocracy and Economic Inequality. Princeton, NJ: Princeton University Press. p. 89-117. [7] Bishop, J. (1991). “Achievement, Test Scores, and Relative Wages.” in M. Kosters, ed.. Workers and their wages: Changing patterns in the United States. AEI Studies, no. 520 Washington, D.C.: AEI Press. p. 146-86 [8] Bjorklund, A. and R. Moffitt (1987). “The Estimation of Wage Gains and Welfare Gains in Self-Selection Models,” The Review of Economics and Statistics, 69(1):42-49. [9] Black, D. and Smith, J. (2002), “How Robust is the Evidence on the Effects of College Quality? Evidence from Matching.”, working paper.

45

[10] Blackburn, M. and D. Neumark (1993).“Omitted-Ability Bias and the Increase in the Return to Schooling,” Journal of Labor Economics, 11(3):521-544. [11] Bureau of Labor Statistics (2001). NLS Handbook 2001. Washington D.C.: U.S. Department of Labor. [12] Card, D. (1999). “The Causal Effect of Education on Earnings” in O. Ashenfelter, and D. Card, eds., Handbook of Labor Economics, Vol 3A. Amsterdam: Elsevier Science, North-Holland, 1801-63. [13] Card, D (2001). “Estimating the Return to Schooling: Progress on Some Persistent Econometric Problems,” Econometrica, 69(5):1127-60. [14] Card, D. and A. Krueger. (1992). “Does School Quality Matter? Returns to Education and the Characteristics of Public Schools in the United States.” Journal of Political Economy, 100(1):1—40. [15] Cameron, S. and J. Heckman (1993). “The Nonequivalence of High School Equivalents,” Journal of Labor Economics, 11:1-47. [16] Cameron, S. and J. Heckman (1999). “Can Tuition Policy Combat Rising Wage Inequality?” in M. Kosters, ed., Financing College Tuition: Government Policies and Educational Priorities. Washington: American Enterprise Institute Press. [17] Cameron, S. and J. Heckman (2001). “The Dynamics of Educational Attainment for Black, Hispanic, and White Males,” Journal of Political Economy, 109:455-499. [18] Carneiro, P. and K. Hansen, and J. Heckman. (2001). “Removing the Veil of Ignorance in Assessing The Distributional Impacts of Social Policies”, NBER working paper No. 8840, and forthcoming in Swedish Economic Policy Review. [19] Carneiro, P. and K. Hansen, and J. Heckman (2003). “Estimating Distributions of Treatment Effects with an Application to the Returns to Schooling and Measurement of the Effects of Uncertainty on Schooling Choice”, forthcoming in International Economic Review. [20] Carneiro, P. and J. Heckman (2001), “Alternative Estimators of the Returns to Schooling”, working paper.

46

[21] Carneiro, P. and J. Heckman (2002). “The Evidence on Credit Constraints in Post-Secondary Schooling,” forthcoming in Economic Journal. [22] Carneiro, P., J. Heckman and E. Vytlacil (2001). “Estimating the Rate of Return to Education When It Varies Among Individuals,” working paper, University of Chicago. [23] Dehejia, R. and S. Wahba, “Propensity Score Matching Methods for Nonexperimental Causal Studies”, NBER working paper No. 6829. [24] Digest of education statistics, 1970-1989, Washington: National Center for Education Statistics [25] Duncan, G. and D. Leigh (1985), “The Endogeneity of Union Status: An Empirical Test”, Journal of Labor Economics, 3, 385-402. [26] Farber, H. (1983), “Worker Preferences for Union Representation”, Research in Labor Economics, suppl. 2. [27] Griliches, Z. (1977). “Estimating the Returns to Schooling: Some Econometric Problems,” Econometrica, 45(1):1-22. [28] Grogger, J. and E. Eide. (1995). “Changes in College Skills and the Rise in the College Wage Premium.” Journal of Human Resources 30(2): 280-310. [29] Hansen, K. J. Heckman and K. Mullen, “Educational Attainment, Ability and Test Scores,” working paper, University of Chicago [30] Hausman, J. (1986), “Taxes and Labor Supply”, in Handbook of Public Economics, vol. 1, edited by Alan Auerbach and Martin Feldstein, New York, North-Holland. [31] Heckman, J. (1974a), “Life Cycle Consumption and Labor Supply: An Explanation of the Relationship between Income and Consumption over the Life Cycle”, American Economic Review, 64, 188-94. [32] Heckman, J. (1974b), “Shadow Prices, Market Wages, and Labor Supply”, Econometrica, 42, 679-94. [33] Heckman, J. (1997), Instrumental Variables: A Study of Implicit Behavioral Assumptions Used in Making Program Evaluations”, Journal of Human Resources, 32, 441-62.

47

[34] Heckman, J. (2001) “Micro Data, Heterogeneity, and the Evaluation of Public Policy: Nobel Lecture,” Journal of Political Economy, 109(4):673-748. [35] Heckman, J., H. Ichimura, J. Smith and P. Todd (1998), “Characterizing Selection Bias Using Experimental Data”, Econometrica, 66, 1017-98. [36] Heckman, J., H. Ichimura and P. Todd (1997). “Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” Review of Economic Studies, 64(4):605-654. [37] Heckman, J., H. Ichimura and P. Todd (1998). “Matching As An Econometric Evaluation Estimator”, Review of Economic Studies, 65(2):261-94. [38] Heckman, J., A. Layne-Farrar and P. Todd (1996). “Human Capital Pricing Equations with an Application to Estimating the Effect of Schooling Quality on Earnings,” The Review of Economics and Statistics, 78(6):562-610. [39] Heckman, J. L. Lochner, and C. Taber. (1998). “ General-Equilibrium Treatment Effects: A Study of Tuition Policy.” American Economic Review 88(2): 381-86. [40] Heckman, J. L. Lochner and P. Todd. (2002). “Fifty Years of Mincer Earnings Regressions.” Working paper, University of Chicago. [41] Heckman, J. and S. Navarro-Lozano (2002), “Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models”, working paper. [42] Heckman, J. and R. Robb (1985), “Alternative Methods for Estimating the Impact of Interventions,” in J. Heckman and B. Singer, (eds.), Longitudinal Analysis of Labor Market Data, (New York: Cambridge University Press), 156-245. [43] Heckman, J. and R. Robb (1986), “Alternative Methods for Solving the Problem of Selection Bias in Evaluating the Impact of Treatments on Outcomes,” in H. Wainer, (ed.), Drawing Inference from Self-Selected Samples, (NY: Springer-Verlag), 63-107. Republished by Lawrence Erlbauum Press, Mahwah, New Jersey, 2000. [44] Heckman, J., J. Tobias and E. Vytlacil (2001). “Four Parameters of Interest in the Evaluation of Social Programs,” Southern Economic Journal, 68(2):210-233.

48

[45] Heckman, J. and E. Vytlacil (1998), “Instrumental Variables Methods for the Correlation Random Coefficient Model,” Journal of Human Resources, 33(4):974-1002. [46] Heckman, J. and E. Vytlacil (1999). “Local Instrumental Variables and Latent Variable Models for Identifying an Bounding Treatment Effects,” Proceedings of the National Academy of Sciences, 96:4730-4734. [47] Heckman, J. and E. Vytlacil (2000), “Local Instrumental Variables,” in C. Hsiao, K. Morimune, and J. Powells, (eds.), Nonlinear Statistical Modeling: Proceedings of the Thirteenth International Symposium in Economic Theory and Econometrics: Essays in Honor of Takeshi Amemiya, (Cambridge: Cambridge University Press, 2000), 1-46. [48] Heckman, J. and E. Vytlacil (2001a). “Identifying the Role of Cognitive Ability in Explaining the Level of and Change in the Return to Schooling,” Review of Economics and Statistics, 83(1):1-12. [49] Heckman, J. and E. Vytlacil (2001b), “Policy Relevant Treatment Effects”, American Economic Review Papers and Proceedings, 91, 107-111. [50] Heckman, J. and E. Vytlacil (2003). “Structural Equations, Treatment Effects and Econometric Policy Evaluation,” forthcoming in Econometrica. [51] Horowitz, J. (2000), “The Bootstrap”, forthcoming in Handbook of Econometrics. [52] Hill, M. (1992). The Panel Study of Income Dynamics: A User’s Guide, Newbury Park, CA: Sage Publications. [53] Hyndman, R.J., Bashtannyk, D.M. and Grunwald, G.K. (1996) ”Estimating and visualizing conditional densities”. J. Comput. Graph. Stat., 5, 315-336. [54] Hyndman, R.J. and Yao, Q. (2002) ”Nonparametric estimation and symmetry tests for conditional density functions”. Journal of Nonparametric Statistics, 14(3), 259-278. [55] Ichimura, H. and C. Taber. (2000). “Direct Estimation of Policy Effects”, working paper [56] Ichimura, H. and C. Taber. (2002). “Semiparametric Reduced Form Estimation of Tuition Subsidies”, working paper

49

[57] Ichimura, H. and P. Todd (2002), “Implementing Nonparametric and Semiparametric Estimators,” forthcoming in J. Heckman and E. Leamer, (eds.), Handbook of Econometrics, Volume 5, (North-Holland:Amsterdam). [58] Imbens, G. and J. Angrist (1994). “Identification and Estimation of Local Average Treatment Effects,” Econometrica, 62(2):467-475. [59] Kane, T. (1994). “College Entry by Blacks since 1970: The Role of College Costs, Family Background, and the Returns to Education,” The Journal of Political Economy, 102(5):878-911. [60] Kane, T., C. Rouse and D. Staiger (1999), “Estimating Returns to Schooling When Schooling is Misreported”, NBER working paper. [61] Kling, J. (2001). “Interpreting Instrumental Variables Estimates of Returns to Schooling.” Journal of Business and Economic Statistics 19(3):358-64. [62] Krueger, A. (2000), Labor Policy and Labor Research Since the 1960s, in Economic Events, Ideas and Policies: The 1960s and After, edited by George Perry and James Tobin, (Washington DC, Brookings Institution Press). [63] Lee, Lung-Fei (1978), “Unionism and Wage Rates: A Simultaneous Equation Model with Qualitative and Limited Dependent Variables”, International Economic Review, 19, 415-33. [64] Matzkin, R. (1992), “Nonparametric and Distribution-Free Estimation of the Binary Threshold Crossing and the Binary Choice Models”, Econometrica, 60(2), 239-70. [65] Meghir, C. and M. Palme (2001). “The Effect of a Social Experiment in Education,” The Institute for Fiscal Studies working paper W01/11. [66] Mincer, J. (1974). Schooling, Earnings and Experience, New York: NBER Press. [67] Murnane, R. J. Willett and F. Levy. (1995). “ The Growing Importance of Cognitive Skills in Wage Determination.” National Bureau of Economic Research, Working Paper No. 5076. [68] National Center for Education Statistics (1995), “High School and Beyond Fourth Follow-up Methodology Report”.

50

[69] Pessino, C. (1991), “Sequential Migration Theory and Evidence from Peru”, Journal of Development Economics, 36, 55-87. [70] Robinson, C. (1989), “The Joint Determination of Union Status and Union Wage Effects: Some Tests of Alternative Models”, Journal of Political Economy, 97, 639-67. [71] Rosembaum, P. and D. Rubin (1983), “The Central Role of the Propensity Score in Observational Studies for Causal Effects”, Biometrika, 70, 41-55. [72] Staiger, D. and J. Stock (1997), “Instrumental Variables Regression with Weak Instruments”, Econometrica, 65(3), 557-586. [73] Stock, J. and M. Yogo (2002), “Testing for Weak Instruments in Linear IV Regression”, working paper. [74] Willis, R. and S. Rosen (1979). “Education and Self-Selection,” Journal of Political Economy, 87(5):S7-S36. [75] Vytlacil, E. (2002).“Independence, Monotonicity, and Latent Index Models: An Equivalence Result,” Econometrica, 70(1):331-41.

51

Average Treatment on the Untreated (TUT)

Average Treatment on the Treated (TT)

E [|µS (X, Z) + US = 0]

E (|S = 0)

E (|S = 1)

=

=

=

=

R R X

R R

X

R R X

R R

X

US

US

US

US

E (|x, uS ) fX,US [x, uS |S (no policy) = 0, S (policy) = 1] duS dx

E (|x, uS ) fX,US [x, uS |µS (X, Z) + US = 0] duS dx

E (|x, uS ) fX,US (x, uS |S = 0) duS dx

E (|x, uS ) fX,US (x, uS |S = 1) duS dx

Table 2a Evaluation Parameters as Weighted Averages of MTE R R E () = X US E (|x, uS ) fX,US (x, uS ) duS dx

Average Marginal Treatment Eect (AMTE) E [|S (no policy) = 0, S (policy) = 1]

Average Treatment Eect (ATE)

Policy Relevant Treatment Eect (PRTE)

E (|S = 1)

E ()

Average Treatment on the Untreated (TUT)

Average Treatment on the Treated (TT)

Average Treatment Eect (ATE)

average return for individuals induced to go to college by a change in the instrument

average return for individuals at the margin between high school and college

average return for individuals choosing not to go to college

average return for individuals choosing to go to college

Table 1 Evaluation Parameters average return for randomly selected individual

E (|S = 0)

Local Average Treatment Eect (LATE)

Average Marginal Treatment Eect (AMTE)

average return for individuals induced to go to college by a change in policy

E [|S (Z) = 0, S (Z 0 ) = 1]

Policy Relevant Treatment Eect (PRTE)

E [|µS (X, Z) + US = 0]

E [|S (no policy) = 0, S (policy) = 1]

If B/BS

If B BS

income maximizing Roy model

(selection on returns)

(no selection returns)

any ordering can emerge and SG can be of either sign

T T > AT E > T U T and SG > 0

then AT E 6= T T 6= T U T 6= AM T E 6= LAT E 6= P RT E and SG 6= 0

then AT E = T T = T U T = AM T E = LAT E = P RT E and SG = 0

Sorting Gain (SG)

S = 1 if  > 0

generalized Roy model

¡ ¢ ¯ =1 E   |S

S = 1 if µS (X, Z) > 0

Local Wage at 17 (in dollars) Local unemployment Rate at 17 (in %) Average Local Tuition at 17 (in $100) Distance to College at 14 Father's Years of Schooling Number of Siblings Corrected AFQT Years of Experience Log Wage Years of Schooling

S=1 (N=731) 15.5007 (1.7543) 2.6123 (0.4797) 9.4083 (2.9711) 64.5883 (13.8165) 2.6484 (1.7590) 13.9959 (3.1373) 19.2496 (8.0055) 2.9811 (9.7070) 6.6217 (1.1994) 6.2778 (1.6625)

White Males from NLSY79

Table 3 Descriptive Statistics

S=0 (N=713) 11.9677 (0.5087) 2.2994 (0.4067) 10.3410 (2.8365) 47.5633 (16.5688) 3.0477 (1.8407) 11.6255 (2.7710) 20.9525 (8.0905) 4.5303 (10.7643) 6.8129 (1.2137) 6.3077 (1.6967) fX,US (x, uS |S = 1)

=

=

fX (x)fUS (uS )FZ (x X uS |X=x) Pr(S=0)

fX (x)fUS (uS )[1FZ (x X uS |X=x)] Pr(S=1)

Table 2b Weights for Dierent Evaluation Parameters fX,US (x, uS ) = fX (x) fUS (uS )

Average Treatment on the Treated (TT)

fX,US (x, uS |S = 0)

Average Treatment Eect (ATE)

Average Treatment on the Untreated (TUT)

fX (x)fUS (uS )fZ (x X uS |X=x)dus dx

fX (x)fUS (uS )[FZ (x X uS |X=x)FZ 0 (x X uS |X=x)]dus dx

fX (x)fUS (uS )[FZ (x X uS |X=x)FZ 0 (x X uS |X=x)]

US

fX (x)fUS (uS )fZ (x X uS |X=x)

=R R =RR

X

fX,US [x, uS |µS (X, Z) + US = 0] fX,US [x, uS |S (no policy) = 0, S (policy) = 1]

Average Marginal Treatment Eect (AMTE)

Policy Relevant Treatment Eect (PRTE)

Assume that X, ZB BUS . Assume that µS (X, Z) = X X + Z. Assume that the policy changes the distribution of Z in the population from fZ (z) to fZ 0 (z) and the policy does not aect X nor US . I consider a monotone policy in the sense that Z 0  Z for every individual.

Table 5 Probit Coefficients for Choice of Schooling Model White Males from NLSY79 Table 4 Log Wage Regression - The Importance of Test Scores NLSY (whites) HSB (whites) Males Females Males Females Experience 0.0756 0.0895 0.1509 0.1815 (0.0185) (0.0182) (0.0123) (0.0161) Experience Squared -0.0020 -0.0019 -0.0119 -0.0120 (0.0010) (0.0010) (0.0013) (0.0018) Test Score 0.0031 0.0023 0.2304 0.8616 (0.0010) (0.0011) (0.1916) (0.2970) School 0.0107 -0.0770 -0.1618 0.3608 (0.0895) (0.0960) (0.1571) (0.2126) School*Test Score 0.0044 0.0068 0.6469 0.1796 (0.0015) (0.0016) (0.2819) (0.4013) Constant 1.6025 1.2141 9.5629 8.6865 (0.0983) (0.0869) (0.1019) (0.1507)

Corrected AFQT Number of Siblings Father's Years of Schooling Average Local Tuition at 17 (in $100) Distance to College at 14 * Local Wage at 17 (in dollars) Local unemployment Rate at 17 (in %) (*)Distance in hundreds of miles

E 0.0389 (0.0026) -0.0369 (0.0216) 0.1217 (0.0134) -0.0138 (0.0049) 0.0471 (0.3710) -0.1275 (0.0348) 0.0650 (0.0261)

E(dF/dZ) 0.0114 (0.0006) -0.0108 (0.0063) 0.0356 (0.0036) -0.0040 (0.0014) 0.0138 (0.1086) -0.0373 (0.0101) 0.0190 (0.0076)

Table 7 Estimates of the Return to Schooling - Matching NLSY (whites) HSB (whites) Males Females Males Females E () 0.0759 0.0856 0.0521 0.0955 (0.0072) (0.0091) (0.0139) (0.0108) E (|S = 1) 0.0887 0.1001 0.0599 0.1018 (0.0091) (0.0088) (0.0136) (0.0123) E (|S = 0) 0.0627 0.0721 0.0446 0.0880 (0.0083) (0.0118) (0.0150) (0.0117) E [|µS (X, Z) + US = 0] 0.0697 0.0823 0.0538 0.0886 (0.0071) (0.0081) (0.0129) (0.0103) Standard Errors are Bootstrapped

Average Treatment Eect (ATE)

Table 6 Weights for Dierent Evaluation Parameters (Matching) i1 ¤h £ fP (p) = fT P 1 (p) CPCt(t) |t=P 1 (p)

i1 ¤h £ = fT P 1 (p) |S = 1 CPCt(t) |t=P 1 (p)

Average Treatment on the Treated (TT)

fP (p|S = 1)

Average Treatment on the Untreated (TUT)

fP (p|S = 0)

Average Marginal Treatment Eect (AMTE)

fP [p|µS (X, Z) + US = 0]

Policy Relevant Treatment Eect (PRTE)1

fP [p|  T +  Zt  1000 < US < T ]

where

T h

CP (t) 1 Ct |t=P (p)

i1 ¤h £ = fT P 1 (p) |S = 0 CPCt(t) |t=P 1 (p)

i1

fT [t|T = US ]

i1 £ ¤h = fT P 1 (p) |X X + Z Z = US CPCt(t) |t=P 1 (p)

i1 ¤h £ = fT P 1 (p) |  T +  Zt  1000 < US < T CPCt(t) |t=P 1 (p)

= X X + Z Z =

£ 11 ¤ fUS FU (p) S fT (t)fUS (t)

=R

T

fT [t|  T +  Zt  1000 < US < T ] This is the PRTE corresponding to a $1000 tuition subsidy (Zt0 = Z 1000), where  Zt is the coe!cient on the tuition variable in the choice model (therefore  Zt < 0).

f (v)fUS (v)dv

RTt

Zt 1000 = R Rt+ r T

1

PSID (whites) Males 0.0558 (0.0176) 0.0504 (0.0174) 0.0623 (0.0202) 0.0539 (0.0178)

r+ Zt 1000

fT (t)fUS (v)dv fT (r)fUS (v)dvdr

ATE

TT

TUT

fTX ,VS (tX , vS )

fTX ,VS (tX , vS |S = 1) fTX ,VS (tX , vS |S = 0)

FU1 S

S

S

¡

fUS

Pr(S=1)

¡

1 FU (1vS ) S

S

Pr(S=0)

¤ ¡ ¤ ¡

S

S

¡

S

Pr(S=0)

¢

¢ ¢

¡

S

½

¾

£ 11 ¤ fUS FU (1vS )

S

S

¾

S

S

£ 11 ¤ fUS FU (1vS )

¾

¢¤ ½

£ 11 ¤ fUS FU (1vS )

£ 11 ¤ fUS FU (1vS )

¢¤ ½

¢½

1 1FTZ tX FU (1vS )|TX =tX

¤£

¤

S

S

¤£ S

1 1 FZ tX FU (1vS )|TX =tX FZ 0 tX FU (1vS )|TX =tX

Phat Cubed Phat Squared

2.1828 (0.7027) 1 -3.0134 (1.1983)

1.2746 (0.8935) 2 -2.0236 (1.5319)

4 -2.7836 (1.3625)

2.0170 (0.8337)

2.4825 (1.1224)

Table 8 Testing for Selection

White Males from NLSY79

GEDs Included and Average Wage 90-94

3 -3.7504 (1.8171)

5 -2.2753 (1.4004)

2.1607 (0.8362)

(1) Regression of Wage on Experience, Experience Squared, AFQT, Phat*AFQT, Phat, Phat Squared and Phat Cubed. (2) = 1 + AFQT Squared and Phat*AFQT Squared. (3) = 2 + AFQT Cubed and Phat*AFQT Cubed. (4) = 1 + Number of Siblings and Father's Education. (5) = 4 + Phat*Number of Siblings and Phat*Father's Education

£

S

1 fTX (tX )fUS FU (1vS )

1 1 fTX (tX )fUS FU (1vS ) fZ tX FU (1vS )|TX =tX dus dtX

1 1 fTX (tX )fUS FU (1vS ) fZ tX FU (1vS )|TX =tX

£ £

1 1 fTX (tX )fUS FU (1vS ) FTZ tX FU (1vS )|TX =tX

£

1 fTX (tX )fUS FU (1vS )

£

Table 9 Weights for Dierent Evaluation Parameters (Local Instrumental Variables) ¾ ½ ¤ £ £ 1 ¤ = fTX (tX ) fUS (1  vS ) =

=

=R

R

fTX ,VS [tX , vS |TX + TZ + US = 0] =

AMTE

fTX ,VS [tX , vS |  TX  TZ +  Zt  1000 < US < TX  TZ ] = X SX

US

PRTE1 TX = Z SZ

TX

where TZ 1 This is the PRTE corresponding to a $1000 tuition subsidy (Zt0 = Z 1000), where  Zt is the coe!cient on the tuition variable in the choice model (therefore  Zt < 0).

¾

Table 10b Estimates of the Return to Schooling - Local Instrumental Variables using Local Linear Regression NLSY (whites) HSB (whites) PSID (whites) Males Females Males Females Males E () 0.1625 0.2086 0.1702 0.1460 0.3287 (0.0319) (0.0325) (0.0393) (0.0432) (0.0504) E (|S = 1) 0.1807 0.2646 0.2418 0.1994 0.3900 (0.0572) (0.0566) (0.0701) (0.0828) (0.0727) E (|S = 0) 0.1456 0.1597 0.1123 0.0806 0.2373 (0.0415) (0.0448) (0.0908) (0.0475) (0.0615) E [|µS (X, Z) + US = 0] 0.1501 0.2108 0.1802 0.1294 0.2787 (0.0296) (0.0324) (0.0984) (0.0327) (0.0476) Standard Errors are Bootstrapped

Table 10a NLSY79 - White Males Estimates of the Return to Schooling - Local Instrumental Variables using: Local Linear Regression Cubic in Pˆ Cubic in Pˆ E () 0.1625 0.1566 0.1657 (0.0319) (0.0321) (0.0328) E (|S = 1) 0.1807 0.1572 0.1854 (0.0572) (0.0655) (0.0666) E (|S = 0) 0.1456 0.1560 0.1474 (0.0415) (0.0458) (0.0450) E [|µS (X, Z) + US = 0] 0.1501 0.1362 0.1454 (0.0296) (0.0288) (0.0297) Standard Errors are Bootstrapped

Table 11a NLSY79 White Males Estimates of Policy Eects by OLS, IV and Local OLS 0.0773 (0.0068) IV 0.1269 (0.0227) LLR Cubic in Pˆ E [|S(t = 0) = 1, S (t = 500) = 0] 0.1505 0.1378 (0.0289) (0.0279) E [|S(t = 0) = 1, S (t = 1000) = 0] 0.1501 0.1382 (0.0286) (0.0275) E [|S(t = 0) = 1, S (t = 2000) = 0] 0.1491 0.1384 (0.0283) (0.0271) Standard Errors are Bootstrapped Quartic in Pˆ 0.1458 (0.0286) 0.1458 (0.0282) 0.1447 (0.0277)

IV

Table 11b Estimates of Policy Eects by OLS, IV and Local IV (estimated by LLR) NLSY (whites) PSID (whites) Males Females Males OLS 0.0773 0.0877 0.1085 (0.0068) (0.0074) (0.0157) IV 0.1269 0.1927 0.3056 (0.0227) (0.0291) (0.0445) E [|S(t = 0) = 1, S (t = 500) = 0] 0.1505 0.2090 0.2538 (0.0289) (0.0324) (0.0556) E [|S(t = 0) = 1, S (t = 1000) = 0] 0.1501 0.2090 0.2517 (0.0286) (0.0322) (0.0548) E [|S(t = 0) = 1, S (t = 2000) = 0] 0.1491 0.2068 0.2489 (0.0283) (0.0321) (0.0534) Standard Errors are Bootstrapped

Table 12a Validity of Instrumental Variables White Males from NLSY ρz,s

ρz,a

ρs,a

Number of Sibling

-0.1159 (0.0212)

-0.1016 (0.0234)

0.4950 (0.0189)

ρs,a*ρz,s ρz,a > ρs,a*ρz,s IF ρz,s > 0 OR ρz,a < ρs,a*ρz,s IF ρz,s < 0 -0.0574 YES (0.0111)

Mother's Education

0.3207 (0.0191)

0.3366 (0.0254)

0.4950 (0.0189)

0.1588 (0.0112)

YES

Father's Education

0.3821 (0.0167)

0.3527 (0.0219)

0.4950 (0.0189)

0.1892 (0.0109)

YES

Tuition

-0.0956 (0.0220)

-0.0420 (0.0234)

0.4950 (0.0189)

-0.0473 (0.0109)

NO

Distance to College

-0.0705 (0.0252)

-0.0795 (0.0246)

0.4950 (0.0189)

-0.0349 (0.0127)

YES

Local Wage

-0.0560 (0.0235)

0.0194 (0.0236)

0.4950 (0.0189)

-0.0277 (0.0116)

NO

Local Unemployment rate

-0.0097 (0.0231)

-0.0015 (0.0225)

0.4950 (0.0189)

-0.0048 (0.0114)

NO

Family Income

0.2112 (0.0254)

0.1856 (0.0249)

0.4950 (0.0189)

0.1045 (0.0133)

YES

Instruments

The first column presents the correlation between the instruments and college attendance. The second column presents the correlation between the instrument and measured ability (AFQT). The third column presents the correlation between college attendance and ability. The fourth column is a product of the first and third columns and last column compares the second to the fourth column of the table. For each correlation in the first and second column I use all white males in NLSY with nonmissing values for the variables needed (instrument, college attendance and AFQT). GED recipients and high school dropouts are excluded. For the third column I use all white males in the NLSY with nonmissing observations for college attendance and AFQT. Standard errors in the fourth column are bootstrapped and so are the standard errors in the other three columns (for consistency across the columns; the non-bootstrapped standard errors for the first three columns are nealy the same as the ones presented here). The number of replications is 100.

Table 12b Regression of College Attendance on each Instrument White Males from NLSY Instruments Number of Sibling Father's Education Mother's Education Tuition Distance to College Local Wage Local Unemployment rate Family Income

F- Test 26.0529 312.7564 203.5318 17.6449 8.9239 5.8588 0.1800 66.6684

To construct thid table I first run an OLS regression of college attendance on each of the instruments. The statistic reported is the F-test for that regression (the null is that the coefficient on the instrument in the regression is equal to zero)

-0.0232 (0.0250) -0.0521 (0.0230) 0.0081 (0.0206) 0.0865 (0.0242)

Distance to College

Local Wage

Local Unemployment rate

Family Income

ρs,a 0.3984 (0.0218) 0.3984 (0.0218) 0.3984 (0.0218) 0.3984 (0.0218) 0.3984 (0.0218)

ρz,a 0.0074 (0.0218) -0.0380 (0.0271) 0.0409 (0.0222) 0.0133 (0.0236) 0.0375 (0.0246)

0.0345 (0.0099)

0.0032 (0.0082)

-0.0208 (0.0090)

-0.0092 (0.0099)

YES

YES

NO

YES

ρs,a*ρz,s ρz,a > ρs,a*ρz,s IF ρz,s > 0 OR ρz,a < ρs,a*ρz,s IF ρz,s < 0 -0.0224 NO (0.0089)

Schooling, Instruments and Test Scores are first residualized on number of siblings, father's education and mother's education using linear regression. Then the first column of the table presents the correlation between the instrument and college attendance. The second column presents the correlation between the instrument and measured ability (AFQT). The third column presents the correlation between college attendance and ability. The fourth column is a product of the first and third columns and the last column compares the second to the fourth column in the table. For each correlation in the first and second column I use all white males in NLSY with nonmissing values for the variables needed (instrument, college attendance and AFQT). GED recipients and high school dropouts are excluded. For the third column I use all white males in the NLSY with nonmissing observations for college attendance and AFQT. Standard errors in the fourth column are bootstrapped and so are the standard errors in the other three columns (the numbers of replications is 100). These standard errors account for the fact that these residuals are estimated

-0.0561 (0.0222)

ρz,s

Tuition

Instruments

White Males from NLSY

Table 12c Validity of Instrumental Variables

Table 12d Regression Schooling on each Instruments

(Schooling and Instruments are Residualized on Family Background)

White Males from NLSY Instruments

Tuition Distance to College Local Wage Local Unemployment rate Family Income

F- Test 5.6040 0.8888 4.6837 0.1154 10.0622

Schooling, Instruments and Test Scores are first residualized on number of siblings, father's education and mother's education using linear regression.Then I run an OLS regression of residualized college attendance on each of the residualized instruments. The statistic reported is the F-test for that regression (the null is that the coefficient on the instrument in the regression is equal to zero)

Figure 1

f (β)

C' C

β

E[β|US]

100-200

Figure 2

2100-2200

US

0.4

0.6

LIV

0.8

Linear IV

0.7

0.7

0.8

0.9

LIV

Linear IV

0.9

1

1

Distribution of the Propensity Score for High School (S=0) and College (S=1)

0.12

0.1

0.08

0.06

0.04

0.02

Figure 3

P

0.5

0.6

f(p)

0.3

E(Y|P) estimated by Linear IV and by LIV

0.2

U

0.5

White Males from NLSY79

0.4

0.1

0.4

1.2

1

0.35

0

0.3

S=0 S=1 0.14

0.3

0.25

0.2

0.15

0.1

0.05

0

0.2

Marginal Treatment Effect estimated by Linear IV and by LIV

2

2.5

1.5

1

0.1

p

0.8

0.6

0.4

0.2

0

0 -0.2

E(Y|P)

MTE

0.5

0

0

-0.5

Figure 4

MTE

0.16

0.14

0.12

0.1

0.08

0.06

0.04 0 0.1

Figure 5 MTE Estimated by Matching

0.3

0.4

0.5

0.6

White Males from NLSY79 GEDs Included and Average Wage 90-94

0.2

Us

0.7

0.8

0.9

h(P)

0.02

0.018

0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0

0

0.1

0.3

P

0.5

0.6

White Males from NLSY79

0.4

0.7

0.8

Figure 6a Parameter Weights for MTE (P =p) (matching) ATE, TT and TUT Weights

0.2

0.9

ate tt tut

The ATE weight is the distribution of P in the population, the TT weight is the distribution of P in the population of college attenders and the TUT weight is the distribution of P in the population of high school graduates that never enroll in college.These distributions are estimated nonparametrically using a gaussian kernel with bandwidth equal to 0.1. P is the estimated propensity score.

1

0.018

0.016

0.014

0.012

h(P) 0.01 0.008

0.006

0.004

0.002

0 0

P

0.5

0.6

0.7

0.8

Average Marginal

0.9

1

MTE

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0

-0.02 -0.5

0.5

1

1.5

A

2

2.5

3

3.5

To construct this figure we first run a local linear regression of log hourly wage on years of experience, years of experience squared, AFQT (A), AFQT interacted with P and a nonparametric function of P, g(p). P is the estimated propensity score. We use a biweight kernel with bandwidth equal to 0.3. The AFQT measure used is a scaled version of AFQT obtained by multiplying the original AFQT by its coefficient in the schooling equation. This figure just plots a line passing through the origin with slope equal to the coefficient on the interaction between AFQT and P in the local linear regression described above.

0

White Males from NLSY79

0.4

Figure 7a MTE(A) calculated using LIV

0.3

Figure 6b Parameter Weights for MTE (P =p) (matching) ATE (average person) and AMTE (marginal person) Weights

0.2

White Males from NLSY79

0.1

The ATE weight is the distribution of P in the population.This distribution is estimated nonparametrically using a gaussian kernel with bandwidth equal to 0.1. P is the estimated propensity score.The AMTE weight is the distribution of P for individuals indifferent between attending college or not. Its analytical expression is presented in table 6 and it uses two ingredients: the distribution of P (the ATE weight) and the distrubution of US.The distribution of US is assumed to be normal with mean 0 and variance 1.

4

MTE

0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.1

0.3

Figure 7b

0.5

0.6

White Males from NLSY79

0.4

0.7

MTE Estimated by Local Linear Regression

0.2

Us

0.8

0.9

MTE

1.4

1

1.2

0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4

US

0.2

0

1

White Males from NLSY79

-1

A

Figure 8 MTE(A,U) estimated using LIV

0

2

3

This graph plots the estimated marginal treatment effect as a function of AFQT (A) and US. Given the assumptions in this paper the MTE is additively separable in A and US: MTE(a,u)=g(a)+h(u). To construct this figure we first run a local linear regression of log hourly wage on years of experience, years of experience squared, AFQT, AFQT interacted with P and a nonparametric function of P, g(p). P is the estimated propensity score. We use a biweight kernel with bandwidth equal to 0.3.Then we need to compute the derivative of the nonparametric component with respect to P.This is the estimate of h(u). We approximate this derivative using finite differences. We construct a grid of equally space values of P. The distance between each value of P is h = 0.01.Then we compute the derivative at each P by dividing g(P+h)-g(P) by h. g(a) is constructed by multiplying A by the coefficient on the interaction between P and A in the regression just described.The surface is just the sum of these two components which are assumed to be independent.

4

MTE

2

11.5

1

00.5

0 1

x 10

-3

0.8 0.6 0.4

US

0.2

Figure 9 Policy Weights

-1

0

1

A

White Males from NLSY79

0

2

3

This graph shows the joint distribution of A and US for individuals induced to attend college by a $1000 tuition subsidy. A is a scaled version of AFQT constructed by multiplying AFQT by the coefficient on AFQT in the schooling equation. US is the unobservable in the schooling equation.The analytical expression of this surface is presented in table 9.TX in table 9 corresponds to A in this figure.TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up.The density of TX is estimated with a gaussian kernel with bandwidth of 0.1.The density of TZ conditional on TX is estimated using local polynomials.The density of US is assumed to be normal with mean 0 and variance 1.

4

0.06

0.05

0.04

h(A) 0.03

0.02

0.01

0 -0.5

0

0.5

A

2

White Males from NLSY79

1.5

2.5

Figure 10a Policy Weights for MTE(A)

1

3

$500 $1000 $2000

3.5

4

This graph shows the distribution of A for individuals induced to attend college by a $500, a $1000 and a $2000 tuition subsidy. A is a scaled version of AFQT constructed by multiplying AFQT by the coefficient on AFQT in the schooling equation. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average the policy weight over the distribution of US. I plot all three policies in this figure.

0.06

0.05

0.04

h(A) 0.03

0.02

0.01

0

-0.01 -0.5 0

0.5

1.5

A

2

2.5

White Males from NLSY79

Figure 10b ATE and Policy Weights for MTE(A)

1

3

$1000 ATE

3.5

This graph shows the distribution of A for individuals induced to attend college by a $1000 tuition subsidy and for the average person. A is a scaled version of AFQT constructed by multiplying AFQT by the coefficient on AFQT in the schooling equation. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average the policy weight over the distribution of US.

4

h(A)

0.06

0.05

0.04

0.03

0.02

0.01

0 -0.5

0

0.5

A

2

White Males from NLSY79

1.5

2.5

Figure 10c Policy Weights for MTE(A)

1

3

$1000 $5000 $10000

3.5

4

This graph shows the distribution of A for individuals induced to attend college by a $1000, a $5000 and a $10000 tuition subsidy. A is a scaled version of AFQT constructed by multiplying AFQT by the coefficient on AFQT in the schooling equation. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average the policy weight over the distribution of US. I plot all three policies in this figure.

h(US)

0.018

0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0 0

0.1

0.2

0.7

$500 $1000 $2000

0.8

0. 9

h(Us)

0.025

0.02

0.015

0.01

0.005

0

0.1

0.2

0.3

0.4

0.5

Us

0.6

0.7

$1000 ATE

0.8

0. 9

This graph shows the distribution of US for individuals induced to attend college by a $1000 tuition subsidy and for the average person. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average both weights over the distribution of US.

0

White Males from NLSY79

0.6

Figure 11b ATE and Policy Weights for MTE(US)

0.5

Figure 11a Policy Weights for MTE(US)

0.4

White Males from NLSY79

0.3

US This graph shows the distribution of US for individuals induced to attend college by a $500, a $1000 and a $2000 tuition subsidy. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average both weights over the distribution of TX. I plot all three policies in this figure.

h(Us)

0.018

0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0 0

0.1

0.2

0.4

Us

0.5

0.6

White Males from NLSY79

Figure 11c Policy Weights for MTE(Us)

0.3

0.7

0.8

$1000 $5000 $10000

0. 9

This graph shows the distribution of US for individuals induced to attend college by a $1000, a $5000 and a $10000 tuition subsidy. The analytical expression of the policy weight is presented in table 9. TX in table 9 corresponds to A in this figure. TZ is an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average both weights over the distribution of TX. I plot all three policies in this figure.

h(US)

20

15

10

5

0

-5

-3

x 10

0

0.1

0.2

0.5 US

0.6

White Males from NLSY79

0.4

0.7

Figure 12 Equality of Policy Weights

0.3

0.9

500 1000-500 2000-1000 5000-2000

0.8

1

This graph shows that the theoretical condition for equality of policy weights is satisfied for tuition subsidies between $500 and $2000. The 500 line represents the weight for the $500 subsidy, the 1000-5000 is the weight we obtain by using the 1000 and the 500 subsidies, 2000-1000 is the weight we obtain by using the 2000 and 1000 subsidies and 5000-2000 is the weight we obtain by using the 5000 and 2000 subsidies. The analytical expression for these curves conditional on a value of X is given by equation (24) in the paper. Instead of X I useTX, a scaled version of AFQT, and instead of Z I use TZ, an index all the remaining variables in the schooling equation (excluding AFQT). It is constructed by multiplying each variable by its corresponding coefficient in the schooling equation and then adding them up. The density of TX is estimated with a gaussian kernel with bandwidth of 0.1. The density of TZ conditional on TX is estimated using local polynomials. The density of US is assumed to be normal with mean 0 and variance 1. To generate this figure I average both weights over the distribution of TX. I plot all three policies in this figure.