ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. Homogeneous Treatment Effect and Treatment Effects a Function of Obervables 3. Heterogeneous Treatment Effects and LATE 4. Control Function Methods for Heterogeneous Treatment Effects 5. A Different Approach to Allowing Heterogeneous Treatment Effects 6. Nonlinear Response Models 1

1. Introduction

∙ Consider now the case where unconfoundedness, or selection on observables, does not hold. Allow units to select into treatment based on unobservables that affect the response.

∙ Even if eligibility is randomly assigned, actual enrollment in programs may suffer from self selection. But randomized eligibility can often be used as an instrumental variables.

∙ Identification and estimation are very simple in the constant treatment effect case. Contemporary interest is often in the heterogeneous treatment effect case.

2

∙ Three general approaches: (i) Study simple IV estimators to see if they estimate anything of value under fairly weak assumptions; (ii) Add more assumptions and try to estimate a parameter such as  ate ; (iii) Use “structural” economic models without parametric assumptions, and try to identify interesting policy or structural parameters.

3

2. Homogeneous Treatment Effect and Treatment Effects a Function of Obervables

∙ In the simplest case we assume Y1 − Y0 is constant. Then Y  Y0  WY1 − Y0  Y0  W ≡  0  W  V 0 where  0 ≡ EY0 and V 0 ≡ Y0 −  0 . So, we have a simple regression model with a constant slope.

4

∙ Suppose we have a single instrumental variable, Z. (At this point, Z could be binary, continuous, or anything in between.) Recall the two requirments for an instrument: CovZ, W ≠ 0 CovZ, V 0   CovZ, Y0  0

5

∙ The first requirement (relevance) is fairly easily tested checked. The second (exogeneity) is maintained. Note that having Z predetermined or randomly assigned does not guarantee its exogeneity. The value of Z could affect Y0 through other channels.

∙ Conveniently, the nature of Y is not restricted, but the treatment effect is assumed constant.  ate   att is consistently estimated by the IV estimator. [Remember, IV not usually unbiased.]

∙ Even if we restrict attention to consistency, it is not true that one should use a “slightly” endogenous instrument rather than OLS.

6

∙ Why? It is easy to show:  plim ̂ OLS    VW0  CorrW, V 0  and CorrZ, V 0   plim ̂ IV    VW0  CorrZ, W So, if CorrZ, W is small – that is, Z is a “weak” instrument – then even a small correlation between Z and V 0 can produce a larger asymptotic bias than OLS.

7

∙ In economics, very common to see IV estimates that are larger in magnitude than OLS estimates. Usually, other explanations are given other than bad IV. Measurement error and heterogeneous treatment effect (later) are among them.

∙ Weak instruments lead to large asymptotic standard errors, too: Avar N ̂ IV

 2V 0 −   2 2 .  W  Z,W

2 When  Z,W is small, the asymptotic variance can be very large. The 2 formula for the OLS estimator omits  Z,W .

8

Adding Other Covariates and Instruments

∙ Suppose we need to add covariates before our instruments are appropriately exogenous: Yg   g  x −  x  g  u g , g  0, 1 where  g  EYg and  x  Ex.

∙ We have made a functional form assumption once we assume V g has a zero mean, conditional on x. We also assume that netting out x makes z exogenous. Easiest way to impose both: Eu g |x, z  0, g  0, 1.

9

∙ In effect, we have made the standard exclusion restrictions plus a linear functional form. Write Y  Y0  WY1 − Y0   0  x −  x  0  u 0  W 1 −  0   Wx −  x   Wu 1 − u 0    0  W  x 0  Wx −  x   u 0  Wu 1 − u 0  where    1 −  0 and  0   0 −  x  0 .

∙ The last term – the interaction of the treatment and the unobserved gain from treatment, e ≡ u 1 − u 0 , causes difficulties.

10

∙ In the simplest case, there is no heterogeneity except in the intercept:   0, u 1  u 0 . Then Y   0  W  x 0  u 0 , and we can estimate the parameters by 2SLS using instruments 1, x, z.

∙ How might we exploit the binary nature of W? 1. (Not Recommended): Take the expected value conditional on all exogenous variables: EY|x, z   0  EW|x, z  x 0  Eu 0 |x, z   0  PW  1|x, z  x 0 because W is binary.

11

∙ Two-step procedure: (i) Estimate PW  1|x, z by probit (or logit) and obtain the first-stage fitted probabilities, say ̂ i  ̂ 0  x i ̂ 1  z i ̂ 2 . Should be able to reject  2  0 fairly  ̂ i in place of W i : strongly. (ii) Use the  ̂ i , x i , i  1, . . . , N. Y i on 1, 

12

∙ Why not recommended? (a) Inconsistent generally if model (probit in this case) for PW  1|x, z is wrong. Then EY|x, z ≠  0   0  x 1  z 2   x 0 ; (b) Need to adjust standard errors for “generated regressor”; (c) No known efficiency properties; (d) Tempting to achieve identification off of the nonlinearity in the probit in the absense of instruments.

13

̂ i as an IV, not a regressor. That is, estimate 2. (Recommended): Use  Y i   0  W i  x i  0  u i0 , ̂ i , x i . by IV using instruments 1,  ∙ Not the same as using ̂ i as a regressor! The first stage when ̂ i is used as an IV is ̂ i  x i ̂ Ŵ i  ̂ 0  ̂ 1  2 and ̂ 0 ≠ 0, ̂ 1 ≠ 1, ̂ 2 ≠ 0.

14

∙ Why recommended? (a) Misspecification of probit model does not matter, provided W is partially correlated with  0  x 1  z 2 ! So this method, like using 1, x i , z i  as instruments, is robust to having the model for PW  1|x, z wrong. (b) No need to account for generated instruments in standard errors – see Wooldridge (2002, Chapter 6). (c) Estimator is efficient IV estimator if Varu 0 |x, z  Varu 0  and probit model for W is correct.

15

∙ As a practical matter, there might not be much difference in the point estimates between the approaches. Certainly not if the coefficients in ̂ i , x i are close to 0, 1, 0. But using the fitted the regression W i on 1,  values as a regressor has no advantages.

∙ Easy to modify to allow treatment to interact with x. After first stage probit, estimate Y i   0  W i  x i  0  W i x i − x̄   error i ̂ i, xi,  ̂ i  x i − x̄ . Benefits same as by IV, using instruments 1,  without interaction.

∙ Can safely ignore estimation error in sample mean, x̄ .

16

∙ Note we have now allowed heterogeneous treatment effects but only as a function of x. In particular, ̂ x  ̂  x − x̄ ̂, and we can insert different values for x to see how the ATE varies with observed characteristics.

17

3. Heterogeneous Treatment Effects and LATE The Local Average Treatment Effect

∙ What does IV estimate, in general, if the gain from treatment, Y1 − Y0, is not constant? Imbens and Angrist (1994): Now treatment is counterfactual, too. Let Z be the binary instrumental variable, which is often randomized eligibility, so that Z is zero or one. Let W0 be treatment status if Z  0 and let W1 be treatment status if Z  1. (Some eligibles may not choose to participate; some non-eligibles may find other ways to “participate.”)

18

∙ We only observe one of the counterfactual treatments: W  1 − ZW0  ZW1 Now write Y  1 − WY0  WY1  Y0  W0Y1 − Y0  ZW1 − W0Y1 − Y0

19

Assumptions: LATE.1: Z is independent of W0, W1, Y0, Y1 LATE.2: PW  1|Z  1 ≠ PW  1|Z  0 LATE.3: W1 ≥ W0 (no defiers, or monotonicity)

20

∙ This last condition means that if a unit would participates in the program when not eligible, they would participate if made eligible. (Vietnam draft lottery: a defier would be someone who would serve if not drafted but would not serve if drafted. LATE.3 rules this out.)

∙ Imbens and Angrist define the local average treatment effect (LATE) as  LATE  EY1 − Y0|W1 − W0  1 which is the average treatment effect for those induced into treatment by assignment. (That is, W0  0 and W1  1.)

21

∙ Imbens and Angrist call the subpopulation with W1  1, W0  0 the compliers. These individuals comply with assignment no matter what their eligibility. That is, if they are assigned to the control group, they would not participate, and if they are assigned to the treatment, they do participate.

∙ One criticism of  LATE is that it is a treatment effect for an unidentifiable segment of the population: we do not know, on the basis of observing Y i , W i , Z i  whether unit i is a complier.

22

∙ Consequently, LATE may not have “external validity,” that is, it may not apply to evaluations of other programs even if the population is the same.

∙ A related point is that two different binary instruments lead to different groups of compliers in the same population. For example, if Y is log of earnings, W is college attendance, and Z is living near a college, the compliers are those who would attend college if one is nearby, but not otherwise. But if Z is whether a grant is offered, the compliers consist of those who would not attend college if they do not receive a grant, but would attend if they do.

23

∙ The key result of IA is that, under the three LATE assumptions, the usual IV estimator consistently estimates  LATE . To see this, use the independence assumption to write EY|Z  EY0  EW0Y1 − Y0  ZEW1 − W0Y1 − Y0 so that EY|Z  1 − EY|Z  0  EW1 − W0Y1 − Y0.

24

Now EW1 − W0Y1 − Y0  1  EY1 − Y0|W1 − W0  1PW1 − W0  1  0  EY1 − Y0|W1 − W0  0PW1 − W0  0  −1  EY1 − Y0|W1 − W0  −1PW1 − W0  −1  EY1 − Y0|W1 − W0  1PW − EY1 − Y0|W1 − W0  −1PW1 − W0  −1  EY1 − Y0|W1 − W0  1PW because PW1 − W0  −1  0 by the no defiers assumption.

25

∙ Because PW1 − W0  −1  0, W1 − W0 is a binary variable, and PW1 − W0  1  EW1 − W0  EW1 − EW0.

∙ But, again, Z is independent of W0, W1, and so EW|Z  1 − ZEW0  ZEW1 which implies EW1 − EW0  EW|Z  1 − EW|Z  0  PW  1|Z  1 − PW  1|Z  0 ≠ 0 by LATE.2.

26

∙ We have shown that EY1 − Y0|W1 − W0  1 

EY|Z  1 − EY|Z  0 , PW  1|Z  1 − PW  1|Z  0

and this is the probability limit of the simple IV estimator when Z and W are both binary. In fact, the IV estimate is just the Wald estimate, N

̂ LATE 

N

−1 N −1 1 ∑ i1 Z i Y i − N 0 ∑ i1 1 − Z i Y i

N −1 1

N i1

∑ ZiWi −

N

N −1 0

∑

N 1 i1

− W i Y i

,

where N 1  ∑ i1 Z i and N 0  N − N 1 . (Sometimes called a “grouping estimator.”)

27

∙ This interpretation of IV has had wide influence in the treatment effect literature. It is probably relied on too much to explain why IV estimate is larger than OLS. Sometimes one forgets the possibility that the IV is not exogenous.

∙ Literature is vast on allowing covariates, x, in the analysis and when z is not a binary scalar. Flavor of results carries through.

28

4. Control Function Methods for Heterogeneous Treatment Effects

∙ Suppose we think there are heterogeneous treatment effects but are not satisfied with estimating LATE (or possibly the necessary assumptions do not hold). The “old-fashioned” approach is to add more assumptions and apply control function methods. This leads to the so-called “switching regression” model, which is the same as a random coefficient model.

∙ Recall that if Yg   g  x g  u g we can write Y   0  W  x 0  W  x −  x   u 0  W  e where e  u 1 − u 0 . Note that the ATE is still the coefficient on W. 29

∙ We have to take seriously the underlying functional forms, EYg|x   g  x g , and we will further restrict the conditional distribution of Yg. In effect, this setup applies to Yg continuous. (Contrast the approaches based on unconfoundedness, particularly PS weighting and matching.)

∙ Even if we add the exogeneity assumptions Eu 0 |x, z  Ee|x, z  0, it is generally not true that

30

EW  e|x, z ≠ EW  e. and this is the condition for earlier IV estimators to be consistent for  if the interaction term W  e is in the error term – see Wooldridge (2003, Economics Letters).

∙ As it turns out, this condition can hold for continuous treatments, in which case only  0 is estimated inconsistently. The IV estimator is consistent for    ate .

∙ For common binary response models for W, this mean independence assumption is known not to hold.

31

∙ Instead, use a control function approach, that is, find EY|W, x, z: EY|W, x, z   0  W  x 0  W  x −  x   Eu 0 |W, x, z  WEe|W, x, z

∙ We can estimate the expectations if we model W. Probit is convenient, and we now have to take the model seriously. W  1 0  x 1  z 2  c ≥ 0 and u 0 , u 1 , c|x, z ~ Multivariate Normal

∙ Can relax this a bit: Eu 0 |c, x, z and Eu 1 |c, x, z linear in c (and do not depend on x, z) along with Dc|x, z standard normal. 32

∙ Expectations have the well-known “inverse Mills ratio” form: EY|W, x, z   0  W  x 0  W  x −  x   Wr/r  1 − Wr/1 − r where  is the standard normal pdf and  is the standard normal cdf and r  1, x, z.

∙ This is an estimating equation for , and , too, when we want x    x −  x .

33

∙ Two-Step Method (Heckman, but for “switching regression,” or “endogenous treatment,” not sample selection): (i) As before Run probit of W i on 1, x i , z i and obtain ̂ i  r i ̂ and ̂ i  r i ̂.  (2) Run the OLS regression ̂ i , 1 − W i   ̂ i /1 −  ̂ i  Y i on 1, W i , x i , W i  x i − x̄ , W i  ̂ i /

∙ There is a “generated regressor” problem. Need to adjust standard errors and inference for first-stage estimation. Can use “heckit” in Stata applied to treated and nontreated.

34

∙ If, say, we estimate the coefficients ̂ 0 , ̂ ′0  ′ and ̂ 1 , ̂ ′1  ′ using heckit on the W i  0 and W i  1 subgroups, important to form ̂ x  ̂ 1 − ̂ 0   x̂ 1 − ̂ 0  and, as a special case, ̂  ̂ ate  ̂ 1 − ̂ 0   x̄ ̂ 1 − ̂ 0 . It makes no sense to just look at ̂ 1 − ̂ 0 unless ̂ 1  ̂ 0 has been imposed or x̄ is, coincidentally, close to zero.

35

∙ The ATEs are never obtained by looking at ÊY i |W i  1, x i , z i  − ÊY i |W i  0, x i , z i  and averaging these differences across i. We already know how  and x depend on the parameters. They do not depend on z!

∙ If the ATEs were based on EY|W, x, z there would be no identification problem and never any functional form restrictions because we could always estimate EY|W, x, z nonparametrically.

∙ Again, we use the expression for EY|W, x, z as an estimating equation, not for computing treatment effects.

∙ Whether separate estimations are carried out or the pooled regression is used, can use the bootstrap for inference.

36

∙ The Stata command “treatreg” (typically using MLE, but sometimes using the CF approach) does not apply to the full switching regression case. In fact, it seems to apply only to the simple model with homogeneous effect, Y   0  W  x 0  u 0 , in which case more robust IV methods are available. The CF regression in this case is ̂ i   1 − W i   ̂ i /1 −  ̂ i  Y i on 1, W i , x i , W i  ̂ i /

∙ That is, treatreg imposes the restriction   .

37

∙ As usual, need make sure z has at least one element that predicts treatment once x is controlled for. (Use the first-stage probit.)

∙ In the pooled estimation in the general case, can use a joint test of the ̂ i , 1 − W i   ̂ i /1 −  ̂ i  to test the null that W is two terms W i  ̂ i / exogenous. Under the null, do not need to account for generated regressors, so use a heteroskedasticity-robust Wald test.

38

∙ Extension of the usual Heckman approach. Suppose we allow lots of heterogeneity in the two counterfactual equations: Y i g  a ig  x i b ig   g  x i  g  x i d ig  u ig where a ig , b ig  is independent of x i , z i . We can write Y i   0  W i  x i  0  W i  x i −  x   u i0  W i  e i  x i d i0  W i x i g i where g i  d i1 − d i0 .

39

∙ Under sufficient normality and linear conditional expections of heterogeneity given c i (the error in the probit for W i ) we get the control function equation EY i |W i , x i , z i    0  W i  x i  0  W i  x i −  x   hW i , r i   W i hW i , r i   hW i , r i x i   W i hW i , r i x i  where hW i , r i   W i r i  − 1 − W i −r i 

40

∙ So the generalized residual from the probit, hW i , r i ̂, gets added on its own, interacted with W i , interacted with x i , and interacted with W i x i . (A total of 2K  1 terms.)

∙ The null that W i is exogenous is a test that all terms have zero coefficients. Usual heteroskedasticity-robust Wald test can be used.

∙ As usual, with a two-step estimation method, need to adjust standard errors if conclude W i is endogenous. (Bootstrap.)

41

∙ Even more flexibility: Allow W i to follow a heteroskedasticity probit, that is W i  1r i   expr oi c i ≥ 0 (where r oi omits a constant) in which case the generalized residual is hW i , r i ̂, r oi ̂  W i exp−r oi ̂r i ̂ − 1 − W i − exp−r oi ̂r i 

∙ A lot of scope for studying robustness of findings even within parametric models.

42

5. A Different Approach to Allowing Heterogeneous Treatment Effects

∙ Based on Wooldridge (2007, Advances in Econometrics). Go back to Y   0  W  x 0  W  x −  x   u 0  W  e

∙ Rather than find Eu 0 |W, x, z and Ee|W, x, z, now write W  e  EW  e|x, z  a, where Ea|x, z  0 by definition. Then Y   0  W  x 0  W  x −  x   EW  e|x, z  u 0  a, where now the composite error, c  u 0  a, has zero mean conditional on x, z.

43

∙ Turns out EW  e|x, z is remarkably simple when W follows a probit: EW  e|x, z  r   0  x 1  z 2 

∙ The estimating equation is Y   0  W  x 0  W  x −  x   r  q Eq|x, z  0.

∙ But W is still endogenous in this equation, so we need an instrument. (This is why the method differs from standard control function approach.)

∙ The added regressor, r, is exogenous, and acts as its own instrument. The natural instrument for W is r.

44

∙ Two-step procedure: ̂ i. (i) As before, probit of W on 1, x, z. Now get ̂ i in addition to  (ii) Estimate Y i   0  W i  x i  0  W i  x i − x̄   ̂ i  error i ̂ i, xi,  ̂ i  x i − x̄ , ̂ i . by IV, using instruments 1,  ∙ Generated regressor problem with ̂ i , so use delta method or bootstrap. (Do not need to worry about generated instruments.)

45

∙ A test of whether W  u 1 − u 0  is needed is just the usual (heteroskedasticity-robust) t statistic on ̂ i , after IV estimation. Not a test of endogeneity of W because W can still be correlated with u 0 .

∙ Drawback to this approach: Does not work with binary Z without covariates:  0   2 Z and  0   2 Z are perfectly collinear when Z is binary. Generally, might be sensitive to weak instruments. Need Z to vary sufficiently.

∙ Control function approach can work with binary Z, at least in some cases. But in CF case, test of interaction not robust to nonnormality (endogeneity and functional form tied together).

46

∙ Can learn other things by studying the estimating equation. Assume we have no covariates, so drop x and write Y   0  W   0   2 Z  q Eq|Z  0. Via simulations, Angrist (1990, NBER Technical Working Paper) studied the properties of the usual IV estimator in the equation without  0   2 Z. In other words, the implicit error term is  0   2 Z  q. In general, the IV, Z (which is what Angrist used, not  0   2 Z), is correlated with  0   2 Z. But not always. In fact, in Angrist’s simulations,  0   2 Z   2 Z and Z has a symmetric distribution about zero. 47

∙ Just like Z and Z 2 are uncorrelated when Z is symmetrically distributed about Z, CovZ,  2 Z  0 (because  is symmetric about zero). We already know CovZ, q  0, and so Z is uncorrelated with  0   2 Z  q.

∙ Likely explains why Angrist’s simulation results for the simple IV estimator look so promising for estimating .

48

∙ This method is likely less efficient than control function methods in general. But it can be quite a bit easier with multiple treatment levels. For example, suppose we replace W with a vector W 1 , . . . , W J . Then the control function method requires computing the expectations Eu 0 |W, x, z and Ee|W, x, z, which can be very difficult, whereas the “correction function” method requires only calculation of EW j e|x, z for each j separately.

49

6. Nonlinear Response Models

∙ What if Y is clearly a discrete outcome, such as a binary response? Could take the Imbens and Angrist perspective and settle for (hopefully) estimate LATE. But what if we want to estimate the ATE as a function of covariates x?

∙ Unfortunately, no general, robust procedures. Joint MLE is the only reliable method. That is, specify a model for W as a function of x, z and also a model for Y.

50

∙ Some poor practices linger, especially in complicated discrete response models (multinomial and conditional logit, for example). Even if Y is binary, often see a two-step method applied. ∙ (1) Probit of W i on x i , z i  and get fitted values, ̂ i . (2) Probit of Y i ̂ i, xi. on 

∙ Coefficient on ̂ i is meaningless (except maybe for the sign). How do we get the ATE from this? No current justification for these kinds of approaches.

∙ Is it somehow possible that

51

∙ Can think of a similar Tobit method (when Y is a corner solution response) that attempts to mimic “two-stage least squares.

52

∙ What can we do that at least works under a set of assumptions? ∙ Suppose Yg are binary, and we are willing to assume, for g  0, 1, Yg  1 g  x g  u g  0 u g |x, z ~ Normal0, 1

∙ We know that the ATE as a function of x is x   1  x 1  −  0  x 0 , which is easy to estimate given estimates of the parameters.

53

∙ If we again assume W  1 0  x 1  z 2  c ≥ 0 and u 0 , u 1 , c|x, z ~ Multivariate Normal, then estimation using standard software is straightforward: apply the Heckman selection approach to the probit model for the control W i  0 and treated W i  1 groups separately.

54

∙ In Stata, “heckprob.” This gives ̂ 0 , ̂ 0 W i

 0, need instrument) and

̂ 1 , ̂ 1 (W i  1) and then we have ̂ x  ̂ 1  x̂ 1  − ̂ 0  x̂ 0 , which we can study at various values of x, or average across subpopulations. The estimate of  ate  E x  1  x 1  −  0  x 0  is N

̂ ate  N −1 ∑̂ 1  x i ̂ 1  − ̂ 0  x i ̂ 0 . i1

55

∙ Simple test for null that W is exogenous. Can show that the score (Lagrange multiplier) test can be obtained as the following variable addition test: (1) Probit of W i on 1, x i , z i to get ̂ and the generalized residual, gr i ≡ W i r i ̂ − 1 − W i −r i ̂ (2) Probit of Y i on 1, x i , W i , gr i and the usual MLE t statistic. Under the null, no need to adjust the standard error for first-stage estimation.

∙ Cannot be theoretically justified as a correction under H 1 , but one wonders for “local” forms of endogeneity.

56

∙ Rather than compute ̂ ate as earlier, can extend Blundell and Powell (2003): N

̂ x  N −1 ∑̂  x̂  ̂  ̂ gr i  − ̂  x̂  ̂ gr i  i1

where ̂ is the probit coefficient on W i . Then average out across x i to approximate ̂ ate . (Caution: I really have no idea if or when this works.)

57

∙ For testing the null, can do lots of different things. For example, with multiple binary treatments W i1 , . . . , W iG , can (assuming enough instruments) compute a GR for each W i1 , then do probit Y i on 1x i , W i1 , . . . , W iG , gr i1 , . . . , gr iG and do a usual joint Wald test for joint significance of gr i1 , . . . , gr iG . Under null, they should be jointly significant.

58

∙ For a test, could use the Pearson (standardized) residual in place of the generalized residual: pr i  W i − r i ̂/ r i ̂1 −r i ̂ As with adding GR, might this approximate the average treatment effect when averaged out of the second-stage probit?

∙ Recent work [Wooldridge (2008, manuscript)]: If W i (scalar) is endogenous and follows a probit, can use the same bivariate probit (quasi-) log likelihood when Y i is a fractional response following a probit mean function, without further retricting the distribution of Y i .

59

∙ If Yg is a corner (labor supply, charitable contributions) might specify Yg  max0,  g  x g  u g , g  1, 2 and use the assumptions as above, except allow  20 and  21 as variances of u 0 , u 1 . Under the multivariate normality of u 0 , u 1 , c (and being independent of x, z), can use MLE. The estimated average treatment effect as a function of x is ̂ x  ̂ 1  x̂ 1 /̂ 1   ̂ 1 ̂ 1  x̂ 1 /̂ 1  − ̂ 0  x̂ /̂ 0   ̂ 0 ̂ 0  x̂ /̂ 0  0

0

and average this across x i to estimate  ate .

60

∙ For other functional forms, we can get by with fewer restrictions, although often same parameters are imposed across regimes.

∙ Write EY|W, x, z, r 1   exp  x  W  r 1 , where r 1 is an unobservable. Terza’s (1998) control function method based on EY|W, x, z  exp  x  WEexpr 1 |W, x, z.

61

∙ When W follows a probit, EY|W, x, z  exp  x  WhW,  0  x 1  z 2 ,  where hW,  0  x 1  z 2 ,   exp 2 /2W  r/z  1 − W1 −   r/1 − r.

∙ Estimate  using first stage probit, as before. Then, say, Poisson QMLE with the above mean function to estimate , , and .

62

∙ Then the ATE as a function of x is estimated as ̂ x  exp̂  x̂  ̂  − exp̂  x̂

∙ What is the score test of H 0

:   0 (W is exogenous)? Can show the

variable addition test is (1) Probit of W i on 1, x i , z i , as usual; (2) Poisson (say) regression of Y i on 1, x i , W i , lgr i where lgr i is the log of the usual generlized residual: lgr i  W i logr i ̂ − 1 − W i  log−r i ̂

63

∙ IV methods that work for any distribution of W are available [Mullahy (1997)]. If EY|W, x, z, r 1   expx  W  r 1 , and r 1 is independent of z then Eexp−x − WY|z  Eexpr 1 |z  1, where Eexpr 1   1 is a normalization. The moment conditions are Eexp−x − WY − 1|z  0.

64