Department of Economics. University of Texas at Austin
. February 21, 2017 ... I also thank
Nonparametric Estimation of Triangular Simultaneous Equations Models under Weak Identification Sukjin Han⇤ Department of Economics University of Texas at Austin
[email protected] February 21, 2017
Abstract This paper analyzes the problem of weak instruments on identification, estimation, and inference in a simple nonparametric model of a triangular system. The paper derives a necessary and sufficient rank condition for identification, based on which weak identification is established. Then, nonparametric weak instruments are defined as a sequence of reduced-form functions where the associated rank shrinks to zero. The problem of weak instruments is characterized as concurvity and to be similar to the ill-posed inverse problem, which motivates the introduction of a regularization scheme. The paper proposes a penalized series estimation method to alleviate the e↵ects of weak instruments and shows that it achieves desirable asymptotic properties. The findings of this paper provide useful implications for empirical work. To illustrate them, Monte Carlo results are presented and an empirical example is given in which the e↵ect of class size on test scores is estimated nonparametrically. Keywords: Triangular models, nonparametric identification, weak identification, weak instruments, series estimation, inverse problem, regularization, concurvity. JEL Classification Numbers: C13, C14, C36. ⇤ I am very grateful to my advisors, Donald Andrews and Edward Vytlacil, and committee members, Xiaohong Chen and Yuichi Kitamura for their inspiration, guidance and support. I am deeply indebted to Donald Andrews for his thoughtful advice throughout the project. The earlier version of this paper has benefited from discussions with Joseph Altonji, Ivan Canay, Philip Haile, Keisuke Hirano, Han Hong, Joel Horowitz, Seokbae Simon Lee, Oliver Linton, Whitney Newey, Byoung Park, Peter Phillips, Andres Santos, and Alex Torgovitsky. I gratefully acknowledge financial support from a Carl Arvid Anderson Prize from the Cowles Foundation. I also thank the seminar participants at Yale, UT Austin, Chicago Booth, Notre Dame, SUNY Albany, Duke, Sogang, SKKU, and Yonsei, as well as the participants at NASM and Cowles Summer Conference.
1
1
Introduction
Instrumental variables (IVs) are widely used in empirical research to identify and estimate models with endogenous explanatory variables. In linear simultaneous equations models, it is well known that standard asymptotic approximations break down when instruments are weak in the sense that (partial) correlation between the instruments and endogenous variables is weak. The consequences of and solutions for weak instruments in linear settings have been extensively studied in the literature over the past two decades; see, e.g., Bound et al. (1995), Staiger and Stock (1997), Dufour (1997), Kleibergen (2002, 2005), Moreira (2003), Stock and Yogo (2005), and Andrews and Stock (2007), among others. Weak instruments in nonlinear parametric models have recently been studied in the literature in the context of weak identification by, e.g., Stock and Wright (2000), Kleibergen (2005), Andrews and Cheng (2012), Andrews and Mikusheva (2016b,a), Andrews and Guggenberger (2015), and Han and McCloskey (2015). One might expect that nonparametric models with endogenous explanatory variables will generally require stronger identification power than parametric models as there is an infinite number of unknown parameters to identify, and hence, stronger instruments may be required.1 Despite the problem’s importance and the growing popularity of nonparametric models, weak instruments in nonparametric settings have not received much attention.2 Furthermore, surprisingly little attention has been paid to the consequences of weak instruments in empirical research using nonparametric models; see below for references. Part of the neglect is due to the existing complications embedded in nonparametric models. In a simple nonparametric framework, this paper analyzes the problem of weak instruments on identification, estimation, and inference, and proposes an estimation strategy to mitigate the e↵ect. Identification results are obtained so that the concept of weak identification can subsequently be introduced via localization. The problem of weak instruments is characterized as concurvity and is shown to be similar to the ill-posed inverse problem. An estimation method is proposed through regularization and the resulting estimators are shown to have desirable asymptotic properties even when instruments are possibly weak. As a nonparametric framework, we consider a triangular simultaneous equations model. The specification of weak instruments is intuitive in the triangular model because it has an explicit reduced-form relationship. Additionally, clear interpretation of the e↵ect of weak instruments can be made through a specific structure produced by the control function approach. To make our analysis succinct, we specify additive errors in the model. This particular model is considered in Newey et al. (1999) (NPV) and Pinkse (2000) in a situation without weak instruments. Although relatively recent 1
This conjecture is shown to be true in the setting considered in this paper; see Theorem 5.1 and Corollary 5.2. Chesher (2003, 2007) mentions the issue of weak instruments in applying his key identification condition in the empirical example of Angrist and Keueger (1991). Blundell et al. (2007) determine whether weak instruments are present in the Engel curve dataset of their empirical section. They do this by applying the Stock and Yogo (2005) test developed in linear models to their reduced form, which is linearized by sieve approximation. Darolles et al. (2011) briefly discuss weak instruments that are indirectly characterized within their source condition. 2
2
developments in nonparametric triangular models contribute to models with nonseparable errors (e.g., Imbens and Newey (2009), Kasy (2014)), such flexibility complicates the exposition of the main results of this paper.3 Also, having a form analogous to its popular parametric counterpart, the model with additive errors is broadly used in applied research such as Blundell and Duncan (1998), Yatchew and No (2001), Lyssiotou et al. (2004), Dustmann and Meghir (2005), Skinner et al. (2005), Blundell et al. (2008), Del Bono and Weber (2008), Frazer (2008), Mazzocco (2012), Coe et al. (2012), Breza (2012), Henderson et al. (2013), Chay and Munshi (2014), and Koster et al. (2014). One of the contributions of this paper is that it derives novel identification results in nonparametric triangular models that complement the existing results in the literature. With a mild support condition, we show that a particular rank condition is necessary and sufficient for the identification of the structural relationship. This rank condition is substantially weaker than what is established in NPV. The rank condition covers economically relevant situations such as outcomes resulting from corner solutions or kink points in certain economic models. More importantly, deriving such a rank condition is the key to establishing the notion of weak identification. Since the condition is minimal, a “slight violation” of it has a binding e↵ect on identification, hence resulting in weak identification. To characterize weak identification, we consider a drifting sequence of reduced-form functions that converges to a non-identification region, namely, a space of reduced-form functions that violate the rank condition for identification. A particular rate is designated relative to the sample size, which e↵ectively measures the strength of the instruments, so that it appears in asymptotic results for the estimator of the structural function. The concept of nonparametric weak instruments generalizes the concept of weak instruments in linear models such as in Staiger and Stock (1997). In the nonparametric control function framework, the problem of weak instruments becomes a nonparametric analogue of a multicollinearity problem known as concurvity (Hastie and Tibshirani (1986)). Once the endogeneity is controlled by a control function, the model can be rewritten as an additive nonparametric regression, where the endogenous variables and reduced-form errors comprise two regressors, and weak instruments result in the variation of the former regressor being mainly driven by the variation of the latter. This problem of concurvity is related to the illposed inverse problem inherent in other nonparametric models with endogeneity or, in general, to settings where smoothing operators are involved; see Carrasco et al. (2007) for a survey of inverse problems. Although the sources of ill-posedness in the two problems are di↵erent, there is sufficient similarity that the regularization methods used in the literature to solve the ill-posed inverse problem can be introduced to our problem. Due to the problems’ distinct features, however, among the regularization methods, only penalization (i.e., Tikhonov-type regularization) alleviates the e↵ect of weak instruments. 3 For instance, the control function employed in Imbens and Newey (2009) requires large variation in instruments, and hence discussing weak instruments (i.e., weak association between endogenous variables and instruments or little variation in instruments) in such a context requires more care.
3
This paper proposes a penalized series estimator for the structural function and establishes its asymptotic properties. Our results on the rate of convergence of the estimator suggest that, without penalization, weak instruments characterized as concurvity slow down the overall convergence rate, exacerbating bias and variance “symmetrically.” We then show that a faster convergence rate is achieved with penalization. In controlling the penalty bias, we introduce a high level condition that restrict the weakness of instruments relative to the smoothness of the unknown function. This condition is related to source conditions (Engl et al. (1996)) in the literature on ill-posed inverse problems. We also derive consistency and asymptotic normality with mildly weak instruments. The problem of concurvity in additive nonparametric models is also recognized in the literature where di↵erent estimation methods are proposed to address the problem—e.g., the backfitting methods (Linton (1997), Nielsen and Sperlich (2005)) and the integration method (Jiang et al. (2010)); also see Sperlich et al. (1999). In particular, as closely related work to the asymptotic results of this paper, Jiang et al. (2010) establish pointwise asymptotic normality for local linear and integral estimators in an additive nonparametric model with highly correlated covariates. In the present paper, where an additive model results from a triangular model accompanied with the control function approach, the problem of concurvity is addressed in a more direct manner via penalization. In addition, although the main conclusions of this paper do not depend on the choice of nonparametric estimation method, using series estimation in our penalization procedure is also justified in the context of design density. In situations where the joint density of x and v becomes singular, such as in our case with weak instruments, it is known that series and local linear estimators are less sensitive than conventional kernel estimators; see, e.g., Hengartner and Linton (1996) and Imbens and Newey (2009) for related discussions. Another possible nonparametric framework in which to examine the problem of weak instruments is a nonparametric IV (NPIV) model (Newey and Powell (2003), Hall and Horowitz (2005) and Blundell et al. (2007), among others). Unlike in a triangular model, the absence of an explicit reduced-form relationship forces weak instruments in this setting to be characterized as a part of the ill-posed inverse problem. Therefore, in this model, the performance of the estimator can be severely deteriorated as the problem is “doubly ill-posed.”4 Further, it may also be hard to separate the e↵ects of the two in asymptotic theory. As a related recent work, Freyberger (2015) provides a framework by which the completeness condition can be tested in a NPIV model. Instead of using a drifting sequence of distributions, he indirectly defines weak instruments as a failure of a restricted version of the completeness condition. While he applies his framework to test weak instruments, our focus is on estimation and inference of the function of interest in a di↵erent nonparametric model with a more explicit definition of weak instruments. The findings of this paper provide useful implications for empirical work. First, when estimating a nonparametric structural function, the results of IV estimation and subsequent inference can be 4
In Section 8, we illustrate this point in an empirical application by comparing estimates calculated from the triangular and NPIV models.
4
misleading even when the instruments are strong in terms of conventional criteria for linear models.5 Second, the symmetric e↵ect of weak instruments on bias and variance implies that the bias–variance trade-o↵ is the same across di↵erent strengths of instruments, and hence, weak instruments cannot be alleviated by exploiting the trade-o↵. Third, penalization on the other hand can alleviate weak instruments by significantly reducing variance and sometimes bias as well. Fourth, there is a tradeo↵ between the smoothness of the structural function (or the dimensionality of its argument) and the requirement of strong instruments. Fifth, if a triangular model along with its assumptions is considered to be reasonable, it makes the data to be informative about the relationship of interest more than a NPIV model does, which is an attractive feature especially in the presence of weak instruments. Sixth, although a linear first-stage reduced form is commonly used in applied research (e.g., in NPV, Blundell and Duncan (1998), Blundell et al. (1998), Dustmann and Meghir (2005), Coe et al. (2012), and Henderson et al. (2013)), the strength of instruments can be improved by having a nonparametric reduced form so that the nonlinear relationship between the endogenous variable and instruments can be fully exploited. The last point is related to the identification results of this paper. In Section 8, we apply the findings of this paper to an empirical example, where we nonparametrically estimate the e↵ect of class size on students’ test scores. The rest of the paper is organized as follows. Section 2 introduces the model and obtains new identification results. Section 3 discusses weak identification and Section 4 relates the weak instrument problem to the ill-posed inverse problem and defines our penalized series estimator. Sections 5 and 6 establish the rate of convergence and consistency of the penalized series estimator and the asymptotic normality of some functionals of it. Section 7 presents the Monte Carlo simulation results. Section 8 discusses the empirical application. Finally, Section 9 concludes.
2
Identification
We consider a nonparametric triangular simultaneous equations model y = g0 (x, z1 ) + ",
x = ⇧0 (z) + v,
E["|v, z] = E["|v] a.s.,
E[v|z] = 0 a.s.,
(2.1a) (2.1b)
where g0 (·, ·) is an unknown structural function of interest, ⇧0 (·) is an unknown reduced-form
function, x is a dx -vector of endogenous variables, z = (z1 , z2 ) is a (dz1 + dz2 )-vector of exogenous variables, and z2 is a vector of excluded instruments. The stochastic assumptions (2.1b) are more general than the assumption of full independence between (", v) and z and E[v] = 0. Following the 5
For instance, in Coe et al. (2012), the first-stage F -statistic value that is reported is (sometimes barely) in favor of strong instruments, but the judgement is based on the criterion for linear models. The majority of empirical works referenced above do not report first-stage results.
5
control function approach, E[y|x, z] = g0 (x, z1 ) + E["|⇧0 (z) + v, z] = g0 (x, z1 ) + E["|v] = g0 (x, z1 ) + where
0 (v)
0 (v),
(2.2)
= E["|v] and the second equality is from the first part of (2.1b). In e↵ect, we capture
endogeneity (E["|x, z] 6= 0) by an unknown function
0 (v),
which serves as a control function. Once
v is controlled for, the only variation of x comes from the exogenous variation of z. Based on equation (2.2) we establish identification, weak identification, and estimation results. First, we obtain identification results that complement the results of NPV. For useful comparisons, we first restate the identification condition of NPV which is written in terms of ⇧0 (·). Given (2.2), the identification of g0 (x, z1 ) is achieved if one can separately vary (x, z1 ) and v in g(x, z1 ) + (v). Since x = ⇧0 (z) + v, a suitable condition on ⇧0 (·) will guarantee this via the separate variation of z and v. In light of this intuition, NPV propose the following identification condition. Proposition 2.1 (Theorem 2.3 in NPV) If g(x, z1 ), (v), and ⇧(z) are di↵erentiable, the boundary of the support of (z, v) has probability zero, and
Pr rank
✓
@⇧0 (z) @z20
◆
= dx = 1,
(2.3)
then g0 (x, z1 ) is identified up to an additive constant. The identification condition can be seen as a nonparametric generalization of the rank condition. One can readily show that the order condition (dz2
dx ) is incorporated in this rank condition.
Note that this condition is only a sufficient condition, which suggests that the model can possibly be identified with a relaxed rank condition. This observation motivates our identification analysis. We find a necessary and sufficient rank condition for identification by introducing a mild support condition. The identification analysis of this section is also important for our later purpose of defining the notion of weak identification. Henceforth, in order to keep our presentation succinct, we focus on the case where the included exogenous variable z1 is dropped from model (2.1) and z = z2 . With z1 included, all the results of this paper readily follow similar lines; e.g., the identification analysis follows conditional on z1 . We first state and discuss the assumptions that we impose. Assumption ID1 The functions g0 (x),
0 (v),
and ⇧0 (z) are continuously di↵erentiable in their
arguments. This condition is also assumed in Proposition 2.1 above. Before stating a key additional assumption for identification, we first define the supports that are associated with x and z. Let X ⇢ Rdx
and Z ⇢ Rdz be the marginal supports of x and z, respectively. Also, let Xz be the conditional 6
support of x given z 2 Z. We partition Z into two regions where the rank condition is satisfied, i.e., where z is relevant, and otherwise.
Definition 2.1 (Relevant set) Let Z r be the subset of Z defined by r
r
Z = Z (⇧0 (·)) =
⇢
z 2 Z : rank
✓
@⇧0 (z) @z 0
◆
= dx .
Let Z 0 = Z\Z r be the complement of the relevant set. Let X r be the subset of X defined by
X r = {x 2 Xz : z 2 Z r }. Given the definitions, we introduce an additional support condition.
Assumption ID2 The supports X and X r di↵er only on a set of probability zero, i.e., Pr [x 2 X \X r ] = 0.
Intuitively, when z is in the relevant set, x = ⇧0 (z) + v varies as z varies, and therefore, the support of x corresponding to the relevant set is large. Assumption ID2 assures that the corresponding support is large enough to almost surely cover the entire support of x. ID2 is not as strong as it may appear to be. Below, we show this by providing mild sufficient conditions for ID2. If we identify g0 (x) for any x 2 X r , then we achieve identification of g0 (x) by Assumption ID2.6
Now, in order to identify g0 (x) for x 2 X r , we need a rank condition, which will be minimal. The following is the identification result:
Theorem 2.2 In model (2.1), suppose Assumptions ID1 and ID2 hold. Then, g0 (x) is identified on X up to an additive constant if and only if
Pr rank
✓
@⇧0 (z) @z 0
◆
= dx > 0.
(2.4)
This and all subsequent proofs can be found in the Appendix. The rank condition (2.4) is necessary and sufficient. By Definition 2.1, it can alternatively be written as Pr [z 2 Z r ] > 0. The condition is substantially weaker than (2.3) in Proposition 2.1, which is Pr [z 2 Z r ] = 1 (with z = z2 ). That is, Theorem 2.2 extends the result of NPV in the sense that when Z r = Z, ID2 is trivially satisfied with X = X r . Theorem 2.2 shows that it is enough
for identification of g0 (x) to have any fixed positive probability with which the rank condition is satisfied.7 This condition can be seen as the local rank condition as in Chesher (2003), and we achieve global identification with a local rank condition. Although this gain comes from having the additional support condition, the trade-o↵ is appealing given the later purpose of building a weak 6 The support on which an unknown function is identified is usually left implicit in the literature. To make it more explicit, g0 (x) is identified if g0 (x) is identified on the support of x almost surely. 7 A similar condition appears in the identification analysis of Hoderlein (2009), where endogenous semiparametric binary choice models are considered in the presence of heteroskedasticity.
7
identification notion. Even without Assumption ID2, maintaining the assumptions of Theorem 2.2, we still achieve identification of g0 (x), but on the set {x 2 X r }.
Lastly, in order to identify the level of g0 (x), we need to introduce some normalization as in NPV. Either E["] = 0 or 0 (¯ v ) = ¯ suffices to pin down g0 (x). With the latter normalization, it follows that g0 (x) = E[y|x, v = v¯] ¯ , which we apply in estimation as it is convenient to implement. The following is a set of sufficient conditions for Assumption ID2. Let Vz be the conditional
support of v given z 2 Z.
Assumption ID20 Either (a) or (b) holds. (a) (i) x is univariate and x and v are continuously distributed, (ii) Z is a cartesian product of connected intervals, and (iii) Vz = Vz˜ for all z, z˜ 2 Z 0 ; (b) Vz = Rdx for all z 2 Z.
Lemma 2.1 Under Assumption ID1, Assumption ID20 implies Assumption ID2. In Assumption ID20 , the continuity of the r.v. is closely related to the support condition in Proposition 2.1 that the boundary of support of (z, v) has probability zero. For example, when z or v is discrete their condition does not hold. Assumption ID20 (a)(i) assumes that the endogenous variable is univariate, which is most empirically relevant in nonparametric models. An additional condition is required with multivariate x, which is omitted in this paper. Even under ID20 (a)(i), however, the exogenous covariate z1 in g(x, z1 ), which is omitted in the discussion, can still be a vector. ID20 (a)(ii) and (iii) are rather mild. ID20 (a)(ii) assumes that z has a connected support, which in turn requires that the excluded instruments vary smoothly. The assumptions on the continuity of the r.v. and the connectedness of Z are also useful in deriving the asymptotic theory
of the series estimator considered in this paper; see Assumption B below. ID20 (a)(iii) means that the conditional support of v given z is invariant when z is in Z 0 . This support invariance condition is the key to obtaining a rank condition that is considerably weaker than that of NPV. Our support
invariance condition is di↵erent from the support invariance condition introduced in Imbens and Newey (2009). Using the notations of the present paper, Imbens and Newey (2009) require that the support of v conditional on x equals the marginal support of v, which inevitably requires a large support of z. On the other hand, ID20 (a)(iii) requires that the support of v conditional on z is invariant (for z 2 Z 0 ), and therefore imposes no restriction on the support of z. Also, the conditional support does not have to equal the marginal support of v here. ID20 (a)(iii), along with the control
function assumptions (2.1b), is a weaker orthogonality condition for z than the full independence condition z ? v. Note that Vz = {x
⇧0 (z) : x 2 Xz }. Therefore, ID20 (a)(iii) equivalently means
that Xz is invariant for those z satisfying E[x|z] = const. Moreover, one can introduce a condition that is weaker than ID20 (a)(iii): Xz ⇢ X r for those z satisfying E[x|z] = const.8 These conditions in terms of Xz can be checked from the data. 8
Therefore, some types of heteroskedasticity of v can be allowed under ID20 (a)(iii) or under this weaker condition.
8
Figure 1: Identification under Assumption ID20 (a), univariate z and no z1 . Given ID20 (b) that v has a full conditional support, ID2 is trivially satisfied and no additional restriction is imposed on the joint support of z and v. ID20 (b) also does not require univariate x or the connectedness of Z. This assumption on Vz is satisfied with, for example, a normally distributed error term (conditional on regressors).
Figure 1 illustrates the intuition of the identification proof under ID20 (a) in a simple case where z is univariate. With ID20 (b), the analysis is even more straightforward; see the proof of Lemma 2.1 in the Appendix. In the figure, the local rank condition (2.4) ensures global identification of g0 (x). The intuition is as follows. First, by @E[y|v, z]/@z = (@g0 (x)/@x) · (@⇧0 (z)/@z) and the rank
condition, g0 (x) is locally identified on x corresponding to a point of z in the relevant set Z r . As such a point of z varies within Z r , the x corresponding to it also varies enough to cover almost the
entire support of x. At the same time, for any x corresponding to an irrelevant z (i.e., z outside of Z r ), one can always find z inside of Z r that gives the same value of such an x. The probability
Pr [z 2 Z r ] being small but bounded away from zero only a↵ects the efficiency of estimators in the estimation stage. This issue is related to the weak identification concept discussed later.
Note that the strength of identification of g0 (x) is di↵erent for di↵erent subsets of X . For
instance, identification must be strong in a subset of X corresponding to a subset of Z where
⇧0 (·) is steep. In addition, g0 (x) is over-identified on a subset of X that corresponds to multiple
subsets of Z where ⇧0 (·) has a nonzero slope, since each association of x and z contributes to identification. This discussion implies that the shape of ⇧0 (·) provides useful information on the
strength of identification in di↵erent parts of the domain of g0 (x). Lastly, it is worth mentioning that the separable structure of the reduced form along with ID20 (a)(iii) allows “global extrapolation” in a manner that is analogous to that in a linear model. The identification results of this section apply to economically relevant situations. Let x be an economic agent’s optimal decision induced by an economic model and z be a set of exogenous components in the model that a↵ects the decision x. One is interested in a nonlinear e↵ect of the 9
optimal choice on a certain outcome y in the model. We present two situations where the resulting Pr [z 2 Z r ] is strictly less than unity in this economic problem: (a) x is realized as a corner solution beyond a certain range of z. In a returns-to-schooling example, x can be the schooling decision of a potential worker, z the tuition cost or distance to school, and y the future earnings. When the tuition cost is too high or the distance to school is too far beyond a certain threshold, such an instrument may no longer a↵ect the decision to go to school. (b) The budget set has kink points. In a labor supply curve example, x is the before-tax income, which is determined by the labor supply decision, z the worker’s characteristics that shift her utility function, and y the wage. If an income tax schedule has kink points, then the x realized at such points will possibly be invariant at the shift of the utility. The identification results of this paper imply that even in these situations, the returns to schooling or the labor supply curve can be fully identified nonparametrically as long as Pr [z 2 Z r ] > 0.
3
Weak Identification
The previous section discusses the structure of the joint distribution of x and z that contributes to the identification of g 0 (·). Specifically, (2.4) imposes a minimal restriction on the shape of the conditional mean function E[x|z] = ⇧0 (z). This necessity result suggests that “slight violation” of (2.4) will result in weak identification of g 0 (·). Note that this approach will not be successful with (2.3) of NPV, since violating the condition, i.e., Pr [rank (@⇧0 (z)/@z 0 ) = dx ] < 1, can still result in strong identification. In this section, we formally construct the notion of weak identification via localization. We define nonparametric weak instruments as a drifting sequence of reduced-form functions that are localized around a function with no identification power. Such a sequence of models or drifting data-generating process (Davidson and MacKinnon (1993)) is introduced to define weak instruments relative to the sample size n. As a result, the strength of instruments is represented in terms of the rate of localization, and hence, it can eventually be reflected in the local asymptotics of the estimator of g0 (·). Let C(Z) be the class of conditional mean functions ⇧(·) on Z that are bounded, Lipschitz
and continuously di↵erentiable. Define a non-identification region C0 (Z) as a class of functions that satisfy the lack-of-identification condition motivated by (2.4)9 : C0 (Z) = {⇧(·) 2 C(Z) :
Pr [rank (@⇧(z)/@z 0 ) < dx ] = 1}. Define an identification region as C1 (Z) = C(Z)\C0 (Z). We consider a sequence of triangular models y = g0 (x) + " and x = ⇧n (z) + v with corresponding stochastic
assumptions. Although g(x) is identified with ⇧n (·) 2 C1 (Z) for any fixed n by Theorem 2.2, g(x) is ¯ in C0 (Z). Namely, the noise (i.e., v) cononly weakly identified as ⇧n (·) drifts toward a function ⇧(·) tributes more than the signal (i.e., ⇧n (z)) to the total variation of x 2 {⇧n (z) + v : z 2 Z, v 2 V} 9
The lack of identification condition is satisfied either when the order condition fails (dz < dx ), or when z are jointly irrelevant for one or more of x, almost everywhere in their support.
10
as n ! 1. In order to facilitate a meaningful asymptotic theory in which the e↵ect of weak instruments is reflected, we further proceed by considering a specific sequence of ⇧n (·).
Assumption L (Localization) For some > 0, the true reduced-form function ⇧n (·) satisfies ˜ 2 C1 (Z) that does not depend on n and for z 2 Z the following. For some ⇧(·) @⇧n (z) =n @z 0
˜ @ ⇧(z) + op (n @z 0
).
˜ · ⇧(z) + c + op (n
)
·
Assumption L is equivalent to ⇧n (z) = n
(3.1)
for some constant vector c. This specification of a uniform convergent sequence over Z can be justified by our identification analysis. The “local nesting” device in (3.1) is also used in Stock and
Wright (2000) and Jun and Pinkse (2012) among others. In contrast to these papers, the value measures the strength of identification here and is not specified to be 1/2.10 Unlike a linear
of
reduced form, to characterize weak instruments in a more general nonparametric reduced form, we need to control the complete behavior of the reduced-form function, and the derivation of local asymptotic theory seems to be more demanding. Nevertheless, the particular sequence considered in Assumption L makes the weak instrument asymptotic theory straightforward while embracing the most interesting local alternatives against non-identification.11
4
Estimation
Once the endogeneity is controlled by the control function in (2.2), the problem becomes one of estimating the additive nonparametric regression function E[y|x, z] = g0 (x) +
0 (v).
In a weak
instrument environment, however, we face a nonstandard problem called concurvity: x = ⇧n (z) + v ! v a.s. as n ! 1 under the weak instrument specification (3.1) of Assumption L with c = 0 P as normalization. With a series representation g0 (x) + 0 (v) = 1 j=1 { 1j pj (x) + 2j pj (v)}, where the pj (·)’s are the approximating functions, it becomes a familiar problem of multicollinearity as
pj (x) ! pj (v) a.s. for all j. More precisely, pj (x) pj (v) = Oa.s. (n ) by mean value expansion ˜ ˜ pj (v) = pj (x n ⇧(z)) = pj (x) n ⇧(z)@p x)/@x with an intermediate value x ˜. Alternatively, j (˜ by plugging this expression of pj (v) back into the series, we can see that the variation of the regressor It would be interesting to have di↵erent rates across columns or rows of @⇧@zn0(·) . One can also consider di↵erent rates for di↵erent elements of the matrix. The analyses in these cases can analogously be done by slight modifications of the arguments. 11 In defining weak instruments in Assumption L, one can consider an intermediate case where @⇧@zn0(·) converges to a matrix with reduced-rank rather than that with zero rank. Extending the analysis in this case can follow analogously but omitted in the paper for succinctness. 10
11
shrinks as n ! 1. This feature is reminiscent of the ill-posed inverse problem that commonly occurs,
e.g., in estimating a standard NPIV model of y = g0 (x) + " with endogenous x and E["|z] = 0. By P P1 writing g0 (x) = 1 j=1 1j pj (x) it follows that E[y|z] = j=1 1j E [pj (x)|z]. Analogous to the weak ⇥ ⇤ instruments problem, the variation of the regressor E [pj (x)|z] shrinks since E E[pj (x)|z]2 ! 0 as j ! 1 (Kress (1999, p. 235)). Blundell and Powell (2003, p. 321) also acknowledge that the ill-posed inverse problem is a functional analogue to the multicollinearity problem.
Given the connections between the weak instruments, concurvity, and ill-posed inverse problems, the regularization methods used in the realm of research concerning inverse problems are suitable for use with weak instruments. There are two types of regularization methods used in the literature: the truncation method and the penalization method.12 In this paper, we introduce the penalization scheme. The nature of our problem is such that the truncation method does not work properly. Unlike in the ill-posed inverse problem, the estimators of
1j
and
2j
above can still be unstable
even after truncating the series since we still have pj (x) ! pj (v) a.s. for j J < 1 as n ! 1. On the other hand, the penalization directly controls the behavior of the
1j ’s
and
2j ’s,
and hence, it
successfully regularizes the weak instrument problem. We propose a penalized series estimation procedure for h0 (w) = g0 (x) +
0 (v)
where w = (x, v).
We choose to use series estimation rather than other nonparametric methods as it is more suitable in our particular framework. Because x ! v a.s. the joint density of w becomes concentrated along a lower dimensional manifold as n tends to infinity. As mentioned in the introduction, series estimators are less sensitive to this problem of singular density. See Assumption B below for related discussions. Furthermore with series estimation, it is easy to impose the additivity of h0 (·) and to characterize the problem of weak instruments as a multicollinearity problem. The estimation procedure takes two steps. In the first stage, we estimate the reduced form ⇧n (·) using a standard series estimation method and obtain the residual vˆ. In the second stage, we estimate the structural function h0 (·) using a penalized series estimation method with w ˆ = (x, vˆ) as the regressors. The theory that follows uses orthonormal polynomials and regression splines as approximating functions.13 Let {(yi , xi , zi )}ni=1 be the data with n observations, and let rL (zi ) =
(r1 (zi ), ..., rL (zi ))0 be a vector of approximating functions of order L for the first stage. Define ˆ a matrix R = (rL (z1 ), ..., rL (zn ))0 . Then, regressing xi on rL (zi ) gives ⇧(·) = rL (·)0 ˆ where ˆ =
n⇥L 0 (R R) 1 R0 (x
1 , ..., xn )
0,
and we obtain vˆi = xi
ˆ i ). Define a vector of approximating ⇧(z
functions of order K for the second stage as pK (w) = (p1 (w), ..., pK (w))0 . To reflect the additive structure of h0 (·), there are no interaction terms between the approximating functions for g0 (·) and those for
0 (·)
in this vector; see the Appendix for the explicit expression. Denote a matrix of approximating functions as Pˆ = (pK (w ˆ1 ), ..., pK (w ˆn ))0 where w ˆi = (xi , vˆi ). Note that L = L(n) n⇥K
and K = K(n) grow with n. 12
In Chen and Pouzo (2012) closely related concepts are used in di↵erent terminologies: minimizing a criterion over finite sieve space and minimizing a criterion over infinite sieve space with a Tikhonov-type penalty. 13 For detailed descriptions of power series and regression splines, see Newey (1997).
12
We define a penalized series estimator : ˆ ⌧ (w) = pK (w)0 ˆ⌧ , h
(4.1)
where the “interim” estimator ˆ⌧ optimizes a penalizing criterion function ˆ⌧ = arg min
˜2RK
where y = (y1 , ..., yn )0 and ⌧n
⇣
y
⌘0 ⇣ Pˆ ˜ y
⌘ Pˆ ˜ /n + ⌧n ˜0 ˜,
(4.2)
0 the penalization parameter. To control bias, ⌧n is assumed to
converge to zero. The optimization problem (4.2) yields a closed form solution: ˆ⌧ = (Pˆ 0 Pˆ + n⌧n IK )
1
Pˆ 0 y.
The multicollinearity feature discussed above is manifested here by the fact that the matrix Pˆ 0 Pˆ is nearly singular under Assumption L, since the two columns of Pˆ become nearly identical. In terms of ⇥ ⇤ the population second moment matrix Q = E pK (wi )pK (wi )0 , the challenge is that the minimum eigenvalue of Q is not bounded away from zero, which is manifested as in Lemma A.2 in the Appendix) where
max
max (Q
1)
= O(n2 ) (shown
denotes the maximum eigenvalue. The term n⌧n IK
mitigates such singularity, without which the performance of the estimator of h0 (·) would deteriorate severely.14 The relative e↵ects of weak instruments (n2 ) and penalization (⌧n ) will determine the ˆ ⌧ (·). Given h ˆ ⌧ (·), with the normalization that (¯ asymptotic performance of h v ) = ¯ , we have ˆ ⌧ (x, v¯) ¯ . gˆ⌧ (x) = h
5
Consistency and Rate of Convergence
First, we state the regularity conditions and key preliminary results under which we find the rate of convergence of the penalized series estimator introduced in the previous section. Let X = (x, z). Assumption A {(yi , xi , zi ) : i = 1, 2, ...} are i.i.d. and var(x|z) and var(y|X) are bounded functions of z and X, respectively.
Assumption B (z, v) is continuously distributed with density that is bounded away from zero on Z ⇥ V, and Z ⇥ V is a cartesian product of compact, connected intervals. Assumption B is useful to bound below and above the eigenvalues of the “transformed” second moment matrix of approximating functions. This condition is worthy of discussion in the context of 14
In linear settings, the introduction of a regularization method is less appealing as it creates the well-known biased estimator of ridge regression. In contrast, we do not directly interpret ˆ⌧ in the current nonparametric setting, since ˆ ⌧ (·). More importantly, the overall bias of h ˆ ⌧ (·) is unlikely to it is only an interim estimator calculated to obtain h be worsened in the sense that the additional bias introduced by penalization can be dominated by the existing series estimation bias.
13
identification and weak identification. Let fu and fw denote the density functions of u = (z, v) and w = (x, v), respectively. An identification condition like Assumption ID20 in Section 2 is embodied in Assumption B. To see this, note that fu being bounded away from zero means that there is no functional relationship between z and v, which in turn implies Assumption ID20 (a)(iii).15 On the other hand, an assumption written in terms of fw like Assumption 2 in NPV (p. 574) cannot be imposed here. Observe that w = (⇧n (z) + v, v) depends on the behavior of ⇧n (·), and hence fw is not bounded away from zero uniformly over n under Assumption L and approaches a singular density. Technically, making use of a transformation matrix (see the Appendix), an assumption is made in terms of fu , which is not a↵ected by weak instruments, and the e↵ect of weak instruments can be handled separately in the asymptotics proof. Note that the assumption for the Cartesian products of supports, namely Z ⇥ V and its compactness can be replaced by introducing a trimming
function as in NPV, that ensures bounded rectangular supports. Assumption B can be weakened to hold only for some component of the distribution of z; some components of z can be allowed to be discrete as long as they have finite supports. Next, Assumption C is a smoothness assumption on the structural and reduced-form functions. Let W be the support of w = (x, v). Let Fx , Fv and Fw be the distribution functions of x, v and w, respectively.
Assumption C g0 (x) and
0 (v)
are in L2 (Fx ) and L2 (Fv ), respectively, and are Lipschitz and con-
tinuously di↵erentiable of order s on W. ⇧n (z) is bounded, Lipschitz and continuously di↵erentiable of order s⇡ on Z.
This assumption ensures that the series approximation error shrinks as the number of approxP 2 imating functions increases. Let K (K) for a nonnegative function (·). Since j=1 |hh0 , pj i| h0 (w) is in L2 (Fw ) by Assumption C,
(K) = O(1).
p p L/n + L s⇡ /dz ) ! 0, K 4 ( L/n + L s⇡ /dz ) ! 0, p p K 3 ! 0, L2 /n ! 0; (b) for power series, n K 7/2 ( L/n+L s⇡ /dz ) ! 0, K 8 ( L/n+L s⇡ /dz ) !
Assumption D (i)(a) For splines, n K 2 ( n
0, n
K 11/2 ! 0, L3 /n ! 0; (ii) ⌧n ! 0, n ⌧n ! 0; (iii) n2
(K) ! 0.
Assumption D(i) restricts the rate of growth of the numbers K and L of the approximating functions. The conditions on K and L are more restrictive than the corresponding assumption for the power series in NPV (Assumption 4, p. 575) where weak instruments are not considered. D(ii) restricts the rate of convergence of ⌧n . D(iii) is reminiscent of the source condition in the literature on inverse problem; see e.g., Darolles et al. (2011) and Chen and Pouzo (2012). The source condition relates the degree of ill-posedness to the smoothness of the unknown function of interest. As discussed earlier, the weak instrument problem is related to the ill-posed inverse problem, and n2 15
(K) = O(1) controls the weakness of instruments relative to the smoothness
The definition of a functional relationship can be found, e.g., in NPV (p. 568).
14
of h0 . Now, we provide results for the rate of convergence in probability of the penalized series ˆ ⌧ (w) in terms of L2 and uniform distance. Let F (w) = Fw (w) for simplicity. estimator h Theorem 5.1 Suppose Assumptions A–D and L are satisfied. Let Rn = min{n , ⌧n ⇢ˆ h
ˆ ⌧ (w) h
h0 (w)
i2
1/2
dF (w)
⇣ p = Op Rn ( K/n + K
s/dx
+ ⌧n +
1/2
p L/n + L
}. Then
s⇡ /dz
⌘ ) .
Also, for r = 1/2 for splines and r = 1 for power series, ˆ ⌧ (w) sup h
w2W
⇣ p h0 (w) = Op Rn K r ( K/n + K
s/dx
+ ⌧n +
p L/n + L
s⇡ /dz
⌘ ) .
Suppose there is no penalization (⌧n = 0). Then with Rn = Op (n ), Theorem 5.1 provides the ˆ rates of convergence of the unpenalized series estimator h(·). For example, with k·kL2 denoting the L2 norm above,
ˆ h
h0
L2
⇣ p = Op n ( K/n + K
s/dx
+
p L/n + L
s⇡ /dz
⌘ ) .
(5.1)
Compared to the strong instrument case of NPV (Lemma 4.1, p. 575), the rate deteriorates by p the leading n rate, the weak instrument rate. Note that the terms K/n and K s/dx correspond p to the variance and bias of the second stage estimator, respectively,16 and L/n and L s⇡ /dz are
those of the first stage estimator. The latter rates appear here due to the fact that the residuals vˆi are generated regressors obtained from the first-stage nonparametric estimation. The way that n enters into the rate implies that the e↵ect of weak instruments (hence concurvity) not only exacerbates the variance but also the bias.17 Moreover, the symmetric e↵ect of weak instruments on bias and variance implies that the problem of weak instruments cannot be resolved by the choice of the number of terms in the series estimator. This is also related to the discussion in Section 4 that the truncation method does not work as a regularization method for weak instruments. More importantly, in the case where penalization is in operation (⌧n > 0), the way that Rn enters into the convergence rates implies that penalization can reduce both bias and variance by the same mechanism working in an opposite direction to the e↵ect of weak instruments. Explicitly, when Rn = min{n , ⌧n ˆ⌧ h
h0
1/2
} = ⌧n
L2
1/2
, penalization plays a role and yields the rate
⇣ p = Op ⌧n 1/2 ( K/n + K
s/dx
+ ⌧n +
Here, the overall rate is improved since the multiplying rate ⌧n 16
p L/n + L
1/2
s⇡ /dz
⌘ ) .
(5.2)
is of smaller order than the mul-
The dimension of w is reduced to the dimension of x as the additive structure of h0 (w) is exploited; see e.g., Andrews and Whang (1990). 17 This is di↵erent from a linear case where multicollinearity only results in imprecise estimates but does not introduce bias. This is also di↵erent from the ill-posed inverse problem where the degree of ill-posedness only a↵ects variance.
15
tiplying rate n of the previous case. There is the additional bias term introduced by penalization, namely, the penalty bias ⌧n . Even after multiplying the multiplier ⌧n
1/2
1/2
, ⌧n
! 0 by Assumption
D(ii). When the e↵ect of weak instruments prevails over the e↵ect of penalization, the penalty (with 1/2
the multiplier n ) becomes ⌧n n which converges to zero since ⌧n n ! 0 in this situation.
Next, we find the optimal L2 convergence rate. For a more concrete comparison between the
rates n and ⌧n
1/2
, let ⌧n = n
2
⌧
for some
⌧
> 0. For example, the larger
⌧
is, the faster the
penalization parameter converges to zero, and hence, the smaller the e↵ect of penalization is. Corollary 5.2 (Consistency) Suppose the Assumptions of Theorem 5.1 are satisfied. Let K = ˆ ⌧ h0 O(n1/(1+2s/dx ) ) and L = O(n1/(1+2s⇡ /dz ) ). Then h = Op (n q ) = op (1), where q = 2 L n o s⇡ s , dz +2s , 2 min{ , ⌧ }. min dx +2s ⌧ ⇡ For the penalization to play a role we want that converge too slowly (i.e., the overall
rate.18
⌧
⌧
< . But at the same time, ⌧n cannot
cannot be too small) lest q becomes too small, which will deteriorate
It can be readily shown that q > 0 from Assumption D(i) and (ii). Given the 3 1+2s/dx
choice of K and L in the corollary with splines, Assumption D(i)(a) implies that s⇡ dz +2s⇡
2 1+2s/dx .
0 for all K(n) by Assumption B† (i).
1
max (Tn )
Note that
⇤
0 1 0 max (Tn (Tn QTn ) Tn )
O(1)
0 1 max ((Tn QTn ) )
0 max (Tn Tn )
by applying Lemma A.5 again. The proof of part (b) proceeds similarly as above. Using (A.6),
32
pK ( w ˆ i )0
= O(1), it
= O(n2 )
. . = 1 .. pK1 (xi )0 .. pK2 (ˆ vi )0 =
. 1 .. pK1 (vi )0 + n 0
K
. ˜ i )@pK1 (˜ ⇧(z vi )0 .. pK2 (ˆ vi )0 and
p (w ˆ i ) Tn = 1 = 1
.. . p (vi )0 + n
. ˜ i )@p (˜ ⇧(z vi )0 .. p (ˆ v i ) 0 · Tn
.. . n (p (vi )0
. ˜ i )@p (˜ p (ˆ vi )0 ) + ⇧(z vi )0 .. p (ˆ vi )0 = p⇤K (ˆ ui )0 + mK0 ˆi0 , i +r
. ˜ . . (v )0 .. p (ˆ 0 with u 0 = 0 .. n (p (v )0 where p⇤K (ˆ ui )0 = 1 .. ⇧(z )@p v ) ˆ = (z , v , v ˆ ) and r ˆ i i i i i i i i i
p (ˆ vi )0 )
P .. . (0⇥1 )0 . Let pˆ⇤i = p⇤K (ˆ ui ) and recall mi = mK i . For a random matrix Xi , denote i Xi /n as
En Xi for simplicity. Then by (A.8),
ˆ n=Q ˆ ⇤ + En [mi pˆ⇤0 Tn0 QT p⇤i m0i ] + En [mi m0i ] i ] + En [ˆ + En [ˆ ri pˆ⇤0 p⇤i rˆi0 ] + En [ˆ ri m0i ] + En [mi rˆi0 ] + En [ˆ ri rˆi0 ] i ] + En [ˆ and thus ˆ n Tn0 QT
ˆ ⇤ = 2En kmi k kˆ Q p⇤i k + En kmi k2 + 2En kˆ ri k kˆ p⇤i k + 2En kˆ ri k kmi k + En kˆ ri k2 ⇣ ⌘1/2 ⇣ ⌘1/2 ⇣ ⌘1/2 ⇣ ⌘1/2 2 En kmi k2 En kˆ p⇤i k2 + 2 En kˆ r i k2 En kˆ p⇤i k2 ⇣ ⌘1/2 ⇣ ⌘1/2 + 2 En kˆ ri k2 En kmi k2 + En kmi k2 + En kˆ r i k2 .
Similar as (A.13), the bound on En kˆ p⇤i k2 can be derived as ⇥ ⇤ ⇥ ⇤ En pˆ⇤0 ˆ⇤i = tr(En pˆ⇤i pˆ⇤0 i p i ) tr(IK )
max (En
p⇤i pˆ⇤0 max (En [ˆ i ])
⇥
⇤ pˆ⇤i pˆ⇤0 i ) Op (K) = Op (),
(A.18)
Op (1) is by the fact that the polynomials or splines are defined on bounded ˜ 2 C1 (Z). For En kˆ sets and by Assumption L that ⇧(·) ri k2 , note that where
kˆ ri k2 = n (p (vi )0
p (ˆ vi )0 )
and therefore En kˆ ri k2 = Op (n2 ⇣1v ()2 ˆ n Tn0 QT
ˆ ⇤ Op (n Q
2 ). ⇡
2
n2 ⇣1v ()2 |vi
vˆi |2 = Op (n2 ⇣1v ()2
2 ⇡)
Combining this with (A.18) and (A.12) yields
1/2 ⇣2v () + n 1/2 ⇣1v ()
⇡
+ ⇣1v ()⇣2v ()
⇡
+n
2
⇣2v ()2 + n2 ⇣1v ()2
2 ⇡)
= op (1) by Assumption D† (i)–(iii). Also, by Lemma A.4, we have
33
0 ˆ min (Tn QTn )
min (Q
⇤)
ˆ n Tn0 QT
Q⇤ .
0 ˆ min (Tn QTn )
Combining the two results yields 1 0 ˆ min (Tn QTn )
ˆ
Therefore, we have
A.3
max (Q
1)
=
min (Q
+ op (1). Similar as before
1 1 = Op (1). ⇤ ) + o (1) (Q c + o min p p (1)
=
0 ˆ 1 0 max (Tn (Tn QTn ) Tn )
=
⇤)
Op (1)
0 max (Tn Tn )
(A.19)
= Op (n2 ). ⇤
Proofs of Rate of Convergence (Section 5)
Given the results of the lemmas above, we derive the rates of convergence. We first derive the ˆ ˆ rate of convergence of the unpenalized series estimator h(·) defined as h(w) = pK (w)0 ˆ where ˆ = (Pˆ 0 Pˆ ) 1 Pˆ 0 y. Then, we prove the main theorem with the penalized estimator h ˆ ⌧ (·) defined in Section 4. Lemma A.3 Suppose Assumptions A–D, and L are satisfied. Then, ˆ h
h0
L2
⇣ p = Op n ( K/n + K
Proof of Lemma A.3: Let ˆ h
h0
L2
= =
⇢ˆ h ⇢ˆ h ⇢⇣
ˆ
C ˆ
=(
ˆ h(w)
1 , ...,
h0 (w)
s⇡ /dz
⌘ ) .
1/2
K
K
0
+ O(K
s/dx
)
p L/n + L
dF (w) dF (w)
p (w) ( ˆ ⌘0
i2
+
By TR of L2 norm (first inequality),
i2
0
K
K ).
s/dx
1/2
Ep (w)p (w) )
⇣
+ ⌘
ˆ
⇢ˆ
⇥
1/2
pK (w)0
+ O(K
by Assumption B† (ii) and using Lemma A.5 (last eq.). As ˆ
h0 (w) s/dx
⇤2
1/2
dF (w)
)
= (Pˆ 0 Pˆ )
1P ˆ 0 (y
Pˆ ), it follows
that ˆ
2
= (y
Pˆ )0 Pˆ (Pˆ 0 Pˆ )
= (y
ˆ Pˆ )0 Pˆ Q
Op (n2 )(y
1/2
1
ˆ Q
(Pˆ 0 Pˆ ) 1
ˆ Q
1/2
⇣ ⌘ Pˆ )0 Pˆ Pˆ 0 Pˆ
Pˆ 0 (y
Pˆ )
Pˆ 0 (y
Pˆ )/n2
1
1
Pˆ 0 (y
Pˆ )/n
by Lemma A.5 and Lemma A.2(b) (last ineq.). ˜ = (h(w Let h = (h(w1 ), ..., h(wn ))0 and h ˆ1 ), ..., h(w ˆn ))0 . Also let ⌘i = yi (⌘1 , ..., ⌘n
)0 .
Let W = (w1 , .., wn
)0 ,
h0 (wi ) and ⌘ =
then E [yi |W ] = h0 (wi ) which implies E [⌘i |W ] = 0. Also 34
⇥ ⇤ similar to the proof of Lemma A1 in NPV (p. 594), by Assumption A, we have E ⌘i2 |W being bounded and E [⌘i ⌘j |W ] = 0 for i 6= j, where the expectation is taken for y. Then, given that ˜ + (h ˜ Pˆ ), we have, by TR, y Pˆ = (y h) + (h h) ˆ
ˆ 1/2 Pˆ 0 (y Pˆ )/n = Op (n ) Q n ˆ 1/2 Pˆ 0 ⌘/n + Q ˆ Op (n ) Q
1/2
˜ ˆ h)/n + Q
Pˆ 0 (h
1/2
˜ Pˆ 0 (h
Pˆ )/n
o
.
(A.20)
For the first term of equation (A.20), consider E
h
2
P ⇤ )0 ⌘/n
(P Tn
i h W = E M 0 ⌘/n = Op (n
2
2
i 1 X W C 2 kmi k2 n i
1 v ⇣2 ()2 )
= op (1)
by (A.11) and op (1) is implied by Assumption D† (ii). Therefore, by MK, (P Tn
P ⇤ )0 ⌘/n = op (1).
(A.21)
Also, E
"
⇣
Pˆ Tn
P Tn
⌘0
2
⌘/n
W
#
C
1 X (ˆ pi n2
2
pi ) 0 T n
i
1 O(n2 )Op (⇣1v ()2 n
2 ⇡)
C
1 X n2
max (Tn )
2
i
= Op (n2 ⇣1v ()2
2 ⇡ /n)
kˆ pi
p i k2 (A.22)
by (A.17) and kˆ pi
pi k2 = kp (xi )
p (xi )k2 + k@p (¯ vi ) (ˆ vi vi )k2 1X C⇣1v ()2 |ˆ vi vi |2 Op (⇣1v ()2 2⇡ ). n
(A.23)
i
Therefore,
by Assumption D† (i) and MK. Also E P ⇤0 ⌘/n
2
⇣
Pˆ Tn
h
P Tn
= E E[ P ⇤0 ⌘/n
2
⌘0
⌘/n = op (1)
i
|W ] = E
"
X i
⇤ 2 2 p⇤0 i pi E[⌘i |W ]/n
1 X ⇥ ⇤0 ⇤ ⇤ C 2 E pi pi = Ctr(Q⇤ )/n = O(/n) n i
35
(A.24)
#
by Assumptions A (first ineq.) and equation (A.13) (last eq.). By MK, this implies P ⇤0 ⌘/n Op ( Hence by TR with (A.21), (A.24), and (A.25), ⇣
Tn0 Pˆ 0 ⌘/n
Pˆ Tn
P Tn
⌘0
p /n).
(A.25)
P ⇤ )0 ⌘/n + P ⇤0 ⌘/n Op (
⌘/n + (P Tn
p /n).
Therefore, the first term of (A.20) becomes ˆ Q
1/2
Pˆ 0 ⌘/n
2
⌘ 0 Pˆ Tn n
=
!
ˆ n) (Tn0 QT
1
Tn0 Pˆ 0 ⌘ n
!
Op (1) Tn0 Pˆ 0 ⌘/n
2
= Op (/n)
(A.26)
by Lemma A.5 and (A.19). ⇣ ⌘ 1 Because I Pˆ Pˆ 0 Pˆ Pˆ 0 is a projection matrix, hence is p.s.d, the second term of (A.20) becomes
ˆ Q
1/2
˜ h)/n
Pˆ 0 (h
2
⇣ ⌘ 1 ˜ 0 Pˆ Pˆ 0 Pˆ ˜ = (h h) Pˆ 0 (h h)/n (h X X = (h(wi ) h(w ˆi ))2 /n = ( (vi ) i
C
X i
|vi
2
vˆi | /n =
X
˜ 0 (h h)
˜ h)/n
(ˆ vi ))2 /n
i
2
ˆ i ) /n = Op ( ⇧(z
⇧n (zi )
2 ⇡)
(A.27)
i
by Assumption C (Lipschitz continuity of (v)) (last ineq.). Similarly, the last term is ˆ Q
1/2
˜ Pˆ 0 (h
Pˆ )/n
2
˜ = (h
⇣ ⌘ Pˆ )0 Pˆ Pˆ 0 Pˆ
1
˜ Pˆ )0 (h ˜ Pˆ )/n (h X = h(w ˆ i ) pK ( w ˆ i )0
˜ Pˆ 0 (h
2
Pˆ )/n
/n = Op (K
2s/dx
)
(A.28)
i
by Assumption C† . Therefore, by combining (A.26), (A.27), and (A.28) ˆ Consequently, since ⇣ K, ˆ h
h0
L2
h p Op (n ) Op ( /n) + Op (
h p Op (n ) Op ( K/n) + O(K
and we have the conclusion of the lemma. ⇤
s/dx
⇡)
+ O(K
) + Op (
s/dx
i ) .
i ) + O(K ⇡
s/dx
)
ˆ ⌧ = (Pˆ 0 Pˆ +n⌧n IK )/n = Q+⌧ ˆ n IK . Define Pˆ# = Pˆ +n⌧n Pˆ (Pˆ 0 Pˆ ) Proof of Theorem 5.1: Recall Q 36
1.
Note that the penalty bias emerges as Pˆ# 6= Pˆ . Consider 2
ˆ⌧
= (y
Pˆ# )0 Pˆ (Pˆ 0 Pˆ + n⌧n IK )
= (y
ˆ ⌧ 1/2 Q ˆ ⌧ 1Q ˆ ⌧ 1/2 Pˆ 0 (y Pˆ# )0 Pˆ Q ˆ
max (Q⌧
1
ˆ 1/2 Pˆ 0 (y ) Q ⌧
1
(Pˆ 0 Pˆ + n⌧n IK )
1
Pˆ 0 (y
Pˆ# )
Pˆ# )/n2
Pˆ# )/n
2
.
First note ˆ
max (Q⌧
1
1 1 1 = ˆ ˆ ˆ min (Q + ⌧n I) min (Q) + min (⌧n I) min (Q) + ⌧n ( ) n o 1 1 min , = min Op (n2 ), ⌧n 1 , ˆ min (Q) ⌧n
)=
ˆ ⌧ 1/2 Pˆ 0 (y Op (Rn ) Q
where the last eq. is by Lemma A.2(b). Then ˆ⌧ Rn = min{n , ⌧n
1/2
is p.s.d. Therefore
ˆ ⌧ 1 Pˆ c c0 Pˆ 0 Q ˆ }. Also, note that c0 Pˆ 0 Q
ˆ 1/2 Pˆ 0 (y Q ⌧
ˆ Pˆ# )/n Q
1/2
Pˆ 0 (y
1P ˆc
(A.29) (A.30)
Pˆ# )/n by letting
ˆ for any vector c, since (Q
1
ˆ ⌧ 1) Q
Pˆ# )/n
ˆ Q
1/2
Pˆ 0 (y
ˆ h)/n + Q
1/2
Pˆ 0 (h
Pˆ
n⌧n Pˆ (Pˆ 0 Pˆ )
ˆ Q
1/2
Pˆ 0 (y
ˆ h)/n + Q
1/2
Pˆ 0 (h
ˆ Pˆ )/n + ⌧n Q
1
)/n
1/2
.
The third term (squared) is ˆ ⌧n Q
1/2
2
ˆ = ⌧n2 0 Q
where the ineq. is by Lemma A.2 and
PK
j=1
1
2 j
⌧n2 n2
(K) = Op (⌧n2 ),
(K), and the last eq. by the source condition
(D(iii)). Consequently, by combining (A.26), (A.27), and (A.28) in Lemma A.3, ˆ⌧ h
h0
L2
⇣ p = Op Rn ( K/n + K
s/dx
+ ⌧n +
p L/n + L
s⇡ /dz
⌘ ) .
This proves the first part of the theorem. The conclusion of the second part follows from ˆ h(w)
h0 (w)
1
pK (w)0
h0 (w)
1
+ pK (w)0 ( ˆ⌧
)
1
O(K
s/dx
) + ⇣0v (K) ˆ⌧
.
⇤ Proof of Theorem 5.3: The proof follows directly from the proofs of Theorems 4.2 and 4.3 of NPV (p. 602). As for notations, we use v instead of u of NPV and the other notations are identical.
37
⇤
A.4
Technical Proofs
A.4.1
Proofs of Sufficiency of Assumptions B, C, D, and L h i ˜ = 1 and we prove after Proof that B and L imply B† : For simplicity, assume Pr z 2 Z r (⇧) ˜ is piecewise one-to-one. Here, we prove replacing Qr⇤ with Q⇤ in Assumption B† (i). Note that ⇧(·) ˜ ˜ the caseh where ⇧(·) i is one-to-one. The general cases where ⇧(·) is piecewise one-to-one or where ˜ < 1 can be followed by conditioning on z in appropriate subset of Z. 0 < Pr z 2 Z r (⇧) ˜ Consider the change of ˜ = (˜ z , v˜) where z˜ = ⇧(z) and v˜ = v. Then, variables of u = (z, v) 0into u . . it follows that p⇤K (ui ) = 1 .. z˜i @p (vi )0 .. p (vi )0 = pK (˜ ui ) where pK (˜ ui ) is one particular form of a vector of approximating functions as specified in NPV (pp. 572–573). Moreover, the joint density of u ˜ is ˜ fu˜ (˜ z , v˜) = fu (⇧ ˜ Since @ ⇧
1
˜ @⇧
(˜ z ), v˜) ·
1 (˜ z)
@ z˜
0
0
1
˜ = fu ( ⇧
1
(˜ z ), v˜) ·
˜ 1 (˜ @⇧ z) . @ z˜
˜ 2 C1 (Z) (bounded derivative) and fu is bounded away from zero and the 6= 0 by ⇧
1 (˜ z )/@ z˜
support of u is compact by Assumption B, the support of u ˜ is also compact and fu˜ is also bounded away from zero. Then, by the proof of Theorem 4 in Newey (1997, p. 167), is bounded away from zero. Therefore
min (Q
⇤)
K u )pK (˜ ui ) 0 ) min (Ep (˜ i
is bounded away from zero for all .
As for Q1 that does not depend on the e↵ect of weak instruments, the density of z being bounded away from zero implies that The maximum eigenvalues
min (Q1 ) is bounded of Q⇤ , Q and Q1 are
away from zero for all L by Newey (1997, p. 167).
bounded by fixed constants by the fact that the ˜ 2 C1 (Z). ⇤ polynomials or splines are defined on bounded sets and by Assumption L that ⇧(·) Proof that C implies C† : The results follow by Theorem 8 of Lorentz (1986, p. 90) for power series and by Theorem 12.8 of Schumaker (1981) for splines. ⇤ Proof that D implies D† : It follows from Newey (1997, p. 157) that with power series, ⇣rv (K) = 1
K 1+2r and with splines, ⇣rv (K) = K 2 +r , and similarly for the restriction on L. The same results holds for ⇠r (L). ⇤ A.4.2
Matrix Algebra
The following are mathematical lemmas that are useful in proving Lemma A.2 and other results. Lemma A.4 For symmetric k ⇥ k matrices A and B, let values such that
1
2
···
| j (A)
k.
j (A)
and
j (B)
denote their jth eigen-
Then, the following inequality holds: For 1 j k, j (B)|
|
1 (A
38
B)| kA
Bk .
Lemma A.5 If K(n) ⇥ K(n) symmetric random sequence of matrices An satisfies
max (An )
=
Op (an ), then kBn An k kBn k Op (an ) for a given sequence of matrices Bn .
Proof of Lemma A.4: First, by Weyl (1912), for symmetric k ⇥ k matrices C and D i+j 1 (C
where i + j
+ D)
i (C)
+
j (D),
(A.31)
1 k. Also, for any k ⇥ 1 vector a such that kak = 1, (a0 Da)2 = tr(a0 Daa0 Da) =
tr(DDaa0 ) tr(DD)tr(aa0 ) = tr(DD). Since
j (D)
= a0 Da for some a with kak = 1, we have
| j (D)| kDk ,
(A.32)
for 1 j k. Now, in (A.31) and (A.32), take j = 1, C = B, and D = A
B, and we have the
desired results. ⇤
tr Bn U D2 U
A.5
1B0 n
tr Bn U U
1B0 n
·
max (An )
2
Then, kBn An k2 =
1.
Proof of Lemma A.5: Let An have eigenvalue decomposition An = U DU = kBn k2 Op (n )2 . ⇤
Proof of Asymptotic Normality (Section 6)
˜ = P 0 P/n where Assumption G in Section 6 implies the following technical assumption. Define Q P = (pK (w1 ), ..., pK (wn ))0 and define
n⇥K
ˆ Q Q1 H
p v = ⇣1v ()2 2⇡ + 1/2 ⇣1v () ⇡ , = ⇣ () log()/n, ˜ Q = ˆ + 0 Q Q p p p = ⇠(L) log(L)/n, L/n + L s⇡ /dz , ⇡ = h = Rn ( K/n + K p = L1/2 ⇣1v () ⇡ + K 1/2 ⇠(L)/ n.
Assumption G† The following quantities converge to zero as n ! 1: p p p n⌧n Rn (K), Rn4 (⇣0v (K)2 K+⇠(L)2 L)/n, n⇣0v (K) 2⇡ , L1/2 Q1 , Rn L1/2 L1/2 ⇣0v (K)⇣1v (K)
h,
and n
⇣2v (K)K. Also Rn = ⌧n
1/2
p
nK
H,
˜ Q s/dx
+
s/dx ,
Rn3 K 1/2
⇡)
p
nL
Q,
s⇡ /dz ,
Rn ⇣1v (K)2
= n ⌧.
p p Proof that Assumption G implies Assumption G† : First by nK s/dx ! 0 and nL s⇡ /dz ! p p p p p 0, we have ⇡ = L/n(1 + L 1/2 nL s⇡ /dz ) = O( L/n) and h = O(Rn ( K/n + L/n)).
39
2, ⇡
Therefore p
n⇣0v (K) L1/2
2 ⇡ Q1
Rn L1/2
H
Rn3 K 1/2
Q
Rn ⇣1v (K)2
2 ⇡
L1/2 ⇣0v (K)⇣1v (K)
h
p = ⇣0v (K)L/ n p = L1/2 ⇠(L) log(L)/n
p = Rn (⇣1v (K)L3/2 + K 1/2 L1/2 ⇠(L))/ n p p p = Rn3 (⇣1v (K)2 K 1/2 L/ n + ⇣1v (K)K L + ⇣0v (K)K 1/2 log(K)) = Rn ⇣1v (K)2 L/n
p = Rn ⇣0v (K)⇣1v (K)(K 1/2 L1/2 + L)/ n 1
1
Then by plugging in ⇣rv (K) = K 2 +r and ⇠(L) = L 2 with splines and ⇣rv (K) = K 1+2r and ⇠(L) = L with polynomials, it is readily seen that Assumption G† is followed by Assumption G. ⇤ Proof of Theorem 6.1: The proofs of Theorem 6.1 and Corollary 6.2 are a modification of the proof of Theorem 5.1 in NPV (with their trimming function being an identity function) in the setting of weak instruments and penalization. We use the components established in the proof of the convergence rate, which are distinct from NPV. The rest of the notations are the same as those of NPV. Let “MVE” abbreviate mean value expansion. Under Assumptions A, B† (ii), and E and given our choice of basis functions, Theorem 4.6 of ˜ Q = Op ( ˜ ) and Q ˆ1 I = Belloni et al. (2015) (which improves over Newey (1997)) yields Q Q ˆ Op ( Q1 ) by letting Q1 = I. Also similar to the proof of Lemma A1 of NPV, but using Q 1 Pn ˆ Q ˜ = Op ( ˆ ). Also, pi pˆ0i pi p0i ) instead of their (A.5), we have Q i=1 (ˆ n Q ˆ⌧ Q
ˆ Q⌧ Q
˜ = Q
ˆ Q + ⌧n I Q
˜ + Q ˜ Q + k⌧n Ik Q p Op ( Qˆ ) + Op ( Q˜ ) + O(⌧n K) = Op ( Q⌧ ).
Let X is a vector of variables that includes x and z, and !(X, ⇡) a vector of functions of X and ⇡ and, trivially, !(X, ⇡) = (x, x
⇧n (z)). Let ri = rL (zi ). Define Q⌧ = Q + ⌧n I and
V⌧ = AQ⌧ 1 ⌃ + HQ1 1 ⌃1 Q1 1 H 0 Q⌧ 1 A0 , ⇥ ⇤ ⇥ ⇤ ⌃ = E pi p0i var(yi |Xi ) , H = E pi [@h(wi )/@w]0 @!(Xi , ⇧n (zi ))/@⇡ ri0 .
Note that H is a channel through which the first-stage estimation error a↵ects into the variance of the estimator of h(·). We first prove p
nV ⌧ 1/2 (✓ˆ⌧
For notational simplicity, let F = V ⌧
1/2
✓0 ) !d N (0, 1).
˜ = (h(w . Let h = (h(w1 ), ..., h(wn ))0 and h ˆ1 ), ..., h(w ˆn ))0 .
40
h0 (wi ) and ⌘ = (⌘1 , ..., ⌘n )0 .
Also let ⌘i = yi
Let ⇧n = (⇧n (z1 ), ..., ⇧n (zn ))0 , vi = xi
⇧n (zih), and U = (v1 , i..., vn )0 . Similar to NPV, once we prove that (i) F AQ⌧ 1 = O(Rn ), (ii) p p ˜ Pˆ# ) = op (1), (iv) F AQ ˆ 1 Pˆ 0 (h nF a(pK0 ˜) a(h0 ) = op (1), (iii) nF A(Pˆ 0 Pˆ +n⌧n I) 1 Pˆ 0 (h ⌧ p p p p 1 0 1 0 1 0 ˜ ˆ ⌧ Pˆ ⌘/ n = F AQ⌧ P ⌘/ n + op (1) below, then h)/ n = F AQ⌧ HR U/ n + op (1), and (v) F AQ we will have, p
⇣ nV ⌧ 1/2 ✓ˆ⌧
Then, for any vector
⌘ p ⇣ ⌘ ✓0 = nF a(pK0 ˆ⌧ ) a(pK0 ˜) + a(pK0 ˜) a(h0 ) p p = nF A ˆ⌧ nF A ˜ + op (1) p p = nF A(Pˆ 0 Pˆ + n⌧n I) 1 Pˆ 0 (h + ⌘) nF A(Pˆ 0 Pˆ + n⌧n I) p ˜ Pˆ# ˜) + op (1) + nF A(Pˆ 0 Pˆ + n⌧n I) 1 Pˆ 0 (h p p ˜ ˆ 1 Pˆ 0 ⌘/ n F AQ ˆ 1 Pˆ 0 (h h)/ = F AQ n + op (1) ⌧ ⌧ p p = F AQ⌧ 1 (P 0 ⌘/ n + HR0 U/ n) + op (1). with k k = 1, let Zin =
0 F AQ 1 [p ⌘ i i ⌧
1
˜ Pˆ 0 h
(A.33)
p + Hri ui ] / n. Note Zin is i.i.d. for
each n. Also EZin = 0, var(Zin ) = 1/n. Furthermore, F AQ⌧ 1 = O(Rn ) and F AQ C F AQ
1
= O(n ) by CI
1H
HH 0 being p.s.d, so that, for any " > 0,
⇥ ⇤ ⇥ ⇤ ⇥ ⇤ 2 nE 1 {|Zin | > "} Zin = n"2 E 1 {|Zin /"| > 1} (Zin /")2 n"2 E (Zin /")4 h i n"2 4 2 4 k k4 F AQ⌧ 1 {kpi k2 E kpi k2 E[⌘i4 |Xi ] n " h i + kri k2 E kri k2 E[u4i |zi ] } n o CO(Rn4 ) ⇣0v (K)2 E kpi k2 + ⇠(L)2 E kri k2 /n CO(Rn4 ) ⇣0v (K)2 tr(Q) + ⇠(L)2 tr(Q1 ) /n O(Rn4 ⇣0v (K)2 K + ⇠(L)2 L /n) = o(1) by G† . Then,
p
nF (✓ˆ⌧
✓0 ) !d N (0, 1) by Lindbergh-Feller theorem and (A.33).
Now, we proceed with detailed proofs of (i)–(v). For simplicity as before, the remainder of the proof will be given for the scalar ⇧(z) case. Under Assumption F and by CS, for hK (w) = pK (w)0 |a(hK )| = |A
K|
kAk k
away from zero, Q⌧
1
Kk
= kAk EhK
min (Q⌧
1 )I
1/2 (w)2
= CI. Also since
zero by Assumption E and ⌧n ! 0, we have ⌃ V⌧
so kAk ! 1. But since
AQ⌧ 1 ⌃Q⌧ 1 A0
2 (X)
2
1)
is bounded
= var(y|X) is bounded away from
CQ⌧ . Hence
CAQ⌧ 1 Q⌧ Q⌧ 1 A0
1 0 ˜ CAQ ⌧ A
C˜˜ kAk2 ,
Therefore, F is bounded. Now, by (A.34) F AQ⌧ 1/2
min (Q⌧
K,
= tr(F AQ⌧ 1 A0 F ) tr(CF V ⌧ F ) = C. 41
(A.34)
Also
max (Q⌧
1)
= O(Rn2 ), which can readily be shown analogous to (A.30). Using these results,
F AQ⌧ 1 = F AQ⌧ 1/2 Q⌧ 1/2
1 1/2
max (Q⌧
F AQ⌧ 1/2 = O(Rn ).
)
Then, combining with (A.30), (i) follows by ⇣ ˆ 1 F AQ 1 + F AQ 1 Q ˆ⌧ F A0 Q ⌧ ⌧ ⌧
ˆ⌧ O(Rn ) + O(Rn )Op (Rn2 ) Q = O(Rn ) + Op (Rn3
where Rn3
Q
Q)
⌘ ˆ 1 Q⌧ Q ⌧ Q⌧
= Op (Rn ),
= op (1) is by Assumption G† . Also
ˆ ⌧ 1/2 F AQ
2
F AQ⌧ 1/2
2
⇣ ˆ⌧ + tr(F AQ⌧ 1 Q ˆ⌧ Q
C + F A0 Q⌧ 1
⌘ ˆ ⌧ 1 A0 F ) Q⌧ Q
ˆ 1 O(Rn )Op ( F A0 Q ⌧
Q⌧
Q )O(Rn )
= op (1)
by G† . To prove (ii), by C† and G† , h nF a(pK0 ˜)
a(h0 )
˜ ˆ 1 Pˆ 0 (h F AQ ⌧
p p p ˆ 1 Pˆ 0 / n Pˆ# ˜)/ n F AQ n sup pK (w)0 ˜ ⌧
p
i
=
p
h nF a(pK0 ˜
p C nK
s/dx
h0 )
i
p
n |F | sup pK (w)0 ˜ w
h0 (w)
= op (1).
For (iii),
w
h0 (w)
p ˆ 1 ˜/ n + n⌧n F AQ ⌧ p p ˆ ⌧ 1/2 ˆ⌧ 1 ˜ F AQ nO(K s/dx ) + n⌧n F AQ p p p op (1)O( nK s/dx ) + Op ( n⌧n Rn (K)) = op (1)
ˆ 1 ˜ = Op (Rn ) can be shown analogous to F AQ ˆ 1 = Op (Rn ). To prove by G† , where F AQ ⌧ ⌧ ¯ = P pˆi d(Xi )r0 /n. (iv), let = ( 1 , ..., L )0 , di = d(Xi ) = [@h(wi )/@w]0 @!(Xi , ⇧0 (zi ))/@⇡ and H i
By a second order MVE of each h(w ˆi ) around wi ˆ ⌧ 1 Pˆ 0 (h F AQ
h X p ˜ ˆ i) ˆ⌧ 1 h)/ n = F AQ pˆi di ⇧(z i
i p ⇧n (zi ) / n + ⇢ˆ
p ˆ H ¯Q ˆ 1 R0 U/ n + F AQ ˆ 1H ¯Q ˆ 1 R0 (⇧n = F AQ ⌧ ⌧ 1 1 X ⇥ 0 ⇤ p 1 ˆ + F AQ pˆi di ri ⇧n (zi ) / n + ⇢ˆ. ⌧ 1
i
42
p R0 )/ n (A.35)
2 p ˆ ⌧ 1/2 ⇣ v (K) P ⇧(z ˆ i ) ⇧n (zi ) /n = op (1)Op (pn⇣ v (K) 2⇡ ) = op (1). But kˆ ⇢k C n F AQ 0 0 i 1 ¯0 ¯ ˆ Also, by di being bounded and nH Q1 H being equal to the matrix sum of squares from the ¯Q ˆ 1H ¯ 0 P pˆi pˆ0 d2 /n C Q ˆ CQ ˆ ⌧ . Therefore, the second multivariate regression of pˆi di on ri , H i i 1 i
term in (A.35) becomes
p p p ˆ ⌧ 1H ¯Q ˆ 1 R0 / n R0 )/ n F AQ n sup ⇧n (z) 1
ˆ ⌧ 1H ¯Q ˆ 1 R0 (⇧n F AQ 1
rL (z)0
Z
n ⇣ ⌘o1/2 p ˆ ⌧ 1H ¯Q ˆ 1Q ˆ 1Q ˆ 1H ¯ 0Q ˆ ⌧ 1 A0 F 0 tr F AQ O( nL s⇡ /dz ) 1 1 n ⇣ ⌘o1/2 p ˆ 1Q ˆ⌧ Q ˆ 1 A0 F 0 C tr F AQ O( nL s⇡ /dz ) ⌧ ⌧ p p ˆ ⌧ 1/2 O( nL s⇡ /dz ) = op (1)Op ( nL s⇡ /dz ) = op (1) C F AQ
by G† . Similarly, the third term is ˆ 1 F AQ ⌧
X
p p ˆ 1/2 O( nL ⇧n (zi )]/ n C F AQ ⌧
pˆi di [ri0
i
s⇡ /dz
p ) = op (1)Op ( nL
s⇡ /dz
) = op (1).
ˆ ⌧ 1H ¯Q ˆ 1 R0 U/pn in (A.35). Note that E kR0 U/pnk2 = Next, we consider the first term F AQ 1 p tr(⌃1 ) Ctr(IL ) L by E[u2 |z] bounded, so by MR, kR0 U/ nk = Op (L1/2 ). Also, we have ˆ ⌧ 1H ¯Q ˆ 1 Op (1) F AQ ˆ ⌧ 1/2 = op (1). F AQ 1 Therefore ˆ ⌧ 1 H( ¯ Q ˆ 1 F AQ 1
p ˆ 1H ¯Q ˆ 1 I)R0 U/ n F AQ ⌧ 1 = op (1)Op (
ˆ1 Q
Q1 )Op (L
I
1/2
p R0 U/ n
) = op (1)
by G† . Similarly, ˆ ⌧ 1 (H ¯ F AQ
p ˆ⌧ 1 H)R0 U/ n F AQ = Op (Rn )Op (
¯ where H
H = Op (
H)
¯ H
p R0 U/ n
H
H )Op (L
1/2
) = op (1),
instead of (A.12) in NPV, and the last eq. by G† . Note that HH 0 is
the population matrix mean-square of the regression of pi di on ri so that HH 0 C, it follows that p 2 p E kHR0 U/ nk = tr(H⌃1 H 0 ) CK and therefore, kHR0 U/ nk = Op (K 1/2 ). Then, ˆ⌧ 1 F A(Q
p Q⌧ 1 )HR0 U/ n
ˆ
max (Q⌧
1
) F AQ⌧ 1
= O(Rn2 )O(Rn )Op (
43
Q⌧
Q )Op (K
ˆ⌧ Q 1/2
p HR0 U/ n
) = op (1).
ˆ ⌧ 1H ¯Q ˆ 1 R0 U/pn = F AQ Combining the results above and by TR, F AQ 1
1 HR0 U/pn
+ op (1), and
thus we have the result of (iv). Lastly, to prove (v), similar to (A.22), p Pˆ )0 ⌘/ n = Op (Rn ⇣1v (K)2
ˆ 1/2 (P Q ⌧
2 ⇡)
= op (1)
by G† (and by (A.6) of NPV), which implies p ˆ 1/2 P )0 ⌘/ n F AQ ⌧
ˆ ⌧ 1 (Pˆ F AQ
ˆ ⌧ 1/2 (P Q
p Pˆ )0 ⌘/ n = op (1)op (1) = op (1).
Also, by E[⌘|X] = 0,
p 2 ˆ ⌧ 1 Q⌧ 1 )P 0 ⌘/ n |Xn F A(Q ⇣ hX i ⌘ ˆ 1 Q 1) ˆ 1 Q 1 )A0 F tr F A(Q pi p0i var(yi |Xi )/n (Q ⌧ ⌧ ⌧ ⌧ ⇣ ⌘ ⇣ ˆ ⌧ 1 Q⌧ 1 ) Q ˆ ⌧ (Q ˆ ⌧ 1 Q 1 ⌧ )A0 F = Ctr F AQ⌧ 1 (Q ˆ⌧ Ctr F A(Q
E
Op (Rn2 ) F AQ⌧ 1
2
ˆ⌧ Q
Q⌧
2
Op (Rn2
Q)
2
ˆ ⌧ 1 (Q ˆ⌧ Q⌧ ) Q
Q⌧ )Q⌧ 1 A0 F
= op (1)
⌘
by G† . Combining all of the previous results and by TR, we have the result of (v). Next, note CI CQ⌧ . Therefore,
H⌃1 H 0 is p.s.d and since var(y|X) is bounded by Assumption E, ⌃ CQ
V ⌧ = AQ⌧ 1 ⌃ + HQ1 1 ⌃1 Q1 1 H 0 Q⌧ 1 A0 = AQ⌧ 1 ⌃ + H⌃1 H 0 Q⌧ 1 A0 AQ⌧ 1 (CQ⌧ + CI) Q⌧ 1 A0 = C AQ⌧ 1/2
2
+ C AQ⌧ 1
2
.
The first-stage estimation error is not cancelled out with Q⌧ 1 , since the first-stage does not su↵er multicollinearity. Since a(pK0 A0 ) = AA0 , kAk2 a(pK0 A0 ) ApK
so
kAk ⇣r˜(K). Hence
=
O(Rn4 ⇣r˜(K)2 ).
Thus, V ✓ˆ⌧
⇣r˜(K) kAk by CS, r˜ 2 AQ⌧ = AQ⌧ 1 A0 C kAk = O(Rn2 ⇣r˜(K)2 ). Likewise, AQ⌧ 1 2 2 4 2 4 2 ⌧ = O(Rn ⇣r˜(K) ) + O(Rn ⇣r˜(K) ) = O(Rn ⇣r˜(K) ). Therefore, 1/2 2
2
p p p 2 2 v ✓0 = Op (V 1/2 ⌧ / n) = Op (Rn ⇣r˜(K)/ n) Op (Rn ⇣r˜ (K)/ n),
since max|µ|r supw2W @ µ pK (w)
2
supx2X k@ r p (x)k2 + supv2V k@ r p (v)k2 ⇣rv ()2 + ⇣rv ()2 =
2⇣rv ()2 and thus ⇣r (K) C⇣rv (K). This result for scalar a(h) covers the case of Assumption F. Now we can prove
by showing F Vˆ⌧ F
p
nVˆ⌧
1/2
(✓ˆ⌧
✓0 ) !d N (0, 1)
1 !p 0. Then, V ⌧ 1 Vˆ⌧ !p 1, so that
44
p ˆ nV ⌧
1/2
(✓ˆ⌧
✓0 ) =
p
nV ⌧
1/2
(✓ˆ
✓0 )/(V ⌧ 1 Vˆ⌧ )1/2 !d N (0, 1). The rest part of the proof can analogously be followed by the relevant part of the proof of NPV (pp. 600–601), using Q 6= I because of weak instruments and F = V⌧
1/2
.
Therefore the following replace the corresponding parts in the proof: For any matrix B, we have kB⌃k C kBQ⌧ k by ⌃ CQ⌧ . Therefore, ˆ 1 ⌃Q ˆ 1 F A(Q ⌧ ⌧ ˆ 1 F A(Q ⌧
Q⌧ 1 ⌃Q⌧ 1 )A0 F 0
ˆ 1 A0 F 0 + F AQ 1 ⌃(Q ˆ 1 Q⌧ 1 )⌃Q ⌧ ⌧ ⌧
ˆ ⌧ 1 (Q⌧ F AQ ˆ⌧ 1 F AQ
2
ˆ⌧ 1 C F AQ Op (Rn2 )Op (
Q⌧ 1 )A0 F 0
ˆ ⌧ )Q⌧ 1 ⌃Q ˆ ⌧ 1 A0 F 0 + F AQ⌧ 1 ⌃Q⌧ 1 (Q⌧ Q ˆ ⌧ )Q⌧ 1 ⌃ + F AQ⌧ 1 Q
(Q⌧ 2
(Q⌧
Q)
⌃Q⌧ 1 (Q⌧
ˆ ⌧ )Q⌧ 1 Q⌧ + C F AQ⌧ 1 Q
+ Op (Rn2 )Op (
Q)
ˆ ⌧ )Q ˆ ⌧ 1 A0 F 0 Q ˆ⌧ ) Q
ˆ⌧ 1 F AQ ˆ⌧ ) Q
Q⌧ Q⌧ 1 (Q⌧
ˆ⌧ 1 F AQ
= op (1)
by Assumption G† . Also note that in our proof, Q⌧ is introduced by penalization but the treatment is the same as above. Also, recall ⇣r (K) ⇣rv (K) and is, by
⇣0v (K)
2 h , Rn
Q
+
⇣0v (K)2 K/n
, and
Q are redefined in this paper. h and v 1/2 v ⇣0 (K)L ⇣1 (K) h converging to zero, we can
That prove
the following: ˆ 1 (⌃ ˆ⌧ F AQ ⌧ ˆ ⌧ 1 (⌃ ˜ F AQ
ˆ ⌧,i ˜ Q ˆ 1 A0 F 0 Ctr(D) ˆ max h ⌃) ⌧ in
ˆ ⌧ 1 A0 F 0 F AQ ˆ⌧ 1 ⌃)Q
ˆ⌧ H
¯ C H
n X i=1
2
˜ ⌃
hi Op (1)Op (⇣0v (K)
⌃ Op (Rn2 )Op (
kˆ pi k2 kri k2 /n
!1/2
= Op (⇣0v (K)L1/2 )Op (⇣1v (K)
n X
dˆ⌧,i
i=1
h)
Q
h)
= op (1),
+ ⇣0v (K)2 K/n) = op (1),
2
di /n
!1/2
= op (1).
h i0 ˆ ⌧ (wi )/@w @w(Xi , ⇧0 (zi ))/@⇡. The rest of the proof thus follows. ⇤ where dˆ⌧,i = @ h
Proof of Corollary 6.2: Now instead, suppose Assumption H holds. For the functional c0 a(g), this assumption is satisfied with ⌫(w) replaced by c0 ⌫(w). Therefore, it suffices to prove the result ⇥ ⇤ for scalar ⌫(w). Then, A = a(pK ) = E ⌫(wi )pK (wi )0 . Let ⌫K (w) = AQ 1 pK (w), which is 1⇥K
(transpose of) mean square projection of ⌫(·) on approximating functions pK (w). Note that Q 1 ⇥ ⇤ is singular with weak instruments. Therefore, define A⇤ = a(p⇤K ) = E ⌫(wi )p⇤K (ui )0 . Also, let ⇤ (u) = A⇤ Q⇤ ⌫K
1 p⇤K (u),
which is (transpose of) mean square projection of ⌫(·) on approximating
45
⇤ k2 E ⌫ ⌫K
functions p⇤K (u). Then, E k⌫
⌫K = E[⌫(wi )pK (wi )0 ]Q
⇤ E k⌫K
2
! 0 by Assumption H. But,
p (w) = E[⌫(wi )pK (wi )0 ]Tn (Tn0 QTn )
1 K
= E[⌫(wi )p⇤K (ui )0 ](Tn0 QTn ) Let R⇤ = Tn0 QTn
0 p⇤K ↵K
1 ⇤K
p
1
Tn0 pK (w)
(u).
⇤ 0 0 Q⇤ = E [mi p⇤0 i ] + E [pi mi ] + E [mi mi ]. Then, we have
⌫K k2 = E A⇤ [Q⇤
(Tn0 QTn )
1
1
]p⇤K (ui )
C kA⇤ k2 kR⇤ k2 E p⇤K (ui )
2
2
= E A⇤ Q⇤ 1 R⇤ (Tn0 QTn ) 1 p⇤K (ui ) ⇣ ⌘2 C˜ kA⇤ k2 2E kmi k kp⇤i k + E kmi k2 .
2
But by CS and Assumptions B and H, a(p⇤j ) = E[⌫(wi )p⇤j (ui )] E[v(wi )2 ]E[pj (u⇤i )2 ] < 1, and therefore kA⇤ k2 = a(p⇤K0 ) and by Assumption D (n n
2
⇣2v ()2 K 1/2 = n
2
K6
2
= O(K). Then, by using previous results of (A.12) and (A.13),
! 0) which implies that n
⇣2v ()1/2 K 1/2 = n
⌫K k2 ! 0. Lastly,
⇤ ⇣2v (K)2 K 1/2 converge to zero, we conclude that E k⌫K
let ⌫⌧,K (w) = AQ⌧ 1 pK (w). Then E k⌫K
⌫⌧,K k2 = E AQ⌧ 1 (Q⌧
1 K
Q)Q
p (wi )
0 K = O(⌧n Rn2 )E ↵ ˜K p (wi )
by Assumption F. Given
q
E k⌫
⌫⌧,K k2
2
q E ⌫
2
= O(⌧n Rn2 )E AQ
⇣2v (K)K and
1 K
p (wi )
2
= o(1) ⇤ ⌫K
2
+
q
⇤ E ⌫K
⌫K
2
+
q
E k⌫K
we have the desired result that E k⌫ ⌫⌧,K k2 ! 0. ⇥ ⇤ ⇥ ⇤ Let b⌧,KL (z) = E di ⌫⌧,K (wi )rL (zi )0 rL (z) and bL (z) = E di ⌫(wi )rL (zi )0 rL (z). Then, h E kb⌧,KL (zi )
i h bL (zi )k2 E d2i k⌫⌧,K (wi )
i h ⌫(wi )k2 CE k⌫⌧,K (wi )
h as K ! 1. Furthermore by Assumption E, E kbL (zi )
⌫⌧,K k2 ,
i ⌫(wi )k2 ! 0
i ⇢(zi )k2 ! 0 as L ! 1, where ⇢(z) is
a matrix of projections of elements of ⌫(w)d(X) on L which is the set of limit points of rL (z)0
L.
Therefore (as in (A.10) of NPV), by Assumption E ⇥ V⌧ = E ⌫⌧,K (wi )⌫⌧,K (wi )0
2
⇤ ⇥ ⇤ (Xi ) + E b⌧,KL (zi )var(xi |zi )b⌧,KL (zi )0 ! V¯ ,
⇥ ⇤ where V¯ = E ⌫(wi )⌫(wi )0 2 (Xi ) + E [⇢(zi )var(xi |zi )⇢(zi )0 ]. This shows F is bounded. p Then, given nF (✓ˆ⌧ ✓0 ) !d N (0, 1) from the proof of Theorem 6.1, the conclusion follows p from F 1 ! V¯ 1/2 so that F 1 nF (✓ˆ⌧ ✓0 ) !d N (0, V¯ ). ⇤
46
References Amemiya, T., 1977. The maximum likelihood and the nonlinear three-stage least squares estimator in the general nonlinear simultaneous equation model. Econometrica, 955–968. 5 Andrews, D. W., Guggenberger, P., 2015. Identification-and singularity-robust inference for moment condition models. 1 Andrews, D. W. K., Cheng, X., 2012. Estimation and inference with weak, semi-strong, and strong identification. Econometrica 80 (5), 2153–2211. 1 Andrews, D. W. K., Stock, J. H., 2007. Inference with weak instruments. In: Advances in Econometrics: Proceedings of the Ninth World Congress of the Econometric Society. 1 Andrews, D. W. K., Whang, Y.-J., 1990. Additive interactive regression models: circumvention of the curse of dimensionality. Econometric Theory 6 (04), 466–479. 16 Andrews, I., Mikusheva, A., 2016a. Conditional inference with a functional nuisance parameter. Econometrica 84 (4), 1571–1612. 1 Andrews, I., Mikusheva, A., 2016b. A geometric approach to nonlinear econometric models. Econometrica 84 (3), 1249–1264. 1 Angrist, J. D., Keueger, A. B., 1991. Does compulsory school attendance a↵ect schooling and earnings? The Quarterly Journal of Economics 106 (4), 979–1014. 2 Angrist, J. D., Lavy, V., 1999. Using Maimonides’ rule to estimate the e↵ect of class size on scholastic achievement. The Quarterly Journal of Economics 114 (2), 533–575. 8 Arlot, S., Celisse, A., 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79. 5 Belloni, A., Chernozhukov, V., Chetverikov, D., Kato, K., 2015. Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186 (2), 345–366. A.5 Blundell, R., Browning, M., Crawford, I., 2008. Best nonparametric bounds on demand responses. Econometrica 76 (6), 1227–1262. 1 Blundell, R., Chen, X., Kristensen, D., 2007. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75 (6), 1613–1669. 2, 1, 5 Blundell, R., Duncan, A., 1998. Kernel regression in empirical microeconomics. Journal of Human Resources 33, 62–87. 1, 5 47
Blundell, R., Duncan, A., Pendakur, K., 1998. Semiparametric estimation and consumer demand. Journal of Applied Econometrics 13 (5), 435–461. 1, 5 Blundell, R., Powell, J. L., 2003. Endogeneity in nonparametric and semiparametric regression models. Econometric Society Monographs 36, 312–357. 4 Blundell, R. W., Powell, J. L., 2004. Endogeneity in semiparametric binary response models. The Review of Economic Studies 71 (3), 655–679. 9 Bound, J., Jaeger, D. A., Baker, R. M., 1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American statistical association 90 (430), 443–450. 1 Breza, E., 2012. Peer e↵ects and loan repayment: Evidence from the Krishna default crisis. Job Market Paper MIT. 1 Carrasco, M., Florens, J.-P., Renault, E., 2007. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Handbook of econometrics 6, 5633–5751. 1 Chay, K., Munshi, K., 2014. Black networks after emancipation: Evidence from reconstruction and the great migration, Unpublished working paper. 1 Chen, X., Pouzo, D., 2012. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80 (1), 277–321. 12, 5 Chernozhukov, V., Hansen, C., 2005. An IV model of quantile treatment e↵ects. Econometrica 73 (1), 245–261. 9 Chesher, A., 2003. Identification in nonseparable models. Econometrica 71 (5), 1405–1441. 2, 2 Chesher, A., 2007. Instrumental values. Journal of Econometrics 139 (1), 15–34. 2 Coe, N. B., von Gaudecker, H.-M., Lindeboom, M., Maurer, J., 2012. The e↵ect of retirement on cognitive functioning. Health economics 21 (8), 913–927. 1, 5 Darolles, S., Fan, Y., Florens, J.-P., Renault, E., 2011. Nonparametric instrumental regression. Econometrica 79 (5), 1541–1565. 2, 5 Das, M., Newey, W. K., Vella, F., 2003. Nonparametric estimation of sample selection models. The Review of Economic Studies 70 (1), 33–58. 9 Davidson, R., MacKinnon, J. G., 1993. Estimation and inference in econometrics. OUP Catalogue. 3
48
Del Bono, E., Weber, A., 2008. Do wages compensate for anticipated working time restrictions? evidence from seasonal employment in austria. Journal of Labor Economics 26 (1), 181–221. 1, 5 Dufour, J.-M., 1997. Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica, 1365–1387. 1 Dustmann, C., Meghir, C., 2005. Wages, experience and seniority. The Review of Economic Studies 72 (1), 77–108. 1, 5 Engl, H. W., Hanke, M., Neubauer, A., 1996. Regularization of inverse problems. Vol. 375. Springer. 1 Frazer, G., 2008. Used-clothing donations and apparel production in Africa. The Economic Journal 118 (532), 1764–1784. 1 Freyberger, J., 2015. On completeness and consistency in nonparametric instrumental variable models. Working Paper. 1 Garg, K. M., 1998. Theory of di↵erentiation. Wiley. A.1 Giorgi, G., Guerraggio, A., Thierfelder, J., 2004. Mathematics of Optimization: Smooth and Nonsmooth Case. Elsevier. A.1 Hall, P., Horowitz, J. L., 2005. Nonparametric methods for inference in the presence of instrumental variables. The Annals of Statistics 33 (6), 2904–2929. 1 Han, S., McCloskey, A., 2015. Estimation and inference with a (nearly) singular Jacobian, Unpublished Manuscript, University of Texas at Austin and Brown University. 1 Hastie, T., Tibshirani, R., 1986. Generalized additive models. Statistical science, 297–310. 1 Henderson, D. J., Papageorgiou, C., Parmeter, C. F., 2013. Who benefits from financial development? New methods, new evidence. European Economic Review 63, 47–67. 1 Hengartner, N. W., Linton, O. B., 1996. Nonparametric regression estimation at design poles and zeros. Canadian journal of statistics 24 (4), 583–591. 1 Hoderlein, S., 2009. Endogenous semiparametric binary choice models with heteroscedasticity, cemmap working paper. 7 Hong, Y., White, H., 1995. Consistent specification testing via nonparametric series regression. Econometrica, 1133–1159. 9 Horowitz, J. L., 2011. Applied nonparametric instrumental variables estimation. Econometrica 79 (2), 347–394. 8 49
Imbens, G. W., Newey, W. K., 2009. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77 (5), 1481–1512. 1, 3, 2 Jiang, J., Fan, Y., Fan, J., 2010. Estimation in additive models with highly or nonhighly correlated covariates. The Annals of Statistics 38 (3), 1403–1432. 1 Jun, S. J., Pinkse, J., 2012. Testing under weak identification with conditional moment restrictions. Econometric Theory 28 (6), 1229. 3, 5 Kasy, M., 2014. Instrumental variables with unrestricted heterogeneity and continuous treatment. The Review of Economic Studies 81 (4), 1614–1636. 1 Kleibergen, F., 2002. Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70 (5), 1781–1803. 1 Kleibergen, F., 2005. Testing parameters in GMM without assuming that they are identified. Econometrica 73 (4), 1103–1123. 1 Koster, H. R., Ommeren, J., Rietveld, P., 2014. Agglomeration economies and productivity: a structural estimation approach using commercial rents. Economica 81 (321), 63–85. 1 Kress, R., 1999. Linear integral equations. Vol. 82. Springer. 4 Lee, J. M., 2011. Topological spaces. In: Introduction to Topological Manifolds. Springer, pp. 19–48. A.1 Lee, S., 2007. Endogeneity in quantile regression models: A control function approach. Journal of Econometrics 141 (2), 1131–1158. 9 Linton, O. B., 1997. Miscellanea: Efficient estimation of additive nonparametric regression models. Biometrika 84 (2), 469–473. 1 Lyssiotou, P., Pashardes, P., Stengos, T., 2004. Estimates of the black economy based on consumer demand approaches. The Economic Journal 114 (497), 622–640. 1, 5 Mazzocco, M., 2012. Testing efficient risk sharing with heterogeneous risk preferences. The American Economic Review 102 (1), 428–468. 1 Moreira, M. J., 2003. A conditional likelihood ratio test for structural models. Econometrica 71 (4), 1027–1048. 1 Newey, W. K., 1990. Efficient instrumental variables estimation of nonlinear models. Econometrica, 809–837. 5
50
Newey, W. K., 1997. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 (1), 147–168. 13, 19, A.4.1, A.5 Newey, W. K., Powell, J. L., 2003. Instrumental variable estimation of nonparametric models. Econometrica 71 (5), 1565–1578. 1 Newey, W. K., Powell, J. L., Vella, F., 1999. Nonparametric estimation of triangular simultaneous equations models. Econometrica 67 (3), 565–603. 1 Nielsen, J. P., Sperlich, S., 2005. Smooth backfitting in practice. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), 43–61. 1 Pinkse, J., 2000. Nonparametric two-step regression estimation when regressors and error are dependent. Canadian Journal of Statistics 28 (2), 289–300. 1 Skinner, J. S., Fisher, E. S., Wennberg, J., 2005. The efficiency of medicare. In: Analyses in the Economics of Aging. University of Chicago Press, pp. 129–160. 1 Sperlich, S., Linton, O. B., H¨ ardle, W., 1999. Integration and backfitting methods in additive models-finite sample properties and comparison. Test 8 (2), 419–458. 1 Staiger, D., Stock, J. H., 1997. Instrumental variables regression with weak instruments. Econometrica 65 (3), 557–586. 1, 7 Stock, J. H., Wright, J. H., 2000. GMM with weak identification. Econometrica 68 (5), 1055–1096. 1, 3 Stock, J. H., Yogo, M., 2005. Testing for weak instruments in linear IV regression. Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 80–108. 1, 2, 7 Stone, C. J., 1982. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 1040–1053. 19 Taylor, A. E., 1965. General theory of functions and integration. Dover Publications. A.1 Weyl, H., 1912. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller di↵erentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen 71 (4), 441–479. A.4.2 Yatchew, A., No, J. A., 2001. Household gasoline demand in Canada. Econometrica 69 (6), 1697– 1709. 1, 5
51
µ2 4
8
16
32
64
128
256
Bias2
0.0377
0.0335
0.0054
0.0008
0.0000
0.0003
0.0000
V ar
99.0147
3.8019
0.9395
0.1419
0.0711
0.0310
0.0186
M SE
99.0524
3.8354
0.9449
0.1426
0.0711
0.0313
0.0186
M SEIV /M SELS
374.7291
15.9165
3.3232
0.5875
0.2790
0.1472
0.0901
Bias2
0.0328
0.0131
0.0030
0.0010
0.0002
0.0000
0.0000
V ar
0.3727
0.2557
0.1497
0.0829
0.0427
0.0349
0.0174
M SE
0.4055
0.2688
0.1527
0.0839
0.0429
0.0349
0.0174
M SEP IV /M SEIV
0.0035
0.1203
0.5365
0.6888
0.8297
0.9074
0.9452
Bias2
0.1145
0.0682
0.0305
0.0150
0.0042
0.0017
0.0010
V ar
0.4727
0.1332
0.0732
0.0894
0.0345
0.0248
0.0354
M SE
0.5872
0.2014
0.1037
0.1045
0.0387
0.0265
0.0364
M SEP IV /M SEIV
0.0024
0.0894
0.3501
0.5594
0.6462
0.7795
0.7464
Bias2
0.1566
0.1068
0.0685
0.0346
0.0158
0.0047
0.0022
V ar
0.2117
0.1981
0.2965
0.0318
0.0265
0.0183
0.0132
M SE
0.3684
0.3049
0.3649
0.0664
0.0423
0.0230
0.0154
M SEP IV /M SEIV
0.0037
0.0795
0.3862
0.4655
0.5942
0.7345
0.8238
⌧ =0
⌧ = 0.001
⌧ = 0.005
⌧ = 0.01
Table 1: Integrated squared bias, integrated variance, and integrated MSE of the penalized and unpenalized IV estimators gˆ⌧ (·) and gˆ(·).
⌧
CV Value
0.005
37.5315
0.01
37.5133
0.015
37.5065
0.02
37.5035
0.05
37.5070
Table 2: Cross-validation values for the choice of ⌧ .
52
(a) with a weak instrument
(b) with a strong instrument
Figure 2: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)), ⌧ = 0.001. The (blue) dotted-dash line is the true g0 (·). The (black) solid line is the (simulated) mean of gˆ(·) with the dotted band representing the 0.025-0.975 quantile ranges. Note that the di↵erence between g0 (·) and the mean of gˆ(·) is the (simulated) bias. The (red) solid line is the mean of gˆ⌧ (·) with the dashed 0.025-0.975 quantile ranges.
(a) with a weak instrument
(b) with a strong instrument
Figure 3: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)), ⌧ = 0.005.
53
Figure 4: Unpenalized IV estimates with nonparametric first-stage equations, full sample (n = 2019), 95% confidence band
Figure 5: Penalized IV estimates with the discontinuity sample (n = 650, F = 191.66).
54