Statistical Inference on Panel Data Models: A Kernel Ridge ... - arXiv

2 downloads 0 Views 874KB Size Report
Mar 8, 2017 - Statistical Inference on Panel Data Models: A Kernel. Ridge Regression Method. Shunan Zhao1 Ruiqi Liu2 Zuofeng Shang∗2. 1Department of ...
Statistical Inference on Panel Data Models: A Kernel Ridge Regression Method

arXiv:1703.03031v1 [math.ST] 8 Mar 2017

Shunan Zhao1 Ruiqi Liu2 Zuofeng Shang∗2 1 2

Department of Economics, Binghamton University

Department of Mathematical Sciences, Binghamton University

Abstract: We propose statistical inferential procedures for panel data models with interactive fixed effects in a kernel ridge regression framework. Compared with traditional sieve methods, our method is automatic in the sense that it does not require the choice of basis functions and truncation parameters. Model complexity is controlled by a continuous regularization parameter which can be automatically selected by generalized cross validation. Based on empirical processes theory and functional analysis tools, we derive joint asymptotic distributions for the estimators in the heterogeneous setting. These joint asymptotic results are then used to construct confidence intervals for the regression means and prediction intervals for the future observations, both being the first provably valid intervals in literature. Marginal asymptotic normality of the functional estimators in homogeneous setting is also obtained. Simulation and real data analysis demonstrate the advantages of our method. Keywords and Phrases: kernel ridge regression, panel data models with interactive fixed effects, joint asymptotic distribution, empirical processes, Functional Bahadur representation JEL Classification: C13, C14, C23

1. Introduction Panel data models with interactive fixed effects (IFE) have many applications in econometrics and statistics, e.g., individuals’ education decision (Carneiro et al., 2003; Cunha et al., 2005), house price analysis (Holly et al., 2010), prediction of investment returns (Eberhardt et al., 2013), risk-return relation (Ludvigson and Ng, 2009, 2016), etc. The IFE can capture both cross section dependence and heterogeneity in the data, which makes these models more flexible than the classic fixed or random effect models. Cross section dependence is usually characterized by time-varying common factors, and heterogeneity is captured by individual-specific factor loadings. There has been an increasing literature in this area addressing various statistical inferential problems. Earlier studies focused on parametric settings; see Pesaran (2006), Bai and Ng (2006), Bai (2009), Moon and Weidner (2010), Moon and Weidner (2015), among others. A common crucial assumption in these papers is that the response and predictor variables are linearly related and the parameters of interest are finite-dimensional. ∗

Corresponding author. Assistant Professor at Department of Mathematical Sciences, Binghamton University.

Address: PO Box 6000, Binghamton, New York 13902-6000, USA; Email: [email protected]; Tel: (607)7774263; Fax: (607)777-2450. Research Sponsored by a start-up grant from Binghamton University. 1

Zhao et al./Panel Data Models with KRR

2

Recently, efforts were devoted to nonparametric/semiparametric panel data models with IFE. For instance, Freyberger (2012), Su and Jin (2012), Jin and Su (2013) and Su and Zhang (2013), Su et al. (2015), among others, proposed sieve methods for estimating or testing the infinitedimensional regression functions; Huang (2013) and Cai et al. (2016) proposed local polynomial methods, differing from sieves by their local feature. The success of the sieve methods hinges on a good choice of basis functions; see Chen (2007) for a comprehensive introduction. The truncation parameter, i.e., number of basis functions used for model fitting, changes in a discrete fashion, which may yield an imprecise control on the model complexity, as pointed out in Ramsay and Silverman (2005). For these reasons, it is worthwhile to explore a method that relies less on the choice of basis or the choice of discrete truncation parameters, which will possess computational, theoretical and conceptual advantages. In this paper, we propose a new kernel-based nonparametric method, called as kernel ridge regression (KRR), for handling panel data models. Our method relies on the assumption that the regression function belongs to a reproducing kernel Hilbert space (RKHS) driven by a kernel function called as reproducing kernel. The KRR estimator is “basis-free” in the sense that it can be explicitly expressed by the kernels rather than the basis functions. Our method is applicable to a broad class of RKHS, e.g., Euclidean space, Sobolev space, Gaussian kernel space, or spaces with advanced structures such as semiparametric/additive structures; see Wahba (1990), Shang and Cheng (2013), Zhao et al. (2016) about more descriptions of these spaces. Reproducing kernels corresponding to the above mentioned RKHS have (approximately) explicit expressions which can be directly used in our inferential procedures. For applications of RKHS in other fields such as statistical machine learning, see Hofmann et al. (2008). In contrast to the method of sieves, our KRR method does not involve the discrete truncation parameter. Instead, it directly searches the estimator in the entire (possibly infinite-dimensional) function spaces, though the process of searching requires the use of a continuous regularization parameter which controls the smoothness of the estimators. In smoothing splines or KRR, regularization parameters are usually selected by generalized cross validation (GCV); see Craven and Wahba (1978), Wahba (1990), Gu (2013), Wang (2011). The selection procedure proceeds by searching a global minimizer of the GCV criteria function which provides a more accurate management on model complexity. As a result, the estimator may yield better performance as observed by Shang and Cheng (2015) in functional data analysis. In this paper, we adapt the traditional GCV criteria to semiparametric panel data models in both heterogeneous and homogeneous settings. The selection algorithms are easy-to-use with satisfactory performance as demonstrated in our simulation study and real data analysis (Section 5). Besides numerical advantages, the proposed method is theoretically powerful. For instance, based on our RKHS framework, it is theoretically more convenient to derive the joint asymptotic distributions for the estimators of the linear and nonlinear components; see Theorems 3.3 and 3.4. Nonetheless, joint asymptotic distributions are more difficult to prove in the sieve or local

Zhao et al./Panel Data Models with KRR

3

polynomial framework. Our joint asymptotic results can be used to design novel statistical procedures such as confidence intervals for the regression means and prediction intervals for the future observations, though they can naturally imply the marginal asymptotic results obtained by Su and Jin (2012) as a corollary. The theory developed in this paper relies on nonstandard technical tools such as empirical processes theory and functional analysis. Specifically, functional Bahadur representations (FBR) in heterogeneous and homogeneous settings, i.e., Theorems 3.2 and 4.2, are proved based on the aforementioned technical tools, which can extract the leading terms from the estimators. Our FBR theory is a nontrivial extension of Shang (2010) and Shang and Cheng (2013) to panel data models, which plays a central role in our theoretical study. The rest of this paper is structured as follows. Section 2 contains some technical preliminaries including an introduction to panel data models with IFE and an RKHS framework. Sections 3 and 4 contain the main results. Specifically, in heterogeneous setting, Section 3 includes estimation procedures for each individual parameters, and derives joint asymptotic normality for the estimators. Constructions of confidence interval and prediction interval are also mentioned. In homogeneous setting, Section 4 includes an estimation procedure for the common regression function and derives its marginal asymptotic normality. Section 5 examines the proposed methods based on a simulation study and real data analysis. Proofs of the main theorems are deferred to Section 6, and proofs of other results are separated as a supplement document. 2. Preliminary 2.1. Panel Data Models with Interactive Fixed Effects Let Yit be a real-valued observation and Xit ∈ Xi ⊆ Rd be a real vector of observed covariates, both collected on the ith unit at time t, for i ∈ [N ] := {1, 2, . . . , N }, t ∈ [T ] := {1, . . . , T }. Suppose that the observations follow a semiparametric regression model Yit = gi (Xit ) + γ 01i f1t + γ 02i f2t + it , i ∈ [N ], t ∈ [T ],

(2.1)

where gi is an unknown regression function, f1t ∈ Rq1 is a vector of observed common factors including intercept, f2t ∈ Rq2 is a vector of unobserved common factors, γ 1i and γ 2i are unobserved fixed vectors called as factor loadings, and it are unobserved random noise. The term γ 02i f2t is called as interactive fixed effect. In general, we allow gi to be varying across the units, i.e., the panels demonstrate a heterogeneous structure. The special case gi = g for all i ∈ [N ] implies that the panels are homogeneous. In this paper, we will consider both cases from theoretical and methodological aspects. Since the factors in model (2.1) are not identifiable in the sense that they cannot be consistently estimated, further constraints are needed. Similar to Pesaran (2006), suppose that Xit are related to the factors through the following data generating equation: Xit = Γ01i f1t + Γ02i f2t + vit , i ∈ [N ], t ∈ [T ],

(2.2)

Zhao et al./Panel Data Models with KRR

4

where Γ1i ∈ Rq1 ×d and Γ2i ∈ Rq2 ×d are unobserved but fixed matrices, vit ∈ Rd is a vector of random noise. Other constraints to guarantee identifiability were proposed by Bai (2009) based on principle component analysis. Equation (2.2) provides a convenient way to remove the unobserved factor f2t from model (2.1). Specifically, averaging (2.2) as done by Pesaran (2006), we get ¯t = Γ ¯ 0 f1t + Γ ¯ 0 f2t + v¯t , t ∈ [T ], X 1 2 ¯t = where X

1 N

PN

we assume that

1 PN 1 PN ¯ ¯ i=1 Xit , Γ1 = N i=1 Γ1i , Γ2 = N i=1 Γ2i , 0 ¯ is invertible which may hold true if q2 ≤ ¯ 2Γ Γ 2

(2.3)

and v¯t =

1 N

PN

i=1 vit .

Throughout

d. It follows from (2.3) that

 ¯ 2Γ ¯ 02 )−1 Γ ¯2 X ¯t − Γ ¯ 01 f1t − v¯t . f2t = (Γ

(2.4)

Replacing f2t in (2.1) by (2.4), we get the following  0 0 ¯ ¯ 0 −1 ¯ ¯t − Γ ¯ 0 f1t − v¯t + it Yit = gi (Xit ) + γ1i f1t + γ2i (Γ2 Γ2 ) Γ2 X 1 = gi (Xit ) + Zt0 βi + eit , i ∈ [N ], t ∈ [T ],

(2.5)

0 ,X ¯ ¯ 0 −1 ¯ ¯t , and ¯ t0 )0 , eit = it − γ 0 (Γ where Zt = (f1t 2i 2 Γ2 ) Γ2 v  ¯ 1Γ ¯ 0 (Γ ¯ ¯ 0 −1  γ1i − Γ 2 2 Γ2 ) γ2i βi = ¯ 0 (Γ ¯ ¯ 0 −1 Γ 2 2 Γ2 ) γ2i .

The equations (2.1), (2.2) and (2.5) play an important role in the proof of our main results. 2.2. Kernel Ridge Regression Suppose gi ∈ Hi , where Hi is a Reproducing Kernel Hilbert Space (RKHS). Specifically, Hi is a Hilbert space of real-valued functions on Xi , endowed with an inner product h·, ·iHi , satisfying ¯ x(i) ∈ Hi such that for every g ∈ Hi , the property: for any x ∈ Xi , there exists a unique element K ¯ x(i) , giH = g(x). The reproducing kernel function is defined by K ¯ (i) (x1 , x2 ) = K ¯ x(i) hK 1 (x2 ), for i (i) (i) (i) ¯ ¯ ¯ any x1 , x2 ∈ Xi . The kernel K is symmetric, i.e., K (x1 , x2 ) = K (x2 , x1 ), and the matrix ¯ (i) := [K(x ¯ i , xi )]n K is semi-positive definite for any x1 , . . . , xn ∈ Xi . Berlinet and Thomasi,j=1

Agnan (2004) provides a nice introduction to RKHS. ¯ (i) admits a spectral expansion: By Mercer’s theorem, K ¯ (i) (x1 , x2 ) = K

∞ X

ϕν(i) (x1 )ϕν(i) (x2 )/ρ(i) ν , x1 , x2 ∈ Xi ,

(2.6)

ν=1 (i)

(i)

(i)

where 0 < ρ1 ≤ ρ2 ≤ · · · are eigenvalues and ϕν are eigenfunctions which form an L2 (PXi ) orthonormal basis with Xi ≡ Xi1 . This paper focuses on RKHS generated by the following kernels. For simplicity, we use an  bn to represent an = O(bn ) and bn = O(an ). ¯ (i) is said to have rank k > 0 if ρ(i) Finite Rank Kernel (FRK): The kernel K ν = ∞ for ν > k. (i) 0 k−1 ¯ (x1 , x2 ) = (1 + x x2 ) For instance, the (k − 1)-order polynomial kernel K for x1 , x2 ∈ Rd has 1

rank k. Clearly, an FRK of rank k corresponds to a parametric space of dimension k.

Zhao et al./Panel Data Models with KRR

5

¯ (i) is said to be polynomially diverging of Polynomially Diverging Kernel (PDK): The kernel K (i)

order k > 0 if it has eigenvalues satisfying ρν  ν 2k for ν ≥ 1. For instance, the k-order Sobolev space is an RKHS with a kernel polynomially diverging of order k; see Wahba (1990). ¯ (i) is said to be exponentially diverging of Exponentially Diverging Kernel (EDK): The kernel K (i)

order k > 0 if its eigenvalues satisfy ρν  exp(bν k ) for ν ≥ 1, for a constant b > 0. For instance, ¯ (i) (x1 , x2 ) = exp(−|x1 − x2 |2 ) corresponds to k = 1; see Lu et al. (2016). Gaussian kernel K The results of this paper can be applied to more complicated RKHS such as Additive RKHS, as described in Remark 4.2. Let Θi := Rq1 +d × Hi . We estimate θ = (β, g) ∈ Θi via the following Kernel Ridge Regression: θbi = (βbi , gbi ) = arg min `i,M,ηi (θ) θ∈Θi ( ) T 1 X η i ≡ arg min (Yit − g(Xit ) − Zt0 β)2 + kgk2Hi , θ∈Θi 2T 2

(2.7)

t=1

where M = (N, T ) and ηi > 0 is called as a regularization parameter. Our results will rely on an RKHS structure on Θi . Specifically, we will follow Cheng and Shang (2015) to construct two operators Ri : Ui → Θi and Pi : Θi → Θi , where Ui ≡ {u = (x, z) : x ∈ Xi , z ∈ Rq1 +d }, such that for any u = (x, z) ∈ Ui and θ = (β, g) ∈ Θi , the following holds: e i = ge(x) + z 0 β, e hPi θ, θi e i = ηi hg, geiH , for any θe = (β, e ge) ∈ Θi , hRi u, θi i

(2.8)

where h·, ·ii is an inner product on Θi to be defined later in (2.9). For any x ∈ Xi , define Gi (x) = E{Z|Xi = x} and Ωi = E{(Z − Gi (Xi ))(Z − Gi (Xi ))0 }, where Z = Z1 . Clearly, Ωi is a square matrix of dimension q1 + d. In the below we require the eigenvalues of Ωi to be bounded away from zero and infinity, a standard condition to guarantee semiparametric efficiency; see, e.g., Mammen and van de Geer (1997); Cheng and Shang (2015). Besides, Gi are assumed to be L2 integrable. Assumption A1. For i ∈ [N ], Gi ∈ L2 (PXi ). Furthermore, c1 ≤ λmin (Ωi ) ≤ λmax (Ωi ) ≤ c2 for positive constants c1 , c2 , where λmin (·) and λmax (·) are minimal and maximal eigenvalues. For any θk = (βk , gk ) ∈ Θi , k = 1, 2, define hθ1 , θ2 ii = E{(g1 (Xi ) + Z 0 β1 )(g2 (Xi ) + Z 0 β2 )} + ηi hg1 , g2 iHi .

(2.9)

Define hg1 , g2 i?,i = h(0, g1 ), (0, g2 )ii . Following Cheng and Shang (2015), Assumption A1 implies that h·, ·ii and h·, ·i?,i are valid inner products on Θi and Hi , respectively. Meanwhile, (Hi , h·, ·i?,i ) P (i) (i) (i) is an RKHS with kernel K (i) (x, y) ≡ ∞ ν=1 ϕν (x)ϕν (y)/(1 + ηi ρν ), x, y ∈ Xi . We can further find a positive definite self-adjoint operator Wi : Hi → Hi and an element Ai ∈ Hiq1 +d such that hWi g1 , g2 i?,i = ηi hg1 , g2 iHi , hAi , gi?,i = Vi (Gi , g), for any g, g1 , g2 ∈ Hi , where Vi (g1 , g2 ) = E{g1 (Xi )g2 (Xi )}. Define Σi = E{Gi (Xi )(Gi (Xi ) − Ai (Xi ))0 }, a symmetric matrix of dimension q1 + d. Following Cheng and Shang (2015), Proposition 2.1 below guarantees (2.8).

Zhao et al./Panel Data Models with KRR

6

Proposition 2.1. For any u = (x, z) ∈ Ui and for any θ = (β, g) ∈ Θi , (2.8) holds for Ri u = (i)

(i)

(?i)

(Hu , Tu ) and Pi θ = (Hg

(?i)

, Tg

), where

Hu(i) = (Ωi + Σi )−1 (z − Vi (Gi , Kx(i) )), Tu(i) = Kx(i) − A0i (Ωi + Σi )−1 (z − Vi (Gi , Kx(i) )), Hg(?i) = −(Ωi + Σi )−1 Vi (Gi , Wi g), Tg(?i) = Wi g + A0i (Ωi + Σi )−1 Vi (Gi , Wi g). A direct application of Proposition 2.1 is to exactly calculate the Fr´echet derivatives of `i,M,ηi . Define Uit = (Xit , Zt ) for i ∈ [N ], t ∈ [T ]. For θ = (β, g), ∆θ = (∆β, ∆g) ∈ Θi , we have D`i,M,ηi (θ)∆θ = h−

T 1X (Yit − hRi Uit , θii )Ri Uit + Pi θ, ∆θii ≡ hSi,M,ηi (θ), ∆θii , T t=1

T 1X hRi Uit , ∆θii Ri Uit + Pi ∆θ, T

DSi,M,ηi (θ)∆θ =

t=1

2

D Si,M,ηi (θ) = 0. 3. Heterogeneous Model In this section, we consider heterogeneous model (2.1) where the gi ’s are assumed to be different across the units. We will estimate each gi through penalized estimations, and develop a joint asymptotic theory for the estimators. Our joint asymptotic results lead to novel statistical procedures as well as rediscover the existing marginal asymptotic results. 3.1. Estimation Procedure By representer theorem (Wahba, 1990), the minimizer gbi of (2.7) has the expression g(x) =

T X

¯ (i) (Xit , x) = a0 K ¯ x(i) , x ∈ Xi , at K

(3.1)

t=1

¯ (i) (XiT , x))0 . Based on (3.1) we have ¯ x(i) = (K ¯ (i) (Xi1 , x), · · · , K where a = (a1 , · · · , aT )0 and K T T X X ¯ (i) (Xit , ·), ¯ (i) (Xit , ·)iH = a0 K ¯ (i) a, kgk2Hi = h at K at K i t=1

(3.2)

t=1

¯ (i) = (K ¯ (i) , · · · , K ¯ (i) ) ∈ RT ×T is semi-positive definite. So, (2.7) can be equally transwhere K Xi1 XiT formed to the following: (b ai , βbi ) = arg

min

a∈RT ,β∈Rq1 +d

0   1  ¯ (i) a − Zβ ¯ (i) a − Zβ + ηi a0 K ¯ (i) a, Yi − K Yi − K 2T 2

(3.3)

Zhao et al./Panel Data Models with KRR

7

where Yi = (Yi1 , · · · , YiT )0 and Z = (Z1 , · · · , ZT )0 . The solution to (3.3) has expression  −1  (i)  ¯ + T ηi IT b ai = IT − Z(Z 0 Z)−1 Z 0 K IT − Z(Z 0 Z)−1 Z 0 Yi ,  −1 0  ¯ (i) b βbi = Z 0 Z Z Yi − K ai .

(3.4)

(i)

¯ x for any x ∈ Xi . Then we estimate gi by gbi (x) = b a0i K Remark 3.1. In practice, we choose ηi by minimizing the following GCV function: k(IT − Bηi )Yi k22 , ηi >0 T [1 − tr(Bηi )/(T )]2

ηbi = arg min GCVi (ηi ) ≡ arg min ηi >0

where Bηi is the so-called smoothing matrix, i.e., a T × T matrix satisfying Ybi = Bηi Yi , where ¯ (i) b Ybi = K ai + Z βbi is the fitted response vector. 3.2. Rate of Convergence We will derive the rate of convergence for θbi . Before that, let us assume some technical conditions. For t ≥ j ≥ 1, define Fjt = σ (f1l , f2l : j ≤ l ≤ t). Define φ-mixing coefficients φ(t) = sup

sup

t1 ≥1 A∈F t1 B∈F ∞

P (A|B) − P (A) , t ≥ 0.

t1 +t

1

P (B)>0

Assumption A2.

(a) {vit : i ∈ [N ], t ∈ [T ]} are i.i.d., and {it : i ∈ [N ], t ∈ [T ]} are i.i.d.,

both of zero means. Furthermore, vit ’s and it ’s are independent. (b) Both {f1t : t ∈ [T ]} and {f2t : t ∈ [T ]} are strictly stationary process satisfying the following P 1−4/α < ∞, where α > 4 is a constant. (f , f ) is distributed φ-mixing condition: ∞ 1t 2t t=0 φ(t) independently of vit ’s and it ’s. (c) E{kf1t kα2 } < ∞, E{kf1t kα2 } < ∞, E{|vit |α } < ∞, E{|it |α } < ∞, where k · k2 denotes the Euclidean norm. 0 (Γ ¯ 2Γ ¯ 0 )−1 Γ ¯ 2. (d) supi≥1 k∆i k2 < ∞, where ∆i = γ2i 2

Assumption A2(a) requires that the variables εit , vit are zero-mean independent. Assumption A2(b) specify that the factors are strictly stationary and φ-mixing, and independent of vit ’s and it ’s. Independence assumption can be relaxed to mixing conditions with more tedious technical arguments. Assumption A2(c) requires that the variables have finite α-moments. Assumption A2(d) requires that the vectors ∆i based on “true” factor loadings are uniformly bounded. (i)

(i)

(i)

The following assumption says that ϕν are uniformly bounded and ϕν , ρν simultaneously diagonalize Vi and h·, ·iHi , a standard assumption in kernel ridge regression literature, e.g., Shang and Cheng (2013). This condition holds for polynomially diverging kernels, exponentially diverging kernels, or finite rank kernels on compactly supported Xi ; see, e.g., Wahba (1990); Shang and Cheng (2013); Zhao et al. (2016). Besides, we need 1 ∈ / Hi for identifiability.

Zhao et al./Panel Data Models with KRR

8

(i)

Assumption A3. For any i ∈ [N ], supν≥1 supx∈Xi |ϕν (x)| < ∞ and (i) (i) (i) (i) Vi (ϕ(i) ν , ϕµ ) = δνµ , hϕν , ϕµ iHi = ρν δνµ , ν, µ ≥ 1,

where δνµ is the Kronecker’s delta. Furthermore, 1 ∈ / Hi , and any g ∈ Hi satisfies g = P (i) (i) 2 where gν = Vi (g, ϕµ ) is a real sequence satisfying ∞ ν=1 ρν gν < ∞.

(i) ν=1 gν ϕν ,

P∞

For any θ = (β, g) ∈ Θi , define kθki,sup = supx∈Xi |g(x)| + kβk2 . For p, δ > 0, define Gi (p) = {θ = (β, g) ∈ Θi : kθki,sup ≤ 1, kgkHi ≤ p} and the corresponding entropy integral δ

Z Ji (p, δ) = 0

 ψ2−1 (Di (ε, Gi (p), k · ki,sup )) dε + δψ2−1 Di (δ, Gi (p), k · ki,sup )2 ,

where ψ2 (s) = exp(s2 ) − 1 and Di (ε, Gi (p), k · ki,sup ) is the ε-packing number of Gi (p) in terms of k · ki,sup -metric. Let θi0 = (βi0 , gi0 ) denote the “true” value of (β, g) in (2.5). Define hi =

∞ X

1/2

−1 −1/2 (1 + ηi ρ(i) + ηi ν ) , ri,M = (T hi )

+ (N hi )−1/2 .

ν=1 1/(2k)

It can be shown that hi  ηi

for k-order PDK; hi  (log(1/ηi ))−1/k for k-order EDK. We use

(N, T ) → ∞ to represent both N → ∞ and T → ∞. Theorem 3.1. Suppose that Assumptions A1–A3 are satisfied. Furthermore, as (N, T ) → ∞, the following conditions hold: ηi = o(1), hi = o(1), T 2/α−1 h−1 i = o(1), ηi + −1/2

1 = O(hi ), N hi

−1/2

T −1/2+1/α hi max{hi , T 1/α }Ji ((ηi−1 hi )1/2 , 1) q × log N + log log(T Ji ((ηi−1 hi )1/2 , 1)) = o(1).

(3.5)

Then for any i ∈ [N ], kθbi − θi0 ki = OP (ri,M ), as (N, T ) → ∞. Theorem 3.1 presents a rate of convergence for θbi in k · ki -norm which relies on hi , N, T . The k · ki -norm is stronger than the commonly used L2 -norm in literature; see Su and Jin (2012), Su and Zhang (2013). The optimal choice of hi , denoted h?i , relies on the type of kernels and the ? , can be relationship of N, T , i.e., N ≥ T or N < T . The optimal convergence rate, denoted ri,M ? calculated accordingly; see Table 1. We observe that h?i , ri,M only depend on the smaller value of

N, T in both PDK and EDK. Rate conditions (3.5) are satisfied under hi  h?i . N ≥T

N 0, αx0 , βx0 ∈ Rq1 +d are nonrandom constants. Then, √

?)− T (βbi − βi0



(i)

T E{eit HUit } √ √ ? (x )) − T h E{e T (i) (x )} T hi (b gi (x0 ) − gi0 0 i it Uit 0

! d

→ N (0, Ψ? ) , as (N, T ) → ∞,

(3.7)

where ?

Ψ =

σ2

Ω−1 i (αx0 +

βx0 )0 Ω−1 i

Ω−1 i (αx0 + βx0 ) σx20

+

2βx0 0 Ω−1 i αx0

+

βx0 0 Ω−1 i βx0

! , and σ2 = V ar(it ).

(3.8)

Theorem 3.3 proves joint asymptotic normality for βbi and gbi (x0 ). The estimators are nonetheless not (asymptotically) unbiased, i.e., they do not converge to the truth βi0 and gi0 (x0 ). To correct the bias, we need to assume T /N = o(1) as in the following Theorem 3.4. This condition means that the number of observations within each individual unit is strictly smaller than the number of units, which can provide more cross section information. We expect that the bias cannot be corrected if √ (i) T ≥ N . Indeed, our theoretical analysis indicates a possibly sharp upper bound T E{eit HUit } = p OP ( T /N ). When T ≥ N , this term will result in uncorrectable bias in estimating βi . Relevant

Zhao et al./Panel Data Models with KRR

10

assumptions exist in literature for bias correction, e.g., T /N 2 = o(1) considered by Pesaran (2006) in parametric setting; κT /N = o(1) considered by Su and Jin (2012) in sieve estimation, where κ represents the discrete truncation parameter (or, number of basis functions). Compared to the latter, our condition is weaker. Another condition for bias correction is that Gi is sufficiently smooth, i.e., condition (3.9) in Theorem 3.4. Such condition holds if the conditional distribution of the factor variable f1t given Xit is smooth. As a by-product, βbi and gbi (x0 ) become asymptotically independent which facilitates the applications, e.g., one does not need to estimate the correlation between the two estimators. Define Gi = (Gi,1 , . . . , Gi,q1 +d )0 . Theorem 3.4. Suppose that the conditions in Theorem 3.3 hold, T ηi = o(1) and T /N = o(1). P Furthermore, there exists a positive non-decreasing sequence kν with ν≥1 1/kν < ∞ such that X

2 |Vi (Gi,k , ϕ(i) ν )| kν < ∞,

for k = 1, . . . , q1 + d.

(3.9)

ν

Then we have, for any x0 ∈ Xi , √ √

T (βbi − βi0 )

! d

→ N (0, Ψ) , as (N, T ) → ∞,

T hi (b gi (x0 ) − gi0 (x0 ))

where Ψ = σ2

Ω−1 i

0

0

σx20

(3.10)

! .

(3.11)

An application of Theorem 3.4 is the construction of confidence interval for regression mean. Suppose Xi t+1 = xi0 and f1 t+1 = f10 with known xi0 and f10 , i.e., the predictor variables of each individual are observed at future time t + 1. By (2.5), the conditional mean of Yi t+1 is PN 0 0 0 , N −1 µi0 ≡ E{Yi t+1 |Xi t+1 = xi0 , f1 t+1 = f10 } ≈ gi0 (xi0 ) + z00 βi0 , where z0 = (f10 i=1 xi0 ) . We propose the following 1 − α confidence interval for µi0 : σx σ µ bi ± z1−α/2 √ 0 , where µ bi = gbi (xi0 ) + z00 βbi . T hi

(3.12)

Here, z1−α/2 is the (1 − α/2)-percentile of standard normal distribution. The following Corollary 3.5 guarantees the asymptotic validity of (3.12). Corollary 3.5. Under the conditions of Theorem 3.4, we have, as (N, T ) → ∞, p  d T hi (b µi − µi0 ) → N 0, σx20 σ2 .

(3.13)

Another application of Theorem 3.4 is to construct the prediction interval for Yi t+1 . By (2.5), 0 ¯ ¯ 0 −1 ¯ Yi t+1 − µ bi = (µi0 − µ bi ) + i t+1 − γ2i (Γ2 Γ2 ) Γ2 v¯t+1 .

(3.14)

The proof of Theorem 3.1 indicates that the last term of (3.14) is OP (N −1/2 ), whereas the first term is OP ((T hi )−1/2 ) thanks to Corollary 3.5. If T hi = o(N ), i.e., the last term of (3.14) is

Zhao et al./Panel Data Models with KRR

11

asymptotically negligible, then the asymptotic distribution of (3.14) is a convolution of F and the distribution of N (0, σx20 σ2 /(T hi )), where F is the c.d.f. of i t+1 . Let qα/2 and q1−α/2 be the α/2- and (1 − α/2)-percentiles of the convolution, then a 1 − α prediction interval for Yi t+1 is [b µi + qα/2 , µ bi + q1−α/2 ].

(3.15)

In particular, if i t+1 ∼ N (0, σ2 ), then (3.15) becomes the following q µ bi ± z1−α/2 σ σx20 /(T hi ) + 1. The asymptotic variance σx20 has an explicit expression σx20

=

hi Vi (Kx(i) , Kx(i) ) 0 0

=

X hi |ϕν(i) (x0 )|2 (i)

ν≥1 (i)

(1 + ηi ρν )2

.

(i)

In practice, we can estimate σx20 by replacing ϕν and ρν with their empirical counterparts such as kernel eigenvalues and kernel eigenfunctions; see, e.g., Braun (2006). Meanwhile, we estimate σ2 by the following σ b2

T X = {Yit − gbi (Xit ) − Zt0 βbi }2 /T.

(3.16)

t=1

The following result shows that (3.16) is a consistent estimator. 1/2

Proposition 3.6. Under conditions of Theorem 3.1, if ri,M = oP (hi ) and ri,M = oP (T −1/α ), then σ b2 → σ2 in probability, as (N, T ) → ∞. 4. Homogeneous Model In this section, we consider a homogeneous case, i.e., gi = g for all i ∈ [N ]. Assuming homogeneity, model (2.1) becomes the following Yit = g(Xit ) + γ 01i f1t + γ 02i f2t + it , i ∈ [N ], t ∈ [T ].

(4.1)

By (2.2) and a similar statement as (2.5), we can rewrite (4.1) as the following Yit = g(Xit ) + Zt0 βi + eit , i ∈ [N ], t ∈ [T ],

(4.2)

where Zt , βi and eit are given in (2.5). We will provide a procedure for estimating g and explore its asymptotic property. Our theoretical results hold when M → ∞. Here M → ∞ means either (N, T ) → ∞ or N → ∞, fixed T . Whereas the estimation of βi is inconsistent when T is fixed due to insufficient data in each individual unit. The βi will be treated as nuisance parameters throughout the whole section.

Zhao et al./Panel Data Models with KRR

12

4.1. Estimation Procedure ¯ ·), where Suppose g belongs to an RKHS H ⊂ L2 (X ) with inner product h·, ·iH and kernel K(·, X ⊆ Rd . Our estimation is based on profile least squares: Step (a). For any g ∈ H, we estimate βi through the following min

βi ∈Rq1 +d

T X

(Yit − g(Xit ) − Zt0 βi )2 =

t=1

min (Yi − τi g − Σ0 βi )0 (Yi − τi g − Σ0 βi ), i ∈ [N ],

βi ∈Rq1 +d

(4.3)

where Yi = (Yi1 , . . . , YiT )0 , τi g = (g(Xi1 ), . . . , g(XiT ))0 and Σ = (Z1 , . . . , ZT ). Recall that Σ is (q1 + d)×T . Suppose T ≥ q1 +d so that (ΣΣ0 )−1 exists. Then (4.3) has solution βbi = (ΣΣ0 )−1 Σ(Yi −τi g). Step (b). Plug the above βbi into (4.3). The minimum value of (4.3) is equal to (Yi − τi g)0 (IT − Σ0 (ΣΣ0 )−1 Σ)(Yi − τi g). Then we estimate g by the following ) ( N η 1 X (Yi − τi g)0 P (Yi − τi g) + kgk2H , gb = arg min `M,η (g) ≡ arg min g∈H g∈H 2N T 2

(4.4)

i=1

where η > 0 is a penalty parameter and P = IT − Σ0 (ΣΣ0 )−1 Σ with IT the T × T identity. The above Step (b) yields an explicit solution. Specifically, by representer theorem, gb satisfies g(x) =

N X T X

¯ it , x) = a0 K ¯ x, ait K(X

(4.5)

i=1 t=1

where a = (a11 , · · · , a1T , · · · , aN 1 , · · · , aN T )0 are constant scalars, and  ¯ x = K(X ¯ 11 , x), · · · , K(X ¯ 1T , x), · · · , K(X ¯ N 1 , x), · · · , K(X ¯ N T , x) 0 . K ¯ a, where K ¯ = (K ¯X , · · · , K ¯X , · · · , K ¯X , · · · , K ¯ X ) ∈ RN T ×N T . Similar to (3.2), kgk2H = a0 K 11 N1 NT 0 1T  ¯ PN Y − Ka ¯ /(2N T ) + ηa0 Ka/2, ¯ Therefore, we can rewrite `M,η (g) as `M,η (g) = Y − Ka where Y = (Y11 , · · · , Y1T , · · · , YN 1 , · · · , YN T )0 is an N T -vector and PN = IN ⊗ P is N T × N T . The  ¯ + N T ηIN T −1 PN Y . Then gb(x) = b ¯ x for any x ∈ X . minimizer b a has an expression b a = PN K a0 K Remark 4.1. We propose the following GCV method for choosing η in the above estimation: ηb = arg min GCV(η) ≡ arg min η>0

η>0

k(IN T − Bη )Y k22 , N T [1 − tr(Bη )/(N T )]2

where Bη is the N T × N T smoothing matrix defined similar to Remark 3.1. 4.2. Rate of Convergence To derive the rate of convergence for gb, let us adapt the framework of Section 3 to the homogeneous  P setting. Define V (g, ge) = N E (τi g)0 P (τi ge) F T /(N T ) for any g, ge ∈ H. Suppose that V (·, ·) i=1

and h·, ·iH are simultaneously diagonalizable.

1

Zhao et al./Panel Data Models with KRR

13

Assumption A4. There exist eigenfunctions ϕν ∈ H and a nondecreasing positive sequence of eigenvalues ρν such that V (ϕν , ϕµ ) = δνµ , hϕν , ϕµ iH = ρν δνµ , for any ν, µ ≥ 1. Furthermore, P 1 ∈ / H and any function g ∈ H admits a generalized Fourier expansion g = ν≥1 V (g, ϕν )ϕν . Both ϕν and ρν are F1T -measurable and cϕ ≡ supν≥1 kϕν ksup = OP (1), i.e., ϕν are stochastic uniformly bounded. A4 type conditions are commonly used in literature to derive the rate of convergence for smoothing splines or kernel ridge regression; see Gu and Qiu (1993); Shang and Cheng (2013); Cheng and Shang (2015); Zhao et al. (2016). Classic ways to verify such conditions rely on variational methods; see Weinberger (1974). Nevertheless, Assumption A4 differs from literature in that the functional V and the eigenpairs (ρν , ϕν ) are random. Fortunately, we can still verify Assumption A4 by adapting the classic variational method to this new setting. The exact verification is deferred to Lemma S.2 in appendix. 0 , (X ¯ ? )0 )0 and Define Σ? = (Z1? , . . . , ZT? ), a square matrix of dimension q1 + d, where Zt? = (f1t t ? ¯ ¯ Xt = Xt − v¯t for t ∈ [T ]. By (2.3) and Assumption A2, Σ? is independent of the variables vit .

Hence, Σ? can be viewed as a “noiseless” analogy of Σ. In the below we impose a moment condition on the spectral norms of various matrices. Assumption A5. There exist constants ζ > 4 and c > 0 such that     E k(Σ? Σ0? /T )−1 kζop ≤ c, E k(ΣΣ0 /T )−1 kζop ≤ c and ! T   X 0 E kΣ? Σ0? /T k2ζ/(ζ−4) ≤ c, E k f2t f2t /T k2ζ/(ζ−4) ≤ c, op op t=1

where k · kop represents the operator norm of square matrices. For any g, ge ∈ H, define hg, gei = V (g, ge)+ηhg, geiH . Following Cheng and Shang (2015), (H, h·, ·i) is an RKHS with reproducing kernel denoted K. For convenience, define Xi = (Xi1 , . . . , XiT )0 and KXi = (KXi1 , . . . , KXiT )0 for i ∈ [N ]. Similar to Section 2.2, there exists a positive definite self-adjoint operator Wη : H → H such that hWη g, gei = ηhg, geiH , g, ge ∈ H. Then the Fr´echet derivatives of `M,η (g) have the following expressions D`M,η (g)∆θ = h−

N 1 X (Yi − hKXi , gi)0 P KXi + Wη g, ∆gi ≡ hSM,η (g), ∆gi, NT i=1

N 1 X hKXi , ∆gi0 P KXi + Wη ∆g, NT

DSM,η (g)∆g =

i=1

2

D SM,η (g) = 0. For p, δ > 0, define G(p) = {g ∈ H : kgksup ≤ 1, kgkH ≤ p} and an entropy integral Z J(p, δ) = 0

δ

 ψ2−1 (D(ε, G(p), k · ksup )) dε + δψ2−1 D(δ, G(p), k · ksup )2 ,

Zhao et al./Panel Data Models with KRR

14

where recall that D(ε, G(p), k · ksup ) is the ε-packing number of G(p) in terms of k · ksup -metric. Define  h=

X ν≥1

−1 p p 1  , bN,p = log log (N J(p, 1))J(p, 1), p = (cϕ h−1 η)−1 . 1 + ηρν

Theorem 4.1. Suppose that Assumptions A2, A4 and A5 are satisfied. Furthermore, bN,p = oP (N 1/2 h). Then, as M → ∞, kb g − g0 k = OP (rM ), where rM = (N T h)−1/2 + (N h1/2 )−1 + η 1/2 . Theorem 4.1 provides a rate of convergence for gb. Like in Theorem 3.1, to yield optimal rate of ? ), the optimal choice of h (denoted h? ) relies on kernels and relationship convergence (denoted rM ? in PDK and EDK. Interestingly, of N, T . The following Table 2 summarizes the values of h? and rM ? depends on N T ; whereas N < T , r ? only depends on N . The latter implies when N ≥ T , rM M

that, when N < T , increasing time points will not change convergence rate. Moreover, it can be examined that the condition bN,p = oP (N 1/2 h) in Theorem 4.1 holds true when h  h? . N ≥T

h? ? rM

N 0 such that, as N → ∞, ! √ p T kZiM (θ)ki P max sup √ ≥ C0 log N + log log(T Ji (pi , 1)) → 0, i∈[N ] θ∈Gi (pi ) T Ji (pi , kθki,sup ) + 1 where

T 1 X √ ZiM (θ) = [ψi,M,t (Uit ; θ)Ri Uit − E (ψi,M,t (Uit ; θ)Ri Uit )], θ ∈ Θi . T t=1

Proofs of Lemmas A.1 and A.2, Propositions A.1 and A.2 can be found in supplement document. Proof of Theorem 3.1. Since Yit = gi0 (Xit ) + Zt0 βi0 + eit , it follows that ? Si,M,η (θi0 ) = E{Si,M,ηi (θi0 )} = E{− i

T 1X eit Ri Uit + Pi θi0 }. T t=1

0 (Γ ¯ 2Γ ¯ 0 )−1 Γ ¯ 2 v¯t = it − ∆i v¯t , so we have Also eit = it − γ2i 2 ? kSi,M,η (θi0 )ki = kE{(it − ∆i v¯t )Ri Uit − Pi θi0 }ki i

≤ kE{(it − ∆i v¯t )Ri Uit ki + kPi θi0 ki sup |hE{(it − ∆i v¯t )Ri Uit }, θii | + kPi θi0 ki

=

kθki =1

sup |E{(it − ∆i v¯t )(g(Xit ) + Zt0 βi )}| + kPi θi0 ki

=

kθki =1

=

sup |E{∆i v¯t (g(Xit ) + Zt0 βi )}| + kPi θi0 ki .

kθki =1

Since −1/2

|g(Xit ) + Zt0 βi | ≤ (1 + kZt k2 )kθki,sup ≤ Ci (1 + kZt k2 )(1 + hi

)kθki

and −1/2

E{∆i v¯t hi

−1/2

} ≤ E{(∆i v¯t )2 }1/2 hi

= O((N hi )−1/2 ),

there exists a constant C 0 , such that sup |E{∆i v¯t (g(Xit ) + Zt0 βi )}| ≤

kθki =1

C0 . (N hi )1/2

(A.3)

In the meantime, we have kPi θi0 ki = sup |hPi θi0 , θii | = sup |ηi hgi0 , gii | ≤ kθki =1

kθki =1



ηi kgi0 kHi , i ∈ [N ].

Consider an operator ? T1i (θ) = θ − Si,M,η (θ + θi0 ), θ ∈ Θi . i

By Lemma A.1 we have for any θ ∈ Θi , ? ? ? T1i (θ) = θ − DSi,M,η (θi0 )θ − Si,M,η (θi0 ) = −Si,M,η (θi0 ). i i i

(A.4)

Zhao et al./Panel Data Models with KRR

22

Since T1i takes constant value and by (A.3) and (A.4), T1i is a contraction mapping from 0 √ Bi ( ηi kgi0 kHi + (N hC )1/2 ) to itself, where Bi (r) represents the r-ball in (Θi , k · ki ). By Contraction i 0 √ mapping theorem, there exists a unique fixed point θ0 ∈ Bi ( ηi kgi0 kHi + (N hC )1/2 ) such that i 0 √ ? T1i (θ0 ) = θ0 . Let θηi = θ0 +θi0 , then Si,M,η (θηi ) = 0. Obviously, kθηi −θi0 ki ≤ ηi kgi0 kHi + (N hC )1/2 . i i

We fix an i ∈ [N ] and assume both T, N to approach infinity. Let EM = {max1≤t≤T kZt k2 ≤ e 1/α }. Proposition A.1 says that when C e is large, EM has probability approaching one. Write CT e 1/α }. Then EM = ∩T EM,t . By Lemma A.2, EM,t implies that kRi Uit ki ≤ EM,t = {kZt k2 ≤ CT t=1

−1/2 Ci (hi

e 1/α ). + CT

Consider another operator T2i (θ) = θ − Si,M,ηi (θηi + θ), θ ∈ Θi . For i ∈ [N ], t ∈ [T ], define ψi,M,t (Uit ; θ) =

hRi Uit , θii IEM,t , θ ∈ Θi . e i T 1/α (h−1/2 + CT e 1/α ) CC i

It is easy to see that on EM , for any θ1 = (β1 , g1 ), θ2 = (β2 , g2 ) ∈ Θi , by Proposition 2.1, k(ψi,M,t (Uit ; θ1 ) − ψiM t (Uit ; θ2 ))Ri Uit ki |hRi Uit , θ1 − θ2 ii | × kRi Uit ki = IEM,t e 1/α (h−1/2 + CT e 1/α ) Ci CT i

=

|(g1 − g2 )(Xit ) + Zt0 (β1 − β2 )| × kRi Uit ki IEM,t e 1/α (h−1/2 + CT e 1/α ) Ci CT



e 1/α Ci (h−1/2 + CT e 1/α ) kθ1 − θ2 ki,sup CT i IEM,t ≤ kθ1 − θ2 ki,sup . e 1/α (h−1/2 + CT e 1/α ) Ci CT

i

i

Notice the following decomposition: T2i (θ) = θ − Si,M,ηi (θ + θηi ) + Si,M,ηi (θηi ) − Si,M,ηi (θηi ) = θ − DSi,M,ηi (θηi )θ − Si,M,ηi (θηi ). We first examine Si,M,ηi (θηi ) as follows: Si,M,ηi (θηi ) = Si,M,ηi (θηi ) − E(Si,M,ηi (θηi )) = −

T 1X [(Yit − hRi Uit , θηi ii )Ri Uit − E((Yit − hRi Uit , θηi ii )Ri Uit )] T t=1

T 1X = − [eit Ri Uit − E(eit Ri Uit )] T

+

1 T

t=1 T X

[hRi Uit , θηi − θi0 ii Ri Uit − E(hRi Uit , θηi − θi0 ii Ri Uit )]

t=1

(A.5)

Zhao et al./Panel Data Models with KRR

23

Define ξit = eit Ri Uit . Following Dehling (1983, eqn. (3.2)) and Bradley (2005, eqn. (1.11)), Ek

= Ek

T X [eit Ri Uit − E(eit Ri Uit )]k2i t=1 T X

[ξit − E(ξit )]k2i

t=1 T X

[E(hξit , ξit0 ii ) − hE(ξit ), E(ξit0 )ii ]

=

t,t0 =1 T X



α/2 4/α

15(φ(|t − t0 |)/2)1−4/α E(kξit ki

)

.

t,t0 =1

It follows from Assumption A2 (a), (c), (d), and Lemma A.2 that α/2 2

E(kξit ki

)

α/2 2

= E(|eit |α/2 kRi Uit ki

)

−α/2

≤ E(|eit |α/2 )E(kRi Uit kαi ) ≤ c0 hi

,

where c0 is an absolute constant. The existence of such c0 is due to the fact E(|eit |α ) < ∞ and E(kZt kα2 ) < ∞. Therefore, it follows from Assumption A2 (b) that there exists an absolute constant c1 such that T X Ek [eit Ri Uit − E(eit Ri Uit )]k2i ≤ c1 T h−1 i . t=1

Similarly, it can be shown that Ek

≤ ≤

T X

[hRi Uit , θηi − θi0 ii Ri Uit − E(hRi Uit , θηi − θi0 ii Ri Uit )]k2i

t=1 T X

15(φ(|t − t0 |)/2)1−4/α E(kRi Uit kαi )4/α kθηi − θi0 k2i

t,t0 =1 c01 T h−1 i ,

where c01 is an absolute constant. The last step follows from Proposition A.2, i.e., −α/2

E(kRi Uit kαi ) = O(hi and the fact kθηi − θi0 k2i = O(ηi +

1 N hi ),

),

and the condition ηi +

1 N hi

= O(hi ).

Therefore, we can choose c2 to be large such that, with probability approaching one, kSi,M,ηi (θηi )ki ≤ c2 (T hi )−1/2 . On EM,t , for any unequal θ1 , θ2 ∈ Θi , define θ=

θ1 − θ 2 −1/2

Ci (1 + hi

)kθ1 − θ2 ki

.

Zhao et al./Panel Data Models with KRR

24

Write θ = (β, g). Hence, by Lemma A.2 (A.2), kθki,sup ≤ 1, ηi kgk2Hi

≤ kθk2i =

kθ1 − θ2 k2i −1/2 2 ) kθ

Ci2 (1 + hi

2 1 − θ2 k i

≤ Ci−2 hi .

This means that θ ∈ Gi (pi ) with pi = Ci−1 (ηi−1 hi )1/2 . Since ηi−1 hi tends to infinity as (N, T ) does, it is not of loss of generality to assume that pi ≥ 1. Define T 1 X [ψi,M,t (Uit ; θ)Ri Uit − E(ψi,M,t (Uit ; θ)Ri Uit )]. ZiM (θ) = √ T t=1

It follows from (A.5) and Proposition A.2 that, with probability approaching one, √ p T kZiM (θ)ki sup √ ≤ C0 log N + log log(T Ji (pi , 1)). T Ji (pi , kθki,sup ) + 1 θ∈Gi (pi )

(A.6)

Since hi = o(1), assume that h−1 i ≥ 1. It follows from Lemma A.2 (A.1) that   c kE hRi Uit , θii IEM,t Ri Uit ki   c ≤ E |hRi Uit , θii |IEM,t kRi Uit ki   −1/2 c ≤ E (1 + kZt k2 )IEM,t Ci (hi + kZt k2 )   −1/2 c ≤ C i hi E (1 + kZt k2 )2 IEM,t −1/2

c E ((1 + kZt k2 )α )2/α P (EM,t )1−2/α  1−2/α 1 −1/2 α α 2/α ≤ C i hi E ((1 + kZt k2 ) ) E(kZt k2 ) . eα T C

≤ C i hi

(A.7)

Consequently, with probability approaching one, for any unequal θ1 , θ2 ∈ Θi On EM,t , it follows

Zhao et al./Panel Data Models with KRR

25

from (A.6) and (A.7) that

=

=

=

= ≤

kT2i (θ1 ) − T2i (θ2 )ki

T

1X

− [hRi Uit , θ1 − θ2 ii Ri Uit − E(hRi Uit , θ1 − θ2 ii Ri Uit )]

T i t=1

T

1X −1/2

− [hRi Uit , θii Ri Uit − E(hRi Uit , θii Ri Uit )] × kθ1 − θ2 ki Ci (1 + hi )

T

i t=1

T 1X −1/2 kθ1 − θ2 ki Ci (1 + hi ) [hRi Uit , θii IEM,t Ri Uit − E(hRi Uit , θii IEM,t Ri Uit )]

−T t=1 

c +E(hRi Uit , θii IEM,t Ri Uit )

 i 

−1/2 −1/2 1/α −1/2 1/α e e

c kθ1 − θ2 ki Ci (1 + hi ) −T Ci CT (hi + CT )ZiM (θ) + E(hRi Uit , θii IEM,t Ri Uit )

i  −1/2 −1/2 −1/2 1/α 1/α −1/2 e e kθ1 − θ2 ki Ci (1 + hi ) T C0 Ci CT (hi + CT )(Ji (pi , 1) + T )  1−2/α ! p 1 −1/2 × log N + log log(T Ji (pi , 1)) + Ci hi E ((1 + kZt k2 )α )2/α E(kZt kα2 ) eα T C

≤ c3 kθ1 − θ2 ki , (A.8) where c3 is a constant in (0, 1/2). Note that (A.8) holds also for θ1 = θ2 . The existence of such c3 √ follows by condition bN,p = oP ( N h). In particular, letting θ2 = 0, one gets that for any θ1 ∈ B(2c2 (T hi )−1/2 ), kT2i (θ1 )ki ≤ kT2i (θ1 ) − T2i (0)ki + kT2i (0)ki ≤ c3 kθ1 ki + kSi,M,ηi (θηi )ki ≤ 2c2 c3 (T hi )−1/2 + c2 (T hi )−1/2 < 2c2 (T hi )−1/2 . This implies that, with probability approaching one, T2i is a contraction mapping from B(2c2 (T hi )−1/2 ) to itself. By contraction mapping theorem, there exists uniquely a θ00 ∈ B(2c2 (T hi )−1/2 ) such that T2i (θ00 ) = θ00 , implying that Si,M,ηi (θηi + θ00 ) = 0. Thus, θbi = θηi + θ00 is the penalized MLE of `i,M,η . This further shows that kθbi − θη ki ≤ 2c2 (T hi )−1/2 . Combined with kθη − θi0 ki = i

i

1/2 O(ηi

+ (N hi

)−1/2 ),

1/2 we have kθbi − θi0 ki = OP ((T hi )−1/2 + ηi + (N hi )−1/2 ).

A.2. Proofs in Section 3.3 In this section, we prove Theorems 3.2, 3.3 and 3.4, and Corollary 3.6. Proof of Theorem 3.2. Define Si,M (θ) ≡ −

T 1X (Yit − hRi Uit , θii )Ri Uit T t=1

i

Zhao et al./Panel Data Models with KRR

and Si (θ) ≡ E{Si,M (θ)} = E{−

26

T 1X (Yit − hRi Uit , θii )Ri Uit }. T t=1

? Recall Si,M,ηi = Si,M + Pi θ and Si,M,η (θ) = Si (θ) + Pi θ. Denote θi = θbi − θi0 . Since Si,M,ηi (θbi ) = 0, i

we have Si,M,ηi (θi + θi0 ) = 0. Therefore, kSi,M (θi + θi0 ) − Si (θi + θi0 ) − (Si,M (θi0 ) − Si (θi0 ))ki ? ? = kSi,M,ηi (θi + θi0 ) − Si,M,η (θi + θi0 ) − (Si,M,ηi (θi0 ) − Si,M,η (θi0 ))ki i i ? ? = kSi,M,η (θi + θi0 ) + Si,M,ηi (θi0 ) − Si,M,η (θi0 )ki i i ? = kDSi,M,η (θi0 )θi + Si,M,ηi (θi0 )ki i

= kθi + Si,M,ηi (θi0 )ki .

(A.9) 1/2

Consider an event Bi,M = {kθki ≤ ri,M ≡ CB ((T hi )−1/2 )+ηi

+(N hi )−1/2 }. For some CB large −1/2

enough, Bi,M has probability approaching one. Let di,M = Ci ri,M (1 + hi

), where Ci is defined ¯ g¯) = d−1 θ/2, = o(1). For any θ ∈ Θi , we further define θ¯ = (β, i,M

in lemma A.2. We have di,M ¯ = d−1 where β¯ = d−1 i,M g/2. Then, on event Bi,M , we have i,M β/2 and g −1/2

¯ i,sup ≤ Ci (1 + h kθk i

−1/2

¯ i = Ci (1 + h )kθk i

1 )d−1 i,M kθki /2 ≤ . 2

Meanwhile, k¯ g k2Hi =

d−2 i,M 4

ηi−1 (ηi kgk2Hi ) ≤

d−2 i,M 4

ηi−1 kθk2i ≤

d−2 i,M 4

2 ηi−1 ri,M ≤ Ci−2 hi ηi−1 .

Let pi = Ci−1 (hi ηi−1 )1/2 . Then k¯ g kHi ≤ pi . Therefore θ¯ ∈ Gi (pi ). Since (ηi h−1 i ) → ∞ as (N, T ) → ∞, pi > 1 in general. e 1/α }, as defined in the proof of Theorem 3.1. Let Recall EM,t = {kZt k2 ≤ CT d ¯ = ψi,M,t (Uit ; θ)

hRi Uit , θii IEM,t e i T 1/α (h−1/2 + CT e 1/α ) 2dM CC i

, θ ∈ Θi .

Following the proof of Theorem 3.1, on EM , for any θ1 = (β1 , g1 ), θ2 = (β2 , g2 ) ∈ Θi , we have d d ¯ ¯ ¯ k(ψi,M,t (Uit ; θ¯1 ) − ψiM t (Uit ; θ2 ))Ri Uit ki ≤ kθ1 − θ2 ki,sup .

Define ¯ ZdiM (θ)

(A.10)

T 1 X d ¯ i Uit − E(ψ d (Uit ; θ)R ¯ i Uit )]. =√ [ψi,M,t (Uit ; θ)R i,M,t T t=1

It follows from Proposition A.2 that, with probability approaching one, √ d (θ)k ¯ i p T kZiM sup √ ≤ C0 log N + log log(T Ji (pi , 1)). ¯ i,sup ) + 1 T Ji (pi , kθk θ∈Gi (pi )

(A.11)

Zhao et al./Panel Data Models with KRR

27

Therefore kθi + Si,M,ηi (θi0 )ki = kSi,M (θi + θi0 ) − Si (θi + θi0 ) − (Si,M (θi0 ) − Si (θi0 ))ki

X

1 T

= [hRi Uit , θi ii Ri Uit − E{hRi Uit , θi ii Ri Uit }]

T t=1 i

T

1 X

c = [hR U , θ i I R U − E(hR U , θ i I R U )] − E(hR U , θ i I R U ) i it i i EM,t i it i it i i EM,t i it i it i i EM,t i it

T i t=1

√ 

1/α−1 −1/2 d ¯ − E(hRi Uit , θi ii IE c Ri Uit ) e e 1/α ) (hi + CT = T Zi,M (θ)

2dM CCi T

M,t i √ p 1/α−1 −1/2 1/α ¯ e e ≤ 2dM C0 CCi T (hi + CT ) T Ji (pi , kθki,sup ) + 1 log N + log log(T Ji (pi , 1))  1−2/α 1 −1/2 2/α +Ci hi E ((1 + kZt k2 )α ) E(kZt kα . (A.12) 2) α e C T

1/2 ∗h = (β ? , h1/2 g ? ), and Rh u = (H (i) , h1/2 T (i) ), Proof of Theorem 3.3. Define θbih = (βbi , hi gbi ), θi0 u u i0 i i0 i i ? = (id − P )θ . From Theorem 3.2, we have where θi0 i i0

kθbi − θi0 + Si,M,ηi (θi0 )ki = OP (aM ).

(A.13)

Since Si,M,ηi (θi0 ) = −

T T 1X 1X (Yit − hRi Uit , θi0 ii ) Ri Uit + Pi θi0 = − eit Ri Uit + Pi θi0 , T T t=1

t=1

Theorem 3.2 can be re-written as ? kθbi − θi0 −

T 1X eit Ri Uit ki = OP (aM ). T

(A.14)

t=1

PT (i) ? −1 b ? 1 PT eit Ri Uit It implies kβbi −βi0 t=1 eit Hit k2 = OP (aM ). Further, we define Rem = θi −θi0 − T t=1 T P and Remh = θbh − θ∗h − 1 T eit Rh Uit . Then i

i0

T

t=1

i

! T

X 1

1/2 1/2 (i) ? − kRemh − hi Remki = (1 − hi )(βbi − βi0 eit HUit ), 0

T t=1 i

! T

1X

1/2 (i) ? − = OP (aM ). ≤ (1 − hi )O βbi − βi0 eit HUit )

T t=1

1/2

2

1/2

Therefore, kRemh ki ≤ kRemh − hi Remki + khi Remki = OP (aM ). The idea is to employ the Cram´er-Wold device. For any z, we will obtain the limiting distribution of T 1/2 z 0 (βbi − β ? ) + (T hi )1/2 (b gi (x0 ) − g ? (x0 )), which is T 1/2 hRi u, θbh − θ∗h ii by Proposition i0

2.1, where u = (x0 , z).

i0

i

i0

Zhao et al./Panel Data Models with KRR

28

Since T 1/2 h−1/2 aM = o(1), we have T X 1 1/2 h ∗h h b eit Ri Uit ii T hRi u, θi − θi0 − T t=1

≤T

1/2

h

kRi uki kRem ki

= OP (T 1/2 h−1/2 aM ) = oP (1). ∗h i , we only need to find the limiting Then, to find the limiting distribution of T 1/2 hRi u, θbih − θi0 i P P T T 1 h 1/2 −1/2 0 1/2 T (i) (x )). Next we will distribution of T hRi u, T t=1 eit Ri Uit ii = T t=1 eit (z HU (i) +h Uit 0 it

use CLT to find its limiting distribution. (i)

(i)

Define L(Uit ) = z 0 HUit + h1/2 TUit (x0 ). Since it and vit are i.i.d. across t, we have s2T

= V ar

T X

! eit (z

0

(i) HUit

1/2

+h

(i) TUit (x0 ))

t=1

= T · V ar (eit L(Uit )) +

T X

Cov (eit1 L(Uit1 ), eit2 L(Uit2 ))

t1 6=t2 T X  = T · E e2it L(Uit )2 − T · E {eit L(Uit )}2 + Cov (eit1 L(Uit1 ), eit2 L(Uit2 )) . t1 6=t2

For the first term, o n    E e2it L(Uit )2 = E (it − ∆i v¯t )2 L(Uit )2 = σ2 E L(Uit )2 + E (∆i v¯t )2 L(Uit )2 . From Cauchy-Schwarz and H¨ older’s inequality, we can show that   E (∆i v¯t )2 L(Uit )2 ≤ k∆i k22 E k¯ vt k22 L(Uit )2 o α−2  n 2α α . ≤ k∆i k22 (E {k¯ vt kα2 })2/α E |L(Uit )| α−2 We next will find the upper bound of |L(Uit )|. (i)

(i)

L(Uit ) =z 0 HUit + h1/2 TUit (x0 ) 1/2

(i)

=z 0 (Ωi + Σi )−1 Zt − z 0 (Ωi + Σi )−1 Ai (Xit ) + hi KXit (x0 ) 1/2

1/2

− hi A0i (x0 )(Ωi + Σi )−1 Zt + hi A0i (x0 )(Ωi + Σi )−1 Ai (Xit ).

(A.15)

Zhao et al./Panel Data Models with KRR

29

By the proof of lemma A.2, we have 0 z (Ωi + Σi )−1 Zt ≤ c−1 z 0 Zt ≤ c1z kZt k2 , 1 0 z (Ωi + Σi )−1 Ai (Xit ) = z 0 (Ωi + Σi )−1/2 (Ωi + Σi )−1/2 Ai (Xit ) q p ≤ z 0 (Ωi + Σi )−1 z A0i (Xit )(Ωi + Σi )−1 Ai (Xit ) −1/2

≤ c2z hi

, 0 Ai (x0 )(Ωi + Σi )−1 Zt ≤ c3z h−1/2 kZt k2 , i (i) 2 KXit (x0 ) ≤ Cϕ,i h−1 i , 0 2 2 −1 Ai (x0 )(Ωi + Σi )−1 Ai (Xit ) ≤ c−1 Cϕ,i CG h , 1 i i −1 −1 where c1z = c−1 1 kzk2 , c2z = c1 Cϕ,i CGi kzk2 , and c3z = c1 Cϕ,i CGi . Combine all these inequality

together, we have −1/2

|L(Uit )| ≤ c4z kZt k2 + c5z hi

,

2 + c−1 C 2 C 2 . Therefore, there exists a constant c , where c4z = c1z + c3z and c5z = c2z + Cϕ,i 6z ϕ,i Gi 1

such that  2α  o n 2α α−2 −1/2 E |L(Uit )| α−2 = E c4z kZt k2 + c5z hi     2α α − α−2 α−2 ≤ c6z E kZt k2 + hi   2 − α ≤ c6z E {kZt kα2 } α−2 + hi α−2 . 0 ,X ¯ 0 f1t + Γ ¯ 0 f2t + v¯t , we have ¯ t0 )0 and X ¯t = Γ Since Zt = (f1t 1 2 n  o ¯ t k22 α/2 E {kZt kα2 } = E kf1t k22 + kX   ¯ t kα2 ≤ c7 E {kf1t kα2 } + E kX

≤ c8 (E {kf1t kα2 } + E {kf1t kα2 } + E {kf2t kα2 } + E {k¯ vt kα2 }) , where c7 and c8 are constants. By Assumption A2, E {kf1t kα2 }, E {kf2t kα2 }, and E {|vit |α } are

Zhao et al./Panel Data Models with KRR

30

finite. Using Marcinkiewicz and Zygmund inequality in Shao (2003), we can show that   !2 α/2   d N   X X 1 E {k¯ vt kα2 } = α E  vilt    N  l=1 i=1    !2 α/2   d N   X 1 X ≤ c9 α E  vilt    N   i=1 l=1 ( N !α ) d X 1 X E vilt = c9 α N l=1

1 d α 1−α/2 N N   1 = OP , N α/2 ≤ c10

i=1 N X

α E {vilt }

i=1

o n 2α where c9 and c10 are two constants. As N → ∞, E {kZt kα2 } = OP (1) and E |L(Uit )| α−2 = α − α−2

OP (hi

). Thus, we have

 n o α−2  2α 1 α E (∆i v¯t )2 L(Uit )2 ≤ k∆i k22 (E {k¯ vt kα2 })2/α E |L(Uit )| α−2 = OP ( ). (A.16) N hi  By assumption (N hi )−1 = oP (1), we have E (∆i v¯t )2 L(Uit )2 = oP (1).  Next, we will find the order of E L(Uit )2 . As it is shown in Cheng and Shang (2015), as  0 −1 ηi → 0 and T → ∞, E |L(Uit )|2 → αx2 0 + 2(z + βx0 )0 Ω−1 i αx0 + (z + βx0 ) Ωi (z + βx0 ) for any given z. From above derivatives, we have found the leading term of the first term in equation (A.15). Now, we turn to the second term. It is straightforward to obtain that   vt k22 E L(Uit )2 , E {eit L(Uit )}2 = E {(it − ∆i v¯t )L(Uit )}2 ≤ k∆i k22 E k¯   where E k¯ vt k22 ≤ E {k¯ vt kα2 }2/α = OP (1/N ) and E L(Uit )2 = OP (1). So E {eit L(Uit )}2 = OP (1/N ). So the second term is of a smaller order than the first term.

Zhao et al./Panel Data Models with KRR

31

For the last term of equation (A.15), we can show that T X

Cov (eit1 L(Uit1 ), eit2 L(Uit2 ))

t1 6=t2



T X

|Cov (∆i v¯t1 L(Uit1 ), ∆i v¯t2 L(Uit2 )) |

t1 6=t2

≤8

T X

o4/α n α(|t1 − t2 |)1−4/α E |∆i v¯t1 L(Uit1 )|α/2

t1 6=t2

≤8 · 21−4/α

T X

T X

o4/α n , φ(|t1 − t2 |)1−4/α E |∆i v¯t1 L(Uit1 )|α/2

t1 =1 t2 =1,t2 6=t1

where α(|t1 − t2 |) and φ(|t1 − t2 |) are the α − mixing and φ − mixing coefficients for ∆i v¯t L(Uit ). We have 2 · α(|t1 − t2 |) < φ(|t1 − t2 |). The second inequality is from Proposition 2.5 in Fan and Yao (2003). Similar to equation (A.16), we can show that E {|∆i v¯t1 L(Uit1 )|α } = OP ((N hi )−α/2 ). So  4/α P 1−4/α < E |∆i v¯t1 L(Uit1 )|α/2 = OP ( N1hi ). From Assumption A2, we have ∞ t2 =1,t2 6=t1 φ(|t1 −t2 |) ∞. Thus, T X

Cov (eit1 L(Uit1 ), eit2 L(Uit2 )) = OP (

t1 6=t2

T ). N hi

Again, the third term of equation (A.15) is of a smaller order than the first term. To combine all above equations together, we have 1 2 s → σs2 T T where σs2 = σ2 αx2 0 + 2(z + βx0 )0 Ω−1 i αx0

1 2 s → (z 0 , 1)Ψ? (z 0 , 1)0 , T T  + (z + βx0 )0 Ω−1 i (z + βx0 ) . n

or

In order to use the CLT with mixing conditions, we need to show E |eit L(Uit )|α/2

o

is finite.

n o E |eit L(Uit )|α/2 ≤ E {|eit |α }1/2 E {|L(Uit )|α }1/2 n o −α/2 1/2 ≤ c6z E {|eit |α }1/2 E kZt kα2 + hi . n o Since E {kZt kα2 } = OP (1) and E {|eit |α } ≤ ∞, then E |eit L(Uit )|α/2 ≤ ∞. From the above proof, we have E {eit L(Uit )}2 = OP (1/N ), which implies √ (A.17) E {eit L(Uit )} = OP (1/ N ). n o Then E |eit L(Uit ) − E {eit L(Uit )}|α/2 < ∞. Since φ−mixing condition is stronger than α−mixing, by the Theorem 2.21 of Fan and Yao (2003), we have 1 √ T

T X t=1

! (eit L(Uit ) − E {eit L(Uit )})

d

→ N (0, σs2 ).

Zhao et al./Panel Data Models with KRR

32

Proof of Theorem 3.4. Notice !

−(Ωi + Σi )−1 Vi (Gi , Wi gi0 )

? θi0 − θi0 = Pi θi0 =

Wi gi0 + A0i (Ωi + Σi )−1 Vi (Gi , Wi gi0 )

.

Hence the result of the theorem holds if we can show that, for any x ∈ Xi , √



(i)

T E(eit Uit ) = oP (1), p (i) T hi E(eit Hit (x)), T (Ωi + Σi )−1 Vi (Gi , Wi gi0 ) = oP (1),

p T hi A0i (x)(Ωi + Σi )−1 Vi (Gi , Wi gi0 ) = oP (1), αx = βx = 0, p T hi Wi gi0 (x) = 0. lim

T →∞

First by (A.17), we can see that the follow hold true: √

p (i) T E(eit Uit ) = OP ( T /N ) = oP (1), p p (i) T hi E(eit Hit (x)) = OP ( T /N ) = oP (1). Similar to Shang and Cheng (2013), we have (i)

Wi ϕν(i) =

ηi ρν

(i)

1 + ηi ρν

ϕ(i) ν , ν ≥ 1.

(A.18)

Now we see from A3 and (A.18) (i)

Vi (Gi,k , Wi gi0 ) =

X

(i) Vi (Gi,k , ϕ(i) ν )Vi (gi0 , ϕν )

ν≥1

ηi ρν

(i)

.

1 + ηi ρν

So by Cauchy’s inequality, (i)

|Vi (Gi,k , Wi gi0 )|2 ≤

X

|Vi (Gi,k , ϕν(i) )|2

ν≥1

ηi ρν

(i)

X (i)

1 + ηi ρν

2 |Vi (gi0 , ϕ(i) ν )|

ν≥1

ηi ρν

(i)

≤ ηi const

X

|Vi (Gi,k , ϕν(i) )|2

ν≥1

ηi ρν

(i)

1 + ηi ρν

(i)

≤ ηi const

X

|Vi (Gi,k , ϕν(i) )|2 kν

ν≥1

ηi ρν

(i)

(1 + ηi ρν )kν

≤ ηi const , For kAi,k ksup , by definition, Ai,k (x) = hAi,k , Kx(i) i?,i ≤ Vi (Gi,k , Kx(i) ) =

X Vi (Gi,k , ϕ(i) ν ) ν≥1

1+

(i) ηi ρ ν

(i)

(1 + ηi ρν )

ϕ(i) ν (x).

Zhao et al./Panel Data Models with KRR

33

(i)

By boundedness condition of ϕν (Assumption A3) and Cauchy’s inequality, we have |Ai,k (x)|2 ≤

X

2 |Vi (Gi,k , ϕν(i) )|2 kν |ϕ(i) ν (x)|

ν≥1



const

X

|Vi (Gi,k , ϕν(i) )|2 kν

X

1

ν≥1

kν (1 + ηi ρν )2

(i)

X 1 = O(1). kν ν≥1

ν≥1

The above holds uniformly for all x ∈ Xi . So √

p T (Ωi + Σi )−1 Vi (Gi , Wi gi0 ) = O( T ηi ) = o(1),

and p p T hi A0i (x)(Ωi + Σi )−1 Vi (Gi , Wi gi0 ) = O( T hi ηi ) = o(1). By the above uniform boundedness of Ai,k , we have βx = 0. Similarly, by (A.18), we have Wi Ai,k (x) =

X Vi (Gi,k , ϕ(i) ν ) 1+

ν≥1

(i) ηi ρν

(i) ηi ρ(i) ν ϕν (x).

Hence (i)

|Wi Ai,k (x)|2 ≤

X

2 (i) 2 |Vi (Gi,k , ϕ(i) ν )| kν |ϕν (x)|

ν≥1



const

X

2 |Vi (Gi,k , ϕ(i) ν )| kν

ν≥1

X

(ηi ρν )2

ν≥1

kν (1 + ηi ρν )2

(i)

X 1 = O( 1), kν ν≥1

and this rate is uniform for all x ∈ Xi . So αx = 0. By definition of RKHS and Wi , |Wi gi0 (x)| = |hWi gi0 , Kx(i) i?,i | = |ηi hgi0 , Kx(i) iHi | ≤ kgi0 kHi kKx(i) kHi q √ (i) (i) ≤ ηi kgi0 kHi hKx , Kx i?,i √ −1/2 ≤ ηi kgi0 kHi O(hi ) p = O( ηi /hi ). Thus



T hi Wi gi0 (x) = o(1).

Proof of Corollary 3.6. By (2.5), Yit − gbi (Xit ) − Zt0 βbi = it − ∆i v¯t + (gi0 (Xit ) − gbi (Xit )) + Zt0 (βi0 − βbi ). Hence A2t =

PT

PT

bi (Xit ) − Zt0 βbi − it }2 /T ≤ 4 t=1 {Yit − g (gi0 (Xit ) − gbi (Xit ))2 , A3t = kZt k22 kβbi −

t=1 (A1t + A2t + A3t )/T , where βi0 k22 . By uniform boundedness

A1t = k∆i k22 k¯ vt k22 , of ∆i and i.i.d. of

Zhao et al./Panel Data Models with KRR

34

vit in Assumption A2, we have E(At1 ) ≤ k∆i k22 tr(Σv )/N , where Σv is the covariance matrix of P vit . So Tt=1 A1t = OP (1/N ) = oP (1). Also, by Lemma A.2 and Theorem 3.1, −1/2

kgi0 − gbi ksup = OP (hi

−1/2 kθi0 − θbi ki ) = OP (hi ri,M ).

2 = OP (h−1 i ri,M ) = oP (1). By Proposition A.1, Lemma A.2 and PT 2 ) uniformly for all t = 1, 2, . . . , T . So Theorem 3.1 , we have A3t = OP (T 2/α ri,M t=1 A3t /T = PT P T 2/α 2 0 2 2 OP (T ri,M ) = oP (1). Hence t=1 {Yit − gbi (Xit ) − Zt βbi } /T − t=1 it /T = oP (1). The result P follows by applying Law of Large Number: Tt=1 2it /T → σ2 , in probability.

So it follows that

PT

t=1 A2t /T

A.3. Proofs in Section 4.2 We prove Theorem 4.1. Let us first introduce some notation. Define F10 = (f11 , f12 , ..., f1T ), F20 = (f21 , f22 , ..., f2T ), i = (i1 , i2 , ..., iT )0 , ei = (ei1 , ei2 , ..., eiT ), ¯ = (X ¯1, X ¯ 2 , ..., X ¯ T ), v¯ = (¯ ¯? = X ¯ − v¯, X v1 , v¯2 , ..., v¯T ), X e = Σ − Σ? . P? = IT − Σ0? (Σ? Σ0? )−1 Σ? , Σ 0 F 0 + γ 0 F 0 + 0 , (2.2) as X ¯∗ = X ¯ − v¯ = Γ ¯0 F 0 + Γ ¯0 F 0 We can rewrite (2.1) as (Yi − hKXi , g0 i)0 = γ1i 1 1 1 2 2 2,i 2 i 0 0 0 ∗0 0 ¯ ¯ and (2.5) as Yi − hKX , g0 i = Σ βi + ei . Notice that we have Σ = (F1 , X ) , Σ? = (F1 , X ) . By i

¯ = 0, F 0 P? = 0, X ¯ ? P? = 0, F 0 P? = 0. definition, we have ΣP = 0, Σ? P? = 0. Hence F10 P = 0, XP 1 2 ? (g) = E{S T Define SM,η M,η (g)|F1 }. By the proof of Lemma A.1, it can be easily shown that ? (g) = id, for any g ∈ H. We also have DSM,η

kKx k2 ≤ Cϕ2 h−1 , sup kg(x)k ≤ Cϕ h−1/2 kgk.

(A.19)

x∈X

The proof of Theorem 4.1 also relies on the following Lemmas A.3, A.4 and A.5. Proofs of these lemmas are provided in supplement document. Lemma A.3. Suppose that Assumptions A2, A4 and A5 hold. Then the following holds: E (kP − P? kop ) = O(N −1/2 ), r 0 max kE{γ2i F20 (P − P? )KXi |F1T }k = OP

1≤i≤N

eΣ e 0 kop ) = O(T /N ). E(kΣ where F2 = (f21 , . . . , f2T )0 .

T T + √ Nh N h

(A.20) ! ,

(A.21) (A.22)

Zhao et al./Panel Data Models with KRR

35

Lemma A.4. Suppose that Assumptions A2, A4 and A5 hold. Let p = p(M ) ≥ 1 be an F1T measurable sequence indexed by M and let ψ(X, g) : RT × H → RT be a measurable function satisfying ψ(X, 0) ≡ 0 and the following Lipschitz condition: p kψ(X, g1 ) − ψ(X, g2 )k2 ≤ L h/T kg1 − g2 ksup , for any g1 , g2 ∈ H, where L > 0 is a constant. Then with M → ∞,   p sup kZM (g)k = OP 1 + log log (N J(p, 1))(J(p, 1) + N −1/2 ) , g∈G(p)

where

N 1 X ZM (g) = √ [ψ(Xi , g)0 P KXi − E{ψ(Xi , g)0 P KXi |F1T }]. N i=1

Lemma A.5. Under conditions in Theorem 4.1, kSM,η (gη )k = OP ( √

1 1 √ + √ ) + oP ( η). NTh N h

Proof of Theorem 4.1. The proof consists of two parts. ? (g + g ). So Part one: Define T1 (g) = g − SM,η 0 ? ? ? T1 (g) = g − DSM,η (g0 )g − SM,η (g0 ) = −SM,η (g0 ).

So we have ? kSM,η (g0 )k = kE{−

= kE{−

= kE{−

N 1 X (Yi − hKXi g0 i)0 P KXi + Wη g0 |F1T }k NT

1 NT 1 NT

i=1 N X

e0i P KXi − Wη g0 |F1T }k

i=1 N X

0 0 (γ1i F10 + γ2,i F20 + 0i )P KXi − Wη g0 |F1T }k

i=1

N 1 X 0 0 = kE{− γ2,i F2 P KXi − Wη g0 |F1T }k, NT

= kE{−

1 NT

i=1 N X

0 γ2,i F20 (P − P? )KXi − Wη g0 |F1T }k,

i=1

where the second last equation is using independence of i and Xi , F1 , F2 . By directly calculations, kWη g0 k = sup |hWη g0 , gi| = sup |ηhg0 , giH | ≤ sup kgk=1

kgk=1



√ √ ηkg0 kH ηkgkH ≤ ηkg0 kH .

kgk=1

For the first term, by Lemma A.3, kE{−

N 1 1 X 0 0 1 γ2,i F2 (P − P? )KXi |F1T }k = OP ( √ + √ ). NT NTh N h i=1

Zhao et al./Panel Data Models with KRR

36

√ √ ? (g )k = O (1/ N T h + 1/(N h) + √η). Hence with probability apAs a consequence, kSM,η 0 P proaching one, we have kT1 (g)k ≤ C( √

1 1 √ + √ + η) ≡ r, NTh N h

¯ r) to B(0, ¯ r), so for some constant C > 0. Notice that T1 is also a contraction mapping from B(0, ¯ r) ⊂ H such that, T1 (e there exists a ge1 ∈ B(0, g1 ) = ge1 . By Taylor expansion, ? ge1 = T1 (e g1 ) = ge1 − SM,η (e g1 + g0 ), ? (e and hence it follows that SM,η g1 +g0 ) = 0. Let gη = ge1 +g0 , we have with probability approaching

one, kgη − g0 k ≤ C( √

1 1 √ ? + √ + η), SM,η (gη ) = 0. NTh N h

(A.23)

Part two: Define T2 (g) = g − SM,η (gη + g), for any g, ∆g ∈ H. Since ? ∆g = DSM,η (g)∆g

= E(DSM,η (g)∆g|F1T ) =

N 1 X E{τi ∆g 0 P KXi + Wη ∆g|F1T }, NT i=1

we have that kT2 (g1 ) − T2 (g2 )k = k(g1 − g2 ) − (SM,η (gη + g1 ) − SM,η (gη + g2 )) k ? = kDSM,η (g)(g1 − g2 ) − (SM,η (gη + g1 ) − SM,η (gη + g2 )) k N 1 X {τi (g1 − g2 )0 P KXi + Wη (g1 − g2 ) NT i=1  −E τi (g1 − g2 )0 P KXi + Wη (g1 − g2 )|F1T }k

= k

= k

N  1 X {τi (g1 − g2 )0 P KXi − E τi (g1 − g2 )0 P KXi |F1T }k NT i=1

≡ kκ(g1 − g2 )k, where κ(g) =

PN

i=1 (τi g

0P K Xi

− E{τi g 0 P KXi |F1T })/(N T ). Let Ψ(Xi , g) =



hτi g/T . It follows

that, for any g1 , g2 ∈ H, √

h kτi (g1 − g2 )k2 T v r √ u T X hu h t ≤ |g1 (Xit ) − g2 (Xit )|2 ≤ kg1 − g2 ksup . T T

kΨ(Xi , g1 ) − Ψ(Xi , g1 )k2 ≤

t=1

Let ge ≡ (g1 − g2 )/(cϕ h−1/2 kg1 − g2 k), then ke g ksup =

kg1 − g2 ksup ≤ 1, cϕ h−1/2 kg1 − g2 k

Zhao et al./Panel Data Models with KRR

and ke gη kH = So ge ∈ G(p), with p = 1/(cϕ

37

kg1 − g2 kH 1 ≤ p . −1/2 cϕ h kg1 − g2 k cϕ h−1 η

p h−1 η). By Lemma A.4,

  p kZM (g1 − g2 )k −1/2 = kZ (e g )k = O 1 + log log (N J(p, 1))(J(p, 1) + N ) = OP (bN,p ). M P cϕ h−1/2 kg1 − g2 k √ So by assumption bN,p = oP ( N h), we have 1 ZM (g1 − g2 )k Nh bN,p 1 = cϕ √ kg1 − g2 kOP (bN,p ) = OP ( √ )kg1 − g2 k = oP (1)kg1 − g2 k, Nh Nh

kκ(g1 − g2 )k = k √

(A.24)

where the terms OP , oP do not depend on g1 , g2 . Hence with probability approaching one, uniformly for any g1 , g2 , kT2 (g1 ) − T2 (g2 )k ≤

1 2 kg1

− g2 k. Also with probability approaching one,

uniformly for g, 1 kT2 (g)k ≤ kT2 (g) − T2 (0)k + kT2 (0)k ≤ kgk + kSM,η (gη )k. 2 By Lemma A.5, with probability approaching one,   1 1 √ kSM,η (gη )k ≤ C √ + √ + η ≡ R/2, NTh N h for some C > 0. Hence with probability approaching one, it follows that supkgk≤R kT2 (g)k ≤ ¯ R) to itself. By R/2 + R/2 = R. The above implies that T2 is a contraction mapping from B(0, ¯ R) such that ge2 = T2 (e contraction mapping theorem, there exists ge2 ∈ B(0, g2 ) = ge2 −SM,η (gη +e g2 ). Hence SM,η (gη + ge2 ) = 0. So gb = gη + ge2 . Therefore,  kb g − g0 k ≤ kgη − g0 k + ke g2 k = OP

 1 1 √ √ + √ + η . NTh N h

A.4. Proofs in Section 4.3 We will prove Theorems 4.2 and 4.3, and Proposition 4.5. Let us introduce some additional notation and preliminaries. Let m be the increasing sequence of integers provided in Theorems P −1/2 0 4.2 and 4.3. For any fixed x0 ∈ X , define VN T = N1T N i=1 KXi (x0 ) P KXi (x0 ), AN T = VN T . Let P PN −1/2 0 0 0 VN T m = N i=1 φm Φi P Φi φm /(N T ), AN T m = VN T m and HN T m = i=1 Φi P Φi /(N T ), where   ϕ1 (Xi1 ) ϕ2 (Xi1 ) · · · ϕm (Xi1 )    0  ϕ1 (Xi2 ) ϕ2 (Xi2 ) · · · ϕm (Xi2 )  ϕ (x ) ϕ (x ) ϕ (x ) 1 0 2 0 m 0  , φm = Φi =  , ,··· , .   1 + ηρ1 1 + ηρ2 1 + ηρm ···   ϕ1 (XiT ) ϕ2 (XiT ) · · · ϕm (XiT )

Zhao et al./Panel Data Models with KRR

38

The proof of Theorems 4.2 and 4.3 rely on the following Lemmas A.6, A.7 and A.8. Proofs of these lemmas can be found in supplement document. √ Lemma A.6. Under Assumptions A2, A4 and A5, suppose m = o( N ), then kHN T m −Im kF = oP (1), λ−1 min (HN T m ) = OP (1) and λmax (HN T m ) = OP (1). √ Lemma A.7. Under Assumptions A2, A4 and A5, suppose h−1 = oP ( N ), Dm = oP (1) , then for any x0 ∈ X , AN T = OP (1). √ √ Lemma A.8. Under Assumptions A2, A4 and A5, suppose Dm = oP ( N ), m = oP ( N ), for ant x0 ∈ X , we have √ N T AN T m (

N 1 X 0 0 d φm Φi P i ) → N (0, σ2 ). NT i=1

? (g ) = id (see Section A.3) and S Proof of Theorem 4.2. Let g = gb − g0 . By the fact DSM,η 0 M,η (g +

g0 ) = 0, we have ? kg + SM,η (g0 )k = kDSM,η (g0 )g + SM,η (g0 )k ? ? = kSM,η (g + g0 ) − SM,η (g0 ) − SM,η (g + g0 ) + SM,η (g0 )k

= k

N 1 X (τi g 0 P KXi − E(τi g 0 P KXi )) + Wη g − E(Wη g|F1T )k NT i=1

≤ kκ(g)k, where κ(g) is defined in Part two of the proof of Theorem 4.1. By (A.24) with g1 − g2 therein b

)kgk. Hence replaced by g we have kκ(g)k = OP ( √N,p Nh  kg + SM,η (g0 )k = OP

bN,p √ Nh



 kgk = OP

bN,p √ Nh



1 1 √ √ + √ + η NTh N h

 .

For fixed x0 ∈ X , AN T |g(x0 ) + SM,η (g0 )(x0 )| ≤ AN T kg + SM,η (g0 )ksup ≤ cϕ h−1/2 kg + SM,η (g0 )k r    bN,p 1 1 η √ = OP √ + + . N h h Nh NTh Since g0 ∈ H is fixed, kWη g0 k = sup hWη g0 , gei = sup ηhg0 , geiH ≤ sup ηkg0 kH ke g kH ≤ ke g k=1

ke g k=1

It follows from (A.25) that kWη g0 k ≤



ke g k=1

√ ηkg0 kH = OP ( η), and hence

|Wη g0 (x0 )| ≤ cϕ h−1/2 kWη g0 k = OP (

p η/h).



ηkg0 kH .

(A.25)

Zhao et al./Panel Data Models with KRR

39

Hence, we have  AN T |g(x0 ) + SM,η (g0 )(x0 ) − Wη g0 (x0 )| = OP

bN,p √ Nh

 √

1 1 + + N h NTh

r  η . h

√ Proof of Theorem 4.3. By Theorem 4.2, the asymptotic distribution of N T AN T (b g (x0 )−g0 (x0 )+ √ √ 1 PN Wη g0 ) is the same as N T AN T [−SM,η (g0 )(x0 )+Wη g0 (x0 )] = N T AN T [ N T i=1 (Yi −τi g0 )0 P KXi (x0 )]. By Yi = τi g0 + Σ0 βi + ei (4.2), ΣP = 0 (see Section A.3) and ei = i − v¯0 ∆0i , it yields that N 1 X AN T [ (Yi − τi g0 )0 P KXi (x0 )] NT

= AN T (

1 NT

i=1 N X

KX0 i (x0 )P i ) − AN T (

i=1

N 1 X 0 KXi (x0 )P v¯0 ∆0i ) NT i=1

≡ AN T ζ − AN T ξ. Let ζm =

PN

0 0 i=1 φm Φi P i /(N T )

PN

0 ¯0 ∆0 /(N T ). 0 i i=1 φm Φi P v

and ξm =

To prove the result of the

theorem, it is sufficient to prove the following: AN T − AN T m

=

oP (1),

(A.26)

AN T

=

OP (1),

(A.27)

N T AN T m ξm = √ N T (ξ − ξm ) = √ N T (ζ − ζm ) = √ d N T AN T m ζm →

oP (1),

(A.28)

oP (1),

(A.29)

oP (1),

(A.30)

N (0, 1).

(A.31)



(A.26) and (A.27) are guaranteed by Lemma A.7; (A.31) follows from Lemma A.8. For (A.28), by the expression of AN T m and Lemma A.6, we get, |AN T m ξm | =

|AN T m φ0m

N 1 X 0 0 0 Φi P v¯ ∆i | NT i=1

≤ kAN T m φm k2 × k

N 1 X 0 0 0 Φi P v¯ ∆i k2 NT i=1

s ≤

N 1 X 0 0 0 kφm k22 k Φi P v¯ ∆i k2 φ0m HN T m φm N T

−1/2

≤ λmin (HN T m )k

1 NT

i=1 N X

Φ0i P v¯0 ∆0i k2 .

i=1

0 (Γ ¯ 2Γ ¯ 0 )−1 Γ ¯2 ≡ γ0 M f (see Assumption A2(d)), which leads to Recall ∆i = γ2i 2 2i

k

N X i=1

N N X X Φ0i P v¯0 ∆0i k22 = T r{( Φ0i P v¯0 ∆0i )( ∆i v¯P Φi )}. i=1

i=1

Zhao et al./Panel Data Models with KRR

40

In matrix form, the above becomes 0 f0 fN v¯N PN (Φ0 , Φ0 , ..., Φ0 )0 } T r{(Φ01 , Φ02 , ..., Φ0N )PN v¯N MN γ 2 γ 2 0 M 1 2 N 0 f0 M f ≤ λmax (γ2 γ2 0 )λmax (M vN v¯N )T r{ N N )λmax (¯

N X

Φ0i P Φi },

i=1

fN = IN ⊗ M f, v¯N = IN ⊗ v¯. where PN is defined in Section 4.1, M By Lemma A.3, Assumption A2 and Lemma A.6 ,we have 0 f0 M f ¯ ¯ 0 −1 eΣ e 0 kop = OP (T /N ), λmax (M λmax (¯ vN v¯N ) = kΣ N N ) = λmax ((Γ2 Γ2 ) ) = OP (1),

and T r{

N X

Φ0i P Φi } = N T T r(HN T m ) = OP (N T m).

i=1

So it can be seen that, ! p λmax (γ2 γ2 0 )m , N

|AN T m ξm | = OP thus

r

√ N T AN T m ξm = OP

λmax (γ2 γ2 0 )mT N

! = oP (1),

and hence (A.28) is true. Similar to (S.20) in the proof of Lemma A.5 (see supplement document), it can be shown that   Dm Dm + √ |ξ − ξm | = OP √ . (A.32) NTh N h More explicitly, the proof of (A.32) follows by replacing KXi in the expression of T2 with Φi φm , and by a line-by-line check. Therefore, we have √

N T |ξ − ξm | = OP

√ ! Dm Dm T √ + √ = oP (1), h Nh

i.e., (A.29) holds. Let D1T = σ(f1t , f2t , Xit : t ∈ [T ], i ∈ [N ]). By independence of i and D1T , we have E(|ζ − P P ϕν (x0 )ϕν (·) 0 0 ζm |2 |D1T ) = N , ν≥m+1 i=1 Ri P Ri /(N T ), where Ri = (KXi (x0 )−Ψi φm ) P i . Let Rx0 (·) = 1+ηρν since F1T ⊂ D1T , it follows that, 2

E(|ζ − ζm |

|F1T )

N 1 X 0 = E( 2 2 Ri P Ri |F1T ) N T i=1

= =

1 V (Rx0 , Rx0 ) NT ∞ 1 X ϕ2ν (x0 ) NT (1 + ηρν )2 ν=m+1

1 2 c Dm NT ϕ Dm = OP ( ), NT ≤

Zhao et al./Panel Data Models with KRR

so ζ − ζm = OP (

41

p Dm /(N T )) which implies that (A.30) is valid. Proof completed.

Proof of Proposition 4.5. By Yi = τi g0 + Σ0 βi + ei , ΣP = 0, ei = i − v¯0 ∆0i , we have N 1 X (Yi − τi gb)0 P (Yi − τi gb) NT i=1

=

N N N 2 X 1 X 0 1 X 0 0 (τi (g0 − gb)) P (τi (g0 − gb)) + (τi (g0 − gb)) P ei + ei P ei NT NT NT i=1

i=1

i=1

≡ T1 + T2 + T3 . By Theorem 4.1, kb g − g0 ksup ≤ cϕ h

−1/2

 kb g − g0 k = OP

1 1 √ + + NTh Nh

r  η = oP (1). h

PN PT

− gb(Xit ))|2 /(N T ) ≤ kb g − g0 k2sup = oP (1). e and by Lemma A.3, we have E(k¯ eΣ e 0 kop ) = By the definitions of v¯ and Σ v v¯0 kop ) = E(kΣ

So |T1 | ≤

i=1

t=1 |g0 (Xit

O(T /N )). Hence it holds that E(kei k22 ) = E(0i i ) + E(∆0i v¯v¯0 ∆i ) − 2E(0i v¯0 ∆i ) ≤ T σ2 + E(k¯ v v¯0 kop ) sup k∆i k22 = O(T ). 1≤i≤N

It then follows from Cauchy inequality that |T2 | ≤ ≤

N 2 X kτi (g0 − gb)k2 kei k2 NT i=1 N X

2 NT

kei k2

q

T kb g − g0 ksup

i=1

q = OP (1) kb g − g0 ksup = oP (1). Meanwhile, the following decomposition holds T3 =

=

N N N 1 X 0 1 X 0 2 X 0 i P i + ∆i v¯P v¯0 ∆i − ∆i v¯P i NT NT NT

1 NT

i=1 N X

0i P? i

i=1

1 + NT

i=1 N X i=1

i=1

0i (P

N N 1 X 0 2 X 0 0 − P? )i + ∆i v¯P v¯ ∆i − ∆i v¯P i NT NT i=1

i=1

≡ T31 + T32 + T33 − T34 . We handle the above terms T31 , T32 , T33 , T34 respectively. By Lemma A.3, it follows that N

|T32 | ≤

X 1 kP − P? kop 0i i = OP (N −1/2 ) = oP (1). NT i=1

Zhao et al./Panel Data Models with KRR

42

In the meantime, N

|T33 | ≤

X 1 1 sup k∆i k22 k¯ v v¯0 kop = OP ( ) = oP (1), N T 1≤i≤N N i=1

and s 2 |T34 | ≤ ki k2 k¯ v 0 ∆i k ≤ 2 NT

PN

0 i=1 i 

s

NT

PN

0 ¯v ¯0 ∆i i=1 ∆i v

NT

1 = OP ( √ ) = oP (1). N

Next we look at T31 . By direct examinations, E(T31 |F1T ) =

1 T − (q1 + d) 1 T r{P? E(i 0i |F1T )} = σ2 T r(P? ) = σ2 , T T T

and by Chebyshev’s inequality, for any  > 0, P (|T31 − E(T31 |F1T )| > |F1T ) ≤ ≤ ≤ ≤ =

1 E{|01 P? 1 /T − E(T31 |F1T )|2 }|F1T N 2 1 E(|01 P? 1 /T |2 |F1T ) N 2 1 E(|01 1 |2 |F1T ) N T 2 2 1 (T E(411 ) + T 2 σ4 ) N T 2 2 o(1).

So T31 → σ2 (T − q1 − d)/T in probability. Since T32 , T33 , T34 are all oP (1) as shown in the above, we have T3 → σ2 (T − q1 − d)/T in probability. Proof completed. References Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica, 77(4):1229–1279. Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factoraugmented regressions. Econometrica, 74(4):1133–1150. Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic, Boston, MA. Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2(2):107–144. Braun, M. (2006). Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7:2303–2328. Cai, Z., Fang, Y., and Xu, q. (2016). Inferences for varying-coefficient panel data models with cross-sectional dependence. Working Paper. Carneiro, P., Hansen, K., and Heckman, J. (2003). Estimating distributions of treatment effects with an application to the returns to schooling and measurement of the effects of uncertainty on college choice. International Economic Review, 44(2):361–422.

Zhao et al./Panel Data Models with KRR

43

Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:5549–5632. Cheng, G. and Shang, Z. (2015). Joint asymptotics for semi-nonparametric regression models with partially linear structure. The Annals of Statistics, 43(3):1351–1390. Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377–403. Cucker, F. and Smale, S. (2001). On the mathematical foundations of learning. American Mathematical Society, 39(1):1–49. Cunha, F., Heckman, J., and Navarro, S. (2005). Separating uncertainty from heterogeneity in life cycle earnings. Oxford Economic papers, 57(2):191–261. Dehling, H. (1983). Limit theorems for sums of weakly dependent banach space valued random variables. Probability Fields and Related Fields, 63(3):393–432. Dudley, R. M., Hahn, M. G., and Kuelbs, J. (1992). Probability in Banach space. Springer Science Business Media. Eberhardt, M., Helmers, C., and Strauss, H. (2013). Do spillovers matter when estimating private returns to R&D? Review of Economics and Statistics, 95(2):436–448. Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer Science Business Media, LLC. Freyberger, J. (2012). Nonparametric panel data models with interactive fixed effects. Working Paper. Gu, C. (2013). Smoothing spline ANOVA models, vol 297. Springer Science Business Media, LLC. Gu, C. and Qiu, C. (1993). Smoothing spline density estimation: Theory. The Annals of Statistics, pages 217–234. Hashiguchi, Y. (2015). Allocation efficiency in china: an extension of the dynamic olley-pakes productivity decomposition. Working Paper. Hofmann, T., Schlkopf, B., and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, pages 1171–1220. Holly, S., Pesaran, M. H., and Yamagata, T. (2010). A spatio-temporal model of house prices in the usa. Journal of Econometrics, 158(1):160–173. Huang, X. (2013). Nonparametric estimation in large panels with cross-sectional dependence. Econometric Reviews, 32(5-6):754–777. Jin, S. and Su, L. (2013). A nonparametric poolability test for panel data models with cross section dependence. Econometric Reviews, 32(4):469–512. Kosorok, M. (2008). Introduction to empirical processes and semiparametric inference. Springer: New York. Lu, J., Cheng, G., and Liu, H. (2016). Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601. Ludvigson, S. C. and Ng, S. (2009). Macro factors in bond risk premia. Review of Financial Studies, 22(12):5027–5067. Ludvigson, S. C. and Ng, S. (2016). A factor analysis of bond risk premia. Handbook of Empirical Economics and Finance, pages 313–372. Mammen, E. and van de Geer, S. (1997). Penalized quasi-likelihood estimation in partial linear models. The Annals of Statistics, pages 1014–1035. Melitz, M. J. (2003). The impact of trade on intra-industry reallocations and aggregate industry productivity. Econometrica, 71(6):1695–1725.

Zhao et al./Panel Data Models with KRR

44

Moon, H. R. and Weidner, M. (2010). Dynamic linear panel regression models with interactive fixed effects. Econometric Theory, pages 1–38. Moon, H. R. and Weidner, M. (2015). Linear regression for panel with unknown number of factors as interactive fixed effects. Econometrica, 83(4):1543–1579. O’Brien, G. (1974). The maximum term of uniformly mixing stationary processes. Probability Theory and Related Fields, 30(1):57–63. Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica, 74(4):967–1012. Pinelis, I. (1994). Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, pages 1679–1706. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis (2nd ED). Springer: New York. Shang, Z. (2010). Convergence rate and Bahadur type representation of general smoothing spline Mestimates. Electronic Journal of Statistics, 4:1411–1442. Shang, Z. and Cheng, G. (2013). Local and global asymptotic inference in smoothing spline models. The Annals of Statistics, 41(5):2608–2638. Shang, Z. and Cheng, G. (2015). Nonparametric inference in generalized functional linear models. The Annals of Statistics, 43(4):1742–1773. Shao, J. (2003). Mathematical Statistics. Springer: New York. Su, L. and Jin, S. (2012). Sieve estimation of panel data models with cross section dependence. Journal of Econometrics, 169(1):34–47. Su, L., Jin, S., and Zhang, Y. (2015). Specification test for panel data models with interactive fixed effects. Journal of Econometrics, 186(1):222–244. Su, L. and Zhang, Y. (2013). Nonparametric dynamic panel data models with interactive fixed effects: sieve estimation and specification testing. Working Paper. Sun, H. (2005). Mercer theorem for RKHS on noncompact sets. Journal of Complexity, 21(3):337–349. Wahba, G. (1990). Spline models for observational data. SIAM. Wang, Y. (2011). Smoothing splines: methods and applications. CRC Press. Weinberger, H. F. (1974). Variational methods for eigenvalue approximation, volume 15. SIAM. Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. The Annals of Statistics, 44(4):1400–1437.

Supplement to

Statistical Inference on Panel Data Models: A Kernel Ridge Regression Method This supplement document contains proofs and other relevant results that were not included in the main text and appendix. In Section S.1, we prove Lemmas A.1 and A.2, Propositions A.1 and A.2. In Section S.2, we prove Lemmas A.3, A.4, A.5 A.6, A.7 and A.8. We also provide additional Lemmas S.1, S.2 and S.3 as well as their proofs. Lemmas S.1 and S.2 give mild conditions to guarantee the validity of Assumption A4; Lemma S.3 is useful for proving Lemma A.4. S.1. Additional Proofs or Other Relevant Results for Heterogeneous Model Proof of Lemma A.1. For any θ, θk = (βk , gk ) ∈ Θi for k = 1, 2, it holds from (2.1) that ? hDSi,M,η (θ)θ1 , θ2 ii i

= hE{DSi,M,ηi (θ)θ1 }, θ2 ii =

=

T 1X E (hRi Uit , θ1 ii hRi Uit , θ2 ii ) + hPi θ1 , θ2 ii T

1 T

t=1 T X

 E (g1 (Xit ) + Zt0 β1 )(g2 (Xit ) + Zt0 β2 ) + ηi hg1 , g2 iHi

t=1

 = E (g1 (Xi ) + Z 0 β1 )(g2 (Xi ) + Z 0 β2 ) + ηi hg1 , g2 iHi = hθ1 , θ2 ii , ? which implies that DSi,M,η (θ) = id, the identity operator on Θi . i

Proof of Lemma A.2. It follows by Proposition 2.1 that kRi Uit k2i = K (i) (Xit , Xit ) + (Zt − Ai (Xit ))0 (Ωi + Σi )−1 (Zt − Ai (Xit )). By (2.6) and hAi , gi?,i = Vi (Gi , g) (see Section 3.2), Ai (x) = hAi , Kx(i) i?,i = Vi (Gi , Kx(i) ) ∞ (i) X ϕν (x) = V (Gi , ϕ(i) ν ). (i) i ν=1 1 + ηi ρν

1

Zhao et al./Panel Data Models with KRR

2

(i)

It follows by Assumption A3 that Cϕ,i ≡ supν≥1 supx∈Xi |ϕν (x)| < ∞. Then we have K (i) (Xit , Xit ) = 0

Ai (Xit ) (Ωi + Σi )

−1

Ai (Xit ) ≤ ≤ ≤

Zt0 (Ωi + Σi )−1 Zt ≤

(i) ηi ρν

2 ≤ Cϕ,i h−1 i ,

ν≥1 1 + 0 c−1 1 Ai (Xit ) Ai (Xit ) 2 X X |ϕ(i) ν (Xi )| (i) Vi (G0i , ϕ(i) c−1 ν )Vi (Gi , ϕν ) 1 (i) 2 ν≥1 (1 + ηi ρν ) ν≥1 −1 2 2 −1 c1 Cϕ,i CG h , i i 0 c−1 1 Zt Zt ,

(i) (i) 0 ν≥1 Vi (Gi , ϕν )Vi (Gi , ϕν ). By Assumption 2 , 2c−1 C 2 C 2 , 2c−1 }. holds for Ci2 = max{Cϕ,i ϕ,i Gi 1 1

2 = where CG i

Then (A.1)

2 X |ϕ(i) ν (Xit )|

P

2 is a finite positive constant. A1, CG i

To show (A.2), first notice that, for any θ = (β, g) ∈ Θi , kθki,sup =

sup

|g(x) + z 0 β|.

x∈Xi ,kzk2 =1

The “≥” is obvious. To show “≤”, note that for any x ∈ Xi , choose zx = sign(g(x))β/kβk2 . Then |g(x) + zx0 β| = |g(x)| + kβk2 . Therefore, sup

|g(x) + z 0 β| ≥ sup |g(x) + zx0 β| = sup |g(x)| + kβk2 = kθki,sup . x∈Xi

x∈Xi ,kzk2 =1

x∈Xi

Following Proposition 2.1 and the proof of (A.1), for u = (x, z) with x ∈ Xi and kzk2 = 1, −1/2

|g(x) + z 0 β| = |hRi u, θii | ≤ kRi uki kθki ≤ Ci (1 + hi

)kθki .

This proves (A.2). Proof of Proposition A.1. Since f1t , f2t , vit and it all have finite αth moments, it follows by (2.2) that Xit and Zt both have finite αth moments, i.e., E(kXi kα2 ) < ∞ and E(kZkα2 ) < ∞. Define CT (ξ) = inf{x|T P (kZk2 > x) ≤ ξ}, ξ > 0. By Markov inequality,   ξ P kZk2 > [T E(kZkα2 )/ξ]1/α ≤ , T therefore,  CT (ξ) ≤

T E(kZkα2 ) ξ

1/α .

Thanks to the φ-mixing condition (see Assumption A2), it follows by O’Brien (1974, Theorem 1) that for any ξ > 0,  lim inf P T →∞

 max kZt k2 ≤ CT (ξ) = exp(−bξ),

1≤t≤T

Zhao et al./Panel Data Models with KRR

3

where b > 0 is a constant. For arbitrary ε > 0, choose ξ > 0 such that 1 − exp(−bξ) < ε/2. Then, as T approaches infinity,  P

 max kZt k2 ≤ CT (ξ) ≥ exp(−bξ) − ε/2,

1≤t≤T

leading to that  P

 max kZt k2 > CT (ξ) ≤ 1 − exp(−bξ) + ε/2 ≤ ε.

1≤t≤T

This proves that max kZt k2 = OP (CT (ξ)) = OP (T 1/α ).

1≤t≤T

Proof of Proposition A.2. For notation simplicity, denote Fj

= F1j , j ∈ [T ],

F0 = trivial σ-algebra consisting only of the empty set and full sample space. For any θ1 , θ2 ∈ Θi , define lit = (ψi,M,t (Uit ; θ1 ) − ψi,M,t (Uit ; θ2 ))Ri Uit . First of all, we will prove the following concentration inequality: for any r > 0, ! !

X

2

T

r P , [lit − E(lit )]

≥ r ≤ 2 exp − 32T C 2 kθ − θ k2 2 i,sup i φ 1 t=1 where Cφ ≡

P∞

t=0 φ(t).

(S.1)

It follows by Assumption A2 that Cφ is finite. Clearly, (S.1) holds for

θ1 = θ2 since both sides equal to zero. In what follows, we assume θ1 6= θ2 . P Define MiT = Tt=1 lit , and fiT j = E(MiT |Fj ) − E(MiT |Fj−1 ), j ∈ [T ]. It is easy to see that MiT − E(MiT ) =

T X

fiT j ,

j=1

fiT j

=

T X

(E(lit |Fj ) − E(lit |Fj−1 )) .

(S.2)

t=j

Clearly, fiT j is Fj -measurable. For k ∈ [T ], define NiT k =

Pk

j=1 fiT j

and NiT 0 ≡ 0. Then NiT k =

NiT k−1 + fiT k . For λ > 0, let uk−1 (x) = λkNiT k−1 + xfiT k ki , x ∈ [0, 1]. Define ϕk−1 (x) = E (cosh (uk−1 (x)) |Fk−1 ) , x ∈ [0, 1]. It is easy to see that ϕk−1 (1) = E (cosh (λkNiT k ki ) |Fk−1 ) ϕk−1 (0) = E (cosh (λkNiT k−1 ki ) |Fk−1 ) .

Zhao et al./Panel Data Models with KRR

By the proof of Pinelis (1994, Theorem 3.2) and direct calculations, it can be shown that  ϕ0k−1 (x) = E sinh(uk−1 (x))u0k−1 (x)|Fk−1 ,  ϕ00k−1 (x) = E cosh(uk−1 )(u0k−1 (x))2 + sinh(uk−1 (x))u00k−1 (x)|Fk−1  ≤ E cosh(uk−1 )(u0k−1 (x))2 + cosh(uk−1 (x))uk−1 (x)u00k−1 (x)|Fk−1  1 = E cosh(uk−1 (x))(uk−1 (x)2 )00 |Fk−1 2  = λ2 E cosh(uk−1 (x))kfiT k k2i |Fk−1 .

4

(S.3)

Next we will show that kfiT k k2i is almost surely bounded. We will first examine the terms E(lit |Fk ) − E(lit ) for t ≥ k. Arbitrarily choose A ∈ Fk and θ ∈ Θi with kθki = 1. Define X = hlit , θii . Write X = X + − X − , where X + and X − represent the positive and negative parts of X, respectively. Clearly, |X| ≤ klit ki kθki = klit ki implying that both X + and X − belong to [0, klit ki ]. Note that the X + is Ft∞ -measurable. Therefore, Z klit ki + + + + |E(X |A) − E(X )| = [P (X > v|A) − P (X > v)]dv Z

0 klit ki



|P (X + > v|A) − P (X + > v)|dv ≤ klit ki φ(t − k).

0

Similarly, one can show that |E(X − |A) − E(X − )| ≤ klit ki φ(t − k). Therefore, |E(X|A) − E(X)| ≤ 2klit ki φ(t − k). By arbitrariness of A ∈ Fk and by taking supremum over θ ∈ Θi with kθki = 1, one gets that kE(lit |Fk ) − E(lit )ki ≤ 2klit ki φ(t − k), t ≥ k. Similar arguments lead to kE(lit |Fk−1 ) − E(lit )ki ≤ 2klit ki φ(t − k + 1), t ≥ k. Therefore, for t ≥ k, kE(lit |Fk ) − E(lit |Fk−1 )ki ≤ 2klit ki (φ(t − k) + φ(t − k + 1)). Using (S.2) and the assumption klit ki ≤ kθ1 − θ2 ki,sup , it can be shown that kfiT k ki ≤ ≤

T X t=k T X

kE(lit |Fk ) − E(lit |Fk−1 )ki 2klit ki (φ(t − k) + φ(t − k + 1))

t=k

≤ 2kθ1 − θ2 ki,sup

T X (φ(t − k) + φ(t − k + 1)) t=k

≤ 4Cφ kθ1 − θ2 ki,sup .

(S.4)

Zhao et al./Panel Data Models with KRR

5

Therefore, it follows by (S.3) that ϕ00k−1 (x) ≤ 16λ2 Cφ2 kθ1 − θ2 k2i,sup ϕk−1 (x). Meanwhile, note that NiT k−1 is Fk−1 -measurable, so we have ϕ0k−1 (0) = λ

sinh(λkNiT k−1 ki ) E (hNiT k−1 , fiT k ii |Fk−1 ) = 0, k ≥ 2, kNiT k−1 ki

where the last equality follows from E(fiT k |Fk−1 ) = 0. Directly using (S.3) one also has that ϕ00 (0) = 0. So ϕ0k−1 (0) = 0 for all k ∈ [T ]. By Dudley et al. (1992, pp. 133, Lemma 3) we have for k ∈ [T ],  ϕk−1 (x) ≤ ϕk−1 (0) exp 8λ2 Cφ2 kθ1 − θ2 k2i,sup x2 , x ∈ [0, 1]. In particular,  ϕk−1 (1) ≤ ϕk−1 (0) exp 8λ2 Cφ2 kθ1 − θ2 k2i,sup . Taking expectations on both sides leading to that  E (cosh (λkNiT k ki )) ≤ exp 8λ2 Cφ2 kθ1 − θ2 k2i,sup E (cosh (λkNiT k−1 ki )) .

(S.5)

By repeatedly using (S.5) and the convention NiT 0 = 0, and by (S.2), we have

T

!!

X

[l − E(l )] = E (cosh (λkNiT T ki )) E cosh λ it it

i

t=1

 ≤ exp 8T λ2 Cφ2 kθ1 − θ2 k2i,sup . Therefore, P

!

X

T

= P [lit − E(lit )]

≥r t=1

i



!

X

!

T

cosh λ [lit − E(lit )]

≥ cosh(λr) i

t=1

1 E cosh(λr)

!!

X

T

cosh λ [lit − E(lit )]

2

≤ e exp −λr + 8T λ Then (S.1) follows by choosing λ=

t=1  Cφ2 kθ1 − θ2 k2i,sup .

i

r . 16T Cφ2 kθ1 − θ2 k2i,sup

The rest of the proof follows by chaining argument. Let ψ2 (x) = exp(x2 ) − 1. It follows by (S.1) and Kosorok (2008, Theorem 8.1) that for any θ1 , θ2 ∈ Θi ,



kZiM (θ1 ) − ZiM (θ2 )ki ≤ 96Cφ kθ1 − θ2 ki,sup .

ψ2

Zhao et al./Panel Data Models with KRR

6

It follows by Kosorok (2008, Theorem 8.4) that there exists a universal constant C > 0, which only depends on Cψ , such that for any δ > 0,



sup kZiM (θ1 ) − ZiM (θ2 )ki

θ1 ,θ2 ∈Gi (pi ) kθ1 −θ2 ki,sup ≤δ δ

Z ≤ C 0

ψ2

ψ2−1 (Di (ε, Gi (pi ), k

· ki,sup )) dε +

δψ2−1

Di (δ, Gi (pi ), k · ki,sup )

2



 = CJi (pi , δ).

Therefore,



sup θ∈Gi (pi ) kθki,sup ≤δ

kZiM (θ)ki

≤ CJi (pi , δ).

ψ2

It follows again from Kosorok (2008, Lemma 8.1) that for all δ > 0, s > 0, !   s2 . P sup kZiM (θ)ki > s ≤ 2 exp − 2 C Ji (pi , δ)2 θ∈Gi (pi ),kθki,sup ≤δ

(S.6)

√ It is easy to see that for any θ ∈ Gi (pi ), kθki,sup ≤ 1. Let T Ji (pi , 1) = ε−1 , and Qε = − log ε − 1. p Let τ = 3C log N + log log(T Ji (pi , 1)). Then it can be checked that   τ2 N (Qε + 2) exp − 2 → 0, N → ∞. C exp(2) Since Ji (pi , δ) is strictly increasing in δ, the function Ji (δ) ≡ Ji (pi , δ) has inverse denoted by Ji−1 . Then we have √

! T kZiM (θ)ki P max sup √ ≥τ i∈[N ] θ∈Gi (pi ) T Ji (kθki,sup ) + 1 ! √ N X T kZiM (θ)ki √ P sup ≤ ≥τ T Ji (kθki,sup ) + 1 kθki,sup ≤Ji−1 (T −1/2 ) i=1 +

Qε X

P

l=0



N X

P

Qε X l=0



N X

T kZiM (θ)ki √ sup ≥τ T Ji (kθki,sup ) + 1 Ji−1 (T −1/2 exp(l))≤kθki,sup ≤Ji−1 (T −1/2 exp(l+1)) !

!!

kZiM (θ)ki ≥ T −1/2 τ

sup kθki,sup ≤Ji−1 (T −1/2 )

i=1

+



!! P

kZiM (θ)ki ≥ T −1/2 τ exp(l)

sup kθki,sup ≤Ji−1 (T −1/2

2 exp −τ 2 /C

 2

i=1

 = 2N (Qε + 2) exp −

+

exp(l+1))

Qε X

!  2 exp −τ 2 /(C 2 exp(2))

l=0 τ2

C 2 exp(2)

 → 0, as N → ∞.

This proves the desirable conclusion with C0 = 3C.

(S.7)

Zhao et al./Panel Data Models with KRR

7

S.2. Additional Proofs or Other Relevant Results for Homogeneous Model The following lemma gives mild conditions that guarantee Assumption A4. Before stating the lemma, we borrow the concept of complete continuity from Weinberger (1974, page 50). A bilinear functional A(·, ·) on H × H is said to be completely continuous w.r.t another bilinear functional B(·, ·) if for any  > 0, there exists finite number of functionals l1 , l2 , ..., lk on H such that li (g) = 0, i = 1, 2, .., k, implies A(g, g) ≤ B(g, g). Let U be an open subset of X and U N T ≡ U | ×U × {z· · · × U}. Let C(X ) be the set of all continuN T items

ous functions on X and H ⊆ C(X ). Let x denote the N T -vector (x11 , . . . , x1T , . . . , xN 1 , . . . , xN T ). Lemma S.1. Suppose 1 ∈ / H, and p(x|F1T ) > 0 for x ∈ U N T , where p(x|F1T ) is the joint conditional density of X11 , X12 , ..., XN T given F1T . If V (f, g) = 0 for all f ∈ H, then g = 0. Proof of Lemma S.1. For simplicity, we assume that f1t , Xit are both univariate. By assumption, T  P 0 0 = V (g, g) = N i=1 E (τi g) P (τi g) F1 /(N T ). Hence it follows that Z 0 = (g(xi1 ), g(xi2 ), ..., g(xiT ))P (g(xi1 ), g(xi2 ), ..., g(xiT ))0 p(x|F1T )dx, for all i ∈ [N ]. (S.1) Since the integrand in (S.1) is continuous and nonnegative, it holds that, for all i ∈ [N ] and x with p(x|F1T ) > 0, (g(xi1 ), g(xi2 ), ..., g(xiT ))P (g(xi1 ), g(xi2 ), ..., g(xiT ))0 = 0.

(S.2)

By definition, P is a projection matrix whose image is the orthogonal space of the linear space ¯ Therefore, it yields that spanned by F1 and X. (g(xi1 ), g(xi2 ), ..., g(xiT )) = αi (f11 , f12 , ..., f1T ) + βi (¯ x1 , x ¯2 , ..., x ¯T ),

(S.3)

e = (x11 , x12 , ..., xN,T −1 , x for some αi , βi ∈ R. Consider x = (x11 , x12 , ..., xN,T −1 , xN T ) and x eN T ) ∈ U N T with xN T 6= x eN T and p(x|F1T ) > 0, p(e x|F1T ) > 0, i.e., the two points differ only on the last e, we have element. Applying (S.2) to point x e (g(xi1 ), g(xi2 ), ..., g(e xiT )) = α ei (f11 , f12 , ..., f1T ) + βei (¯ x1 , x ¯2 , ..., x ¯T ),

(S.4)

for some α ei , βei ∈ R. Comparing (S.3) and (S.4), and by the fact T > q1 + d = 2, it holds that αi = α ei , βi = βei = 0. Hence (g(xi1 ), g(xi2 ), ..., g(xiT )) ∈ span((f11 , f12 , ..., f1T )) for all p(x|F1T ) > 0, i ∈ [N ], and it happens if and only if g = 0. Lemma S.2. Suppose X is compact. Furthermore if V (f, g) = 0 for all f ∈ H implies g = 0, then Assumption A4 is valid. Remark S.2.1. The compactness of X can be relaxed by Mercer’s theorem; see Sun (2005).

Zhao et al./Panel Data Models with KRR

Proof of Lemma S.2. Define bilinear functionals W (g, ge) =

PN

8

i=1 E



(τi g)0 (τi ge) F1T /(N T ), and

J(g, ge) = hg, geiH . Clearly, V (g, g) ≤ W (g, g). Let µ be a measure such that N T 1 XX gdµ = E(g(Xit )|F1T ). NT

Z

i=1 t=1

Hence,

R

¯ of H follows the expansion: g 2 dµ = W (g, g). By Mercer’s theorem, the kernel K ¯ K(x, y) =

∞ X

λi ei (x)ei (y).

i=1

where λi is a non-increasing positive sequence converging to zero and {ei }∞ i=1 forms an orthonormal √ ∞ basis of L2 (µ), so that W (ei , ej ) = δij . Moreover, { λi ei }i=1 is also an orthonormal basis of H, which is proved in Cucker and Smale (2001). As a consequence, any g ∈ H simultaneously admits the following expansions: g=

∞ X

W (g, ei )ei , g =

i=1

with

P∞

i=1 W

2 (g, e

i)

< ∞ and

∞ X

J(g,

p

λi ei )

p λi ei

i=1

P∞

i=1 J

2 (g,



λi ei ) < ∞. This implies W (g, ei ) = λi J(g, ei ). For

any  > 0, choose integer k large enough so that λi <  for i > k. Define functionals li (g) = W (g, ei ), i = 1, 2, ..., k. By direct direct examinations, if li (g) = 0 for i = 1, 2, ..., k, then W (g, g) =

∞ X

∞ X

2

W (g, ei ) =

i=k+1

λ2i J 2 (g, ei )

i=k+1

≤

∞ X

λi J 2 (g, ei ) = J(g, g).

i=k+1

Since V (g, g) ≤ W (g, g) ≤ J(g, g), V is completely continuous w.r.t J. By Weinberger (1974, Theorem 3.1, page 52), there are positive eigenvalues {αi }∞ i=1 converging to zero and eigenfunctions {ϕ ei }∞ ei , ϕ ej ) = αi δij , J(ϕ ei , ϕ ej ) = δij and i=1 ∈ H such that V (ϕ g=

∞ X

J(g, ϕ ei )ϕ ei , for all g ∈ H.

i=1

√ The above implies V (g, ϕ ei ) = αi J(g, ϕ ei ). Take ϕi = ϕ ei / αi and ρi = 1/αi , then {ϕi }∞ i=1 and {ρi }∞ i=1 will satisfy Assumption A4. Proof of Lemma A.3. Throughout we let kAkF = e= Σ

···

0

v¯1 v¯2 · · ·

v¯T

0

0

p Tr(AA0 ) be the Frobenius norm. Clearly,

! e0

, Σ? Σ =

0(q1 +d)×q1 ,

T X

! Zt? v¯t0

.

t=1

By direct examinations we have e0 + Σ e 0 Σ? + Σ eΣ e 0 ≡ R. ΣΣ0 − Σ? Σ0? = Σ? Σ

(S.5)

Zhao et al./Panel Data Models with KRR

9

By independence of vit and Zt? , it can be shown that ! T T X X   ? 0 2 E k Zt v¯t kF = Tr E v¯t0 v¯l E (Zt? )0 Zl? = O(T /N ), t=1

E



t,l=1

eΣ e 0 k2 kΣ F





  eΣ e 0Σ eΣ e0 ≤ E Tr Σ = O(T 2 /N 2 ).

(S.6)

Hence, 

e 0 k2 E kΣ? Σ F



= E

k

T X

! Zt? v¯t0 k2F

= O(T /N ),

t=1

e 0 k2F ) + 2E(kΣ eΣ e 0 k2F ) = O(T /N + (T /N )2 ). E(kRk2F ) ≤ 8E(kΣ? Σ

(S.7)

Since k(ΣΣ0 )−1 − (Σ? Σ0? )−1 kop = k(Σ? Σ0? )−1 R(ΣΣ0 )−1 kop ≤ k(Σ? Σ0? )−1 kop kRkop k(ΣΣ0 )−1 kop , it follows by Assumption A5 and (S.5) and H¨older inequality that  E k(ΣΣ0 )−1 − (Σ? Σ0? )−1 k1+ω = O((T 3 N )−(1+ω)/2 + (T N )−(1+ω) ), op

(S.8)

e 0 Σ) e = σ 2 IT and ETr(Σ e 0 Σ) e = O(T /N ), where σ 2 = where ω = (ζ − 4)/(ζ + 4). Note that E(Σ v v 0 v ) is a constant. By direct examinations E(vit it

P? − P  e +Σ e 0 (Σ? Σ0? )−1 Σ? + Σ e 0 (Σ? Σ0? )−1 Σ. e = Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σ + Σ0? (Σ? Σ0? )−1 Σ It follows by (S.6), (S.7) and (S.8) and H¨older inequality that E (kP − P? kop )      e 0 (Σ? Σ0 )−1 Σ? kop + E kΣ e 0 (Σ? Σ0 )−1 Σk e op ≤ E kΣΣ0 kop k(ΣΣ0 )−1 − (Σ? Σ0? )−1 kop + 2E kΣ ? ?   ω/(1+ω) 1/(1+ω) (1+ω)/ω ≤ E k(ΣΣ0 )−1 − (Σ? Σ0? )−1 k1+ω E kΣΣ0 kop op 1/2   0 1/2    0  eΣ e eΣ e E Tr Σ + E k(Σ? Σ0? )−1 kop E Tr Σ +2E k(Σ? Σ0? )−1 kop p = O((T 3 N )−1/2 + (T N )−1 )T + O(T −1/2 T /N ) + O(T −1 (T /N )) = O(N −1/2 ). (S.9) This proves (A.20). Next we show (A.21). For any i ∈ [N ], 0 E{γ2i F20 (P − P? )KXi |F1T } 0 0 e X |F1T } = E{γ2i F20 Σ0 [(ΣΣ0 )−1 − (Σ? Σ0? )−1 ]ΣKXi |F1T } + E{γ2i F20 Σ0? (Σ? Σ0? )−1 ΣK i 0 0 e 0 (Σ? Σ0? )−1 Σ? KX |F1T } + E{γ2i e 0 (Σ? Σ0? )−1 ΣK e X |F1T }. +E{γ2i F20 Σ F20 Σ i i

Zhao et al./Panel Data Models with KRR

10

By direct calculations it can be examined that E kΣ0 [(ΣΣ0 )−1 − (Σ? Σ0? )−1 ]ΣF2 kop



 ≤ E k(ΣΣ0 )−1 − (Σ? Σ0? )−1 kop × kΣΣ0 kop × kF2 kop 1/(1+ω) ≤ E k(ΣΣ0 )−1 − (Σ? Σ0? )−1 k1+ω op ω/(2(1+ω))  ω/(2(1+ω))  2(1+ω)/ω E kF k ×E kΣΣ0 k2(1+ω)/ω 2 op op = O((T 3 N )−1/2 + (T N )−1 )T 3/2 = O(N −1/2 + T 1/2 /N ), and   e 2 kop E kΣ0? (Σ? Σ0? )−1 ΣF   0 e0 e 1/2 ≤ E k(Σ? Σ0? )−1 k1/2 op Tr(F2 Σ ΣF2 ) 1/2 1/2  e 0 ΣF e 2) ≤ E k(Σ? Σ0? )−1 kop E Tr(F20 Σ 1/2 1/2 = E k(Σ? Σ0? )−1 kop E Tr(F20 F2 ) O(N −1/2 ) 1/2 1/2 = E k(Σ? Σ0? )−1 kop E kF20 F2 kop O(N −1/2 ) = O(N −1/2 ). For any g ∈ H with kgk = 1 (implying |g(x)| ≤ cϕ h−1/2 for any x), we have e i g|F T }k2 ≤ kF 0 Σ0 (Σ? Σ0 )−1 kop kE{Στ e i g|F T }k2 = OP (kE{Στ e i g|F T }k2 ). kE{F20 Σ0? (Σ? Σ0? )−1 Στ 1 2 ? ? 1 1 On the other hand, by direct examinations we have T X

e i gk22 = kΣτ

v¯t0 v¯l g(xit )g(xil ).

t,l=1

Meanwhile, for any t 6= l, v¯t0 g(xit ) and v¯l g(xil ) are independent conditional on F1T , and 1 1 X 1 E{¯ vl g(xil )|F1T } = E{vil g(xil )|F1T } + E{vkl g(xil )|F1T } = E{vil g(xil )|F1T }. N N N k6=i

The last equality holds because vkl and g(xil ) are conditional independent (on F1T ) for k 6= i and the former has mean zero. This leads us to that e i gk2 |F T } = E{kΣτ 2 1

T X

E{¯ vt0 v¯t g(xit )2 |F1T } +

t=1

=

T X

=

E{¯ vt0 v¯t g(xit )2 |F1T } +

X

E{¯ vt0 g(xit )|F1T }E{¯ vl g(xil )|F1T }

t6=l

E{¯ vt0 v¯t g(xit )2 |F1T } +

t=1

1 X 0 E{vit g(xit )|F1T }E{vil g(xil )|F1T } N2 t6=l

 = OP

E{¯ vt0 v¯l g(xit )g(xil )|F1T }

t6=l

t=1 T X

X

T2

T + N h N 2h

 .

Zhao et al./Panel Data Models with KRR

11

Therefore, r e i g|F T }k2 kE{F20 Σ0? (Σ? Σ0? )−1 Στ 1

= OP

T T + √ Nh N h

! ,

where the OP term is free of g. Similarly, we can show that   e 0 (Σ? Σ0 )−1 ΣF e 2 kop E kΣ ? 1/2  1/4 1/4  0 0 e 2 eΣ e 0 k2op e ΣF ≤ E kΣ E k(Σ? Σ0? )−1 k4op E F2 Σ p p = O( T /N )O(1/T )O( T /N ) = O(1/N ).

(S.10)

Combining the above, we get that r 0 kE{γ2i F20 (P − P? )KXi |F1T }k = OP

T T + √ Nh N h

! ,

where the OP is free of i ∈ [N ]. Proof completed. Lemma S.3. Suppose that Assumptions A2, A4 and A5 hold. Let ψ satisfy the conditions in Lemma A.4. Then N 1 X √ k ψ(Xi , g)0 (P − P? )KXi k = OP (1) , N kgksup ≤1 i=1

sup

and

N  1 X sup √ k E ψ(Xi , g)0 (P − P? )KXi F1T k = OP (1) . N i=1 kgksup ≤1

Proof of Lemma S.3. For any g, ge satisfying kgksup ≤ 1 and ke g k ≤ 1, the former implies that p kψ(Xi , g)k2 ≤ L h/T for each i ∈ [N ], and the latter implies that ke g ksup ≤ cϕ h−1/2 , by (A.20) we have N N 1 X 1 X √ ψ(Xi , g)0 (P − P? )τi ge ≤ √ kψ(Xi , g)k2 kτi gek2 kP − P? kop = OP (1) , N i=1 N i=1 and

N √ T  1 X 0 √ E ψ(Xi , g) (P − P? )τi ge F1 ≤ Lcϕ N E kP − P? kop F1T = OP (1). N i=1

Proof is completed. ? (g) = Proof of Lemma A.4. It follows by Lemma S.3 that we only need to consider the process ZM P N 0 0 T √1 i=1 [ψ(Xi , g) P? KXi − E{ψ(Xi , g) P? KXi |F1 }] for g ∈ H where the items in summation are N

Zhao et al./Panel Data Models with KRR

12

independent conditional on F1T . Let Ki = [K(Xit , Xil )]1≤t,l≤T , a T × T matrix. By Assumption A4 it follows that Ki ≤ c2ϕ h−1 T IT . For any g1 , g2 ∈ H, k(ψ(Xi , g1 ) − ψ(Xi , g2 ))0 P? KXi k2 = (ψ(Xi , g1 ) − ψ(Xi , g2 ))0 P? Ki P? (ψ(Xi , g1 ) − ψ(Xi , g2 )) ≤ (Lcϕ kP? kop kg1 − g2 ksup )2 = (Lcϕ kg1 − g2 ksup )2 . The last equation follows by kP? kop = 1 since P? is idempotent. Notice that {Xit : i ∈ [N ], t ∈ [T ]} are conditional independent given F1T . It follows by Pinelis (1994, Theorem 3.5) that for any r ≥ 0,     r2 ? ? P kZM (g1 ) − ZM (g2 )k ≥ r F1T ≤ 2 exp − 2 2 . 8L cϕ kg1 − g2 k2sup It follows by Kosorok (2008, Lemma 8.1) that

?

?

kZM (g1 ) − ZM

(g )k 2

≤ 5Lcϕ kg1 − g2 ksup ,

F1T ,ψ2

where k · kF T ,ψ2 denotes the Orlicz-norm conditional on F1T with respect to ψ2 (s) = exp(s2 ) − 1. 1

This in turn leads to, by Kosorok (2008, Theorem 8.4), that for any δ > 0,



? ?

kZM (g1 ) − ZM (g2 )k sup

T g1 ,g2 ∈G(p) kg1 −g2 ksup ≤δ δ

Z ≤ C 0

F1 ,ψ2

ψ2−1 (D(ε, G(p), k

· ksup )) dε +

δψ2−1

2

D(δ, G(p), k · ksup )





= CJ(p, δ), where C > 0 is a constant depending on L, cϕ only. Then we have



?

sup kZM (g)k ≤ CJ(p, δ).

g∈G(p) kgksup ≤δ

F1T ,ψ2

It follows again from Kosorok (2008, Lemma 8.1) that for all δ > 0, r > 0,      r2  ? P  sup kZM (g)k ≥ r F1T  ≤ 2 exp − 2 . C J(p, δ)2 g∈G(p)

(S.11)

kgksup ≤δ

Let QN = log(N 1/2 J(p, 1)) − 1. It follows from the proof of (S.7) that √ P

sup √

? (g)k N kZM

! p ≥ C 18 log(QN ) F1T

N J(p, kgksup ) + 1   18C 2 log(QN ) 2(QN + 2) ≤ 2(QN + 2) exp − ≤ . 2 C exp(2) Q2N g∈G(p)

(S.12)

Zhao et al./Panel Data Models with KRR

13

Taking expectation on both sides of (S.12), we get that √ sup √

P

g∈G(p)

!

? (g)k N kZM

N J(p, kgksup ) + 1

p ≥ C 18 log(QN )

= o(1), as N → ∞.

This shows that, with probability approaching one, √ ? (g)k p N kZM sup √ ≤ C 18 log(QN ). N J(p, kgksup ) + 1 g∈G(p) Since kgksup ≤ 1 for any g ∈ G and J(p, δ) is increasing in δ, the above inequality implies that, with probability approaching one, ? sup kZM (g)k ≤ C

p 18 log(QN )(J(p, 1) + N −1/2 ).

g∈G(p)

Combining with Lemma S.3, we get that sup kZM (g)k ≤ g∈G(p)

? ? sup kZM (g) − ZM (g)k + sup kZM (g)k g∈G(p)



= OP 1 +

g∈G(p)

p

 log log (N J(p, 1))(J(p, 1) + N −1/2 ) .

Proof completed. ¯ −Γ ¯0 F 0 − Γ ¯ 0 F 0 ) and Proof of Lemma A.5. By (2.5), we have e0i = 0i − ∆i , v¯ = 0i − ∆i (X 1 1 2 2 (Yi − τi gη )0 P KXi = [τi (g0 − gη ) + Σ0 βi + ei ]0 P KXi = [τi (g0 − gη ) + ei ]0 P KXi ¯ 2 ∆0 ]0 P KX . = [τi (g0 − gη ) + i + F2 Γ i i

(S.13)

By the definition of gη in the proof Theorem 4.1 and (S.13), we get that ? SM,η (gη ) = SM,η (gη ) − SM,η (gη ) = T1 + T2 − T3 + Wη gη − E(Wη gη |F1T ) = T1 + T2 − T3 , (S.14)

where T1 =

N 1 X 0 [i P KXi − E(0i P KXi |F1T )], NT i=1

N 1 X T2 = [∆i F20 P KXi − E(∆i F20 P KXi |F1T )], NT i=1

T3 = κ(gη − g0 ). Recall that κ is defined in the proof of Theorem 4.1. It is worthwhile to mention that the terms Wη gη and E(Wη gη |F1T ) cancel each other in (S.14) thanks to Wη gη ∈ F1T . Next, we will bound T1 , T2 , T3 respectively.

Zhao et al./Panel Data Models with KRR

14

First of all, by (A.23) and (A.24), it yields that 1 1 √ kT3 k = oP (1)kgη − g0 k = oP ( √ + √ + η). NTh N h

(S.15)

Secondly, the independence of i and Xi , F1 , F2 tells us that T1 =

N 1 X 0 i P KXi . NT i=1

Again by the independence assumption and direct calculations, we have E(kT1 k2 |F1T ) = =

=

N 1 X E(0i P < KXi , KXi > P 0 i |F1T ) N 2T 2

1 N 2T 2 1 N 2T 2

i=1 N X i=1 N X

E(0i P Ki P 0 i |F1T ) T r(E(P Ki P 0 i 0i )|F1T )

i=1

σ2 = E{T r(P Ki P 0 )|F1T } NT2 σ2 ≤ E{T r(Ki )|F1T } NT2 1 ), = OP ( NTh where we are using the facts that Ki = [K(Xit , Xil )]1≤t,l≤T and T r(Ki ) ≤ T c2ϕ h−1 derived from (A.19). So it follows kT1 k = OP ( √

1 ) NTh

Lastly, we will handle T2 as follows. Since F20 P? = 0 (see Section A.3), it follows that T2 =

N 1 X [∆i F20 (P − P? )KXi − E(∆i F20 P KXi |F1T )]. NT i=1

By the proof and notation in Lemma A.3, it can be shown that P? − P  e +Σ e 0 (Σ? Σ0 )−1 Σ? + Σ e 0 (Σ? Σ0 )−1 Σ. e = Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σ + Σ0? (Σ? Σ0? )−1 Σ ? ?

(S.16)

Zhao et al./Panel Data Models with KRR

15

Consequently, T2 has following decomposition: T2 =

N   1 X [∆i F20 (Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σ)KXi − E(∆i F20 (Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σ)KXi |F1T )] NT

+

+

i=1 N X

1 NT 1 NT

i=1 N X

e X |F1T )] e X − E(∆i F20 Σ0? (Σ? Σ0? )−1 ΣK [∆i F20 Σ0? (Σ? Σ0? )−1 ΣK i i 0 −1 T e 0 (Σ? Σ0 )−1 Σ? KX − E(∆i F 0 Σ e0 [∆i F20 Σ ? 2 (Σ? Σ? ) Σ? KXi |F1 )] i

i=1

N 1 X 0 −1 e T e 0 (Σ? Σ0 )−1 ΣK e X − E(∆i F 0 Σ e0 + [∆i F20 Σ ? 2 (Σ? Σ? ) ΣKXi |F1 )] i NT i=1

≡ T21 + T22 + T23 + T24 . The rest of the proof proceeds to bound the terms T2i , i = 1, 2, 3, 4. By (S.9) in the proof of Lemma A.3, we obtain the following:  E(kF20 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σkop ) = O(N −1/2 + T 1/2 /N ), e 0 (Σ? Σ0 )−1 Σ? kop ) = O(N −1/2 ), E(kF20 Σ ? e op ) = O(1/N ). e 0 (Σ? Σ0 )−1 Σk E(kF20 Σ ? Therefore, it follows that N  1 X ∆i F20 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 ) ΣKXi |F1T )k kE( NT i=1





1 NT

N X

 E(k∆i F20 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 ) ΣKXi k|F1T )

i=1

v   u T N u X X  1 k∆i k2 E kF20 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 ) Σkop t kKXit k2 |F1T  NT i=1

t=1

N q    1 X k∆i k2 E kF20 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 ) Σkop T c2ϕ h−1 |F1T NT i=1 q  1 ≤ √ sup k∆i k2 c2ϕ h−1 E(kF 0 Σ0 (ΣΣ0 )−1 − (Σ? Σ0? )−1 Σkop |F1T ) T 1≤i≤N 1 1 + √ ). = OP ( √ NTh N h



Zhao et al./Panel Data Models with KRR

16

As a consequence, kT21 k = OP ((N T h)−1/2 + N −1 h−1/2 ). Similarly, N 1 X e 0 (Σ? Σ0? )−1 Σ? KX |F1T )k} = OP ( √ 1 ), E{kE( ∆i F20 Σ i NT NTh

E{kE(

1 NT

i=1 N X

e 0 (Σ? Σ0 )−1 ΣK e X |F T )k} = OP ( ∆i F20 Σ ? 1 i

i=1

1 √ ). N Th

So it follows that kT23 k = OP ((N T h)−1/2 ) and kT24 k = OP (N −1 (T h)−1/2 ). Finally, we will handle T22 . Let W = F20 Σ0? (Σ? Σ0? )−1 . It can be easily seen from (S.10) that W ∈ F1T and kW kop = OP (1). To bound T22 , notice 

e X = ΣK i

0q1 ×T PT ¯t KXit t=1 v

!

0q1 ×T



 PT  t=1 v¯t1 KXit   PT , = v ¯ K t2 X it  t=1    ···   PT ¯td KXit t=1 v

where v¯ti is the ith element of vector v¯t . By direct calculations, it follows that kT22 k = k ≤



N 1 X e X − E(∆i W ΣK e X |F T )}k {∆i W ΣK 1 i i NT

1 NT 1 NT

i=1 N X

e X − E(∆i W ΣK e X |F T )k k∆i W ΣK 1 i i

i=1 N X

v u d T uX X  k∆i k2 kW kop t k v¯tl KXit − E(¯ vtl KXit |F1T ) k2 .

i=1

l=1

By (S.17), it suffices to find the rate of v u d N T X X  1 Xu t k v¯tl KXit − E(¯ vtl KXit |F1T ) k2 . NT i=1

l=1

(S.17)

t=1

(S.18)

t=1

Because d is finite and fixed, to simplify our technical arguments, assume d = 1 without loss of generality. Direct examinations give the following decomposition: k

T X

 v¯tl KXit − E(¯ vtl KXit |F1T ) k2

t=1

=

T N  1 XX k vjtl KXit − E(vjtl KXit |F1T ) k2 2 N t=1 j=1

=

T N T  2  2 XX 2 X T k vjtl KXit − E(vjtl KXit |F1 ) k + 2 k vitl KXit − E(vitl KXit |F1T ) k2 2 N N t=1 j6=i

≡ T221 + T222 .

t=1

Zhao et al./Panel Data Models with KRR

17

When i 6= j, vjtl is independent of Xit , F1T , so it follows that E{k

T X X

 vjtl KXit − E(vjtl KXit |F1T ) k2 |F1T }

t=1 j6=i

= E{k

T X X

vjtl KXit k2 |F1T }

t=1 j6=i

= E{

T X

X

vjtl vj 0 t0 l K(Xit , Xit0 )|F1T }

t,t0 =1 j,j 0 6=i

=

T X X

2 E(vjtl )E(K(Xit , Xit )}|F1T )

t=1 j6=i 2 )c2ϕ h−1 , ≤ N T E(v11l

As a consequence, T221 = OP (T (N h)−1 ). To deal with T222 , by Cauchy inequality, it yields that E{k

T X

(vitl KXit )k

2

|F1T }

≤ E{

t=1

≤ E( =

T X

t=1 T X

2 vitl

T X

kKXit k2 |F1T }

t=1 2 )T c2ϕ h−1 vitl

t=1 2 )T 2 cϕ h−1 , E(v11l

which further implies T222 = OP (T 2 (N 2 h)−1 ). By Jensen’s inequality and d = 1, it follows that v   N u T u X X  1 tk v¯tl KXit − E(¯ vtl KXit |F1T ) k2 |F1T  (S.18) = E  NT t=1 i=1 v ! u N T X  1 Xu tE k ≤ v¯tl KXit − E(¯ vtl KXit |F1T ) k2 |F1T NT t=1 i=1 r 1 2 )( 1 ≤ 2c2ϕ E(v11l + 2 ) NTh N h 1 1 = OP ( √ + √ ). (S.19) NTh N h Combining (S.17) and (S.19), it follows that kT22 k = OP ((N T h)−1/2 + (N h1/2 )−1 ). As a consequence, we have

1 1 + √ ). NTh N h Combining (S.15), (S.16) and (S.20), it yields that kT2 k = OP ( √

kSM,η (gη )k = OP ( √ Proof completed.

1 1 √ + √ ) + oP ( η). NTh N h

(S.20)

Zhao et al./Panel Data Models with KRR

18

Next we will prove Lemmas A.6, A.7 and A.8. For this purpose, let us introduce a set of notation. Define VN T ? , AN T ? , VN T m? , AN T m? , HN T m? as follows, VN T ?

N 1 X −1/2 = KXi (x0 )0 P? KXi (x0 ), AN T ? = VN T ? , NT i=1

VN T m? =

1 NT

N X

−1/2

φ0m Φ0i P? Φi φm , AN T m? = VN T m? , HN T m? =

i=1

N 1 X 0 Φi P? Φi . NT i=1

Proof of Lemma A.6. Define Qi? = E(

N N 1 X 1 X Φ0 P Φi T ¯ Φ0i P? Φi T ¯ |F1 ), Q? = Qi? , Qi = E( i |F1 ), Q = Qi = Im . T N T N i=1

i=1

Notice that, conditioning on F1T , Φi are independent. Hence, by Chebyshev’s inequality, it follows that N 1 X Φ0i P? Φi T ¯ P (kHN T m? − Q? kF > |F1 ) = P (k ( − Qi? )kF > |F1T ) N T i=1 ! N X Φ0i P? Φi 1 2 − Qi? )] |F1T } ≤ 2 2 E{T r [ (  N T i=1   0 N X 1 Φi P? Φi 2 = 2 2 − Qi? ] |F1T } T r{E [  N T i=1  0  N 1 X Φ P? Φi = 2 2 − Qi? k2F |F1T E k i  N T i=1  0  N Φi P? Φi 2 T 1 X E k kF |F1 ≤ 2 2  N T i=1



N X  1 4 T E kΦ k |F i F 1 2 N 2 T 2



m2 (cϕ + 1)4 . 2 N

i=1

As a consequence, it follows that, ¯ ? kF > P (kHN T m? − Q

m(cϕ + 1)2 T 1 √ |F1 ) ≤ 2 .  N

By taking expectation on both sides, we have 2 ¯ ? kF > m(c√ϕ + 1) ) ≤ P (kHN T m? − Q N

1 . 2

Since cϕ = OP (1), we obtain ¯ ? kF = OP (mN −1/2 ). kHN T m? − Q

(S.21)

Zhao et al./Panel Data Models with KRR

19

By Lemma A.3, we have E(kHN T m − HN T m? kF |F1T ) ≤

N 1 X E(kΦ0i (P − P? )Φi kF |F1T ) NT i=1





N 1 X E(k(P − P? )kop kΦi k2F |F1T ) NT

1 NT

i=1 N X

mT c2ϕ E(k(P − P? )kop |F1T )

i=1

m = OP ( √ ). N

(S.22)

Again by Lemma A.3 and similar calculations, it follows that ¯−Q ¯ ? kF kQ

= kE(

N 1 X 0 Φi (P − P? )Φi |F1T )kF NT i=1



1 NT

N X

E(kΦ0i (P − P? )Φi kF |F1T )

i=1

m = OP ( √ ). N

(S.23)

Combining (S.21), (S.22) and (S.23), it yields that ¯ F = OP ( √m ) = oP (1). kHN T m − Im kF = kHN T m − Qk N To the end of the proof, we quantify the minimal and maximal eigenvalues of HN T m as follows. λmin (HN T m ) = ≥

min u0 HN T m u

kuk2 =1

min u0 Im u − min |u0 (HN T m − Im )u|

kuk2 =1

kuk2 =1

= 1 − kHN T m − Im kop = 1 + oP (1), and λmax (HN T m ) = ≤

max u0 HN T m u

kuk2 =1

max u0 Im u + max |u0 (HN T m − Im )u|

kuk2 =1

kuk2 =1

= 1 + kHN T m − Im kop = 1 + oP (1), where have used the trivial inequality kHN T m −Im kop ≤ kHN T m −Im kF = oP (1). Proof completed.

Zhao et al./Panel Data Models with KRR

20

Proof of Lemma A.7. By Lemma A.6, we find a lower bound for VN T m and an upper bound for AN T m as follows: VN T m = φ0m HN T m φm ≥ λmin (HN T m )kφm k22 ≥ λmin (HN T m )C, m X ϕ2ν (x0 ) −1/2 −1/2 −1/2 AN T m = VN T m ≤ λmin (HN T m )kφ0m k−1 = O (1)( ) = OP (1). (S.24) P 2 (1 + ηρν )2 ν=1

Define Li (x0 ) = KXi (x0 ) − Φi φm . Then it follows that E(kLi k22 |F1T )



∞ X 4 T cϕ ( ν=m+1

1 2 )2 ≡ T c4ϕ Dm . 1 + ηρν

Directly calculation shows that |VN T

N N 2 X 0 1 X 0 − VN T m | ≤ | Li P KXi | + | Li P Li | ≡ 2|T 1| + |T 2|. NT NT i=1

Let Rx0 (·) =

P∞

ν=m+1

ϕν (x0 )ϕν (·) . 1+ηρν

(S.25)

i=1

Notice Li = τi Rx0 and E(T1 |F1T ) = V (Kx0 , Rx0 ). Similar to

the proof of Lemma A.6, we can show that Dm ). E(|T1 − V (Kx0 , Rx0 )||F1T ) = OP ( √ Nh Meanwhile we have the following ∞ X

V (Kx0 , Rx0 ) =

ν=m+1

ϕ2ν (x0 ) ≤ c2ϕ Dm . (1 + ηρν )2

As a consequence, it follows that |T1 | = OP (Dm ). A bound for T2 is given by the following inequality, E(|T2 ||F1T ) ≤

N 1 X 2 E(kLi k22 |F1T ) = OP (Dm ). NT i=1

So (S.25) becomes VN T − VN T m = OP (Dm ) = oP (1). Hence AN T = AN T m + oP (1) = OP (1), where last equality is from (S.24). Proof completed. Proof of Lemma A.8. The proof of this lemma is based on Lyapunov C.L.T. Let ci = AN T m P Φi φm /(N T ). We have √ N T AN T m ( Since ci ∈

D1T

N N √ X 1 X 0 0 φm Φi P i ) = N T c0i i . NT

and i is independent of

i=1 T D1 , it

i=1

follows that

N √ N X X E[( N T c0i i )2 ||D1T ] = N T σ2 c0i ci i=1

i=1

=

N T σ2 A2N T m

=

σ2 .

N 1 X 0 0 φm Φi P Φi φm N 2T 2 i=1

Zhao et al./Panel Data Models with KRR

21

Let cit be the tth element of ci . By direct examinations, it follows that N X

N X T X √ 0 4 T 2 2 E[( N T ci i ) ||D1 ] = N T c4it E(4it ) i=1 t=1 N X T X X +3N 2 T 2 c2it c2it0 E(2it 2it0 ). i=1 t=1 t0 6=t

i=1

(S.26)

Next we are going to find a bound for cit . By direct calculation, we have 1 pt· Φi φm | NT T 1 X ≤ kAN T m φm k2 k pts Φi,s· k2 NT

|cit | = |AN T m

−1/2

≤ λmin (HN T m )

s=1 T X

1 k NT

pts Φi,s· k2 ,

s=1

where pt· is the tth row of P , pts is the (t, s)th element of P and Φi,s· is the sth row of Φi . Meanwhile, pts = δts − Zs0 (ΣΣ0 )−1 Zt , hence k

T X

T

1 0 ΣΣ0 −1 X Z( ) Zs Φi,s· k2 T t T s=1 v v u u T T 0 u1 X u X ΣΣ −1 t 1 t 2 ) kop kZs k2 kΦi,s· k22 ≤ kΦi,t· k2 + kZt k2 k( T T T s=1 s=1 v u T q q ΣΣ0 −1 u 1X t 2 ≤ ) kop kZs k22 mc2ϕ mcϕ + kZt k2 k( T T s=1 q ≤ mc2ϕ (1 + bkZt k2 ),

pts Φi,s· k2 = kΦi,t· −

s=1

q P T 1 kZs k22 = OP (1) by Assumption A5. So |cit | ≤ a(1 + bkZt k2 ), T q s=1 −1/2 where a = λmin (HN T m ) N1T mc2ϕ . By Lemma A.6, we have 0

−1 where b = k( ΣΣ T ) kop

N X T X i=1 t=1

c4it ≤

N X T X

8a4 (1 + b4 kZt k42 ) = OP (

i=1 t=1

m2 ), N 3T 3

(S.27)

and N X T X X i=1 t=1 t0 6=t

c2it c2it0

v v u T N uX X X u T X uX 4 t ≤ cit t c4it0 i=1

≤ T

t=1 t0 6=t

N X T X

c4it

i=1 t=1 m2

= OP (

t=1 t0 6=t

N 3T 2

).

(S.28)

Zhao et al./Panel Data Models with KRR

Combining (S.26), (S.27) and (S.28), we have

PN



i=1 E[(

Lyapunov C.L.T, the result follows. Proof completed.

22

N T c0i i )4 ||D1T ] = OP (m2 /N ). And by