Consistent Model Selection for Marginal Generalized Additive Model ...

9 downloads 0 Views 474KB Size Report
Keywords: L2-norm consistency; model selection; nonparametric function; oracle ... Horowitz and Wei (2009) propose additive model selection using B-spline ...
Consistent Model Selection for Marginal Generalized Additive Model for Correlated Data Lan Xue 1 Department of Statistics, Oregon State University Corvallis, OR 97331

Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign Champaign, IL 61820

Jianhui Zhou Department of Statistics, University of Virginia Charlottesville, VA 22904

1

Xue’s research was supported by the National Science Foundation (DMS-0906739). Qu’s research was supported by

the National Science Foundation (DMS-0906660). Zhou’s research was supported by the National Science Foundation (DMS-0906665).

Abstract We consider the generalized additive model when responses from the same cluster are correlated. Incorporating correlation in the estimation of nonparametric components for the generalized additive model is important since it improves estimation efficiency and increases statistical power for model selection. In our setting, there is no specified likelihood function for the generalized additive model since the outcomes could be non-normal and discrete, which makes estimation and model selection very challenging problems. We propose consistent estimation and model selection which incorporate the correlation structure. We establish an asymptotic property with L2 -norm consistency for the nonparametric components, which achieves the optimal rate of convergence. In addition, the proposed model selection strategy is able to select the correct generalized additive model consistently. That is, with probability approaching to 1, the estimators for the zero function components converge to 0 almost surely. We will illustrate our method using numerical studies with both continuous and binary responses, and a real data application of binary periodontal data. Supplemental materials including technical details are available online.

Keywords: L2 -norm consistency; model selection; nonparametric function; oracle property; polynomial spline; quadratic inference function; SCAD. Running title: Correlated Generalized Additive Model.

1

Introduction

Longitudinal or correlated data occur very frequently in biomedical studies. The challenge in analyzing correlated data is that the likelihood function is difficult to specify or formulate for nonnormal responses with large cluster size. To allow richer and more flexible model structures, the generalized additive model (Hastie and Tibshirani, 1990) takes the additive form of nonparametric functions. This is appealing for both model interpretation and prediction. However, when some of the predictor variables are redundant, traditional estimation methods for the generalized additive model (Stone 1985, Linton and Härdle 1996, Horowitz and Mammen 2007) are unable to produce an accurate and efficient estimator. Since even a single predictor variable is typically associated with a large number of unknown parameters in the nonparametric functions, inclusion of redundant variables can hinder accuracy and efficiency for both estimation and inference. Therefore, variable selection is a crucial step in generalized additive modeling. In general, even when the data is not correlated, variable selection with discrete responses is quite challenging since it involves numerical approximation and can be computationally intensive (Meier, van de Geer and Bühlmann, 2008). Variable selection for the generalized additive model is more challenging since the dimension for the nonparametric components could be infinite. Huang, Horowitz and Wei (2009) propose additive model selection using B-spline bases when the number of variables and additive components is larger than the sample size. Meier, van de Geer and Bühlmann (2009) develop estimation procedures for high-dimensional generalized additive models by penalizing the sparse terms. Other approaches to variable selection in additive models include Huang and Yang (2004) and Xue (2009). However, these approaches either require the specification of the likelihood function, or mainly target continuous outcomes using least squares. In general, these methods cannot handle correlated discrete types of outcomes in model selection and estimation since there is no closed form of the likelihood function, and/or it is practically infeasible to perform numerical approximation for infinite dimension problems. In this paper, we are interested in developing a general framework for estimation and model selection for generalized additive models which can handle correlated categorical responses in addition to continuous ones.

1

Furthermore, the current literature on incorporating correlation for the generalized additive model is rather limited. This is mainly because nonparametric modeling can be highly computationally intensive and introduces high dimensional nuisance parameters associated with nonparametric forms. This makes it difficult to incorporate additional correlation structure into the model. However, ignoring correlation could lead to inefficient estimation and impair statistical power for the selection of correct models. In addition, for smoothing spline and seemingly unrelated kernel approaches (Wang, 2003), ignoring the correlation could also lead to biased estimation (Zhu, Fung and He, 2008). Specifically, Wang (2003) shows that selection of the smoothing parameter could fail since it is rather sensitive for even a small departure from the true correlation structure, and this is reflected by over-fitting the nonparametric estimator in order to reduce the overall bias. In contrast to the parametric setting, these problems could be more serious for the nonparametric generalized additive model since the true model might not be easily verified. To incorporate correlation, Berhane and Tibshirani (1998) proposed estimation of the generalized additive model using a smoothing spline generalized estimating equation (GEE; Liang and Zeger, 1986) approach. However, it does not address the variable selection problem. Wang, Li and Huang (2008) proposed a nonparametric varying coefficient model for longitudinal data, but it is mainly applicable for continuous outcomes. The quadratic inference function approach (QIF; Qu, Lindsay and Li, 2000) can efficiently take the within-cluster correlation into account and is more efficient than the GEE approach when the working correlation is misspecified. In this paper, we propose consistent estimation and variable selection based on the penalized quadratic inference function for the generalized additive model for correlated data. The proposed method is able to shrink small coefficients of the nonparametric functional components to zero. In addition, it effectively takes the within-cluster correlation into consideration and provides efficient estimators for the non-zero components. We provide an asymptotic consistency property for nonparametric function estimation, and establish the optimal rate of convergence. Huang (1998) and Xue and Yang (2006) established asymptotic theory for additive models with polynomial splines when the responses are continuous and independent. However, the asymptotic property has not been well studied in the current literature 2

for the correlated generalized additive model using polynomial splines. The theoretical technique for the spline approach is very different from the kernel approach (Horowitz, 2001) for the generalized additive model. In addition, since the objective function we minimize is a penalized quadratic distance function, the theoretical proof is very different from the penalized least squares approach (Xue, 2009). The techniques involved for asymptotic property development are quite challenging because of the curse of the dimensionality of nonparametric functions, the nonlinear relation between the response and covariates, and the correlated nature of measurements. The advantage of our approach is that the model selection is accomplished through SCAD penalization (Fan and Li, 2001), and it achieves the oracle property. That is, the proposed method is able to select the correct generalized additive model consistently, and the estimators of the non-zero components achieve the same rate of L2 convergence as if the true model is known in advance. In addition, since the nonparametric function forms are selected group-wise for each functional component, the choice of the basis functions of the polynomial spline is not critical to the final model selection in our approach. In addition, our numerical studies show that our approach performs extremely well in estimation efficiency and model selection when the dimension of functional components is high. The paper is organized as follows. Section 2 describes a marginal framework for the generalized additive model for correlated data. Section 3 introduces a penalized quadratic inference function method for simultaneous estimation and variable selection of marginal generalized additive models. Asymptotic theories are developed and issues for implementation are discussed. Section 4 illustrates simulation studies for continuous and binary responses. Section 5 demonstrates model selection and estimation for binary periodontal data. The last section provides concluding remarks and discussion. The proofs are given in the Appendix, and more technical details are available in the online supplemental files.

2

A marginal generalized additive model for longitudinal data

Suppose there are n clusters with the i th (i = 1, . . . , n) cluster containing mi observations. Let € (1) Š (d) T Yi j be a response variable, and Xi j = X i j , . . . , X i j a d × 1 vector of covariates for the j -th 3

( j = 1, . . . , mi ) observation in the i -th cluster. Denote    T  Yi1   Xi1  .   .  . .  Yi =   .  , Xi =  .    T Yimi Xim i

   .  

Then the basic assumption of the marginal generalized additive model is that observations from € Š different clusters are independent, and the first two moments of Yi j are specified as E Yi j |Xi j , Xi =   € Š E Yi j |Xi j = µ0i j , and Var(Yi j |Xi j , Xi ) = Var(Yi j |Xi j ) = φV µ0i j , where φ is a scale parameter and V (·) is a known variance function, and the marginal mean µ0i j relates to the covariates through the known link function g (·) , η0i j µ0i j

d   X € (l) Š 0 αl x i j , = g µ i j = α0 +

= g

−1

  η0i j .

l=1

(1)

 d Here α0 is an unknown constant, and αl (·) l=1 are unknown smooth functions. Since the esti-

mation of nonparametric functions is usually obtained on a compact support, we assume without loss of generality that the covariates X i(l)j can be scaled into the interval [0, 1], for 1 ≤ l ≤ d . For model (1) to be identified, without loss of generality, we assume that each αl (·) is centered with R1 αl (x) d x = 0. 0 In model (1), the additive nonparametric functions are formulated to model the covariate effects. If each αl (·) is a linear function, then model (1) reduces to the generalized linear models for clustered data. Berhane and Tibshirani (1998) extended the generalized estimating equation approach and applied the smoothing spline to estimate the model (1). We propose an efficient estimation procedure which approximates the nonparametric functions using polynomial splines. In addition, within-cluster correlation is incorporated to obtain efficient parameter estimation via the quadratic inference function (Qu et al., 2000). Unlike the GEE approach (Berhane and Tibshirani, 1998), our proposed method does not require estimation of the working correlation parameters, and is more efficient than the GEE approach when the working correlation is misspecified. Another focus of this paper is to perform variable selection for the marginal generalized additive model (1). With advanced technology, we encounter massive high-throughput data with large 4

dimensional covariates quite frequently. Identification of important variables is a crucial step in analyzing high-dimensional data, since each redundant variable involves an infinite dimension of parameters for nonparametric components. Here a predictor variable X l is said to be redundant in  model (1), if and only if αl X l = 0 almost surely. Otherwise, a predictor variable X l is said to be relevant. Suppose only an unknown subset of predictor variables in model (1) is relevant. Our goal is to consistently identify such subsets of relevant variables and estimate their unknown function components in (1) at the same time.

3

Methodology and theory

 d In our estimation procedure, we approximate the smooth functions αl (·) l=1 in (1) by polynomial

splines. For each 1 ≤ l ≤ d, let υl be a partition of the interval [0, 1], with Nn interior knots ¦ © υl = 0 = υl,0 < υl,1 < · · · < υl,Nn < υl,Nn +1 = 1 .

Using υl as knots, the polynomial splines of order p + 1 are functions with p-degree (or less) of — ” polynomials on intervals [υl,i , υl,i+1 ), i = 0, . . . , Nn − 1, and υl,Nn , υl,Nn +1 , and have p − 1 continuous derivatives globally. We denote the space of such spline functions by ϕ l = ϕ p ([0, 1] , υl ). n o R1 Denote ϕ 0l = s ∈ ϕ l : 0 s (x) d x = 0 , which consists of centered spline functions. The advantage of polynomial splines is that they often provide a good approximation of smooth functions with a limited number of knots.

3.1 An Initial Estimator ¦ ©Jn Let Bl j (·) j=1 be a set of spline bases of ϕ 0l with Jn = Nn + p. Then Jn X

β l j Bl j (·) ,

αl (·) ≈ sl (·) = j=1

€ ŠT with a set of coefficients β l = β l1 , . . . , β lJn . As a result, η0i j



Jn d X X

≈ ηi j β = β 0 + l=1 j=1

5

€ (l) Š β l j Bl j x i j ,

€ ŠT where β = β 0 , β 1T , . . . , β dT . Or equivalently, the mean function µ0i j in (1) can be approximated

by

(



µ0i j ≈ µi j β = g −1 β 0 +

Jn d X X

€

Š (l)

β l j Bl j x i j

) .

l=1 j=1



€



In matrix notation, we write µi β = µi1 β , . . . , µimi β

Š T

. To estimate unknown coefficients

β , we apply the quadratic inference function (Qu et al. 2000), which efficiently incorporates the

within-cluster correlation. To simplify notation, we first assume equal cluster size with mi = m < ∞. The implementation for data with unequal cluster size will be discussed later in Subsection 3.5.

Let R be a common working correlation. The QIF approach approximates the inverse of R by a linear combination of some basis matrixes, i.e. R−1 ≈ a0 I + a1 M1 + . . . + aK MK ,

where I is the identity and Mi are known symmetric basis matrices. For example, if R is of compound symmetric structure with correlation ρ , then R−1 can be represented as a0 I + a1 M1 with M1 being a  matrix with 0 on the diagonal and 1 off the diagonal, a0 = − (m − 2) ρ + 1 /k1 , and a1 = ρ/k1 , where k1 = (m − 1)ρ 2 − (n − 2)ρ − 1 and m is the dimension of R. The linear representation of R−1 is also applicable for the AR-1 and the block diagonal correlation structures. The advantage of

the QIF approach is that it does not require the estimation of the linear coefficients ai ’s associated with the working correlation matrix, which are treated as nuisance parameters here. € Š T    T . For any x = x 1 , . . . , x d , let B (x) = 1, B11 x 1 , . . . , B1Jn x 1 , . . . , Bd1 x d , . . . , BdJn x d € ŠŠ T €  Denote Bi = B xi1 , . . . , B ximi for i = 1, . . . , n. Define 

Gn

BiT ∆iT



A−1 i





β Yi − µi β  1  1    − − n n  BiT ∆iT β Ai 2 M1 Ai 2 Yi − µi β  1X  1X  β = gi β = .. n i=1 n i=1   .  1   − −1  BiT ∆iT β Ai 2 MK Ai 2 Yi − µi β

       

¦  ©   ˙ i1 β , . . . , µ ˙ imi β and µ ˙ i j β being the first derivative of g −1 evaluated with ∆i β = dia g µ € Š ¦ € Š©  at B T xi j β ; and Ai = d iag V µi1 , . . . , V µimi . Since there are more estimation equations

than the number of unknown parameters, the QIF approach estimates β by setting Gn as close to 6

 zero as possible, in the sense of minimizing the quadratic inference function Q n β , that is     e = argmin Q n β = argmin nG T β C−1 β Gn β , β n n β

(2)

β

where 

Cn β =

n 1X

n

  gi β giT β .

i=1

d

As a result, for any x ∈[0, 1] , the estimator of the unknown components in (1) is given as Jn € Š X € Š (l) e e Bl j x (l) , l = 1, . . . , d, e0 = β 0, α el x α = β lj j=1

and e + e (x) = β α 0

d X

€ Š e l x (l) . α

(3)

l=1

Using a spline basis we convert a problem in the continuum to one that is governed by only e can be obtained using the a finite number of parameters. For any given set of spline bases, β

Newton-Raphson algorithm developed in Qu et al. (2000). QIF estimation through spline basis expansion produces functional estimators which achieve optimal nonparametric properties on rate of convergence as shown in Theorem 1. This can be used in its own right, or as the initial value for an interactive scheme as developed in the next subsection for simultaneous variable selection and functional estimation. e0, α e l , and α e , we first introduce some notation and present To study the rate of convergence for α  regularity conditions. We assume equal cluster sizes (mi = m < ∞), and Yi , Xi , i = 1, . . . , n are € Š T T T i.i.d. copies of (Y, X) with Y = Y1 , . . . , Ym , and X = X1T , . . . , Xm . The asymptotic results still

hold for data with unequal cluster sizes mi using a cluster-specific transformation as discussed later in Subsection 3.5. For any matrix A, kAk denotes the modulus of the largest singular value of A. To prove the theoretical arguments, we need the following assumptions: € Š T T are compactly supported, and without loss of generality, (C1) The covariates X = X1T , . . . , Xm

we assume that each X j has support χ = [0, 1]d . The density of X j , denoted by f j (x), is absolutely continuous and there exist constants c1 and c2 such that 0 < c1 ≤ minx∈χ f j (x) ≤ maxx∈χ f j (x) ≤ c2 < ∞ for all j = 1, . . . , m.

7

(C2) For each l = 1, . . . , d, αl (·) has r continuous derivatives for some r ≥ 2. (C3) Let e = Y − µ0 (X). Then Σ = Eee T is positive definite and for some δ > 0, E kek2+δ < +∞. ¦ © (C4) The knots sequences υl = 0 = υl,0 ≤ υl,1 ≤ · · · ≤ υl,Nn ≤ υl,Nn +1 = 1 for l = 1, . . . , d, are

quasi-uniform, that is, there exists c3 > 0, such that € Š max υl, j+1 − υl, j , j = 0, . . . , Nn max ≤ c3 . l=1,...,d min(υl, j+1 − υl, j , j = 0, . . . , Nn ) Furthermore, the number of interior knots Nn  n1/(2r+1) , where  means both sides have the same order. In particular, let h = 1/Nn  n−1/(2r+1) . (k) (k) −1/2 −1/2 T (k) (C5) Let Γ(k) M k A0 and ∆0 , A0 are evaluated at µ = 0 = ∆0 V0 ΣV0 ∆0 where V0 = A0 € (k) Š µ0 (X). Then for sufficiently large n, the eigenvalues of E Γ0 are bounded away from 0

and infinity for any k = 0, . . . , K . € Š € Š (C6) There exists a positive constant c4 such that 0 < c4 ≤ infi, j V µi j ≤ supi, j V µi j < ∞. The

functions V (·) and g −1 (·) have bounded second derivatives. € ŠT (C7) Let M = M0T , . . . , MKT . Assume the modular of the singular value of M is bounded away

from 0 and infinity. These conditions are common in the polynomial spline estimation literature. Assumptions similar to (C1)-(C4) are also considered in Huang (1998), Huang (2003) and Xue (2009). In particular, e0, α el , the smoothness condition in (C2) determines the rate of convergence of the spline estimators α e . Conditions (C5) and (C6) are similar to assumptions (A3) and (A4) in He et al. (2005), and and α

can be easily verified for a broad family of distributions. Condition (C7) is needed for the weighting matrix Cn in (2) so it maintains positive definite asymptotically. This holds for the choice of basis matrices of exchangeable and AR-1 correlation structures discussed in this subsection. Theorem 1 Under conditions (C1-C7), there exists a local minimizer of (2) such that € Š |e α0 − α0 | = Op n−r/(2r+1) n X i ¦ 1X m

max

1≤l≤d

n

€ (l) Š € (l) Š©2 € Š e l x i j − αl x i j α = Op n−2r/(2r+1) .

i=1 j=1

8

For mi = m = 1, it reduces to a special case where the responses are i.i.d.. The rate of convergence given here is the same optimal rate as that obtained for polynomial spline regression for independent data (Huang 1998, Xue 2009). The main advantage of the QIF approach is that it incorporates within-cluster correlation by optimally combining estimating equations without estimating the correlation parameters. However, when there are redundant predictor variables, it fails to produce efficient and accurate estimators due to model complexity. In the next subsection, we propose the penalized QIF method which automatically sets small estimated functions to zero and selects a parsimonious model.

3.2 Penalized QIF for Marginal Generalized Additive Models We propose a new variable selection approach by penalizing the quadratic inference function in (2). This enables one to perform simultaneous estimation and variable selection in the generalized additive model. The proposed method is able to shrink small components of estimated functions to zero, thus performing variable selection. In addition, it produces efficient estimators of the non-zero components by taking the within-cluster correlation into consideration. The penalized quadratic inference function estimator is defined as (



ˆ = argmin Q n β + n β β

d X

)   pλ n β l K , l

l=1

(4)

2 in which pλn (·) is a given penalty function depending on a tuning parameter λn , and β l K = l € (l) Š € (l) Š € ŠT P n 1 P mi 1 T B Xi j BlT Xi j with Bl (·) = Bl1 (·) , . . . , BlJn (·) . Note β l Kl β l , where Kl = n i=1 m j=1 l i

that



β = l K l

mi n 1X 1 X

n

i=1

mi

€

Š (l)

sl2 Xi j

!1/2

= s l n ,

j=1

which is the empirical norm of the spline function sl . The reason we choose to penalize on sl n is

that it does not depend on a particular choice of spline bases, and intuitively, shrinking sl n to 0

entails sl is 0 almost surely. Note that the proposed penalization strategy achieves the same effect as the group-wise model selection approach (Yuan and Lin, 2006), since each additive component containing many coefficients is treated as a whole group in model selection. 9

There are a number of choices for the penalty function pλn (·). For the linear models, the L1 penalty pλn (|·|) = λn |·| provides a LASSO-type of estimator, and the L2 penalty pλn (|·|) = λn |·|2 gives a ridge-type estimator. Here we choose the smoothly clipped absolute deviation (SCAD; Fan and Li, 2001) penalty, which is defined by the derivative of pλn , « ¨  aλn − θ +   0 I θ > λn , pλn (θ ) = λn I θ ≤ λn + (a − 1) λn where a is a constant and is usually set to be a = 3.7 (Fan and Li 2001, Xue 2009), and λn > 0 ˆ through penalization in (4), for any given is a tuning parameter. After obtaining the estimator β x ∈[0, 1]d , an estimator of the unknown components in (1) is given as Jn € Š X € Š (l) ˆ ˆ Bl j x (l) , l = 1, . . . , d. ˆ 0 = β 0, α ˆl x α = β lj j=1

Fan and Li (2001) established the asymptotic property of the regression parameter estimator using SCAD for the linear model. In the following, we establish the asymptotic properties of the estimators of the nonparametric components for the marginal generalized additive model. We denote € Š € (l) Š Pd € (l) Š Pd Ps (l) η0 xi j = α0 + l=1 αl (x i j ) = α0 + l=1 αl x i j + l=s+1 αl x i j , where, without loss of generality, αl = 0 almost surely for l = s + 1, . . . , d , and s is the total number of nonzero function  d ˆ l l=1 achieve the same rate of components. We first show that the penalized QIF estimators α  d ˜ l l=1 . Furthermore, we prove that the penalized estimators convergence as the un-penalized ones α  d ˆ l = 0 almost surely for l = s + 1, . . . , d . ˆ l l=1 in Theorem 2 possess the sparsity property, that is, α α The sparsity property ensures that the proposed model selection is consistent, that is, it selects the correct variables with probability tending to 1 as the sample size goes to infinity. Theorem 2 Under the same assumptions of Theorem 1, and if the tuning parameter λn → 0, then there exists a local minimizer of (4) such that € Š |ˆ α0 − α0 | = Op n−r/(2r+1) n X i ¦ 1X m

max

1≤l≤d

n

€ (l) Š € (l) Š©2 € Š ˆ l x i j − αl x i j α = Op n−2r/(2r+1) .

i=1 j=1

Theorem 3 Under the same assumptions of Theorem 1, and if the tuning parameter λn → 0, and ˆ l = 0 almost surely for l = s + 1, . . . , d . λn n r/(2r+1) → +∞, then with probability approaching 1, α

10

Theorem 3 also implies that the above generalized additive model selection possesses the consistency property. The results in Theorems 2 and 3 are similar to those for penalized least squares established in Xue (2009). However, the theoretical proof is very different from the penalized least squares approach due to the nonlinear structure of the problem. Therefore a more thorough study of the properties of the quadratic inference functions for the infinite dimension case is indeed necessary here to overcome the difficulties.

3.3 An Algorithm Using Local Quadratic Approximation In this subsection, we provide an algorithm to minimize the penalized quadratic inference function defined as in (4) using the local quadratic approximation ( Fan and Li, 2001). Let β 0 be an initial e , the QIF estimator, value which is close to the true minimizer of (4). For example, we could take β € ŠT as the initial value. Denote β k = β 0k , β 1kT , . . . , β kT as the value at the kth iteration. If β kl is d

close to zero in the sense that β kl K ≤ ε, for some small threshold value ε, then set β k+1 to 0. l l

In the implementation, we have used ε = 10−6 . Without loss of generality, suppose β k+1 = 0, for l  ‹T  T  T € Š € Š T k+1 k+1 k+1 k+1 k+1 T k+1 , . . . , β dk l = dk + 1, . . . , d , and write β = β 0, β 1 , β dk +1 , . . . , β d =   Š Š € € T k+1 k+1 T k+1 T k+1 , with β 22 containing the d − dk zero components. Let β = , β 22 β 11 β 0, € Š T T T β 0, , β 11 , β 22 be the same partition of any β with the same length as β k+1 . k+1 We obtain a value for the non-zero component β 11 using the following quadratic approxima k tion. For β l K > ε, one can locally approximate the penalty function by l

  pλ n β l K l   1 0   −1 Š € ≈ pλn β kl K + pλn β kl K β kl K β kT Kl β l − β kl l l l 2   1 0   −1 € Š k ≈ pλn β kl K + pλn β kl K β kl K β lT Kl β l − β kT K β , l l l l l l 2 0

where pλn is the first order derivative of pλn . Therefore, the objective function in (4) can be locally approximated (up to a constant) by a quadratic function € Š ŠT € Š€ Š € ŠT € Š 1€ k k k Q n β k + ∇Q n β k ∇2Q n β k β 11 −β 11 β 11 −β 11 β 11 −β 11 + 2 1 T € kŠ + nβ 11 Λ β β 11 , 2

11

€ Š where ∇Q n β k = €

Λ β

k

∂ Q n (β k ) ∂ β 11

Š

€ Š , ∇2Q n β k =

∂ 2 Q n (β k ) T ∂ β 11 ∂ β 11

, and



  −1    −1 0



k k k

= diag pλn β 1 K β 1 K , . . . , pλn β kdk .

β dk 0

l

l

Kl

Kl

(5)

k+1 One can obtain β 11 by minimizing the above quadratic function. Specifically,

€ Š k © € Š ¦ € Š € Š©−1 ¦ k+1 k . ∇Q n β k + nΛ β k β 11 β 11 = β 11 − ∇2Q n β k + nΛ β k

One can repeat the above iteration process until convergence is reached. Here we use the converq€ Š T € k+1 Š gence criterion such that β k+1 − β k β − β k ≤ 10−6 . From our experience, the algorithm is very stable and fast to compute. It usually reaches a reasonable convergence tolerance within a few iterations. However, the computational time increases as the number of covariates increases. In one of our numerical examples in Section 4, it takes about 25 iterations on average to converge when there are 100 covariates in the model.

3.4 Tuning Parameter Selection It is important to select sensible tuning parameters in the implementation, since the performance of the proposed method depends on the choice of tuning parameters critically. The QIF method in  d subsection 3.1 requires an appropriate choice of the knot sequences υl l=1 for the spline approx d imation. For the penalized QIF method in subsection 3.2, in addition to υl l=1 , one also needs to choose tuning parameter λn in the SCAD penalty function. For knot selection in the QIF method, we use equally spaced knots and select only the number of interior knots Nn . A similar strategy can also be found in Huang, Wu and Zhou (2004), Xue and Yang (2006) and Xue (2009). For the penalized QIF, we use the same knot sequences selected in the QIF procedure and select only the tuning parameter λn for simplicity. For any given Nn , e . Similarly, for any given λn , denote the estimator denote the solution which minimizes (2) by β Nn ˆ . Motivated by Qu and Li (2006) and Wang, Li and Tsai (2008), we use which minimizes (4) by β λn

a consistent BIC procedure to select the optimal tuning parameters. Since the quadratic inference function Q n behaves equivalently to minus twice the log-likelihood function (Qu et al. 2000), we

12

define the BIC in the QIF procedure as Š €  e BI C1 Nn = Q n β Nn + log (n) DFNn /n,

where DFNn = 1 + dJn is the total number of unknown parameters in (2). Then one selects the  bn = argminN BIC1 Nn . From our experience, only a small number of knots optimal knot number N n are needed. In the numerical examples in Section 4, the selected the optimal knot numbers are in the range of 2 to 5. Similarly for the penalized QIF procedure, we define € Š  ˆ BI C2 λn = Q n β λn + log (n) DFλn /n.

The effective degrees of freedom is defined as DFλn = t r ace



2

∇ Qn

€

Šo ŠŠ−1 € Š € 2 ˆ ˆ ˆ ∇ Q n β λn , β λn + nΛ β λn

ˆ . where ∇2Q n (·) and Λ (·) are defined similarly as in (5) for the non-zero components in β λn

Consequently, we choose the optimal λn such that the BIC value reaches the minimum, that is,  b n = argmin BIC2 λn . λ λn

3.5 Unequal cluster sizes In the cases where the clusters have unequal cluster sizes, we utilize cluster-specific transformation matrices to reformulate the data with unequal cluster size. That is, for the i th cluster with cluster size mi , the m × mi transformation matrix Ti is defined by deleting the corresponding columns (indexed

by the missing observations in the cluster) of the m × m identity matrix, where m is the largest size of a fully observed cluster, and is assumed to be bounded. Using the transformation matrices, we define Yi∗ = Ti Yi , ∆∗i (β) = Ti ∆i (β), Bi∗ = Ti Bi , µ∗i (β) = Ti µi (β) and A∗i = Ti Ai for the i th cluster. Note that the entries in Yi∗ , ∆∗i (β), Bi∗ , µ∗i (β) and A∗i can be either obtained if the measurements are observed or 0 if the corresponding observations are missing. We substitute Yi , ∆i (β), Bi , µi (β) and Ai with Yi∗ , ∆∗i (β), Bi∗ , µ∗i (β) and A∗i , respectively, in the previous procedures and in Lemma 10 and

11 provided in the supplemental material. Following the arguments in the proofs of the theorems, it can be shown that the asymptotic results of Theorems 1, 2 and 3 still hold for correlated data with unequal cluster sizes for parameter estimation and variable selection procedures. 13

4

Simulation studies

In this section, we conduct simulation studies for both continuous and binary outcomes. The total ˆ (r) be averaged integrated squared error (TAISE) is evaluated to assess estimation efficiency. Let α  ngrid the estimator of a nonparametric function α in the r -th (1 ≤ r ≤ R) replication and x m m=1 be the ˆ (r) are evaluated. We define grid points where α AISE (ˆ α) =

and TAISE =

Pd l=1

ngrid R 1X 1 X¦

R

r=1

ngrid

  ©2 ˆ (r) x m α xm − α ,

m=1

 ˆ l . Let S and S0 be the selected and true index set containing relevant AISE α

variables, respectively. We say S is correct if S = S 0 ; S overfits if S0 ⊂ S and S0 6= S ; and S underfits if S0 6⊂ S . In all of the simulation studies, the number of replications is 100.

4.1 Example 1: Continuous Response and Moderate Dimension of Covariates ¦ © In this example, the continuous responses Yi j are generated from d X

Yi j =

€ (l) Š αl X i j + " i j , i = 1, . . . , n, j = 1, . . . , 5,

(6)

l=1

where d = 10 and the number of clusters n = 100, 250, or 500. The additive functions are α1 (x) = 2x − 1, α2 (x) = 8(x − 0.5)3 , α3 (x) = sin (2πx) ,

and αl (x) = 0 for l = 4, . . . , 10. Thus the last 7 variables in this model are null variables and do € Š (10) T not contribute to the model. The covariates Xi j = X i(1) are generated independently j , . . . , Xij  T from Uniform [0, 1]10 . The error " i = " i1 , . . . , " i5 follows a multivariate normal distribution with mean 0, a common marginal variance σ2 = 1 and an exchangeable correlation structure with correlation ρ = 0.7. We apply the penalized QIF with SCAD penalty (SCAD) and fit both the linear splines ( p = 1) and cubic splines ( p = 3). To illustrate the effect on estimation efficiency of incorporating withincluster correlation, we compare the estimation efficiency of using basis matrices from different 14

working correlation structures: exchangeable (EC), AR-1 and independent (IND). In addition, we compare the penalized QIF approach with the QIF estimations of a full model (FULL) and an oracle model (ORACLE). Here the full model consists of all ten variables and the oracle model only contains the first three relevant variables. The oracle model is only available in simulation studies where the true information is known. Table 1 summarizes the estimation results of all procedures. It shows that the estimation procedures with a correct EC working correlation have the smallest TAISEs, and therefore the estimators are more efficient than their counterparts with IND working correlation which ignore within-cluster correlation. For example, the efficiency gained by incorporating correlation could be doubled in the cubic splines approach. Estimation based on a misspecified AR-1 correlation structure will lead to some efficiency loss compared to using the true EC structure, but it is still more efficient than assuming independent structure. In general, the efficiency of estimation using cubic splines is higher than that using linear splines. Furthermore, with the same type of working correlation, ORACLE has the smallest TAISEs compared to the SCAD or FULL model approaches. This is not surprising since ORACLE takes advantage of the true information from the data generation, and it contains only three relevant variables. However, the efficiency of the SCAD is closer to that of ORACLE, with much smaller TAISEs than those of the FULL model. From one selected simulated data set with n = 250, Figure 1 plots the first four estimated functional components from the SCAD, FULL and ORACLE models using linear spline and exchangeable working correlation. Note that for the fourth variable, both the true and estimated component functions from SCAD are zero. This clearly indicates that the proposed method estimates unknown functions reasonably well. Table 2 provides variable selection results for the SCAD procedures. It shows that for the EC, AR-1 or IND type of working correlation, the percentage of correct model fitting goes to 1 quickly as the sample size increases. This confirms the consistency theorem of variable selection provided in subsection 3.2. Furthermore, for the cubic spline approach, the correct model-fitting percentage using EC correlation structure is noticeably higher than that using the IND when the sample size is moderate. Lastly, we give the computing time of the algorithm proposed in Subsection 3.3. For 15

SCAD with p = 1 and EC working correlation, it takes 0.5, 1.2, and 3.8 seconds respectively to complete one run of the simulation for n = 100, 250, and 500. The computing time for cubic spline and other correlation structures are similar.

4.2 Example 2: Continuous Response with High Dimension of Covariates To assess our method in more challenging cases for high dimensional data, we consider a model with the dimension of functional components d = 100 in (6). However, only the first three variables are relevant and take the same functional forms as in Example 1. We consider the model (6) with ¦  T ©200 number of clusters n = 200 and cluster size 5, with errors " i = " i1 , . . . , " i5 generated indei=1 pendently from a multivariate normal distribution with mean 0, and an AR-1 correlation structure with cor r(" i j1 , " i j2 ) = 0.7| j1 − j2 | for 1 ≤ j1 , j2 ≤ 5. We apply the linear spline QIF to estimate the full (FULL) and oracle (ORACLE) models with working independent (IND), exchangeable (EC) or AR-1 working correlation. For variable selection, we consider the penalized linear spline QIF with SCAD penalty with basis matrices from IND, EC or AR-1 working correlations. Table 3 reports TAISEs for FULL, ORACLE and SCAD and variable selection results on correct, overfit and underfit percentages of the SCAD approach for both IND and AR-1 working correlations. Table 3 clearly indicates that the improvement from incorporating within-cluster correlation is very significant. In particular, the estimation procedures with a correctly-specified AR-1 structure always give smaller TAISEs than those with a misspecified EC or IND working correlation. For variable selection, the SCAD with an AR-1 working correlation also performs noticeably better than the one with EC or independent working correlation. Furthermore, Table 3 also shows that the SCAD procedure dramatically improves the estimation accuracy for this high-dimensional case, with TAISEs from the SCAD less than one tenth of the TAISEs from the FULL model. Finally, for computing time, it takes 19.8 seconds to compute SCAD with p = 1 and EC working correlation in one run of simulation. The computing times for other correla-

tion structures are similar. However, compared with Example 1, the computation burden increases dramatically as the number of covariates increases.

16

4.3 Example 3: Binary Response To assess the performance of our method for binary outcomes, we generate a random sample of 250 ¦ ©20 clusters in each simulation run. Within each cluster, binary responses Yi j j=1 are generated from a marginal logit model €

Š

10 X

logit P Yi j = 1|Xi j = xi j =

€ (l) Š αl x i j ,

l=1



 where α1 (x) = exp(x + 1) − exp(2) − exp(1) /16, α2 (x) = cos (2πx) /4, α3 (x) = x(1 − x) − 1/6, α4 (x) = 2 (x − 0.5)3 , and the remaining 6 covariates are null variables with αl (x) = € (1) Š (10) T 0 for l = 5, . . . , 10. The covariates Xi j = X i j , . . . , X i j are generated independently from  Uniform [0, 1]10 . The algorithm provided by Macke et al. (2008) is applied to generate correlated

binary responses with exchangeable correlation structure with a correlation parameter of 0.5. We applied the penalized QIF with a SCAD penalty and linear spline (SCAD) for the variable selection, and the linear spline QIF for estimation of the full (FULL) and oracle (ORACLE) models. To illustrate how different working correlations could affect our estimation and variable selection results, we consider minimizing (2) and (4) using AR-1 (AR-1) working correlation and independent (IND) structures, in addition to the true exchangeable (EC) correlation structure. Table 4 gives the TAISEs for the SCAD, ORACLE and FULL models with three different working correlations. Similar to the previous continuous simulation study, the estimation based on correctly specified exchangeable correlation structure has the smallest TAISEs. For the SCAD approach, the efficiency is almost doubled if the correct correlation information is incorporated. Estimation based on misspecified AR-1 correlation structure will lead to some efficiency loss compared to using the true structure, but it is still more efficient than assuming independent structure. However, this could be a different case for the GEE if the true exchangeable correlation were misspecified as AR-1, since GEE requires one to estimate the correlation ρ for misspecified AR-1, and the estimator of ρ may not be valid. Consequently the estimator from the misspecified correlation structure could be less efficient than assuming independence. Furthermore, similar to the previous study, TAISEs calculated based on the SCAD approach are also shown to be close to the TAISEs from ORACLE, and much smaller than those from the FULL

17

model. The relative efficiency ratio between the SCAD and FULL model approaches is close to 2. This shows that the SCAD approach can gain estimation accuracy significantly by effectively removing the zero component variables. Table 5 gives the frequency of appearance for each of the 10 variables in the selected model. Overall, the SCAD procedures work reasonably well, and the

SCAD with EC and AR-1 working correlation structures provides better variable selection results than the SCAD with IND working structure. Additionally, the SCAD approach with EC working correlation performs the best in selecting non-zero component variables (X 1 , . . . , X 4 ). Note that the model selection assuming IND working structure performs poorly for selecting variable X 4 . For one simulation run, Figure 2 plots the first four estimated functional components from the SCAD approach with the three different working correlations. For the fourth functional component, the SCAD with IND structure does not estimate the non-zero component correctly since it apparently shrinks all coefficients of the function form to zero. In general, estimation assuming IND structure provides poorer results than estimation assuming EC or AR-1 working correlations. For the EC and AR-1 correlation structures, the SCAD approach provides reasonable estimations of the unknown functions. Lastly, for computing time, it takes 9.7 seconds to compute SCAD with p = 1 and EC working correlation in one run of simulation. Overall, the proposed algorithm is fast to compute.

5

Real data analysis

To illustrate the proposed method, we apply it to an observational study of periodontal disease (Stoner, 2000). The partial data set consists of 528 patients with chronic periodontal disease who had an initial periodontal exam during 1988-1992. Each patient was followed annually for five years after the initial exam. One of the study goals is to identify risk factors which influence tooth loss in order to improve the treatment of periodontal disease. Let a binary response Yi j = 1 if patient i in j th year has at least one tooth extraction, and Yi j = 0 otherwise. The covariates of interest

include age at time of initial exam (Age), number of teeth present at time of initial exam (Teeth), number of diseased sites (Sites), mean pocket depth in diseased sites (Pddis), mean pocket depth in all sites (Pdall), number of non-surgical periodontal procedures in a year (Nonsurg), number of non18

periodontal dental treatments in a year (Dent), and date of initial exam in fractional years (Entry). It is obvious that the last variable (Entry) is not related to the model prediction, but it is included here so we can verify whether the model selection procedure will exclude this variable as expected. To model the binary responses Yi j , we consider a marginal generalized additive model with a logit link in (1). We apply the proposed penalized QIF with SCAD penalty and fit both linear and cubic splines to select relevant variables, in addition to the QIF generalized additive model with linear spline based on the full model with all eight covariates. In both procedures, we use exchangeable working correlation. Procedures with other types of working correlation are also considered and lead to similar conclusions, therefore they are not reported here. Figure 3 plots estimated function components from the SCAD approach with linear spline (dashed) and cubic spline (dotdash), and the QIF estimation using the full model (solid line). Figure 3 shows that for both SCAD procedures, the selected relevant covariates are Age, Sites, Pdall and Dent. The others are not selected since their function components are shrunken to zero by the penalized QIF procedures. Note that the variable Entry is not selected, as expected. Furthermore, we also calculate the 95% point-wise confidence intervals of the estimated component functions using the QIF procedure with linear spline based on 500 bootstrapped samples. Notice that for the four variables which are not selected, their 95% bootstrap confidence intervals contain the zero line completely. This confirms our findings on variable selection based on the penalized QIF procedure.

6

Discussion

We propose efficient estimation and model selection procedures for generalized additive models when the responses are correlated. We provide a SCAD penalty in model selection for the additive functional forms, and we are able to select the functional components groupwise. We incorporate correlation from the clustered data through optimally combining moment conditions which contain correlation information from the correlated data. The advantage of this approach is that we do not require explicit estimation of the correlation parameters. This improves estimation efficiency significantly as well as saving computational cost. We found that it may be quite computationally intensive 19

in high-dimensional variable selection settings, since the dimension of the parameters involved in the nonparametric forms increases significantly compared to parametric model selection settings. We show that the nonparametric estimator is consistent with an optimal L2 -norm convergence rate of n−r/(2r+1) , which is the same optimal rate of L2 -norm convergence as in nonparametric models for independent and continuous data. In addition, we show that the model selection approach is consistent, that is, with probability tending to 1, it selects the correct model with non-zero functional forms converging to 1 almost surely. Another advantage of our approach is that estimation and model selection are achieved simultaneously, in contrast to other model selection procedures such as AIC or BIC. Our simulation studies show that the proposed estimator recovers a significant amount of efficiency by performing model selection. This is reflected in that the total average integrated square error from the SCAD model selection approach is much closer to the true ORACLE model, and is smaller than the square error from the full model. In addition, we also show that estimation is more accurate when the true correlation structure is taken into consideration. Finally, the theoretical techniques in deriving asymptotic properties of the proposed estimator are quite different from the existing nonparametric approaches of which we are aware. We should also mention that because there is no closed form of the likelihood function, the numerical integration technique to obtain a maximum likelihood estimator for a setting with a small number of parameters is not applicable here, since there are infinite dimensions of parameters involved in each functional form. In addition, the conventional penalized least squares approach for continuous data is not applicable for correlated categorical response data since it could result in biased estimation and inconsistent model selection.

Appendix: Proofs of theorems The necessary lemmas for the following proofs are given in the supplemental materials. We first introduce some notation. 20

For any real-valued function f on [0, 1], let f ∞ = sup x∈[0,1] f (x) . Let k·k2 be the usual L2 norm for functions and vectors. Let L2 ([0, 1]) be the space of all square integrable functions

defined on [0, 1]. For any α1 , α2 ∈ L2 ([0, 1]), let Z1 Z1

2 α1 , α2 = α1 ( x) α2 ( x) d x, and kαk2 = α2 ( x) d x. 0

(A.1)

0

Let H0 be the space of square integrable constant functions on [0, 1], and let L20 be the space of square integrable functions on [0, 1] which is orthogonal to H0 . That is, let L20 = {α : 〈α, 1〉 = 0, α ∈ L2 }, where 1 is the identity function defined on [0, 1]. Define the additive model space M

as,

( M=

d X

α (x) = α0 +



)

αl x l ; α0 ∈ H0 , αl ∈ L20 .

l=1

Then the regression function η (x) in (1) of the main text is modeled as an element of M . Define the polynomial spline approximation space Mn , as ) ( d X  0,n , Mn = s (x) = s0 + sl x l ; s0 ∈ H0 , sl ∈ ϕ l l=1

in which ϕ 0,n l



= sl (·) : sl ∈ ϕ l , sl , 1 = 0 , the centered polynomial spline space. Note that the

definitions of M and Mn do not depend on the constraints that αl ∈ L20 , sl ∈ ϕ 0,n . We impose l those constraints to ensure that there is an unique additive decomposition of the functions in M and Mn almost surely. Furthermore, for any α ∈ M , define two norms which will be used later: Pn  T     kαk2 = E α T (X) α (X) and kαk2n = 1n i=1 α T Xi α Xi , where α (X) = α X1 , . . . , α Xm € Š T T for X = X1T , . . . , Xm .

A.1 Proof of Theorem 1. Lemma A.7 in the supplement and triangular inequality given that for each l = 1, . . . , d, m m n X n X Š— ” € (l) Š ” € (l) Š € € (l) Š—2 2 X 1X e − β 0 2 + ch2r . e l x i j − αl x i j α β ≤ BlT x i j l l n i=1 j=1 n i=1 j=1

Therefore, it is sufficient to show that m n X

€ Š 2 ” € (l) Š € Š— 1X

T e 0 e − β0 2 BlT x i j β

Bl β l −β l = l l n n i=1 j=1 € Š = Op (nh)−1 . 21

(A.2)

Note that Lemmas A.10 and A.11 in the supplement entail that for any ε > 0, there exists a sufficiently large C > 0 such that, as n → ∞, ( P



inf

−1/2 e β:kB T (β−β 0 )kn =C(nh)

Qn β > Qn β 0

) 

> 1 − ε.

Therefore

ª § € Š

T e

−1/2 P B β−β 0 ≤ C (nh) > 1 − ε, n

€

Š € Š

−1/2 e which entails that B T β−β . Furthermore, Lemma A.5 in the supplement 0 = Op (nh) n

entails that for each l = 1, . . . , d, there exists a constant C > 0, such that

€

€ Š Š € Š

T e

T e

−1/2 0 .

Bl β l − β l ≤ C B β−β 0 = Op (nh) n

n

Therefore, (A.2) is proven.

A.2 Proof of Theorem 2.   Pd   Let L n β = Q n β + n l=1 pλn β l K be the object function in (4). Let l

ˆ∗ = β

argmin β=(β 0 ,β 1T ,...,β sT ,0 T ,...,0 T )

T

 Qn β ,

which leads to the spline QIF estimator of the first s function components, knowing that the rest are

€ Š € p Š

T ˆ∗

zero terms. As a special case of Theorem 1, one has B β − β 0 = Op 1/ nh . We want to n

show that for large n and any " > 0, there exists a constant C large enough such that ! € ∗Š  ˆ P inf Ln β > Ln β ≥ 1 − ". ˆ ∗ )kn =C(nh)−1/2 β:kB T (β−β

(A.3)

As a result, this implies that L n (·) has a local minimum in the ball

€ n o € € p Š ∗Š

T ˆ ˆ ∗ Š − 12 T ˆ β : kB β − β kn ≤ C(nh) . Thus B β − β = Op 1/ nh . Further, the trian n €

€ Š



T ˆ ˆ ∗ Š

T ˆ∗ Tˆ

gular inequality gives B β − α n ≤ B β − β + B β − β 0 + B T β 0 − α n = n n Š € p r −1/(2r+1) . Op 1/ nh + h . It proves the theorem by noting that h = n To show (A.3), using pλn (0) = 0 and pλn (·) ≥ 0, one has € ∗Š  ˆ Ln β − Ln β 

€

ˆ ≥ Qn β − Qn β

∗Š

   s   X

ˆ ∗

n pλ n β l K − pλ n β + . l l

l=1

22

Kl

By similar arguments in the proof of Theorem 2 in Xue (2009), if λn → 0, then for any β with

€

∗Š

b ∗ −1/2 T

ˆ ≥ aλn for each l = 1, . . . , s, kB β − β kn = C(nh) , one has β l K ≥ aλn , and β l l

Kl

when n is large enough. Therefore, when n is large enough,   s    X

b ∗

pλ n β l K − pλ n β = 0, l l Kl

l=1

by the definition of the SCAD penalty function. Furthermore, € ∗Š  ˆ Qn β − Qn β € Š Š € ∗Š 1 € € ∗Š € Š¦ © ˆ∗ T Q ˆ + ˆ∗ T Q ˆ ˆ ∗ 1 + o p (1) ˙n β ¨n β = β −β β −β β −β 2

(A.4)

˙ n and Q ¨ n being the gradient vector and Hessian matrix of Q n , respectively. Following Qu et with Q € Š ˆ ∗ kn = C(nh)−1/2 , one al. (2000) and Lemma A.9 in the supplement, for any β , with kB T β − β

has € ∗Š Š ˆ ˆ∗ T Q ˙n β β −β € ∗Š € ∗Š € ∗Š ¦ € Š © ˆ∗ T G ˆ C−1 β ˆ Gn β ˆ ˙T β = n β −β 1 + o (1)  Ch−1 , p n n €

and €

Š € ∗Š € Š ˆ∗ T Q ˆ ˆ∗ ¨n β β −β β −⠊¦ © € ∗Š € ∗Š € € ∗Š Š € ˆ ∗ 1 + o p (1) ˆ ˆ G ˆ C−1 β ˆ∗ T G ˙n β ˙T β β − β = n β −β n n  C 2 h−1 ˙ n is the first order derivative of Gn . Therefore, by choosing C large enough, the second term where G

on the right hand side of (A.4) dominates its first term. Therefore (A.3) holds when C and n are large enough. This completes the proof of Theorem 2.

A.3 Proof of Theorem 3. n € ŠT € p Šo  Let Θ1 = β : β = β 0 , β 1T , . . . , β sT , 0 T , . . . , 0 T , B T β − β 0 n = Op 1/ nh . For l = s + n € ŠT € p Šo

1, . . . , d, define Θl = β : β = 0, 0, . . . , 0, β lT , 0, . . . , 0 , B T β n = Op 1/ nh .

23

€ Š  It is sufficient to show that uniformly for any β ∈ Θ1 and β ∗l ∈ Θl , L n β ≤ L n β + β ∗l , with

probability tending to 1 as n → ∞. Note that, similar to the proof of Theorem 2, € Š  L n β + β ∗l − L n β   € Š  = Q n β + β ∗l − Q n β + npλn β l K l

  € ∗ Š 1 ∗T € ∗ Š ∗ ¦ © ˆ + β Q ˆ β 1 + o p (1) + npλ β ˙n β ¨n β = β ∗T Q l Kl l l l n (2 ) 0 ©

R n pλn (w) ¦ = nλn B T β l n + 1 + o p (1) , λn λn

where

€ ∗ Š 1 ∗T € ∗ Š ∗  p ‹ ˆ + β Q ˆ β ¨n β ˙n β β ∗T Q l l 2 l

= Op 1/ nh , Rn = n B T β l n

and w is a value between 0 and β l K . We complete the proof by observing that plimn→∞ R n /λn = l

0

0, while lim infn→∞ lim infw→0+ pλn (w) /λn = 1.

Acknowledgements This work is supported in part by National Science Foundation grants. The authors thank Dr. Julie Stoner of University of Oklahoma for providing the periodontal disease data for the data analysis part. The paper is substantially improved thanks to the constructive suggestions and comments from two referees, an Associated Editor and the Editor.

Supplemental Materials WebAppendix.pdf: The web appendix provides necessary lemmas and their proofs to support the proofs of the theorem in the Appendix.

24

References Berhane, K., and Tibshirani, R. J. (1998), "Generalized additive models for longitudinal data," Canadian Journal Statistics, 26, 517-535. de Boor, C. (2001), A Practical Guide to Splines, New York: Springer. Fan, J., and Li, R. (2001), "Variable selection via nonconcave penalized likelihood and its oracle properties," Journal of the American Statistical. Association, 96, 1348–1360. Hastie, T. J., and Tibshirani, R. J. (1990), Generalized additive models, London: Chapman and Hall. He, X., Fung, W. K., and Zhu, Z. (2005), "Robust estimation in generalized partial linear models for clustered data," Journal of the American Statistical. Association, 100, 1176-1184. He, X., and Shi, P. (1996), "Bivariate tensor-product B -splines in a partly linear model," Journal of Multivariate Analysis, 58, 162-181. Horowitz, J. L. (2001), "Nonparametric estimation and a generalized additive model with an unknown link function," Econometrica, 69, 499-513. Horowitz, J. L., and Mammen, E. (2007), "Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions," The Annals of Statistics, 35, 2589-2619. Huang, J. Z. (1998), "Projection estimation in multiple regression with application to functional ANOVA models," The Annals of Statistics, 26, 242-272. Huang, J. Z. (2003), "Local asymptotics for polynomial spline regression," The Annals of Statistics, 31, 1600-1635. Huang, J. Z., Wu, C. O., and Zhou, L. (2004), "Polynomial spline estimation and inference for varying coefficient models with longitudinal data," Statistica Sinica, 14, 763-788. Huang, J. Z., and Yang, L. (2004), "Identification of nonlinear additive autoregressive models," Journal of the Royal Statistical Society, Ser. B, 66, 463-477. Huang, J., Horowitz, J. L., and Wei, F. (2010+), "Variable selection in nonparametric additive mod-

25

els," The Annals of Statistics, Under revision. Liang, K. Y., and Zeger, S. L. (1986), Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. Linton, O. B., and Härdle, W. (1996), "Estimation of additive regression models with known links," Biometrika, 83, 529-540. Macke, J. H., Berens, P., Ecker, A. S., Tolias, A. S., and Bethge, M. (2008), "Generating spike-trains with specified correlation-coefficients," Neural Computation, 21, 397-423. Meier, L., van de Geer, S., and Bühlmann, P. (2008), "The group Lasso for logistic regression," Journal of the Royal Statistical Society, Ser. B, 70, 53-71. Meier, L., van de Geer, S., and Bühlmann, P. (2009), "High-dimensional additive modeling," The Annals of Statistics, 37, 3779-3821. Qu, A., Lindsay, B. G., and Li, B. (2000), "Improving generalised estimating equations using quadratic inference functions," Biometrika, 87, 823-836. Qu, A., and Li, R. (2006), "Quadratic inference functions for varying-coefficient models with longitudinal data," Biometrics, 62, 379-391. Stone, C. J. (1985), "Additive regression and other nonparametric models," The Annals of Statistics, 13, 689 - 705. Stoner, J. A. (2000), "Analysis of clustered data: A combined estimating equations approach," Ph.D. thesis, University of Washington. Wang, H., Li, R., and Tsai, C. (2008), "Tuning parameter selectors for the smoothly clipped absolute deviation method," Biometrika, 94, 553-568. Wang, L., Li, H., and Huang, J. Z. (2008), "Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements," Journal of the American Statistical. Association, 103, 1556-1569. Wang, N. (2003), "Marginal nonparametric kernel regression accounting for within-subject correla26

tion," Biometrika, 90, 43-52. Xue, L. (2009), "Variable selection in additive models," Statistica Sinica, 19, 1281-1296. Xue, L., and Yang, L. (2006), "Additive coefficient modeling via polynomial spline," Statistica Sinica, 16, 1423-1446. Yuan, M., and Lin, Y. (2006), "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society, Ser. B, 68, 49-67. Zhu, Z., Fung, W. K., and He, X. (2008), "On the asymptotics of marginal regression splines with longitudinal data," Biometrika, 95, 907-917.

27

Table 1: Example 1: Continuous response and moderate dimension of covariates. The TAISEs (standard errors in parentheses) of SCAD, ORACLE and FULL with exchangeable (EC), AR-1 or independent (IND) working correlation using linear or cubic splines. The number of replications is 100. n

Linear spline

100

Method

AR-1

IND

SCAD 0.0385(.0017)

0.0468(.0019)

0.0496(.0018)

ORACLE 0.0263(.0004)

0.0311(.0005)

0.0377(.0007)

0.0522(.0010)

0.0649(.0012)

0.0776(.0015)

SCAD 0.0121(.0002)

0.0143(.0004)

0.0173(.0006)

ORACLE 0.0103(.0001)

0.0117(.0002)

0.0164(.0004)

0.0205(.0004)

0.0247(.0005)

0.0385(.0007)

SCAD 0.0093(.0001)

0.0095(.0001)

0.0119(.0002)

ORACLE 0.0085(.0001)

0.0089(.0001)

0.0117(.0002)

0.0134(.0001)

0.0150(.0002)

0.0218(.0003)

SCAD 0.0378(.0042)

0.0422(.0032)

0.0454(.0038)

ORACLE 0.0172(.0009)

0.0217(.0010)

0.0262(.0011)

0.0778(.0022)

0.0843(.0024)

0.0867(.0044)

SCAD 0.0062(.0003)

0.0083(.0004)

0.0099(.0004)

ORACLE 0.0052(.0003)

0.0067(.0003)

0.0099(.0004)

0.0220(.0007)

0.0284(.0008)

0.0362(.0008)

SCAD 0.0038(.0001)

0.0043(.0001)

0.0075(.0002)

ORACLE 0.0026(.0001)

0.0032(.0001)

0.0056(.0002)

0.0096(.0002)

0.0174(.0003)

FULL 250

FULL 500

FULL Cubic spline

100

FULL 250

FULL 500

FULL

EC

0.0075(.0001)

28

Table 2: Example 1: Continuous response and moderate dimension of covariates. Variable selection results of SCAD with exchangeable, AR-1 or independent working correlation and linear or cubic splines. The columns of C, O and U give the percentage of correct-fitting, over-fitting and underfitting from 100 replications. EC

Linear spline

Cubic spline

AR-1

IND

n

C

O

U

C

O

U

C

O

U

100

0.93

0.00

0.07

0.93

0.01

0.06

0.91

0.01

0.08

250

0.99

0.00

0.01

0.98

0.00

0.02

0.98

0.00

0.02

500

1.00

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.00

100

0.81

0.09

0.10

0.79

0.17

0.04

0.73

0.18

0.09

250

0.99

0.00

0.01

0.99

0.00

0.01

0.95

0.00

0.05

500

1.00

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.00

Table 3: Example 2: Continuous response and high dimension of covariates. The TAISEs (standard errors in parentheses), and percentages of correct-fitting (C), under-fitting (U), and over-fitting (O) from 100 replications. SCAD

ORACLE

FULL

C

O

U

AR-1

0.03185.0007

0.0189.0005

0.4125.0005

0.92

0.02

0.06

EC

0.03544.0008

0.0214.0006

0.5241.0006

0.89

0.05

0.06

IND

0.04499.0009

0.0237.0006

0.7278.0006

0.85

0.04

0.11

Table 4: Example 3: Binary response. The TAISEs (standard errors in parentheses) from 100 simulations. SCAD

ORACLE

FULL

EC

0.0062.0004

0.0068.0003

0.0191.0008

AR-1

0.0089.0005

0.0083.0004

0.0229.0008

IND

0.0123.0006

0.0112.0005

0.0276.0010

29

Table 5: Example 3: Binary response. The appearance frequency of the variables in the selected model from 100 simulations. X1

X2

X3

X4

X5

X6

X7

X8

X9

X 10

100 100

90

98

0

0

0

2

0

0

AR-1

96

100

79

94

0

1

0

0

0

0

IND

89

100

75

66

0

1

0

1

0

0

EC

30

1.0 −1.0

−0.5

0.0

0.5

1.0 0.5 0.0 −0.5 −1.0 0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.8

1.0

0.6

0.8

1.0

0.5 0.0 −0.5 −1.0

−1.0

−0.5

0.0

0.5

1.0

(b)

1.0

(a)

0.6

0.0

0.2

0.4

0.6

0.8

1.0

0.0

(c)

0.2

0.4

(d)

Figure 1: Example 1: Continuous response and moderate dimension of covariates. The estimated component functions of (a) α1 (x) = 2x − 1, (b) α2 (x) = 8 (x − 0.5)3 , (c) α3 (x) = sin(2πx), and (d) α4 (x) = 0 from SCAD (dot-dash), FULL (dot), and ORACLE (dash) using linear spline and exchangeable working correlation. The true functions are plotted in solid lines.

31

0.3 −0.3

−0.1 0.0

0.1

0.2

0.3 0.2 0.1 −0.1 0.0 −0.3 0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.8

1.0

0.6

0.8

1.0

0.2 0.1 −0.1 0.0 −0.3

−0.3

−0.1 0.0

0.1

0.2

0.3

(b)

0.3

(a)

0.6

0.0

0.2

0.4

0.6

0.8

1.0

0.0

(c)

0.2

0.4

(d)

Figure 2: Example 3: Binary response. The estimated component functions of (a) α1 (x) = (e x+1 − e2 + e)/16, (b) α2 (x) = cos(2πx)/4, (c) α3 (x) = x(1 − x) − 1/6, and (d) α4 (x) = 2(x − 0.5)3

from linear spline SCAD with exchangeable (dot), AR (dash), and independent (dot-dash) working correlation. The true functions are plotted in solid lines.

32

−0.5

0.5

1.0

1.5

4 −4

−2

0

2

2 −2

−1

0

1

2 1 0 −1 −2 −1.5

−1.5 −1.0 −0.5

0.5

1.0

0.0

1.5

4 0 −2 −4 −1.0

0.0

0.5

1.0

1.5

2.0

Pdall

−1.0

0.0

0.5

1.0

1.5

Dent

1 0 −1 −2

−2

−1

0

1

2

Pddis

2

1.0

2

4 2 −2 −4 0.0 0.5 1.0 1.5 2.0

0.5

Sites

0

1 0 −1 −2 −1.0

−1.0

Teeth

2

Age

0.0

−1.5

−0.5

0.5

Entry

1.0

1.5

−2.0

−1.0

0.0 0.5 1.0 1.5

Nonsurg

Figure 3: Periodontal disease application. The estimated component functions from the SCAD procedure with linear spline (dashed line) and cubic spline (dotdash), and the QIF of the full model (solid line) using linear spline and the exchangeable working correlation. Two SCAD procedures give exactly zero estimates for variables: Teeth, Pddis, Entry and Nonsurg. Also plotted are 95% bootstrap confidence intervals of the component functions from the QIF procedure.

33

Web Appendix of "Consistent Model Selection for Marginal Generalized Additive Model for Correlated Data" Lan Xue, Department of Statistics, Oregon State University Annie Qu, Department of Statistics, University of Illinois at Urbana-Champaign Jianhui Zhou, Department of Statistics, University of Virginia This document gives the technical lemmas and their proofs. The technical lemmas are used in the proofs of Theorems 1-3 in the paper. with respect With Jn = Nn + p, let Bl = (Bl,1 , . . . , Bl,Jn ) T be a set of orthonormal bases of ϕ 0,n l to inner product as defined in (A.1) of the main text, for l = 1, . . . , d . Let B = (1, B1T , . . . , BdT ) T , in which 1 denotes the identity function defined on χ. Then B is a set of bases for Mn . The next several lemmas present some preliminary results regarding M and Mn .

Lemma 1 Under condition (C1), the two norms k·k and k·k2 are equivalent on M . That is, there exist constants C ≥ c > 0, such that, for any α ∈ M , c kαk2 ≤ kαk ≤ C kαk2 . ” € Š— Pm   Proof: Notice that kαk2 = E α T (X) α (X) = j=1 E α2 X j . Then condition (C1) en” € Š— p ≤ c2 kαk22 . Then the lemma follows by taking c = mc1 , and tails that c1 kαk22 ≤ E α2 X j p C = mc2 .

Lemma 2 Under conditions (C1), there exists a constant c > 0, such that ! d d X X

2 2 2

kαk ≥ c α + αl αl ∈ M . , ∀ α = α0 + 0

l=1

l=1

Proof: The result follows immediately from Lemma (1) and Lemma 1 of Stone (1985).

2 Lemma 3 Under conditions (C1) and (C4), there exist constants C ≥ c > 0, such that c β 2 ≤

T 2



β B ≤ C β 2 , for any vector β of length 1 + dJ . n 2

Web-App.1

€ ŠT € ŠT Proof: Let β = β 0 , β 1T , . . . , β dT , β 0 is a scalar and β l = β l1 , . . . , β lJn . Note that Lemma

1 entails that there exists a constant C ∗ such that d X

T 2 2

β B β +

T 2

β B ≤ 2d+1

l

0

!

l

i=1

≤ 2d+1 C ∗

d X

T 2 2

β B β + 0

2 = 2d+1 C ∗ β 2 .

!

l 2

l

i=1

On the other hand, Lemmas 1 and 2 entail that there exist constants c1∗ and c2∗ such that, ! ! d d X X

T 2



2 2

β B ≥ c ∗ β 2 +

β T B

β T B ≥ c∗ c∗ β 2 + l l 1

0

l

1 2

i=1

= c

β 20

d X

2

β +

!

l 2

0

l

2

i=1

2 = c β 2 .

i=1

This completes the proof.

Lemma 4 As n → ∞, one has r 

s , s − s , s 2 1 2  log (n)  1 2 sup n = Op  .

s s nh s1 ,s2 ∈Mn 1 2

In particular, there exist constants 0 < c < 1 < C , such that, except in an event whose probability tends to zero as n → ∞, c ksk ≤ kskn ≤ C ksk , ∀s ∈ Mn . Proof: The proof is similar to the proof of Lemma 4 in Xue and Yang (2006).

Lemma 5 There exists a constant C > 0, such that, except in an event whose probability tends to zero as n → ∞, ksk2n ≥ C

d X

2

s s2 + l

0

n

!

d X

, ∀ s = s0 +

l=1

sl ∈ Mn . l=1

Proof: The results follow from Lemmas 2 and 4.

Web-App.2

Lemma 6 There exist constants C ≥ c > 0, such that, except in an event whose probability tends to

2

2

2 zero as n → ∞, c β 2 ≤ β T B n ≤ C β 2 , for any vector β of length 1 + dJn . Proof: The results follows from Lemmas 3 and 4.

Lemma 7 Let α = α0 +

Pd

αl ∈ M with αl satisfying condition (C2) for each l = 1, . . . , d . Then Pd Pd Bl ∈ Mn and a constant there exists an additive spline function s0 = s00 + l=1 sl0 = s00 + l=1 β 0T l

c > 0, such that α − s0 ∞ ≤ Chr . l=1

Proof: According to de Boor (2001, p. 149), for each l = 1, . . . , d , there exists a constant



c > 0 and a spline function sl ∈ ϕ l such that αl − sl ∞ ≤ chr . Let sl0 = sl − sl , 1 ∈ ϕ 0l . Note









that | sl , 1 | ≤ | sl − αl , 1 | + | αl , 1 | ≤ αl − sl ∞ ≤ chr . Thus αl − sl0 ∞ ≤ αl − sl ∞ +

Pd Pd

| sl , 1 | ≤ 2chr . Let S 0 = α0 + l=1 sl0 ∈ Mn and α − s0 ∞ ≤ l=1 αl − sl0 ∞ ≤ 2dchr . This completes the proof. € € ŠŠ T ŠT €  Let B = B1T , . . . , BnT for i = 1, . . . , n. Write M0 = I. with Bi = B T xi1 , . . . , B T ximi 0

Then for k, k = 0, . . . , K, define 

Gn,k β =

n 1X

n





g ik β , Ckk0 β =

i=1

n 1X

n

 T  β , g ik β g ik 0

i=1

where   g ik β = g ik µ β = BiT ∆iT µi β



−1/2

Ai

µi β



−1/2

Mk Ai

µi β

 

Yi − µi β



.

€   Š T , Then Gn and Cn in (2) of the main text can be written as Gn β = Gn,0 β , . . . , Gn,K β  € ŠK Cn β = Ckk0 β k,k0 =0 .

Similarly define ¦  © −1/2 −1/2 0 T g ik β = BiT ∆0,i A0,i Mk A0,i Yi − µ0i + ∆0,i ζi1 + ∆0,i ζi2 β ,

Web-App.3

€ Š   0T T where ζi1 = Bi β 0 − η0i , ζi2 β = Bi β − β 0 and β 0 = 1, β 0T , . . . , β defined in Lemma 1 d

7, and ∆0,i , and A0,i are evaluated at µi = µ0i . Let −1/2

−1/2

−1/2

−1/2

0 T g ik1 = BiT ∆0,i A0,i Mk A0,i

€ Š Yi − µ0i ,

0 T g ik2 = BiT ∆0,i A0,i Mk A0,i ∆0,i ζi1 , 0 g ik3 β

0 = g k1

1 n

Pn i=1



 −1/2 −1/2 T = BiT ∆0,i A0,i Mk A0,i ∆0,i ζi2 β ,

 0 0 0 0 β similarly. Furthermore, for k, k = 0, . . . , K, define , g k3 , and define g k2 g ik,1 C0 0 kk



β =

n 1X

n

0 0T g ik1 g ik 01.

i=1

0 0 Then define Gn,k , Gn0 , Cn0 , Q0n similarly as Gn,k , Gn , Cn , Q n but replacing g ik with g ik , and Ckk0 β  with C 0 0 β .



kk

Lemma 8 Under the conditions of Theorem 1, then one has € Š



g 0 = O 1/pn , g 0 = O (hr ) . p p k1 k2 ¦

p ©  Furthermore, let Θn (C) = β : B T β − β 0 n = C/ nh for a constant C . Then there exist

constants 0 < c6 ≤ c5 such that for any β ∈ Θn (C), p p

 0 β ≤ c5 C/ nh c6 C/ nh ≤ g k3

except in an event whose probability goes to 0 as n → ∞. 0 Proof: For any k = 0, . . . , K , and any a ∈ R1+dJ n with a T a =1, a T g k1 has mean 0 , and by

conditions (C5), (C6) and Lemma 3, there exists a constant c > 0, such that, €

var a

T

0 g k1

Š

=

1 n

Ea

T

0 0T g ik1 g ik1 a



c n

T

a E

€

BiT Bi

Š

a≤

c n

T

a a=O

  1 n

.

0 0 = supaT a=1 a T g k1 = Note that the above holds for any a ∈ R1+d J n with a T a =1. Therefore g k1 p  −1/2 −1/2 T Op 1/ n . Let Σ0ki = ∆0,i A0,i Mk A0,i ∆0,i for i = 1, . . . , n. Then for any a ∈ R1+dJ n with a T a =1, the Cauchy-Schwartz inequality together with Lemmas 6 and 7 entail that there exists a

Web-App.4

c > 0 such that,

n 1 X −1/2 −1/2 T T T a T g 0 = a Bi ∆0,i A0,i Mk A0,i ∆0,i ζi1 k2 n i=1 !1/2 !1/2 n n X 1X 1 T 0 a T BiT Σ0ki Bi a ζi1 Σki ζi1 ≤ n i=1 n i=1 € Š1/2

B T β − η0 ≤ c aT a 0 ∞ ≤ chr .

0 = Op (hr ) . Similarly, there exists a c5 > 0, such that with probability approaching Therefore g k2

to 1, a T g 0

k3

n X   1 −1/2 −1/2 T T T β = a Bi ∆0,i A0,i Mk A0,i ∆0,i ζi2 β n i=1 !1/2 !1/2 n n X 1X 1 ≤ a T BiT Σ0ki Bi a ζ T Σ0 ζ n i=1 n i=1 i1 ki i1 € Š1/2

B T β − β  ≤ c5 a T a 0 n p = c5 C/ nh.

p  0 Therefore g k3 β ≤ c5 C/ nh.

 On the other hand, taking a = β − β 0 / β − β 0 2 , then conditions (C5), (C6) and Lemma 3

entail that there exist positive constants c and c6 such that 0 a T g k3 β



n X 1  T −1/2 −1/2 T

A0,i Mk A0,i ∆0,i Bi β − β 0 = β − β 0 BiT ∆0,i n β − β 0 2 i=1 n X

c T 

β − β 0 BiT Bi β − β 0 n β − β 0 2 i=1

c  2

B T β − β 0 n =

β − β 0 2

 T

≥ c6 B β − β 0 n p = c6 C/ nh. ≥

p   0 0 Therefore g k3 β = supaT a=1 a T g k3 β ≥ c6 C/ nh.

Web-App.5

Lemma 9 Under the conditions of Theorem 1, the eigenvalues of Cn0 are bounded from 0 and infinite ¦

p ©  when n is large enough. Furthermore let Θn (C) = β : B T β − β 0 n = C/ nh , for some C sufficiently large. Then there exist constants 0 < c8 ≤ c7 , 0 < c10 ≤ c9 such that for any β ∈ Θn (C), p c8 C/

p

 0

nh ≤ Gn β ≤ c7 C/ nh,

(B.1)

and  c10 C/h ≤ Q0n β ≤ c9 C/h,

(B.2)

except in an event whose probability goes to 0 as n → ∞. e i = I(K+1) ⊗ Bi , where I(K+1) is an identity matrix with dimension (K + 1) × (K + 1) , Proof: Let B ŠT € and ⊗ denotes Kronecker product. Furthermore, let M = M0T , . . . , MKT . To show that Cn0 has

bounded eigenvalues, conditions (C5) and (C6) ensure that it is sufficient to show that there exist constants c ∗ and C ∗ such that when n is large enough, c∗ ≤

n 1X

n

e T MM T B e i a ≤C ∗ , aT B i

(B.3)

i=1

€ ŠT for any a = a0T , . . . , aKT with each ak ∈ R(1+d J n) and a T a =1. The eigenvalues of MM T are the

squares of the singular values of M, which are bounded from 0 and +∞ by condition (C7). Then Pn PK Pn PK 1 Pn 1 T eT T T Te T eT e a a B B MM B a  B a  I , where I = aT a = a B B a/n= i k i n n i i=1 i=1 k i i k=0 k k i=1 i k=0 n n a T a = 1, by Lemma 6. Here denote a  b, if there exist constants c ≥ c ∗ > 0 such that c ∗ b ≤ a ≤ c b.

Therefore (B.3) holds. Finally, (B.1) follows directly from Lemma 8, and (B.2) follows from (B.1) and the result that Cn0 has bounded eigenvalues.

Lemma 10 Under the conditions of Theorem 1, for some C sufficiently large, one has  p ‹

  0

sup Gn β − Gn β = o p 1/ nh .

(B.4)

β∈Θn (C)

  sup Q n β − Q0n β = o p (1/h) .

β∈Θn (C)

Web-App.6

(B.5)

  0 Proof: It is sufficient to show that (B.4) holds for each of its components Gn,k β − Gn,k β

with k = 0, . . . , K. Note that the regression function can be decomposed as,  = BiT β = η0i + ζi β , ”  — µi β = µ η0i + ζi β ,

ηi β



    where ζi β = ζi1 + ζi2 β , ζi1 = BiT β 0 − η0i and ζi2 β = BiT β − β 0 .  For any a ∈ R1+dJ n with a T a =1, a Taylor expansion of a T Gn,k β gives € € ŠŠ   a T Gn,k β = a T Gn,k µ β = a T Gn,k µ η0 + ζ n € € ŠŠ −1/2 € € ŠŠ 1X = a T BiT ∆iT µi η0i + ζi Ai µi η0i + ζi Mk × n i=1 € € ŠŠ ¦ € Š© −1/2 Ai µi η0i + ζi yi − µi η0i + ζi n © ¦ 1X −1/2 −1/2 T a T BiT ∆0,i A0,i Mk A0,i yi − µ0i = n i=1 n 1X −1/2 −1/2 T + A0,i Mk A0,i ∆0,i ζi a T BiT ∆0,i n i=1 +

n 1X

n

i=1

+ R∗n µ

−1/2

T ζiT ∆0,i

 ∗

−1/2

T ∂ a T BiT ∆0,i A0,i Mk A0,i

¦

∂ µi

yi − µ0i

©

,

€ Š Pn  where R∗n µ∗ = 1n i=1 R∗ni µ∗i , with Š

€

1

   − 12 − 21 2 T T T M A µ µi yi − µi ∂ a B A ∆ µ k i i i i i i T

R∗ni µ∗i = ζiT ∆i 2 ∂ µi ∂ µiT € Š evaluated at µ∗i = g −1 η0i + τi ζi with 0 < τi < 1. Then one has

∆i ζi

¦ ©  0 a T Gn,k β − Gn,k β =

1 n

−1/2 −1/2 T n X ∂ a T BiT ∆0,i A0,i Mk A0,i T T ζi ∆0,i ∂ µi i=1

(B.6) ¦

©  yi − µ0i + R∗n µ∗

  = I k β + R∗n µ∗ .  For I k β , one can write Ik β



n € Š 1X ¦ ©   T T ζi1 β Γ0,i yi − µ0i + ζi2 β Γ0,i yi − µ0i n i=1 n i=1  = I k1 + I k2 β ,

=

n 1X

Web-App.7

T where Γ0,i = ∆0,i

−1/2

T A0,i ∂ a T BiT ∆0,i

−1/2

M j A0,i

. Conditions C5, C6 and Lemma 6.2 of He and Shi (1996) give ∂ µi € Š

 that sup1≤i≤n,aT a=1 Γ0,i = Op h−1/2 . Furthermore, I k β has mean 0 and condition C3 and

Lemma 4 ensures that for any β ∈ Θn (C), Š€ ŠT  c  T€ c € T Š  var I k1 ≤ E ζi1 yi − µ0i yi − µ0i ζi1 ≤ E ζi1 ζi1 = O (chr ) . nh nh € p Š   Therefore, supaT a=1,β∈Θn (C) I k β = o p (hr ) . Similarly supaT a=1 I k2 β = o p 1/ nh .  2 T T T ‹ −1/2 ∂ a Bi ∆i (µi )Ai (µi ){yi −µi } (µi )Mk A−1/2 i T ∗ |µi =µ∗i ∆i . Then Let Fi = ∆i ∂ µ ∂ µT i

R∗n

µ





= =

n 1 X

2n (1) I n3

T ∗ ζi1 Fi ζi1

i=1

+

i

(2) I n3

+



β +

n 1X

n

i=1

(3) I n3

T ζi2

β



Fi∗ ζi1

+

n 1 X

2n



  T ζi2 β Fi∗ ζi2 β

i=1

β .

Conditions C5, C6 and Lemma 6.2 of He and Shi (1996) entail that sup 1≤i≤n,a T a=1

€ Š



F ∗ = O h−1/2 . p i

Therefore sup a T a=1,β∈Θn (C)

sup a T a=1,β∈Θn (C)

sup a T a=1,β∈Θn (C)

€ Š (1) I n3 = Op h2r−1/2 ,

€ Š (2)  I n3 β = Op n−1/2 hr−1 , € Š (3)  I n3 β = Op n−1 h−3/2 .

€ Š Combining all approximations into (B.6) and noting that h = O n−1/(2r+1) , one has  p ‹

 

0 sup Gn,k β − Gn,k β = o p 1/ nh .

β∈Θn (C)

Finally, (B.5) follows from Lemma 8 and equation (B.4).

Lemma 11 Under the conditions of Theorem 1, for any ε > 0, there exists a C = C (ε) , such that as n → ∞

 P

inf

β∈Θn (C)

Q0n



β >

Q0n

β0

Web-App.8

 

> 1 − ε.

€     Š  € Š−1  Proof: Let ∆ β = Gn0 β −Gn0 β 0 . Then 1n Q0n β − Q0n β 0 = ∆ T β Cn0 ∆ β +  € Š−1 0  2∆ T β Cn0 Gn β 0 . By Lemma 9, there exists a constant c11 > 0 such that for any β ∈ Θn (C), K X Š  € 0 Š−1 0  Š T € 0 T € 0 0 Gn β 0 ≤ c11 g k1 + g k2 ∆ β Cn g k3 β k=0 K X

€ Š

g 0 β  g 0 + g 0 , ≤ c11 k3 k1 k2 k=0

  € Š−1 0 Furthermore, Lemma 9 entails that there exists a constant C1 > 0 such that ∆ T β Cn0 Gn β 0 ≤ p C C1 hr / nh when n is large enough. On the other hand, condition (C7) entails that there exists a c12 > 0, such that ∆T β



Cn0

Š−1

K K X X €

Š T € 0 Š  0

g 0 β  2 , g k3 β g k3 β = c12 ∆ β ≥ c12 k3 2 k=0

k=0

and Lemma 9 gives that there exists a constant C2 > 0 such that ∆T β



Cn0

Š−1

 ∆ β ≥ C 2 C2 /nh.

Therefore we proves the Lemma 11 by noting that h = n−1/2r+1 , and choosing C sufficiently large such that C > C1 /C2 .

Web-App.9