Variable Selection for Marginal Longitudinal Generalized Linear Models

6 downloads 33998 Views 228KB Size Report
Although it is often perceived as an extensive search for a single best model, ... tations of several general purpose statistical packages (SAS, Stata, SUDAAN,.
Variable Selection for Marginal Longitudinal Generalized Linear Models Eva Cantoni, Joanna Mills Flemming and Elvezio Ronchetti

No 2003.03

Cahiers du département d’économétrie Faculté des sciences économiques et sociales Université de Genève

Juin 2003

Département d’économétrie Université de Genève, 40 Boulevard du Pont-d’Arve, CH -1211 Genève 4 http://www.unige.ch/ses/metri/

Variable Selection for Marginal Longitudinal Generalized Linear Models Eva Cantoni, Joanna Mills Flemming and Elvezio Ronchetti Department of Econometrics University of Geneva CH - 1211 Geneva 4, Switzerland [email protected], [email protected], [email protected] June 2003

Summary. Variable selection is an essential part of any statistical analysis and yet has been somewhat neglected in the context of longitudinal data analysis. In this paper we propose a generalized version of Mallows’s Cp (GCp ) suitable for use with both parametric and nonparametric models. GCp provides an estimate of a measure of adequacy of a model for prediction. We examine its performance with popular marginal longitudinal models (fitted using GEE) and contrast results with what is typically done in practice: variable selection based on Wald-type tests. An application to real data further demonstrates the merits of our approach while at the same time emphasizing some important robust features inherent to GCp . Key words: Cp , generalized estimating equations (GEE), prediction error, robustness, variable selection.

1

1. INTRODUCTION Variable (or model) selection is an essential part of any statistical analysis. Although it is often perceived as an extensive search for a single best model, it should be viewed as a technique which facilitates the identification of a few good models. After all, in many contexts it may not be appropriate to choose a single model. Moreover, one can often achieve better prediction results by aggregating a collection of good models in the spirit of bagging; cf. Breiman (1996). This implies that variable selection criteria which allow direct comparisons of models should be preferred to stepwise procedures based on significant testing. In this paper we propose such a variable selection technique. It is an extension of Mallows’s Cp (Mallows, 1973) and it requires only the data and a model from which predicted values can be obtained. The technique can be applied to many different types of models, including those in which the classical assumptions, in particular the independence of variables and their normal distributions, do not hold. For example, binary data (e.g. a subject having a disease or not at a particular point in time) are not normally distributed. In addition, the independence of outcome variables does not hold when repeated measurements are taken on the same subject. In particular we focus on Marginal Longitudinal Generalized Linear Models and develop our variable selection technique for these models. Generalized Linear Models (McCullagh and Nelder, 1989) and Generalized Estimating Equations (Liang and Zeger, 1986) are very popular statistical methods which allow us to model a variety of data and properly address the type of situations described above. Generalized Linear Models (GLM) are a generalization of the regression model for continuous and discrete responses and Marginal Models are extensions of GLM for correlated data. Generalized Estimating Equations (GEE) enable us to fit Marginal Models and are often used for modeling longitudinal data that commonly arise for instance in medical studies and economics. While there have been many novel approaches to analyzing such data, little attention has been paid to the need for appropriate variable selection. In the latest edition of the Analysis of Longitudinal Data (Diggle et al., 2002), a reference book on longitudinal data analysis, a discussion of variable selection techniques has been somewhat neglected, with the exception of a few examples suggesting to the reader, that, in the case of GEE, one can test the significance of covariates using Wald-type test (z-statistics). Horton and Lipsitz (1999) recently compared GEE implementations of several general purpose statistical packages (SAS, Stata, SUDAAN, and S-PLUS) and concluded that GEE are well-supported by all of these software packages with hypothesis testing being particularly well-implemented in 2

Stata and SUDAAN. However, it appears that variable selection within these systems (when existent) is restricted to either likelihood ratio or Wald-type tests. In addition, on a number of occasions (Ziegler and Gr¨omping, 1998; Ziegler et al., 2000) it has been pointed out that formal variable selection procedures would be a very helpful inclusion to commercial software. The above observations, in conjunction with a literature review, suggest that any variable selection efforts seem to rely predominantly on the use of Wald-type tests. This can be unreliable, because, among other things, the choice of working dependence model can impact point estimates and significance levels. The use of GCp here for purposes of model selection avoids a stepwise procedure and is based on a measure of predictive error rather than on significance testing. The paper is organized as follows. In Section 2, we develop a general criterion for prediction and its estimated version which leads to a general and robust Cp statistic. We then derive explicitly this statistic for parametric longitudinal models. In Section 3 we present the results of a simulation study that contrasts our proposal with that of the two most commonly used approaches to variable selection (i.e. those based on stepwise procedures and significance testing). Results are examined both in the absence and presence of misspecification of the model. This seems particularly important when investigating tools like the GEE that are used in medical studies where 5% of outlying observations seems to be quite common; cf. Hampel et al. (1986), p. 27. The results show the good performance of the new technique in identifying good models. An application on real data is presented in Section 4 which further demonstrates the robust and diagnostic features of our approach. Conclusions and directions for future research are provided in Section 5. 2. DERIVATION OF GCp 2.1 General Cp Procedure for Model Selection We begin by considering the general setting in which we have only observations yi , i = 1, · · · , K, and a model, either parametric or nonparametric in form, from which we can obtain predicted values yˆi , i = 1, · · · , K. We define the rescaled weighted predictive squared error Γp =

K   y − yˆ  yˆ − Ey 2   i i i i , E wi2 1/2 1/2 σv σv i=1 i i

where yˆi is the fitted value for submodel P and Eyi and V (yi ) = σ 2 vi are the expected value and variance under the full model. The weight function wi (·) 3

may be chosen so as to achieve a number of different objectives including heteroscedasticity or robustness; cf. Section 3.2. If we define the weighted sum of squared residuals by W SR =

K 

wi2 (ri )ri2 ,

i=1

where ri =

yi −ˆ yi 1/2 , σvi

and let δi =

Γp = E[W SR] −

yˆi −Eyi 1/2 , σvi

K 

it is easy to show that

E[wi2 (ri )2i ] + 2

i=1

where i = lows’s Cp :

yi −Eyi 1/2 . σvi

K 

E[wi2 (ri )i δi ],

i=1

This suggests the following generalized version of Mal-

GCp = W SR −

K 

E[wi2 (ri )2i ]

i=1

+2

K 

E[wi2 (ri )i δi ].

(1)

i=1

The latter two terms comprise the correction term necessary in order to make WSR unbiased for Γp . As previously mentioned, the weights wi (·) may address heteroscedasticity, robustness or simply be identically one in which case (1) becomes a classical yet generalized version of Mallows’s Cp . In the case of robustness, the weights in (1) are different for each model, because an observation can be outlying with respect to one model and have full weight in another. We may select a weighting scheme that has the effect of downweighting the outlying observations with respect to model P and limiting their influence on Γp and, therefore, on the model selection procedure. This would not penalize models which do not fit a few outlying observations; cf. Ronchetti and Staudte (1994) in the context of linear regression. A Taylor series expansion of the weights (details are provided in Appendix A) allows us to write the final form of our GCp statistic as follows: GCp = W SR −

K 

E[wi2 (i )2i ]

i=1

+ 2 −

K 

  E[ wi (i )wi (i )2i + wi2 (i )i δi ]

i=1 K 

  E[ wi (i )wi (i )2i + (wi (i ))2 2i + 4wi (i )wi (i )i δi2 ].

i=1

4

(2)

To make the definition (2) operational we must be able to compute the latter three terms (comprising the correction). Approximations for these terms can be derived and will depend on the specifics of the model under consideration. In Section 2.2 we compute these terms for longitudinal marginal models. Models with small values of GCp will be preferred to others. At this point, the decision of how to proceed will depend on the application at hand. For instance, we may wish to obtain predictions for a few good models with small GCp and then average them in order to draw final conclusions. The weights themselves may also prove very insightful. In the event that they are chosen to address robustness, they can routinely identify outlying observations; cf. Section 4. 2.2 Computation of GCp for a Marginal Longitudinal Model We now consider a longitudinal data analysis setting, where Yit is the discrete or continuous outcome for subject i at time t, for i = 1, · · · , K and t = 1, · · · , ni . For each outcome Yit , we also measure a set of covariates xit . We write Yi = (Yi1 , · · · , Yini )T for the ni × 1 vector of responses, and Xi = (xi1 · · · xini )T for the ni ×p matrix of covariates of subject i. We suppose −1 1/2 (µi1 ), · · · , v 1/2 (µini )), that Corr(Yi ) = A−1 i V ar(Yi )Ai , with Ai = diag(v and that the subjects are independent. Purely dependent data are obtained with K = 1 (only one cluster) and purely independent data are obtained with ni = 1 for all i (GLM). We model the marginal mean E(Yit ) = µit , and assume that g(µit ) = xTit β for a known link function g, and that V (Yit ) = σ 2 v(µit ). For short, we will write vit instead of v(µit ). An M-estimator βˆp for model P with p parameters is the solution of the estimating equations proposed by Cantoni (2003): K 

DiT ΓTi Vi−1 (Ψi − ci ) = 0,

(3)

i=1

where Di = Di (Xi , β) = ∂µi /∂β is a ni × p matrix, and Vi = Vi (µi , α) = Ai Ri (α)Ai is a ni × ni matrix. Ri (α), for an s-parameter α, is said to be the working correlation matrix, as opposed to the “true” correlation matrix −1 Corr(Yi ) = A−1 i V ar(Yi )Ai . Moreover, Ψi = Wi (Yi − µi ) and ci = E(Ψi ), where Wi = Wi (Xi , yi , µi ) is a diagonal ni × ni weight matrix containing weights wit for t = 1, · · · , ni . These weights may be different to those contained in the definition of our GCp statistic and can be chosen so as to address a number of different objectives, robustness being one example in which case we refer to Cantoni and Ronchetti (2001) and Cantoni (2003) 5

˜ i − c˜i ) for a detailed discussion on the choice of weights. Finally, Γi = E(Ψ ˜ with Ψi = ∂Ψi /∂µi and c˜i = ∂ci /∂µi . Note that the classical GEE equations (Liang and Zeger, 1986) are obtained with Wi equal to the identity matrix. Note also that the estimating equation in (3) is a slightly modified version of that in Preisser and Qaqish (1999), in that it includes the matrix Γi that makes it optimal in the class of all estimating equations based on (Yi − µi ), see Hanfelt and Liang (1995). Under the usual regularity conditions for M -estimators (Huber, 1981) the estimator defined as the solution of (3) is asymptotically normally distributed with asymptotic variance M −1 QM −1 , where K 1  T T −1 Di Γi Vi Γi Di M = lim K→∞ K i=1

and

K 1  T T −1 Di Γi Vi V ar(Ψi )Vi−1 Γi Di . K→∞ K i=1

Q = lim

For such longitudinal marginal models and writing ψ(it ) = w(it ) · it , GCp from (2) becomes: GCp = W SR −

ni K  

E[ψ 2 (it )] + t1 − t2 ,

(4)

i=1 t=1

where

K 2  t1  T r[M −1 E(DiT Zi aTi A−1 i Di )], σK i=1

(5)

with Zi = ΓTi Vi−1 (Ψi − ci ), ai = (ai1 . . . aini )T and ait = ψ(it )ψ  (it ), and K K    1    −1 −1 , t2  2 2 T r E Bi Ai Di M Dj Zj ZjT Dj M −1 DiT A−1 i σ K i=1 j=1

(6)

with Bi = diag(bi1 . . . bini ) and bit = ψ(it )ψ  (it ) − ψ 2 (it )/2it + (ψ  (it ))2 . Computations are provided in Appendix B. Notice that the expectations in (5) and (6) can be easily evaluated by Monte Carlo. If the weights in (2) are chosen to be identically one, we obtain GCp =

ni K   i=1 t=1

rit2 −

K  i=1

6

ni + 2

ni K   i=1 t=1

E[it δit ],

(7)

where we have used the fact that E[2it ] = 1, and where we notice that the term t2 is exactly zero in this case. Moreover, following the same reasoning leading to (5), we obtain in this case GCp =

ni K   i=1 t=1

rit2



K  i=1

K 2  ni + 2 T r(M −1 DiT A−2 i Di ), σ K i=1

(8)

which can be computed directly without simulation. Hereafter, for clarity we refer to (8) as our classical GCp , and to (4)-(6) as our weighted or robust GCp . 3. SIMULATION STUDY To assess and compare the performance of our new proposal with the existent methods, we have carried out two simulation studies. The first one is designed to compare the behavior of the z-test and the stepwise procedure with that of our classical GCp . The second simulation study demonstrates the utility of our robust GCp in handling contaminated data. 3.1 Classical GCp We consider a marginal longitudinal model (as described in Section 2.2) with logistic link, where the linear predictor is β0 +xTit β for i = 1, . . . , K = 30 and t = 1, . . . , ni = n = 10, with xit and β being of dimension 8. The response Yit is binary (0 or 1). For each individual i at each time point t, the set of explanatory variables xit comprises (in this order): 2 discrete variables (D1 and D2 ) and 6 continuous variables (C1 , . . . , C6 ). D1 is a dummy variable (e.g. sex) coded 0 or 1 with probability of each equal to 0.5 and D2 is a three level variable with probabilities 0.35, 0.15 and 0.5 respectively. The 6 continuous variables have been generated independently according to standard normal distributions. The true values of the parameters are β0 = 0.5 and β = (1, 0, 0.5, 0.5, 0.5, 0, 0, 0), meaning that the true model generating the data is defined by the intercept, D1 , C1 , C2 and C3 . The correlation between observations on the same subject is exchangeable, that is, for each i, Corr(Yit , Yit ) = α = 0.1, for t = t . The subject are assumed independent. We simulated 300 replications of block-correlated binary responses Y = (Y1 , . . . , YK ) following the algorithm described in Emrich and Piedmonte (1991). This sample represents our non-contaminated data. To obtain a slightly contaminated dataset we flipped (from 0 to 1, or the converse) 5% randomly chosen responses from each replication contained in the noncontaminated data. This contamination is reflective of what occurs in practice, when in a few cases a zero might be recorded as a one or vice versa. In 7

Table 1 Proportion of good models selected by the different techniques for non contaminated and contaminated data.

Non contaminated Contaminated

GCp 88 % 80 %

z-test 68 % 53 %

stepwise 83 % 69 %

addition, it is compatible with the fact that we consider observation downweighting, as opposed to cluster downweighting. We estimated the parameters according to (3) with Wi = I (and therefore Γi = I and ci = 0), which reproduces the Zeger and Liang (1986) equations, and investigate three variable selection strategies: • z-test: Fit the full model and retain all the variables for which the ˆ ∼ χ2 ) gives a p-value lower than 0.05. ar(β) Wald test (βˆ2 /V 1 • stepwise: Backward stepwise selection procedure based on Wald test with cutoff on the p-value set at 0.1. • GCp : Identify the model with smallest GCp as defined by (7). In Table 1 we report the results of the simulation study. According to the definition in Shao (1993), p. 487, a selected model is considered good if it contains the true model generating the data. On the other hand, incorrect models are those where at least one variable used to generate the data is missing. These results show that classical GCp and stepwise perform similarly on non-contaminated data, while the z-test is less successful at identifying good models. This is explained by the fact that, despite the independence of the explanatory variables, the estimated coefficients of the full model are not independent (as is the case in linear regression for example), and therefore the single-step procedure does not perform as well as the other techniques. When we apply these three techniques to contaminated data (see Table 1 again), we see that classical GCp appears robust by design, whereas the “success rate” of the z-test and stepwise procedure drop considerably, showing that both these variable selection procedures are more affected by outlying observations than GCp .

8

Table 2 Proportion of good models selected by the classical and the robust approaches on contaminated data.

Classical Robust

GCp 82 % 91 %

z-test 60 % 63 %

stepwise 73 % 72 %

3.2 Robust GCp In the presence of contaminated data, one could argue that robust versions of the variable selection procedures should be used. To check their performance, we consider here a simulation setting similar but simpler to that discussed in Section 3.1, with only 5 explanatory variables, that is D1 , D2 , C1 , C2 and C3 . All the parameters are kept the same as in Section 3.1, except for β which takes the value (1, 0, 0.5, 0.5, 0). 100 contaminated replications are produced. For the classical procedure we proceed as in Section 3.1. For the robust approach, the regression parameters are estimated robustly according to (3) where Wi = diag(wi1 , . . . , wini ), with wit = c/|it | if it > c and wit = 1 otherwise (c=1.5). The nuisance exchangeable correlation parameter α is estimated by a generalized version of the method of moments, as defined in Cantoni (2003) (k = 2.4). In addition to the three classical variable selection approaches of the previous section, three robust techniques are compared: our robust GCp as defined by (4) with ψ = ψc (x) = x min(1, c/|x|) being the Huber’s function (c = 1.5); a robust z-test and a robust stepwise procedure, both based on a 2 Wald-type test of the form βˆrob /V ar(βˆrob ) distributed as χ21 under the null hypothesis, see Heritier and Ronchetti (1994). The results of Table 2 indicate that even though the classical GCp is robust by design, its performance on contaminated data can be improved by using its robust version. On the other hand, the robust versions of z-test and stepwise cannot handle contaminated data any better than their classical counterparts. It is interesting to take a closer look at the distribution of the classical and robust GCp statistics for good models in the contaminated setting. We identify the models with a five-letters sequence of T (=True) and F (=False), according to the inclusion of the corresponding parameters. For instance, in 9

Figure 1. Distribution of GCp for the good models. Model TFTTT

Model TTTTF

Model TTTTT

−20

0

GCp

20

40

Correct model TFTTF

Robust Classical Robust Classical Robust Classical Robust Classical

our simulations the model that generated the data is TFTTF. Then, among the 25 (= 32) possible models there are 3 others good models, namely TFTTT, TTTTF and TTTTT. Figure 1 shows a boxplot of the values of classical and robust GCp for these four models. It clearly appears that not only the values of the robust GCp are in median smaller than those of the classical counterpart but also that their variability is lower. This confirms that the robust technique is more stable than its classical counterpart and should be preferred in presence of misspecification of the model. 4. APPLICATION ON REAL DATA Many healthcare professionals are trained in direct laryngoscopic endotracheal intubation (LEI), which is a potentially life saving procedure. Unfortunately, there is little information to indicate the amount of training required to assure competent performance of LEI. In general, airway training programs for healthcare personnel, such as paramedics and respiratory therapists, are not standardized and may be inadequate. This inadequacy is 10

a grave concern, given the serious consequences of failed airway management. We now examine data from a prospective longitudinal study on LEI directed by Dr. Orlando Hung of the Department of Anesthesia, Dalhousie University. A portion of this data was previously analyzed in Mills, Field and Dupuis (2003). The goal here is to identify features of the process of LEI which are predictive of a successful LEI. These features are initially all entered as covariates into our proposed full model with the goal of identifying those covariates which are most predictive of successful completion. Variable selection is hence a natural choice here as the model(s) chosen will include only those covariates significant in predicting successful completion of LEI. A total of 438 LEI were analyzed during this longitudinal study. We let the response Yij equal 1 if trainee i performs a complete LEI in less than 30 seconds on trial j, and 0 otherwise. We judge trainees based on the following covariates: whether the head and neck were in optimal position, i.e. neck flexion and atlanto-occipital extension (NECKFLEX and EXTOA respectively); whether they inserted the scope properly (PROPLGSP); whether they performed the lift successfully (PROPLIFT); whether there was appropriate request for help (ASKAS); whether there was unsolicited intervention by the attending anesthesiologist (HELP); the number of complications (COMPS) and the trainee’s handedness (TRHAND) and sex (TRGEND). All covariates are binary with the exception of COMPS which is ordinal. 19 trainees performed anywhere from 18 to 33 trials. A covariate TRIALCAT was also defined; TRIALCAT=1 for trials 1-5, TRIALCAT=2 for trials 6 through 10, and so forth. Our classical GCp procedure selects the model containing covariates TRIALCAT, PROPLGSP, PROPLIFT, ASKAS, HELP and COMPS whereas both the classical z-test and stepwise approach select a subset of these particular covariates. The stepwise approach excludes ASKAS from the model while the z-test excludes both ASKAS and COMPS. Such behavior is consistent with that observed in our simulation results where we saw the z-test and stepwise approach favoring (often incorrectly) smaller models. The robust procedures are quite insightful. Robust GCp as defined by (4) with ψ = ψc (x) = x min(1, c/|x|) being the Huber’s function (c = 1.5), selects the model containing covariates TRIALCAT, NECKFLEX, EXTOA, PROPLIFT, ASKAS, HELP and COMPS. That is, it includes two extra covariates (NECKFLEX and EXTOA) when compared to that selected by the classical GCp procedure. Robust z-test and robust stepwise again select a subset of these particular covariates: the stepwise approach excludes EXTOA while the robust z-test excludes both EXTOA and COMPS.

11

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • • • • • • •

Weights in GCp formula (4) 0.4 0.6 0.8

1.0

Figure 2. Observation weights in (4) for the model selected by robust GCp .







• •



••



• •

• •

0.2

26 •



•308

• 233

113 •



31 • 243 •

•36 0.0

• 11 0

•278 100

200 Observation

300

400

The above results make evident that each robust technique selects a larger model than its classical counterpart. In addition, of the three techniques, GCp tends to select larger models. In most cases including an extra covariate is preferable to excluding one that may be important. Our robust GCp is designed to automatically downweight outlying data thereby reducing their influence on model selection. Figure 2 shows the weights associated with each observation in the GCp formula (4) for the model selected by the robust GCp procedure. These weights allow one to identify the outlying observations. There are 9 observations (corresponding to 2%, identified in Figure 2) that were heavily downweighted by the procedure (weight less than 0.4). The disagreement between results obtained using robust and classical GCp suggest that the outlying data have a significant impact on the model chosen. For this reason the model selected by the robust GCp procedure is to be preferred in this case.

12

5. CONCLUSIONS Model selection is an important part of any statistical analysis. In requiring only observations and a model from which predicted values can be obtained, GCp can be applied to a wide range of statistical problems. Its design also should make it a welcome addition to areas where model selection procedures are mainly based on Wald-type tests or stepwise procedures both of which must be used with caution. In choosing to focus on longitudinal marginal modeling, much has been learned about the performance of the z-test and stepwise procedure, and important comparisons drawn with that of GCp . GCp performs as well as the stepwise procedure and much better than the z-test, in being able to identify good models. GCp goes on to exceed both approaches when faced with contamination as often occurs in practice. Work in progress includes the application of GCp to nonparametric techniques (such as generalized additive models) and nonparametric extensions to longitudinal data. Acknowledgements The financial support of the Swiss National Science Foundation, Project Number 1214-66989 is gratefully acknowledged. The authors also thank Chris Field for some insightful comments. Appendix A. Derivation of the correction term in the formula for GCp Here we derive the correction term C=−

k 

E[wi2 (ri )2i ]

+2

i=1

K 

E[wi2 (ri )i δi ].

i=1

Consider first the Taylor expansion of the weights wi (ri ) around wi (i ): wi (ri ) = wi (i ) + (ri − i )w (i ) 1 1 + (ri − i )2 w (i ) + (ri − i )3 w (∗ ) 2 6 for some ∗ lying between ri and i . It is assumed that the weight function wi is even and three times differentiable. The above expansion leads us to an approximate expression for wi2 (ri ) from which we obtain equation (2). 13

B. Derivation and computation of t1 and t2 In the setting of Section 2.2, (2) becomes

GCp = W SR −

ni K  

E[wit2 (it )2it ]

i=1 t=1

+ 2 −

ni K  

  E[ wit (it )wit (it )2it + wit2 (it )it δit ]

(9)

i=1 t=1 ni K 

  E[ wit (it )wit (it )2it + (wit (it ))2 2it + 4wit (it )wit (it )it δit2 ],

i=1 t=1

where W SR =

ni K  

wit2 (rit )rit2 ,

i=1 t=1 1/2

1/2

1/2

rit = (yit −ˆ yit )/(σvit ), it = (yit −Eyit )/(σvit ) and δit = (ˆ yit −Eyit )/(σvit ). 2 Notice Eyit and σ vit are the expected value and variance under the full model. By defining ψ(it ) = w(it ) · it , we can rewrite (9) as GCp = W SR −

ni K  

E[ψ 2 (it )] + t1 − t2 ,

i=1 t=1

where t1 = 2

ni K  

E[ψ(it )ψ  (it )δit ]

(10)

i=1 t=1

and

ni K      ψ 2 (it )  2 δit2 . t2 = E ψ(it )ψ  (it ) − + (ψ ( )) it 2  it i=1 t=1

(11)

At this point we use the structure of the marginal model form in order 1/2 to compute the expectations in (10) and (11). Let δit = (ˆ yit − Eyit )/(σvit ) ˆ − g −1 (xT β))/(σv 1/2 ). which can be expressed alternatively as δit = (g −1 (xTit β) it it ˆ about g −1 (xT β) results in the following apA Taylor expansion of g −1 (xTit β) it proximation for δit : δit 

∂g −1 (θ) T xTit (βˆ − β). 1/2 ∂θ θ=xit β σvit 1

14

On the other hand, the influence function (Hampel, 1974, Hampel et al., 1986) of the estimator defined by (3) is given by ˆ Fβ ) = M −1 DT ΓT V −1 (Ψ − c) IF ((X, y); β, for a generic observation (X,y). We therefore have the following expansion for (βˆ − β) as K → ∞: K K  1  ˆ Fβ ) = 1 IF ((Xj , yj ); β, M −1 DjT ΓTj Vj−1 (Ψj − cj ), βˆ − β  K j=1 K j=1

which implies K  1 −1 −1 1 δi = Ai Di M DT ΓT V −1 (Ψj − cj ) σ K j=1 j j j

(12)

for δi = (δi1 . . . δini )T . We can now express t1 of equation (10) as t1  2

K 

E[T r(δi aTi )].

i=1

Substituting our expression for δi from equation (12) into the above and recognizing that all (j = i) terms are 0 by independence, we obtain with the following expression for t1 : t1 

K 2  −1 T T −1 E[T r(A−1 Di Γi Vi (Wi (yi − µi ) − ci )aTi )], i Di M σK i=1

and by using the properties of expectation and trace K 2  t1  T r[M −1 E(DiT Zi aTi A−1 i Di )], σK i=1

(13)

where Zi = ΓTi Vi−1 (Wi (yi −µi )−ci ) ai = (ai1 . . . aini )T and ait = ψ(it )ψ  (it ). Using the same arguments we can express t2 in equation (11) as t2 

K 

E[T r(Bi δi δiT )],

i=1

15

where Bi = diag(bi1 . . . bini ) and bit is defined in Section 2.2. Substituting our expression for δi from Equation (12) into that above and removing those terms whose expectation is zero (by independence), we obtain with the following expression for t2 : t2 ≈

K K   −1 T −1  1    −1 −1 T . (14) T r E B A D M D Z Z D Di Ai i i i j j j j M σ 2 K 2 i=1 j=1

References Breiman, L. (1996). Bagging predictors. Machine Learning 24, 123–140. Cantoni, E. (2003). A robust approach to longitudinal data analysis. Canadian Journal of Statistics, under revision. Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association 96, 1022–1030. Diggle, P. J., Heagerty, P., Liang, K.-Y. and Zeger, S. L. (2002). Analysis of Longitudinal Data. Oxford University Press, New York. Emrich, L. J. and Piedmonte, M. R. (1991). A method for generating highdimensional multivariate binary variates. The American Statistician 45, 302–304. Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association 69, 383–393. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. Hanfelt, J. J. and Liang, K.-Y. (1995). Approximate likelihood ratios for general estimating functions. Biometrika 82, 461–477. Heritier, S. and Ronchetti, E. (1994). Robust bounded-influence tests in general parametric models. Journal of the American Statistical Association 89, 897–904. Horton, N. J. and Lipsitz, S. R. (1999). Review of software to fit generalized estimating equation regression models. The American Statistician 53, 160–169. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Mallows, C. L. (1973). Some comments on Cp . Technometrics 15, 661–675. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, London, second edition. 16

Mills, J. E., Field, C. A. and Dupuis, D. J. (2003). Marginally specified generalized linear mixed models: A robust approach. Biometrics 58, 727–734. Preisser, J. S. and Qaqish, B. F. (1999). Robust regression for clustered data with applications to binary regression. Biometrics 55, 574–579. Ronchetti, E. and Staudte, R. G. (1994). A robust version of Mallows’ Cp . Journal of the American Statistical Association 89, 550–559. Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association 88, 486–494. Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42, 121–130. Ziegler, A. and Gr¨omping, U. (1998). The generalised estimating equations: A comparison of procedures available in commercial statistical software packages. Biometrical Journal. Journal of Mathematical Methods in Biosciences. 40, 245–260. Ziegler, A., Kastner, C., Brunner, D. and Blettner, M. (2000). Familial associations of lipid profiles: A generalized estimating equations approach. Statistics in Medicine 19, 3345–3357.

17

Suggest Documents