Influence diagnostics for generalized linear mixed models: a gradient ...

3 downloads 0 Views 356KB Size Report
ear Mixed Models (GLMMs) require the information matrix that can be difficult to ... Generalized Linear Mixed Models (GLMMs) [5] are useful extensions of both ...
Influence diagnostics for generalized linear mixed models: a gradient-like statistic Marco Enea, Antonella Plaia

Abstract In the literature, many influence measures proposed for Generalized Linear Mixed Models (GLMMs) require the information matrix that can be difficult to calculate. In the present paper, a known influence measure is approximated to get a simpler form, for which the information matrix is no more necessary. The proposed measure is showed to have a form similar to the gradient statistic, recently introduced. Good performances have been obtained through simulation studies. Key words: GLMM, outliers, diagnostics, gradient statistic

1 Introduction Generalized Linear Mixed Models (GLMMs) [5] are useful extensions of both linear mixed models and generalized linear models in order to assess additional components of variability due to latent random effects. For this reason, these models have received growing attention during the past decades. Unfortunately, the model estimates may heavily depend on a small part of the dataset or even on a particular observation or cluster. Therefore, the identification of potentially influential outliers is an important step beyond estimation in GLMMs. There are two major approaches for detecting influential observations. The first one is the local influence approach [3] which develops diagnostic measures by using the curvature of the influence graph of an appropriate function. The second one, the deletion approach [1], develops a diagnostic measure by assessing a chosen quantity change that is induced by the exclusion of individual data points from an analysis. However, since the observed-data likelihood function in a GLMM involves intractable integrals, the development, as well as the evaluation, of deletion diagnostic measures on the basis of Cook’s approaches is rather difficult. In this paper, on the grounds of the measure suggested by Cook [3], we derive a diagnostic measure which does not require the information matrix, while maintaining the same large sample behaviour. In fact, as it will be shown later, the proposed measure is the analogue of the gradient statistic, recently introduced by Terrell [7] and further studied by Lemonte [4]. The performance of the proposed measure is assessed on a GLMM through well tested simulation studies.

Dipartimento di Scienze Economiche Aziendali e Statistiche, University of Palermo, Palermo, Italy; e-mail: [email protected], [email protected]

1

2

Marco Enea, Antonella Plaia

2 Influence diagnostics Let yi j be the response of unit j of the i-th cluster, j = 1, . . . , ni , i = 1, . . . , N, x i j and z i j the covariate arrays. The GLMM is written as: g(µi j ) = g(E[yi j | x i j , b i ]) = ηi j = x 0i j β + z 0i j b i , where β is a p-vector of fixed parameters and b i are assumed to be N(00, G). The three most used influence measures, computable for a single observation, cluster or more generally for its subset Mi , are: the log-likelihood distance LDMi = 2{L(yy|ζˆ ) − L(yy|ζˆ (Mi ) )},

(1)

the Cook’s distance [1] ¨ ζˆ − ζˆ (M ) ), CDMi = (ζˆ − ζˆ (Mi ) )0 {−L}( i and the Cook’s total influence measure [3] CMi = 2 ∆ 0M L¨ −1 ∆ Mi , i

(2)

(3)

where L¨ is the Hessian matrix of the log-likelihood relative to the parameter vecβ 0 , δ 0 )0 , with δ corresponding to the variance components. Here ∆ 0Mi = tor ζ = (β s i − s i(Mi ) , where s i = (ss0iββ , s 0iδδ )0 , is the subvector of the difference between the contribution to the score function of cluster i and the score function for such cluster without set Mi , for which details can be found in [6]. Of course, if interest is only in the influence of the ith cluster, it will be sufficient to consider ∆ 0Mi = si . Both L¨ and ∆ 0Mi are calculated at ζ = ζˆ . Notice that we use the “total”, as opposed to the “local”, influence measure in the sense that (3) is the deletion diagnostic subcase of [3], initially proposed to construct influence curves. Now, let ζˆ (Mi ) be the estimate of ζ when subset Mi is deleted. Since ζˆ (Mi ) ≈ ζˆ − [L¨ (ζˆ )]−1 ∆ (ζˆ ), and by considering that [L¨ (ζˆ )]−1 can be approximated by (Mi )

(Mi )

(Mi )

¨ ζˆ )]−1 , as done by Zhu et al. [9], we have [L( ζˆ − ζˆ (Mi ) ≈ L¨ −1 ∆ (Mi ) .

(4)

By pre-multiplying both members of (4) by ∆ 0(Mi ) , it becomes ∆ 0(Mi ) (ζˆ − ζˆ (Mi ) ) ≈ ∆ 0(Mi ) L¨ −1 ∆ (Mi ) .

(5)

Notice the similarity between the first member of (5) and the gradient statistic ∆ 00 (ζˆ − ζ 0 ). Such a statistic is asymptotically χ 2 distributed, although it is not a quadratic form and might assume negative values for small sample sizes. By con∆ Mi , (5) becomes sidering that ∑Ni=1 ∆ Mi = 0 and given that ∆ (Mi ) = ∑ j6=i ∆ M j = −∆ 0 0 ¨ −1 ˆ ˆ ∆ Mi (ζ (Mi ) − ζ ) ≈ ∆ Mi L ∆ Mi . Finally, by substituting in (3) we have a ∆ 0Mi (ζˆ (Mi ) − ζˆ )|, CMi ≈ CM = 2|∆ i

(6)

Influence diagnostics for generalized linear mixed models ...

3

which is a measure of influence, because of the distance ζˆ (Mi ) − ζˆ , for which the use of the information matrix is no more necessary. Notice that if the Mi -th subset is influential, ζˆ (Mi ) − ζˆ will be large and the accuracy of (4) will be likely to be lower. a However that will no more needed as long as C(M is “sufficiently large to draw our i) attention for further consideration.” [2, p.182].

3 Simulation studies Following well-tested simulation schemes [8], a small-scale simulation study is performed from the following model: yi j |bi ∼ Poisson(µi j ), bi ∼ N(0, σ 2 ), log(µi j ) = xi j β + bi , where j = 1, ..., n, i = 1, ..., 10, with equal sample size n in each cluster i. The single variable xi j is chosen as j/n, while β = 1, σ 2 = 0.1, 0.2, 1.0 and n = 30, 100, 200; and both 100 and 1000 replications are considered for each combination of σ 2 and n. Both measures are calculated on models estimated adding an a by assessing its concorintercept. The aim is to investigate the performance of CM i dance with CMi in terms of proportion of correct identification of: (a) the cluster with the largest Ci ; (b) the two clusters with the largest and the second largest Ci . Table 1 shows the results of the simulation. Observe that the proportions of correct identification are at least 83% (a) and 62.4% (b), which is a good result. Table 1: Proportion of correct identification of (a) the cluster with the largest Ci and (b) the two clusters with the largest and the second largest Ci , using Cia .

Replications 100 (a) (b) 1000

(a) (b)

n = 30, σ 2 0.1 0.2 1.0 (%) (%) (%) 89.0 83.0 90.0 64.0 67.0 73.0

n = 100, σ 2 0.1 0.2 1.0 (%) (%) (%) 87.0 95.0 90.0 66.0 74.0 71.0

n = 200, σ 2 0.1 0.2 1.0 (%) (%) (%) 88.0 91.0 94.0 73.0 75.0 80.0

87.7 88.0 86.1 62.4 68.9 68.9

89.7 92.4 87.6 67.8 73.9 72.3

89.2 92.2 89.7 64.9 75.8 75.9

A second and a third simulation study were carried out. By using the same parameters of the previous study, the second study was aimed at assessing the concordance among LRi , CDi , Ci /2 and Cia /2, on 100 “typical” datasets [8]. We generated 101 dataset, picked the one with median log-likelihood value and repeated the procedure 100 times. As a result, Cia /2 showed high concordance rates, above all with LRi and Ci /2. Figure 1 shows the result from some “median” datasets by varying σ 2 and n. In the third study, we assessed the concordance between Ci and Cia in a mediumlarge simulation study with artificially created outliers. Cluster-level and observationlevel diagnostics were carried out showing very slight differences between Ci and Cia but, due to lack of space, these results are not reported here. In conclusion, we have shown the gradient statistic to have an analogue, in the study of influence, which offers good performances. Although we have used the gradient-like measure in the context of the GLMMs, it can be used, as well as LR(Mi ) , CD(Mi ) , and C(Mi ) , to carry out an influence diagnostics on any model.

4

Marco Enea, Antonella Plaia sigma^2 = 0.1 , n= 30

sigma^2 = 0.1 , n= 200

4

6

8

10

0.5

2

4

6

8

10

sigma^2 = 0.2 , n= 100

2

4

6

8

10

sigma^2 = 0.2 , n= 200

2

4

6

8

10

3 1 0

0.0

0.0

0.5

0.4

1.0

2

1.5

0.8

2.0

1.2

2.5

sigma^2 = 0.2 , n= 30

4

2

0.0

0

0.0

2

0.4

4

1.0

0.8

1.5

6

1.2

2.0

8

sigma^2 = 0.1 , n= 30

2

4

6

8

10

2

sigma^2 = 1 , n= 100

4

6

8

10

sigma^2 = 1 , n= 200

2

4

6

8

10

1.5 0.5 0.0

0

0.0

0.5

1

1.0

1.0

2

1.5

3

2.0

2.0

sigma^2 = 1 , n= 30

2

Cai

4

CDi

6

8

10

Ci

2

4

6

8

10

LRi

Fig. 1: Cluster-oriented diagnostics from the second simulation scheme.

References [1] Cook, R. D. (1977). Detection of influential observations in linear regression. Technometrics 19, 15–18 [2] Cook, R. D., Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall. [3] Cook, R. D. (1986). Assessment of Local Influence. J. Roy. Stat. Soc. B Met. 4(2), 133–169 [4] Lemonte, A.J. (2013). On the gradient statistic under model misspecification. Statistics and Probability Letters. 83, 390–398 [5] McCulloch, C.E., Searle, S.R. (2001). Generalized, Linear, and Mixed Models. Wiley, New York. [6] Ouwens, M. J. N. M., Tan, F. E. S., Berger, M. P. F. (2001). Local Influence to Detect Influential Data Structures for Generalized Linear Mixed Models. Biometrics 57(42), 1166–1172 [7] Terrell, G.R.(2002). The gradient statistic. Computing Science and Statistics 34, 34, 206-215 [8] Xu, L., Lee, S., Poon, W. (2006). Deletion measures for generalized linear mixed models. Comput. Stat. Data An. 51, 1131–1146 [9] Zhu, H., Lee, S., Wei, B., and Zhou, J. (2001). Case-deletion measures for models with incomplete data. Biometrika. 88(3), 727–737

Suggest Documents