Communications in Statistics - Theory and Methods
ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20
A weighted Jackknife method for clustered data Ruofei Du & Ji-Hyun Lee To cite this article: Ruofei Du & Ji-Hyun Lee (2018): A weighted Jackknife method for clustered data, Communications in Statistics - Theory and Methods To link to this article: https://doi.org/10.1080/03610926.2018.1440597
Published online: 06 Mar 2018.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lsta20
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS , VOL. , NO. , – https://doi.org/./..
A weighted Jackknife method for clustered data Ruofei Dua,b and Ji-Hyun Leea,b a
Biostatistics Shared Resource, University of New Mexico Comprehensive Cancer Center, Albuquerque, New Mexico, USA; b University of New Mexico School of Medicine, Albuquerque, New Mexico, USA
ABSTRACT
ARTICLE HISTORY
We propose a weighted delete-one-cluster Jackknife based framework for few clusters with severe cluster-level heterogeneity. The proposed method estimates the mean for a condition by a weighted sum of estimates from each of the Jackknife procedures. Influence from a heterogeneous cluster can be weighted appropriately, and the conditional mean can be estimated with higher precision. An algorithm for estimating the variance of the proposed estimator is also provided, followed by the cluster permutation test for the condition effect assessment. Our simulation studies demonstrate that the proposed framework has good operating characteristics.
Received August Accepted February KEYWORDS
Clustered data; Few clusters; Heterogeneity; Weighted delete-one-cluster Jackknife.
1. Introduction For analysis of clustered data, a continuous outcome variable may be expressed as, yijk = μi + γi j + ijk
(1)
where i is the study condition index, j is the cluster index, k denotes the kth subject from the jth cluster under condition i. The mean of outcome for ith condition is denoted by μi . To accommodate intra-cluster correlation, a random effect γi j is included with the assumption γi j ∼ N(0, σγ2 ); the random error associated to yijk is denoted by ijk , assumed following N(0, σε2 ). All γi j ’s and ijk ’s are independent to each other. Generally, the main goal of the statistical analysis includes estimation of the mean of outcome for each study condition and testing the condition effect. The linear mixed-effects model (LMM) is widely used for such clustered data analysis. A common practice using the LMM starts from estimating the variance components (i.e. σγ2 and σε2 ) by residual maximum likelihood (REML) method (Harville 1977), and then estimates the conditional means by generalized least squares (GLS). For testing the condition effect between two conditions, a t statistic is constructed, and the KenwardRoger (KR) method (Kenward and Roger 1997) is widely used for correction of its degree freedom, thus adjusting for the estimate of the standard error. The analysis of clustered data by LMM fitting permits intra-cluster correlation, presuming that the number of clusters is large (Eldridge et al. 2004; Eldridge, Ashby, and Kerry 2006; Hayes and Bennett 1999; Littell et al. 2006). In literature, for the situation of a small number of clusters, a great deal of research has been focused on obtaining an improved estimate of the standard errors of the estimated CONTACT Ji-Hyun Lee
[email protected] of New Mexico, Albuquerque, NM -, USA. © Taylor & Francis Group, LLC
University of New Mexico Comprehensive Cancer Center, University
2
R. DU AND J.-H. LEE
conditional means, because with a small number of clusters the standard errors by a general regression-based method are usually downward biased. In addition, considering unequal error variances among the clusters, the so-called cluster robust estimates of the standard error were developed, in which the small-sample issue is adjusted by the formulas (Cameron and Miller 2010; Mancl and DeRouen 2001; Nichols and Schaffer 2007) or using the bootstrap methods (Cameron, Gelbach, and Miller 2008). In addition, the assumption of an asymptotic distribution on test statistics of the condition effect may not hold for a small number of clusters. Some have approached the issue of the invalid assumption by applying cluster permutation test (Donner and Klar 1996; Gail et al. 1996; Murray et al. 2006). When an insufficient number of clusters is used, in addition to the lack of representation of the population of clusters, a heterogeneous cluster could have a significant impact on the statistical analysis. This paper extends the assumptions of the clustered data analysis to include the situation. We propose a weighted delete-one-cluster Jackknife based framework. The proposed method estimates the mean of a condition by a weighted sum of estimates from each of the Jackknife procedures. Influence from a heterogeneous cluster can be weighted appropriately, and the mean of outcome can be estimated with higher precision. An algorithm for estimating the variance of the proposed estimator is also provided, and we further take advantage of the cluster permutation test for the condition effect assessment.
2. Motivation Through simulations, we investigate how both the estimate of two conditional means and statistical test of condition effect are affected when the number of clusters is 4, 8 or 20 per condition and the cluster heterogeneity is present. The model setting (Equation (1)), is used for the simulations, except the random variable due to the cluster could have different variances, i.e. heterogeneous cluster variances (HCV). Specifically, the HCV is simulated one out of every four clusters from each condition (for HCV: σγ2 = 25; for the other three clusters: σγ2 = 1). The left panel of Figure 1 shows, for 4 clusters per condition the precision of the estimate of the conditional means is affected when the HCV is present. For testing of the condition effect, the rejection rate drops from 0.98 (homogeneous cluster variance) to 0.61(HCV) at the significance level of 0.05. The right panel of Figure 1 shows the influence
Figure . Density of estimated conditional means by LMM fitting for simulated datasets by cluster variance type (homogeneous and heterogeneous) with clusters (left panel) and by three sizes of clusters (//) (right panel). The curves to the left in each panel are from the simulations with the conditional mean ; the curves to the right in a panel are from the simulations with the conditional mean .
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS
3
of HCV on estimation is reduced as the number of clusters increases: for 20 clusters per condition, the rejection rate is observed 1 at the significance level of 0.05.
3. Methods 3.1. Weighted Jackknife estimator Equation (1)is simplified below for the situation that the clusters are all from an identical condition, y jk = μ + γ j + jk ,
j = 1, . . . G; k = 1, . . . , n j .
(2)
Considering heterogeneity at cluster level, γ j is assumed following standard Normal distribution with a different variance possibly, i.e. γ j ∼ N(0, σγ2j ). Given that the random effect of the clusters is independent of the random error of the residuals, we aggregate the outcomes of a cluster by calculating their mean. A cluster mean (y¯ j ) can be expressed as y¯ j = μ + γ j + ¯ j ,
(3)
where ¯ j denotes the error mean of cluster j. By leaving out a cluster at a time and calculating the least squares estimate (LSE) of μ with the remaining clusters in the Equation (3), we obtain μˆ −1 . . . μˆ −G . The subscript − j denotes the jth cluster being left out for the computation. The process is the same as a traditional Jackknife resampling procedure, except each time a cluster is left out, instead of an individual subject. The proposed estimator of the conditional mean is μ˜ =
G
w j μˆ − j ,
(4)
j=1
where 1/Vˆ (μˆ − j ) , w j = G ˆ ˆ −j) j=1 1/V (μ
(5)
and Vˆ (μˆ − j ) denotes the estimated variance of μˆ − j . This is an inverse-variance style weight. We call it a weighted Jackknife (WJK) estimator. 3.2. Calculation of weights for WJK estimator Following Equation (3), a cluster mean can be further expressed as y¯ j = μ + (γ j + ¯ j ) = μ + δ j ,
(6) σ2
where j = 1, . . . G. Let σδ2j denote Var(δ j ) and σ2j denote Var( jk ), we have σδ2j = σγ2j + n jj . This aggregated data can then be fit into a general linear model (GLM) setting. By applying the method of Hinkley (1977) for a GLM setting with heterogeneous variances, an unbiased estimator of the variance of μˆ − j is, 1 (y¯l − μˆ − j )2 . (7) Vˆ ∗ (μˆ − j ) = (G − 1)(G − 2) l= j The details of the derivation and a proof are provided in Appendix.
4
R. DU AND J.-H. LEE
A cluster with a larger number of subjects will provide a more precise estimate of a cluster mean; however above aggregation steps treat each of the clusters equally, regardless of various cluster sizes. By using the aggregated data solely, the intra-cluster information (i.e. estimate of error variance within a cluster) is not utilized either. Therefore, we modify the Vˆ ∗ (μˆ − j ) for the variance estimation as: ˆ 2l /nl l= j σ ∗ , (8) Vˆ (μˆ − j ) = Vˆ (μˆ − j ) + (G − 1)2 where σˆ 2l is the mean residual sum of squares in the lth cluster, l 1 = (ylk − y¯l )2 . nl − 1 k=1
n
σˆ 2l
(9)
As observed from simulation studies we conducted, the distribution of Vˆ ∗ (μˆ − j ) is highly right-skewed for few clusters: i.e. the probability of having Vˆ ∗ (μˆ − j ) less than its mean, V (μˆ − j ), is much higher. In addition, as shown in the following variance expression, l= j
V (μˆ − j ) = Var
y¯l
G−1
=
σδ2l
l= j
(G − 1)2
=
l= j
σγ2l
+
(G − 1)2
l= j
σ2l /nl
(G − 1)2
,
(10)
2 l= j σγ
the value of V (μˆ − j ) is usually dominated by the value of (G−1)2l when the number of subjects in a cluster, nl , is sufficiently large. Those are the technical reasons for us to suggest adding 2 l= j σˆ l /nl to Vˆ ∗ (μˆ − j ) for estimate of V (μˆ − j ). The weight in Equation (5), w j , is now able to (G−1)2 be obtained.
3.3. Estimation of variance of WJK estimator The variance of the WJK estimator μ˜ can be written as, Var(μ) ˜ =
G
w2j Var(μˆ − j ) + 2
j=1
w j wl Cov (μˆ − j , μˆ −l ).
(11)
1≤ j