paper develops the robust regression procedure for the linear mixed model setting with clustered data. Parametric Linear Mixed Models. A common formulation ...
Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001
LINEAR MIXED MODEL ROBUST REGRESSION Megan J. Waterman, Jeffrey B. Birch, Oliver Schabenberger, Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24061 KEYWORDS: local likelihood estimation, semiparametric inference Introduction In many practical situations, the user has some knowledge of the parametric form for a particular model. However, this parametric model may be misspecified over a portion of the data. This bias may be corrected through the use of a local or nonparametric model; however, the variance of the nonparametric fits is often large. Our model robust regression method, which rests on combining the parametric and nonparametric fits in a convex combination, provides an improvement over a misspecified parametric model, while obtaining smaller variances than the nonparametric fit. This paper develops the robust regression procedure for the linear mixed model setting with clustered data.
Parametric Linear Mixed Models A common formulation of the linear mixed model for clustered data is the Laird-Ware (Laird and Ware, 1982) model (1) Y = X + Zb + , where X is a model matrix for the fixed effects, is a vector of fixed and unknown parameters, Z is a model matrix for the random effects, b is a vector of random effects, and is a vector of random errors. In addition, the random effects vector b is assumed to be normally distributed with mean 0 and variancecovariance matrix B, and the errors are also normally distributed with expectation 0 and varianceare covariance matrix R. The vectors b and assumed to be independent. The Laird-Ware model may also be expressed in terms of a particular cluster by including a subscript i, denoting the linear mixed model for the ith cluster (i=1,..,s). It can be shown that, under the assumption that the mixed linear model is correct, that E (Y | b ) = X + Zb E (Y ) = X
Var(Y | b ) = R Var (Y ) = R + ZBZ′ = V.
Using the Laird-Ware model and the given and a assumptions, one can find an estimator of
predictor of b . The estimator ˆ is the generalized least squares estimator ˆ = ( X ′V −1 X) −1 X ′V −1 Y, (2) and the predictor bˆ can be found by maximizing the joint likelihood of b and as (3) bˆ = BZ′V −1 (Y − X ˆ ). The expression in (3) is the best linear unbiased predictor. In many practical situations, the variancecovariance matrices B and R, and hence V, are unknown and must be estimated, usually by maximum or restricted maximum likelihood. For s clusters, there are s+1 curves to be estimated. The linear parametric population average is estimated from the fixed effects part of the model as ˆ = X ˆ. (4) Y i i In addition, a curve may be estimated for each cluster. The cluster specific curves can be expressed as ˆ = X ˆ + Z bˆ . (5) Y i i i i The cluster specific curves differ across clusters. An example of the parametric linear mixed model follows using a portion of the data in Haslett and Raftery (1989). Wind speeds in knots at twelve meteorological stations in Ireland were measured daily from 1961 – 1978. This example focuses on the average monthly wind speeds in knots for the year 1961. There are twelve clusters, as the cluster is the station. Since there are twelve observations per station, there are a total of 144 observations in the data set. The parametric model is chosen as a quadratic model with a random intercept. Thus, the cluster specific fit at each station is a parabola shifted up or down for the particular cluster. A trellis plot of the population average curve and the cluster specific curves for each station appear in Figure 1. Each plot in Figure 1 corresponds to a station. The population average curve, in red, is the same for every cluster. The cluster specific curves, in blue, are shifts of the population average curve, accomplished by the random intercept term in the model. The cluster specific curves, as expected, provide a better fit to the data.
weights at x0 as the diagonal elements. Of course, the weights depend on a bandwidth. The proposed bandwidth selector, PRESS**, will be discussed in detail in a later section. It can be shown that, under the assumption that the mixed linear model is correct, that for fixed
Parametric Linear Mixed Model Plot of Population Average and Cluster Specific Curves by Station 4 10 4 10 ROS
RPT
SHA
VAL
DUB
KIL
MAL
MUL
16
average wind speed in knots
8
16
K0
E (Y | b 0 ) = X E (Y ) = X
8 BEL
BIR
CLA
CLO
−
1
−
1
−
1
−
1
Var (Y ) = K 0 2 RK 0 2 + ZBZ ′ = R * + ZBZ ′ = V * .
8 10
4 months since 1/1/61
Figure 1.
10
Y PA fit CS fit
The Parametric Model
At virtually every station, we notice a sharp rise in wind speed during the month of February. The wind speeds then diminish in the spring and summer months, until they rise again in October and December. The proposed parametric model is unable to model this type of trend. The parametric model has been misspecified. A model is misspecified when the model proposed by the user does not equal the true model. In order for the user to handle this model misspecification, he or she must consider a more flexible model. One such model is a nonparametric, or local, model.
Local Mixed Models The concept of local weighting (Fan and Gijbels, 1996) can be applied to mixed models by two approaches. The first approach, the conditional local mixed model, weights the variance-covariance matrix of the observation vector conditioned on the random effects. The second approach, the marginal local mixed model, weights the variance-covariance matrix of the observation vector. These two approaches are presented below. Consider the model (6) Y = X 0 + Zb 0 + 0 , where the subscript 0 denotes estimation at the point x0. The vector 0 contains unknown fixed effects at x0, b 0 is a vector of unknown random effects at x0, and 0 is a vector of random errors at x0. The vectors of random effects and errors are assumed normally distributed with means 0 and variance-covariance −
1
−
1
matrices B and K 0 2 RK 0 2 , respectively. The matrix K 0 is a diagonal weight matrix with the NadarayaWatson (Nadaraya 1964, Watson 1964) kernel
The estimator of 0 and the predictor of b 0 can be found for the conditional local mixed model using the joint likelihood of b 0 and 0 . It can be shown that the estimator of 0 and the predictor of b 0 for the conditional local model are ˆ = ( X′V * −1X) −1 X′V * −1Y, (7) 0 bˆ 0 = BZ ′V *−1 (Y − X ˆ 0 ).
(8) Notice that the forms of (7) and (8) are identical to those of the parametric model, with the exception of the variance-covariance matrix. Using the estimates of the fixed effects and the predictions of the random effects, a population average curve and cluster specific curves are found by connecting the fits obtained at each x0. The population average and cluster specific curves for the conditional local mixed model applied to the wind speed data are given in Figure 2 for a bandwidth of 0.20. The local model was a quadratic model (although another local polynomial could be used) with a random intercept. Akin to Figure 1, the population average curve is in red, and the cluster specific curves are in blue. Notice that both curves capture the increase in wind speed in the month of Conditional Local Mixed Model (b=0.20) Plot of Population Average and Cluster Specific Curves by Station 4 10 4 10 ROS
RPT
SHA
VAL
DUB
KIL
MAL
MUL
16
average wind speed in knots
4
1 2
0
Var(Y | b 0 ) = K 0 2 RK 0 2 = R *
16
−
+ Zb 0
0
8
16
8 BEL
BIR
CLA
CLO
16
8 4
10
4 months since 1/1/61
Figure 2.
10
Y PA fit CS fit
The Conditional Local Mixed Model
February. The cluster specific curves for the conditional local model are an improvement over the parametric cluster specific curves. The conditional local model provides a flexible fit to each cluster. The marginal local mixed model incorporates the weights in the variance-covariance matrix of the marginal distribution. Recall that in the parametric model, the marginal distribution of Y had variance-covariance matrix V. The marginal local mixed model uses the weight matrix K 0 and V such that the variance of Y in the marginal local mixed −
1
−
1
model is K 0 2 VK 0 2 . The marginal mixed model can be expressed as −
1
(9) Y = X 0 + K 0 2 Zb 0 + 0 , where 0 is defined in (6). As in the conditional local model, 0 and b 0 are the parameters for the fixed and random effects at x0. The random effect and random error vectors contain normal random variates with mean 0 and variance-covariance matrices B and 1
−
1
correct, E (Y | b 0 ) = X E (Y ) = X
−
1
+ K 0 2 Zb 0
0
0 −
1
−
1
Var(Y | b 0 ) = K 0 2 RK 0 2 = R * −
1
−
1
−
1
−
1
Var (Y ) = K 0 2 RK 0 2 + K 0 2 ZBZ′K 0 2 −
1
−
1
= K 0 2 (R + ZBZ′)K 0 2 = V **. −
1
Thus, the weight matrix K 0 2 occurs twice in the marginal local mixed model; as a multiplier of the Z matrix and in the definition of the variancecovariance matrix of Y conditioned on the random effects vector. The estimator of 0 and the predictor of b 0 for the marginal local mixed model are ˆ = ( X′V **−1X) −1 X′V **−1Y, (10) 0
Marginal Local Mixed Model (b=0.20) Plot of Population Average and Cluster Specific Curves by Station 4 10 4
(11) The forms of (10) and (11) are identical to the estimators and predictors of the parametric and conditional local mixed model, except for the variance-covariance matrices. A population average curve and cluster specific curves may be found by connecting the fits obtained at each x0. The population average and cluster specific curves for the marginal local mixed model are given in Figure 3 for the wind speed data. The marginal local model was a quadratic model with a random intercept and used a bandwidth of 0.20. Both the population average and cluster specific
ROS
RPT
SHA
VAL
DUB
KIL
MAL
MUL
10
16
bˆ 0 = BZ′V **−1 (Y − X ˆ 0 ).
average wind speed in knots
−
K 0 2 RK 0 2 . Assuming that the linear mixed model is
curves modeled the wind speed intensification in February. The population average curve from the marginal model is very close to the population average curve from the conditional model. However, the marginal model provides poor cluster specific fits for many of the clusters, including the KIL, MAL, RPT, and BIR stations. By using the expectation and variance formulas for the conditional and marginal models, one can see that the conditional local model is appropriate for both the population average and cluster specific curves. That is, localization of the cluster specific mean also results in a marginal mean that is properly localized. This is not true for the marginal local mixed model, where only the population average curve is properly localized, but not the conditional mean. The wind speed example supports this finding, as the cluster specific fits for the marginal model are inappropriate. The primary advantage of the local mixed models lies in their flexibility. The local models are able to provide curves that fit the trend of the data without imposing a parametric structure. Since the local models are fit pointwise, the variance of the random components is re-estimated at every point. For example, consider the parametric and conditional local mixed model cluster specific fits for the wind speed data. Both models fit a quadratic with a random intercept. For the parametric model, the cluster specific fits were parallel to each other. This was not the case with the conditional local mixed model. The cluster specific fits were more flexible; they were not simply shifted parabolas. In fact, some of the cluster specific fits for the conditional local model intersected because the random intercept adjustments vary across the x0. A comparison of the parametric and conditional local mixed model cluster specific fits is shown in Figure 4.
8
16
8 BEL
BIR
CLA
CLO
16
8 4
10
4 months since 1/1/61
Figure 3.
10
Y PA fit CS fit
The Marginal Local Mixed Model
~ for the marginal model. In addition, V is an estimate of the variance-covariance matrix of Yi based on the residuals Y - Yˆ i,-iNP , HNP is the local smoother matrix, n
Profile Plot of Parametric Cluster Specific Fits (random intercept)
is the total sample size, and d is the number of parameters in the nonparametric model. The error sums of squares SSEmax and SSEb are those for the nonparametric fits using weights equal to 1/n and weights using a bandwidth of b, respectively. The denominator of the PRESS**(b) statistic is the penalty term. The n – trace (HNP) protects against small bandwidths; the second term guards against large bandwidths.
average wind speed in knots
17
12
7
0
2
4
6
8
10
12
Mixed Model Robust Regression
months since 1/1/61
Profile Plot of Conditional Local Cluster Specific Fits (random intercept)
average wind speed in knots
17
12
7
0
2
4
6
8
10
12
months since 1/1/61
Figure 4.
A comparison of the cluster specific fits for the parametric and conditional local models
Bandwidth Selection in Local Mixed Models The kernel weights in the local mixed models depend on a bandwidth. It determines the smoothness of the resulting curves; the regression curve becomes less smooth as the bandwidth decreases. The proposed bandwidth selector for the local mixed models is an adaptation of the PRESS** statistic of Mays, Birch, and Starnes (2001). The PRESS** statistic is a penalized PRESS statistic that corrects for extreme bandwidths. The bandwidth b is chosen that minimizes the PRESS** statistic s
PRESS * *(b) =
∑ (Y - Yˆ i =1
NP i,-i
~ ˆ NP ) )′V −1 (Y - Y i,-i
SSE max − SSE b n − trace(H NP ) + (n - d) SSE max
where Yˆ i,-iNP is the local fit for the ith cluster with the ith cluster removed. This local fit is cluster specific for the conditional local model and population average
,
Parametric fits usually have small variance but are biased if the model is misspecified. Nonparametric fits, on the other hand, have lower bias than a misspecified parametric model, but the variability of the fit is often high. Thus, the goal is to develop a method for misspecified models that has lower bias than the parametric model but smaller variances than the local model. Mays, Birch, and Starnes (2001) discuss two methods of model robust regression for the traditional regression setting. The first method, termed Model Robust Regression 1 (MRR1) (Einsporn and Birch, 1993) combines the parametric and nonparametric fits in a convex combination. This concept can be extended to the mixed model. The Mixed Model Robust Regression (MMRR) model is ˆ MMRR = (1 − Y ˆ NP , ˆP+ Y (12) Y P where Yˆ is the parametric linear mixed model fit, and Yˆ NP is the nonparametric mixed model fit. The parameter governs the degree of mixing and ranges between zero and one. A equal to one means that the mixed model robust fit is equivalent to the nonparametric fit and a equal to zero implies that the mixed model robust fit is equal to the parametric fit. For a misspecified model, the MMRR predictor is robust to model misspecification because the MMRR predictor uses only a fraction of the biased parametric fit. In addition, the variance of the MMRR fit is controlled through the use of the mixing parameter. Notice that the MMRR fit is a population average if both the parametric and nonparametric fits in (12) are population average fits. The MMRR curves are cluster specific curves if both the parametric and nonparametric fits are cluster specific fits. Thus, the MMRR fit using the marginal local mixed model is strictly a population average, whereas the MMRR fit using the conditional local mixed
model can either be population average or cluster specific curves, as needed. Mixed model robust regression curves for the wind speed data set appear in Figure 5. The first trellis plot is MMRR using the conditional local mixed model with a bandwidth of 0.20 and a mixing parameter ( ) of 0.60. That is, the model robust fit uses 60% of the nonparametric fit and 40% of the parametric fit. The first plot contains the MMRR cluster specific fits. The second trellis plot is mixed model robust regression using the marginal local mixed model also with b=0.20 and =0.60. The mixed model robust regression curve is the population average. In both trellis plots, the model robust fits are in blue. Naturally, the MMRR fits lie between the parametric and nonparametric fits. In both plots, there are key features. Notice the model robust fits retain the shape of the nonparametric fits. The MMRR fits preserve the flexibility that was inherent in the local models. The model robust fit, however, is smoother than the local fit. Plot of Parametric, CLMM, and MMRR Cluster Specific Fits (λ=0.60) 4
10
4
10
ROS
RPT
SHA
VAL
DUB
KIL
MAL
MUL
average wind speed in knots
16
8
The mixing parameter in mixed model robust regression determines the proportion of the nonparametric and parametric fits used in the creation of the model robust fits. Thus, our selector should if the nonparametric fit choose a large value of differs significantly from the parametric fit, and a when the nonparametric and small value for parametric fits are similar. The proposed mixing parameter for mixed model robust regression is a variation of the asymptotically optimal mixing parameter for MRR1 in Mays, Birch, and Starnes (2001) of the form s
MMRR
=
∑ (Yˆ
NP i,-i
i =1 s
~ −1 ˆ P )′C ˆ P) -Y (Y - Y i,-i i
~ ∑ (Yˆ iNP - Yˆ iP )′C−1 (Yˆ iNP - Yˆ iP )
.
i =1
Here, Yˆ i,-iNP and Yˆ i,-iP are the nonparametric and parametric fits for the ith cluster with the ith cluster ˆ NP and Y ˆ P are the nonparametric removed, and Y i
i
and parametric fits. The vector Y is the vector of the ~ average response at the data points, and C is an estimate of the variance-covariance matrix of ˆ P . The nonparametric and parametric fits and Y-Y i deleted cluster fits are cluster specific if the conditional local model is used in MMRR. Otherwise, the population average fits are used in the calculation of .
16
8 BEL
BIR
CLA
CLO
16
8 4
10
4
Y Parametric CLMM (local) MMRR
10
months since 1/1/61
Plot of Parametric, MLMM, and MMRR Population Average Fits (λ=0.60) 4
10
4
10
ROS
RPT
SHA
VAL
DUB
KIL
MAL
MUL
16
average wind speed in knots
Mixing Parameter Selection
8
16
8 BEL
BIR
CLA
CLO
16
8 4
10
4 months since 1/1/61
Figure 5.
10
Y Parametric MLMM (local) MMRR
Mixed Model Robust Regression
Conclusion Mixed model robust regression corrects an incorrectly specified parametric mixed model by combining this parametric fit with a locally weighted mixed model. This combination results in a model that has less bias that the parametric model and smaller variance than the locally weighted model. Mixed model robust regression resulted in flexible population average and cluster specific curves that capture the important structure in each cluster on the average. The PRESS**(b) and the mixing parameter MMRR are the proposed selectors; selection of the bandwidth and the mixing parameter is an ongoing study. We have presented two methods for localizing a mixed model, termed the conditional and the marginal local mixed model. The two models differ on whether the conditional or the marginal variance of the response vectors is localized.
References Einsporn, R., and Birch, J.B. (1993), “Model Robust Regression: Using Nonparametric Regression to Improve Parametric Regression Analyses,” Technical Report 93-5, Department of Statistics, Virginia Polytechnic Institute and State University. Fan, J., and Gijbels, I. (1996). Monographs on Statistics and Applied Probability 66, Local Polynomial Modeling and Its Applications. London : Chapman and Hall. Haslett, J. and Raftery, A. E. (1989), “Space-time Modelling with Long-memory Dependence: Assessing Ireland's Wind Power Resource (with Discussion),” Applied Statistics, 38, 1-50. Laird, N.M. and Ware, J.H. (1982), “Random-effects models for longitudinal data,” Biometrics, 38, 963974. Mays, J.E., Birch, J.B., and Starnes, B.A. (2001), “Model Robust Regression: Combining parametric, nonparametric, and semiparametric methods,” Journal of Nonparametric Statistics, 13, 2, 245-277. Nadaraya, E. (1964), “On Estimating Regression,” Theory of Probability and its Applications, 9, 141142. Watson, G. (1964), “Smoothing Regression Analysis,” Sankhya Ser. A, 26, 359-372.