SUMMARY. Based on the marginal likelihood approach, we develop a model selection criterion,. MIC, for regression models with the general variance structure.
On the Use of Marginal Likelihood in Model Selection Peide Shi Department of Probability and Statistics Peking University, Beijing 100871 P. R. China Chih-Ling Tsai Graduate School of Management University of California Davis, California 95616-8609 U. S. A.
SUMMARY Based on the marginal likelihood approach, we develop a model selection criterion, MIC , for regression models with the general variance structure. These include weighted regression models, regression models with ARMA errors, growth curve models, and spatial correlation models. We show that MIC is a consistent criterion. For regression models with either constant or non-constant variance, simulation studies indicate that MIC not only provides better model order choices than the asymptotic ecient criteria AIC , AICC , FPE and Cp, but is also superior to the consistent criteria BIC and FIC in both small and large sample sizes. These results also hold for regression models with AR(1) errors, except that BIC performs slightly better than MIC in the large sample size case. Finally, Monte Carlo studies show that the eectiveness of model selection criteria decreases as the degree of heterogeneity (or the rst order autocorrelation) increases.
Some key words: AIC ; AICC ; BIC ; FIC ; MIC ; Marginal likelihood.
1
1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsai, 1989), BIC (Schwartz, 1978), FIC (Wei, 1992), FPE (Akaike, 1970), and Cp (Mallows, 1973). All of these criteria are comprised of two basic components where the rst is a function of the scaled parameter estimator, which measures the goodness-of- t, and the second is a function of the number of unknown parameters, which penalizes over tting. Hence, when considering two candidate models with the same number of parameters (e.g., one including explanatory variables x ; x and x , and the other including x ; x and x ), the model with the smaller value for the rst component will be selected. This implies that the more interesting parameter for model selection is the scaled parameter, and that the regression parameters can be viewed as nuisance parameters. In applications, one may use a linear model with a general variance structure, which needs to be selected based on data. For this kind of models, we are interested in the parameters included in the variance structure, and the regression parameters are considered nuisance parameters. 1
3
5
2
4
6
A widely used procedure for inference about parameters of interest when nuisance parameters are also present is based on the pro le log-likelihood function. However, this procedure may give inconsistent or inecient estimates. In addition, the pro le loglikelihood itself is not a log-likelihood. Thus we must consider conditional log-likelihood (Cox & Hinkley, 1974), modi ed pro le log-likelihood (Barndor-Nielsen, 1983 and 1985), and marginal log-likelihood (McCullagh & Nelder, 1989) as alternative approaches. For the sake of simplicity, in this paper we only consider the marginal log-likelihood to obtain a model selection criterion. The aim of this paper is to consider model selection for regression models with the general variance structure and obtain the modi ed information criterion, MIC . We will achieve this rst by obtaining an unbiased estimator of the expected Kullback-Leibler information of the marginal log-likelihood function of the tted model. We then derive MIC criterion in Section 2, and show that it is a consistent criterion in Section 3. In Section 4, we compare MIC with other model selection criteria via Monte Carlo studies. The results indicate that MIC is not only outperforms the ecient criteria AIC , AICC , FPE and Cp, but is also superior (or comparable) to the consistent criteria BIC and FIC in both small and large sample sizes. Finally, we give some concluding remarks in 2
Section 5.
2. Derivation of the MIC Criterion 2.1. Marginal Log-likelihood Function Consider the candidate family of models
Y = X + e;
(2:1)
where Y is an n 1 vector, X is an n k matrix of known constants, is a k 1 vector of unknown regression coecients, e = (e ; ; en)0 has a multivariate normal distribution with mean E (e) = 0 and variance E (ee0 ) = W (), is an unknown scalar and is a q 1 unknown vector. In this model, (0; )0 are the parameters of interest and the are nuisance parameters. Next, by adopting McCullagh and Nelder's (1989, Section 7.2) results, we obtain the marginal log-likelihood of (0 ; )0: (2:2) L(; ; Y ) = ? 12 log j W j ? 21 log jX 0W ? X= j ? 21 Y 0 (I ? H )Y = ; where Y = W ? = Y , X = W ? = X and H = X (X 0 X )? X 0 . 1
2
2
2
2
2
1 2
1
2
2
1
1 2
Suppose, however, that the response was generated by the true model
Y = X + "; 0
0
where X is an n k matrix of known constants, is a k 1 vector of unknown regression coecients, " = (" ; ; "n)0 has a multivariate normal distribution with mean E (") = 0 and variance E (""0) = W ( ), is an unknown scalar and is a m 1 unknown vector. Then the marginal log-likelihood of (0 ; )0 is L ( ; ; Y ) = ? 12 log j W j ? 12 log jX 0 W ? X = j ? 21 Y 0 (I ? H )Y = ; (2:3) 0
0
0
0
1
2 0
0
0
0
0
0
0
2 0
2 0
0
0
0
1
0
2 0
2 0
0
0
0
2 0
where Y = W ? = Y , X = W ? = X , H = X (X 0X )? X 0 and W = W ( ). Using equations (2.2) and (2.3), we can now obtain a model selection criterion as shown in the next subsection. 1 2
0
0
1 2
0
0
0
0
2.2. Marginal Information Criterion 3
0
0
1
0
0
0
A useful measure of the discrepancy between the true and candidate marginal loglikelihood functions is the Kullback-Leibler information (; ) = ?2E fL(; ; Y )g = log j W j + log jX 0W ? X= j + trf 0 X 0 W 0? = (I ? H )W ? = X g +trf(I ? H )W ? = W W 0? = g = ; (2:4) 0
2
2
1
1 2
1 2
2
0
1 2
0
2 0
1 2
0
0
0
2
where E denotes the expectation under the true model. We assume that the candidate family includes the true model, an assumption that is also used in the derivation of the Akaike information criterion, AIC (Linhart & Zucchini, 1986), and the corrected Akaike information criterion, AICC (Hurvich & Tsai, 1989). Under this assumption, the columns of X can be rearranged so that X = X , where = ( 0 ; 00)0 and 00 is a 1 (k ? k ) vector. Hence, the the third component of the right hand side of (2.4) is 0. Furthermore, replacing W ? = by its approximation W ? = in the second trace component of equation (2.4), (; ) can be approximated by 0
0
0
0
0
1 2
1 2
0
(n ? k) log( ) + log jW j + log jX 0W ? X j + (n ? k) = : 2
1
(2:5)
2
0
A reasonable criterion for judging the quality of the candidate family in the light of the data is E f(^; ^)g, where ^ and ^ are the maximum likelihood estimates of and in the candidate family, respectively. Note that when the candidate models include the true model, n^ = is approximately distributed as n?k and E ( =^ ) n?nk? . This, together with equation (2.5), gives us (^; ^) (n ? k)E flog(^ )g + E flog jW^ j + log jX 0W^ ? X jg + n(n ? k) ; n?k?2 where W^ = W (^). This leads to the corrected Akaike information criterion obtained from the marginal log-likelihood function, which is MIC = (n ? k) log(^ ) + log jW^ j + log jX 0W^ ? X j + nn(?n k??k)2 : (2:6) 2
0
2
2
2 0
2
0
2
0
2 0
2
2
1
0
2
1
Three analytical examples are given below to illustrate the use of MIC .
Example 2.1 (Regression model with constant variance) Consider the case where W () = I . Here (2.1) represents a multiple regression model with constant variance, and the resulting selection criterion is (2:7) MIC = (n ? k) log(^ ) + log jX 0X j + nn(?n k??k)2 : 2
4
It is interesting to note that the log jX 0X j term also appears in Wei's (1992) FIC criterion (see the expression of FIC in Section 4).
Example 2.2 (Regression model with non-constant variance) Assume that W () = diag(Wi()), where Wi () = W (zi; ), zi is a known r 1 vector and is an unknown m 1 vector for i = 1; ; n. Under this assumption, (2.1) is a multiple regression model with non-constant variance, and the resulting marginal log-likelihood has the same form as equation (2.2). Hence, the selection criterion MIC is given by (2.6). For this weighted regression model, McCullagh & Tibshirani (1990) showed that its marginal log-likelihood is the rst order bias corrected pro le log-likelihood. In addition, Verbyla (1993) used the marginal likelihood to obtain a test statistic to assess the homogeneity of the errors, and recently Lyon & Tsai (1996) compared various tests for homogeneity obtained when using the log-likelihood, marginal log-likelihood, conditional log-likelihood, and modi ed log-likelihood functions.
Example 2.3 (Regression model with ARMA errors) Consider the regression model given by (2.1), where the random errors are generated by an ARMA(p; q) process de ned as
et ? et? ? ; ?pet?p = at ? ' at? ? ? 'q at?q ; 1
1
1
1
where at is a sequence of independent normal random variables having mean zero and variance . The log-likelihood function for e (ignoring the constant) is (2:8) L(ej ; ) = jW j? = (2)?n= exp(? 21 e0 W ? e= ); where = ( ; ; p; ' ; ; 'q )0 and W is the variance matrix of e. Equation (2.8) can be found in Cooper & Thompson, 1977 and Harvey & Phillips, 1979. The resulting marginal log-likelihood has the same form as (2.2) (see Wilson, 1989), and hence the selection criterion MIC is given by (2.6). Note that MIC is used to select regression parameters by assuming that the the orders of p and q in the ARMA process are speci ed. This is dierent from the approach taken by Tsay (1984), who proposed a method to identify the order of p and q when the dimension of the regression parameters is known. 2
2
1
1 2
2
1
2
1
Note that by using arguments similar to those in Examples 2 and 3, we also can obtain MIC for growth curve models and for regression models with spatial correlation errors 5
(Sen & Shrivastava, 1990, p. 138 and p. 143). The theoretical justi cation for applying MIC to variable selection under the above conditions is given in the next section.
3. Asymptotic Results for MIC Let A = f : is a nonempty subset of f1; ; kgg and A = f 2 A : E (Y ) = X () ()g be a subset of A, where each in A is associated with a model that includes the true model, and X () and () contain the components X and , respectively. Finally, let ^ denote the model selected by MIC so that its value is the smallest of those for all possible candidate models, and let be the model in A with the smallest dimension. We will rst obtain asymptotic results for the multiple regression model with constant variance, and then extend the results to the general model setting given by (2.1). In order to state and prove our asymptotic results, we will need the following two de nitions and assumptions. 0
0
0
0
De nition 1. The selection procedure is said to be consistent if P f^ = g ?! 1 0
as n tends to in nity, and the selection procedure is said to be strongly consistent if
P flim ^ = g = 1: n 0
De nition 2. A random variable is sub-Gaussian if there exists a constant ' > 0 such that for all real d
E fexp(d )g exp('d ): This de nition is adopted from Zheng & Loh (1995). 2
lim inf n? 2A?A inf 0 kX ? H ()X k > 0, where n!1 X ()(X ()0X ())? X 0().
Assumption 1.
1
0
0
0
0
2
1
Assumption 2. lim sup max jX ()0X ()=nj < 1 and 0 < lim inf min jX ()0X ()=nj: n!1 2A 2A n!1
We can now present the asymptotic results for the selection criterion, MIC . 6
H () =
Theorem 3.1. If Assumptions 1 and 2 are satis ed and the "i are assumed to be iid
sub-Gaussian, then MIC is a consistent criterion.
The proof for Theorem 3.1 is given in the Appendix. Note that if the sub-Gaussian assumption is replaced by the moment condition, E (" h) < 1 for some positive integer h, Theorem 3.1 remains true. 4 1
Theorem 3.2. Let
2) ; MIC s () = (n ? k()) log(^ ()) + n log jX ()0X ()j + n2(?k(k()+ )?2 ( )
2
where 8k() is the dimension of the submodel 2 A, js = maxfj : j = s < log(j )g, and < when n js Under the same assumptions given for Theorem n = : 1 = s n = log(n) when n > js: 3.1, MIC s is a strongly consistent criterion for all nite s > 1. 1 2
1 2
( )
The proof of Theorem 3.2 is not presented here since the techniques used are similar to those for the proof of Theorem 3.1. However, the details can be obtained from the authors by request. Note that under the assumptions of Theorem 3.1, MIC is near strong consistency, since, except for a constant term n +2, MIC and MIC s are identical when n < js and js are very large; for example, j = 7:430701195419606E + 183 and j = 7:272891100785729E + 307. ( )
35
54
Theorem 3.1 shows that MIC is a consistent criterion for multiple regression models with a constant variance error structure. In fact, this result can be extended to the general model setting given by (2.1), as we see below.
Theorem 3.3 In model (2.1), assume that both ^? and ^ (; ^)?^ (; ) tend to zero in probability as n tends to in nity for all 2 A. In addition, assume that the elements 0
2
2
0
of W () are continuous functions of , W () is positive de nite in the neighborhood of , and Assumptions 1 and 2 are satis ed when X () is replaced by W ()? = X (). If the "i are iid sub-Gaussian, then MIC is a consistent criterion. 1 2
0
The proof of Theorem 3.3 is given in the Appendix. Its results also are still valid if the sub-Gaussian assumption is replaced by the moment condition, E (" h) < 1 for some positive integer h. 4 1
7
4. Monte Carlo Results In this section we use simulation studies to compare the performance of MIC with AIC , AICC , BIC , FIC , CP and FPE , de ned as: AIC = n log(^ ) + 2k, AICC = n log(^ ) + 2n(k + 1)=(n ? k ? 2), BIC = n log(^ ) + k log(n), FIC = n^ + ~ log jX 0X j, Cp = n^ =~ ? n + 2k, and FPE = n^ (n + k)=(n ? k), where ^ and ~ are the maximum likelihood estimates of under the tted model and the full model, respectively. We consider regression models with either constant variance, non-constant variance, or AR(1) errors. For each of ve sample sizes (n=15, 20, 40, 80, 160), 1000 realizations were generated from the true model (2.1), where = (1; ; 1)0 is an k 1 vector, = 1, and x i N (0; Ik0 ) for i = 1; ; n. There are nine candidate variables, stored in an n 9 matrix of independent identically distributed normal random variables. The candidate models are linear, and are listed in columns in a sequential nested fashion. Hence the set of candidate models includes the true model. 2
2
2
2
2
2
2
2
0
0
2
2
0
0
Example 4.1. (Regression model with constant variance) Here we assume that k = 5 and that the "i are iid normal random variables with mean zero and variance 1. Table 1 presents the proportions of the true order selected by the various criteria. In the small sample sizes (n=15 and 20), it is obvious that AICC and MIC outperform other criteria, and FIC and CP perform the worst. As the sample size increases, the proportions of correct model order chosen by BIC , FIC , and MIC also increase. This is not surprising since all these criteria are consistent. MIC performs the best overall for both large and small sample sizes. 0
Example 4.2. (Regression model with non-constant variance) Consider the weighted regression model
yi = x0 i + wi"i; i = 1; ; n; 0
0
where wi = exp( zi), the zi are iid standard normal, and the are 0.2, 0.5 and 0.7. 0
0
Table 2 shows that the performances of all six criteria are similar to those for Example 4.1; that is, MIC is superior to AICC in small sample sizes and is slightly better than FIC and BIC in large sample sizes, and is best overall. The criteria AIC , CP and FPE 8
all perform comparably poorly. It is also interesting to note that the proportion of correct model order selections for each criterion decreases as the increases. This indicates that strong heterogeneity has a negative eect on model selection. 0
Example 4.3. (Regression model with AR(1) errors) Consider the linear regression with AR(1) errors
yi = x0 i + "i; i = 1; ; n; 0
0
where the "i are random errors satisfying
" N (0; 1=(1 ? )); "i = "i? + i ; i N (0; 1); for i = 2; 3; ; n; 2 0
1
0
1
and = 0:1; 0:5 and 0:9. 0
Table 3 only presents simulation results for positive autocorrelation, since the corresponding negative autocorrelation gives similar results. It shows that MIC outperforms the other criteria except when n = 80 and 160, where BIC performs slightly better. Not surprisingly, large positive autocorrelation reduces the eectness of model selection criteria, especially when is close to 1 and the sample size is small or moderate. We also obtain the same conclusion when the autocorrelation is negative. 0
The above three examples give us more insight into why AICC and MIC perform quite dierently. For regression models with the constant variance (Example 4.1), AICC and MIC are unbiased estimators of the Kullback-Leibler informations computed from the log-likelihood function and the marginal log-likelihood function, respectively. Hence, both performance comparably in small samples. As the sample size increases, MIC , because it is a consistent criterion, outperforms AICC . For regression models with heteroscedastic error (Example 4.2) or autocorrelated error (Example 4.3), MIC is still an approximately unbiased estimator of the Kullback-Leibler information computed from the marginal loglikelihood function, but AICC is not the unbiased estimator of the Kullback-Leibler information computed from the log-likelihood function. Therefore, MIC outperforms AICC in both small and large samples.
5. Conclusions 9
We have applied the marginal log-likelihood approach to regression models with the general variance structure in order to obtain a good criterion, MIC . We show that MIC is a consistent criterion. Simulation studies indicate that MIC is not only superior to the ecient criteria AIC , AICC , CP and FIC , but also performs better than (or comparable to) the consistent criteria BIC and FIC . This means that the marginal log-likelihood approach results in better variable selections than the log-likelihood approach under these conditions, and therefore it is natural to use the marginal log-likelihood approach to revise some existing model selection criteria. MIC also can be extended to other model settings, such as regression models with ARCH errors (Gourieroux, 1997) and multivarite regression model with a general variance structure (Greene, 1993, Chapters 16 & 17, Diggle, Liang & Zeger, 1994, Chapter 4, and Bedrick & Tsai, 1994). Finally, in future the conditional loglikelihood and modi ed pro le log-likelihood approaches can be applied to obtain model selection criteria, and then compared to MIC .
ACKNOWLEDGMENTS The research of Peide Shi was supported by NSF of China. The research of Chih-Ling Tsai was supported by National Science Foundation grant DMS 95-10511.
10
REFERENCES Akaike, H. (1970). Statistical predictor identi cation. Ann. Inst. Statist. Math. 22, 203-17. Akaike, H.(1973). Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, Ed. B.N. Petrov and F. Csaki, pp. 267-81. Budapest: Akademia Kiado. Barndor-Nielsen, O. E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70, 343-65. Barndor-Nielsen, O. E. (1985). Properties of modi ed pro le likelihood. In Contributions to Probability and Statistics in Honour of Gunnar Blom, Ed. J. Lanke and G. Lindgren, pp. 25-38. Lund. Bedrick, E. J. & Tsai, C. L. (1994). Model selection for multivariate regression in small samples. Biometrics 50, 226-31. Chatterjee, S. & Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. New York: Wiley. Cooper, D. M. & Thompson, R. (1977). A note on the estimation of the parameters of the autoregressive-moving average process. Biometrika 64, 625-28. Cox, D. R. & Hinkley, D. V. (1974). Theoretical Statistics. New York: Chapman and Hall. Diggle, P. J., Liang, K. Y. & Zeger, S. L. (1994). Analysis of Longitudinal Data. New York: Oxford. Gourieroux, C. (1997). ARCH Models and Financial Applications. New York: Springer. Greene, W. H. (1993). Econometric Analysis. New York: Macmillan. Harvey, A. C. & Phillips, G. D. A. (1979). Maximum likelihood estimation of regression models with autoregressive-moving average disturbances. Biometrika 66, 49-58. Hurvich, C. M. & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika 76, 297-307. Linhart, H. & Zucchini, W. (1986). Model Selection. New York: Wiley. Lyon, J. & Tsai, C. L. (1996). A comparison of tests for heteroscedasticity. The Statistician 45, 337-49. Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661-675. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. New York: Chapman and Hall. 11
McCullagh, P. & Tibshirani, R. (1990). A simple method for the adjustment of pro le likelihoods. J. R. Statist. Soc. B. 52, 325-44. Rao, C. R. & Klee, J. (1988). Estimation of Variance Components and Applications. Amsterdam: North-Holland. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-4. Sen, A. & Shrivastava, M. (1990). Regression Analysis: Theory, Methods, and Applications. New York: Springer-Verlag. Tsay, R. S. (1984). Regression models with time series errors. J. Amer. Statist. Assoc. 79, 118-24. Verbyla, A. P. (1993). Modelling variance heterogeneity: residual maximum likelihood and diagnostics. J. R. Statist. Soc. B. 55, 493-508. Wei, C. Z. (1992). On predictive least squares principles. Ann. Statist. 20, 1-42. Wilson, G. T. (1989). On the use of marginal likelihood in time series model estimation. J. R. Statist. Soc. B. 51, 15-27. Zheng, X. D. & Loh, W. Y. (1995). Consistency variable selection in linear models. J. Amer. Statist. Assoc. 90, 151-6.
12
Table 1. Proportions of correct model order selection by AIC , AICC , BIC , FIC , Cp, FPE , and MIC criteria in 1000 realizations for the regression model. Criteria
n=15
n=20
n=40
n=80
n=160
AIC AICC BIC FIC Cp FPE MIC
0.383 0.703 0.509 0.314 0.302 0.471 0.781
0.510 0.874 0.720 0.530 0.433 0.565 0.875
0.647 0.835 0.878 0.841 0.613 0.663 0.919
0.734 0.810 0.954 0.945 0.709 0.735 0.962
0.713 0.747 0.959 0.956 0.704 0.713 0.968
13
Table 2. Proportions of correct model order selection by AIC , AICC , BIC , FIC , Cp, FPE , and MIC criteria in 1000 realizations for the weighted regression model.
0
0.2
0.5
0.7
Criteria AIC AICC BIC FIC Cp FPE MIC AIC AICC BIC FIC Cp FPE MIC AIC AICC BIC FIC Cp FPE MIC
n=15 0.317 0.633 0.434 0.280 0.266 0.408 0.756 0.403 0.620 0.481 0.369 0.350 0.464 0.742 0.429 0.601 0.523 0.420 0.407 0.498 0.731
n=20 0.458 0.805 0.655 0.500 0.399 0.515 0.849 0.536 0.821 0.686 0.589 0.493 0.590 0.854 0.580 0.824 0.695 0.638 0.566 0.621 0.853
14
n=40 0.620 0.790 0.856 0.788 0.591 0.629 0.921 0.656 0.801 0.858 0.824 0.630 0.661 0.919 0.666 0.793 0.844 0.810 0.658 0.673 0.894
n=80 0.691 0.783 0.935 0.919 0.678 0.693 0.951 0.692 0.760 0.918 0.898 0.673 0.693 0.953 0.592 0.663 0.835 0.818 0.588 0.596 0.867
n=160 0.720 0.760 0.968 0.958 0.713 0.720 0.976 0.662 0.707 0.948 0.942 0.659 0.662 0.968 0.520 0.565 0.836 0.826 0.513 0.520 0.871
Table 3. Proportions of correct model order selection by AIC , AICC , BIC , FIC , Cp, FPE , and MIC criteria in 1000 realizations for the regression model with AR(1) errors.
Criteria AIC AICC BIC 0.1 FIC Cp FPE MIC AIC AICC BIC 0.5 FIC Cp FPE MIC AIC AICC BIC 0.9 FIC Cp FPE MIC 0
n=15 0.380 0.703 0.505 0.307 0.290 0.466 0.774 0.364 0.606 0.476 0.295 0.285 0.456 0.707 0.246 0.246 0.276 0.225 0.224 0.289 0.445
n=20 0.506 0.871 0.709 0.530 0.428 0.568 0.873 0.512 0.817 0.679 0.540 0.442 0.565 0.814 0.339 0.407 0.377 0.369 0.315 0.371 0.476
15
n=40 0.660 0.826 0.880 0.830 0.618 0.671 0.922 0.667 0.833 0.883 0.840 0.625 0.681 0.920 0.580 0.672 0.651 0.655 0.552 0.590 0.689
n=80 0.731 0.817 0.954 0.938 0.713 0.734 0.966 0.764 0.835 0.960 0.940 0.748 0.767 0.963 0.718 0.799 0.897 0.903 0.704 0.720 0.863
n=160 0.705 0.746 0.959 0.951 0.703 0.705 0.964 0.721 0.760 0.970 0.962 0.713 0.721 0.965 0.745 0.785 0.970 0.964 0.731 0.745 0.919