Variable Selection by Cp Statistic in Multiple Responses Regression with Fewer Sample Size Than the Dimension Mariko Yamamura1 , Hirokazu Yanagihara2, , and Muni S. Srivastava3 1
2
Graduate School of Business Sciences, University of Tsukuba 3-29-1 Otsuka, Bunkyo-ku, Tokyo, 112-0012, Japan
[email protected] Department of Mathematics, Graduate School of Science, Hiroshima University 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8626, Japan
[email protected] 3 Department of Statistics, University of Toronto 100 St. George Street, Toronto, Ontario ON M5S 3G3, Canada
[email protected]
Abstract. In this paper, we introduce a better statistical method about model selection, and contribute to updating data mining technique. We consider the problem of selecting q explanatory variables out of k (q ≤ k), when the dimension p of the response variables is larger than the sample size n in the multiple responses regression. We consider Cp statistic which is an estimator of the sum of standardized mean square errors. The standardization uses the inverse of the variance-covariance matrix of p response variables and thus the estimator of the inverse of the sample variance-covariance matrix. However, since n < p, such an inverse matrix cannot be used. Thus, we use the Moore-Penrose inverse and define the Cp statistic. Such a statistic will be denoted by Cp+ . An example is given to illustrate the use of Cp+ statistic. The performance is demonstrated by simulation result and real data study. Keywords: High dimensional data, Mallows’ Cp statistic, Model selection, Moore-Penrose inverse, Multivariate linear regression model.
1
Introduction
A statistical analysis is a powerful tool to understand and explain about our interest in many fields, i.e., it is well used for marketing in business. A data mining technique, especially statistical analysis, depends on a statistical software, and the statistical software has been updated after some new or better statistical analysis methods are introduced. In these days, we often see a high dimensional data that the dimension of vector of mutually correlated response variables is
This research was supported by the Japan Society for the Promotion of Science, Excellent Young Researchers Overseas Visit Problem, #21-2086.
R. Setchi et al. (Eds.): KES 2010, Part III, LNAI 6278, pp. 7–14, 2010. c Springer-Verlag Berlin Heidelberg 2010
8
M. Yamamura, H. Yanagihara, and M.S. Srivastava
larger than sample size. Methods of analyzing the high dimensional data is recently started studying, and updating statistical software has not been done yet. Therefore, in this paper, we introduce a statistical method about model selection when data is the high dimension, and contribute to updating statistical software. Suppose that k-variate explanatory variables xi = (xi1 , . . . , xik ) and p-variate mutually correlated response variables yi = (yi1 , . . . , yip ) (i = 1, . . . , n) are observed, where n is the sample size. A linear regression model is useful to predict yi by xi . Such a model is practically called a multivariate liner regression (MLR) model. The MLR model is one of the basic models in multivariate analysis. It is introduced in many textbooks on applied multivariate statistical analysis (see e.g., [9, Chapter 9], [14, Chapter 4]), and even now it is widely applied in chemometrics, engineering, econometrics, psychometrics and other many fields for the prediction of correlated multiple responses using a set of explanatory variables (e.g., [1], [6], [7], [8] and [15]). The n vectors of response variables y1 , . . . , yn and the n vectors of k explanatory variables x1 , . . . , xn are written in a matrix notation as an n × p matrix Y = (y1 , . . . , yn ) and an n × k matrix X = (x1 , . . . , xn ) , respectively. Here, we assume that X is of full rank, i.e., rank(X) = k. A matrix form of the MLR model is given by (1) Y ∼ Nn×p (XΞ, Σ ⊗ In ). Here Σ ⊗ In is called the Kronecker product of Σ and In and its (i, j)th block element is given by σij In , where σij is the (i, j)th element of Σ. It is desirable to have as few explanatory variables as possible for ease of interpretation, and after all not all the k explanatory variables are needed for a good prediction. Although several methods are available, most applied researchers use, Cp statistic proposed by [4]. It is based on an estimate of the standardized version of mean square errors (MSE). Suppose we choose a subset of q explanatory variables out of k explanatory variables (q ≤ k), i.e., we use an n × q matrix Xq consisting of the q columns of X for the prediction. Then the predicted value of Y will be given by ˆ q, Yˆq = Xq Ξ
ˆ q = (X Xq )−1 X Y , Ξ q q
(2)
and the MSE is given by MSE = E tr Σ−1 (XΞ − Yˆq ) (XΞ − Yˆq ) = pq + tr Σ−1 Ξ X (In − Hq ) XΞ ,
(3)
where Hq = Xq (Xq Xq )−1 Xq . When n > p, an unbiased estimator of this MSE is given by {1 − (p + 1)/(n − k)} tr(S −1 Vq ) − np + 2kq + p(p + 1), (see [13]), and is called Cp statistic. Here S=
1 V, n−k
V = Y (In − H)Y ,
Vq = Y (In − Hq )Y ,
(4)
Variable Selection by Cp Statistic in Multiple Responses Regression
9
and H = X(X X)−1 X . However, when p is close to n, the estimator S is not a stable estimator of Σ. And when n < p, the inverse matrix of S does not even exist. In this case, we use S + , the Moore-Penrose inverse of S as has recently been done by [11]. The Moore-Penrose inverse of any matrix is unique and satisfies the following four conditions: (i) SS + S = S, (ii) S + SS + = S + , (iii) (SS + ) = SS + , (iv) (S + S) = S + S, (see [5, p. 26]). The objective of this paper is to obtain an asymptotically unbiased estimator of MSE when n < p, and (n, p) → ∞. Such an estimator will be denoted by Cp+ . This paper is organized in the following ways: In Section 2, we propose the new Cp+ when p > n. In Section 3 and 4, we verify performances of proposed criteria by conducting studies with numerical simulation and real data, respectively. In Section 5, we give discussions and conclusions. Technical details are provided in the Appendix.
2
Cp+ Statistics
When p < n, a rough estimator of MSE in (3) is given by tr(S −1 Vq ) − np + 2pq, where S and Vq are given by (4). Hence, when p > n, a rough estimator of MSE is defined by replacing S −1 with S + as Cp+ = tr(S + Vq ) − np + 2pq.
(5)
However, Cp+ has a constant bias in estimating MSE. Such a bias may become large when the dimension p is large. Hence we try to remove such a bias by evaluating the bias from an asymptotic theory based on the dimension and the sample size approaching to ∞ simultaneously. Suppose that the following three conditions hold: (C.1) 0 < limp→∞ tr(Σ)/p (= αi ) < ∞ (i = 1, 2 and 4), (C.2) n−k = O(pδ ) for 0 < δ ≤ 1/2, (C.3) the maximum eigenvalue of Σ is bounded in large p. Under the three conditions and Hq X = X, the bias Δ = MSE − E[Cp+ ] can be expanded as Δ = {(n − k)2 γ/p − p}q + np − (n − k)2 (1 + kγ/p) + o(1), where γ = α2 /α21 (the proof is given in the appendix). Let tr(S 2 ) 1 p(n − k)2 − γˆ = . (n − k − 1)(n − k + 2) (trS)2 n−k
(6)
(7)
The estimator γˆ is a consistent estimator of γ when the conditions (C.1), (C.2) and (C.3) are satisfied (see [10]). By using (6) and (7), we propose a new estimator of MSE as + + 2 ˆ /p + p q − (n − k)2 (1 + kˆ γ /p) . (8) Cp,ˆ γ = tr(S Vq ) + (n − k) γ
10
3
M. Yamamura, H. Yanagihara, and M.S. Srivastava
A Simulation Study
In the previous section, we avoid the nonexistence of the matrix to standardize Vq by using the Moore-Penrose inverse. However, we can avoid the singularity by another way if we allow model misspecification. Another choice is to omit correlations between yi tentatively, namely, we use S(d) = diag(s1 , . . . , sp ) to standardize Vq , where si (i = 1, . . . , p) is the ith diagonal element of S. This has been done by [2] in discriminant analysis and by [12] in testing the equality of two mean values. Thus, we can also define the estimator of MSE as −1 Vq ) − np + 2pq. Cp(d) = tr(S(d)
(9)
The effect of correlations between yi to model selection can be studied by com(d) paring with proposed two Cp+ and Cp . We evaluate the proposed Cp statistics applied numerically to the regression model in (1) with n = 20, k = 8 and p = 30 and 100. Here, we assume that Σ = diag(ψ1 , . . . , ψp )Φdiag(ψ1 , . . . , ψp ). In this numerical study, we chose X: the first column vector was 1n and the others were generated from U (−1, 1), Ξ: the first, second, third and fourth rows were −τ (1 − aj ), τ (1 + aj ), −τ (2 − aj ) and τ (1 + aj ) (j = 1, . . . , p), respectively, and the others are 0, ψj : ψj = 2 + aj 1/7 (j = 1, . . . , p), Φ: the (i, j)th element is ρ|i−j| (i = 1, . . . , p; j = 1, . . . , p), where 1n is an n-dimensional vector of ones and aj = (p − j + 1)/p. Let Mj denote the jth candidate model with the matrix of explanatory variables, which is consisting of the first j columns of X (j = 1, . . . , k). This means that the candidate models are nested. Moreover, we chose τ = 0.0 or 8.0. It means that there are two types of true model, i.e., the true model is M1 (τ = 0.0) and M4 (τ = 8.0), respectively. (d) + We compared Cp+ , Cp,ˆ γ and Cp with respect to the following two properties: (i) the selection probability of the model chosen by minimizing the criterion, (ii) the true MSE of the predicted values of the best model chosen by minimizing the criterion, which is defined by 1 −1 E tr Σ (XΞ − YˆB ) (XΞ − YˆB ) , MSEB = (10) np where YˆB is the predictor of Y based on the best model chosen by each Cp . Since the prediction error has to be measured by the same measurement as the goodness of fit of the model, the prediction error of the best model has to be defined by (10). These two properties were evaluated by the Monte Carlo simulation with 1, 000 iterations. Since Ξ and Σ are known in the simulation study, we can evaluate MSEB by the Monte Carlo simulation. Tables 1 and 2 show obtained + properties (i) and (ii), respectively. From tables, we can see that Cp,ˆ γ had good + performances in all cases. Performances of Cp were also good, however, these became bad when τ = 8.0, ρ = 0.8 and p = 100. Furthermore, we can see (d) that performances of Cp were not too bad when ρ is low. However, when ρ is
Variable Selection by Cp Statistic in Multiple Responses Regression
11
Table 1. Selection probabilities of three Cp statistics
Cp+ + Cp,ˆ γ (d) Cp
τ = 0.0 ρ = 0.2 ρ = 0.8 p = 30 p = 100 p = 30 p = 100 100.0 100.0 100.0 100.0 100.0 100.0 98.90 100.0 98.70 99.90 70.90 74.20
τ = 8.0 ρ = 0.2 ρ = 0.8 p = 30 p = 100 p = 30 p = 100 99.80 100.0 99.90 89.00 99.80 100.0 99.30 99.90 98.70 99.80 73.80 76.30
Table 2. True MSEs of three Cp statistics
Cp+ + Cp,ˆ γ (d) Cp
τ = 0.0 ρ = 0.2 ρ = 0.8 p = 30 p = 100 p = 30 p = 100 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.050 0.051 0.050 0.095 0.085
τ = 8.0 ρ = 0.2 ρ = 0.8 p = 30 p = 100 p = 30 p = 100 0.200 0.200 0.200 0.210 0.200 0.200 0.200 0.201 0.201 0.200 0.227 0.231
high, the performances became bad. This result means that we should consider correlations to evaluate the goodness of fit of a statistical model correctly if response variables are not independent. We have studied several other settings for simulation, and have obtained similar results.
4
An Example Study
We show an example of model selection by using real data in [3]. This data gives 21 body dimension measurements in cm such as biacromial diameter, biiliac diameter or pelvic breadth, bitrochanteric diameter, chest depth, chest diameter, elbow diameter, wrist diameter, knee diameter, ankle diameter, shoulder girth, chest girth, waist girth, navel girth, hip girth, thigh girth, bicep girth, forearm girth, knee girth, calf girth, ankle girth, wrist girth. The data also gives age in years, weight in kg, height in cm, gender. Observations are 507 individuals in their 20s and 30s. We applied multivariate linear regression to see performances (d) + of Cp+ , Cp,ˆ γ and Cp . Response variables were 21 body dimension measurements and 4 explanatory variables were age, weight, height, gender taking 1 for males and 0 for females. A best model was selected from 16 models having different (d) + combinations of 4 explanatory variables by Cp+ , Cp,ˆ γ or Cp . We divided data to three samples, 10(= n(1) ), 10(= n(2) ) and 487(= n(3) ), randomly, and repeated such division 1,000 times. Divided samples were denoted by (Y(1) , X(1) ), (Y(2) , X(2) ) and (Y(3) , X(3) ), respectively. To calculate the mean squared error M , we used (Y(3) , X(3) ) to estimate the covariance matrix by ˆ (3) = Y {In − X(3) (X X(3) )−1 X }Y(3) /(n(3) − 5). We used (Y(1) , X(1) ) Σ (3) (3) (3) (3) (d)
+ for the model selection by Cp+ , Cp,ˆ γ and Cp . We also used (Y(1) , X(1) ) for estimating the regression parameters Ξ. Thus, the predicted value of Y(2) is
12
M. Yamamura, H. Yanagihara, and M.S. Srivastava Table 3. Results of real data Variables {} {1} {2} {3} {4} {1, 2} {1, 3} {1, 4}
Frequencies (d) + Cp+ Cp,ˆ γ Cp 47 7 0 0 2 0 781 567 11 2 1 0 56 43 0 10 50 15 0 1 0 1 3 0
Variables {2, 3} {2, 4} {3, 4} {1, 2, 3} {1, 2, 4} {1, 3, 4} {2, 3, 4} {1, 2, 3, 4} MSE
Frequencies (d) + Cp+ Cp,ˆ γ Cp 19 63 35 82 237 233 0 0 0 0 3 34 1 14 204 0 0 3 1 7 251 0 2 214 0.753 0.695 0.948
given by Yˆ(2)B = X(2)B (X(1)B X(1)B )−1 X(1)B Y(1) , where X(1)B and X(2)B are (d)
+ matrices of best explanatory variables chosen by Cp+ , Cp,ˆ γ and Cp . Thus, M is given by
1 ˆ −1 Y(2) − Yˆ(2)B ˆ Y tr Σ M= − Y − 1, (2) (2)B (3) n(1) p
We may regard that sample average of M in 1,000 repetitions is the MSE of the best model. Results of calculations are in Table 3. “Variables” shows 16 models and numbers in brace are used explanatory variables. “1”, “2”, “3”, and “4” mean age, weight, height, and gender, respectively. All models contain constant terms, therefore model {} has only constant term. “Frequencies” shows the number of times that the model was selected as the best model in 1,000 iterations. Cp+ selected the model {2} 781 times out of 1,000 repetitions. The model {2} was + + also selected frequently by Cp,ˆ γ , however Cp,ˆ γ selected the model {2,4} 237 times (d)
which was more than Cp+ . The result of Cp
was different from those of Cp+ and
(d)
+ Cp,ˆ γ . The Cp frequently selected models with many explanatory variables such as models {1,2,4}, {2,3,4}, and {1,2,3,4} in 204, 251, and 214 times, respectively. (d) The Cp did not consider correlations between Y(1) . Thus the result indicates + importance of considering the correlations. In “MSE”, Cp,ˆ γ was 0.695 and the smallest among 3 statistics. From this, we understood that a performance of (d) + + Cp,ˆ γ was better than those of Cp and Cp .
5
Conclusion and Discussion
In this paper, we proposed new three Cp statistics for selecting variables in the multivariate linear model with p > n. These are defined by replacing S −1 with + S + , and Cp,ˆ γ is constructed by adding renewal bias correction terms evaluated from an asymptotic theory, which is based on p → ∞ and n → ∞ simultaneously.
Variable Selection by Cp Statistic in Multiple Responses Regression
13
+ A simulation shows that performances of Cp+ and Cp,ˆ γ were better than those (d)
+ of Cp . Especially, in all cases, Cp,ˆ γ had good performance. An example of model selection using the real data shows that the importance of considering + correlations between response variables and the performance of Cp,ˆ γ is better (d)
+ than Cp+ and Cp . Hence, we recommend the use of Cp,ˆ γ for selecting variables + in multivariate linear regression model with p > n. Cp,ˆ γ could help to update statistical software in high dimensional data analysis.
References 1. van Dien, S.J., Iwatani, S., Usuda, Y., Matsui, K.: Theoretical analysis of amino acid-producing Escherichia coli using a stoichiometric model and multivariate linear regression. J. Biosci. Bioeng. 102, 34–40 (2006) 2. Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97, 77–87 (2002) 3. Grete, H., Louis, J.P., Roger, W.J., Carter, J.K.: Exploring relationships in body dimensions. J. Statist. Educ. 11 (2003) 4. Mallows, C.L.: Some comments on Cp . Technometrics 15, 661–675 (1973) 5. Rao, C.R.: Linear Statistical Inference and Its Applications (Paper back ed). John Wiley & Sons, New York (2002) 6. Sˆ arbu, C., Oni¸sor, C., Posa, M., Kevresan, S., Kuhajda, K.: Modeling and prediction (correction) of partition coefficients of bile acids and their derivatives by multivariate regression methods. Talanta 75, 651–657 (2008) 7. Sax´en, R., Sundell, J.: 137 Cs in freshwater fish in Finland since 1986 – a statistical analysis with multivariate linear regression models. J. Environ. Radioactiv. 87, 62–76 (2006) 8. Skagerberg, B., Macgregor, J.F., Kiparissides, C.: Multivariate data analysis applied to low-density polyethylene reactors. Chemometr. Intell. Lab. 14, 341–356 (1992) 9. Srivastava, M.S.: Methods of Multivariate Statistics. John Wiley & Sons, New York (2002) 10. Srivastava, M.S.: Some tests concerning the covariance matrix in high dimensional data. J. Japan Statist. Soc. 35, 251–272 (2005) 11. Srivastava, M.S.: Multivariate theory for analyzing high dimensional data. J. Japan Statist. Soc. 37, 53–86 (2007) 12. Srivastava, M.S., Du, M.: A test for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 99, 386–402 (2008) 13. Srivastava, M.S., Kubokawa, T.: Selection of variables in multivariate regression models for large dimensions. In: CIRJE Discussion Papers CIRJE-F-709, University of Tokyo, Japan (2010) 14. Timm, N.H.: Applied Multivariate Analysis. Springer, New York (2002) 15. Yoshimoto, A., Yanagihara, H., Ninomiya, Y.: Finding factors affecting a forest stand growth through multivariate linear modeling. J. Jpn. For. Soc. 87, 504–512 (2005) (in Japanese)
14
M. Yamamura, H. Yanagihara, and M.S. Srivastava
Appendix Under the assumption that Hq X = X, MSE = pq holds. Hence, the bias of Cp+ for MSE is rewritten as Δ = (n − q)p − E[tr(S + Vq )].
(11)
Note that Vq = V + Y (H − Hq )Y , S + = (n − k)V + and tr(V + V ) = n − k. Hence, we derive tr(S + Vq ) = (n − k)2 + (n − k)tr V + Y (H − Hq )Y . Since V + and Y (H −Hq )Y are mutually independent, and E[Y (H −Hq )Y ] = (k − q)Σ holds. Then, the expectation of tr(S + Vq ) is expressed as E[tr(S + Vq )] = (n − k)2 + (n − k)(k − q)EY∗ [tr(V + Σ)].
(12)
Let L be a (n − k) × (n − k) diagonal matrix L = diag( 1 , . . . , n−k ), where
1 , . . . , n−k are positive eigenvalues of V , and Q = (Q1 , Q2 ) be a p × p orthogonal matrix such that Q V Q = diag(L, Op−n+k ), where Q1 and Q2 are p×(n−k) and p×(p−n+k) matrices, respectively, and Op−n+k is a (p−n+k)×(p−n+k) matrix of zeros. Note that V + = Qdiag(L−1 , Op−n+k )Q = Q1 L−1 Q1 .
(13)
By using (13) and applying the simple transformation, we derive E[tr(V + Σ)] = E[tr(L−1 Q1 ΣQ1 )] α2 = E tr (L/(pα))−1 (α1 /α2 )Q1 ΣQ1 . 2 pα1
(14)
Since V is distributed according to the Wishart distribution, from [11], we obtain limp→∞ L/(pα1 ) = In−k and limp→∞ (α1 /α2 )Q1 ΣQ1 = In−k in probability. Therefore, we have E[tr(L−1 Q1 ΣQ1 )] = (n − k)γ/p + o(p−1+δ ). By combining this result and (14), (12) is expanded as 1 E[tr(S + Vq )] = (n − k)2 + (n − k)2 (k − q)γ + o(1). p Finally, substituting (15) into (11) yields (6).
(15)