Noname manuscript No. (will be inserted by the editor)
Multivariate linear regression with non-normal errors: a solution based on mixture models Gabriele Soffritti · Giuliano Galimberti
Received: date / Accepted: date
Abstract In some situations, the distribution of the error terms of a multivariate linear regression model may depart from normality. This problem has been addressed, for example, by specifying a different parametric distribution family for the error terms, such as multivariate skewed and/or heavy-tailed distributions. A new solution is proposed, which is obtained by modelling the error term distribution through a finite mixture of multi-dimensional Gaussian components. The multivariate linear regression model is studied under this assumption. Identifiability conditions are proved and maximum likelihood estimation of the model parameters is performed using the EM algorithm. The number of mixture components is chosen through model selection criteria; when this number is equal to one, the proposal results in the classical approach. The performances of the proposed approach are evaluated through Monte Carlo experiments and compared to the ones of other approaches. In conclusion, the results obtained from the analysis of a real dataset are presented.
1 Introduction
Multivariate regression analysis is a well-known technique that is widely used in many branches of science and engineering to predict values of D responses from a set of P regressors, where D ≥ 1 and P ≥ 1. This technique is based on a statistical model in which the error terms are assumed to be independent and identically distributed random variables, whose distribution is usually considered to be multivariate normal with a zero mean vector and a positive definite covariance matrix (see, for example, [46]). However, in many disciplines scientific research based on empirical studies or theoretical reasoning provided support for the presence of skewness or heavy tails in the distribution of the error terms. Examples can be found, amongst others, in [17] and [49]. Departures from normality may be caused also by the presence of outlying values in the responses. For these reasons, several researchers proposed to perform multivariate regression analysis using a model that assumes a different parametric distribution famKeywords EM algorithm · Mixture model · Model ily for the error terms. Intense research focused on the selection criterion · Multivariate regression · Nonbroad class of elliptic distributions, and in particular on normal error distribution the multivariate t-distribution (see, for example, [52, 48,24,34,15]). Further studies were performed within THE FINAL, DEFINITIVE VERSION OF THIS the elliptic distribution family in order to analyse more MANUSCRIPT HAS BEEN PUBLISHED IN STATISTICS AND COMPUTING, VOLUME 21, NUMBER 4 complex situations, such as data with missing values (2011), 523-536, DOI: 10.1007/s11222-010-9190-3, URL: in the response variables [33], and with monotone misshttp://www.springerlink.com/content/d76070k44j2m7734/ ing response variables [6,7]. The same problem was also G. Soffritti approached, within a Bayesian framework, by assuming Department of Statistics, University of Bologna a multivariate skewed and heavy-tailed distribution for via delle Belle Arti 41 - 40126 Bologna, Italy the error terms [18,19]. E-mail:
[email protected] A drawback of the just described solutions is that G. Galimberti the observed data may be drawn from a distribution Department of Statistics, University of Bologna via delle Belle Arti 41 - 40126 Bologna, Italy that does not belong to the pre-specified parametric disE-mail:
[email protected] tribution family. That is, such methods rely on models
2
that may be incorrectly specified. A convenient framework in which to model unknown distributional shapes is represented by finite mixture modelling. It is well known that, through an appropriate choice of its components, a finite mixture model is able to model quite complex distributions, and can handle situations where a single parametric family is unable to provide a satisfactory model for local variations in the observed data [40]. Thus, a solution to the described problem can be obtained by modelling the unknown distribution of the error terms using a finite mixture of D-dimensional components. In this paper, we investigate such a solution with reference to the multivariate linear regression model. In particular, we focus our attention on the special case of multivariate normal mixtures. The same idea was previously studied in the context of multiple linear regression analysis, that is for D = 1 [5]. This paper provides a generalization for D > 1 which is able to deal with the correlation structure among the D dependent variables. Besides the advantages deriving from the flexible modelling of the unknown error distribution, the solution investigated in this paper may be useful also to capture the effect of omitting relevant nominal regressors from the model. Such an omission could introduce a source of unobserved heterogeneity in the model, whose error terms will be distributed according to a mixture of K components, where K equals the number of categories obtained from the cross classification of the omitted nominal regressors. Thus, the proposed approach should be able to detect the presence of such a source of unobserved heterogeneity in the multivariate linear regression model. The proposed solution can also be recast as a particular mixture of K multivariate linear regression models with Gaussian error terms, in which regression coefficients of the components are assumed to be equal. Many contributions can be found in the literature about mixtures of linear regression models (see, for example, [30, 14,50,29,27]), but most of them consider the special case D = 1. [32] introduced the R package flexmix which also allows to fit mixtures of multivariate linear regression models with Gaussian error terms. The most recent version of this package [26] also considers models with constant parameters in the component-specific regressions. The main difference between these latter models and the one proposed in this paper concerns the assumptions on the error correlation structure within each component: our proposal allows the componentspecific error terms to be correlated, while flexmix assumes independence. In [28] and [27] it is shown that mixtures of univariate and multivariate linear regression models are identifiable only under specific condi-
tions. In this paper we prove that, when the regression coefficients of the components are assumed to be equal, the resulting mixture of multivariate linear regression models is always identifiable. The remainder of the paper is organized as follows: in Section 2 we describe the multivariate linear regression model whose errors are assumed to be distributed as a finite mixture of K D-dimensional Gaussian components; we compare the proposed model with other mixtures of linear regression models with Gaussian error terms, and we address identifiability (the proof is provided in Appendix A) and maximum likelihood estimation using the EM algorithm (for the proof see the Online Resource); in Section 3 we illustrate the results of Monte Carlo experiments in which datasets were simulated from the proposed model and from models with other distributions for the error terms; finally, in Section 4 we present the results obtained from the analysis of a biometrical dataset concerning athletes at the Australian Institute of Sport [12]. 2 Multivariate linear regression model with a Gaussian mixture for the error terms 2.1 The model A multivariate linear regression model is generally based on the assumption y i = β 0 + B 0 xi + i ,
(1)
where y i = (yi1 , . . . , yid , . . . , yiD )0 and xi are the Ddimensional vector of the response variables and the P -dimensional vector of the fixed regressor values for the ith sample unit, respectively; β 0 is a D-dimensional vector containing the intercepts for the D responses; B is a matrix of dimension P × D whose (p, d)th element, βpd , is the regression coefficient of the pth regressor on the dth response; finally, i denotes the D-dimensional random vector of the error terms corresponding to the ith observation. The classical multivariate linear regression model also assumes that i , i = 1, . . . , I, are independent and identically distributed random vectors, whose distribution is assumed to be multivariate Gaussian with a D-dimensional zero mean vector and a positive definite covariance matrix Σ of dimension D × D, that is i ∼ M V N (0, Σ).
(2)
The proposed multivariate linear regression model is based on assumptions (1) and i ∼
K X
k=1
πk M V N (ν k , Σ k ),
(3)
3
where πk ’s are positive weights that sum to 1, and the ν k ’s are D-dimensional mean vectors that satisfy the PK constraint k=1 πk ν k = 0. 2.2 Comparisons with other mixtures of regression models with Gaussian error terms
model (7) can be seen as a multivariate linear regression model whose error terms are modelled using a mixture of Gaussian components with diagonal covariance matrices: K X
k=1
πk
D Y
2 φ1 (yid ; µikd , σkd ), µikd = λkd + β0d xi .
(8)
d=1
Thus, this specific model differs from the one proposed in this paper as the latter allows the component-specific error terms to be correlated. K X The introduction of some restrictions on the depenπk φD (y i ; µik , Σ k ), µik = ν k + β 0 + B 0 xi , (4) dence structure of the multivariate error terms that are k=1 stronger than the ones implied by model (4) also allows where φD (y i ; µik , Σ k ) is the density of the D-dimensional to derive the following model: Gaussian distribution M V N (µik , Σ k ). According to (4), "K # D d Y X the proposed model can also be seen as a mixture of 2 πkd φ1 (yid ; µikd , σkd ) , µikd = λkd +β 0d xi . (9) K restricted multivariate linear regression models with d=1 k=1 Gaussian error terms, whose generic component takes Model (9) is obtained by applying the univariate model the form described in [5] to each dependent variable and by asy i = λk + B 0 xi + ˜ik , ˜ik ∼ M V N (0, Σ k ), (5) suming independence between them. It is possible to show that model (9) corresponds to model (1) under where λk = β 0 + ν k , for k = 1, . . . , K. Thus, the comthe following assumption for the error term: ponents have different intercepts for the D responses "K # and different covariance matrices for the error terms, D d Y X 2 but the K matrices of the regression coefficients are i ∼ πkd N (νkd , σkd ) , (10) d=1 k=1 restricted to be equal. In the literature on mixtures of linear regression P Kd where νkd = λkd − β0d , and k=1 πkd νkd = 0. models (see, for example, [30,14,50,29,27]) many conSimilarly, model (8) can be seen as model (1) under tributions focus on the special case D = 1. In this sitthe following assumption: uation the probability density function of yi is "D # K X Y K X 2 i ∼ πk N (νkd , σkd ) , (11) πk φ1 (yi ; µik , σk2 ), µik = λk + β0k xi , (6) Given equations (1) and (3), the probability density function of y i is
k=1
k=1
where φ1 (yi ; µik , σk2 ) is the density of the uni-dimensional Gaussian distribution N (µik , σk2 ), and λk , β k are the intercept and the vector of the P regression coefficients for the kth component, respectively. Mixtures of linear regression models with Gaussian error terms for multivariate responses can be fitted, for example, through the R package flexmix [32]. This package considers models which assume independence among the D response variables within each component: K X
k=1
πk
D Y
2 φ1 (yid ; µikd , σkd ),
µikd = λkd + β0kd xi . (7)
d=1
Recently, new features have been introduced in this package [26]. In particular, some of the componentspecific parameters can be either restricted to be equal over all components or to vary between groups of components. In the special case β kd = βd for d = 1, . . . , D,
d=1
where νkd = λkd − β0d , and
PK
k=1
πk νkd = 0.
2.3 Model identifiability and parameter estimation As far as identifiability of the model defined by equations (1) and (3) is concerned, the I × P matrix x with rows xi for i = 1, . . . , I, needs to have full column rank (otherwise even a single regression model would not be identifiable). In [28] and [27] it is proved that mixtures of univariate and multivariate linear regression models are identifiable only under specific additional conditions. Appendix A shows that, given the previously-described restriction on the regression coefficients, model (4) is always identifiable. However, it should be noted that model (4) is invariant under permutations of the labels of the K components. This is a common problem with all mixture models (see, for example, [40]).
4
Maximum likelihood (ML) estimation of the model parameters may be carried out through the well-known Expectation-Maximization (EM) algorithm [13]. This is a general-purpose algorithm for ML estimation in a wide variety of situations best described as incompletedata problems. A comprehensive account of the EM algorithm, including the special case of parameter estimation for mixture models, can be found in [39]. In the following, we give the sketch of the solution obtained using this algorithm to estimate the parameters of the proposed model. A detailed proof of the result is reported in the Online Resource. Given a random sample of I observations drawn from the proposed model, the model log-likelihood is
l=
I X
log
i=1
K X
!
πk φD (y i ; µik , Σ k ) .
k=1
(12)
The problem of maximizing l with respect to the model parameters can be recast as a maximization problem with incomplete data, and can be solved in the EM framework. Let y 1 , . . . , y I be the vectors with the values of the D response variables for the I sample units, and let y be the matrix of dimension I × D obtained by combining y 01 , . . . , y 0I by rows. Let zik be a binary variable equal to 1 when the ith observation has been generated from the kth component, and 0 otherwise, PK for k = 1, . . . , K. Thus, z = 1. Furthermore, k=1 ik let z i+ be the K-dimensional vector whose kth element is zik . Since vectors z i+ ’s are unknown, the observed data y can be considered incomplete, and (12) is the incomplete-data log-likelihood. If we know both the observed data and the component-label vectors z i+ ’s, we can obtain the so-called complete log-likelihood of the model. For random samples, it is appropriate to assume that the component label vectors z 1+ , . . . , z I+ are observed values of I independent and identically distributed random vectors whose unconditional distribution is multinomial consisting of one draw on K categories with probabilities π1 , . . . , πK . The complete log-likelihood of the model is equal to
lc =
I X K X
zik [log πk + log φD (y i ; µik , Σ k )].
i=1 k=1
Up to a constant factor, lc = lc1 + lc2 , where
lc1 =
K X
k=1
z.k log πk ,
(13)
K
lc2 = −
1X z.k log |Σ k | 2
(14)
k=1 I
−
K
1 XX zik (y i − µik )0 Σ −1 k (y i − µik ), 2 i=1 k=1
PI with z.k = i=1 zik , and |A| denotes the determinant of matrix A. Function lc1 depends only on the parameters πk ’s and can be maximized simply by letting πk equal to π ˆk = z.k /I, k = 1, . . . , K. In order to show how lc2 can be maximized, it is convenient to express such a quantity in the matrix notation obtained as follows. Let λk = β0 + ν k for k = 1, . . . , K, and let Γ be the matrix of dimensions (K + P ) × D formed by combining the vectors λ01 , . . . , λ0K and the matrix B by rows. Moreover, let z +k = (z1k , . . . , zIk )0 , and let µk be the matrix of dimension I × D obtained by combining µ01k , . . . , µ0Ik by rows. Note that the latter may be expressed as µk = X k Γ , where X k = (O k x), O k is a matrix of dimension I × K with all the elements equal to 0 apart from those of column k which are equal to 1. As a consequence of these relations, it is possible to write lc2 = −
K
K
k=1
k=1
1X 1X z.k log |Σ k | − tr(Σ −1 k D k ), 2 2
(15)
where D k = (y − X k Γ )0 diag(z +k )(y − X k Γ ), and diag(z +k ) is the I × I diagonal matrix whose main diagonal equals vector z +k . Function lc2 defined by equation (15) depends on the parameters Γ and Σ k , k = 1, . . . , K. It can be maximized by evaluating its first differential, by setting the first derivatives computed with respect to all the parameters equal to 0, and by solving the resulting equations (for the details see the Online Resource). PK 0 Provided that M = k=1 Σ −1 k ⊗[X k diag(z +k )X k ] is non-singular, where ⊗ is the Kronecker product operator, the solutions are ˆ ) = M −1 N vec(y), vec(Γ (16) −1 ˆ Σ k = z.k Dk , k = 1, . . . , K, (17) PK −1 0 where N = k=1 Σ k ⊗ [X k diag(z +k )] and vec(A) denotes the vector formed by stacking columns of the matrix A, one underneath the other. ˆ 1, . . . , λ ˆ K and B. ˆ we directly obtain λ ˆ We From Γ 0 ˆk + B ˆ xi for k = 1, . . . , K ˆ ik as λ may also compute µ ˆ can be obtained as and i = 1, . . . , I. Furthermore, β 0 PK ˆ k , and ν ˆk − β ˆ for k = 1, . . . , K. Thus, ˆ k as λ ˆk λ 0 k=1 π P ˆ k = 0. ˆ k satisfy the constraint K the estimates ν ˆk ν k=1 π As equation (16) depends on the Σ k ’s, and equation (17) depends on Γ , the maximization of function
5
lc2 with respect to such parameters can be obtained by iteratively updating the estimate of Γ given an estimate of the Σ k ’s, and vice versa. Since the zik ’s are missing, in the EM algorithm they are substituted with their conditional expected values. More specifically, the EM algorithm consists of iterating the following two steps until convergence: E step: on the basis of the current estimate of the (r) (r) ˆ (r) ˆ (r) , π ˆ ,Σ model parameters Γ ˆ ,µ for k = 1, . . . , K k
ik
k
and i = 1, . . . , I, the expected value of the complete log-likelihood given the observed data, E(lc |y), is computed. In practice, this consists of substituting any zik with its conditional expected value E(zik |y), which is equal to (r) ˆ (r) (r) ˆ ik , Σ π ˆk φD y i ; µ k (r) pik = P . (r) ˆ (r) (r) K ˆ ih , Σ h ˆh φD y i ; µ h=1 π
M step: E(lc |y) is maximized with respect to the model parameters as follows: the estimate of πk is updated by PI (r) (r+1) computing π ˆk = I1 i=1 pik for k = 1, . . . , K; the (r+1) (r+1) ˆk estimates Γˆ and Σ for k = 1, . . . , K are ob-
ˆ (r+1) and Σ ˆ (r+1) untained by iteratively updating Γ j k,j ˆ (r+1) ˆ (r) til convergence. More specifically, let Σ = Σ k,0 k , i h −1 P (r+1) (r) (r+1) K 0 ˆ k,0 Σ ⊗ X diag p X M0 = k k k=1 +k i PK ˆ (r+1) −1 h 0 (r+1) (r) and N 0 = k=1 Σ k,0 ⊗ X k diag p+k . (r+1) −1 (r+1) (r+1) ˆ = Mj vec(y). Nj Then, vec Γ j+1 −1 0 (r+1) (r) (r) (r+1) (r+1) ˆ Finally, Σ = p diag p R k,j+1 j+1 .k +k Rj+1 ,
(r+1) ˆ (r+1) . where Rj+1 = y − X k Γ j+1 The iterative estimation process requires a set of starting values for the model parameters. In general, different starting strategies (and also different stopping rules) can lead to quite different estimates (see [45] for a demonstration in the context of mixtures of exponential components). Furthermore, a poor choice of the starting values may make the convergence of the EM algorithm very slow. The log-likelihood may also have different local maxima. Thus, the choice of a good starting set is crucial in the parameter estimation via the EM algorithm. For the proposed model we suggest the following solutions: For B simply compute the ML estimate un˜ +ν ˜ k (k = 1, . . . , K), der assumption (2). For λk use β 0 ˜ where β0 is the estimate of β0 under assumption (2), ˜ k is the estimate of ν k obtained by fitting the and ν mixture model (3) to the residuals computed under assumption (2), using for example the R package mclust [23]. This also provides the starting estimates of the parameters Σ k and πk , k = 1, . . . , K. The results described in Section 3 seem to suggest that this strategy
for the selection of the starting values is reasonable. However, since the convergence of the EM algorithm to the global maximum cannot be guaranteed, different initializations should be considered and the solution with the largest likelihood should be chosen. For example, different starting values can be obtained by applying the proposed strategy to different random subsamples of the data or by considering some other random procedures. As far as the choice of the unknown value of K is concerned, it is known that this is a difficult problem, that can be tackled with different methods (see, for example, [40]). We suggest resorting to model-selection techniques, such as the Akaike’s information criterion AIC, the AIC3, the consistent Akaike information criterion CAIC, the Bayesian information criterion BIC and the ICL criterion [1,9,10,44,8]. 3 Experimental results from simulated data The proposed approach was evaluated through Monte Carlo experiments in which artificial datasets were generated from model (1) using the statistical software system R [41]. As far as the generation of the error terms is concerned, we considered three different probability models: the Gaussian model, a mixture of two Gaussian models and the skew-Gaussian model. In order to compute the estimates of the model parameters provided by the proposed approach, we implemented the EM algorithm described above in the system R. The R package mclust02 [22] was used to obtain the starting values for the parameters of each Gaussian component of the mixture model. No restrictions were imposed on the covariance matrices of such distributions throughout the whole simulation study. Parameters of the multivariate regression model with K components for the error terms were estimated, for K from 1 to Kmax (the values of Kmax used in the experiments are described in the following Subsections). We used the following convergence criteria: the increment in the log-likelihood value between two consecutive steps lower than 0.0005 for the EM algorithm (with a maximum number of iterations equal to 300); the Euclidean distance between two consecutive model parameter estimates, divided by the total number of estimated parameters, lower than 0.0005 for the M step within the EM algorithm (with a maximum number of iterations equal to 100). In order to choose the best model among the fitted ones we computed the BIC criterion: BICM = 2 max [log LM ] − nparM log(I), where max [log LM ] is the maximum of the log-likelihood of a model M for the given sample of I units and nparM is the number of independent parameters to be estimated for that model. This criterion provides an approximation to the
6 Table 1 Frequency distributions (over 100 samples) of the values of K selected using the proposed procedure in the first four experiments, for samples of size 100, 200 and 300. Experiment K I = 100 I = 200 I = 300
1st
2nd
3rd
1
2
2
3
4
1
2
3
1
2
3
94 98 100
6 2 0
94 100 98
4 0 2
2 0 0
73 17 0
26 79 98
1 4 2
57 8 0
40 89 87
3 3 13
Bayes factor and enables us to trade-off the fit and parsimony of a given model: the greater the BIC, the better the model [44]. It also performed well in a number of applications (see, for example, [20,21,47]). Only parameter estimates of this best model were examined in the evaluation of the performance of the proposed approach. Using the values of the parameters described in the following Subsections and each distribution of the error term described above, we generated 100 samples of three different sizes (I = 100, 200, 300) from the resulting model. 3.1 Errors generated from a Gaussian distribution In the first Monte Carlo experiment the datasets were generated from model defined by equations (1) and (2) with the following values of the model parameters: β 0 = (2, 4)0 , 2 −3 0 B = , −3 2 1.2 0.8 Σ= . 0.8 1.0
4th
(18) (19) (20)
This situation is a special case of the model defined by equations (1) and (3), where K = 1. In this experiment we set Kmax = 4. The performance of the proposal described in Section 2 was evaluated first of all on the basis of its ability to select the correct number of components (see Tab. 1, left part). The percentage of samples for which the value of K was correctly chosen resulted to be high for every sample size and reached 100% when I = 300. In order to study the effect on inference due to the choice of the model we numerically compared the bias and the mean square error (MSE) of our estimator to the ones of the ML estimators of the regression coefficients calculated under assumption (2) (see Tab. 5 in the Online Resource). Since in most samples the BIC allowed the detection of the correct value of K, the two estimation strategies generally led to very similar results in terms of both the bias and MSE. However, an important difference between these strategies is that the one based on assumption (3) does not require the a priori knowledge of the value of K.
3.2 Errors generated from a mixture of Gaussian distributions The second Monte Carlo experiment was performed using the values of the parameters β0 and B defined by equations (18) and (19), but with error terms generated from model (3) with K = 2, π1= π2 = 0.5, 1.0 −0.6 ν 1 = (2, 2)0 , ν 2 = (−2, −2)0 , Σ 1 = , −0.6 1.5 1.2 0.8 Σ2 = . In this setting, since the error terms 0.8 1.0 can be drawn from two different distributions, data are characterized by an unobserved source of heterogeneity: an unobserved dichotomous variable partitions the units into K = 2 groups and affects the D dependent variables by shifting the intercept vector of the model. For any sample generated in this experiment, we computed the MLE under both the assumption (2) and the mixture assumption (3) for K from 1 to Kmax = 4. We also estimated model (1) under assumptions (10) for K1 and K2 from 1 to Kmax = 6 and (11) for K from 1 to Kmax = 7. The percentage of successes in selecting the correct value of K using the proposed strategy is high for every sample size (see Tab. 1). The performance of our strategy was also evaluated with respect to its ability to correctly classify each error term, that is, to recover the unobserved source of heterogeneity. This purpose was performed by computing the corrected Rand index CR [43,31] between the true classification of the I error terms generated from the mixture model considered in this experiment and the classification obtained from the best mixture model estimated according to the proposed strategy. Tab. 2 reports the means and standard deviations of the CR index over the one hundred samples for the three sample sizes. The proposed approach allowed an almost perfect reconstruction of the unknown classification of the error terms for each sample size. The comparison between the results obtained using our strategy and the ones detected under assumption (10) allows to highlight the effect of ignoring the dependence structure of the error terms. As far as the ability of correctly classifying the error terms is concerned,
7 Table 2 Means and standard deviations of the CR index (over 100 samples) computed between the true partition of the error terms and the ones estimated using the proposed procedure (2nd experiment). CR index
I = 100
I = 200
I = 300
Mean Standard deviation
0.9741 0.0477
0.9948 0.0119
0.9938 0.0146
Table 3 Frequency distributions (over 100 samples) of the values of K1 and K2 selected under assumption (10) (2nd experiment). K1
I = 100 I = 200 I = 300
K2
1
2
3
4
5
6
1
2
3
4
5
6
31 23 29
39 24 16
23 23 22
6 18 16
1 11 12
0 1 5
37 31 25
47 39 40
13 21 16
3 7 6
0 2 6
0 0 7
Table 4 Means and standard deviations of the CR index (over 100 samples) computed between the true partition of the error terms and the ones estimated under assumption (10) for each dependent variable (2nd experiment).
CR index
Y1 I = 100
I = 200
I = 300
Y2 I = 100
I = 200
I = 300
Mean Standard deviation
0.5659 0.3953
0.6053 0.3489
0.5575 0.3683
0.4843 0.3879
0.5570 0.3876
0.5893 0.3520
none of the two univariate analyses produced satisfactory results. In both analyses the percentage of samples for which the number of classes was correctly detected was always less than 50% (Tab. 3), and the average CR index was always lower than 0.61 (Tab. 4). Since this index is a measure of partition correspondence based on the number of pairs of units which are assigned to the same group in both partitions, it can reach its maximum value only when the two partitions have the same number of groups. Hence, relatively low values of this index can be observed whenever the estimated number of components is different from the true one. The MSE of the ML estimators of the regression coefficients B were also remarkably larger (up to five times) than the ones of the ML estimators computed using the proposed model (see Tab. 7, where the case I = 100 is omitted in order to make comparisons clearer). Using assumption (11) led to an overestimation of the number of components which increased with the sample size (Tab. 5). This result was not surprising: the true error term distribution is characterized by correlation within each component, while model based on (11) assumes uncorrelation. Thus, a larger number of components was needed to adequately fit the data. This overestimation of K also affected the ability of correctly classifying the error terms: the average CR indexes (Tab. 6) were lower than the ones reported in Tab. 2 (see the above remark about the CR index), and were decreasing as the sample size increased. As far as the properties of the ML estimators of the regression
coefficients are concerned (Tab. 7), using assumption (11) instead of (3) led to an increase in the MSE of the estimates of B ranging between 17% and 83%. Furthermore, from the comparison between the ML estimators computed under assumptions (2) and (3) it emerges that our approach resulted in a decrease in the MSE of the estimates of B by a factor of at least four (almost ten in some situations) (see Tab. 7). In order to provide a more comprehensive evaluation of the performance of our strategy when the data are drawn from the proposed model we performed some further Monte Carlo experiments by analysing data generated from model (1) under assumption (3) with K = 2. The main interest of these experiments is in the behavior of our method when the complexity of the model is increased due to the presence of a larger number of response variables and/or of regressors. Some results of these experiments are reported in the Online Resource.
3.3 Errors generated from a skew-Gaussian distribution In the third and fourth Monte Carlo experiments, the datasets were generated from model (1) using the values of the parameters β 0 and B defined in equations (18) and (19), but with error terms generated from a bidimensional skew-Gaussian distribution, that is i ∼ SN (ξ, Ω, α),
(21)
8 Table 5 Frequency distributions (over 100 samples) of the values of K selected under assumption (11) (2nd experiment). K I = 100 I = 200 I = 300
2
3
4
5
6
7
22 0 0
57 43 21
17 30 32
3 23 42
0 3 5
1 1 0
Table 6 Means and standard deviations of the CR index (over 100 samples) computed between the true partition of the error terms and the ones estimated under assumption (11) (2nd experiment). CR index
I = 100
I = 200
I = 300
Mean Standard deviation
0.7428 0.1335
0.6168 0.1179
0.5631 0.1164
Table 7 Estimated biases and mean-square errors of four ML estimators of the parameters β0 and B when the error terms are generated from a mixture of two Gaussian components (2nd experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ02 βˆ12 βˆ22
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
MLE under (10)
MLE under (11)
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
0.0107 0.0282 0.0241 0.0329 0.0160 0.0167 0.0231 0.0285 0.0212 0.0333 0.0081 0.0170
0.0189 0.0195 −0.0024 0.0204 −0.0142 0.0176 0.0087 0.0202 0.0017 0.0195 −0.0137 0.0185
0.0090 0.0282 0.0001 0.0034 −0.0038 0.0044 0.0215 0.0281 0.0034 0.0042 −0.0100 0.0034
0.0179 0.0195 −0.0115 0.0026 0.0016 0.0021 0.0082 0.0201 −0.0011 0.0025 −0.0024 0.0021
0.0082 0.0282 0.0175 0.0149 −0.0054 0.0107 0.0222 0.0283 0.0150 0.0173 −0.0014 0.0116
0.0193 0.0196 0.0066 0.0113 −0.0117 0.0096 0.0080 0.0203 0.0011 0.0087 0.0027 0.0108
0.0092 0.0282 −0.0049 0.0051 −0.0044 0.0052 0.0213 0.0278 −0.0034 0.0077 −0.0149 0.0044
0.0185 0.0195 −0.0063 0.0035 −0.0019 0.0034 0.0081 0.0201 −0.0025 0.0042 0.0000 0.0030
where ξ, Ω, α denote the location, dispersion and skewness parameters, respectively [2]. The skewness parameter was set α = (−4, 10)0 in the third experiment and α = (4, 10)0 in the fourth, while the remaining parameters were kept constant in both experiments. Specifically, Ω = Σ, where Σ is the covariance matrix defined by equation (20), and ξ = (−0.377, −0.734)0. This last value was chosen in order to guarantee that the mean vector of i was 0. For any sample generated in these two experiments, we computed not only the MLE under both the assumption (2) and the mixture assumption (3) for K from 1 to Kmax = 4, but also the ones under the assumption of skew-Gaussian distribution (21). The data generation and ML estimation based on such a probability distribution were performed in R using package sn. In both experiments, the value of K most commonly selected by the proposed procedure over the 100 samples was equal to two for sample sizes 200 and 300, and equal to one for I = 100 (see Tab. 1). Tabs. 8 and 9 summarize the bias and MSE of the three estimation procedures in the third and fourth experiments, respectively, for samples of size 200 and 300. Obviously, as in
the first experiment, the best performance is obtained using the ML estimators computed under the model used to generate the data. However, biases and meansquare errors of the parameter estimates computed under (3) are very close to the ones of the estimates obtained with the knowledge of the true distribution of the error terms and are lower than the ones computed under (2).
4 Experimental results from real data The proposed approach was applied to a real dataset concerning 202 athletes at the Australian Institute of Sport. Data is described in [12] and is also available within the package sn in R. In particular, we focused on seven variables: body mass index (BMI), sum of skin folds (SSF), percentage of body fat (PBF), lean body mass (LBM), red cell count (RCC), white cell count (WCC), and plasma ferritine concentration (PFC). The first four are biometrical variables, while the last three concern blood composition. We studied the joint linear dependence of the biometrical variables on the blood
9 Table 8 Estimated biases and mean-square errors of three ML estimators of the parameters β0 and B when errors are generated from a skew-Gaussian distribution (3rd experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ02 βˆ12 βˆ22
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
MLE under (21)
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
0.0024 0.0057 −0.0030 0.0069 0.0047 0.0042 0.0034 0.0026 −0.0099 0.0027 0.0040 0.0019
0.0053 0.0035 −0.0018 0.0046 0.0037 0.0041 −0.0025 0.0015 −0.0018 0.0015 0.0025 0.0021
0.0019 0.0057 0.0017 0.0069 0.0012 0.0046 0.0029 0.0026 −0.0026 0.0023 0.0003 0.0015
0.0053 0.0035 −0.0015 0.0041 0.0025 0.0035 −0.0024 0.0015 −0.0009 0.0011 −0.0001 0.0012
0.0014 0.0057 −0.0003 0.0063 0.0013 0.0042 0.0020 0.0026 −0.0055 0.0020 −0.0014 0.0012
0.0058 0.0034 −0.0003 0.0042 0.0028 0.0034 −0.0015 0.0013 0.0004 0.0009 −0.0010 0.0009
Table 9 Estimated biases and mean-square errors of three ML estimators of the parameters β0 and B when errors are generated from a skew-Gaussian distribution (4th experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ02 βˆ12 βˆ22
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
MLE under (21)
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
0.0030 0.0037 −0.0080 0.0040 0.0060 0.0025 0.0028 0.0023 −0.0112 0.0017 0.0029 0.0022
−0.0019 0.0022 −0.0020 0.0025 0.0046 0.0029 −0.0079 0.0011 −0.0013 0.0010 0.0015 0.0013
0.0024 0.0038 −0.0007 0.0035 0.0018 0.0021 0.0024 0.0022 −0.0035 0.0012 0.0006 0.0012
−0.0016 0.0022 0.0019 0.0017 0.0025 0.0018 −0.0077 0.0011 0.0001 0.0008 −0.0024 0.0007
0.0022 0.0034 −0.0024 0.0025 0.0000 0.0018 0.0021 0.0021 −0.0056 0.0011 −0.0032 0.0010
−0.0001 0.0021 0.0025 0.0016 0.0026 0.0013 −0.0061 0.0001 0.0029 0.0006 −0.0015 0.0004
composition variables. The Pearson correlation matrix between the four dependent variables is: 1.000 0.321 0.188 0.714 1.000 0.963 −0.208 . 1.000 −0.362 1.000 This data was previously analyzed, for instance, by [18] through Bayesian multivariate regression methods with skewed distributions for the error terms. The same data without the covariates were also used by [2] and [3] in the context of skewed distributions. Multivariate linear regression models based on assumption (3) were estimated for values of K from 1 to 4. For comparison purposes, model (1) was also estimated under assumptions (21), (10) for Kd = 1, . . . , 4 ∀d, and (11) for K = 1, . . . , 8. The log-likelihoods of some estimated models are reported in Tab. 10 together with the corresponding number of estimated parameters and the values of three model-selection criteria. The reported results for model (1) under assumptions
(10) and (11) are the ones corresponding to maximum AIC, BIC and ICL. According to two model-selection criteria, namely BIC and AIC, the best model is the one with a mixture of two Gaussian components with non-diagonal covariance matrices for the error terms. According to ICL preference should be given to the skew-Gaussian model, but the value of ICL for this model is very close to the one of the proposed mixture model with two components. This latter model seems to indicate the presence of an unobserved source of heterogeneity in the linear regression model. Such a source induces a partition of the athletes into two clusters, one for each component. The estimated prior probabilities that an athlete belongs to these clusters are π ˆ1 = 0.385 and π ˆ2 = 0.615. As described in Section 2, this classification does not reflect upon the regression coefficients B of the model describing the linear dependence of the biometrical variables from the blood composition ones; it only affects the intercepts λk = β0 + ν k and the
10
variance-covariance matrices Σ k of the model. The four intercepts in the first cluster (and also the four response mean values) result to be higher than the ones in the second cluster (see the upper part of Tab. 11). As far as the response variances are concerned (lower part of Tab. 11), BMI, SSF and RCC have higher values in the first cluster, while LBM has higher values in the second one. Further differences between clusters concern some correlations. For example, SSF and LBM are negatively correlated within the first cluster (−0.197), while they are almost uncorrelated within the second one (0.019). Tab. 12 shows the estimates of the model parameters β 0 and B, together with their 95% confidence intervals obtained using the parametric bootstrapping residual method [16]. For some parameters such intervals contain the 0 value. Thus, not all the model parameters may be considered significant. Similar results have been obtained also through the bootstrapping unit method for calculating confidence intervals. From the selected model it is also possible to classify each athlete by calculating the two posterior cluster membership probabilities and by assigning the athlete to the cluster with the highest of such probabilities. Most of the athletes classified in the first cluster are female (81.8%), while 70% of the athletes classified in the second cluster are male (Tab. 13). The unobserved source of heterogeneity discovered by the selected model results to be statistically associated with athletes’ gender (χ2 = 51.98, p − value = 5.62 · 10−13 ). We also evaluated the proportion of the total sum of squares of each response explained by the model. This evaluation was performed using both the prior estimated cluster membership probabilities and the posterior ones (Tab. 14). Using the posterior probabilities instead of the prior ones results in an increase in the R2 values, especially for SSF and PBF. Based on the posterior probabilities, the estimated model accounts for only 34% and 40% of the deviance of LBM and BMI, respectively, while almost 80% of the deviance of SSF and PBF is explained by the model.
5 Concluding remarks In this paper we have explored the idea of using a finite mixture of Gaussian components for analysing non-normal errors in the multivariate linear regression model. As already mentioned in the introductory Section, this idea was previously studied in the context of multiple linear regression analysis [5]. According to that approach, a preliminary evaluation of the departure from normality has to be performed, based on a test for the hypothesis of normality of the error terms
in a linear regression model [35]. Furthermore, whenever a departure from normality is detected, a linear regression model under the assumption that the error terms follow a finite mixture of normal distributions has to be fitted to the dataset. On the basis of a simulation study, in [5] it is suggested to use K = 2 components. As far as the model and the parameter estimation are concerned, the solutions described in this paper can be considered as a generalization of the proposal by [5] to the case D > 1 which is able to deal with the correlation structure among the D dependent variables. As the experimental results from simulated data described in Section 3 and in the Online Resource show, taking into account this correlation structure can result in estimators of the regression coefficients with sensibly lower MSEs. Section 4 also gives an example of a real dataset in which the model that ignores this dependence has a worse fit than the model proposed in this paper. Differently from [5], in our approach linear regression models under assumption (3) have to be fitted to the dataset with different values of the number of components (e.g., 1 ≤ K ≤ Kmax , where Kmax is a value chosen by the analyst), and the choice of the best model has to be performed through a model selection criterion. Furthermore, since the classical model with normal error terms is obtained by setting K = 1 in (3), the evaluation of the departure from normality may be included in the model selection phase and is not based on a statistical test. Special attention should be paid to the choice of the model selection criterion, as this may affect the results obtained using our approach. More generally, it is worth noting that choosing the number of components of a finite mixture model is still an open problem (see, for example, [40], Chapter 6). As illustrated in Subsection 2.2, the proposed model can be recast as a particular mixture of K multivariate linear regression models with Gaussian error terms, in which regression coefficients are assumed to be equal over all components. Similar models are implemented in the R package flexmix [26]. The main difference between these models and the one proposed in this paper concerns the assumptions on the error correlation structure within each component: our proposal allows the component-specific error terms to be correlated, while flexmix assumes independence. As the experimental results obtained from simulated and real data show, ignoring this component-specific correlation structure can lead to estimators of the regression coefficients with larger MSEs and can result in a model with a worse fit to the data. The approach described in this paper assumes that all covariance matrices of the mixture components are unconstrained. This approach could be made more flex-
11 Table 10 Log-likelihood and values of three model-selection criteria for some linear regression models fitted to the Australian Institute of Sport dataset (the maximum value of each criterion is written in italics). Model (1)
Log-likelihood
No. of parameters
BIC
AIC
ICL
−2422.90 −2343.90 −2340.09 −2325.72
26 41 56 71
−4983.82 −4905.44 −4979.06 −5028.32
−4897.80 −4769.80 −4793.80 −4793.44
−4983.82 −4923.22 −5006.89 −5062.33
under (10) with K1 = 2, K2 = K3 = K4 = 1 under (10) with K1 = 2, K2 = 3, K3 = 2, K4 = 1 under (10) with K1 = 2, K2 = 3, K3 = 4, K4 = 1
−2820.58 −2788.63 −2778.64
23 32 38
−5763.26 −5747.12 −5758.99
−5687.17 −5641.25 −5633.28
−5763.36 −5879.12 −5849.44
under (11) with K = 7 under (11) with K = 8
−2456.32 −2440.84
74 83
−5305.47 −5322.26
−5060.64 −5047.67
−5328.15 −5345.01
under (21)
−2380.11
30
−4919.46
−4820.22
−4919.46
under under under under
(3) (3) (3) (3)
with with with with
K=1 K=2 K=3 K=4
Table 11 Estimates of model parameters λk and Σk calculated from the real dataset (K = 2). BMI
SSF
PBF
LBM
ˆ0 λ 1 ˆ0 λ 2
12.45 9.57
122.18 72.62
29.61 20.27
−3.65 −5.51
ˆ1 Σ
6.89 17.30 0.80 15.11
17.30 710.66 100.11 −42.71
0.80 100.11 16.43 −12.89
15.11 −42.71 −12.89 66.34
ˆ2 Σ
3.96 4.63 −0.21 19.19
4.63 156.23 28.20 2.79
−0.21 28.20 6.47 −8.86
19.19 2.79 −8.86 140.67
Table 12 Estimates of β0 and the regression coefficients calculated from the real dataset (K = 2). 95% confidence intervals are reported in brackets. BMI
SSF
PBF
LBM
ˆ0 β 0
10.68 (7.80, 13.73)
91.72 (69.36, 122.75)
23.87 (19.45, 29.95)
−4.79 (−16.13, 6.82)
RCC
2.30 (1.60, 2.92) 0.06 (−0.11, 0.21) 0.013 (0.008, 0.020)
−7.74 (−13.52, −2.86) 1.974 (0.58, 3.12) −0.003 (−0.048, 0.051)
−2.71 (−3.99, −1.77) 0.40 (0.13, 0.62) −0.006 (−0.015, 0.005)
14.33 (11.78, 16.78) −0.28 (−1.04, 0.35) 0.053 (0.025, 0.082)
WCC PFC
Table 13 Joint classification of the athletes according to gender and cluster membership estimated by the selected model. Cluster membership 1 2
Female 63 37 100
Gender Male 14 88 102
77 125 202
Table 14 Proportions of the total sum of squares of the four responses explained by the selected model, evaluated using both the prior estimated cluster membership probabilities (R2prior ) and the posterior ones (R2posterior ).
R2prior R2posterior
BMI
SSF
PBF
LBM
0.135 0.405
0.098 0.714
0.187 0.793
0.336 0.342
12
ible and versatile by allowing these covariance matrices to be parameterized by the eigenvalue decomposition [4], which is commonly used in Gaussian model-based cluster analysis to obtain different clustering criteria. By means of this parameterization it is possible to control orientation, volume and shape of mixture distributions, that can be allowed to vary between components, or constrained to be the same for all components [4,11]. Thus, a collection of parsimonious and interpretable multivariate linear regression models can be defined. In order to incorporate this solution into our approach a modified version of the EM algorithm is needed that allows a constrained estimate of the mixture component covariance matrices. As proven in Appendix A, provided that matrix x has full column rank, the model proposed in this paper is always identifiable, but it is invariant under permutations of the component labels [40]. This proof also applies to the model proposed in [5], where the issue of identifiability of the proposed model was not addressed. It also holds for the models implemented in flexmix, when all the regression coefficients are restricted to be equal over all components (thus, relaxing the conditions described in [27] that are required for the unrestricted model). Properties of the proposed ML estimators of the regression parameters, numerically evaluated through various Monte Carlo experiments for three probability models, suggest that the approach to multivariate linear regression analysis described in this paper may represent a useful and flexible strategy to handle nonnormal error terms in the multivariate linear regression model. However, it is worth noting that in real data applications some model parameters may be not significant. Thus, estimates of the standard errors of each parameter estimate should also be computed. Approximated standard errors can be obtained, for example, through a resampling strategy, namely the bootstrap approach. Other approaches to standard error approximation, based on the expected or the observed information matrices about the model parameters, require knowledge of the Hessian of the log-likelihood function [40]. We are currently in the process of evaluating the Hessian of the log-likelihood function in order to make it possible to obtain the asymptotic covariance matrix of the proposed ML estimators, and to use this result to compute the approximated standard errors. However, since the sample size must be very large before the asymptotic theory of maximum likelihood applies, in particular for mixture models [40], the resampling strategy is recommended whenever the sample size is small.
As in this paper linear restrictions on the regression coefficients are not considered, all the regressors are assumed to be relevant for all the responses. In future work we intend to modify the estimation procedure and the strategy in search of the best model in order to allow for linear restrictions on the regression coefficients. The model proposed in this paper could be used to extend some recent models for clustering data that allow for the possible presence of irrelevant variables, that is, variables that do not provide information about the clustering of the units [36,37,42]. The basic idea behind these models is that the irrelevant variables may depend on the relevant ones, and this dependence is modelled using multivariate Gaussian linear regression models. In particular, the use of the proposed model instead of the Gaussian one could allow for the possible presence of multiple cluster structures defined by different subsets of the observed variables [25]. With respect to [25], the main advantage of using the proposed model would consist of removing the assumption of independence among the subsets of observed variables. A Appendix. Model identifiability Given the model defined by equations (1) and (3), the joint probability density function (p.d.f.) of a random sample of I observations is # " K I Y X πk φD (yi ; µik , Σk ) , (22) f (y; x, θ) = i=1
k=1
where θ = (π1 , . . . , πK−1 , λ1 , . . . , λK , vec(B), v(Σ1 ), . . . , v(ΣK ))0 is the vector containing the independent parameters of the model, and v(Σk ) denotes the vector formed by stacking the columns of the lower triangular portion of Σk . In order to prove identifiability, we first show that (22) may be expressed as a mixture of J = K I (D·I)-dimensional Gaussian components. Let AK,I = {(k1 , . . . , kI ) : ki ∈ {1, . . . , K}, i = 1, . . . , I} be the set containing the arrangements of the first K positive inte (j) (j) gers amongst I with repetitions, and let k (j) = k1 , . . . , kI be a generic element of AK,I (j = 1, . . . , J). Since equation (22) is a product of I factors that are a sum of K addends each, it can be written as a sum of J addends that are a product of I factors each, as follows: f (y; x, θ) ( I J X Y = π j=1
=
J X
j=1
i=1
(
I Y
i=1
Let πj = QI PJ j=1
π
(j) φD
ki
(j) ki
I Y
i=1
yi ; λ
φD
(j)
ki
yi ; λ
+ B0 xi , Σ
(j) ki
(j)
ki
+ B0 xi , Σ
)
(j) ki
)
.
(23)
QI
π (j) , j = 1, . . . , J. Clearly, πj ≥ 0 ∀j and ki i Q I hP K i=1 π (j) = k=1 πk = 1. Furthermore, given i=1 i=1
ki
the properties of products between independent Gaussian random variables and vectors (see, for example, [38]), each product
13 of I D-dimensional Gaussian p.d.f.’s in equation (23) is equal to a (D · I)-dimensional Gaussian p.d.f.: I Y
φD
i=1
yi ; λ
(j)
ki
+ B0 xi , Σ
(j)
ki
= φD·I (vec(y); λj + vec(xB), Σj ) ,
(24)
where λj = (λ0 (j) , . . . , λ0 (j) , . . . , λ0 (j) )0 , and Σj = diag(Σ k1
...,Σ
(j) , . . . , Σ
ki
ki
kI
(j) ,
k1
(j) ) is a block diagonal matrix.
kI
Thus, f (y; x, θ) may be re-expressed as the following mixture of J Gaussian components:
f (y; x, θ) =
J X
πj φD·I (vec(y); λj + vec(xB), Σj ) .
(25)
j=1
In order to complete the proof, we now show that the family = of the multidimensional Gaussian p.d.f’s in equation (24) generates identifiable finite mixtures. Let = = {φD·I (vec(y); λ + vec(xB), Σ) , λ ∈ RD·I , B ∈ M, Σ ∈ N, vec(y) ∈ RD·I }, where M denotes the set of P × D matrices, and N denotes the set of (D ·I)×(D ·I) positive definite matrices. A necessary and sufficient condition for the class of all finite mixtures of the family = to be identifiable is that = be a linearly independent set over the field of real numbers [51]. Similarly to the proof of Proposition 2 in [51], suppose that = is not identifiable. This implies, in terms of moment generating functions, that for some J ≥ 1 J X
aj exp
j=1
1 0 T Σj T + T 0 [λj + vec(xB)] 2
= 0 ∀ T ∈ RD·I , (26)
where aj ∈ R for j = 1, . . . , J, and the pairs (λj , Σj ) are all distinct. It is easy to show that (26) implies J X exp T 0 vec(xB) aj exp j=1
1 0 T Σj T + T 0 λj 2
= 0 ∀ T ∈ RD·I . (27)
Since exp {T 0 vec(xB)} > 0 ∀ B, equation (26) implies J X
j=1
aj exp
1 0 T Σj T + T 0 λj 2
= 0.
(28)
However, (28) asserts that the class of finite mixtures of generic multidimensional Gaussian p.d.f.’s is not identifiable, contrary to Proposition 2 in [51]. It is worth noting that this proof relies on the fact that matrix B in equation (5) does not depend on k and, hence, it does not depend on j in equation (23). Furthermore, since the proof holds for D ≥ 1, it also applies to the model proposed by [5]. Since the proof does not rely on any assumption about the structure of Σk , for k = 1, . . . , K, it also holds for the models implemented in flexmix, when all the regression coefficients are restricted to be equal over all components. The identifiability of the model parameters θ guaranteed by the proof, together with the constraint PK k=1 πk ν k = 0, also ensures the identifiability of the parameters β0 and νk for k = 1, . . . , K.
References 1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, B.F. (Eds.) Second International Symposium on Information Theory, pp. 267–281. Academiai Kiado, Budapest (1973) 2. Azzalini, A., Capitanio, A.: Statistical applications of the multivariate skew normal distribution. J. Roy. Statist. Soc. Ser. B 61, 579–602 (1999) 3. Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry, with emphasis on a multivariate skew t-distribution. J. Roy. Statist. Soc. Ser. B 65, 367–389 (2003) 4. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and nonGaussian clustering. Biometrics 49, 803-821 (1993). 5. Bartolucci, F., Scaccia, L.: The use of mixtures for dealing with non-normal regression errors. Comput. Stat. Data An. 48, 821–834 (2005) 6. Batsidis, A., Zografos, K.: Statistical inference for location and scale of elliptically contoured models with monotone missing data. J. Statist. Plann. Inference 136, 2606–2629 (2006) 7. Batsidis, A., Zografos, K.: Multivariate linear regression model with elliptically contoured distributed errors and monotone missing dependent variables. Commun. Stat. Theory 37, 349– 372 (2008) 8. Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated classification likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725 (2000) 9. Bozdogan, H.: Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370 (1987) 10. Bozdogan, H.: Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Bozdogan, H. (Ed.) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modelling: an Informational Approach, pp. 69–113. Kluwer Academic Publishers, Boston (1994) 11. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781-793 (1995). 12. Cook, R.D., Weisberg, S.: An Introduction to Regression Graphics. Wiley, New York (1994) 13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–22 (1977) 14. DeSarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classification 5, 249– 282 (1988) 15. Diaz-Garcia, J.A., Rojas, M.G., Leiva-Sanchez, V.: Influence diagnostics for elliptical multivariate linear regression models. Commun. Stat. Theory 32, 625–642 (2003) 16. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall, London (1993) 17. Fama, E.F.: The behaviour of stock market prices. J. Bus. 38, 34–105 (1965) 18. Ferreira, J.T.A.S., Steel, M.F.J.: Bayesian multivariate regression analysis with a new class of skewed distributions. Research Report 419, Department of Statistics, University of Warwick (2003) 19. Ferreira, J.T.A.S., Steel, M.F.J.: Bayesian multivariate skewed regression modeling with an application to firm size. In: Genton, M. G. (Ed.) Skew-Elliptical Distributions and Their Applications: a Journey Beyond Normality, pp. 174–189. CRC Chapman & Hall, Boca Raton (2004) 20. Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 578-588, (1998).
14 21. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Amer. Statist. Assoc. 97, 611-631 (2002) 22. Fraley, C., Raftery, A.E.: Enhanced software for model-based clustering. J. Classification 20, 263–286 (2003) 23. Fraley, C., Raftery, A.E.: MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical Report No. 504, Department of Statistics, University of Washington (2006) 24. Galea, M., Paula, G.A., Bolfarine, H.: Local influence in elliptical linear regression models. Statistician 46, 71–79 (1997) 25. Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data An. 52, 520–532 (2007) 26. Gr¨ un, B., Leisch, F.: FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J. Statistical Software 28, URL http://www.jstatsoft.org/v26/i04/ (2008a) 27. Gr¨ un, B., Leisch, F.: Finite mixtures of generalized linear regression models. In: Shalabh, Heumann, C. (Eds.) Recent Advances in Linear Models and Related Areas, pp. 205–230. Physica Verlag, Heidelberg (2008b) 28. Hennig, C.: Identifiability of models for clusterwise linear regression. J. Classification 17, 273–296 (2000) 29. Hennig, C.: Fixed point clusters for linear regression: computation and comparison. J. Classification 19, 249–276 (2002) 30. Hosmer, D.W.jr.: Maximum likelihood estimates of the parameters of a mixture of two regression lines. Commun. Stat. Simulat. 3, 995–1006 (1974) 31. Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2, 193–218 (1985) 32. Leisch, F.: FlexMix: a general framework for finite mixture models and latent class regression in R. J. Statistical Software 11, URL http://www.jstatsoft.org/v11/i08/ (2004) 33. Liu, C.: Bayesian robust multivariate linear regression with incomplete data. J. Amer. Statist. Assoc. 91, 1219–1227 (1996) 34. Liu, S.: Local influence in multivariate elliptical linear regression models. Linear Algebra Appl. 354, 159–174 (2002) 35. Looney, S.W., Gulledge, T.R.: Use of the correlation coefficient with normal probability plots. Am. Stat. 39, 75–79 (1985) 36. Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data An. 53, 3872–3882 (2009a) 37. Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701-709 (2009b) 38. McColl, J.H.: Multivariate Probability. Arnold, London (2004) 39. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Second edition. Wiley, Chichester (2008) 40. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000) 41. R Development Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org (2008) 42. Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Amer. Statist. Assoc. 101, 168–178 (2006) 43. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66, 846–850 (1971) 44. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978) 45. Seidel, W., Mosler, K., Alker M.: A cautionary note on likelihood ratio tests in mixture models. Ann. I. Stat. Math. 52, 481-487 (2000) 46. Srivastava, M.S.: Methods of Multivariate Statistics. John Wiley & Sons, New York (2002)
47. Steele, R.J., Raftery, A.E.: Performance of Bayesian model selection criteria for Gaussian mixture models. Technical Report No. 559, Department of Statistics, University of Washington (2009) 48. Sutradhar, B.C., Ali, M.M.: Estimation of the parameters of a regression model with a multivariate t error variable. Commun. Stat. Theory 15, 429–450 (1986) 49. Sutton, J.: Gibrat’s legacy. J. Econ. Lit. 35, 40–59 (1997) 50. Wedel, M., Steenkamp, J.-B.E.M.: A clusterwise regression method for simultaneous fuzzy market structuring and benefit segmentation. J. Marketing Res. 28, 385–396 (1991) 51. Yakowitz, S.J., Spragins, J.D.: On the identifiability of finite mixtures. Ann. Math. Stat. 39, 209–214 (1968) 52. Zellner, A.: Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms. J. Amer. Statist. Assoc. 71, 400–405 (1976)
Noname manuscript No. (will be inserted by the editor)
Multivariate linear regression with non-normal errors: a solution based on mixture models Supplementary material Gabriele Soffritti · Giuliano Galimberti
the date of receipt and acceptance should be inserted later
1 Maximum likelihood estimation of the model parameters
for example, [1]), this differential is given by dlc2 =−
In order to obtain the maximum likelihood estimates of Γ and Σ k for k = 1, . . . , K, we have evaluated the first differential of the function lc2 defined in equation (15). Using some results from matrix derivatives (see,
−
=−
+
K
K
k=1
k=1
1X 1X z.k d(log |Σ k |) − tr[(dΣ −1 k )D k ] 2 2 K 1X
2
tr[(dD k )Σ −1 k ]
k=1 K
K
k=1
k=1
1X 1X −1 z.k tr[Σ −1 tr[Σ −1 k dΣ k ] + k (dΣ k )Σ k D k ] 2 2 K X
0 tr[Σ −1 k (y − X k Γ ) diag(z +k )X k dΓ ]
k=1 K
=
1X −1 tr[(dΣ k )0 Σ −1 k (D k − z.k Σ k )Σ k ] 2 k=1
+
K X
tr[(dΓ )0 X 0k diag(z +k )(y − X k Γ )Σ −1 k ]
k=1
K
=
1X −1 [vec(dΣ k )]0 (Σ −1 k ⊗ Σ k )vec(D k − z.k Σ k ) 2 k=1
THE FINAL, DEFINITIVE VERSION OF THIS MANUSCRIPT HAS BEEN PUBLISHED ONLINE AS SUPPLEMENTARY MATERIAL TO STATISTICS AND COMPUTING, VOLUME 21, NUMBER 4 (2011), 523-536, DOI: 10.1007/s11222-010-9190-3, URL: http://www.springerlink.com/content/d76070k44j2m7734/ G. Soffritti (corresponding author) Department of Statistics, University of Bologna via delle Belle Arti 41 - 40126 Bologna, Italy E-mail:
[email protected] G. Galimberti Department of Statistics, University of Bologna via delle Belle Arti 41 - 40126 Bologna, Italy E-mail:
[email protected]
+
K X
0 [vec(dΓ )]0 {Σ −1 k ⊗ [X k diag(z +k )]}vec(y)
k=1
−
K X
0 [vec(dΓ )]0 {Σ −1 k ⊗ [X k diag(z +k )X k ]}vec(Γ ),
k=1
where vec(A) denotes the vector formed by stacking columns of the matrix A, one underneath the other, v(B) denotes the vector formed by stacking the columns of the lower triangular portion of B and ⊗ is the Kronecker product operator. Since matrix Σ k is symmetric, vec(dΣ k ) = dvec(Σ k ) = Gdv(Σ k ), where G denotes the duplication matrix that transforms v(Σ k )
2
into vec(Σ k ). Thus, dlc2 may be re-expressed as dlc2
2 Further experimental results from simulated data
K
1X −1 [dv(Σ k )]0 G0 (Σ −1 = k ⊗ Σ k )vec(D k − z.k Σ k ) 2 k=1
+
K X
0 [vec(dΓ )]0 {Σ −1 k ⊗ [X k diag(z +k )]}vec(y)
k=1
−
K X
0 [vec(dΓ )]0 {Σ −1 k ⊗ [X k diag(z +k )X k ]}vec(Γ ).
k=1
The first partial derivatives of lc2 are equal to K X ∂lc2 0 = {Σ −1 k ⊗ [X k diag(z +k )]}vec(y) ∂[vec(Γ )]0 k=1
−
K X
0 {Σ −1 k ⊗ [X k diag(z +k )X k ]}vec(Γ ),
k=1
∂lc2 1 −1 = G0 (Σ −1 k ⊗ Σ k )vec(D k − z.k Σ k ). ∂[v(Σ k )]0 2 On equating these first derivatives to null vectors, the following equations can be obtained K X
0 {Σ −1 k ⊗ [X k diag(z +k )X k ]}vec(Γ )
k=1
=
K X
0 {Σ −1 k ⊗ [X k diag(z +k )]}vec(y)
k=1 G0 (Σ −1 k
⊗ Σ −1 k )vec(D k − z.k Σ k ) = 0. PK 0 Provided that matrix k=1 Σ −1 k ⊗ [X k diag(z +k )X k ] is non-singular, from the first of these two equations the solution for vec(Γ ) is !−1 K X −1 0 ˆ) = vec(Γ Σ ⊗ [X diag(z +k )X k ] k
k
k=1
×
K X
Σ −1 k
⊗
!
[X 0k diag(z +k )]
k=1
vec(y).
The second equation can be rewritten as −1 G0 (Σ −1 k ⊗ Σ k )Gv(D k − z.k Σ k ) = 0,
because the symmetry of (Dk − z.k Σ k ) implies that vec(D k − z.k Σ k ) = Gv(D k − z.k Σ k ). Pre-multiplying this equation by G+ (Σ k ⊗ Σ k )G+ 0 , where G+ denotes the Moore-Penrose inverse of G, and using a theorem from linear algebra ([1], p. 315), we have found that v(D k − z.k Σ k ) = 0. Since (D k − z.k Σ k ) is symmetric, this implies that (D k − z.k Σ k ) equals a null matrix, and so the solution for Σ k is ˆ k = z −1 Dk , k = 1, . . . , K. Σ .k
Three further Monte Carlo experiments were performed by generating 100 samples of three different sizes (I = 100, 200, 300) from model defined by equations (1) and (3) with K = 2 and π1 = π2 = 0.5. For any sample we computed the MLE under the assumptions defined in equations (2), (3) for K from 1 to Kmax = 4, and (11) for K from 1 to Kmax = 7. In the fifth experiment D = 2 responses linearly depend on P = 8 regressors. The parameters β 0 and B used to generate the data are illustrated in Tab. 1, while the remaining parameters ν 1 , ν 2 , Σ 1 and Σ 2 were set equal to the values already used in the second experiment (see Subsection 3.2). As far as the estimation of the model parameters under the mixture assumption (3) is concerned, it is worth noting that the large number of free parameters of the model caused some computational problems (i.e., lack of convergency of the EM algorithm within 300 iterations and/or nonexistence of the ML estimators) when models with three or four components were fitted to the data, especially for samples of size I = 100. The percentage of successes in selecting the correct value of K was noticeably lower than the ones obtained in the second experiment, particularly when I = 100 (see Tab. 2, left part). A slight decrease in the ability of reconstructing the unknown classification of the error terms was also registered (see Tab. 3). Both results may be caused by the presence of a larger number of regressors into the model. Nonetheless, the value of K most commonly selected by the proposed strategy was equal to two for every sample size, and all values of the CR index were above 0.868. As far as the properties of the parameter estimates are concerned, the highest accuracy of the estimates of B was still obtained using the proposed approach (see Tab. 4). However, the differences in the MSE were slightly lower than the ones obtained in the second experiment. In the sixth experiment D = 8 responses linearly depend on P = 2 regressors. The parameters used to generate the data are summarized in Tab. 1. Also in this experiment, for some samples it was not possible to compute the estimates of the parameters of model (1) under assumption (3) when K = 3 and K = 4 because of the large number of model parameters, especially for samples of size I = 100. The value of K was always correctly selected by the proposed strategy for each sample size (see Tab. 2, central part). The proposed approach also allowed an almost perfect reconstruction of the unknown classification of the error terms (see Tab. 3, central part). As far as the properties of the ML estimators of the regression parameters
3 Table 1 Parameters used to generate the data in the Monte Carlo experiments. Fifth experiment β00 B0
2.0 2.0 −3.0
4.0 −3.0 2.0
−2.5 3.5
3.5 −2.5
1.5 −4.5
−4.5 1.5
4.5 −1.5
−1.5 4.5
2.0 2.0 −3.0 2.0 −2.0 1.0 −0.6 −0.8 −0.4 −0.3 −0.5 −0.1 −0.3 1.2 0.8 0.7 0.4 0.8 0.7 0.2 0.4
4.0 −3.0 2.0 2.0 −2.0 −0.6 1.5 0.3 0.5 0.4 0.6 0.3 0.5 0.8 1.0 0.2 0.3 0.5 0.6 0.4 0.2
1.0 −2.0 3.0 2.0 −2.0 −0.8 0.3 1.6 −0.5 0.5 0.5 0.5 0.7 0.7 0.2 2.0 0.5 0.4 0.5 0.6 0.8
3.0 3.0 −2.0 2.0 −2.0 −0.4 0.5 −0.5 2.0 −0.6 0.2 −0.4 −0.6 0.4 0.3 0.5 1.6 0.9 0.3 0.8 0.6
5.0 1.0 −4.0 2.0 −2.0 −0.3 0.4 0.5 −0.6 1.8 0.7 0.2 0.4 0.8 0.5 0.4 0.9 2.0 0.8 0.5 0.5
7.0 −4.0 1.0 2.0 −2.0 −0.5 0.6 0.5 0.2 0.7 1.4 0.8 0.6 0.7 0.6 0.5 0.3 0.8 1.7 0.9 0.6
8.0 4.0 −1.0 2.0 −2.0 −0.1 0.3 0.5 −0.4 0.2 0.8 1.9 0.8 0.2 0.4 0.6 0.8 0.5 0.9 1.8 0.9
6.0 −1.0 4.0 2.0 −2.0 −0.3 0.5 0.7 −0.6 0.4 0.6 0.8 1.7 0.4 0.2 0.8 0.6 0.5 0.6 0.9 1.5
2.0 2.0 −3.0 −3.0 −2.5 3.5 1.5 −4.5 4.5 −1.5
4.0 −3.0 2.0 2.0 3.5 −2.5 −4.5 1.5 −1.5 4.5
1.0 −2.0 3.0 3.0 0.5 −3.5 3.5 −0.5 1.5 −2.5
3.0 3.0 −2.0 −2.0 −3.5 0.5 −0.5 3.5 −2.5 1.5
5.0 1.0 −4.0 −4.0 2.0 −3.0 5.0 1.0 3.0 −2.0
7.0 −4.0 1.0 1.0 −3.0 2.0 1.0 5.0 −2.0 3.0
8.0 4.0 −1.0 −1.0 3.0 −2.0 −1.5 −2.5 2.0 −5.0
6.0 −1.0 4.0 4.0 −2.0 3.0 2.5 −1.5 −5.0 2.0
Sixth experiment β00 B ν01 ν02 Σ1
Σ2
Seventh experiment β00 B
are concerned, our approach resulted in a decrease in the MSE of the estimates of B by a factor of at least three (almost 14 in some situations) with respect to the classical approach (see Tab. 6). Using assumption (11) instead of assumption (3) led to an increase in the MSE of the estimates of B ranging between 12% and 170%. In the seventh experiment D = 8 responses linearly depend on P = 8 regressors. The parameters β 0 and B used to generate the data are illustrated in Tab. 1; as far as the remaining parameters are concerned, we used the same values of the sixth experiment. Also in this case some computational problems arose in the parameter estimation phase under assumption (3) for K = 3 and K = 4 due to the large number of free parameters. The results obtained using our procedure were very similar to the ones described above for the sixth experiment in terms of correct choice of K (see Tab. 2, right part), reconstruction of the unknown classification of the error
terms (see Tab. 3) and properties of the ML estimates (these latter results are not reported).
References 1. Schott, J.R.: Matrix Analysis for Statistics. Second edition. John Wiley & Sons, New York (2005)
4 Table 2 Frequency distributions (over 100 samples) of the values of K selected using the proposed procedure in the Monte Carlo experiments, for samples of size 100, 200 and 300. Experiment
5th
K I = 100 I = 200 I = 300
6th
7th
2
3
4
2
2
3
62 70 73
23 17 14
15 13 13
100 100 100
99 99 100
1 1 0
Table 3 Means (standard deviations in brackets) of the CR index (over 100 samples) computed between the true partition of the error terms and the ones estimated using our procedure in the Monte Carlo experiments. Experiment I = 100 I = 200 I = 300
5th
6th
7th
0.8689 (0.1579) 0.9264 (0.1092) 0.9482 (0.0841)
0.9964 (0.0176) 0.9994 (0.0034) 0.9993 (0.0029)
0.9890 (0.0310) 0.9979 (0.0148) 0.9992 (0.0032)
5 Table 4 Estimated biases and mean-square errors of three ML estimators of the parameters β0 and B when the error terms are generated from a mixture of two Gaussian components (5th experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ31 βˆ41 βˆ51 βˆ61 βˆ71 βˆ81 βˆ02 βˆ12 βˆ22 βˆ32 βˆ42 βˆ52 βˆ62 βˆ72 βˆ82
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
MLE under (11)
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
0.0081 0.0319 −0.0019 0.0238 −0.0029 0.0339 −0.0153 0.0211 −0.0061 0.0247 0.0038 0.0271 0.0121 0.0223 0.0072 0.0204 0.0026 0.0296 0.0228 0.0308 −0.0067 0.0247 −0.0055 0.0352 −0.0181 0.0233 −0.0029 0.0239 −0.0012 0.0306 0.0073 0.0235 −0.0052 0.0206 0.0036 0.0284
0.0179 0.0203 0.0056 0.0193 −0.0153 0.0197 0.0112 0.0203 −0.0020 0.0138 0.0050 0.0171 0.0059 0.0150 0.0011 0.0142 −0.0188 0.0157 0.0063 0.0214 −0.0034 0.0180 −0.0254 0.0231 −0.0030 0.0162 −0.0031 0.0132 0.0060 0.0189 0.0027 0.0176 −0.0005 0.0155 −0.0181 0.0218
0.0113 0.0294 0.0009 0.0048 0.0052 0.0061 −0.0026 0.0066 0.0065 0.0045 0.0005 0.0048 0.0018 0.0046 −0.0019 0.0045 0.0166 0.0055 0.0228 0.0284 −0.0052 0.0070 −0.0004 0.0076 0.0072 0.0071 0.0116 0.0052 −0.0031 0.0047 −0.0052 0.0056 −0.0009 0.0052 −0.0034 0.0057
0.0188 0.0194 0.0006 0.0038 0.0071 0.0040 0.0030 0.0038 −0.0020 0.0026 0.0016 0.0038 0.0051 0.0024 0.0043 0.0031 −0.0014 0.0032 0.0071 0.0206 −0.0041 0.0040 0.0020 0.0041 −0.0085 0.0036 0.0031 0.0024 0.0077 0.0038 0.0027 0.0027 −0.0002 0.0029 0.0018 0.0045
0.0098 0.0295 −0.0043 0.0060 0.0084 0.0075 0.0016 0.0077 0.0041 0.0054 0.0026 0.0074 −0.0004 0.0061 0.0047 0.0057 0.0033 0.0068 0.0240 0.0284 −0.0088 0.0084 0.0033 0.0079 0.0025 0.0080 0.0103 0.0065 0.0002 0.0066 −0.0020 0.0064 −0.0046 0.0066 0.0000 0.0076
0.0194 0.0195 0.0122 0.0043 0.0084 0.0046 0.0095 0.0043 −0.0008 0.0037 0.0058 0.0042 0.0040 0.0027 0.0044 0.0040 −0.0058 0.0040 0.0077 0.0207 0.0000 0.0055 −0.0015 0.0050 −0.0009 0.0036 0.0025 0.0038 0.0077 0.0060 0.0046 0.0040 0.0024 0.0044 −0.0057 0.0049
Table 5 Estimated biases and mean-square errors of two ML estimators of the parameters β0 and B (see equations (18) and (19)) when the error terms are generated from a Gaussian model (1st experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ02 βˆ12 βˆ22
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
I = 100
I = 200
I = 300
I = 100
I = 200
I = 300
0.0017 0.0122 0.0050 0.0137 −0.0101 0.0092 −0.0001 0.0101 0.0031 0.0132 −0.0053 0.0095
0.0046 0.0056 −0.0056 0.0084 −0.0036 0.0054 0.0078 0.0037 −0.0148 0.0066 −0.0122 0.0049
0.0134 0.0042 −0.0013 0.0044 −0.0050 0.0041 0.0134 0.0038 −0.0009 0.0027 −0.0146 0.0033
0.0018 0.0122 0.0066 0.0140 −0.0064 0.0097 0.0001 0.0101 0.0035 0.0135 −0.0010 0.0098
0.0046 0.0056 −0.0043 0.0085 −0.0035 0.0054 0.0078 0.0037 −0.0129 0.0065 −0.0119 0.0048
0.0134 0.0042 −0.0013 0.0044 −0.0050 0.0041 0.0134 0.0038 −0.0009 0.0027 −0.0146 0.0033
6 Table 6 Estimated biases and mean-square errors of three ML estimators of the parameters β0 and B when the error terms are generated from a mixture of two Gaussian components (6th experiment). MLE under (2)
βˆ01 βˆ11 βˆ21 βˆ02 βˆ12 βˆ22 βˆ03 βˆ13 βˆ23 βˆ04 βˆ14 βˆ24 βˆ05 βˆ15 βˆ25 βˆ06 βˆ16 βˆ26 βˆ07 βˆ17 βˆ27 βˆ08 βˆ18 βˆ28
Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE Bias MSE
MLE under (3)
MLE under (11)
I = 200
I = 300
I = 200
I = 300
I = 200
I = 300
−0.0024 0.0289 −0.0173 0.0225 −0.0096 0.0357 0.0119 0.0289 −0.0177 0.0250 −0.0105 0.0375 0.0035 0.0310 −0.0021 0.0268 −0.0278 0.0387 0.0204 0.0314 −0.0251 0.0268 −0.0248 0.0378 0.0033 0.0313 −0.0163 0.0243 −0.0234 0.0466 0.0064 0.0301 −0.0102 0.0240 −0.0234 0.0466 0.0088 0.0308 −0.0130 0.0275 −0.0225 0.0431 0.0100 0.0299 −0.0137 0.0296 −0.0305 0.0411
0.0099 0.0190 0.0009 0.0183 −0.0049 0.0195 0.0011 0.0212 −0.0065 0.0163 −0.0148 0.0204 −0.0023 0.0191 −0.0031 0.0239 −0.0282 0.0244 0.0017 0.0192 −0.0058 0.0177 −0.0185 0.0260 0.0004 0.0226 −0.0045 0.0214 −0.0109 0.0193 −0.0029 0.0211 −0.0055 0.0230 −0.0262 0.0203 0.0105 0.0221 −0.0012 0.0237 −0.0272 0.0223 0.0029 0.0226 −0.0038 0.0252 −0.0176 0.0198
−0.0015 0.0286 −0.0024 0.0027 0.0068 0.0029 0.0126 0.0287 −0.0066 0.0040 0.0065 0.0048 0.0041 0.0308 0.0123 0.0059 −0.0122 0.0068 0.0215 0.0311 −0.0026 0.0055 −0.0041 0.0065 0.0040 0.0307 −0.0045 0.0056 −0.0027 0.0091 0.0070 0.0299 0.0015 0.0053 −0.0064 0.0103 0.0095 0.0307 −0.0004 0.0085 −0.0067 0.0125 0.0106 0.0297 0.0006 0.0066 −0.0149 0.0077
0.0095 0.0189 −0.0022 0.0014 0.0028 0.0014 0.0007 0.0212 −0.0042 0.0024 0.0044 0.0030 −0.0023 0.0190 −0.0003 0.0044 −0.0150 0.0047 0.0012 0.0192 −0.0050 0.0022 0.0039 0.0041 0.0000 0.0226 −0.0041 0.0048 0.0031 0.0054 −0.0029 0.0210 −0.0022 0.0042 −0.0091 0.0051 0.0101 0.0218 −0.0025 0.0055 −0.0140 0.0061 0.0026 0.0224 −0.0024 0.0049 −0.0024 0.0041
−0.0013 0.0287 −0.0085 0.0042 0.0028 0.0057 0.0126 0.0287 −0.0145 0.0069 0.0028 0.0067 0.0041 0.0308 0.0010 0.0093 −0.0145 0.0104 0.0215 0.0312 −0.0156 0.0089 −0.0095 0.0082 0.0040 0.0307 −0.0151 0.0101 −0.0086 0.0132 0.0070 0.0299 −0.0088 0.0097 −0.0090 0.0147 0.0095 0.0307 −0.0100 0.0128 −0.0077 0.0140 0.0106 0.0297 −0.0104 0.0095 −0.0166 0.0110
0.0096 0.0189 −0.0017 0.0031 0.0134 0.0033 0.0009 0.0212 −0.0039 0.0048 −0.0019 0.0052 −0.0024 0.0190 0.0016 0.0079 −0.0142 0.0083 0.0016 0.0193 −0.0050 0.0061 −0.0013 0.0070 0.0002 0.0226 −0.0031 0.0077 0.0037 0.0081 −0.0030 0.0210 −0.0022 0.0057 −0.0129 0.0066 0.0104 0.0218 0.0026 0.0083 −0.0121 0.0070 0.0029 0.0224 0.0009 0.0087 −0.0041 0.0057