Computational Statistics & Data Analysis 51 (2007) 5339 – 5351 www.elsevier.com/locate/csda
Constrained monotone EM algorithms for finite mixture of multivariate Gaussians Salvatore Ingrassiaa,∗ , Roberto Roccib a Dipartimento di Economia e Metodi Quantitativi, Facoltà di Economia, Università di Catania, Corso Italia, 55 – 95129 Catania, Italy b Dipartimento SEFEMEQ, Università di Roma “Tor Vergata”, Via Columbia, 2 - 00133 Roma, Italy
Available online 7 November 2006
Abstract The likelihood function for normal multivariate mixtures may present both local spurious maxima and also singularities and the latter may cause the failure of the optimization algorithms. Theoretical results assure that imposing some constraints on the eigenvalues of the covariance matrices of the multivariate normal components leads to a constrained parameter space with no singularities and at least a smaller number of local maxima of the likelihood function. Conditions assuring that an EM algorithm implementing such constraints maintains the monotonicity property of the usual EM algorithm are provided. Different approaches are presented and their performances are evaluated and compared using numerical experiments. © 2006 Elsevier B.V. All rights reserved. Keywords: Mixture models; EM algorithm; Monotonicity; Eigenvalues
1. Introduction Let f (x; ) be the density of a mixture of k multivariate normal distributions f (x; ) = 1 p(x; μ1 , 1 ) + · · · + k p(x; μk , k ),
(1)
where the j ’s are the mixing weights and p(x; μj , j ) is the density function of a q-variate normal distribution with mean vector μj and covariance matrix j ; furthermore let us set = {(j , μj , j ), j = 1, . . . , k} ∈ , where is the parameter space 2 = (1 , . . . , k , μ1 , . . . , μk , 1 , . . . , k ) ∈ Rk[1+q+(q +q)/2] : 1 + · · · + k = 1, j 0, j > 0 for j = 1, . . . , k . Let L() be the likelihood function of given a sample x1 , . . . , xN ∈ Rq of N independent and identically distributed (i.i.d.) observations with density (1), and let ˆ be the maximum likelihood estimate of . It is well known that L() is unbounded from above and may present many local spurious maxima, see e.g. McLachlan and Peel (2000). In the univariate case the likelihood grows indefinitely when the mean of one component coincides with a sample observation ∗ Corresponding author. Tel.: +39 095 7537732; fax: +39 095 7537510.
E-mail addresses:
[email protected] (S. Ingrassia),
[email protected] (R. Rocci). 0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.10.011
5340
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
and the corresponding variance tends to zero; in the multivariate case analogously this occurs when some covariance matrix tends to be singular. Problems relating to the unboundness of the likelihood function have been studied by several authors both for univariate and multivariate normal mixtures. In the univariate case, Hathaway (1985) imposed relative constraints between pairs of variances while Ciuperca et al. (2003) addressed penalized maximum likelihood estimators providing statistical asymptotic properties of the penalized MLE; the problem of degeneracy has also been considered in a sequence of papers by Biernacki and Chrétien (2003) and Biernacki (2004a,b). In the multivariate case Snoussi and Mohammad-Djafari (2001) approached the problem of degeneracy penalizing the likelihood by means of an inverse Wishart prior on the covariance matrices. Hathaway (1985) proposed a constrained (global) maximumlikelihood formulation which presents a strongly consistent global solution, no singularities and at least a smaller number of local spurious maxima by imposing the following constraints satisfied by the true set of parameters min
1 h=j k
min (h −1 j ) c > 0
with c ∈ (0, 1],
(2)
−1 where min (h −1 j ) is the smallest eigenvalue of the matrix h j ; furthermore Biernacki (2004b) suggested constraining the determinant of covariance matrices to be greater than a given value. Here we move in the spirit of Hathaway (1985) extending the work of Ingrassia (2004), who formulated a sufficient condition such that constraint (2) holds. The advantage of this proposal is that the new set of constraints can be applied directly in the EM algorithm where each covariance matrix j (j = 1, . . . , k) is iteratively updated. As a matter of fact condition (2) is satisfied when it results
a i (j )b,
i = 1, . . . , q; j = 1, . . . , k,
(3)
where i (j ) is the ith eigenvalue of j , in non decreasing order, and a and b are positive numbers such that a/b c; indeed for any q × q symmetric and positive definite matrices A, B it results min (AB−1 )
min (A) , max (B)
which implies min (h −1 j )
min (h ) a c > 0, max (j ) b
1 h = j k,
(4)
and thus (3) leads to (2). In this paper we study constraints based on relation (4) that can be implemented directly at each iteration of the EM algorithm for the updating of the covariance matrices; furthermore we investigate the conditions such that this constrained EM algorithm leads to a non decreasing sequence of the likelihood values as in the usual unconstrained version. Our results follow from the analysis of the role of the eigenvalues and eigenvectors of the covariance matrices j (j = 1, . . . , k) in the sequence of the estimates performed by the EM algorithm obtained. Basic preliminary ideas have been summarized in Ingrassia and Rocci (2006). The spectral decomposition of covariance matrices has been considered with different objectives by many authors in multivariate normal mixture approaches to clustering, see e.g. Frayley and Raftery (2002); relationships with other approaches will also be discussed throughout this paper. The rest of the paper is organized as follows. In Section 2 we investigate the role of the eigenvalues and eigenvectors of the covariance matrices in the EM algorithm; in Section 3 we present some constrained monotone versions of the EM algorithm for multivariate normal mixtures and in Section 4 they are evaluated and compared using numerical studies; in Section 5 we present some geometrical interpretation of the proposed constraints and in Section 6 we discuss some relationships with other approaches; finally in Section 7 we give some concluding remarks. 2. On the update of the covariance matrices in the EM algorithm The EM algorithm generates a sequence of estimates {(m) }m , where (0) denotes the initial guess and (m) ∈ for m ∈ N, so that the corresponding sequence {L((m) )}m , is not decreasing. In what follows, for sake of space we
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5341
will indicate with the superscript + the (m + 1)th iteration and with the superscript—the previous mth iteration. The E-step, on the (m + 1)th iteration computes the quantities − − − j p(xn ; μj , j ) u+ = , nj k − − − h=1 h p(xn ; μh , h )
n = 1, . . . , N; j = 1, . . . , k,
(5)
while the M-step requires the global maximization of Q+ () =
u+ nj ln j +
j,n
u+ nj ln p(xn ; μj , j ) =
j,n
u+ nj ln j +
j,n
k j =1
qj+ (μj , j ),
(6)
+ + with respect to over the parameter space which gives the update + = {(+ j , μj , j ), j = 1, . . . , k}, where qj+ (μj , j ) = n u+ nj ln p(xn ; μj , j ). Based on the objectives of the present paper, here we focus on the updates of the covariance matrices N 1 + + unj (xn − μ+ j )(xn − μj ) , u+ •j n=1
+ + j ← Sj =
where we set u+ •j = μ+ j
N
+ n=1 unj
j = 1, . . . , k,
(7)
with u+ nj given by (5), and
N =
+ n=1 unj xn , u+ •j
j = 1, . . . , k,
according to the maximization of qj+ (μj , j ) with respect to μj . Based on the spectral decomposition theorem we have j = j j j and thus we can regard the update of j as the composition of two substeps by performing the maximum of Q+ j with respect to: (i) the orthonormal matrix j whose columns are the standardized eigenvectors of j and afterwards with respect to (ii) the diagonal matrix j of the + eigenvalues of j . In the ordinary EM algorithm these substeps are performed simultaneously by setting + j ← Sj . We remark that we do not propose here a ECM-like algorithm, these two substeps are here investigated only from a theoretical point of view in order to distinguish the roles of eigenvalues and eigenvectors in the update of the covariance matrix; this provides insight for the construction of a monotone EM algorithm satisfying constraints (4). To begin with, we observe that the maximization of Q+ () can be split into k independent maximizations of the terms qj+ (μ+ j , j ) (j = 1, . . . , k). Since + + −1 + −1 + + −1 (xn − μ+ j ) j (xn − μj ) = tr((xn − μj ) j (xn − μj )) = tr(j (xn − μj )(xn − μj ) ),
then we can write qj+ (μ+ j , j ) =
N 1 + + + unj [−q ln(2) − ln |j | − tr(−1 j (xn − μj )(xn − μj ) )] 2 n=1
= + j −
N 1 1 + + + ln |j |u+ − unj tr(−1 •j j (xn − μj )(xn − μj ) ) 2 2
= + j
n=1
N 1 + −1 1 + + + − u•j ln |j | + tr j + unj (xn − μj )(xn − μj ) 2 u•j n=1
= + j −
1 + + u [ln |j | + tr(−1 j Sj )], 2 •j
5342
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
where S+ j has already been defined in (7) and q q + + u+ nj = − ln(2)u•j . j = − ln(2) 2 2 N
n=1
(i) Maximization with respect to j , (j = 1, . . . , k): Thus we deduce that the maximization of qj+ (μ+ j , j ) w.r.t. j amounts to the minimization of + q˜j+ (j , j ) = ln |j | + tr(−1 j Sj ).
(8)
To this end, we note that + −1 + tr(−1 j Sj )tr(j Lj ) =
q
+ −1 ij lij ,
j = 1, . . . , k,
(9)
i=1 + + see Theobald (1975), where Lj+ = diag(l1j , . . . , lqj ) is the diagonal matrix of the eigenvalues, in non decreasing order, + of Sj ; in particular, the equality in (9) holds if and only if j and S+ j have the same eigenvectors which are ordered with + + respect to both 1j , . . . , qj and l1j , . . . , lqj . It follows that the optimal j is equal to the matrix having as columns the standardized eigenvectors of S+ j . (ii) Maximization with respect to j , (j = 1, . . . , k): First, we rewrite (8) as −1 + q˜j+ (+ j , j ) = ln |j | + tr(j Sj ) =
q
+ (ln ij + −1 ij lij ),
i=1
because ln |j | = ln i ij = i ln ij ; then the calculations given in the previous step imply that the maximization of q˜j+ (+ j , j ) w.r.t. j is equivalent to the minimization of + q˜ij+ (ij ) = ln ij + −1 ij lij ,
i = 1, . . . , q,
(10)
with respect to 1j , . . . , qj , and this gives ij = lij+ ,
i = 1, . . . , q.
Finally, we can summarize the two substeps (i) and (ii) as follows: + − − (i) maximize Q+ (+ , μ+ 1 , ..., μk , 1 , . . . , k , 1 , . . . , k ) with respect to j (j = 1, . . . , k) to obtain + + j ← Gj ,
where Gj+ is the matrix with columns equal to the standardized eigenvectors of S+ j ; + + + (ii) maximize Q+ (+ , μ+ , . . . , μ , , . . . , , , . . . , ) with respect to (j = 1, . . . , k) to obtain 1 k j 1 1 k k + + j ← Lj .
3. Constrained monotone EM algorithms The reformulation of the update of the covariance matrices j (j = 1, . . . , k) presented above suggests some ideas for the construction of EM algorithms such that the constraints (3), or some other sufficient conditions for (2), are satisfied while the monotonicity is preserved. Firstly at each iteration let us compute the spectral decomposition of the + + + + estimate of covariance matrix S+ j , i.e. Sj = Gj Lj Gj and afterwards consider one of the following strategies. Approach A. The simplest approach concerns the update of the whole covariance matrix: + + + − • if a i (S+ j )b then set j ← Sj otherwise set j ← j ; in other words, the covariance matrix j is updated if and only if conditions (3) are satisfied.
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5343
Approach B. A more refined strategy concerns the update of the diagonal matrix of eigenvalues
+ + − + + + + • if a lij+ b then set + j ← Lj otherwise set j ← Lj , afterwards set j ← j j j ;
in other words, the matrix of the eigenvalues of j is updated if and only if conditions (3) are satisfied, while the eigenvectors of j are always updated. Approach C. Another approach consists of modifying only the eigenvalues of S+ j that do not satisfy constraints (3), that is to find an update of j which maximizes (6) under constraints (3). This leads to ⎧ a ⎪ ⎪ ⎨ + + ij = ⎪ lij ⎪ ⎩ b
if lij+ < a, if a lij+ b, if lij+ > b,
which can be summarized as follows: + + ij = min(b, max(a, lij )).
(11) (0)
It is important to note that in all these three cases the resulting EM algorithm is monotone, once the initial guess j satisfies the constraints. However, only in the third case is the maximization of the complete loglikelihood guaranteed. The above recipes require obviously some a priori information on the covariance structure of the mixture throughout the bounds a, b. A weaker constraint could be imposed directly on the ratio a/b in (4) and to this end we introduce a suitable parameterization for the covariance matrices of the mixture components. Let us rewrite j =2 j (j =1, . . . , k), where the matrices j are such that min i (j ) = 1,
(12)
ij
and we impose the constraints 1 1 i (j ) , c
(13)
for i = 1, . . . , q and j = 1, . . . , k. These new constraints (13) are weaker than (3), in fact if (3) are satisfied and we set 2 = min i (j ) ij
and
j =
j 2
then, by noting that i (j ) = −2 i (j ), we obtain b 1 1 i (j ) a c
with min i (j ) = 1. ij
However, such constraints are stronger than (2). In fact, if constraints (13) are satisfied then min (h −1 j )
min (h ) 1 min (h ) = = c, max (j ) max (j ) 1/c
1 h = j k.
In order to implement this new set of constraints in the EM algorithm, in this case too only the step of eigenvalues ˜ j denote the diagonal matrix of the eigenvalues of j , then the update update must be taken into account. Let
5344
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
˜ j and 2 must maximize the complete loglikelihood and this amounts to maximize the function of j = j j N k k k 1 + + + + + qj (μj , j ) = j − unj ln |j | + tr(−1 S ) j j 2 j =1
j =1
= + • −
j =1
n=1
k 1 + ˜ j | + tr(−2 j ˜ −1 S+ ) . u•j ln |2 j j j j j 2
(14)
j =1
On the basis of the results obtained in the previous sections, it can be easily shown that (14) obtains a maximum — ˜ j and 2 — when w.r.t., respectively, j , j = Gj ,
+ lij 1 , , max 1, 2 ˜ ij = min c
2 =
k 1 + + u•j [tr(−1 j Sj )]. Nq j =1
Finally, we have ⎧ 2 ⎪ ⎪ ⎨ i (j ) = lij+ ⎪ ⎪ ⎩ 2 /c
if lij+ < 2 , if 2 lij+ 2 /c, if lij+ > 2 /c,
and we can summarize this fourth strategy as follows: Approach D. • set + ij ← min
2 , max(2 , lij+ ) , c
• set + q k 1 + lij ( ) ← u•j , Nq + j =1 i=1 ij + 2
+
+ + 2 +˜ and afterwards set + j ← ( ) j j (j ) . Also in this case the monotonicity is guaranteed once the initial guess
(0)
j satisfies the constraints. Finally, two considerations are in order. First, it should be noted that the proposed algorithm does not necessarily give a solution satisfying constraint (12); in this case, a correct solution can be obtained by setting i (j ) ←−
i (j ) , minij i (j )
2 ←− 2 min i (j ), ij
and thus, in this way, a new solution is obtained that satisfies the complete set of constraints by giving the same value of the likelihood. The second consideration is that the new algorithm in the M-step does not maximize the complete
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5345
likelihood unless the last two update steps are iterated until convergence, to be precise the proposed algorithm is of ECM type (see e.g. Meng and Rubin, 1993; McLachlan and Krishnan, 1997) rather than EM.
4. Numerical results In this section we present numerical results in order to evaluate and compare the performances of the different proposed constraints and algorithms. We applied five different algorithms based on the strategies presented in the previous section and we compared them with the basic unconstrained algorithm: 0. Unconstrained algorithm (U) + Ordinary EM where + j ← Sj . 1. Constrained algorithm C naive (CCN) Constrained EM algorithm C with the eigenvalues constrained in the interval [10−5 , 10]. 2. Constrained algorithm C refined (CCR) Constrained EM algorithm C with the eigenvalues constrained in the interval [0.05, 5], which is strictly contained in the interval of CCN. 3. Constrained algorithm C with external information (CCEI) Constrained EM algorithm C with the eigenvalues constrained in the interval [0.01, 1]. In this case we suppose we know exactly which is the interval where the eigenvalues lie. 4. Constrained algorithm D naive (CDN) Constrained EM algorithm D with c = 10−5 . 5. Constrained algorithm D with external information (CDEI) Constrained EM algorithm D with c = 0.01 equal to the true maximum ratio between two eigenvalues. The sample data have been generated by a mixtures of four-variate normal distributions with mean vectors and covariance matrices obtained as follows: • mean vectors generated independently from a Gaussian distribution with mean 0 and standard deviation equal to 4. • eigenvalues of the covariance matrices generated independently from a uniform distribution in the interval [0.01, 1]; • eigenvectors of the covariance matrices generated by orthonormalizing matrices generated independently by a standard normal distribution. Moreover two different cases considered: 1. a two-component mixture with mixing weights 1 = [0.3, 0.7] . 2. a three-component mixture with mixing weights 2 = [0.1, 0.3, 0.6] . For each set of weights 1 and 2 , we have generated 400 samples and we aimed at analysing the performance of the proposed recipes in terms of mean and standard deviation values of: • the loglikelihood L() ˆ computed at the final estimate; • the number of iterations (# iter); • the unweighted sum of squared differences between the true parameters and the corresponding estimates , ˆ that is the values of Su = − ˆ 2 + j μj − μ ˆ j 2 + j j − ˆ j 2 ; • the weighted sum Sw = − ˆ 2 /2 + j μj − μ ˆ j 2 /μj 2 + j j − ˆ j 2 /j 2 , where · denotes the Euclidean norm both for vectors and matrices. Each run has been started from a randomly chosen set of posterior probabilities {unj ; n=1, . . . , N; j =1, . . . , k}, the other parameters are computed starting from this set. For algorithm D the starting value of 2 was 0.01. The algorithms were stopped when the increase in the loglikelihood was less than 0.00001; of course for each of the six algorithms we have used the same starting points.
5346
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
Table 1 Mean and standard deviation values of the log-likelihood, number of iterations and sum of squared errors of estimation, computed over 400 samples of size 40 under the two sets of weights 1 and 2 Weights
[0.3 , 0.7]a
Values
L(ˆ ) # iter Su Sw
[0.1, 0.3, 0.6]b
L(ˆ ) # iter Su Sw
a Two b 169
Stat
Algorithm U
CCN
CCR
CCEI
CDN
CDEI
−33.03 41.38 25.35 19.28 693.31 1758.00 673.75 1920.20
−18.69 32.32 14.11 13.73 16.93 38.00 15.74 40.70
−16.49 29.41 8.93 8.35 3.30 84.00 2.31 4.90
−12.76 30.53 4.75 1.55 1.01 0.60 0.80 1.10
−33.02 41.36 37.37 50.05 690.47 1756.10 671.66 1918.90
−13.99 31.23 76.25 41.39 7.47 30.20 5.38 19.80
−38.62 35.89 27.23 14.91 1467.00 2548.80 1664.10 4011.60
−26.46 30.64 21.10 13.07 107.90 99.50 74.10 111.50
−30.42 28.64 19.11 18.58 53.20 83.40 14.30 24.80
−48.36 78.89 11.16 11.42 43.10 81.20 3.20 3.20
−40.64 33.01 117.91 92.67 1250.60 2183.10 1468.60 3821.70
−25.39 29.72 84.60 47.87 63.50 110.30 32.30 65.50
samples have been excluded because U failed. samples have been excluded because U failed.
A first group of simulations concerned samples with N = 40; the results have been summarized in Table 1 using mean values and standard deviations ; in some cases the unconstrained algorithm failed due to singularities and thus those samples have been excluded from the comparisons. However, the results about the constrained versions are representative of the whole data set. It is interesting to note that Approach C works well even if the constraints are not well specified (see CCR and CCN). Approach D leads to algorithms which converged quite slowly, in particular, the naive version (CDN) overperforms the unconstrained algorithm only in terms of Su and Sw ; in the CDEI version, Su , Sw and average number of iterations decrease considerably even if still they exhibit larger values than those of CCR and CEI. A second group of simulations have been carried out along the same lines, but in this case we worked with data samples of size N = 200; the results have been summarized in Table 2. In the case of the two-component mixture, we see that the six algorithms have similar performances. Only the CEI algorithm seems to behave slightly better than the other ones. In the case of the three-component mixture, there are still four samples where the unconstrained algorithm did not converge and considerations similar to the previous group of simulations still hold in terms of Sw . The ranking of the algorithm performances change if we consider the Su distance. In this case the CDEI algorithm seems to be the best while the three algorithms based on Approach C do not show significant differences. In order to evaluate whether the algorithms have a different sensitivity to local minima, we repeated the first simulation study on the same 400 samples by using 10 random starting points for each algorithm. The results are shown in Table 3. We can see that all the algorithms suffer the local minima problem because the mean values of loglikelihood increase while the mean values of Su and Sw decrease. This is particularly true for Approach D. In fact, we can see that now CDEI have performances quite similar to CCR. We conclude that when we know where the eigenvalues are located then the best approach is C; when such information is not available, but we know which should be the maximum ratio between two eigenvalues then it is better to consider Approach D, even if it is more expensive from a computational point of view. Finally, we remark that for each run the distances Su and Sw have been computed by taking the smallest value over all possible combinations of the components of and . ˆ
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5347
Table 2 Mean and standard deviation values of the log-likelihood, number of iterations and sum of squared errors of estimation, computed over 400 samples of size 200 under the two sets of weights 1 and 2 Weights
Values
[0.3, 0.7]
L(ˆ ) # iter Su Sw
[0.1 , 0.3 , 0.6]a
L(ˆ ) # iter Su Sw
a Four
Stat
Algorithm U
CCN
CCR
CCEI
CDN
CDEI
−131.93 143.16 15.95 10.56 0.68 0.92 0.74 1.57
−131.93 143.16 10.80 9.51 0.68 0.91 0.74 1.57
−136.56 137.60 8.06 8.12 0.69 0.91 0.75 1.56
−132.03 143.20 5.13 3.18 0.61 0.60 0.66 1.34
−131.93 143.16 16.71 12.51 0.68 0.92 0.74 1.57
−131.93 143.15 71.53 38.43 0.69 0.86 0.74 1.52
−194.03 126.84 25.74 26.62 99.82 700.29 73.67 532.78
−206.18 132.98 29.52 44.53 45.06 106.05 13.58 33.57
−229.58 147.82 30.52 40.75 46.89 96.14 6.21 10.90
−382.60 428.72 26.88 53.09 45.17 85.29 2.44 3.39
−193.69 126.63 64.76 175.57 92.66 685.08 68.92 523.36
−191.63 124.13 165.91 111.23 30.14 115.37 16.39 73.10
samples have been excluded because U did not converge.
Table 3 Mean values of the loglikelihood, number of iterations and sum of squared errors of estimation, computed over 400 samples of size 40, under the two sets of weights 1 and 2 by using 10 random starting points Weights
[0.3 , 0.7]
Values
L(ˆ ) # iter Su Sw
[0.1 , 0.3 , 0.6]a
L(ˆ ) # iter Su Sw
a 19
Stat
Algorithm U
CCN
CCR
CCEI
CDN
CDEI
−14.20 32.35 18.05 12.74 54.82 307.87 47.43 356.43
−12.32 30.57 11.27 9.99 2.19 6.16 1.63 4.24
−14.05 28.61 7.87 7.60 1.43 1.90 1.25 2.38
−12.52 30.57 4.73 1.67 1.04 0.61 0.88 1.60
−14.20 32.35 26.35 38.98 54.79 307.85 47.13 356.30
−12.25 30.35 76.41 45.00 1.70 5.08 1.52 5.79
−10.67 40.47 22.30 13.23 961.03 1687.6 881.18 1692.50
−1.30 29.32 14.19 10.02 33.27 52.7 23.85 67.70
−16.41 24.18 12.74 12.01 11.50 28.9 6.09 17.60
−12.18 26.13 7.55 8.91 6.72 25.0 2.08 2.20
−5.07 31.68 190.59 104.71 175.76 729.5 148.11 750.90
−11.68 26.10 107.35 43.49 9.57 27.1 4.54 17.60
samples have been excluded because U did not converge.
5. Geometrical interpretation of the constraints Constraint (2) on the smallest eigenvalue of h −1 j (1 h = j k) leads to a likelihood function with no singularities and having a smaller number of local maxima than the unconstrained likelihood function, see Hathaway (1985); however the reformulation (3) provides some relationships with other aspects to be considered in data modelling by multivariate normal mixtures. Banfield and Raftery (1993) proposed a general approach for geometric cross-cluster constraints in
5348
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
3
3
2
2 1
1
0
0
-1
-1
-2
-2
-3
-3 -2
-1
0
1
2
-4
-2
0
2
4
Fig. 1. Unconstrained ellipsoids (dotted lines) and constrained ellipsoids (straight lines): how the constraints a on the smallest eigenvalue (left) and b on the largest eigenvalue (right) work.
multivariate normal mixture by rewriting the covariance matrix according to the spectral decomposition as j = j j j = ∗j j ∗j j ,
j = 1, . . . , k,
(15)
where j = diag(1j , . . . , qj ) is the diagonal matrix of the eigenvalues of j in non increasing order, and j is the orthonormal matrix whose columns are the standardized eigenvectors of j ; moreover ∗j = 1j and thus ∗j = diag(1, . . . , qj /1j ). The quantities ∗j , j and j can be treated as an independent set of parameters and imposing constraints on such parameters amounts to imposing constraints on certain geometric features (volume, orientation and shape) of the clusters. Indeed the eigenvalues determine the volume and the shape of the jth cluster while the eigenvectors determine its orientation so that the parametrization (15) allows us to set different clustering criteria, from the simplest one (spherical clusters with equal volumes) to the most complex one (unknown and different volumes, orientation and shapes for all clusters). Thus when some but not all the quantities ∗j , j and ∗j vary among clusters we obtain parsimonious models, see Celeux and Govaert (1995) for further details. From a geometrical point of view, the two steps (i) and (ii) of the update of each covariance matrix given in Section 2 can be interpreted as follows. (i) set the directions of the principal axes of the ellipsoids of equal concentration for the j th component according to the eigenvectors of S+ j ; (ii) set the variances along these directions equal to the eigenvalues of S+ j . Note that these substeps attain different geometrical aspects of the ellipsoids of equal concentration: the main directions and the shape, but only the latter must be controlled in order to prevent degenerate cases, i.e. when the length of (at least) one of the axes of the ellipsoid of equal concentration is much smaller (larger) than the other ones and this happens when some eigenvalue has a quite smaller (larger) value than the other ones. The behaviour of constraint (3) is illustrated in Fig. 1 in the bivariate case; as far as the constraint (13) is concerned, it can be interpreted as a lower bound on the ratio between the length of the axis of the ellipsoid of constant density of the hth component and the length of the ellipsoid of constant density of the jth component (1 h = j k) and its geometrical behaviour is similar to the previous one. 6. Relationships with other approaches We have already mentioned that several other methods have been proposed to avoid the singularities in the likelihood function and then the numerical failure of the optimization procedures. In this context, we can distinguish
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5349
three main approaches: 1. to control the condition number; 2. to control the determinant; 3. to add a penalty to the likelihood. We start examining the relations between our approach and the condition number for matrix inversion. Constraint (4) protects also the covariance matrices j (j = 1, . . . , k) from ill conditioning. Indeed particular configurations of the (m) sample data may cause the estimate of covariance matrix at step m, say j , to be close to singularity and then its determinant is close to zero. This fact may also alter the numerical computation of the corresponding inverse matrix (m) (j )−1 which is subject to round-off error and is dependent on the accuracy with which the calculations are performed, see e.g. Burden and Faires (1985). The numerical behaviour of inverse matrix calculation is measured by the condition number. For a given square non singular matrix A the condition number for matrix inversion with respect to the matrix norm · is defined as
(A) : =A−1 A, and it is related to the estimation of the error made in computing the inverse of a matrix. Simple algebra shows that
(A) 1 for any matrix norm. The matrix A is called ill conditioned or poorly conditioned (with respect to the matrix norm · ) if (A) is large, the matrix A is called well conditioned if (A) is small (near 1) and the matrix A is said to be perfectly conditioned if (A) = 1. We remark that the condition number (A) for inversion depends on the matrix norm used, but all the condition numbers are equivalent in the sense that if = A−1 A and = A−1 A , then there exist two finite positive constants Cm and CM such that Cm (A) (A)CM (A) for any q × q matrix, see e.g. Horn and Johnson (1999). If A is positive definite then in general it can be proved that
(A)
max (A) , min (A)
where the equality holds when the condition number is considered w.r.t. the spectral norm; some authors define the condition number just in this case:
2 (A) : =
max (A) , min (A)
see e.g. Axelsson (1996). This means that imposing constraints (3) amounts to imposing an upper bound on the condition number of the covariance matrices j (j = 1, . . . , k): min (j ) 1 a = = c > 0, max (j ) 2 (j ) b
j = 1, . . . , k.
The same is true also for constraints (13). Moreover we remark that the converse is not necessarily true, i.e., to impose an upper bound on the condition number of the covariance matrices does not necessarily correspond to imposing an upper bound on the ratio between two eigenvalues belonging to different covariance matrices. It is interesting to note that constraints (3) imply lower and upper bounds for the determinant of the covariance matrices proposed by Biernacki (2004b). In fact, from (3) it follows ap
p
i (j ) bp .
i=1
Finally, we compare our approach with the penalized one proposed by Snoussi and Mohammad-Djafari (2001). The penalized likelihood is obtained by imposing an inverse Wishart prior on the covariance matrices of the form p(j ; , , Wj ) =
d exp(− tr(−1 j Wj )), |Wj |
5350
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
where d is a normalization constant, and are two strictly positive constants which contain a priori information about the power level (scale parameter) and Wj is a positive definite symmetric matrix which contains a priori information on the covariance structure. These priors imply the following additive penalty term on the loglikelihood: K = N ln(d) −
k j =1
where = N ln(d) − penalty, in fact K −
{ ln(|Wj |) + tr(−1 j Wj )} = −
k
p k j =1 i=1
j =1 ln(|Wj |).
k j =1
{tr(−1 j Wj )},
It is interesting to note that constraint (3) implies an upper bound for this
−1 −1 ij i (Wj ) − b
p k j =1 i=1
i (Wj ) = − b
−1
k
tr(Wj ).
j =1
7. Concluding remarks In this paper we have given theoretical results about EM algorithms for mixtures of multivariate normal distributions which preserve the usual monotone property while implementing the suitable constraints on the eigenvalues of the covariance matrices. These constraints lead to a likelihood function with no singularities and present at least a smaller numbers of local maxima than the unconstrained version. We have also shown that the proposed constraints protect the covariance matrices from ill conditioning leading to round-off errors in the computation of the density function. For both reasons we recommend always implementing some constrained version of the EM algorithms for normal mixture decomposition. Such constraints obviously require some a priori information on the covariance structure of the mixture and here we have proposed two different approaches. The first is based on lower and upper bounds a, b on the eigenvalues of the covariance matrices: it works quite well when the suitable external information is available; the second one considers a bound on the ratio of the eigenvalues: it requires weaker a priori information than the previous one but has given slightly worse results and moreover it is more expensive from a computational point of view, another drawback concerns the sensitivity to the choice of the initial value 20 of 2 which influences the convergence of the algorithm, that is the number of iterations. When we have no a priori information, we could impose only a quite small lower bound on the eigenvalues of the covariance matrices in order to avoid numerical failures of the algorithm due to singularities. However quite recently Biernacki and Chrétien (2003) and Biernacki (2004a) have investigated the behaviour of the EM algorithm near a degenerate solution in the univariate case and proved that there exists a domain of attraction around the singularities and that convergence to these particular solutions is extremely fast. These results could suggest a different approach and provide material for future work in this field. Acknowledgements The authors would like to thank the associate editor and the referee for their interesting comments and suggestions which considerably improved an earlier version of the present paper. References Axelsson, O., 1996. Iterative Solution Methods. Cambridge University Press, Cambridge. Banfield, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821. Biernacki, C., 2004a. Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures for grouped data and behaviour of the EM algorithm. Technical Report, Université de Franche-Comté. Biernacki, C., 2004b. An asymptotic upper bound of the likelihood to prevent Gaussian mixtures from degenerating. Technical Report, Université de Franche-Comté. Biernacki, C., Chrétien, S., 2003. Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM. Statist. Probab. Lett. 61, 373–382. Burden, R.L., Faires, J.D., 1985. Numerical Analysis. Prindle, Weber & Schmidt, Boston. Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793.
S. Ingrassia, R. Rocci / Computational Statistics & Data Analysis 51 (2007) 5339 – 5351
5351
Ciuperca, G., Ridolfi, A., Idier, J., 2003. Penalized maximum likelihood estimator for normal mixtures. Scand. J. Statist. 30, 45–59. Frayley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis and density estimation. J. Amer. Statist. Assoc. 97, 611–631. Hathaway, R.J., 1985. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann. Statist. 13, 795–800. Horn, R.A., Johnson, C.R., 1999. Matrix Analysis. Cambridge University Press, New York. Ingrassia, S., 2004. A likelihood-based constrained algorithm for multivariate normal mixture models. Statist. Methods Appl. 13, 151–166. Ingrassia, S., Rocci, R., 2006. Monotone constrained EM algorithms for multinormal mixture models. In: Zani, S., Cerioli, A., Riani, M., Vichi, M. (Eds.), Data Analysis, Classification and the Forward Search. Springer, Berlin, pp. 111–118. McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley, New York. McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Meng, X.L., Rubin, D.B., 1993. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278. Snoussi, H., Mohammad-Djafari, A., 2001. Penalized maximum likelihood for multivariate Gaussian mixture. In: Bayesian Inference and Maximum Entropy Methods (MaxEnt).American Institute of Physics, New York, pp. 36–46. Theobald, C.M., 1975. An inequality with applications to multivariate analysis. Biometrika 62, 461–466.