A note on constrained EM algorithms for mixtures

0 downloads 0 Views 219KB Size Report
cial version of constrained EM algorithm has been introduced, in order to face with ... Key words: Mixture models, robust clustering, EM algorithm, elliptical dis-.
A note on constrained EM algorithms for mixtures of elliptical distributions Francesca Greselin1 and Salvatore Ingrassia2 1

2

Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali, Universit` a di Milano Bicocca (Italy) [email protected] Dipartimento di Economia e Metodi Quantitativi, Universit` a di Catania (Italy) [email protected]

Abstract. The existence of a global maximizer of the likelihood on constrained parameter spaces has been proved here, extending a previous result due to Hathaway (1986) to the multivariate setting and to the general case of mixtures of elliptical distributions. Then, focusing on a particular data set which motivates our methodology, a new definition of weak homoscedasticity is introduced. Successively, a test for detecting weak homoscedasticity in two sample data has been presented and a special version of constrained EM algorithm has been introduced, in order to face with weak homoscedasticity in the mixture. Numerical results show that this algorithm performs a better mixture decomposition; at the same time, it shows how suitable constraints can considerably improve convergence capabilities and robustness in the estimation of the mixture model.

Key words: Mixture models, robust clustering, EM algorithm, elliptical distributions, weak homoscedasticity.

1 Introduction Although most of classical multivariate analysis has been concerned with the multivariate normal distribution, an increasing amount of attention has being given to alternative distributional models. One area of applicability of such models is in the study of robustness of multivariate techniques to departure from multivariate normality in the underlying distributions. The difficulties associated with many alternatives are both theoretical and practical. There is, however, a simple class of distributions having similar features to the multivariate normal but which exhibit either longer or shorter tails than the normal. Such a class form an ideal basis for robustness studies, and hence has attracted increasing attention. Elliptic distributions are a broad family of location-scale densities, characterized by having elliptical contours of equal density. Many properties of such distributions have been obtained by Kelker (1970).

2

Francesca Greselin and Salvatore Ingrassia

The rest of the paper is organized as follows. In Section 2 the needed notation is given, introducing firstly multivariate elliptical density, then elliptical mixtures and their likelihood. In Section 3, after recalling some known theoretical results on constrained maximization of the mixture likelihood for the univariate setting, a result due to Hathaway (1986) is extended to the multivariate case and to elliptical distributions. Section 4 motivates and provides our main contribution, that is the definition of weak homoscedasticity. A statistical test for proving or discarding the hypothesis of weak homoscedasticity in two sample data is also provided. In Section 5 firstly the test is applied on the Crab data set, then numerical results obtained applying a weakly constrained EM algorithm on the same data are shown and discussed, comparing them to earlier results in the literature. Section 6 gives some concluding remarks and further developments.

2 Preliminaries and notation A q dimensional random vector X is said to have a multivariate elliptical distribution with location parameter µ and positive definite inner product matrix Σ, if its joint probability is given by p(x; µ, Σ) = ηq |Σ|−1/2 g{(x − µ)′ Σ −1 (x − µ)}

(1)

where g is a strictly positive, continuous function on R, symmetrical about 0 and monotonicly decreasing on [0, ∞), ηq is a constant depending on the dimension q of the Euclidean space (for more details see, for example, Fang and Anderson (1990)). The multinormal density is obtained in the special case of ηq = (2π)(−q/2) and g(·) = exp(·). The multivariate t with ν degrees of freedom: p(x; µ, Σ, ν) =

−1/2 Γ ( ν+q 2 )|Σ|

(πν)q/2 Γ ( ν2 ){1 + (x − µ)′ Σ −1 (x − µ)/ν}(ν+q)/2

(2)

is obtained when ηq is the constant of integration and g(·) is the negative (ν + q)/2 power function. Analogously, it is easy to show that the multivariate Cauchy, the multivariate exponential and the symmetric stable distribution are elliptical. More specifically, this work deals with ML estimation for the vector γ of the parameters of a k component mixture of a multivariate elliptical distribution, given by: k X αj p(x; µj , Σ j ) (3) f (x; γ) = j=1

where γ = (α1 , . . . , αk , µ1 , . . . , µk , Σ 1 , . . . , Σ k ), and Γ is the parameter space

Title Suppressed Due to Excessive Length

Γ = {γ ∈ Rk[1+q+(q

2

+q)/2]

3

: α1 +. . .+αk = 1, αj ≥ 0, |Σ j | > 0, forj = 1, . . . , k}.

Further, let L(γ) be the log-likelihood function of γ, given a sample X = {x1 , . . . , xn } of n i.i.d. observations with law (3). Hence   k n   X X . ln αj ηq |Σ j |−1/2 g{(xi − µj )′ Σ −1 L(γ) = j (xi − µj )} i=1

j=1

3 Likelihood maximization for mixtures of elliptical distributions on constrained parameter spaces The EM algorithm generates a sequence of estimates {γ (m) }m where γ (0) ∈ Γ denotes the initial guess and γ (m) ∈ Γ for m ∈ N so that the sequence {L(γ (m) )}m∈N is not decreasing. When a fitted component has a very small value of the determinant of the covariance matrix, relatively large local maxima of the likelihood can occur. Such a component corresponds to a cluster containing a few data points, either relatively close together, or almost lying in a lower dimensional subspace. The EM algorithm may converge to such a spurious maximizer, or even to a singularity of L(γ) whenever the determinant of a covariance matrix is null. Hathaway (1986), in order to avoid singularities and to reduce the number of spurious maximizers, suggested a constrained re-formulation of the problem. He required, in the univariate setting, that mini6=j σi /σj ≥ c > 0. Extending this criterium to the multivariate case, Hathaway proposed  min λ Σ h Σ −1 ≥c>0 (4) j 1≤h6=j≤q

 where λ is the generic eigenvalue of the product Σ h Σ −1 . In Hathaway j (1985), the author proved that if the sample {x1 , . . . , xn } contains at least (k + q) distinct points and the choice of c ∈ (0, 1] does not exclude the true value of the parameter, then (4) leads to a constrained global maximum likelihood formulation, as the assumptions for consistency required in Kiefer and Wolfowitz (1956) hold. However, Hathaway’s constraint is not directly implementable into the EM algorithm. In order to overcome this issue, Ingrassia (2004) devised a specific condition on the eigenvalues of all covariance matrix (no more on the eigenvalues of the product of two covariance matrices) showing that this position assures Hathaway’s constraint. Furthermore, such constraints are easily included into the EM code for the iterative update of the covariance matrices. Numerical results show the better performance of the constrained algorithm. Therefore, let Γ c be the set of parameter values in Γ satisfying mini,j λij ≥ c > 0, where λij is the i-th eigenvalue of the covariance matrix Σ j in the j-th component of the elliptical mixture. The following proposition is a generalization of an analogous result in Hathaway (1986), here extended to the multivariate case and to elliptical distributions.

4

Francesca Greselin and Salvatore Ingrassia

Theorem 1 Let {x1 , . . . , xn } be a set of observations containing at least (k + q) distinct points. Then for c ∈ (0, 1] there exists a constrained global maximizer of L(γ) over Γc . Proof. To maximize L(γ) means to jointly maximize |Σ j |−1/2 and minimize the argument of g, i.e. {(xi − µj )′ Σ −1 j (xi − µj )}, for each j = 1, . . . , k. Hence, firstly we will show that, for a given |Σ j |−1/2 , µj has to lie in a compact subset in Rq . Let C be the convex hull of X , i.e. the intersection of all convex sets containing the n points, given by C(X ) = {

n X

u i xi

|

n X

ui = 1; ui ≥ 0}.

(5)

i=1

i=1

¯ ∈ Γc satisfies µj ∈ Suppose now that γ / C(X ). Then L(¯ γ ) ≤ L(γ∗) where ¯ by changing the j-th mean component to µj ∗ = γ∗ ∈ Γ c is obtained from γ hµj with h ∈ (0, 1) so that µj ∗ ∈ C(X ). Besides, if {γ (m) } ∈ Γ c is a sequence satisfying (m)

lim |Σ j

m→∞

(m)

| = 0 or

lim |Σ j

m→∞

|=∞

j = 1, . . . , k

then limm→∞ L(γ) = −∞. Hence, choosing b sufficiently small and d sufficiently large (with b, d ∈ R), we are sure that the constraint 0 < b ≤ |Σ j | ≤ d < +∞

j = 1, . . . , k

do not throw away any global maximum of L(γ). For the hypothesis, the mixture has at least one non degenerate component: q + 1 different observations requires it, while the remaining data, that are at most m − 1, can originate at most m − 1 degenerate components. It follows that n Y f (xi , γ) 6= 0 for all γ ∈ Γ c i=1

and hence L(γ) 6= −∞

for all γ ∈ Γ c . From the above, we have that sup L(γ) = sup L(γ) γ∈Γc

γ∈S

where S = {γ ∈ Γ c | µj ∈ C(X ); 0 < b ≤ |Σ j | ≤ d < +∞ j = 1, . . . , k}. By the compactness of S and the continuity of L(γ), there exists a parameter ˆ ∈ Γ c satisfying γ L(ˆ γ ) = sup L(γ) = sup L(γ) γ∈Γc

γ∈S

for Weierstrass’ theorem. ⊓ ⊔ Indeed, having chosen a value b in the definition of the set S, we can impose 0 < b ≤ |Σ j | by requiring that the eigenvalues of all covariance matrix have a lower bound, that is λi (Σ j ) ≥ c = b1/q . This amounts also to impose them an upper bound λi (Σ j ) ≤ c′ , as in Hennig (2004) for the univariate case, extended to the multivariate case in Greselin and Ingrassia (2008):

Title Suppressed Due to Excessive Length

5

Theorem 2 Let L(γ) be the likelihood function of γ, defined on Γ c , given a sample X = {x1 , . . . , xN } of N i.i.d. observations coming from a mixture (3) of elliptical distributions (1). Then there exists γ > 1, depending on the geometry of the data points X , such that for each γ = (α1 , . . . , αk , µ1 , . . . , µk , Σ 1 , . . . , Σ k ) ∈ Γ c , the eigenvalues λij = λi (Σ j ) satisfy also the constraints λij ≤ γ c where

(i = 1, . . . , q; j = 1, . . . , k).

 2 γ = g(0)/g(δ ′ δc−1 )

(6) (7)

and δ = diam(X ) =

max

l,m=1,...,n

|xl − xm | .

We can hence conclude this section with Hathaway’s words: “Adding simple constraints on the component standard deviations results in an optimizationally and statistically well-posed problem; a constrained global maximizer of the likelihood function exists and this maximizer is strongly consistent”. Anyway, this result does not indicate how to tune the constant c that respects the given data set. The issue is whether the constrained parameter space still contains the global maximum of L(γ) or not. The following sections are devoted to some considerations about how to adequate constraints to the data we have to analyze. 3.1 Crab data set The above results are here illustrated using the Crab data set which consists on measures over a sample of 100 rock crabs of the genus Leptograpsus (available at http://www.stats.ox.ac.uk/ pub/PRNN/). It has been analyzed further in Ripley (1996) and McLachlan and Peel(2000). Each specimen has q = 5 measurements : the width of the frontal lip (FL), the rear width (RW), the length along the midline (CL) and the maximum width (CW) of the carapace, and the body depth (BD) (in mm); the data are grouped into two classes by sex, see Figure 1. In the setting of t mixtures, this data set has been used in Peel and McLachlan (2000) and in Lin et al. (2004). According to such references, here we cluster a sample of 100 units (with n1 = 50 males and n2 = 50 females) ignoring the right classification. Usually, based on the results of Hawkins’ (1981) simultaneous test of multivariate normality and equal covariance matrices on this data set, the group conditional distributions are assumed to be normal with common covariance matrix. However, McLachlan and Peel (2000) noted that this assumption has a marked impact on the implied clustering of the data: indeed imposing an homoscedastic model produce a larger misallocation rate than in the case in which no constraint is imposed. They fitted the crabs data set using both normal and t mixtures imposing equal covariance matrices, concluding that the two models lead almost to the same error rate (19% and 18% respectively).

6

Francesca Greselin and Salvatore Ingrassia 15

25

35

45

8

12

16

20

10

14

18

BD

45

6

| |||| | |||||||||| |||||||| ||||| |||| | ||||||||||||||||||||||||| ||||||||| |||| |

15

25

35

CL

| | ||| |||||||| || ||||||| |||||| | || ||||||| |||||| |||| ||||||||| | ||| |||| || ||| | |

20

30

40

50

CW

20

| | ||||||||||||||||||||||||| |||||| || |||||| |||||||||||||||||| ||||| |||||| |

8

12

16

FL

RW

8 10

1 2

14

| | |||| | ||||||||| |||||||||||||||||||||||||||| |||||||| ||||||||| ||| | |

|| ||| | |||| | ||| ||||||| ||||||||| || |||| |||||||| || |||| ||| |||| | | ||| 6

10

14

18

20

30

40

50

8 10

14

Fig. 1. Scatterplot matrix of crab data set.

On the contrary, if no constraint on the covariance matrices is imposed, the observed misallocation rate decreased to 11%. This means that the assumption of homoscedasticity in fitting the mixture model leads to a much inferior clustering of the data. This apparent contradiction arose our curiosity and suggested us to undertake a deeper evaluation of constraints. We point out that homoscedasticity means that the ellipsoids of equal concentration have: i) the same shape and ii) the same principal axes. However, a graphical inspection of data, see Figure 1, suggests that for this data set the latter assumption may appear too strong. Our believe is that the constraints can be usefully lightened requiring only covariance matrices with the same shape.

4 Weak homoscedasticity of covariance matrices The above results and a statistical analysis of the Crab data set suggested to introduce the concept of weak homoscedasticity of covariance matrices: (1)

Definition 3 Two covariance matrices Σ 1 and Σ 2 with eigenvalues λ1 ≤ (1) (1) (2) (2) (2) λ2 ≤ · · · ≤ λq and λ1 ≤ λ2 ≤ · · · ≤ λq respectively, are said weakly (1) homoscedastic if they have the same ordered set of eigenvalues, that is λh = (2) λh (h = 1, . . . , q). Hence, weak homoscedasticity can be thought as an intermediate constraint between heteroscedasticity and homoscedasticity.

Title Suppressed Due to Excessive Length

7

In the rest of this section, we consider the problem of testing if two covariance matrices Σ 1 and Σ 2 are weakly homoscedastic. Let us consider the spectral decomposition of Σ j (j = 1, 2), i.e. Σ j = Γ j Λj Γ ′j where Λj is the diagonal matrix of the eigenvalues of Σ j , and Γ j is an orthogonal matrix whose columns are the standardized eigenvectors; the symbol ′ denotes matrix transpose. Then the usual test for homoscedasticity H0 : Σ 1 = Σ 2

versus

H1 : Σ 1 6= Σ 2

can be modified in the following weaker formulation, testing only the equality between the ordered sets of the eigenvalues of Σ 1 and Σ 2 respectively H0Λ : Λ1 = Λ2 (1)

H1Λ : Λ1 6= Λ2 .

versus

(8)

(1)

Let x1 , . . . , xn1 be a sample of size n1 drawn from a multivariate nor(2) (2) mal distribution with covariance matrix Σ 1 ; analogously let x1 , . . . , xn2 be a sample of size n2 drawn from a multivariate normal distribution with covariance matrix Σ 2 . Afterward, let us denote by X1 the n1 × q data matrix ∼

(1)

(1)

with rows x1 , . . . , xn1 ; analogous definition applies to n2 × q data matrix X2 . Finally let S1 and S2 be the sample covariance matrices of X1 and X2 ∼



respectively. According to the principal component transformation (1)

xi (1)

(1)

→ yi

(1)

= G′1 (xi

¯1) , −x

i = 1, . . . , n1



(9)

(1)

the data y1 , . . . , yn1 are incorrelated with variances covariance matrix equal to L1 , where L1 is the diagonal matrix of the eigenvalues of S1 , G1 is an orthogonal matrix whose columns are the corresponding standardized eigen¯ 1 is the vector mean of X1 . Analogous arguments apply to vectors, and x ∼

(2)

(2)

y1 , . . . , yn1 . We shall denote by Y1 and Y2 the data matrices with rows ∼

(1) (1) y1 , . . . , yn1

and

(2) (2) y1 , . . . , yn2

respectively:

Y1 = (X1 − 1¯ x1 )G1 ∼





and Y2 = (X2 − 1¯ x2 )G2 ∼



where 1 = (1, 1, . . . , 1) is a vector of n1 ones in the first expression (and, respectively, it has n2 ones in the second one). The test for weak homoscedasticity (8) can be written as:       (1) (2) (1) (2) (2) ∩ · · · ∩ λ(1) ∩ λ2 = λ2 H0Λ : λ1 = λ1 q = λq (10) (1) (2) H1Λ : there exists h ∈ {1, . . . , q} such that λh 6= λh . Recalling that two Gaussian incorrelated random variables are also independent, thus the test (10), under the assumption of multinormality, can be done through q simpler tests

8

Francesca Greselin and Salvatore Ingrassia (1)

(2)

H0 : λ h = λ h

versus

(1)

(2)

H1 : λh 6= λh ,

h = 1, . . . , q

(11)

Since the eigenvalues of the covariance matrix Σ 1 (Σ 2 ) coincide with the variances along the principal axes, the h-th hypothesis in (11) can be tested by means of a well known F -test on equality of variance based on the samples (2) (1) (2) (1) yh and yh obtained from xh and xh (h = 1, . . . , q) by means of the principal component transformation (9).

5 A numerical study Following the genesis of the notion of weak homoscedasticity, this section aims to illustrate how this methodology applies on the Crab data set. To begin with, we tested the weak homoscedasticity of the two classes. We performed q = 5 tests on variance equality, obtaining the following p values 0.1912

0.2647

0.2862

0.4505

0.2586

which show that there is statistical evidence for the weak homoscedasticity. First simulations concerned a set of 100 runs of the EM algorithm based on a mixture of t-distributions without any constraint (for 100 different starting points), and for each run the misclassification error has been evaluated. The EM algorithm failed in 6% of cases, due to singularities of the likelihood function; in the other cases a large variability in the misclassification error rate has been observed: it lies in the interval [11%, 47%], the quartiles are Q1 = Q2 = 11%, Q3 = 17% and the 90th percentile is 36%. Afterward we run the EM algorithm (using the same starting points considered in the unconstrained version) imposing the same ordered set of eigenvalues in the covariance matrices of the two distributions, and the same number of degrees of freedom. No failure of the algorithm has been observed, moreover the misclassification error rate lies in the interval [11%, 49%], the three quartiles are Q1 = Q2 = 11%, Q3 = 12% and the 90th percentile is 12%. Further, the misclassification rate of 49% was observed just once and a misclassification error rate greater or equal to 36% appeared only in six cases over 100 runs. With reference to robust classification, some more simulations have been performed along the lines of McLachlan and Peel (2000). We inserted outliers in the original data set by adding various values to the second variate of the 25th data point. In Table 1 the overall misclassification is reported. It may be useful to read Table 1 beginning from the central row, where the initial data set without any perturbation is considered (as the constant value is null). The weak constrained t-mixture overperforms on the normal mixture and on the unconstrained t-mixture, reaching a best error rate of 11%, attained in the 95% of cases. Then, looking through the following rows of the table, the value of the constant raises progressively from 5 to 20 and the error rate for the normal mixture strongly increases, till attaining the maximum value of

Title Suppressed Due to Excessive Length

9

50%. It grows much more slowly for t-mixtures (reaching only 20%), while it remains almost unchanged for the weakly constrained t-mixtures, assuming steadily the value of 13% (with a very high frequency in all cases, ranging from 86% to 94%). The obtained error rates are almost symmetric for the negative values of the perturbing constant. The last consideration, with reference to Table 1, concerns the degrees of freedom. They are decreasing as the perturbing effect is more relevant, reflecting how they downweight the effect of the outliers on the parameters estimation. Table 1. Comparison of error rates when fitting normal mixtures, t mixtures and constrained t mixtures to the crab data set with outliers. The best results of the error rate of the constrained algorithm is given, followed by the percentage of times in which such result has been attained, in parenthesis. Constant −15 −10 −5 0 5 10 15 20

normal mixture t mixture Error Rate Error Rate 49% 19% 49% 19% 21% 20% 19% 18% 21% 20% 50% 20% 47% 20% 49% 20%

constrained t mixture Error Rate νˆ 13% (87%) 5.76 13% (86%) 6.13 12% (88%) 8.57 11% (94%) 24.17 13% (94%) 13.64 13% (95%) 7.76 13% (93%) 6.68 13% (86%) 6.17

Table 2. Range, three quartiles and 90th percentile of the error rate over 100 simulations. Error Rate range Q1 Q2 Q3 Mixture Model t no constraints 11%-47% 11% 11% 17% t weak constraint 11%-48% 11% 11% 12% t strong constraint 16%-50% 18% 50% 50%

x90 36% 12% 50%

6 Conclusions In this work constrained maximum likelihood maximization for multivariate elliptical distributions has been introduced. Hathaway’s result concerning the existence of a global maximizer of L(γ) over a constrained parameter space and holding for univariate normal mixtures, has been here proved in the extended setting of multivariate elliptical distributions.

10

Francesca Greselin and Salvatore Ingrassia

Therefore, with reference to a particular data set that motivated this position, the notion of weak homoscedasticity has been defined. Requiring only that ellipsoid of equal concentration have the same shape, weak homoscedasticity is an intermediate constraint between heteroscedasticity and homoscedasticity. Numerical results have showed the interest of this approach in applications. Our experiments have pointed out that suitable constraints allows the EM algorithm to perform a better mixture decomposition; at the same time, they show how such constraints can considerably improve convergence capabilities and robustness in the estimation. However, as such constraints need some a priori information which can not be available in advance, some more work is needed. We are actually looking for semi-automatic procedures, based on the analysis of the EM algorithm, able perhaps to dynamically modify the constraints from a strongest to a weakest formulation, to improve parameters estimation.

References 1.K.T. Fang, T.W. Anderson. Statistical Inference in Elliptically Contoured and Related Distributions. Alberton Press, New York, 1990. 2.F. Greselin, S. Ingrassia. Constrained monotone EM algorithms for mixtures of multivariate t-distributions. Rapporti di Ricerca del Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali - Universit` a degli Studi di Milano Bicocca, n.142, 2008. 3.R.J. Hathaway. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. The Annals of Statistics, 13:795–800, 1986. 4.D.M. Hawkins. A new test for multivariate normality and homoscedasticity. Technometrics, 23:105–110, 1981. 5.C. Hennig. Breakdown points for maximum likelihood estimators of location-scale mixtures. The Annals of Statistics, 32:1313–1340, 2004. 6.S. Ingrassia. A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods & Applications, 13:151–166, 2004. 7.S. Ingrassia, R. Rocci. Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Computational Statistics & Data Analysis, 51:5339– 5351, 2007. 8.D. Kelker. Distribution theory of spherical distributions and a location-scale parameter generalization. Sankhy¯ a Ser. A, 32:419–438, 1970. 9.J. Kiefer, J. Wolfowitz. Consistency of the Maximum Likelihood Estimator in the Presence of Infinitely Many Incidental Parameters. The Annals of Mathematical Statistics, 27:887–906, 1956. 10.T.I. Lin , J.C. Lee , W.J. Hsieh. Robust mixture modeling using the skew t distribution, Statistics and computing, 17:81–92, 2007. 11.G.J. McLachlan, D. Peel. Finite Mixture Models. John Wiley & Sons, New York, 2000. 12.D. Peel, G.J. McLachlan. Robust mixture modelling using the t distribution. Statistics and Computing, 10:339–348, 2000. 13.B.D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996.

Suggest Documents