ISIT 1998, Cambridge, MA, USA, August 16 { August 21
Towards EM-style Algorithms for Posterior Optimization of Normal Mixtures Eric Sven Ristad
Mnemonic Technology, Inc. Princeton, NJ 08540 USA
[email protected]
Peter N. Yianilos
NEC Research Institute Princeton, NJ 08540 USA
[email protected]
Abstract | Expectation maximization (EM) pro- d-dimensional k-component normal mixture p0 , such that for vides a simple and elegant approach to the problem 1 i k: of optimizing the parameters of a normal mixture on 1. k0 i ? xk < an unlabeled dataset. To accomplish this, EM iter2. k0 i k < atively reweights the elements of the dataset until a 3. p0 and p are class-equivalent locally optimal normal mixture is obtained. This paper explores the intriguing question of whether such The following theorem shows that by reweighting a dataset an EM-style algorithm exists for the related and ap- in a way that diers slightly from that of EM, arbitrary class parently more dicult problem of nding a normal behavior may be induced. mixture that maximizes the posterior class probabilTheorem 2 Given any d-dimensional normal mixture M ities of a labeled dataset. We expose a fundamental degeneracy in the rela- with k components, and s1 ; : : : ; sn 2 Rd such that n tionship between a normal mixture and the a posteri- d(d +3)=2 with some subset of size d(d +3)=2 satisfying a mild orthodoxy condition, then there exists a k n table of nonnegori class probability functions that it induces, and use f i;j g, and nonnegative values m1 ; : : : ; mk?1 with this degeneracy to prove that reweighting a dataset ative P mkvalues 1, such that the normal mixture M 0 generated as can almost always give rise to a normal mixture exhibiting any desired class function behavior. This described below, is class-equivalent to M . establishes that EM-style approaches are suciently 1. The parameters of M 0 are m1 ; : : : ; mk?1 ; 1 ? Pkj=1?1mixing expressive for posterior optimization problems and mj . opens the way to the design of new algorithms for 2. Each mean 0i within M 0 is given by: them. n 1 X P =
i;j sj i n I. Technical Summary j =1 i;j j =1 Normal mixtures have proven useful in several areas including 3. Each covariance 0i within M 0 is given by: pattern recognition [3] and speech recognition { along with vector quantization and many others. The problem of nding Xn i;j (sj ? i )(sj ? i)t = i a k-component normal mixture M that maximizes the likeliQ j =1 hood i p(sijM ). of an unlabeled dataset s1 ; : : : ; sn may be approached using the well-known expectation maximization Our work also shows that reweighting in exactly the fashion (EM) algorithm [2, 1]. EM iteratively reweights each sam- of EM, fails to be universallyPexpressive in the sense above. n t ple's membership in each of the k mixture components by the That is, where i = P 1 j =1 i;j (sj ? i )(sj ? i ) . =1 posteriori probability of each component given the sample. When normal mixtures are applied to pattern classi cation Acknowledgements problems each mixture component corresponds to a pattern The authors thank Leonid Gurvits and Joe Kilian for helpful class. Given a class label !(si) for each element si in the discussions. dataset, the Q goal is to maximize the mixture's a posteriori performance i p(!(si)jsi ; M ), i.e.to predict well the correct labels | not model the observation vectors themselves. We de ne two normal mixtures to be class equivalent if they induce identical a posteriori class functions p(!jx; M ), [1] L. E. Baum and J. E. Eagon, An inequality with i.e. perform identically as probabilistic classi ers. Theoapplication to statistical estimation for probabalistic rem 1 shows that the relationship between mixtures and their functions of a Markov process and to models for ecolclass behavior is highly degenerate and, we suggest, somewhat ogy, Bull. AMS, 73 (1967), pp. 360{363. strange and counterintuitive. As a positive result of this degeneracy one can search the entire space of class functions [2] A. P. Dempster, N. M. Laird, and D. B. Ruwithout considering all possible mixtures. So to solve the a bin, Maximum-likelihood from incomplete data via posteriori maximization problem above, it suces to nd any the EM algorithm, J. Royal Statistical Society Ser. normal mixture that induces optimal class functions. B (methodological), 39 (1977), pp. 1{38. n j
i;j
References
Theorem 1 Let p be a d-dimensional normal mixture with [3] R. O. Duda and P. E. Hart, Pattern Classi cation k components. For any x 2 Rd and > 0, there exists a and Scene Analysis, John Wiley & Sons, Inc., 1973.