IEEE International Conference on Systems, Man and Cybernetics, Washington DC, 2003
Non-Parametric Expectation Maximization: A Learning Automata Approach Wael Abd-Almageed∗ ∗
Aly El-Osery#
Department of Electrical and Computer Engineering University of New Mexico Albuquerque, NM 87131 wamageed,
[email protected]
Abstract The famous Expectation Maximization technique suffers two major drawbacks. First, the number of components has to be specified apriori. Also, the Expectation Maximization is sensitive to initialization. In this paper, we present a new stochastic technique for estimating the mixture parameters. Parzen Window is used to estimate a discrete estimate of the PDF of the given data. Stochastic Learning Automata is then used to select the mixture parameters that minimize the distance between the discrete estimate of the PDF and the estimate of the Expectation Maximization. The validity of the proposed approach is verified using bivariate simulation data.
1. Introduction The Expectation Maximization algorithm (EM) [1] is the most frequently used technique for estimating class conditional probability density functions (PDF) in both univariate and multivariate cases. Expectation Maximization is widely used in computer vision [2], speech processing [3] and pattern recognition [4] applications. In all of these areas, EM is used to model a set of feature vectors by a mixture model, which contains a finite number of components. Mixtures of Gaussian distributions are frequently used to model the observation vectors. The EM, however, suffers two major drawbacks. First, the number of components has to be apriori specified, in order to estimate the density function. This number has to be specified accurately, in order to hit the right balance between the accuracy of density estimation from one side, and the complexity of the function, which affects the speed of both training and testing, from the other side. Also, because EM is a maximum likelihood approach, the resulting estimates are sensitive to initialization. The EM always converges to a local minima. Several attempts have been made to overcome these drawbacks. Figuerideo and Jain [5] broadly classify these methods into two categories: deterministic approaches and
Christopher E. Smith∗ #
Department of Electrical Engineering New Mexico Tech. Socorro, NM 87801
[email protected]
stochastic approaches. Deterministic methods, such as [6], [7] and [8], are based on selecting the number of components according to some model selection criterion, which usually contains an increasing function that penalizes higher number of components. In [9], [10] and [11] stochastic approaches based on Markov Chain Monte Carlo methods are used. These approaches have the disadvantage of being too computationally intensive. This paper introduces a new stochastic approach for overcoming the drawbacks of the Expectation Maximization algorithm. The proposed approach is based on computing a distance measure between the density function estimated using the EM and a discrete density estimated using kernel methods. Stochastic Learning Automata (SLA) [12] is then used to find the mixture parameters that minimizes the distance. Studying of Learning Automata started in early Sixties [13][14]. Learning automata theory provides a framework for the design of automata which interact with a random environment and dynamically learn the action that minimizes the probability of a penalty. Since the Sixties, this field has seen vast improvements and developments [15][16]. The main advantage of SLA is that it does not require any knowledge about random environment in which the automaton operates, or the function to be optimized. This paper is organized as follows. Section 2 introduces the proposed approach for selecting the mixture parameters and the number of components. Experimental results are presented in Section 3. Section 4 concludes the paper and introduces directions for future research.
2. The Proposed Approach In this section, the proposed approach is described. The SLA is used to find the mixture model that minimizes the Kullback-Leibler [17] distance between the discrete density function, estimated using Parzen Window [18], and the density functions, estimated using the EM, in a user specified
range of number of components. The following sections outline the building blocks of the proposed approach.
2.1. Expectation Maximization The main advantage of the EM algorithm is that it provides a closed-form representation for the density function. This facilitates computing p(x), where x is the feature vector, directly from the function, most importantly in the case of continuous vectors. On the other hand, the EM is sensitive to the initialization of mixture parameters and the number of mixture components has to be apriori specified. The probability density function is modeled as: pEM (x) =
k X
πj p(x|j),
(1)
j=1
where k is the number of components, p(x|j) is the conditional probability density of component j and πj s are the Pk mixing weight, where i=1 πi = 1. The conditional component density is given by: 1
p(x|j) =
(2π)d/2 |Σ|−1 µ
exp
¶ −1 T −1 (x − µj ) Σ (x − µj ) , 2
(2)
where d is the length of the feature vector, and µj and Σj are the mean and covariance of component j. The parameters of the mixture are iteratively estimated by applying the update Equations 3, 4 and 5, as follows: P µij =
πji =
∀x∈X xP (j|x) , Γi−1 j
(3)
Γi−1 j N
(4)
P =
∀x∈X
P i−1 (j|x)(x − µij )(x − µij )T Γi−1 j
where Γi−1 = j
X
P i−1 (j|x),
,
(5)
(6)
∀x∈X
P (j|x) = and i is the iteration number.
Parzen [18] provides a non-parametric method for estimating the density function from a finite set of training data X = {xi : i = 1, ..., N }. Parzen Window provide an accurate estimate of the PDF with no parameters to initialize and/or apriori specify. However, the method works only for discrete data. In case of a continuous feature spaces, the vector must be discretized before the density value is computed, or the estimation has to be performed online, which slows down the system. The Parzen density is estimated as: pP arzen (x) =
N 1 X Φ(|x − xi |), N i=0
(8)
where N is the size of the feature set and Φ is a kernel function defined by: µ ¶ 1 1 T Φ(x) = √ exp − x x (9) 2 2π
2.3. The Kullback-Leibler Distance In [17], Kullback developed a measure for the similarity between two density functions, p1 (x) and p1 (x). This measure can be considered as a distance between the two functions, though it is not a real distance, because it is not symmetric. We use this criterion to measure the distance between the density function estimated using Parzen Window and all density functions estimated using the standard EM, in the user specified range, as follows. δ(pP arzen (x), pEM (x)) = Z pP arzen (x) pP arzen (x) log dx, pEM (x) S
(10)
where δ is the distance between the two density functions.
2.4. Stochastic Learning Automata and,
and Σij
2.2. Parzen Window
πj p(x|j) , p(x)
(7)
One main advantage of learning automaton is that it needs no knowledge of the environment in which it operates or any analytical knowledge of the function to be optimized. Learning automaton is a sequential machine characterized by a set of: internal states, input actions, state probability distributions, a reinforcement scheme, and an output function and is connected in feedback loop to the environment as shown in Figure 1. The probability distribution of the actions, Pui , is adjusted using reinforcement scheme to achieve the desired objective. At each step, the performance of the SLA through the environment is evaluated by either a penalty (or unsatisfactory performance) (y = 1), or a nonpenalty (or satisfactory performance) (y = 0). A stochastic automaton is a quintuple {Y, Q, U, F, G} where
µ Puj (n + 1) = Puj (n) + vαPuj (n)
Random Environment
H 1−H
¶ ,
(j 6= i) (16)
where u
y
Learning Automaton
H = min[Pu1 (n), . . . , Pum (n)],
(17)
0 < α < 1,
(18)
0 < vα < 1,
(19)
1 (20) m The operation of the SLA is as follows. An action, i.e., the number of mixture components, is selected at random; if the action results in a reward, its probability distribution is increased and the probability distributions of the other actions are decreased based on Equations 13 and 14. The learning rate is determined by α. On the other hand if the randomly selected action results in a penalty, its probability distribution is decreased and the probability distributions of the other actions are increased based on Equations 15 and 16. The penalty or reward is assigned based on the distance, δ, at the current iteration. If the current distance is less than or equal to that of the previous iteration, a reward is given, otherwise the action is penalized. Pu1 (0) = . . . = Pum (0) =
Figure 1: Automaton operating in random environment • If Y consists of only two elements 0 and 1, the environment is said to be P-model. When the input into the SLA is a finite number values in the closed interval [0,1], the environment is said to be Q-model. On the other hand, if the inputs are arbitrary numbers in the closed line segment [0,1], the environment is known as S-model, • Q is a finite set of states, Q = {q1 , . . . , qs }, • U is a finite set of outputs, U = {u1 , . . . , um },
3. Simulation Results
q(n + 1) = F [y(n), q(n)],
(11)
• and G is the output function u(n) = G[q(n)].
(12)
In general, the function F is stochastic and the function G may be deterministic or stochastic. Because of the stochastic nature in state transitions, stochastic automata are considered suitable for modeling learning systems. If the output of the automaton is uj , j = 1, 2, . . . , m, the random environment generates a penalty with probability τj or a nonpenalty with probability (1 − τj ). The reinforcement scheme used to update the probability distribution of action is as follows [19]. Assume that u(n) = ui . If y(n) = 0,
In this section, the results of the algorithm introduced in Section 2 are demonstrated. The algorithm has been tested with several bivariate data sets with different number of Gaussian components. Figures 2, 3, 4 and 5 show four mixtures used to verify the validity of the proposed algorithm. The mixtures in Figures 2, 3, 4, and 5 contain two, three, four and five Gaussian components, respectively.
0.1 0.08 0.06 pdf
• F is the next state function
0.04 0.02 0 10 10 5 5 0
Pui (n + 1) = (1 − α)Pui (n) + α, Puj (n + 1) = (1 − α)Puj (n)
(13)
(j 6= i)
(14)
If y(n) = 1, µ Pui (n + 1) = Pui (n) − vα(1 − Pui (n))
H 1−H
¶ , (15)
0 y
−5
−5
x
Figure 2: Mixture used in first experiment The size of each of the data sets is 600 feature vector. The SLA parameters used for all three simulations are α = 0.05 and v = 0.9. Figures 6, 7, 8 and 9 illustrate the clustering results using mixture parameters estimated by
0.15
0.1 0.08 0.1
0.04
pdf
pdf
0.06
0.02 0.05
0 10 5
0 10
10
10
5
0
5
5 0
0
0
−5
−5
−5 y
−10
−10
−10
y
x
−5 −10
x
Figure 5: Mixture used in fourth experiment
Figure 3: Mixture used in second experiment
8
7 0.06
6 0.05
5
pdf
0.04 0.03
4
x
0.02
3
0.01
2
0 10
10 5
5
1
0
0
y
0
−5
−5 −10
−10
x
Figure 4: Mixture used in third experiment EM and selected by the SLA. The mixtures chosen have a variety of complexity as shown in the figures. The probability distribution of all the actions of each simulation are demonstrated in Figures 10, 11, 12 and 13. The figures clearly show that the the probability distributions of the actions converged to the correct number of Gaussian components in each simulation. In Figures 10 and 11, where the mixture is fairly simple, the probability of the correct number of components started to dominate after five iterations. As the mixture starts getting more complex in Figures 12 and 13, the probability of the correct number of components started to dominate after 10 and 20 iterations, respectively.
4. Conclusions and Future Work In this paper a novel approach to overcome the drawbacks of Expectation Maximization technique has been introduced. The algorithm automatically estimates the mixture parameters and the number of Gaussian components that best approximate the given feature set.
−1
−2 −4
−3
−2
−1
0
1
2
3
4
5
y
Figure 6: Clustering result for a data set generated using the PDF of Figure 2
The proposed approach is based on minimizing the Kullback-Liebler distance between the probability density function estimated using the standard EM, and the discrete density function estimated using Parzen Windows. The distance is minimized using a Stochastic Learning Automaton. The stochastic nature of the distance measure makes the SLA approach a natural choice. The experimental results have shown the effectiveness of the approach. The approach was applied to mixtures of various degrees of complexity and in all three simulations was approximated correctly. Future work will include testing different reinforcement schemes for updating the action probability distributions. This may help improve the convergence time. Also, different measures of distance need to be explored.
6
6
4
4
2 2
x
x
0 0
−2 −2 −4
−4
−6 −5
−6
−4
−3
−2
−1
0 y
1
2
3
4
−8 −5
5
Figure 7: Clustering result for a data set generated using the PDF of Figure 3
−4
−3
−2
−1
0 y
1
2
3
4
5
Figure 9: Clustering result for a data set generated using the PDF of Figure 5 1
6
0.9
2 Gaussians
4 0.8
0.7 Probability of Actions
x
2
0
−2
0.6
0.5
0.4
0.3
0.2
−4 0.1
−6 −6
0
−4
−2
0 y
2
4
6
Figure 8: Clustering result for a data set generated using the PDF of Figure 4
References [1] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of Royal Statistics Society, vol. B-39, 1977. [2] Wael Abd-Almageed and Christopher E. Smith, “Mixture Models for Dynamix Statistical Pressure Snakes,” in Proc. IEEE Internation Conference on Pattern Recognition, August 2002, pp. 721–724. [3] A.V. Nefian, Luhong Liang, Xiaobo Pi, Liu Xiaoxiang, C. Mao, and K. Murphy, “A coupled HMM for audio-visual speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (CASSP’02), 13-17 May 2002, Orlando, FL, USA. [4] A. Abu-Naser, N.P. Galatsanos, M.N. Wernick, and D. Schonfeld, “Object recognition based on impulse
0
10
20
30 Iterations
40
50
60
Figure 10: Probability distribution for the SLA actions applied to the data set generated using the PDF of Figure 2 restoration with use of the expectation-maximization algorithm,” Journal of the Optical Society of America A (Optics, Image Science and Vision), vol. 15, no. 9, pp. 2327 – 40, September 1998. [5] Mario Figueirdeo and Anil Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381–396, March 2002. [6] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a Mixture Model for Clustering with the Integrated Classification Likelihood,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 7, pp. 719–725, July 2000. [7] A. Dasgupta and A. Rafetry, “Detecting Features ni Spatial Point Patterns with Clutter Via Model-Based Clustering,” J. of American Statistical Association, 1998.
0.8
0.9
0.8
0.7
5 Gaussians
3 Gaussians 0.7 0.6
Probability of Actions
Probability of Actions
0.6 0.5
0.4
0.3
0.5
0.4
0.3 0.2 0.2 0.1
0
0.1
0
10
20
30 Iterations
40
50
60
Figure 11: Probability distribution for the SLA actions applied to the data set generated using the PDF of Figure 3
0
0
10
20
30
40 Iterations
50
60
70
80
Figure 13: Probability distribution for the SLA actions applied to the data set generated using the PDF of Figure 5
0.7
0.6
[13] M. L. Tsetlin, “On The Bahavior of Finit Automata in Random Media,” Automatic Remote Control, vol. 22, pp. 1210–1219, 1962.
4 Gaussians
Probability of Actions
0.5
[14] K.S. Fu and G.J.McMurtry, “A study of stochastic automata as a model for learning and adaptive controllers,” IEEE Transactions on Automatic Control, vol. 11, pp. 379–387, 1966.
0.4
0.3
0.2
0.1
0
0
10
20
30 Iterations
40
50
60
Figure 12: Probability distribution for the SLA actions applied to the data set generated using the PDF of Figure 4 [8] J. Campbell, C. Fraley, F. Murtagh, and A. Raftery, “Linear Flaw Detection in Woven Textiles Using Model-Based clustering,” Pattern Recognition Letters, vol. 18, pp. 1539–1548, 1997. [9] S. Richardson and P. Green, “On Bayesian Analysis of Mixturtes with Unknown Number of Componenets,” J. Royal Statistical Society, 1997. [10] R. Neal, “Bayesian Mixture Modeling,” in Proc. of 11th Int’l Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, 1992, pp. 197–211. [11] H. Bensmail, G. Celeus, A. Rafetry, and C. Robert, “Inference in Model-Based Cluster Analysis,” Statistics and Computing, vol. 7, pp. 1–10, 1997. [12] K.S. Narendra and M.A.L. Thathachar, “Learning Automata: A Survey,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 14, pp. 323–334, 1974.
[15] M.A.L. Thathachar and P.S. Sastry, “Varieties of Learning Automata: An Overview,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 32, no. 6, pp. 711–722, December 2002. [16] Learning Automata: Theory and Applications, Pergamon, New York, 1994. [17] Solomon Kullback, Information Theory and Statistics, Dover Publications, Inc., 1968. [18] E. Parzen, “On the Estimation of a Probability Density Function and the Mode,” Ann. Math. Statistics, vol. 33, pp. 1065–1076, 1962. [19] Learning Automata and Stochastic Optimization, Lecture Notes in Control and Information Sciences 225. Springer, 1997.