Sequential EM for Unsupervised Adaptive ... - Semantic Scholar

1 downloads 0 Views 204KB Size Report
Page 1 ... Sequential EM for Unsupervised Adaptive Gaussian Mixture Model. 97 ... EM is a batch method that provides a very good estimate of the clusters when.
Sequential EM for Unsupervised Adaptive Gaussian Mixture Model Based Classifier Bashar Awwad Shiekh Hasan and John Q. Gan School of Computer Science and Electronic Engineering University of Essex Wivenhoe Park, CO4 3SQ, UK {bawwad,jqgan}@essex.ac.uk

Abstract. In this paper we present a sequential expectation maximization algorithm to adapt in an unsupervised manner a Gaussian mixture model for a classification problem. The goal is to adapt the Gaussian mixture model to cope with the non-stationarity in the data to classify and hence preserve the classification accuracy. Experimental results on synthetic data show that this method is able to learn the time-varying statistical features in data by adapting a Gaussian mixture model online. In order to control the adaptation method and to ensure the stability of the adapted model, we introduce an index to detect when the adaptation would fail.

1

Introduction

Gaussian mixture model (GMM) is a successful and simple clustering method that is widely used in many application domains. In GMM configuration the data are assumed to be generated from a finite number of Gaussian distributions. The data is then modeled by a probability density function. p(x) =

K 

πk N (x|μk , Σk ) .

(1)

k=1

Where K is the number of Gaussian components, π1 , . . . , πK are the mixing coefficients, N (x|μk , Σk ) is a Gaussian distribution with μk mean and Σk variance. πk should satisfy the following conditions 0 ≤ πk ≤ 1 . and

K 

πk = 1 .

(2)

(3)

k=1

In order to better model the data, a training method is required to estimate the model parameters. One well-known and widely-used method is ExpectationMaximization(EM). EM works by alternating between two steps: the E step that P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 96–106, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Sequential EM for Unsupervised Adaptive Gaussian Mixture Model

97

uses the values of the current model parameters to calculate the responsibilities of the Gaussian components, and the M step in which the responsibilities are used to re-estimate the model parameters [1]. After a number of iterations the method converges to a model that maximizes the log likelihood of the data points given the model. EM is a batch method that provides a very good estimate of the clusters when the number of clusters is carefully chosen. In [2] the infinite GMM was introdcued to sidestep the difficult problem of deciding the “right” number of mixture components. The inference in the model is done using an efficient parameter-free Markov chain that relies entirely on Gibbs sampling. This results in adding new components to the mixture as new data arrives, which can work well to adapt to unforeseen data but it does not work within a classification configuration. In[3] and [4] Bayesian Information Criterion based incremental methods were developed to build a GMM with the best number of components. Those methods focus only on clustering problems and they still require all the data offline. Some customized adaptive methods for GMM can be found in the literature, like in [5], where the adaptation scheme is based on a constant learning rate that is used to update the GMM and the number of components is changing. This method works only when the data represent several clusters of one class. In this paper we introduce a sequential version of EM to train a GMM in the case where the statistical features of the data are changing over time. The GMM built here is meant to be used for classification which introduces more constraints on how to adapt the components in a way to maintain/improve the classification accuracy. We also introduce a method to detect misrepresentation of the classes.

2 2.1

Method Gaussian Mixture Model and EM

Here we briefly introduce a formulation of Gaussian mixtures in terms of discrete latent variables as it was introduced and deeply discussed in [1]. Let us introduce a K-dimensional binary random variable z having a 1-ofK representation in which a particular element zk is equal to 1 and all other elements are equal to 0. We will define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x|z). The marginal distribution over z is specified in terms of the mixing coefficients zk πk , with p(zk = 1) = πk . p(z) = K k=1 πk . The conditional distribution of x given a particular value of z is given as p(x|zk = 1) = N (x|μk , Σk ) and K p(x|z) = k=1 N (x|μk , Σk )zk . The joint distribution is then given by p(z)p(x|z), and p(x) =

 z

p(z)p(x|z) =

K  k=1

πk N (x|μk , Σk ) .

(4)

98

B.A.S. Hasan and J.Q. Gan

The introduction of z helps in calculating the responsibility of each component in the mixture γ(zk ) = p(zk = 1|x). The EM algorithm is shown in Method1., where γ(znk ) is the responsibility associated with data point xn , and γ(zk ) is the responsibilities associated with all the data points x. Method 1. The standard EM for GMM E-STEP: Evaluate responsibilities using current parameter values πk N (x|μk , Σk ) K j=1 πj N (x|μj , Σj )

γ(zk ) = p(zk = 1|x) =

.

(5)

M-STEP: Re-estimate parameters using current responsibilities

μnew = k Σknew =

1 Nk

1 Nk

 γ(z

 γ(z N

N

N n=1

(6)

nk )(xn

− μk )(xn − μk )T .

(7)

n=1

πknew = where Nk =

.

nk )xn

n=1

Nk . N

(8)

γ(znk )

The initial values for the model parameters can be calculated using a simple clustering method. In this study k-means algorithm is used. 2.2

Classification with Unsupervised Training

In the literature Gaussian mixture models were used for classification in a supervised way[6][7]. A GMM is built for each class and then a Bayesian classifier is used to classify new data(when the classes have the same prior the likelihood is enough for classification). This approach is not suitable for unsupervised adaptation as the labels are always required for training/adaptation. In order to overcome this problem we took a different approach. One GMM was built for all training/offline data and then the available labels were used to calculate p(c = classi |zk ), the probability of the class to be classi when the data point is generated from component zk . To calculate the probability of the data point x belonging to class classi , p(c = classi |x) is calculated as follows: p(c = classi |x) =

K  k=1

p(c = classi |zk , x)p(zk |x) .

(9)

Sequential EM for Unsupervised Adaptive Gaussian Mixture Model

99

where p(zk |x) is the responsibility of component zk to generate the point x. Assuming p(c = classi |zk , x) is independent of x then p(c = classi |x) =

K 

p(c = classi |zk )p(zk |x) .

(10)

k=1

2.3

Sequential EM(SEM)

EM as an optimization method uses a batch of the available data to tune the model parameters and hence better model the data. In the case where we have an incoming stream of non-stationary data, the model built on the training/offline data will not be able to cope with these changes and this will cause a drop in the classification accuracy because of the misrepresentation of the new data by the old model. To tackle this problem a sequential/online version of EM method for GMM is introduced here. The main idea behind this method is to rewrite each parameter/ hyper-parameter as a convex combination of old and new data. This is valid here as all the parameters/hyper-parameters used by EM are sufficient statistics of the distribution. Method2. outlines the sequential EM method, where t is the current time point. Method 2. The sequential EM for GMM E-STEP: Evaluate responsibility using parameters at t-1 and x at t

γ(zkt ) =

t−1 ) πkt−1 N (xt |μt−1 k , Σk

K j=1

πjt−1 N (xt |μt−1 , Σjt−1 ) j

.

(11)

M-STEP: adapt the model parameters 1 (N t−1 μt−1 + γ(zkt )xt ) . k Nkt k

(12)

1 (N t−1 Σkt−1 + γ(zkt )(xt − μtk )(xt − μtk )T ) . Nkt k

(13)

Nkt . t

(14)

μtk = Σkt =

πkt = where Nkt = Nkt−1 + γ(zkt )

In the E-Step we calculate the responsibility γ(zkt ) associated with a new data point xt at time t based on the model parameters estimated at time t − 1 and using the new data point xt only. In the M-Step, the mean μtk is estimated based on the previous estimation t−1 μk and the new data point xt , similarly Σkt and Nkt are calculated based on the previously estimated parameters Σkt−1 , Nkt−1 , and xt .

100

B.A.S. Hasan and J.Q. Gan

γ(zkt ) controls the adaptation process, so the cluster that is closer to the new sample will adapt more than a further away one. The sequential method is initialized by a model that has been generated by the standard method trained on offline training data. The presented SEM is a parametric adaptive method, which does not take into consideration the problem of the optimal number of clusters that represent the data. The assumption is that the model trained over the offline data has already been chosen to optimally represent the data, using cross validation methods for example. 2.4

Adaptation Failure Detection

As the data shift over time the Gaussian components using SEM will try to keep track of changes in the new model, but it is still bounded by the old data. So when a dramatic change in the new data occurs the adaptation method will most likely fail in presenting the new data. This will affect the classification accuracy and most likely the system will end up with fewer dominant components. The other components still exist but they are not playing any actual role in representing/classifying the new data. Figure 1 shows the components’ responsibilities (summed over all data in the adaptation window), the straight lines are components that represent class1 and the boxed lines are components that represent class2. It is clear that one component becomes dominant over time and hence badly affect the classification accuracy. Another cause of failure of adaptation is due to the changes in the probability of class given a Gaussian component p(c = classi |zk ). The model might still be adapted well to the new data but because of the overlapping between the data of the two classes the Gaussian components might shift to represent one class better than the other(s). In order to enhance the system performance, a failure detection method is necessary to know when to stop adaptation, or re-train the system. We have used a simple and efficient method for failure detection based on the responsibility of the components to generate the new data p(zk |X), X = {x1 , . . . , xN } and the probability of the class given the component p(c = classi |zk ). Let cci = {k : p(c = classi |zk ) = max(p(c = classj |zk )), j = 1, 2} which contains the indexes of components that represent classi . We then define cli =

N 

αin .

(15)

n=1

where



p(zk |xn ) : p(zk |xn ) = max(p(zj |xn )), j = 1, . . . , K and k ∈ cci . 0 otherwise (16) 4 cli gives an index of how probable classi is among the data X. The failure detection index is then defined as αin =

Sequential EM for Unsupervised Adaptive Gaussian Mixture Model

101

F DI = cl1 /cl2 .

(17)

When the value of F DI > ul = 2.0 or F DI < ll = 0.5 then one of the classes has a dominant component(s) and the adaptive model has failed. For a more strict/loose constraint the upper limit ul and lower limit ll values can be changed accordingly. FDI deals with the first adaptation failure cause. Detecting the second problem is much harder as the labels are not available online. Instead of trying to detect the shifts of the components between the classes, we try to re-calculate the probability p(c = classi |zk ) after several adaptation batches. This is not a very efficient method though and it might actually affect the classification results especially if the classes are too overlapped.

components responsibilities over a window of data 2000 1800 1600

responsibility

1400 1200 1000 800 600 400 200 0

1

2

3

4

5

6 time

7

8

9

10

11

Fig. 1. Responsibilities of the adaptive GMM components

2.5

The Algorithm

The method starts by modeling the offline data using a predefined number of Gaussian distributions. Then the adaptation is done on a fixed sized windows of the online data, after each adaptation window the FDI is calculated to check the adapted system. If the adaptation has failed, re-training is used to build a model for the new data. Algorithm 1. shows an outline of the adaptive algorithm, where classifyComponents is a method that calculates p(c = classi |zk ). It should be pointed out that step 12 in the algorithm is not mandatory, one might want to stop the adaptation , see the discussion for more about this.

102

B.A.S. Hasan and J.Q. Gan

Algorithm 1. Adaptive GMM 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

3 3.1

model = EMGMM(offlineData) classProb = classifyComponents(model,offlineData,labels) previousModel = model newModel = model while there is new data do classify(newDataPoint,previousModel,classProb) newModel = SEMGMM(newModel,newDataPoint) if size(newData)=adaptationWindow then FDI = calculateFDI(newModel,newData) if F DI > ul||F DI < ll then newModel = EMGMM(newData) classProb= classifyComponents(model,newData,newLabels) end if previousModel=newModel end if end while

Experimental Results Introduction

In order to objectively test the proposed method, synthesized data were used instead of real-life data. The data generated in a way to satisfy a number of criterions – the data present two-class problem – number of data points is balanced between the two classes – the data points are generated from a Gaussian mixture model with equal number of components for each class – linear separability of the data (between 80%-90% in our case) – the data is non-stationary over time – the non-stationary change in the data follow a pre-defined path Those criterions are necessary to make sure that the generated data represents the problem we are trying to solve here, and they make it possible to evaluate the method objectively. 3.2

Synthetic Data Generation

Here we detail the data generation method that complies with the previously stated criterions. The data were generated from a Gaussian mixture model with 6 components that represents 2 classes (3 components per class). The components’ mean were selected randomly on either side of a linear hyperplan, the covariances were selected randomly as well. 2000 data points were sampled from the original model and used as the offline data to train a Gaussian mixture model with 6

Sequential EM for Unsupervised Adaptive Gaussian Mixture Model

103

components. The original model, for generating the data, was then shifted on a curve from θ = 0 to θ = π/2 and on 10 consequent steps, as shown in Fig.2. In each of these steps the covariances were scaled randomly and then another 2000 data points were sampled from the shifted model, this insures a control over the non-stationarity in the data. As the probability of both classes are assumed to be the same, the simulated data are balanced and the ongoing streams use small widows of samples by alternating between the two classes. The assumption of balanced online data does not affect the generality of the method, as this can be satisfied in most application areas by using a proper adaptation window that covers data from both classes, using some prior knowledge of the domain. Means of original data model shifted with time 6

5

4

3

2

1

0 −2

−1

0

1

2

3

4

5

6

Fig. 2. Path of the change in data means

3.3

Results

To test the adaptation method, first we provide the results without the use of FDI, so steps 9-13 are removed from the algorithm. The adaptation window is taken as 2000. The performance of the adapted model is tested on the data from the following window. Here we present the results from 10 data sets. Table1 shows the results using the model built on the offline data. Table2 demonstrates the results using the adaptive method. Figure3 shows the change of average accuracy over all the data sets. In the same figure we added the p values calculated using Wilcoxon signed-rank test as suggested in[8]. The signed-rank test results show significant enhancement of accuracy when the adaptive method is applied. Although the focus in this work is on unsupervised adaptation. We here present the results achieved using re-training after a failure detection using FDI to show it’s usability,as in Table3.

104

B.A.S. Hasan and J.Q. Gan Table 1. Results using the static model

Dataset T1 DS1 0.9370 DS2 0.9040 DS3 0.9095 DS4 0.9790 DS5 0.912 DS6 0.925 DS7 0.75 DS8 0.6935 DS9 0.901 DS10 0.82

T2 0.7610 0.8415 0.8985 0.9745 0.879 0.908 0.657 0.549 0.832 0.785

T3 0.6840 0.8470 0.9060 0.9805 0.8685 0.8835 0.6095 0.599 0.736 0.738

T4 0.6980 0.7705 0.8920 0.9945 0.882 0.7815 0.6055 0.5925 0.594 0.743

T5 0.6575 0.8105 0.8665 0.9510 0.8845 0.733 0.6565 0.562 0.552 0.654

T6 0.7325 0.8375 0.7890 0.9625 0.696 0.6305 0.647 0.706 0.4435 0.604

T7 0.6760 0.8260 0.6485 0.9145 0.559 0.5955 0.6775 0.718 0.4265 0.5825

T8 0.6770 0.7335 0.5865 0.8485 0.5405 0.5385 0.71 0.565 0.3825 0.5725

T9 0.6740 0.5630 0.5400 0.7435 0.521 0.506 0.604 0.544 0.479 0.5605

T10 0.7195 0.4670 0.4980 0.6565 0.5125 0.5 0.472 0.511 0.7445 0.499

Table 2. Results using the adaptive algorithm without re-training Dataset T1 DS1 0.937 DS2 0.904 DS3 0.9095 DS4 0.979 DS5 0.9115 DS6 0.925 DS7 0.75 DS8 0.6925 DS9 0.901 DS10 0.82

4

T2 0.956 0.8535 0.9015 0.985 0.921 0.915 0.6265 0.8385 0.849 0.8005

T3 0.9475 0.8855 0.9285 0.973 0.921 0.918 0.5 0.895 0.874 0.767

T4 0.9615 0.81 0.9165 0.991 0.9285 0.9065 0.498 0.869 0.8535 0.8065

T5 0.9595 0.8285 0.9235 1 0.926 0.8685 0.499 0.8765 0.842 0.7195

T6 0.963 0.8095 0.9095 0.9915 0.887 0.748 0.494 0.8635 0.8395 0.6635

T7 0.9785 0.8235 0.8415 0.992 0.8915 0.706 0.501 0.814 0.8 0.627

T8 0.9655 0.829 0.869 0.982 0.8715 0.6155 0.536 0.852 0.816 0.58

T9 0.9645 0.831 0.811 0.9875 0.8315 0.5235 0.49 0.77 0.877 0.4955

T10 0.961 0.8445 0.763 0.962 0.815 0.5135 0.464 0.682 0.839 0.483

Discussion and Conclusion

In this paper we have presented a sequential EM method to adapt a Gaussian mixture model in a classification configuration. In addition we have defined an adaptation failure detection index. This method is suitable for cases where the system data statistics suffers shifts over time, EMG and EEG data are examples of such a case. The data is expected to change slowly over time, sudden changes can be much harder to capture in our method. In[9], Neal presented a justification for online variant of EM. He showed that an online EM based on sequential E-step can converge faster than the standard EM. Sato [10] showed that the online EM algorithm can be considered as a stochastic approximation method to find the maximum likelihood estimator. Although we did not present a discount factor similar to the one presented by Sato, the online EM method for GMM presented here follows the general scheme of online EM in the literature. Here we did not deal with the problem of the optimal number of Gaussian components that represents each class. Some work on incremental addition/removal

Sequential EM for Unsupervised Adaptive Gaussian Mixture Model

105

average accuracies over 10 data sets p=0.5 0.85

accuracy

0.8

p=0.037109 p=0.048828 p=0.027344 p=0.037109 p=0.037109

0.75

p=0.037109 p=0.019531

0.7

p=0.019531 p=0.013672

0.65

0.6

0.55

2

4

6

8

10

12

time

Fig. 3. Average accuracies over time using static model(continuous line) and adaptive model (discrete line) Table 3. Results using the adaptive algorithm with re-training Dataset T1 DS1 0.937 DS2 0.904 DS3 0.9095 DS4 0.979 DS5 0.9115 DS6 0.925 DS7 0.75 DS8 0.6925 DS9 0.901 DS10 0.82

T2 0.956 0.8535 0.9015 0.985 0.921 0.915 0.8355 0.8385 0.849 0.8005

T3 0.9475 0.8855 0.9285 0.973 0.921 0.918 0.6835 0.895 0.874 0.767

T4 0.9615 0.81 0.9165 0.991 0.9285 0.9065 0.69 0.869 0.8535 0.8605

T5 0.9595 0.8285 0.9235 1 0.926 0.8685 0.7285 0.8765 0.842 0.8275

T6 0.963 0.8095 0.9095 0.9915 0.887 0.748 0.8475 0.8635 0.8395 0.8135

T7 0.9785 0.8235 0.8415 0.992 0.8915 0.925 0.83 0.814 0.8 0.788

T8 0.9655 0.829 0.869 0.982 0.8715 0.8395 0.7405 0.852 0.816 0.675

T9 0.9645 0.831 0.811 0.9875 0.8315 0.928 0.9855 0.77 0.877 0.8805

T10 0.961 0.8445 0.997 0.962 0.815 0.803 0.838 0.682 0.839 0.755

of Gaussian components was presented in[3][4], but in a classification configuration this is a very difficult problem. It is hard to know online ,in unsupervised way,the probabilities p(c = classi |zn ) where zn is the newly added component, so we assumed the number of components is static and only the model parameters are adapted. In the algorithm presented, Algorithm1., p(c = classi |zk ) are considered static and calculated only when building the original model. These probability distributions can be updated with time between sessions based on the current model classification. The re-training step mentioned in the algorithm can only be used if labels are available/partially available online. In a totally unsupervised adaptation scheme

106

B.A.S. Hasan and J.Q. Gan

FDI ,which is a fast and reliable measure of the adaptation, indicates when the adaptation fails and then one might stop adaptation and use the last known stable model. The size of adaptation window might have a considerable effect on the performance of the adaptation method. Small window might not change the model enough while longer window means larger drop in the ongoing classification until the new adapted model is used. The selection of the window size is determined mostly by the chosen application. An important feature of such a window is that it provides balanced number of examples from the two classes. This is important to protect the adaptation method from adapting to one class over the other. In [11], we have applied the proposed method in the field of Brain-Computer Interface (BCI). The experimental results showed the usefulness of this approach in building adaptive BCIs.

Acknowledgment The authors would like to thank Prof. Stephen Roberts for his useful input. This work is part of the project “Adaptive Asynchronous Brain Actuated Control” funded by UK EPSRC. Bashar’s study is funded by Aga Khan Foundation.

References 1. Bishop, C.: Pattern recognition and machine learning. Springer, Heidelberg (2006) 2. Rasmussen, C.E.: The infinite Gaussian mixture model. In: Advances in Neural Information Processing Systems, vol. 12, pp. 554–560 (2000) 3. Cheng, S., Wang, H., Fu, H.: A model-selection-based self-splitting Gaussian mixture learning with application to speaker identification. EURASIP Journal on Applied Signal Processing 17, 2626–2639 (2004) 4. Fraley, C., Raftery, A., Wehrensy, R.: Incremental model-based clustering for large datasets with small clusters. Tech. Rep. 439 (2003) 5. Shimada, A., Arita, D., Taniguchi, R.: Dynamic control of adaptive mixture-ofGaussians background model. In: AVSS 2006. Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, vol. 5 (2006) 6. Marques, J., Moreno, P.J.: A study of musical instrument classification using Gaussian mixture models and support vector machines. Tech. Rep. CRL 99/4 (1999) 7. Millan, J.R.: On the need for on-line learning in brain-computer interfaces. In: Proc. IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2877–2882 (2004) 8. Desmar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 9. Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 355–368 (1998) 10. Sato, M., Ishii, S.: On-line EM algorithm for the normalized Gaussian network. Neural Comp. 12(2), 407–432 (2000) 11. Awwad Shiekh Hasan, B., Gan, J.Q.: Unsupervised adaptive GMM for BCI. In: International IEEE EMBS Conf. on Neural Engineering, Antalya, Turkey (2009)

Suggest Documents