An on-line Classification EM algorithm based on

An on-line Classification EM algorithm based on mixture model

Allou Samé, Christophe Ambroise, Gérard Govaert Université de Technologie de Compiègne HEUDIASYC, UMR CNRS 6599 BP 20529, 60205 Compiègne Cedex, FRANCE

Abstract Mixture model-based clustering is widely used in many applications. In real-time applications, data are received sequentially and classification parameters have to be quickly updated. An on-line clustering algorithm based on mixture models is presented in the context of a real time flaw diagnosis application for pressurized containers. Available data for this application are acoustic emission signals. The proposed algorithm is a stochastic gradient one derived from the Classification version of the EM algorithm (CEM). It provides a model-based generalization of the well known on-line k-means algorithm to handle non spherical clusters when specific Gaussian mixture models are used. Using synthetic and real data sets, the proposed algorithm is compared to the batch CEM algorithm and the on-line EM algorithm. The three approaches generates comparable solutions in terms of the resulting partition when clusters are relatively well separated but on-line algorithms become fasters when the size of the available observations increases. Key words: clustering, mixture model, EM, CEM, stochastic gradient, exponential familly

Preprint submitted to Elsevier Science

13 December 2005

∗ Corresponding author: Gérard Govaert, Université de Technologie de Compiègne, HEUDIASYC, UMR CNRS 6599, BP 20529, 60205 Compiègne Cedex, FRANCE. E-mail: [email protected], Tel: 33 (0)3 44 23 44 86, Fax: 33 (0)3 44 23 44 77.

2

1

Introduction

In many actual applications, data are sequentially received and the overall sample size can become too large. In that context, classical batch algorithms like EM algorithm [6] are not able to compute classification parameters in real-time. On-line parameter estimation using mixture models has already been addressed by many authors (eg. Titterington [10]; Wang and Zhao [11]). More recently Liu et al. [7] have considered, for the internet traffic modelling, a recursive EM algorithm based on Poisson mixture models. Our motivations was a real-time flaw detection problem for pressurized containers using acoustic emissions. The pressurized containers used are cylindrical tanks containing fluids under pressure (see figure 1).

Fig. 1. Example of cylindrical tank

The problem consists in verifying non-destructively the absence of flaws. A computer-aided-decision method was devised. The operator reaches a decision (presence or absence of flaws) using a two-dimensional representation of the tank surface. Each acoustic emission from the tank is described by p variables related to the characteristics of the acoustic signal as well as to its location in relation to the two-dimensional representation of the tank (see figure 2). The computer-aided-decision method we developed is sequential and consists 3

80

60

40

20

0

−20

−40

−60 −40

−30

−20

−10

0

10

20

30

40

50

Fig. 2. Real localisations of acoustic emissions

of two steps: • clustering of the acoustic emission events, in order to find zones where the acoustic emissions are concentrated • classification of the identified clusters as either normal or flawed clusters. The first step is performed using the classification version of the EM algorithm (CEM) [4]. At this stage the locations of the acoustic emissions are the only data taken into account. This algorithm was chosen for its speed compared to other clustering algorithms such as EM [6]. The CEM algorithm is applied assuming a mixture of Gaussian densities with diagonal covariance matrices, since experts are agreed that flaws generally appear along welding zones, which are either horizontally or vertically oriented. The number of clusters is adjusted using the Integrated Classification Likelihood (ICL) criterion [1]. During the second step, features describing each cluster, in terms of the acoustic emission properties, are computed. Using these features, a decision (presence or absence of flaws on the tank surface) is communicated to the operator. This decision is obtained using standard discrimination methods such as Bayesian discrimination with Gaussian densities. 4

The method described above is often satisfactory but becomes slow when more than 10000 acoustic emissions have to be considered. This is due to the CEM algorithm, whose execution time increases significantly with data size. Our aim was to improve this clustering step by developing a faster classification algorithm without losing the accuracy of the CEM algorithm.

For this purpose, an on-line mixture model-based algorithm has been proposed. This algorithm is a stochastic gradient one derived from the CEM algoritm.

Data are supposed to be independent observations x1 , . . . , xn , . . . sequentially received and distributed following a mixture density of K components, defined on IRp by f (x; Φ) =

K X

πk fk (x; θ k ),

k=1

with Φ = (π1 , . . . , πK , θ 1 , . . . , θ K ) where π1 , . . . , πK denote the proportions of the mixture and θ 1 , . . . , θ K the parameters of each density component. We denote by z1 , . . . , zn , . . . the classes associated to the observations, where zn ∈ {1, . . . , K} corresponds to the class of xn .

The paper is organized as follows. The second section describes the stochastic gradient algorithms in a context of parameter estimation. The third section shows how an on-line algorithm has been derived from the EM algorithm by Titterington [10]. In the fourth section, our on-line clustering algorithm derived from the CEM algorithm is presented. An experimental study is summarized in the fifth section and the proposed algorithm is applied to a real data set in the sixth section. 5

2

Stochastic gradient algorithms

Since stochastic gradient algorithms have been chosen to estimate the parameters of the mixture model, this section introduces them in a parameter estimation context. Generally, stochastic gradient algorithms are used for on-line parameter estimation in signal processing, automatic and pattern recognition for their algorithmic simplicity. They have been shown to be faster than standard algorithms. Using current parameters and new observations, stochastic gradient algorithms update recursively parameters. They allow to maximize the expectation of a criterion [2], C(Φ) = E [J(x, Φ)] . where the criterion J(x, Φ) measures the quality of the parameter Φ given the observation x. The stochastic gradient algorithm aiming to maximize the criterion C is then written Φ(n+1) = Φ(n) + αn ∇Φ J(xn+1 , Φ(n) )

(1)

where the learning rate αn is a positive scalar or a positive definite matrix such that

P

k αn k= ∞ and

P

k αn k2 < ∞. Contrary to the general gradi-

ent algorithms, stochastic gradient algorithms use an ascent direction which only depends on the current observation xn+1 and parameter Φ(n) . Bottou [2] gives general conditions of convergence of algorithm (1). In practice, the samples sizes are very large but not infinite. In that case, criterion C(Φ) is empirically the mean maximization of

Pn

i=1

1 n

Pn

i=1

J(xi ; Φ) whose maximization is equivalent to the

J(xi ; Φ). 6

3

On-line EM algorithm

This section shows how Titterington [10] has derived a stochastic gradient algorithm from the EM algorithm. Given the observed data xn = (x1 , . . . , xn ) and some initial parameter Φ(0) , the standard EM algorithm maximizes the log-likelihood

L(Φ; xn ) = log p(xn ; Φ) =

n X

log f (xi ; Φ)

i=1

by alternating the two following steps until convergence:

• E step (Expectation): computation of the expectation of the complete log-likelihood conditionally to the available data and the current parameter: Q(Φ, Φ(q) ) = E[log p(xn , zn ; Φ)|xn , Φ(q) ] = =

n X

E[log p(xi , zi ; Φ)|xi , Φ(q) ]

i=1 n X K X

(q)

tik log[πk f (xi ; θk )],

i=1 k=1

(q) where zn = (z1 , . . . , zn ) and tik = PKπk f (xi ;θk ) ℓ=1

πℓ f (xi ;θℓ )

is the posterior probability

that xi arises from the kth component of the mixture. This step simply (q)

requires the computation of posterior probabilities tik . • M step (Maximization): maximisation of Q(Φ, Φ(q) ) with respect to Φ. A partition of data can then be obtained by assigning each observation to the component having the highest posterior probability.

To derive a stochastic algorithm from this formulation, Titterington [10] de7

fined recursively, in the same way as for the EM algorithm, the quantity      (0)    Q1 (Φ, Φ )

    (n)    Qn+1 (Φ, Φ )

= E[log p(x1 , z1 ; Φ|x1 ; Φ(0) )] (2) = Qn (Φ, Φ(n−1) ) + E[log p(xn+1 , zn+1 ; Φ)|xn+1 ; Φ(n) ],

where Φ(n) is the parameter maximizing Qn (Φ, Φ(n−1) ). The indice n added to letter Q is used to specify that, contrary to the standard EM algorithm, quantity Qn depends on observations xn = (x1 , . . . , xn ) acquired until the time n and quantity Qn+1 (Φ, Φ(n) ) depends on observations xn+1 = (x1 , . . . , xn+1 ) acquired until the time n + 1. The maximization of

1 Q ( n+1 n+1

· , Φ(n) ) using

Newton-Raphson method and approximating the hessian matrix term by its expectation which is the Fisher information matrix Ic (Φ(n) ) = −E[

∂ 2 log p(x, z; Φ) ]|Φ=Φ(n) ∂Φ∂ΦT

associated to one complete observation (x, z) results in the algorithm proposed by Titterington: Φ(n+1) = Φ(n) +

1 [Ic (Φ(n) )]−1 ∇Φ log f (xn+1 ; Φ(n) )· n+1

(3)

Fisher information matrix Ic (Φ(n) ) is positive definite when complete data density belongs to the regular exponential family with natural parameter Φ. In that case, Titterington algorithm has the general form (1) of the stochastic gradient algorithms. For regular exponential family model, Wang and Zhao [11] has established, under very mild conditions, that algorithm (3) converges almost surely toward a parameter vector from the set { Φ ; ∇Φ E[log f (x; Φ)] = 0 } which contains local maxima, minima and saddle points. 8

4

An on line clustering algorithm derived from CEM algorithm

This section begins with a recall of the Classification EM (CEM) [4] algorithm in the context of mixture models and then derives a stochastic algorithm from CEM algorithm.

4.1 CEM algorithm

The Classification EM (CEM) algorithm is an iterative clustering algorithm which allows to find simultaneously the parameters and the classification. It maximizes, with respect to the components membership vector zn = (z1 , . . . , zn ) and the parameter vector Φ, the classification likelihood criterion C(zn , Φ) = log p(xn , zn ; Φ) =

n X K X

(4)

zik log πk fk (xi ; θk )

i=1 k=1

where zik equal 1 if zi equal k and 0 otherwise. The classification likelihood criterion is inspired by the criterion C1 (zn , Φ) =

n X K X

zik log fk (xi ; θ k )

i=1 k=1

proposed by Scott an Symons [9] where the sample xn = (x1 , . . . , xn ) is supposed to be formed by separately taking observations of each component of the mixture. The CEM algorithm starts from an initial parameter Φ(0) and alternates, at the qth iteration, the following steps until convergence: (q)

• E step (Expectation): computation of the posterior probabilities tik ; • C step (Classification): assignation of each observation xi to the cluster (q)

zi

(q)

which maximizes tik , 1 ≤ k ≤ K;

• M step (Maximization): maximization of C(z(q) n , Φ) with respect to Φ. 9

Thus, in the mixture model context, the CEM algorithm can be regarded as a classification version of the EM algorithm which incorporates a classification step between the E step and the M step of the EM algorithm. Celeux and Govaert [4] show that each iteration of the CEM algorithm increases the classification likelihood criterion and that convergence is reached in a finite number of iterations. When assuming Gaussian mixtures with equal proportions and covariance matrices equal to the identity matrix (spherical covariance matrices), CEM algorithm is exactly the k-means algorithm. Thus, CEM algorithm is a generalization of the k-means algorithm which allow to handle non spherical covariance matrices and non uniform proportions. In order to introduce the on-line CEM algorithm, the next section reformulates the CEM algorithm.

4.2 Other formulation of the CEM algorithm

The maximization of the classification likelihood criterion defined by equation (4) is equivalent to the maximization of the criterion LC (Φ) = max [log p(xn , zn ; Φ)] = z n

n X i=1

max[log πk fk (xi ; θ k )]. k

Each iteration q of the CEM algorithm also consists in maximizing with respect to Φ, the quantity R(Φ, Φ(q) ) = log p(xn , z(q) n ; Φ) =

n X

(q)

log p(xi , zi ; Φ),

i=1

(q)

where zi

maximizes p(xi , zi ; Φ(q) ) with respect to zi ∈ {1, . . . , K}. Similarly

to the Titterington approach [10], a stochastic gradient algorithm can be derived from this formulation. 10

4.3 On-line CEM algorithm

In this part, a stochastic gradient algorithm which incorporates an initial run of CEM algorithm on n0 observations is derived from the CEM algorithm. Let us define the quantity Rn as follows:      (n0 )   )  Rn0 (Φ, Φ      

(n0 )

Rn0 +1 (Φ, Φ            (n)   Rn+1 (Φ, Φ )

=

P n0

i=1

(n0 )

log p(xi , zi

; Φ) (5)

(n )

) = Rn0 (Φ, Φ(n0 ) ) + log p(xn0 +1 , zn00+1 ; Φ) (n)

= Rn (Φ, Φ(n−1) ) + log p(xn+1 , zn+1 ; Φ) ∀n > n0 ,

where parameter vector Φ(n0 ) is obtained by a run of the standard CEM (n)

algorithm on n0 initial observations, zn+1 maximizes log p(xn+1 , zn+1 ; Φ(n) ) and Φ(n) maximizes Rn (Φ, Φ(n−1) ) for n > n0 . The subscript n added to letter R is also used to specify that quantity Rn depends on n observations. This definition supposes that the user can fix an initial number n0 of observations to be ran with the standard CEM algorithm in order to start the on-line estimation with good initial parameters.

By maximizing

1 R ( n+1 n+1

· , Φ(n) ) using Newton-Raphson method and approx-

imating the hessian matrix term by the Fisher information matrix Ic (Φ(n) ) associated to one complete observation (x, z), we get our new algorithm given by the recursive formula      (n)    zn+1

    (n+1)   Φ

= arg max log p(xn+1 , z; Φ(n) ) z

= Φ(n) +

1 [I (Φ(n) )]−1 ∇Φ n+1 c

11

(n)

log p(xn+1 , zn+1 ; Φ(n) ) ,

n ≥ n0

which is equivalent to Φ

(n+1)

=Φ

(n)

1 + [Ic (Φ(n) )]−1 ∇Φ max log p(xn+1 , zn+1 ; Φ(n) ) , zn+1 n+1

(6)

This last algorithm is recognizable as a stochastic gradient algorithm with the matrix learning rate

1 [I (Φ(n) )]−1 n+1 c

aiming to maximize the expected

classification likelihood criterion E[max log p(x, z; Φ)]. z For many commonly used mixture models like Gaussian (used in our application), Poisson or exponential mixtures the complete data distribution belongs to the regular exponential family. The next section focuses on algorithm (6) for the regular exponential family models.

4.4 Exponential family model

This part shows how the derivation of a stochastic gradient algorithm from the CEM algorithm is simplified if complete data have their distribution from the regular exponential family. The complete data (x, z) has its distribution p(x, z; Φ) from the exponential family with natural parameter η(Φ) = (η1 (Φ), . . . , ηℓ (Φ)) and sufficient statistic T(x, z) = (T1 (x, z), . . . , Tℓ (x, z)) if its distribution can be written

T

p(x, z; η) = exp η T(x, z) − a(η) + b(x, z) . If ℓ = p (p is the dimension of Φ) and, the ηj (Φ) and the Tj (x, z) are linearly independent, the distribution of (x, z) is said to be from the regular exponential family. The re-parameterization with the expectation parameter Ψ = E(T(x, z)|η) 12

results in the following differentiation of the complete log-likelihood: !

∂ log p(x, z; η(Ψ)) ∂η ∂a = T(x, z) − . ∂Ψ ∂Ψ ∂η Using the basic relations

∂η ∂Ψ

= Ic (Ψ) and Ψ =

∂a ∂η

verified by the regular

exponential family, we get ∂ log p(x, z; η(Ψ)) = Ic (Ψ)(T(x, z) − Ψ). ∂Ψ

(7)

Using this formula, the parameter Ψ(n+1) maximizing Rn+1 (Ψ, Ψ(n) ), obtained by the annulation of the derivative of Rn+1 (Ψ, Ψ(n) ) with respect to Ψ, is written very simply as Ψ

(n+1)

=

Pn0

(n0 )

i=1

T(xi , zi

(i−1)

) + n+1 i=n0 +1 T(xi , zi n+1 P

)

,

which is equivalent to the recursive formula Ψ(n+1) = Ψ(n) +

i 1 h (n) T(xn+1 , zn+1 ) − Ψ(n) , n+1

n ≥ n0 .

(8)

Moreover, by writing recursive formula (8) as

Ψ(n+1) = Ψ(n) +

h i 1 (n) Ic (Ψ(n) )−1 Ic (Ψ(n) ) T(xn+1 , zn+1 ) − Ψ(n) n+1

and using relation (7), it can be deduced that equations (8) and (6) are equivalent. This equivalence justifies the approximation, made for the general models in section 4.3, of the Hessian matrix associated to

1 R n+1 n+1

by the Fisher in-

formation matrix. Consequently, in the situation where complete data have their distribution from the regular exponential family with natural parameter η and sufficient statistic T(x, z), recursion (6) is simplified in recursion (8) when a re-parameterization is used. 13

For Gaussian mixture model, the algorithm resulting of the application of recursion (8) is written as follows: (n0 )

• Initialization: compute initial proportions πk (n0 )

variance matrices Σk

(n )

, means vectors µk 0 , co(n0 )

and initial number of observations nk

of each

cluster k by running the standard CEM algorithm on n0 observations ; set n = n0 . Repeat the two following steps while new observations are received: • Step 1 assign each new observation xn+1 to the class k ∗ which maximizes the posterior probability (n)

(n)

π f (xn+1 ; θ k ) (n) tn+1,k = PK k (n) (n) ℓ=1 πℓ f (xn+1 ; θ ℓ )

(n)

and set zn+1,k equals 1 if k = k ∗ and 0 otherwise. • Step 2 update the parameters: (n+1)

(n)

(n+1) Σk

(n) = Σk

(n)

= nk + zn+1,k 1 (n) (n) (n+1) (n) (zn+1,k − πk ) πk = πk + n+1 (n) z (n+1) (n) (n) µk = µk + n+1,k · (xn+1 − µk ) (n+1) nk

nk

(n)

+

1−

zn+1,k (n+1)

nk

·

(n)

zn+1,k (n+1)

nk

(xn+1 −

(n) µk )(xn+1

−

(n) µk )T

−

(n) Σk

.

Notice that the described algorithm does not require a stop condition since each new observation xn+1 is used only one time. By considering a Gaussian mixture with identical proportions and spherical covariance matrices (equal to the identity matrix), and supposing that no observation is initially processed with CEM (n0 = 0), the on-line k-means (0)

(0)

algorithm [8,3] is recovered. Given initial values µk and setting the nk to 14

zero, the on-line k-means algorithm consists in estimating recursively K means µ1 , . . . , µK using the recursion      (n+1)    nk     (n+1)    µk

(n)

(n)

= nk + zn+1,k , (n)

= µk +

(n) zn+1,k (n+1) nk

(n)

(xn+1 − µk ), (n)

n ≥ 0. (n)

where zn+1,k equal 1 if k minimizes (xi − µk )(xi − µk )T and 0 otherwise. Thus, the proposed algorithm is a generalization of on-line k-means algorithm which can handle non spherical clusters and non uniform proportions.

4.5 Convergence analysis for the exponential family models

For general models, non necessary from the exponential family, Bottou [2] give general conditions of convergence. Only convergence conditions for the regular exponential family are given in this section. The convergence theorem of Wang and Zhao [11] for the on-line EM algorithm can be adapted for the on-line CEM algorithm by considering the complete data distribution instead of the observed data distribution. According to the resulting theorem, if the following conditions are fulfilled: (i) k Ic (Ψ)−1 k< ∞, (ii) k ∇Ψ log p(x, z; Ψ) k< ∞, (iii) {Ψ(n) }n remains in a compact set, then the sequence of parameters given by algorithm (6) converges almost surely toward a parameter of the set { Φ ; ∇Ψ E[max log p(x, z; Ψ)] = 0 } z

which may contain local maxima, minima or saddle points. 15

5

Experiments using simulated data

This section evaluates the on-line CEM algorithm in terms of precision and computing time. Although the proposed algorithm is general, simulations are restricted to two-dimensional data sets corresponding to a Gaussian mixture with diagonal covariance matrices, owing to the assumptions made in the application which gave rise to this study; acoustic emissions are located within a plane, the aim being to detect damaged zones which are usually to be found horizontally and vertically along welding lines. The whole point of this study, however, is to handle large data sets in real time. In all these simulations, results obtained with the on-line CEM algorithm are compared to results yielded by the CEM algorithm, which is a good reference for our problem in term of its precision and also to results obtained with the on-line EM algorithm. The considered on-line EM algorithm is simply the on-line CEM (n)

algorithm for Gaussian mixture where zn+1,k is replaced with the posterior (n)

probability tn+1,k . Three different types of experiments using simulated data have been considered: the first analyses the effect of the initial number n0 of observations classified with CEM or EM; the second and the third set of experiments are designed to compare CEM, on-line CEM and on-line EM in terms of precision and computing time.

5.1 Protocol of experimentations

The protocol of all the simulations is as follows: n observations are generated according to a mixture of K bivariate Gaussian densities; the standard CEM 16

algorithm is applied on the n observations; the CEM and EM algorithms are initially applied on a short number n0 of observations and the on-line algorithms are applied sequentially on the rest of the observations. Values of n varies from 1000 to 20000 by step of 1000 and values of n0 belongs to the set {10, 20, 30, 40, 50, 100, 200, 300, 400, 500}. Given a data set of n or n0 observations, EM and CEM are initialized as follows: the K Gaussian density centers are initialized with K centers among the available observations; covariance matrices are initialized with the covariance matrix of the available sample and proportions are set to

1 . K

Both EM

and CEM start with 30 different initializations and only the solution which provides the greatest likelihood is selected. Given a solution provided by CEM, on-line CEM or on-line EM, the misclassification rate with respect to the true simulated partition and the CPU times are computed. We should point out that the processor used for all the simulations is a 2.5 GHz Pentium 4.

5.2 Influence of the number n0 on the on-line CEM algorithm

The effect of the number n0 of observations initially classified with CEM on the partition provided by on-line CEM is studied by considering bivariate Gaussian mixtures corresponding to two kinds of models: a model with two elliptical clusters which have the same orientation and a model with two elliptical clusters which have different orientations. For each model three overlapping zones were considered, corresponding to 5%, 12% and 20% of theoretical Bayes error. Six data structures were thus obtained: mixture models A1, A2 17

and A3 of two elliptical clusters with same orientation associated respectively to 5%, 12% and 20% of theoretical Bayes error, and mixture models B1, B2 and B3 of two elliptical clusters with different orientations associated respectively to 5%, 12% and 20% of theoretical Bayes error percentage. The proportions and covariance matrices of mixtures A1, A2, A3 are π1 = π2 = 1/2, Σ1 = Σ2 = diag(1/4; 4). The Gaussian density centers are (0; 0), (1.5; 2.5) for mixture A1, (0; 0), (1; 2.5) for mixture A2 and (0; 0), (0.6; 2.5) for mixture A3. The proportions and covariance matrices of mixtures B1, B2, B3 are π1 = π2 = 1/2, Σ1 = diag(1/3; 3), Σ2 = diag(3; 1/3), where diag(a; b) is the diagonal matrix whose diagonal components vector is (a; b). The Gaussian density centers are µ1 = (0; 0), µ2 = (3.4; 0) for mixture B1, µ1 = (0; 0), µ2 = (2.2; 0) for mixture B2 and µ1 = (0; 0), µ2 = (0; 0) for mixture B3. For each of these data structures and each value of n, we generate 25 different samples. Figure 3 shows examples of data from mixtures A2 and B2. 10

6

4

5

2

0

0

−2

−4

−5 −5

−4

−3

−2

−1

0

1

2

3

4

5

−6 −6

−4

−2

0

2

4

6

Fig. 3. Example of simulation of Mixtures A2 and B2

For each sample, the strategy described in section 5.1 is applied, and misclassification rates and CPU times are averaged over the 25 different samples. Figures 4 and 5 reports, as a fonction of the number n0 of observations initially processed with CEM, the misclassification rates obtained respectively for mixtures (A1,B1) and (A2,B2) when n = 1000. The solution yielded by 18

the CEM algorithm is represented by a solid line, and those obtained by online CEM and on-line EM are represented by dotted lines. For each mixture a rapid improvement of the misclassification rate is observed for the on-line algorithms, until n0 = 50. From n0 = 100, partitions provided by on-line algorithms coincide with partitions given by CEM. Similar behavior have been observed for values of n greater than 1000. 50

50 on−line EM on−line CEM CEM

45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0

50

100

200

300

400

on−line EM on−line CEM CEM

45

500

0

0

50

100

200

300

400

500

Fig. 4. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A1 (left) and Mixture B1 (right) with n = 1000 observations. 50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0

50

100

200

300

400


45

500

0

0

50

100

200

300

400

500

Fig. 5. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A2 (left) and Mixture B2 (right) with n = 1000 observations.

When the class overlap is relatively high (see figure 6), the stabilization of the two on-line algorithms is also observed from n0 = 100. However, due to 19

the well known poor performances [4] of CEM-type algorithms when clusters are not well separated on-line EM appear to be better than CEM and on-line CEM. 50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0

50

100

200

300

400


45

500

0

0

50

100

200

300

400

500

Fig. 6. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A3 (left) and Mixture B3 (right) with n = 1000 observations.

Thus for the remaining experiments the on-line algorithms was applied, with n0 = 200 observations initially processed with CEM and EM.

5.3 Comparison with CEM in terms of quality

Many simulations were performed to compare on-line CEM with CEM and on-line EM as regards of the quality of the estimation obtained, but only the most representative situations are described in this third set of experiments. The two kinds of models presented are: models C1, C2 and C3 of three elliptical clusters with same orientation and proportions, and models D1, D2 and D3 composed of three elliptical clusters with different orientations and non uniform proportions. The parameters of models C1, C2 and C3, corresponding to three overlapping degrees (5%, 12% and 20%), are the following: for each model, the proportions are π1 = π2 = π3 = 1/3 and the 20

covariance matrices are Σ1 = Σ2 = Σ3 = diag(1/4; 4). The Gaussian density centers are (−1.6; 2.5), (0; 0), (1.6; 2.5) for mixture C1, (−1.2; 2.5), (0; 0), (1.2; 2.5) for mixture C2 and (−0.8; 2.5), (0; 0), (0.8; 2.5) for mixture C3. The parameters of models D1, D2 and D3, are the following: for each model, the proportions are π1 = 0.4, π2 = 0.2 and π3 = 0.4. The covariance matrices are Σ1 = diag(1/4; 4), Σ2 = diag(4; 1/4), Σ3 = diag(1/4; 4). For model D1, the centers of Gaussian density are µ1 = (−2; 0), µ2 = (0.1; 3.7), µ3 = (2.3; 0); for mixture D2, the centers of Gaussian density are µ1 = (−2; 0), µ2 = (−0.7; 2.2), µ3 = (0.5; 0); for mixture D3 the centers of Gaussian density are µ1 = (−2; 0), µ2 = (−1.3; 1.1), µ3 = (−0.7; 0). For each of these data structures, 25 different samples of sizes n were generated. Figure 7 shows examples of data from mixtures C2 and D2. 10

6

4

5

2

0

0

−2

−4

−5 −5

−4

−3

−2

−1

0

1

2

3

4

5

−6 −6

−4

−2

0

2

4

6

Fig. 7. Example of simulation of Mixtures C2 and D2

Figures 8 and 9 display the misclassification rate with respect to the sample size n obtained with the three algorithms respectively for mixtures (C1,D1) and (C2,D2). A rapid stabilization of all the algorithms can be observed and partitions of on-line algorithms are almost similar to those of CEM. For mixture B2, misclassification percentages of on-line algorithms are slightly greater than those of CEM. This phenomenon can be attributed to the estimation of the non uniform proportions. 21

50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0 1000

5000

10000

15000


45

20000

0

0 1000

5000

10000

15000

20000

Fig. 8. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the sample size n for mixtures A1 (left) and Mixture B1 (right) with n0 = 200 observations.

50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0 1000

5000

10000

15000


45

20000

0

0 1000

5000

10000

15000

20000


Not surprisingly, when the class overlap is relatively high (see figure 10), the on-line CEM algorithm exhibits poor results particularly for mixture D3. Again, this behaviour can be attributed to the notoriously poor performances of CEM-type algorithms [4] when clusters are not well separated which is even more pronounced when estimating uniform proportions. 22

50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0 1000

5000

10000

15000


45

20000

0

0 1000

5000

10000

15000

20000


5.4 Comparison with CEM in terms of speed

The speed of on-line CEM have been compared with that of the CEM and online EM algorithms using the same simulations used for comparison in terms of quality. Figure 11 represents CPU times in second for the three algorithms with respect to the sample size n. Here only the case of 12% Bayes error (mixtures C2 and D2) is represented, since the other two cases (5% and 20%) show approximately the same behavior.

It can be observed that CPU times given by the on-line algorithms vary very slowly with sample size. The CPU time for the standard CEM algorithm in fact grows considerably with the sample size. In particular, for 20000 observations, CEM is about six time slower than the two on-line algorithms for mixture B. These experiments clearly show that our proposed on-line CEM algorithm is more efficient than the CEM algorithm in terms of speed. 23

50


45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5

0 200

5000

10000

15000


45

20000

0 200

5000

10000

15000

20000

Fig. 11. CPU time (in seconds) in relation to the sample size for model A2 (left) and B2 (right)

6

Results on real acoustic emission data

The main motivation of this work, as stated in the introduction, was to develop a computer-aided decision procedure to assist the detection, in real time, of damaged zones on the surface of a gas tank, through the use of acoustic emissions. The goal is the non-destructive detection of imperfections. When subjected to variations in pressure the tank surface emits noises. Each noise or acoustic emission is located and characterized by 16 variables including its spatial coordinates on the tank and other variables such as maximum amplitude, energy and duration. Experts are in agreement that spatial concentrations of acoustic emissions, identified using spatial coordinates, are of primary importance in the detection of damaged zones. Other features of acoustic emissions are useful in distinguishing between major and minor flaws once damage has been detected, but spatial concentrations are the key factor on which detection relies. Our method, therefore, consists of two steps : • Identification of spatial concentrations (sources) of acoustic signals. This is 24

done by clustering the acoustic emission events, which allows the detection of zones where acoustic emissions are concentrated. • Separation of the identified clusters into different categories according to the severity of the imperfection : these categories are termed minor, active and critical. In the opinion of specialists in the field, the method we have described produces satisfactory results using the CEM algorithm for the first step, so long as the number of acoustic emissions does not exceed 10000. When there are more than 10000 emissions the CEM clustering step becomes too slow (more than a few second delay) for a real-time application. To evaluate our new strategy, we performed a comparison of CEM and on-line CEM on a real data set of 2601 acoustic emissions (see Figure 2). On-line EM has not been tested because the only available reference partition is that of CEM which can not be a valid reference for EM. The number of clusters is selected by the following strategy: • CEM and on-line CEM are ran on the 2601 acoustic emissions for number of clusters from 1 to 15, • the number of clusters is selected by maximizing the integrated classification likelihood criterion (ICL) [1]

ICL(K) = log p(xn , zn ; Φ∗ ) −

νK log(n), 2

where Φ∗ is the parameter vector obtained with CEM or on-line CEM and νK is the number of free parameters of the model. Both strategies using CEM or on-line CEM select the model with 9 clusters. 25

80

60

40

20

0

−20

−40

−60 −40

−30

−20

−10

0

10

20

30

40

50

Fig. 12. CEM classification obtained from real acoustic emissions

A misclassification percentage of 1.57% between CEM and on-line CEM is got. This shows that results obtained with on-line CEM are close to the results of the CEM algorithm. We have therefore achieved our original aim of obtaining partitions similar to CEM partitions but in less time. Figure 11 shows the resulting partition given by on-line CEM. In the decision step following this first clustering step, the vertically more elongated cluster observed was found to be the only flaw cluster. This local region is in fact a welding region, and our result corresponds to the presence of a real flaw. The described procedure above is not yet industrialized, but in the testing phase. 80

60

40

20

0

−20

−40

−60 −40

−30

−20

−10

0

10

20

30

40

50

Fig. 13. on-line CEM classification obtained from real acoustic emissions

26

7

Conclusion

An on-line clustering algorithm was proposed to obtain rapid and effective clustering of acoustic emissions on the surface of a gas tank in order to detect flaws.

The proposed algorithm is the stochastic gradient version of the Classification EM algorithm (CEM) which incorporates an initial run of CEM on a few observations. It produces partitions close to those computed with the CEM algorithm for relatively well separated clusters. Almost no differences in results are observed between CEM and on-line CEM partitions when the number n0 of observations initially classified by CEM is greater than 100. When the overlapping between clusters is high, the experimental study has revealed that on-line EM performs better.

When complete data distribution is from the exponential family, the algorithm is written in a very simple form. The on-line k-means algorithm introduced by MacQueen [8] is recovered when a Gaussian mixture with spherical covariance matrices is considered. The proposed algorithm could also be applied to the 28 Gaussian parsimonious models proposed by Celeux and Govaert [5] to handle specific shapes of clusters.

The execution time of on-line CEM do not vary very much, while the execution time for CEM increases significantly as the number of available observations increases. This algorithm therefore represents an efficient alternative for clustering large data sets. 27

References

[1] Biernacki C., Celeux G., and Govaert G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE PAMI, 22(7):719-725, 2000. [2] Bottou L. Une approche théorique de l’apprentissage connexioniste; applications à la reconnaissance de la parole, Thèse de Doctorat, université d’Orsay, 1991. [3] Bottou L. and Bengio Y. Convergence properties of the K-means algorithm. In G. Thesauro and D. Eds. Advances in Neural information, Proc. Systems, Volume 7, pages 585-592, MIT Press, 1995. [4] Celeux G. and Govaert G. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14, 315-332, 1992. [5] Celeux G. and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781 – 793, 1995. [6] Dempster A. P., Laird N. M. and Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistal Society Series B 39(1): 1 – 38, 1977. [7] Liu Z., Almhana J., Choulakian V. and McGorman R.. On-line EM algorithm for mixture with application to internet traffic Modeling. Computational Statistics and Data Analysis (to appear). [8] MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability 1, 281 – 298, 1967. [9] Scott A. J. and Symons M. J. Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387-397, 1971.

28

[10] Titterington D.M. Recursive parameter estimation using incomplete data. Journal of Royal Statistal Society Series B 46, 257 – 267, 1984. [11] Wang S. and Zhao Y. Almost sure convergence of Titterington’s recursive estimator for mixture models. IEEE International Symposium on Information Theory, ISIT, 2002.

29

An on-line Classification EM algorithm based on

An on-line Classification EM algorithm based on

Suggest Documents

an EM Algorithm based on Penalized Likelihood

Turbo-synchronization : an EM algorithm-based ...

An Online EM Algorithm Using Component

AClass: An online algorithm for generative classification

An Ensemble Classification Algorithm Based on Information Entropy ...

An Alternative Algorithm for Classification Based on Robust ...

An Improved KNN Text Classification Algorithm Based on Clustering

An EM based training algorithm for Cross-Language Text Categorization

An EM Algorithm-Based Channel Estimation for OFDM Amplify-And ...

An online EM algorithm for source extraction using ... - IEEE Xplore

An automatic cloud mask algorithm based on ... - Wiley Online Library

An Online Supervised Learning Algorithm Based on ... - Springer Link

Polony Identification Using the EM Algorithm Based on ... - Google Sites

Diffusion-Based EM Algorithm for Distributed

A Generalized Multivariate Logistic Model and EM Algorithm based on ...

Transductive learning with EM algorithm to classify proteins based on

Image segmentation by adaptive distance based on EM algorithm

Online EM Algorithm for Hidden Markov Models

Optimization of an Instance-Based GOES Cloud Classification Algorithm

An ant colony optimization algorithm-based classification ... - DergiPark

AN EFFICIENT STOCHASTIC APPROXIMATION EM ALGORITHM

An Iterative Stripification Algorithm Based on Dual

Development of an Algorithm based on Conservation

AN ALIASING DETECTION ALGORITHM BASED ON SUSPICIOUS ...