An on-line Classification EM algorithm based on mixture model
Allou Samé, Christophe Ambroise, Gérard Govaert Université de Technologie de Compiègne HEUDIASYC, UMR CNRS 6599 BP 20529, 60205 Compiègne Cedex, FRANCE
Abstract Mixture model-based clustering is widely used in many applications. In real-time applications, data are received sequentially and classification parameters have to be quickly updated. An on-line clustering algorithm based on mixture models is presented in the context of a real time flaw diagnosis application for pressurized containers. Available data for this application are acoustic emission signals. The proposed algorithm is a stochastic gradient one derived from the Classification version of the EM algorithm (CEM). It provides a model-based generalization of the well known on-line k-means algorithm to handle non spherical clusters when specific Gaussian mixture models are used. Using synthetic and real data sets, the proposed algorithm is compared to the batch CEM algorithm and the on-line EM algorithm. The three approaches generates comparable solutions in terms of the resulting partition when clusters are relatively well separated but on-line algorithms become fasters when the size of the available observations increases. Key words: clustering, mixture model, EM, CEM, stochastic gradient, exponential familly
Preprint submitted to Elsevier Science
13 December 2005
∗ Corresponding author: Gérard Govaert, Université de Technologie de Compiègne, HEUDIASYC, UMR CNRS 6599, BP 20529, 60205 Compiègne Cedex, FRANCE. E-mail:
[email protected], Tel: 33 (0)3 44 23 44 86, Fax: 33 (0)3 44 23 44 77.
2
1
Introduction
In many actual applications, data are sequentially received and the overall sample size can become too large. In that context, classical batch algorithms like EM algorithm [6] are not able to compute classification parameters in real-time. On-line parameter estimation using mixture models has already been addressed by many authors (eg. Titterington [10]; Wang and Zhao [11]). More recently Liu et al. [7] have considered, for the internet traffic modelling, a recursive EM algorithm based on Poisson mixture models. Our motivations was a real-time flaw detection problem for pressurized containers using acoustic emissions. The pressurized containers used are cylindrical tanks containing fluids under pressure (see figure 1).
Fig. 1. Example of cylindrical tank
The problem consists in verifying non-destructively the absence of flaws. A computer-aided-decision method was devised. The operator reaches a decision (presence or absence of flaws) using a two-dimensional representation of the tank surface. Each acoustic emission from the tank is described by p variables related to the characteristics of the acoustic signal as well as to its location in relation to the two-dimensional representation of the tank (see figure 2). The computer-aided-decision method we developed is sequential and consists 3
80
60
40
20
0
−20
−40
−60 −40
−30
−20
−10
0
10
20
30
40
50
Fig. 2. Real localisations of acoustic emissions
of two steps: • clustering of the acoustic emission events, in order to find zones where the acoustic emissions are concentrated • classification of the identified clusters as either normal or flawed clusters. The first step is performed using the classification version of the EM algorithm (CEM) [4]. At this stage the locations of the acoustic emissions are the only data taken into account. This algorithm was chosen for its speed compared to other clustering algorithms such as EM [6]. The CEM algorithm is applied assuming a mixture of Gaussian densities with diagonal covariance matrices, since experts are agreed that flaws generally appear along welding zones, which are either horizontally or vertically oriented. The number of clusters is adjusted using the Integrated Classification Likelihood (ICL) criterion [1]. During the second step, features describing each cluster, in terms of the acoustic emission properties, are computed. Using these features, a decision (presence or absence of flaws on the tank surface) is communicated to the operator. This decision is obtained using standard discrimination methods such as Bayesian discrimination with Gaussian densities. 4
The method described above is often satisfactory but becomes slow when more than 10000 acoustic emissions have to be considered. This is due to the CEM algorithm, whose execution time increases significantly with data size. Our aim was to improve this clustering step by developing a faster classification algorithm without losing the accuracy of the CEM algorithm.
For this purpose, an on-line mixture model-based algorithm has been proposed. This algorithm is a stochastic gradient one derived from the CEM algoritm.
Data are supposed to be independent observations x1 , . . . , xn , . . . sequentially received and distributed following a mixture density of K components, defined on IRp by f (x; Φ) =
K X
πk fk (x; θ k ),
k=1
with Φ = (π1 , . . . , πK , θ 1 , . . . , θ K ) where π1 , . . . , πK denote the proportions of the mixture and θ 1 , . . . , θ K the parameters of each density component. We denote by z1 , . . . , zn , . . . the classes associated to the observations, where zn ∈ {1, . . . , K} corresponds to the class of xn .
The paper is organized as follows. The second section describes the stochastic gradient algorithms in a context of parameter estimation. The third section shows how an on-line algorithm has been derived from the EM algorithm by Titterington [10]. In the fourth section, our on-line clustering algorithm derived from the CEM algorithm is presented. An experimental study is summarized in the fifth section and the proposed algorithm is applied to a real data set in the sixth section. 5
2
Stochastic gradient algorithms
Since stochastic gradient algorithms have been chosen to estimate the parameters of the mixture model, this section introduces them in a parameter estimation context. Generally, stochastic gradient algorithms are used for on-line parameter estimation in signal processing, automatic and pattern recognition for their algorithmic simplicity. They have been shown to be faster than standard algorithms. Using current parameters and new observations, stochastic gradient algorithms update recursively parameters. They allow to maximize the expectation of a criterion [2], C(Φ) = E [J(x, Φ)] . where the criterion J(x, Φ) measures the quality of the parameter Φ given the observation x. The stochastic gradient algorithm aiming to maximize the criterion C is then written Φ(n+1) = Φ(n) + αn ∇Φ J(xn+1 , Φ(n) )
(1)
where the learning rate αn is a positive scalar or a positive definite matrix such that
P
k αn k= ∞ and
P
k αn k2 < ∞. Contrary to the general gradi-
ent algorithms, stochastic gradient algorithms use an ascent direction which only depends on the current observation xn+1 and parameter Φ(n) . Bottou [2] gives general conditions of convergence of algorithm (1). In practice, the samples sizes are very large but not infinite. In that case, criterion C(Φ) is empirically the mean maximization of
Pn
i=1
1 n
Pn
i=1
J(xi ; Φ) whose maximization is equivalent to the
J(xi ; Φ). 6
3
On-line EM algorithm
This section shows how Titterington [10] has derived a stochastic gradient algorithm from the EM algorithm. Given the observed data xn = (x1 , . . . , xn ) and some initial parameter Φ(0) , the standard EM algorithm maximizes the log-likelihood
L(Φ; xn ) = log p(xn ; Φ) =
n X
log f (xi ; Φ)
i=1
by alternating the two following steps until convergence:
• E step (Expectation): computation of the expectation of the complete log-likelihood conditionally to the available data and the current parameter: Q(Φ, Φ(q) ) = E[log p(xn , zn ; Φ)|xn , Φ(q) ] = =
n X
E[log p(xi , zi ; Φ)|xi , Φ(q) ]
i=1 n X K X
(q)
tik log[πk f (xi ; θk )],
i=1 k=1
(q) where zn = (z1 , . . . , zn ) and tik = PKπk f (xi ;θk ) ℓ=1
πℓ f (xi ;θℓ )
is the posterior probability
that xi arises from the kth component of the mixture. This step simply (q)
requires the computation of posterior probabilities tik . • M step (Maximization): maximisation of Q(Φ, Φ(q) ) with respect to Φ. A partition of data can then be obtained by assigning each observation to the component having the highest posterior probability.
To derive a stochastic algorithm from this formulation, Titterington [10] de7
fined recursively, in the same way as for the EM algorithm, the quantity (0) Q1 (Φ, Φ )
(n) Qn+1 (Φ, Φ )
= E[log p(x1 , z1 ; Φ|x1 ; Φ(0) )] (2) = Qn (Φ, Φ(n−1) ) + E[log p(xn+1 , zn+1 ; Φ)|xn+1 ; Φ(n) ],
where Φ(n) is the parameter maximizing Qn (Φ, Φ(n−1) ). The indice n added to letter Q is used to specify that, contrary to the standard EM algorithm, quantity Qn depends on observations xn = (x1 , . . . , xn ) acquired until the time n and quantity Qn+1 (Φ, Φ(n) ) depends on observations xn+1 = (x1 , . . . , xn+1 ) acquired until the time n + 1. The maximization of
1 Q ( n+1 n+1
· , Φ(n) ) using
Newton-Raphson method and approximating the hessian matrix term by its expectation which is the Fisher information matrix Ic (Φ(n) ) = −E[
∂ 2 log p(x, z; Φ) ]|Φ=Φ(n) ∂Φ∂ΦT
associated to one complete observation (x, z) results in the algorithm proposed by Titterington: Φ(n+1) = Φ(n) +
1 [Ic (Φ(n) )]−1 ∇Φ log f (xn+1 ; Φ(n) )· n+1
(3)
Fisher information matrix Ic (Φ(n) ) is positive definite when complete data density belongs to the regular exponential family with natural parameter Φ. In that case, Titterington algorithm has the general form (1) of the stochastic gradient algorithms. For regular exponential family model, Wang and Zhao [11] has established, under very mild conditions, that algorithm (3) converges almost surely toward a parameter vector from the set { Φ ; ∇Φ E[log f (x; Φ)] = 0 } which contains local maxima, minima and saddle points. 8
4
An on line clustering algorithm derived from CEM algorithm
This section begins with a recall of the Classification EM (CEM) [4] algorithm in the context of mixture models and then derives a stochastic algorithm from CEM algorithm.
4.1 CEM algorithm
The Classification EM (CEM) algorithm is an iterative clustering algorithm which allows to find simultaneously the parameters and the classification. It maximizes, with respect to the components membership vector zn = (z1 , . . . , zn ) and the parameter vector Φ, the classification likelihood criterion C(zn , Φ) = log p(xn , zn ; Φ) =
n X K X
(4)
zik log πk fk (xi ; θk )
i=1 k=1
where zik equal 1 if zi equal k and 0 otherwise. The classification likelihood criterion is inspired by the criterion C1 (zn , Φ) =
n X K X
zik log fk (xi ; θ k )
i=1 k=1
proposed by Scott an Symons [9] where the sample xn = (x1 , . . . , xn ) is supposed to be formed by separately taking observations of each component of the mixture. The CEM algorithm starts from an initial parameter Φ(0) and alternates, at the qth iteration, the following steps until convergence: (q)
• E step (Expectation): computation of the posterior probabilities tik ; • C step (Classification): assignation of each observation xi to the cluster (q)
zi
(q)
which maximizes tik , 1 ≤ k ≤ K;
• M step (Maximization): maximization of C(z(q) n , Φ) with respect to Φ. 9
Thus, in the mixture model context, the CEM algorithm can be regarded as a classification version of the EM algorithm which incorporates a classification step between the E step and the M step of the EM algorithm. Celeux and Govaert [4] show that each iteration of the CEM algorithm increases the classification likelihood criterion and that convergence is reached in a finite number of iterations. When assuming Gaussian mixtures with equal proportions and covariance matrices equal to the identity matrix (spherical covariance matrices), CEM algorithm is exactly the k-means algorithm. Thus, CEM algorithm is a generalization of the k-means algorithm which allow to handle non spherical covariance matrices and non uniform proportions. In order to introduce the on-line CEM algorithm, the next section reformulates the CEM algorithm.
4.2 Other formulation of the CEM algorithm
The maximization of the classification likelihood criterion defined by equation (4) is equivalent to the maximization of the criterion LC (Φ) = max [log p(xn , zn ; Φ)] = z n
n X i=1
max[log πk fk (xi ; θ k )]. k
Each iteration q of the CEM algorithm also consists in maximizing with respect to Φ, the quantity R(Φ, Φ(q) ) = log p(xn , z(q) n ; Φ) =
n X
(q)
log p(xi , zi ; Φ),
i=1
(q)
where zi
maximizes p(xi , zi ; Φ(q) ) with respect to zi ∈ {1, . . . , K}. Similarly
to the Titterington approach [10], a stochastic gradient algorithm can be derived from this formulation. 10
4.3 On-line CEM algorithm
In this part, a stochastic gradient algorithm which incorporates an initial run of CEM algorithm on n0 observations is derived from the CEM algorithm. Let us define the quantity Rn as follows: (n0 ) ) Rn0 (Φ, Φ
(n0 )
Rn0 +1 (Φ, Φ (n) Rn+1 (Φ, Φ )
=
P n0
i=1
(n0 )
log p(xi , zi
; Φ) (5)
(n )
) = Rn0 (Φ, Φ(n0 ) ) + log p(xn0 +1 , zn00+1 ; Φ) (n)
= Rn (Φ, Φ(n−1) ) + log p(xn+1 , zn+1 ; Φ) ∀n > n0 ,
where parameter vector Φ(n0 ) is obtained by a run of the standard CEM (n)
algorithm on n0 initial observations, zn+1 maximizes log p(xn+1 , zn+1 ; Φ(n) ) and Φ(n) maximizes Rn (Φ, Φ(n−1) ) for n > n0 . The subscript n added to letter R is also used to specify that quantity Rn depends on n observations. This definition supposes that the user can fix an initial number n0 of observations to be ran with the standard CEM algorithm in order to start the on-line estimation with good initial parameters.
By maximizing
1 R ( n+1 n+1
· , Φ(n) ) using Newton-Raphson method and approx-
imating the hessian matrix term by the Fisher information matrix Ic (Φ(n) ) associated to one complete observation (x, z), we get our new algorithm given by the recursive formula (n) zn+1
(n+1) Φ
= arg max log p(xn+1 , z; Φ(n) ) z
= Φ(n) +
1 [I (Φ(n) )]−1 ∇Φ n+1 c
11
(n)
log p(xn+1 , zn+1 ; Φ(n) ) ,
n ≥ n0
which is equivalent to Φ
(n+1)
=Φ
(n)
1 + [Ic (Φ(n) )]−1 ∇Φ max log p(xn+1 , zn+1 ; Φ(n) ) , zn+1 n+1
(6)
This last algorithm is recognizable as a stochastic gradient algorithm with the matrix learning rate
1 [I (Φ(n) )]−1 n+1 c
aiming to maximize the expected
classification likelihood criterion E[max log p(x, z; Φ)]. z For many commonly used mixture models like Gaussian (used in our application), Poisson or exponential mixtures the complete data distribution belongs to the regular exponential family. The next section focuses on algorithm (6) for the regular exponential family models.
4.4 Exponential family model
This part shows how the derivation of a stochastic gradient algorithm from the CEM algorithm is simplified if complete data have their distribution from the regular exponential family. The complete data (x, z) has its distribution p(x, z; Φ) from the exponential family with natural parameter η(Φ) = (η1 (Φ), . . . , ηℓ (Φ)) and sufficient statistic T(x, z) = (T1 (x, z), . . . , Tℓ (x, z)) if its distribution can be written
T
p(x, z; η) = exp η T(x, z) − a(η) + b(x, z) . If ℓ = p (p is the dimension of Φ) and, the ηj (Φ) and the Tj (x, z) are linearly independent, the distribution of (x, z) is said to be from the regular exponential family. The re-parameterization with the expectation parameter Ψ = E(T(x, z)|η) 12
results in the following differentiation of the complete log-likelihood: !
∂ log p(x, z; η(Ψ)) ∂η ∂a = T(x, z) − . ∂Ψ ∂Ψ ∂η Using the basic relations
∂η ∂Ψ
= Ic (Ψ) and Ψ =
∂a ∂η
verified by the regular
exponential family, we get ∂ log p(x, z; η(Ψ)) = Ic (Ψ)(T(x, z) − Ψ). ∂Ψ
(7)
Using this formula, the parameter Ψ(n+1) maximizing Rn+1 (Ψ, Ψ(n) ), obtained by the annulation of the derivative of Rn+1 (Ψ, Ψ(n) ) with respect to Ψ, is written very simply as Ψ
(n+1)
=
Pn0
(n0 )
i=1
T(xi , zi
(i−1)
) + n+1 i=n0 +1 T(xi , zi n+1 P
)
,
which is equivalent to the recursive formula Ψ(n+1) = Ψ(n) +
i 1 h (n) T(xn+1 , zn+1 ) − Ψ(n) , n+1
n ≥ n0 .
(8)
Moreover, by writing recursive formula (8) as
Ψ(n+1) = Ψ(n) +
h i 1 (n) Ic (Ψ(n) )−1 Ic (Ψ(n) ) T(xn+1 , zn+1 ) − Ψ(n) n+1
and using relation (7), it can be deduced that equations (8) and (6) are equivalent. This equivalence justifies the approximation, made for the general models in section 4.3, of the Hessian matrix associated to
1 R n+1 n+1
by the Fisher in-
formation matrix. Consequently, in the situation where complete data have their distribution from the regular exponential family with natural parameter η and sufficient statistic T(x, z), recursion (6) is simplified in recursion (8) when a re-parameterization is used. 13
For Gaussian mixture model, the algorithm resulting of the application of recursion (8) is written as follows: (n0 )
• Initialization: compute initial proportions πk (n0 )
variance matrices Σk
(n )
, means vectors µk 0 , co(n0 )
and initial number of observations nk
of each
cluster k by running the standard CEM algorithm on n0 observations ; set n = n0 . Repeat the two following steps while new observations are received: • Step 1 assign each new observation xn+1 to the class k ∗ which maximizes the posterior probability (n)
(n)
π f (xn+1 ; θ k ) (n) tn+1,k = PK k (n) (n) ℓ=1 πℓ f (xn+1 ; θ ℓ )
(n)
and set zn+1,k equals 1 if k = k ∗ and 0 otherwise. • Step 2 update the parameters: (n+1)
(n)
(n+1) Σk
(n) = Σk
(n)
= nk + zn+1,k 1 (n) (n) (n+1) (n) (zn+1,k − πk ) πk = πk + n+1 (n) z (n+1) (n) (n) µk = µk + n+1,k · (xn+1 − µk ) (n+1) nk
nk
(n)
+
1−
zn+1,k (n+1)
nk
·
(n)
zn+1,k (n+1)
nk
(xn+1 −
(n) µk )(xn+1
−
(n) µk )T
−
(n) Σk
.
Notice that the described algorithm does not require a stop condition since each new observation xn+1 is used only one time. By considering a Gaussian mixture with identical proportions and spherical covariance matrices (equal to the identity matrix), and supposing that no observation is initially processed with CEM (n0 = 0), the on-line k-means (0)
(0)
algorithm [8,3] is recovered. Given initial values µk and setting the nk to 14
zero, the on-line k-means algorithm consists in estimating recursively K means µ1 , . . . , µK using the recursion (n+1) nk (n+1) µk
(n)
(n)
= nk + zn+1,k , (n)
= µk +
(n) zn+1,k (n+1) nk
(n)
(xn+1 − µk ), (n)
n ≥ 0. (n)
where zn+1,k equal 1 if k minimizes (xi − µk )(xi − µk )T and 0 otherwise. Thus, the proposed algorithm is a generalization of on-line k-means algorithm which can handle non spherical clusters and non uniform proportions.
4.5 Convergence analysis for the exponential family models
For general models, non necessary from the exponential family, Bottou [2] give general conditions of convergence. Only convergence conditions for the regular exponential family are given in this section. The convergence theorem of Wang and Zhao [11] for the on-line EM algorithm can be adapted for the on-line CEM algorithm by considering the complete data distribution instead of the observed data distribution. According to the resulting theorem, if the following conditions are fulfilled: (i) k Ic (Ψ)−1 k< ∞, (ii) k ∇Ψ log p(x, z; Ψ) k< ∞, (iii) {Ψ(n) }n remains in a compact set, then the sequence of parameters given by algorithm (6) converges almost surely toward a parameter of the set { Φ ; ∇Ψ E[max log p(x, z; Ψ)] = 0 } z
which may contain local maxima, minima or saddle points. 15
5
Experiments using simulated data
This section evaluates the on-line CEM algorithm in terms of precision and computing time. Although the proposed algorithm is general, simulations are restricted to two-dimensional data sets corresponding to a Gaussian mixture with diagonal covariance matrices, owing to the assumptions made in the application which gave rise to this study; acoustic emissions are located within a plane, the aim being to detect damaged zones which are usually to be found horizontally and vertically along welding lines. The whole point of this study, however, is to handle large data sets in real time. In all these simulations, results obtained with the on-line CEM algorithm are compared to results yielded by the CEM algorithm, which is a good reference for our problem in term of its precision and also to results obtained with the on-line EM algorithm. The considered on-line EM algorithm is simply the on-line CEM (n)
algorithm for Gaussian mixture where zn+1,k is replaced with the posterior (n)
probability tn+1,k . Three different types of experiments using simulated data have been considered: the first analyses the effect of the initial number n0 of observations classified with CEM or EM; the second and the third set of experiments are designed to compare CEM, on-line CEM and on-line EM in terms of precision and computing time.
5.1 Protocol of experimentations
The protocol of all the simulations is as follows: n observations are generated according to a mixture of K bivariate Gaussian densities; the standard CEM 16
algorithm is applied on the n observations; the CEM and EM algorithms are initially applied on a short number n0 of observations and the on-line algorithms are applied sequentially on the rest of the observations. Values of n varies from 1000 to 20000 by step of 1000 and values of n0 belongs to the set {10, 20, 30, 40, 50, 100, 200, 300, 400, 500}. Given a data set of n or n0 observations, EM and CEM are initialized as follows: the K Gaussian density centers are initialized with K centers among the available observations; covariance matrices are initialized with the covariance matrix of the available sample and proportions are set to
1 . K
Both EM
and CEM start with 30 different initializations and only the solution which provides the greatest likelihood is selected. Given a solution provided by CEM, on-line CEM or on-line EM, the misclassification rate with respect to the true simulated partition and the CPU times are computed. We should point out that the processor used for all the simulations is a 2.5 GHz Pentium 4.
5.2 Influence of the number n0 on the on-line CEM algorithm
The effect of the number n0 of observations initially classified with CEM on the partition provided by on-line CEM is studied by considering bivariate Gaussian mixtures corresponding to two kinds of models: a model with two elliptical clusters which have the same orientation and a model with two elliptical clusters which have different orientations. For each model three overlapping zones were considered, corresponding to 5%, 12% and 20% of theoretical Bayes error. Six data structures were thus obtained: mixture models A1, A2 17
and A3 of two elliptical clusters with same orientation associated respectively to 5%, 12% and 20% of theoretical Bayes error, and mixture models B1, B2 and B3 of two elliptical clusters with different orientations associated respectively to 5%, 12% and 20% of theoretical Bayes error percentage. The proportions and covariance matrices of mixtures A1, A2, A3 are π1 = π2 = 1/2, Σ1 = Σ2 = diag(1/4; 4). The Gaussian density centers are (0; 0), (1.5; 2.5) for mixture A1, (0; 0), (1; 2.5) for mixture A2 and (0; 0), (0.6; 2.5) for mixture A3. The proportions and covariance matrices of mixtures B1, B2, B3 are π1 = π2 = 1/2, Σ1 = diag(1/3; 3), Σ2 = diag(3; 1/3), where diag(a; b) is the diagonal matrix whose diagonal components vector is (a; b). The Gaussian density centers are µ1 = (0; 0), µ2 = (3.4; 0) for mixture B1, µ1 = (0; 0), µ2 = (2.2; 0) for mixture B2 and µ1 = (0; 0), µ2 = (0; 0) for mixture B3. For each of these data structures and each value of n, we generate 25 different samples. Figure 3 shows examples of data from mixtures A2 and B2. 10
6
4
5
2
0
0
−2
−4
−5 −5
−4
−3
−2
−1
0
1
2
3
4
5
−6 −6
−4
−2
0
2
4
6
Fig. 3. Example of simulation of Mixtures A2 and B2
For each sample, the strategy described in section 5.1 is applied, and misclassification rates and CPU times are averaged over the 25 different samples. Figures 4 and 5 reports, as a fonction of the number n0 of observations initially processed with CEM, the misclassification rates obtained respectively for mixtures (A1,B1) and (A2,B2) when n = 1000. The solution yielded by 18
the CEM algorithm is represented by a solid line, and those obtained by online CEM and on-line EM are represented by dotted lines. For each mixture a rapid improvement of the misclassification rate is observed for the on-line algorithms, until n0 = 50. From n0 = 100, partitions provided by on-line algorithms coincide with partitions given by CEM. Similar behavior have been observed for values of n greater than 1000. 50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0
50
100
200
300
400
on−line EM on−line CEM CEM
45
500
0
0
50
100
200
300
400
500
Fig. 4. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A1 (left) and Mixture B1 (right) with n = 1000 observations. 50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0
50
100
200
300
400
on−line EM on−line CEM CEM
45
500
0
0
50
100
200
300
400
500
Fig. 5. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A2 (left) and Mixture B2 (right) with n = 1000 observations.
When the class overlap is relatively high (see figure 6), the stabilization of the two on-line algorithms is also observed from n0 = 100. However, due to 19
the well known poor performances [4] of CEM-type algorithms when clusters are not well separated on-line EM appear to be better than CEM and on-line CEM. 50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0
50
100
200
300
400
on−line EM on−line CEM CEM
45
500
0
0
50
100
200
300
400
500
Fig. 6. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the number n0 of observations initially processed with CEM or EM for mixtures A3 (left) and Mixture B3 (right) with n = 1000 observations.
Thus for the remaining experiments the on-line algorithms was applied, with n0 = 200 observations initially processed with CEM and EM.
5.3 Comparison with CEM in terms of quality
Many simulations were performed to compare on-line CEM with CEM and on-line EM as regards of the quality of the estimation obtained, but only the most representative situations are described in this third set of experiments. The two kinds of models presented are: models C1, C2 and C3 of three elliptical clusters with same orientation and proportions, and models D1, D2 and D3 composed of three elliptical clusters with different orientations and non uniform proportions. The parameters of models C1, C2 and C3, corresponding to three overlapping degrees (5%, 12% and 20%), are the following: for each model, the proportions are π1 = π2 = π3 = 1/3 and the 20
covariance matrices are Σ1 = Σ2 = Σ3 = diag(1/4; 4). The Gaussian density centers are (−1.6; 2.5), (0; 0), (1.6; 2.5) for mixture C1, (−1.2; 2.5), (0; 0), (1.2; 2.5) for mixture C2 and (−0.8; 2.5), (0; 0), (0.8; 2.5) for mixture C3. The parameters of models D1, D2 and D3, are the following: for each model, the proportions are π1 = 0.4, π2 = 0.2 and π3 = 0.4. The covariance matrices are Σ1 = diag(1/4; 4), Σ2 = diag(4; 1/4), Σ3 = diag(1/4; 4). For model D1, the centers of Gaussian density are µ1 = (−2; 0), µ2 = (0.1; 3.7), µ3 = (2.3; 0); for mixture D2, the centers of Gaussian density are µ1 = (−2; 0), µ2 = (−0.7; 2.2), µ3 = (0.5; 0); for mixture D3 the centers of Gaussian density are µ1 = (−2; 0), µ2 = (−1.3; 1.1), µ3 = (−0.7; 0). For each of these data structures, 25 different samples of sizes n were generated. Figure 7 shows examples of data from mixtures C2 and D2. 10
6
4
5
2
0
0
−2
−4
−5 −5
−4
−3
−2
−1
0
1
2
3
4
5
−6 −6
−4
−2
0
2
4
6
Fig. 7. Example of simulation of Mixtures C2 and D2
Figures 8 and 9 display the misclassification rate with respect to the sample size n obtained with the three algorithms respectively for mixtures (C1,D1) and (C2,D2). A rapid stabilization of all the algorithms can be observed and partitions of on-line algorithms are almost similar to those of CEM. For mixture B2, misclassification percentages of on-line algorithms are slightly greater than those of CEM. This phenomenon can be attributed to the estimation of the non uniform proportions. 21
50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0 1000
5000
10000
15000
on−line EM on−line CEM CEM
45
20000
0
0 1000
5000
10000
15000
20000
Fig. 8. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the sample size n for mixtures A1 (left) and Mixture B1 (right) with n0 = 200 observations.
50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0 1000
5000
10000
15000
on−line EM on−line CEM CEM
45
20000
0
0 1000
5000
10000
15000
20000
Fig. 9. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the sample size n for mixtures A2 (left) and Mixture B2 (right) with n0 = 200 observations.
Not surprisingly, when the class overlap is relatively high (see figure 10), the on-line CEM algorithm exhibits poor results particularly for mixture D3. Again, this behaviour can be attributed to the notoriously poor performances of CEM-type algorithms [4] when clusters are not well separated which is even more pronounced when estimating uniform proportions. 22
50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0 1000
5000
10000
15000
on−line EM on−line CEM CEM
45
20000
0
0 1000
5000
10000
15000
20000
Fig. 10. Misclassification rate obtained with CEM, on-line CEM and on-line EM in relation to the sample size n for mixtures A3 (left) and Mixture B3 (right) with n0 = 200 observations.
5.4 Comparison with CEM in terms of speed
The speed of on-line CEM have been compared with that of the CEM and online EM algorithms using the same simulations used for comparison in terms of quality. Figure 11 represents CPU times in second for the three algorithms with respect to the sample size n. Here only the case of 12% Bayes error (mixtures C2 and D2) is represented, since the other two cases (5% and 20%) show approximately the same behavior.
It can be observed that CPU times given by the on-line algorithms vary very slowly with sample size. The CPU time for the standard CEM algorithm in fact grows considerably with the sample size. In particular, for 20000 observations, CEM is about six time slower than the two on-line algorithms for mixture B. These experiments clearly show that our proposed on-line CEM algorithm is more efficient than the CEM algorithm in terms of speed. 23
50
50 on−line EM on−line CEM CEM
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0 200
5000
10000
15000
on−line EM on−line CEM CEM
45
20000
0 200
5000
10000
15000
20000
Fig. 11. CPU time (in seconds) in relation to the sample size for model A2 (left) and B2 (right)
6
Results on real acoustic emission data
The main motivation of this work, as stated in the introduction, was to develop a computer-aided decision procedure to assist the detection, in real time, of damaged zones on the surface of a gas tank, through the use of acoustic emissions. The goal is the non-destructive detection of imperfections. When subjected to variations in pressure the tank surface emits noises. Each noise or acoustic emission is located and characterized by 16 variables including its spatial coordinates on the tank and other variables such as maximum amplitude, energy and duration. Experts are in agreement that spatial concentrations of acoustic emissions, identified using spatial coordinates, are of primary importance in the detection of damaged zones. Other features of acoustic emissions are useful in distinguishing between major and minor flaws once damage has been detected, but spatial concentrations are the key factor on which detection relies. Our method, therefore, consists of two steps : • Identification of spatial concentrations (sources) of acoustic signals. This is 24
done by clustering the acoustic emission events, which allows the detection of zones where acoustic emissions are concentrated. • Separation of the identified clusters into different categories according to the severity of the imperfection : these categories are termed minor, active and critical. In the opinion of specialists in the field, the method we have described produces satisfactory results using the CEM algorithm for the first step, so long as the number of acoustic emissions does not exceed 10000. When there are more than 10000 emissions the CEM clustering step becomes too slow (more than a few second delay) for a real-time application. To evaluate our new strategy, we performed a comparison of CEM and on-line CEM on a real data set of 2601 acoustic emissions (see Figure 2). On-line EM has not been tested because the only available reference partition is that of CEM which can not be a valid reference for EM. The number of clusters is selected by the following strategy: • CEM and on-line CEM are ran on the 2601 acoustic emissions for number of clusters from 1 to 15, • the number of clusters is selected by maximizing the integrated classification likelihood criterion (ICL) [1]
ICL(K) = log p(xn , zn ; Φ∗ ) −
νK log(n), 2
where Φ∗ is the parameter vector obtained with CEM or on-line CEM and νK is the number of free parameters of the model. Both strategies using CEM or on-line CEM select the model with 9 clusters. 25
80
60
40
20
0
−20
−40
−60 −40
−30
−20
−10
0
10
20
30
40
50
Fig. 12. CEM classification obtained from real acoustic emissions
A misclassification percentage of 1.57% between CEM and on-line CEM is got. This shows that results obtained with on-line CEM are close to the results of the CEM algorithm. We have therefore achieved our original aim of obtaining partitions similar to CEM partitions but in less time. Figure 11 shows the resulting partition given by on-line CEM. In the decision step following this first clustering step, the vertically more elongated cluster observed was found to be the only flaw cluster. This local region is in fact a welding region, and our result corresponds to the presence of a real flaw. The described procedure above is not yet industrialized, but in the testing phase. 80
60
40
20
0
−20
−40
−60 −40
−30
−20
−10
0
10
20
30
40
50
Fig. 13. on-line CEM classification obtained from real acoustic emissions
26
7
Conclusion
An on-line clustering algorithm was proposed to obtain rapid and effective clustering of acoustic emissions on the surface of a gas tank in order to detect flaws.
The proposed algorithm is the stochastic gradient version of the Classification EM algorithm (CEM) which incorporates an initial run of CEM on a few observations. It produces partitions close to those computed with the CEM algorithm for relatively well separated clusters. Almost no differences in results are observed between CEM and on-line CEM partitions when the number n0 of observations initially classified by CEM is greater than 100. When the overlapping between clusters is high, the experimental study has revealed that on-line EM performs better.
When complete data distribution is from the exponential family, the algorithm is written in a very simple form. The on-line k-means algorithm introduced by MacQueen [8] is recovered when a Gaussian mixture with spherical covariance matrices is considered. The proposed algorithm could also be applied to the 28 Gaussian parsimonious models proposed by Celeux and Govaert [5] to handle specific shapes of clusters.
The execution time of on-line CEM do not vary very much, while the execution time for CEM increases significantly as the number of available observations increases. This algorithm therefore represents an efficient alternative for clustering large data sets. 27
References
[1] Biernacki C., Celeux G., and Govaert G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE PAMI, 22(7):719-725, 2000. [2] Bottou L. Une approche théorique de l’apprentissage connexioniste; applications à la reconnaissance de la parole, Thèse de Doctorat, université d’Orsay, 1991. [3] Bottou L. and Bengio Y. Convergence properties of the K-means algorithm. In G. Thesauro and D. Eds. Advances in Neural information, Proc. Systems, Volume 7, pages 585-592, MIT Press, 1995. [4] Celeux G. and Govaert G. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14, 315-332, 1992. [5] Celeux G. and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781 – 793, 1995. [6] Dempster A. P., Laird N. M. and Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistal Society Series B 39(1): 1 – 38, 1977. [7] Liu Z., Almhana J., Choulakian V. and McGorman R.. On-line EM algorithm for mixture with application to internet traffic Modeling. Computational Statistics and Data Analysis (to appear). [8] MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability 1, 281 – 298, 1967. [9] Scott A. J. and Symons M. J. Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387-397, 1971.
28
[10] Titterington D.M. Recursive parameter estimation using incomplete data. Journal of Royal Statistal Society Series B 46, 257 – 267, 1984. [11] Wang S. and Zhao Y. Almost sure convergence of Titterington’s recursive estimator for mixture models. IEEE International Symposium on Information Theory, ISIT, 2002.
29