Clustering of spatial data by the EM algorithm. In geoENV I - Geostatistics for. Environmental Applications, A. Soares, J. G omez-Hernandez and R. Froidevaux ...
CLUSTERING OF SPATIAL DATA BY THE EM ALGORITHM C. AMBROISE , M. DANG AND G. GOVAERT
Universite de Technologie de Compiegne URA CNRS 817 BP 529 F-60205 Compiegne cedex - France
Abstract.
A clustering algorithm for spatial data is presented. It seeks a fuzzy partition which is optimal according to a criterion interpretable as a penalized likelihood. We propose to penalize the energy function exhibited by Hathaway (1986) with a term taking into account spatial contiguity constraints. The structure of the EM algorithm may be used to maximize the proposed criterion. The Maximization step is then unchanged and the Expectation step becomes iterative. The eciency of the new clustering algorithm has been tested with biological images and compared with other clustering techniques. 1
1. Introduction When classical clustering techniques are used for partitioning spatial data, the resulting classes will often be geographically very mixed. To avoid this phenomenon, the spatial information of the data has to be taken into account. Very dierent solutions to this problem have been proposed in the literature. A natural approach consists in using the geographical coordinates of the individuals, more or less heavily weighted, as an additional pairs of variates - see Berry (1966) or Jain and Farrokhnia (1991). This paper is published in the proceedings of geoENV96 conference, Lisbon, Portugal, November 20-22, 1996. It can be referenced as: Ambroise, C., Dang, M. and Govaert, G. (1997). Clustering of spatial data by the EM algorithm. In geoENV I - Geostatistics for Environmental Applications, A. Soares, J. Gomez-Hernandez and R. Froidevaux (eds), 493{504. Quantitative Geology and Geostatistics, vol. 9. Kluwer Academic Publisher. 1
454
C. AMBROISE, M.DANG AND G.GOVAERT
Another approach groups individuals which are both similar and contiguous (Legendre 1987, Openshaw 1977). This implies the de nition of a neighborhood concept. De ning neighborhood relationships is equivalent to building a graph where each element is represented by a node and each neighborhood relationship is an edge. Clustering with spatial contiguity constraints can be described as the succession of two steps: 1. The de nition of a neighborhood graph. This can be done with standard algorithms such as a Delaunay triangulation (Green and Sibson 1977) or a Gabriel graph (Gabriel and Sokal 1969). 2. Running of a clustering algorithm while respecting the constraints. Many classical clustering algorithms may be modi ed to take into account the constraints which are summarized by the graph. In Lebart (1978) a classical hierarchical clustering algorithm is adapted. These procedures produce classes made of adjacent sites. They may separate into dierent classes individuals which are very similar, if they are geographically far apart. Oliver and Webster (1989) propose to run clustering algorithms based on a modi ed dissimilarity matrix. This modi ed matrix is a combination of the matrix of geographical distances and the dissimilarity matrix computed from the non geographical variables. This kind of procedure seems to work well but has no statistical justi cation. Other spatially constrained clustering methods and approaches have been developed in the eld of unsupervised image segmentation (Ripley 1988). The speci city of these methods is that they deal with pixels and their regular grid structure. An image may well be considered as a regular lattice of pixels. In this case, the computation of the neighborhood graph is immediate. The most common choices for the neighborhood graph are the 4 neighbor graph (horizontal and vertical adjacencies only) and the 8 neighbor (including diagonals). There are numerous unsupervised image segmentation algorithms. In this paper, we consider only statistically based methods. In this framework, the Bayesian approach proposes solutions which may be separated into two families (Masson and Pieczinsky 1993): 1. The local methods make assumptions about the pixel or about small groups of adjacent pixels called \context" (Masson and Pieczinsky 1993). 2. The global methods make assumptions about the whole image and generally use a Markov random eld model (Geman and Geman 1984, Besag 1974). Notice that the classes obtained with these methods dier from those obtained with hard contiguity constrained algorithms: a class does not necessarily form a single patch on the image. Pixels of the same class may be
CLUSTERING OF SPATIAL DATA BY THE EM ALGORITHM
455
present in dierent parts of the image. Thus the statistical models used in unsupervised segmentation algorithms take the contiguity constraints into account but do not impose \one region classes". We develop in this paper a new statistical method for spatial clustering which is based on the EM algorithm (Dempster et al. 1977) and takes into account the spatial constraints without requiring a partition made of \one region classes". Next section describes the principle of the proposed algorithm. Section 3 relates this method to statistical image segmentation techniques based on Markov Random Fields. Section 4 illustrates the method's performance on the segmentation of a biological gray-level image. The concluding section evocates aspects of this approach that are currently under investigation or seem to be worth of further study.
2. Introducing Spatial Constraints in the EM Algorithm
2.1. MIXTURE ESTIMATION BY THE EM ALGORITHM
In cluster analysis based on Gaussian mixture models (Celeux and Govaert 1995), data are IRd -valued vectors x1 ; :::xn assumed to be an identically independently distributed (i.i.d.) sample from a mixture of K normal distributions: K X (1) f (xij) = pk fk (xijk ; k ); k=1
where pk are the mixing proportions (for k = 1; : : :; K; 0 < pk < 1, and P p the k k = 1), and fk (xjk ; k ) denotes the density of a Gaussian distribution with mean vector k and variance matrix k . This model also assumes that the unobserved vector of labels, z = (z1; : : :; zn ), is an i.i.d. sample of the multinomial distribution: for 1 i n; 1 k K; P (Zi = k) = pk : The EM algorithm is often used to estimate the unknown parameters of the mixture (Dempster et al. 1977). It produces a set of parameters that maximizes locally the log-likelihood of the sample, de ned as
L() =
n X i=1
log f (xi ):
The principle of the EM algorithm consists in building a sequence of estimates 0; 1; : : :; m ; over which the log-likelihood monotonically increases (for all m, L(m+1 ) L(m )). The basic idea in EM is to choose m+1 from m that globally maximizes the expectation de ned as
X P (zjx; m) log P (x; z; ) Q(jm) =
z
456
C. AMBROISE, M.DANG AND G.GOVAERT
=
K N X X i=1 k=1
0 for 1 log(pk fk (xi ))P (Zi = kjxi ; m) @ mixture A (2) models
Thus, starting from an arbitrary value 0 , iteration (m + 1) of the EM algorithm can be divided in 2 steps: | E-step (Expectation):
computation of the components of Q(jm) that do not depend on ; | M-step (Maximization): search m+1 = arg max Q(jm). In the case of mixture models, the E-step computes m m m tmik+1 = P (Zi = kjxi ; m) = pk fkf((xxijjkm;)k ) 1in : i 1k K
(3)
For a gaussian mixture, the M-step yields, for k = 1; : : :; K
n X mk +1 = n1 tmik+1 xi k i=1 n K X X tm+1 (xi ? mk +1 )(xi ? mk +1 )t mk +1 = n1 k k=1 i=1 ik pmk +1 = nnk
P where nk = ni=1 tmik+1 .
(4) (5) (6)
2.2. THE EM ALGORITHM AS A FUZZY CLUSTERING METHOD
As Hathaway (1986) highlighted it, the EM algorithm in the case of mixture models is formally equivalent to an alternate optimization of function D(c; ) =
K X n X k=1 i=1
cik log(pk fk (xijk ; k )) ?
K X n X k=1 i=1
cik log(cik )
(7)
where c = (cik )i=1;n k=1;K de nes a fuzzy classi cation, cikP representing the of membership of xi to class k (0 cik 1, Kk=1 cik = 1, Pn grade i=1 cik > 0, 1 i n; 1 k K ). To see this, consider the following grouped coordinate ascent on function (7): starting from a given set of initial parameters, 0, criterion D(c; ) is alternatively optimized over the possible values of the classi cation matrix c with xed mixture parameters, then over the possible values of mixture
CLUSTERING OF SPATIAL DATA BY THE EM ALGORITHM
457
parameters with a xed classi cation matrix. Thus, iteration (m + 1) of this alternate optimization algorithm consists of two steps:
\E"-step: The classi cation matrix is updated in order to maximize criterion m cm+1 = arg max c D(c; ): Let us write the Lagrangian of D(c; m ) that takes into account the conP K straints k=1 cik = 1; 8i: ! ! K n X X (8) cik ? 1 ; D(c) = D(c; m) + i i=1
k=1
where the i are the Lagrange coecients corresponding to the constraints. The necessary conditions of optimality yield:
(
@cik = log (pk fk (xi jk ; k )) ? 1 ? log cik + i = 0 P K c =1 @D
m
m
m
k=1 ik
which can also be written as:
(
cik = exp f?1 + i + log (pmkfk (xijmk; mk))g P K exp f?1 + + log (pm f (x jm ; m ))g = 1: k=1
i
k k i k
k
The following values are obtained for the cmik+1 :
m m m cmik+1 = pk fkf((xxi jjkm;)k ) : i
(9)
which is clearly identical to EM's E-step for mixture models (3).
\M"-step: The parameters are reestimated according to m+1 = arg max D(cm+1 ; ): However, combining (2) with (3), and (7) with (9), it can be seen that D(cm+1 ; ) = Q(jm) ?
K X n X cmik+1 log(cmik+1 ) } |k=1 i=1 {z
independent of
so that the same m+1 is obtained as in the M-step of the EM algorithm.
458
C. AMBROISE, M.DANG AND G.GOVAERT
Since this scheme yields the same calculations as EM applied to mixtures, the latter can be viewed as an alternate optimization method of criterion D(c; ). The criteria themselves are related by L(m) = D(cm+1 ; m). This interpretation makes clearer the relationships between EM and other clustering techniques, which often optimize criteria very similar to D(c; ) (Celeux and Govaert 1995). 2.3. PENALIZATION OF HATHAWAY'S CRITERION
We propose to regularize Hathaway's criterion (7) with a term which takes into account the spatial information relative to the data. Let V be the \neighborhood matrix":
> 0 if x and x are neighbors i j vij = 0 if xi and xj are not neighbors: To compute the matrix V, one can use the graph structure described in the
introduction, or a function of the spatial distance between the elements of the data set. The regularizing term we propose is the following: n n X K X X 1 c :c :v : G(c) = 2 k=1 i=1 j =1 ik jk ij
(10)
The more the classes contain adjacent elements, the greater this term is. Notice that G(c) is only function of the classi cation matrix. The new criterion we consider is then: U(c; ) = D(c; ) + :G(c) (11) where the coecient 0 gives more or less weight to the spatial homogeneity term relatively to D(c; ). A modi ed version of the EM algorithm can be used to maximize this criterion. We named this version Neighborhood EM algorithm (NEM). Since the new regularizing term does not contain any parameter of the mixture, the Maximization step remains unchanged. The Estimation step (maximization of the criterion relatively to the classi cation matrix with xed mixture parameters) changes: 1. Initialization: a neighborhood matrix V is computed according to the spatial relationship; arbitrary initial values are chosen for the parameters of the mixtures (0) as well as for the classi cation matrix, c(0), (n lines, K columns). 2. At each iteration the followings steps are realized until convergence:
CLUSTERING OF SPATIAL DATA BY THE EM ALGORITHM
(a) E-step:
459
m cm+1 = arg max c U (c; ):
The necessary conditions of optimality take the following form:
(
@U = log (pm fk (xi jm ; m )) + 1 ? log cik + i + Pn cjk vij j =1 k k @c PikK c = 1k k=1 ik
which may be written as
(
n c v g cPik = exp flog(pmkfk (xi jmk; mk )) + 1 + i + P j =1 jk ij K exp flog(pm f (x jm ; m )) + 1 + + Pn c v g = 1: i k=1 j =1 jk ij k k i k k
Finally we get the following equation:
cmik+1 =
pmkfk (xi jmk; mk ) expf Pnj=1 cmjk+1 vij g PK pmf (x jm ; m ) expf Pn cm+1v g `=1 ` ` i ` ` j =1 j` ij
(12)
which suggests an iterative computing algorithm of the form c = g(~c), where ~c is the old classi cation matrix. From a practical point of view, a few iterations produce a reasonable new classi cation matrix c, which can be used for the next M-step.
(b) M-step:
m+1 = arg max U (cm+1 ; ) = arg max D(cm+1 ; ):
Thus, to compute the parameters of the mixture, one can use the same formulae as in the M-step of the EM algorithm. Empirically the NEM algorithm converges with all tested problems, but no proof of local convergence is available yet. At each iteration the spatial information modi es the partition according to the importance of the coecient. The Estimation step smoothes the map of the labels. The preceding algorithm solves two dierent tasks which are highly dependent: it estimates the parameters of the mixture and nds a partition. The two tasks are performed at the same time. 2.4. HARD PARTITION
The NEM algorithm may be easily adapted to seek a hard partition (cik 2 f0; 1g):
460
C. AMBROISE, M.DANG AND G.GOVAERT
? The easiest way is to consider the fuzzy partition obtained at the convergence, and to assign each individual i to the most probable class according to the a posteriori probabilities, i.e. the class k that maximizes cik . ? As in the CEM algorithm (Celeux and Govaert 1992), it is possible to
add an intermediate classi cation step between the E and the M step.
3. A Bayesian Interpretation Notice that it is possible to have a Bayesian interpretation of the NEM algorithm. Maximizing the criterion U (c; ) is equivalent to maximizing expfU (c; )g = expfD(c; )g expf G(c)g:
(13)
We may write
expf G(c)g / P (c) (14) where P (c) is a Gibbs distribution with energy function ? G(c). On the other hand, the expression expfD(c; )g may be interpreted as P (xjc), the conditional density of the sample x, knowing the classi cation matrix c, with parameters characterizing the Gaussian mixture. Thus, expfU (c; )g / P (c) P (xjc) / P (cjx) where P (cjx) is the posterior distribution of the classi cation matrix c. Maximizing U (c; ) is equivalent to nding the matrix c that maximizes the posterior distribution P (cjx), which is a Gibbs distribution with energy function ?U . Thus, the NEM algorithm may be interpreted as an algorithm that searches the maximum a posteriori (MAP) estimate of the classi cation matrix c. This Bayesian interpretation assumes that there are two random elds: X = fXs ; s 2 S g, which is the observed random eld and C = fCs; s 2 S g, which is the random eld corresponding to the classi cation matrix (S is the set of sites). The Xs take their values in IRd and the Cs = (Cs1; ; CsK ) in a subset of [0; 1]K . Both elds X and C follow Gibbs Distributions. This approach may be used in image segmentation and provides an alternative to the existing unsupervised fuzzy segmentation algorithms (Kent and Mardia 1991, Caillol et al. 1993). In the case of a hard classi cation matrix, i.e. the Cs = (Cs1; ; CsK ) are random variables which take their values in f0; 1gK , and with the following neighborhood matrix:
CLUSTERING OF SPATIAL DATA BY THE EM ALGORITHM
461
xj are neighbors vij = 10 ifif xxii and and xj are not neighbors;
the a priori distribution of the classi cation matrix is a Markov Random Field model very frequently used in image segmentation (Strauss 1977, Besag 1986): P (C = c) = 1 expf G(c)g
Z
X = Z1 expf vij (ci cj )g i