A Maximum Entropy Approach to Pairwise Data Clustering - CiteSeerX

14 downloads 0 Views 134KB Size Report
entropy framework using a variational principle to derive corresponding data ..... [2] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the. Theory of Neural ...
A Maximum Entropy Approach to Pairwise Data Clustering J. M. Buhmann & T. Hofmann Universit¨at Bonn, Institut f¨ur Informatik III D-53117 Bonn, Germany Abstract

dissimilarity values usually do not obey the requirements of a metric, i.e., these numbers might not be positive, they might not fulfil the triangular inequality or the dissimilarity of a datum to itself might not vanish. The question arises how both problems, the embedding problem and the clustering problem, can be solved simultaneously and in a selfconsistent fashion. Obviously, the grouping process depends on the embedding of the data in a Euclidian space. On the other hand, the selected embedding should emphasize the grouping of data points so that an interpretation of an experiment is facilitated. In this paper we propose a clustering algorithm which embeds the data set in a Euclidian space such that the pairwise clustering costs are approximated as tight as possible in the maximum entropy sense by central clustering costs in this embedding space. The coordinates in the embedding space are the free parameters for this variational problem. In mathematical terms, the data are characterized by a real–valued, symmetric proximity matrix 2 IRN N with coefficients Dkl , being the matrix of pairwise dissimilarities between N data points. We make no further assumptions about the data, especially we do not assume that any of the axioms fulfilled by a metric (apart from symmetry) hold. Since we avoid additional presuppositions, our approach is widely applicable to clustering and multidimensional scaling problems, regardless of the special nature of the given distances. Given the dissimilarity matrix as a result of an experiment, there are usually two problems to be solved, and which may lead to an interpretation or evaluation of the data.

Partitioning a set of data points which are characterized by their mutual dissimilarities instead of an explicit coordinate representation is a difficult, NP-hard combinatorial optimization problem. We formulate this optimization problem of a pairwise clustering cost function in the maximum entropy framework using a variational principle to derive corresponding data partitionings in a d-dimensional Euclidian space. This approximation solves the embedding problem and the grouping of these data into clusters simultaneously and in a selfconsistent fashion.

1 Introduction Grouping experimental data into compact clusters arises as a data analysis problem in psychology, linguistics, genetics and other experimental sciences. The data which are supposed to be clustered are either given by an explicit coordinate representation (central clustering) or they are characterized by similarity values for pairs of data points (pairwise clustering). In this paper we study the pairwise clustering problem in the framework of maximum entropy estimation [1, 2]. The clustering process serves the dual purpose to reduce complexity of a data set, i.e., to discard noise and measurement errors, and to discover and emphasize the underlying structure of the data. The discovery of structure in data sets can be achieved by embedding the data in a d-dimensional space for visualization purposes. The search for an optimal embedding is discussed in the literature as multidimensional scaling [3] where the dissimilarities of pairs of data points are approximated by distances in a Euclidian space. Embedding the data, however, constitutes only a partial solution in most instances since an experimentalist has to visually inspect the embedding and has to extract the structure from data projections to low-dimensional (usually two-dimensional) plains. Although the computational problem in this two step process is restricted to the search for an optimal embedding the full problem includes simultaneous embedding and clustering of a data set. The computational task of embedding a data set in a Euclidian space might be aggravated by the fact that the

D

D

1. Find a coordinate representation for the data points, : an embedding in an Euclidian vector space, f1; : : : ; N g ! IRd .

X

2. Cluster the data points to reveal the underlying structure. Formally a solution to the clustering problem with a fixed number of K clusters is a Boolean assi2 f0; 1gN K with the restriction gnment matrix PK of uniqueness  =1 Mi = 1 for every datapoint i.

M

These problems though often treated separatly are closely related. Clearly the representation is only useful if 1

Proceedings of the International Conference on Pattern Recognition ’94, Hebrew University Jerusalem it respects the inherent structure of the data. Natural clusters should be preserved by the mapping into a Euclidian space. Similar samples should be mapped to neighbouring coordinates, preserving the topologic relations. This leads to a data dependent criterion, which avoids the definition of an artifical cost function like the squared error sum between original proximities and Euclidian distances in the constructed representation. On the other hand the additional restrictions of an Euclidian embedding might improve the clustering procedure or lead to a more suited clustering formation. Obviously, the data representation has a major influence on the clustering procedure.

2 The Maximum Entropy Principle for Central and Pairwise Clustering Pairwise clustering of data is a combinatorial optimization problem which depends on Boolean assignments Mi 2 f0; 1g of datum i to cluster  . The cost function for pairwise clustering is

EKpc (M) =

N N 1 XX Mk Ml Dkl  =1 2p N k=1 l=1

K X

(1)

PN with p = k=1 Mk =N . Only pairs of data points assigned to the same cluster contribute to the total cost. Dissimilarities between data which belong to different clusters are not counted. To compensate any cluster asymmetries, i.e., largely varying frequencies of examples for the different clusters, the cost per cluster is normalized by the cluster probability p . With this definition the minimization of (1) yields a data partitioning with minimal average dissimilarity of data points within each cluster. Such a heuristics for grouping maximizes the average cluster compactness. An important special case is the Euclidian pairwise clustering problem. Data points are given as vectors i 2 IRd and the dissimilarity measure is the squared Euclidian distance, Dkl = ( k ? l )2. The cost function for Euclidian pairwise clustering is defined as

x

x x

EKepc (M) =

N N 1 XX Mk Ml ( k ? l)2 : (2) 2 p N  =1 k=1 l=1

K X

x x

With the identity

N X N X

N X Mk Ml 2 (xk ? xl ) = Mk (xk ? y )2 ; 2p N k=1 l=1 k=1

(3)

Euclidian pairwise clustering is shown to be equivalent to central clustering [4, 5, 6, 9] with cost function

EKcc(M) =

N K X X  =1 k=1

Mk (xk ? y )2 ;

(4)

2

y

the prototypes  being the cluster centroids N N X X Mk k = Mk : = k=1 k=1

y

x

(5)

We adress the combinatorical optimization problem associated with the cost function (1) or (4) in a probabilistic framework, using the principle of maximum entropy estimation. Beside strong empirical support for such a strategy the maximum entropy principle leads to the provably most robust density estimations [7] of data assignments which is of paramount importance for randomized algorithms, e.g., simulated annealing and neural network relaxation [2]. The fundamental quantity for calculations of a (thermal) averaged assignment matrix h i is known in statistical physics as the free energy [6]

M



FK = ? 1 ln ZK ;

(6)



ZK = PM exp ? EK (M) being the partition function. In the limit T ! 0, there are no thermal fluctuations and

we calculate a solution to the hard clustering problem. In the case of Euclidian pairwise clustering with the cost function (2) the interactions between pairs of datapoints can be reduced to interactions with the centroids, as demonstrated in (4). The corresponding partition function has the factorial form K N X X Y cc cc exp(? ( k ?  )2 ): ZK = exp(? EK ) = k=1  =1 (7) The average assignment variables, which can be interpreted in our context as a fuzzy membership variable, are derivatives of the free energy, i.e., cc 2 hMi i = @@FE K = PKexp(? ( i ?  ) ) 2 : (8) i =1 exp(? ( i ?  ) )

M

x y

x y x y

Since the averaged assignments depend on the expected cluster centers and probabilities, this leads to a system of N  K coupled transcendental equations. Note that a solution of (8) is a necessary, but not a sufficient condition for the maximum entropy assignments since the free energy (6) is non convex and might have many local minima with the global minimum being the maximum entropy estimate. In the following we will pursue the strategy to solve (8) at a given (computational) temperature using a fixpoint method. Since we assume that only the distance matrix is given and the coordinate representation of the data is not available, we can not use this method directly to solve the grouping problem. However, Eq. (3) shows an important dependence between the two problems, which can be exploited for an approximation method known as meanfield approximation.

D

Proceedings of the International Conference on Pattern Recognition ’94, Hebrew University Jerusalem

3 Meanfield Approximation for Pairwise Clustering The meanfield approximation is a wellknown method to approximate the free energy (6) of an interacting many particle system with a system of noninteracting particles, e.g., assignment variables in the case of clustering. We generalize the cost function for central clustering in (4) and introduce potentials Ek for the independent data assignments. The approximate cost function is defined as

EK0 (M) =

K X N X  =1 k=1

Mk Ek :

(9)

ZK0

denotes the corresponding partition function of the interaction free system. To use the mean field approach as an approximation of the pairwise clustering problem, we split the original cost 0 function and write EK = EK + VK , where VK represents a perturbation term due to neglected interactions. The partition function of the original system can be rewritten in terms of the partition function for the decoupled system,

ZK

= =

X

M

0 exp(? EK )

P

(? EK ) exp(? VK ) M exp P M exp(? EK ) 0

0

ZK hexp(? VK )i0; 0

(10)

h:i0

denotes the average over all configurations of the cost function without interactions. The meanfield approach is based on the approximation hexp(? VK )i0  exp(? hVK i0) (Jensen’s inequality) which yields a lower MF for ZK bound ZK

ZK  ZK0 exp(? hVK i0)  ZKMF :

hVK i =

N K X X

N X

 =1 k=1

l=1

hMk i

!

hMl i D ? E ; kl k 2p N 

(12)

where we omitted the subscript of the averages for conciseness. The expected assignment variables are

hMi i = PKexp(? Ei ) : =1 exp(? Ei )

The calculation of the partial derivatives yields

K X

(13)

4 Multidimensional Scaling and Clustering 0 have not been specified so far, Since the potentials EK they can be selected such that the quality of the mean field approximation is optimized. We vary the free parameters EK0 to maximize the (logarithm of the) lower bound (11).



@ hMi i hV Ei ? Ki = ? @ Ei  =1 @ Ei  N  @ hMl i 1 X @ hMk i hMl i + @ E hMk i Dkl + 2 p N @ Ei i k;l=1 @

1 2p2 N 2

@

N X

3  Mk Ml Dkl 5

hMi i h @ Ei k;l=1

ih

i

? hMi i: (14)

@ 0 = ? hMi i ln ZK @ Ei With @ hMk i=@ Ei = 0 for i = = k and

(15)

?  @ hMi i = ?2 hMi i   ? hMi i ; @ Ei

(16)

maximizing the lower bound (11) yields

K X

@ hMi i MF = Ei ? Ei Z K @ Ei @ E i  =1 @

?



=

0:

(17)

The “optimal” potentials ! N X 1 Ei = Dik ? hMl iDkl 2 p N l=1 (18) depend on the given distance matrix, the averaged assignment variables and the cluster probabilities. They are optimal in the sense, that if we set

N 1 X hM i p N k=1 k

Ei = Ei + ci

(19)

@ hMi i ?  E i ? Ei = 0;  =1 @ Ei

(20)

(11)

Using the fact that for decoupled configurations the quadratic terms are statistically independent, hMk Ml i0 = hMk i0hMl i0, the averaged potential V amounts to

3

the N

 K equations K X

are fulfilled for every i 2 f1; :::; N g; 2 f1; :::; K g. The ci for every data freedom to choose an additive constant PK point is explained by the fact that  =1 @ hMi i=@ Ei = 0, PK which reflects the restriction  =1 hMi i = 1. A simultaneous solution of Eq. (19) with (8) and the centroid condition for the prototypes constitutes a necessary condition for a maximum of the lower bound (11) for the partition function ZK . Therefore, the global maximum approximates the solution for the pairwise clustering problem in the optimal way with pairwise Euclidian clustering. So far we have not adressed the problem, how to find for the datapoints. If we impose the an embedding additional restrictions for the potentials to be the quadratic

X

Proceedings of the International Conference on Pattern Recognition ’94, Hebrew University Jerusalem distance between the data point and the cluster center, we can optimize the potentials Ei under the constraints

Ei = (xi ? y )2:

(21)

In general, the approximations (19) are not consistent with the restriction (21) and we have to use a suboptimal 0 choice for EK , which leads to a (probably) less accurate mean field approximation to guarantee consistency. But if we are allowed to choose the dimension d of the embedding space large enough, we have d + 1 degrees of freedom (d coordinates and the constant ci ) to solve the K equations for each data point. To be precise, we insert (21) in (19), which yields for every datapoint the K equations

yx

x

y

 1?  2 2 T (22) i = ? 2 Ei ? k k + ci ? k ik ; 2 f1; :::; K g. With the definitions ˆ i := ( Ti ; 2i ? ci )T , T ; 1)T and Eˆi = ?0:5(Ei ? k k2 ) this yields ˆ := ( a system of K linear equations for d + 1 unknowns

y

x

y

yˆ T xˆ i = Eˆi ;

y

x x

2 f1; :::; K g:

x y

(23)

Notice, that the quadratic term 2i has been absorbed by the constant ci . The solvability of (23) depends on the row–rank of the matrix = ( ˆ 1 ; :::; ˆ K )T . If d + 1 < K the cluster centers are always linear dependent and a solution does not exist in general. For d + 1  K a unique solution exists if the row–rank of is K . In the limit case d = K ? 1 the set of extended vectors f ˆ 1; :::; ˆ K g has to be linearly independent to guarantee a unique solution, ~0 : P v = :9 = (v1 ; :::; vK ) = = i.e., the condition P 0 ^ v = 0 has to be true and the cluster centers have to be in general position (no cluster center lies in the connection space of the remaining centers) as assumed in the following. The preceding discussion shows, that it is generally possible to fulfil the equations (19) even in the case that the potentials are restricted to Euclidian distances (21) in an at least d = K ? 1 dimensional Euclidian space. Since it is desirable to keep the dimension d as small as possible, we will calculate the solution in the case d < K ? 1. Therefore, we treat the coordinates i instead of the potentials Ei asthe variational parameters and calculate the derivatives  @ ln ZK0 ? hVK i =@ i . Starting with (17) and applying the chain rule we arrive at the approximation

Y^

y

Y^

y

y

v

y

x

x

 @  0 ln ZK ? hVK i  @xi

2

K X  =1



hMi i Ei ? Ei y ? ?

K X

=1

y x



hMiiy ;

(24)

where terms proportional to @ =@ i have been negleg@ ln Z 0 ? hV i  = 0 ted. The stationarity condition @x K 0 K i

Points per Cluster 100 10 100 10 100 500 100

Standard deviation  0.0 5.0 5.0 10.0 10.0 10.0 20.0

4

MSE embedding < 1.0e-20 7.41 0.83 29.55 3.35 0.74 13.44

MSE noise 0.0 24.60 25.09 98.40 100.37 100.48 401.48

Table 1: Coordinate reconstruction from (squared) pairwise distances corrupted with additive Gaussian noise.

x

y x

solved for i with the approximation @ =@ i

Ki xi = 12

K X  =1



hMi i ky k2 ? Ei y ? ?

K X =1

=

0 yields 

hMiiy ;

(25) with the covariance matrix Ki = h T ii ? h iih iTi PK and the definition h ii =  =1 hMi i  . h ii can be viewed as the average prototype of datapoint i. Equation (24) is only exact under the assumption that the cluster centers are independent of i . On the other hand, if the are chosen according to the centroid condition, than modifications of all other assignment variables by changes in i are neglected. The modifications are mediated via the displacement of . To justify this approximation we 0 ? hVK i0 with respect to consider the variation of ln ZK the independent variation parameters , which yields a second set of stationarity conditions ?

y

yy y

y

y y

x

y x

y

y

N X j =1

 xj ? y hMj i 1 ? hMj i Ej ? Ej ?

?

?



=

0;

(26) for all 2 f1; :::; K g. The weighting factors in (26) decay? exponentially  fast? with the inverse temperature hMi i 1 ? hMi i   O exp[? c] , c > 0. This implies that the optimal solution for the data coordinates shows only a very weak dependence on the special choice of the prototypes in the low temperature regime. Fixing the parameters and solving (25), the solution will be very close to the optimal approximation. It is thus possible to choose the prototypes as the cluster centroids in a self–consistent way. Use (25) to calculate a new embedding for the data points and recalculate the assignment variables and the new cluster centroids.

y

The derived system of transcendental equations explicitly reflects the dependencies between the clustering procedure and the Euclidian representation. Solving these equations simultanously leads to an algorithm which interleaves the multidimensional scaling process and the clu-

Proceedings of the International Conference on Pattern Recognition ’94, Hebrew University Jerusalem 12

12

a

b

16

5

c

8 8

12 4

yi

xi

4

yi

0

8 4

-4 0

0 -8 -4

-4 -8

-4

0

xi

4

8

-12 -12

-8

-4

0

4

i

8

12

-12

-8

-4

0

4

8

12

i

Figure 1: Embedding of two dimensional data into one dimension: the data distribution of three Gaussians with principal components parallel to the coordinate axis (300 of 1500 data points are displayed) (a). The standard deviation was 0:7 and 2:0 respectively. The correlations between the xi , yi coordinates and the i coordinate of the one dimensional embedding is shown in (b),(c). stering process, avoiding an artifical algorithmic separation into two uncorrelated processes.

5 Simulation Results The properties of the described algorithm for simultaneous Euclidian embedding and data clustering are demonstrated with three different experiments: (i) Clustering and embedding of data which are selected from a Euclidian space; (ii) clustering and dimension reduction of data which are inhomogeneously distributed; (iii) clustering of real-world proximity data for protein sequences. The first clustering experiment is concerned with clustering of data which are described by Euclidian distances in a two dimensional space. The data are generated by a distribution of three Gaussians with unit variance. To test the behavior of the algorithms the squared distances are contaminated with Gaussian noise. The reconstruction of the coordinates in the case of zero noise is perfect up to numerical errors as shown in Table 4 (  = 0). Note that the reconstruction is not obvious, since the original distances are not directly compared (e.g. in form of a squared error) with the new distances after pairwise clustering. The robustness is investigated by disturbing the squared distances with additive Gaussian noise (standard deviation ). The algorithm is capable of detecting an embedding for the data, which approximates the given distances, although there does not exists a consistent embedding in any metric space due to the added noise. The structure present in the pairwise distances is exploited to supress the noise and to select an embedding with distances closer to the original, noise free data, than the noisy distances. This robustness which is most significant for highly corrupted data is extremly important for data interpretation in empirical sciences, since data gathered e.g. in psychological

experiments very often have a considerable degree of uncertainty. The ability to suppress noise increases significantly with the number of points. The calculated coordinate representation is of course only identical up to translations, rotations and reflections. The capability of finding low dimensional representations for high dimensional data is demonstrated with a data set drawn from a mixture of three Gaussians with different orientation of the principal component axis (Fig. 1a). The linear projection onto the principle component of the data distribution, which is optimal in separating the three clusters is the one on the x–axis. Applying this projection, however, would erase all relevant information of the inner pairwise distances of those two clusters which are elongated along the y–axis. What we expect to be the best one dimensional embedding is a local principal component analysis which “unfolds” the two–dimensional structure. This is indeed the solution of (25), as shown in Fig. 1b,c. The one–dimensional coordinates are strongly correlated with the coordinates along the principal component of each individual cluster, indicated by the approximately linear relationship between the principal component axis and the one-dimensional embedding. The deviations from the ideal linear relationship are not only due to approximation errors, since every point is individually fitted, and the minor components are not completely ignored as they would have been in a projection on the principal component axis. The algorithm proves to find in an unsupervised manner a representation, which preserves the ‘essential’ structure present in the grouping formation and in the topology of the data set. Figure (2) shows the clustering result for a real–world data set of 145 protein sequences. The similarity values between pairs of sequences are determined by a sequence

Proceedings of the International Conference on Pattern Recognition ’94, Hebrew University Jerusalem

hard combinatorial optimization problem which might have exponentially many local minima in the worst case. We have developed an approximate solution in the maximum entropy framework. A variational approach called meanfield approximation is employed to calculate assignments of data to clusters which are optimal (at least in principle) for central clustering with a Euclidian squared distance. The coordinates are treated in this approximation as the variational parameters. They represent an Euclidian embedding which preserves as much as possible of the original clustering structure and which is superior to linear methods like principle component analysis or projection pursuit [8]. The algorithm might serve as a valuable tool for pattern recognition problems in psychology, molecular biology or lingustics to group and visualize distance data with similarity characterization. Furthermore, the algorithm can be employed to organize a nonlinear projection of sparse data in high dimensional spaces to low dimensional subspaces in an unsupervised way. Future research has to address the question how to select an appropriate dimension of the embedding space. Such a search could be guided by complexity measures [9] like the minimum description length. Acknowledgement: It is a pleasure to thank M. Vingron for providing us with the protein data. This work was supported by the Ministry of Science and Research of the state North Rhine-Westphalia.

a HB

HA GG MY HBX,HF,HE GP GG HG,HE

b

8 HB HG,HE 4

HBX,HF,HE HA

0 GP

GG

-4 GG

6

MY

References -8 -8

-4

0

4

8

Figure 2: Similarity matrix of 145 protein sequences of the globin family (a): dark gray levels correspond to high similarity values. Clustering with embedding in two dimensions (b). alignment program which took biochemical and structural information into account. The sequences belong to different protein families like hemoglobin, myoglobin and other globins; they are abbreviated with the displayed capital letters. The gray level visualisation of the dissimilarity matrix with dark values for similar protein sequences shows the formation of distinct “cubes” along the main diagonal. These cubes correspond to the discovered partition after clustering. The embedding in two dimensions shows inter– cluster distances which are in good agreement with the similarity values of the data. In three and four dimensions the error between the given dissimilarities and the found distances was further reduced. Altogether the results are consistent with the biological classification.

6 Conclusion Pairwise clustering of distance data which are described by a dissimilarity matrix and not by a set of coordinates is a

[1] D. Amit, Modelling Brain Function. Cambridge: Cambridge University Press, 1989. [2] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. New York: Addison Wesley, 1991. [3] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [4] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967. [5] R. M. Gray, “Vector quantization,” IEEE Acoustics, Speech and Signal Processing Magazine, pp. 4–29, April 1984. [6] K. Rose, E. Gurewitz, and G. Fox, “Statistical mechanics and phase transitions in clustering,” Physical Review Letters, vol. 65, no. 8, pp. 945–948, 1990. [7] Y. Tikochinsky, N. Tishby, and R. D. Levine, “Alternative approach to maximum–entropy inference,” Physical Review A, vol. 30, pp. 2638–2644, 1984. [8] P. Huber, “Projection pursuit,” Annals of Statistics, vol. 13, pp. 435–475, 1985. [9] J. Buhmann and H. K¨uhnel, “Vector quantization with complexity costs,” IEEE Transactions on Information Theory, vol. 39, pp. 1133–1145, July 1993.

Suggest Documents