A model-based distance for clustering - CiteSeerX

6 downloads 0 Views 159KB Size Report
distance is required for these algorithms and typically the Euclidean distance is .... the Euclidean sense and finds a tractable approximation to the path integral in ...
A model-based distance for clustering Magnus Rattray Department of Computer Science, Manchester University, Manchester M13 9PL, UK. email: [email protected] Abstract A Riemannian distance is defined which is appropriate for clustering multivariate data. This distance requires that data is first fitted with a differentiable density model allowing the definition of an appropriate Riemannian metric. A tractable approximation is developed for the case of a Gaussian mixture model and the distance is tested on artificial data, demonstrating an ability to deal with differing length scales and linearly inseparable data clusters. Further work is required to investigate performance on larger data sets.

1 Introduction Finding clusters in multivariate data is a difficult task in general, perhaps not least because the very definition of a cluster is ambiguous. A number of clustering algorithms have been developed in the literature which aim to minimise some sort of partitioning error measure (see, for example, [1]). Some notion of distance is required for these algorithms and typically the Euclidean distance is used. Recently, Tipping introduced a novel Riemannian metric for enhancing the performance of clustering and visualisation algorithms [2]. He demonstrated good performance on a number of small and large scale data sets, including impressive results when applied to a hand-written digit data set. His choice of metric generalises upon the Mahalanobis distance which can be used to make Gaussian distributed data scale invariant. Instead of the inverse covariance metric used in the single Gaussian case, Tipping uses an additive mixture of inverse covariances from a fitted mixture model. The mixture weights are the posterior probabilities associated with each component and the metric varies according to which mixture components are most influential. The clustering metric introduced by Tipping has a number of nice features, such as invariance to linear transformations of the data variables and robustness to the number of mixture components used to model the data density. However, the method also has a number of limitations. A “straight” line (in the Euclidean sense) is used to measure distance in a Riemannian space. In general, distance will be highly dependent on the path chosen and the concept of a straight line should be re-interpreted. This is especially important if we wish to detect non-convex clusters, which may be difficult to detect using methods based on straight lines in Euclidean space. We will show how an approximation to a geodesic can be developed in order to address this issue. There is also an assumption that the covariance of individual components in a Gaussian mixture model hold useful information, but this may not be the case. For example, it is quite reasonable to fit a long thin cluster with a mixture of spherical Gaussian distributions. Indeed, we might wish to limit the number of parameters in our density model in exactly this way. More generally, we would like a metric which is not restricted to the Gaussian mixture model but can in principle be derived for an arbitrary smooth density model. In this contribution a simple metric is introduced which is appropriate for clustering when a reasonable differential density model of the data exists. The basic principle behind the metric is quite simple: we interpret clusters as connected regions of relatively homogeneous high density data. This allows a metric to be defined in terms of a probability density estimated from the data. A tractable approximation to the metric is developed for the Gaussian mixture model case and we demonstrate the potential benefits of the metric on some artificial examples. Further work is required to determine how well the method behaves on more realistic data sets.

(a)

o o

o

o

o

o

o

(b)

o

o

o

o

o o

o o

o o o o o o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o o o

o

o

o o o o o o

o

o

o

Figure 1: In (a) we show the geodesic distance (solid line) and the Euclidean straight line distance (dashed line) between two points. In (b) we give an example of an approximation to the geodesic which is constructed from a combination of shorter Euclidean straight lines.

2 Defining a Riemannian distance We will define a cluster to be a homogeneous and connected region of high density data. We aim to measure distances between nearby points with this definition in mind. A simple local distance with the required property is given by the change in log probability density generating the data,

s

d

= =

jqlog p(x + dx)

x G(x)dx ;

p

x j ' jdx rx log p(x)j T

log ( )

d T

(1)

where we have defined a Riemannian metric,

G(x) = rx log p(x)(rx log p(x))

T

:

(2)

The Riemannian distance between points depends on the path chosen,

D(x(t0 ); x(t1 )) =

Z x(t1 ) q x(t0 )

t t xT G(x)t x

d

x

=

Z x(t1 ) x(t0 )

x

t jt log p(x)j ;

d

(3)

where t 2 [t0 ; t1 ℄ parameterises the path from (t0 ) to (t1 ). The most natural choice of path is the shortest geodesic, which is the path along the curve minimising the above integral. The reason for this is that we wish the path to stay within any cluster if at all possible, so that points in the same cluster are close together. In general, finding such a geodesic is a difficult computational problem. Tipping uses a straight line in the Euclidean sense and finds a tractable approximation to the path integral in this case (for his choice of metric) [2]. If we can determine the straight line distance between every pair of points (or an approximation to it) then we can approximate a geodesic by combining straight lines in order to find the shortest path between two points, using other points as intermediate steps. In figure 1 we illustrate the three different ways to measure distance described above. In figure 1(a) the solid line shows the ideal shortest path between two points which passes through regions of comparable data density. We see that the straight line, shown by the dashed curve, provides a poor approximation in this case. This approximation would result in an over-estimate in the distance between two points which are in the same cluster. Figure 1(b) shows our suggested approximation to the geodesic, made up of a combination of short straight line segments. We use Floyd’s O(n3 ) algorithm to find the shortest path in the examples given here [3]. For larger data sets this algorithm may not be appropriate, in which case we might reduce the search space by limiting the possible number of intermediate points along the path.

3 Gaussian mixtures: a tractable approximation Consider as a special case the Gaussian mixture model (for a discussion of the role of Gaussian mixture models in pattern recognition and neural networks see [4]),

p(x) =

X k

k p(xjk)

where

p(xjk)

1

=

x

e 2(

k )T Ck 1 (x k )

p



(2 )d

jCk j

:

(4)

From the definition in equation (2) we find the following metric,

G(x) =

X kl

p(kjx)p(ljx)Ck 1 (x k )(x l )T Cl 1

where

 p(xjk) : p(kjx) = P k k k p(xjk )

(5)

We use a similar approximation to the straight Euclidean path distance as introduced in [2]. We first exchange the order of the integral and square root in equation (3) and then we approximate p(k j ) ' k p( jk). These two simplifications result in the following closed form approximation to the squared distance between two points along a straight Euclidean path,

x

x

D

2

x ; xj ) ' (xi xj )

( i

T

G(xi

xj )

where

Here, we have defined the following integrals (where

akl =

Z

1

t p(xjk)p(xjl) ;

Akl =

d 0

G(xi ; xj ) =

Z

P

x = xi + t(xj xi )), 1

(6)

x k )(x l ) p(xjk)p(xjl) :

t

T

d ( 0

1 1 P C A C :

kl l kl k l k   a kl k l kl

(7)

Calculating the integrals,

akl

Akl with

=

=

p e

(2 )d



(2 )d

2

jCl jjCk j

p e



2

f ( ; ) ;



jCl jjCk j

wk wl f ( ; ) (vwl + wk v T

T

T

2 f ( ; ) T  f ( ; ) + vv )   2

wk = xi k , v = xj xi and, = v (Ck + Cl )v ; = v (Ck wk + Cl wl ) ; r     Z t +  1

T

f ( ; )

1

=

t

d e

0

1

2

2

t =

1

T

2



2

e 2

erf

1

p



2

erf



;

(8)

= wkT Ck 1 wk + wlT Cl 1 wl ;  p : 2

To apply the approximation we first calculate the straight line distances according to equation (6) and then apply Floyd’s algorithm in order to determine the shortest path between each pair of points [3]. The resulting distance matrix can be used in a number of different clustering or visualisation algorithms.

4 Artificial examples To demonstrate the potential of the metric we consider two artificial data sets, shown in figure 2. In the leftmost column we show the data along with probability contours from fitted density models. In the top row example 50 data points are generated from each of two long, thin Gaussian clusters (compare the scales on the axes). The data is then fitted using a mixture of four full covariance Gaussian distributions. This example is included to demonstrate robustness to differing length scales and poorly matched models. In the bottom row example 75 data points are uniformly distributed around each of two interleaved semi-circles and a ten kernel mixture distribution is fitted to the data. This data is included as an example of clusters which are non-convex and linearly inseparable. Notice that the density models, although reasonable, exhibit significant modes which do not reflect the underlying generating distributions. In the second two columns of figure 2 we compare pairwise distances for the Euclidean and Riemannian measures (in the latter case we have used the approximation developed in the previous section). The data is ordered so that the first half along each axis comes from one cluster and the second half from the other. Dark shades represent small distances and it is clear that the Riemannian distance separates the clusters more cleanly in each case. In both cases 100% of the data is closer to a member of the opposite cluster than to the furthest member of the same cluster using Euclidean distance. For the Riemannian distance the result is 0% and 26% for the top and bottom row examples respectively. The top case is also well dealt with using the distance measure derived in [2] whereas for the lower case this measure results in rather poor

Mixture of 4 Gaussians

Euclidean distance matrix

10

1

Riemannian distance (approx.) 1

5

50

0

50

−5

100

−10 −20

−10

0

10

20

100 1

Mixture of 10 Gaussians 3

50

100

Euclidean distance matrix

1

50

100

Riemannian distance (approx.) 1

Standard k−means

Adapted k−means

10

10

5

5

0

0

−5

−5

−10 −20

−10

0

10

20

−10 −20

Standard k−means 3

2

2

1

1

0

0

0

−1

−1

−1

2 1 75

−4

−2

0

2

4

10

20

75

150

−2

0

Adapted k−means

3

1

−10

150 1

75

150

−2 1

75

150

−2 −4

−2

0

2

4

−4

−2

0

2

4

Figure 2: The novel distance measure is compared to the Euclidean distance for two 2D artificial data sets. In the top example a mixture model with four Gaussian kernels is fitted to data generated from two long, thin clusters (compare the scales on each axis). This example demonstrates robustness to differing length scales. The bottom example shows a mixture of ten Gaussians fitted to two non-convex clusters. For each case the Euclidean and Riemannian distances detween each data pair is shown. A specially adapted k-means algorithm outperforms standard k-means in both cases. performance (giving 100% with the above “badness” criteria). This indicates that the method presented in this paper deals more effectively with clusters which are non-convex and linearly inseparable. In order to demonstrate how this novel distance measure can improve the performance of a simple clustering algorithm we have developed a modified k-means algorithm appropriate for a Riemannian space. In practice, visualisation or hierarchical clustering methods might be preferred when the number of clusters is not easy to assess, as is often the case. Our non-Euclidean batch k-means algorithm partitions the space using the Riemannian distance measure, but to reduce computation time we only consider data points themselves as potential prototypes (so all the necessary distances can be computed a priori). In the final two columns of figure 2 we compare the result of standard k-means and our adapted version (with k = 2). The adapted algorithm achieves perfect separation for both data sets, in contrast to standard Euclidean k-means. Acknowledgements: I would like to thank Mike Tipping and Jon Shapiro for useful discussions. This work was supported by the EPSRC grant GR/M48123.

References [1] J. M. Buhmann, “Stochastic Algorithms for Exploratory Data Analysis: Data Clustering and Data Visualisation,” in “Learning in Graphical Models,” ed. M. Jordan (MIT Press, 1998) 405–420. [2] M. E. Tipping “Deriving Cluster Analytic Distance Functions from Gaussian Mixture Models,” in Proc. of 9th Int. Conf. on Artificial Neural Networks (Edinburgh, 1999), 815–820. [3] G. Brassard and P. Bratley, “Fundamentals of Algorithmics,” (Prentice Hall, New Jersey, 1996). [4] C. M. Bishop, “Neural Networks for Pattern Recognition,” (Clarendon Press, Oxford, 1995).

Suggest Documents