Clustering by kernel density - Springer Link

Comput Econ (2007) 29:199–212 DOI 10.1007/s10614-006-9078-7

Clustering by kernel density Christian Mauceri · Diem Ho

Received: 15 November 2006 / Accepted: 12 December 2006 / Published online: 1 March 2007 © Springer Science+Business Media B.V. 2007

Abstract Kernel methods have been used for various supervised learning tasks. In this paper, we present a new clustering method based on kernel density. The method does not make any assumption on the number of clusters or on their shapes. The method is simple, robust, and behaves equally or better than other methods on problems known as difficult. Keywords Clustering · Kernel 1 Introduction The objective of clustering is to group data points based on their characteristic attributes. Clustering can use parametric model, as in mixture parameter estimation whose k-means algorithm of MacQueen (1967) is a special case. They can also use non-parametric model as in the scale space clustering method of Roberts (1996). The multiplicity of methods has often been quoted by many authors and very good surveys have been written on the subject (see Jain, Murty, & Flynn, 1999 for instance). A kernel on a set X is an application X × X → R, (x, y) → k(x, y), such that k is symmetric, definite and

C. Mauceri (B) · D. Ho IBM Europe, 2 avenue Gambetta, Tour Descartes - La Défense 5, 92066 Courbevoie, France e-mail: [email protected] D. Ho e-mail: [email protected] C. Mauceri École Nationale Supérieure des Télécommunications – Bretagne, Technopôle Brest-Iroise, CS 83818, 29238 Brest, France

200

Comput Econ (2007) 29:199–212

positive.1 Kernel methods, whose Support Vector Machine (see Burges, 1998 for a good tutorial) is certainly the best known example, have been initially used for supervised learning algorithms. The principle governing these methods is to map objects into richer high-dimensional spaces called feature spaces where a dot product is used for classification, regression or clustering purpose. The advantage of such a principle is that the dot product in the feature space is represented by a kernel in the original space avoiding explicit and intractable mapping in the feature space. Three very popular clustering methods using kernels are kernel k-means, spectral clustering and support vector clustering. The first method applies the k-means algorithm in the feature space in order to extract clusters which would not be separable in the original space. Spectral clustering algorithms compute the primary dominant eigenvectors of a distance matrix (see Ng, Jordan, & Weiss, 2002), project the data points onto the space spanned by the selected eigenvectors and then cluster them using a simple algorithm like k-means. Support vector clustering algorithms look for the smallest sphere in the feature space containing the data image with a tolerance threshold and looks at the connected component in the original space (see Ben-Hur, Horn, Siegelmann, & Vapnik, 2001). Kernel k-means are attractive because of their simplicity, rapidity, and low-memory requirements but they are sensitive to initial conditions and require a pre-set number of clusters. One great advantage of the spectral approach is its representation of the clusters in matrix form allowing a visual analysis of the clustering in term of density and inter-cluster connectivity. It needs however, eigenvector computation which can be costly and somewhat difficult to interpret. Support vector clustering is very appealing because of its elegance and its strong connection with the support vector supervised learning. It suffers, however, from a lack of natural visual representation. In this paper, we propose a non-parametric clustering algorithm based on a kernel density approach. We identify the initial clusters using a simple algorithm to find a diagonal block structure of the kernel matrix2 depending on a given kernel density for these blocks. The outcome is refined by using a connectivity threshold to obtain the final clustering. We use a connected component algorithm at the dense cluster level and improve the clustering sharpness by tuning the kernel density of the diagonal block structure. This is simpler and in contrast with other methods which use the connected components at the element level and use the Gaussian kernel radius to refine the clustering outcome. Our method allows us to use it with any type of kernels. We present the proposed method in Sect. 2, we then discuss its benefits in Sect. 3 and conclude by a summary and some future directions of research in Sect. 4.

1 ∀(x, y) ∈ X × X we have k(x, y) = k(y, x) and ∀(x ) n n i 1≤i≤n ∈ X and ∀(ci )1≤i≤n ∈ R we have n n i=1 j=1 ci cj k(xi , xj ) ≥ 0 (see Haussler, 1999). We call X the original space. 2 Given a set of elements S = {x ∈ X | 1 ≤ i ≤ n} and a kernel X × X → R, (x, y) → k(x, y), the i kernel matrix K is a n × n matrix whose entries are Ki,j = k(xi , xj ).

Comput Econ (2007) 29:199–212

201

Fig. 1 The inner cluster is composed of 50 points generated from a Gaussian distribution. The two concentric rings contain, respectively, 150 and 300 points, generated from a uniform angular distribution and radial Gaussian distribution. The horizontal and the vertical axis give the coordinates of the points in centimeters

2 Kernel density Our method based on kernel density is closely related to spectral clustering in term of cluster representation but is rooted in similarity aggregation (see Marcotorchino, 1981 and Michaud, 1985). The difference is that, instead of using the intra-cluster Kernel sum, we maximize the intra-cluster kernel density as the objective function. This helps us to avoid size dominance effect that leads to disparity in cluster size and thus difficult to analyze. Our approach is based on kernel matrix transformation into dense block diagonal3 matrix and connected component identification to detect non-convex and overlapping clusters. The advantages of our method are: • Capability of correctly identifying clusters in well known difficult configuration (non-convex and overlapping clusters) with no prior knowledge of cluster’s number or shape. • Independence of kernel type. • Rapidity of processing in particular in case of sparse data. • Consistent treatment of heterogeneous data types (quantitative and qualitative mix). • Intuitive interpretation of the results. 2.1 The procedure To illustrate our method we use the synthetic data in Fig. 1, as similarly generated by Ben-Hur et al. (2001). Suppose the dots in the inner cluster are numbered from 1 to 50, those of the first ring from 51 to 200 and those of the second and 2 outer ring from 201 to 500. When the kernel k : (x, y) → e−x−y where x and y are data points, is applied to the dots of Fig. 1 numbered as described, we get the kernel matrix given in Fig. 2. In this matrix, we clearly see three main 3 Strictly speaking we are not transforming the kernel matrix in a block diagonal one as the values

outside the blocks can be different from zero these values are only much lower than those inside the block.

202

Comput Econ (2007) 29:199–212

Fig. 2 Kernel matrix of the previous figure. The dots of the previous figure are numbered from 1 to 50 for those belonging to the inner cluster, from 51 to 200 for those of the first ring and from 201 to 500 those of the second and outer ring. Values of the kernel are represented by gray levels (1 is represented by black dot and 0 by a white one). It is striking to see the natural clusters as squares on the diagonal. The square (1,1)–(50,50) seems almost black as the dots in the inner cluster are very close to one another

squared blocks on the diagonal. The first very dark block corresponds to the inner cluster, the second less dark block corresponds to the first ring, and the third and lighter block corresponds to the last and outer ring. The proximity between two clusters is represented by symmetric gray rectangles delimited by the diagonal squares associated with the considered clusters. For instance the gray rectangles4 (1,50), (51,200) and (50,1), (200,51) in Fig. 2 represent the proximity between the inner circle and the first ring. The above example suggests that the clusters appear as dense square blocks on the diagonal of the kernel matrix when they are arranged in a suitable way. However, the density of the kernel inside these blocks depends on the data configuration and may differ from one cluster to another. For instance, on Fig. 2 the inner cluster appears much denser than the outer ring whereas the average distance of neighboring points inside any cluster is of the same order. Our method is based on a two step approach: • Gather the points in high-density kernel clusters (Fig. 3), using the algorithm described below in 2.2. • Build the density matrix of these dense clusters (Fig. 4) to assess their inter cluster density and find the connected components for a given threshold deduced from its analysis. The density matrix is a p × p matrix where p is the number of initial clusters. The density matrix shows, on its diagonal, the density of the previously calculated dense clusters and, off the diagonal, the inter cluster density between two different clusters. So if di,j is the general term of the density matrix and k is the Kernel we have:

4 Rectangles and squares in the matrices are described by two of their diagonally opposed vertices,

these vertices are represented by pairs of integer corresponding to line and column number of the matrix.

Comput Econ (2007) 29:199–212

203

Fig. 3 The original points of Fig. 1 are gathered in clusters of 0.5 kernel density. Different colors stand for different clusters

Fig. 4 Density matrix of the clusters shown in Fig. 3. Density values are represented by gray levels (1 is represented by black dot and 0 by a white one)

di,j =

⎧ k(x,y) ⎨ x,y∈Ci2,x=y ,

if i = j,

⎩

if i = j,

|Ci | −|Ci | x∈Ci ,y∈C , k(x,y) j

|Ci ||Cj |

,

(1)

where Ci and Cj are clusters of the initial partition and |C| is the cardinal of the cluster C.5 Comparing the density matrix (Figs. 4, 5) and the kernel matrix (Fig. 6), we can see that the inner cluster, represented by the first cluster on the density matrix corresponds to the first black square (1,1)–(50,50) on the diagonal of the kernel matrix. The proximity of the inner cluster and the first ring is shown by shaded 5 In the calculation of a cluster density we do not take into account the diagonal of the kernel x,y∈Ci ,x =y k(x,y)

matrix in order to penalize clusters reduced to one element, it is why the formula is

and not

x,y∈Ci k(x,y) . |Ci |2

|Ci |2 −|Ci |

204

Comput Econ (2007) 29:199–212

Fig. 5 The original clusters are retrieved. They correspond to the connected components of the density matrix in Fig. 4 for a connectivity threshold of 0.2

Fig. 6 The kernel matrix arranged in order to show dense square blocks of kernel density equal or greater than 0.5 on the diagonal. The connected components for a connectivity threshold of 0.2, are the squares identified within (1,1)–(50,50), (51,51)–(200,200) and (201,201)–(500,500). The data points are grouped by neighboring clusters, as the result, we can observe more dark points along side the diagonal axis; in contrast to Fig. 2 in which the data points are distributed randomly

squares from to 2 to 9 on the first line and the first column of the density matrix. They correspond to the rectangles (1,50)–(51,200) and (50,1)–(200,51) of the kernel matrix. The density matrix is a representation of the initial kernel matrix at a given level of density threshold.

2.2 The algorithm For the first step of our method, the objective function is to maximize the weighted intra density of the clusters subject to the following constraints: the intra kernel density of each cluster is above a given threshold θ and the inter kernel density between two given clusters is below θ . Maximize () =

C∈,|C|>1 x∈C

y∈C,x=y k (x, y)

|C| − 1

−θ

(2)

Comput Econ (2007) 29:199–212

subject to:

205

x,y∈C,x=y k (x, y) 2

|C| − |C|

x∈Ci ,y∈Cj ,x=y k (x, y)

|Ci | Cj

≥ θ,

< θ,

∀C ∈ ,

∀ Ci , Cj ∈ × ,

(3)

(4)

where is a partition of the set X to cluster, |C| represents the cardinal of the cluster C and θ is a density threshold which defines the homogeneity of the clusters. The constraint (3) ensures that the cluster density is greater than the threshold θ .6 For a non-trivial partition (trivial partition: one element clusters), () is strictly greater than 0. The constraint (4) ensures the inter kernel density between two clusters is always lower than the kernel density threshold θ . To compute these dense clusters we use the Algorithms 1 and 2. Algorithm 1 allows us to find a covering clustering satisfying the feasibility condition (3) by iterating over each element, putting it in a cluster if the average value of the kernel between it and the elements of the cluster is above the density threshold θ and creating a new cluster otherwise. Algorithm 2 allows us to maximize the objective function () in (2) with the given density threshold. This algorithm is based on the fact that for a given element x belonging to a cluster Cx , transferring x to another cluster C changes the value of the objective function in (2) by: ⎧ ⎪ ⎨

1 y∈Cx ,x=y k (x, y) , |Cx |−1 1 y∈C k (x, y) − θ , |C| 1 1 y∈C k (x, y) − |Cx |−1 y∈Cx ,x=y k (x, y), |C| θ−

if x is transfered to a new cluster, if x = {x}, ⎪ ⎩ otherwise. (5) The Algorithm 2 iteratively calculates for each element x the change in the value of associated to each possible transfer of x in another cluster: 1 1. If |Cx1|−1 y∈Cx ,x=y k (x, y) < θ and |C| y∈C, k (x, y) < θ , ∀C ∈ , C = Cx then x is transfered to a newly created cluster. 2. If the change in the value of is positive, then x is moved into the cluster associated with the highest change. The algorithm finishes when no more transfers are possible or when a maximum allowed number of iteration is reached. Finally, in order to satisfy the constraint (4), we progressively join the clusters whose inter-density is higher than the 6 Indeed x∈C

y∈C,x =y k(x,y)

|C|−1

−θ

>0⇔

x,y∈C,x =y k(x,y) |C|2 −|C|

> θ and in the worst case, when

the density threshold is too high, all elements are isolated in clusters reduced to one element and the value of the objective function in 2 is equal to 0.

206

Comput Econ (2007) 29:199–212

density threshold θ , starting with the highest inter-density and updating the density matrix after each junction. At the end of this step, we obtain a partition of X in which each cluster is denser than θ and the inter-cluster density between two clusters is below θ . It should be noted the θ threshold is an adjustable parameter and must be chosen as a tradeoff between the complexity of the algorithm and its precision; a too low value of θ leading to a too coarse clustering and a too high-value leading to too many dense clusters. In general, we use 0.5 as the starting value of θ . The above described algorithm might not give the correct solution. For the second step, we need a second criterion on the proximity relation among the clusters to arrive at the final clustering. We identify the connected components of the partition for the relation: Ci and Cj are connected if their interdensity is greater than a given threshold we call the connectivity threshold. This threshold is determined by the analysis of the density matrix. Similar thresholds have also been used by Fischer and Poland (2004) for their conductivity matrix, or by Ben-Hur et al. (2001) in their Gaussian Kernel spread. As in these two approaches the determination of our threshold is rather empirical but, in our case, guided by the density matrix. The initial clustering reduces the complexity of the analysis: inter clustering kernel density is high when clusters are close or adjacent which facilitates the choice. We can also use an assignment-like algorithm to progressively link the clusters to reduce the partition size to the desired final number of connected clusters. The density matrix in (Fig. 4), for instance, shows cluster 1 is linked to clusters 2–9 because of the light gray squares (1,2)–(1,9). However, the darker squares like (3,4) or (2,5) show stronger relations between clusters 3, 4, 2, and 5. It is the corresponding value of these darker regions we used as the connectivity threshold to separate correctly the rings. This clustering has been done without any prior knowledge of the original space or the feature space.

2.3 Complexity

The complexity of the algorithm is O ln2 kernel operations where l is the number of iterations necessary to improve the initial covering clustering and n the number of objects to cluster, plus O n2 kernel operations to compute the density This is comparable to the SMO7 approach (Platt, 1998) in

2matrix. which O n kernel operations is needed to compute the radius function plus

O dn2 kernel operations to compute the connected components, where d is a constant depending on the desired precision of the connected component computation. The spectral methods described in Fischer and Poland (2004) is greedier because of the spectral decomposition of the kernel matrix, and the

conductivity matrix computation of O n3 operations. 7 Sequential minimum optimization.

Comput Econ (2007) 29:199–212

207

When, the feature space dimension is finite, it is represented by Rm , and: ∃φ:X−→ Rm ,x−→ (φi (x))1≤i≤m such as : k(x, y) =

m φi (x)φi (y) =< φ(x).φ(y) > i=1

Algorithm 1 = φ For each x in X { bestGain = 0 For each C in { 1 gain = |C| y∈C,x=y k (x, y) If(bestGain < gain) { bestGain = gain bestClass = C } } If(bestGain > θ ) { C= bestClass = ( − {C}) ∪ {C ∪ {x}}/* x is put in C*/ }else { C = φ /* A new empty cluster C is created */ = ∪ {C ∪ {x}} /* x is put in C */ } }.

Algorithm 2 = init do { For each x in X { Let be Cx the cluster of X xDensity = |C 1|−1 y∈Cx ,x=y k (x, y) x bestGain = 0 For each C in { 1 gain = |C| y∈C,x=y k (x, y) −xDensity If(bestGain < gain) { bestGain = gain bestClass = C } } If(bestGain > 0) { C= bestClass = ( − {Cx , C}) ∪ {Cx − {x} , C ∪ {x}} } else if (xDensity < θ) C = φ /* A new empty cluster C is created */ = ∪ {C ∪ {x}} /* x is put in C */ } } } while ( has changed and the number of iterations is not too high).

208

Comput Econ (2007) 29:199–212

we have: ∀A, B ⊂ X,

x∈A y∈B

k(x, y) =

y∈B

in such a case the complexity of the algorithm can be O (lnpm), where p is the number of clusters. The kernel can be directly computed in the feature space. If pm n then lnpm will be much less than ln2 . In addition, when the in

vectors the feature space are very sparse the complexity is O (lnpu) O ln2 where u is the average non-null components of a vector in the feature space and u m p. When dealing with texts this characteristic is important because the algorithm can be extremely fast allowing for large document quantity processing. 3 Discussion In this section, we present some of the benefits of the proposed approach, its ability to deal with difficult problems and its consistency. 3.1 Non-convex clusters The above example, also described in Ben-Hur et al. (2001), is a difficult problem because the natural clusters are non-convex. The kernel value for two opposed points of a ring is very low and, worse, can be much lower than the kernel value of two points belonging to different rings. In order to separate the three rings (Ben-Hur et al., 2001) use also two parameters: the kernel spread and a slack variable in order to eliminate the outliers. Then, a connected components algorithm is used to retrieve clusters in the original space. In this sense our method is comparable; we use a θ kernel density of 0.5 for the initial covering clusters and a connectivity threshold of 0.2. However, by working with the density matrix of the initial covering clusters we reduce the complexity of the analysis because the density matrix is much smaller than the kernel matrix. The data analysis, no matter what the original space or how big the feature space are, is simpler. In the algorithm described above, the number of clusters is not required to be specified. However, it is also possible to set, a priori, a number of clusters8 making the density matrix analysis tractable when dealing with very large datasets. 3.2 Strongly overlapping clusters Another type of difficult problem cited by Ben-Hur et al. (2001) & Li, Zhang, & Jiang (2004) is when clusters strongly overlap. Consider for instance the synthetic two-component Gaussian data (Fig. 7) described in Li et al. (2004), 8 Using a trash cluster, for instance, in order to collect outliers.

Comput Econ (2007) 29:199–212

209

Fig. 7 The left bottom component consists of 800 points and the upper one of 400. The mean and covariance matrix of the two components

are 0, 0 and 1, 1 ,

1 −0.3 1 0.3 , and −0.3 1 0.3 1 respectively

Fig. 8 Covering clusters of the data set described in Fig. 7

• For a kernel density threshold of 0.5 we obtain the initial covering clusters in Fig. 8. • The associated density matrix is shown in Fig. 9. We can see that the preliminary clusters are highly connected, separated by two groups to the northwest and south-east corners. • The covering clusters are further joined in order to satisfy both constraints (3) and (4). The new clustering is shown in Fig. 10, its density matrix is shown in Fig. 11. • Connecting the final clusters for a connectivity threshold of 0.4 we obtain the final results in Fig. 12. If you observe carefully, this clustering does not correspond exactly to the original one (Fig. 7) because of the overlapping, however the difference is small and the resulting clustering is closer to what our intuition would suggest. • The kernel matrix in block diagonal arrangement is shown in Fig. 13. The separation is evident between two clusters. 3.3 Consistency One common drawback of the clustering methods purely based on distance is that they compare things of different natures. It is particularly obvious for qualitative variables and ordinal variables which are very often arbitrarily considered as components of real vectors on which we compute meaningless distances and centroids. For instance considering 0 for male and 1 for female in a Gender

210 Fig. 9 Density matrix of the covering clusters in Fig. 8

Fig. 10 Covering clusters of Fig. 8 joined in order to satisfy constraint (4). Each cluster has a density above 0.5 and the inter-density between two clusters is below 0.5

Fig. 11 Density matrix of the covering clusters in Fig. 10. We clearly have a choice of further aggregating these clusters

Comput Econ (2007) 29:199–212

Comput Econ (2007) 29:199–212

211

Fig. 12 Final results as connected components of density matrix of Fig. 11 with a connectivity threshold of 0.4

Fig. 13 Kernel matrix associated with the data shown on Fig. 12. The separation is evident between two clusters

variable and 0 for mammal, 1 for reptile, 2 for birds in Biological Taxonomy variable we can exhibit a chimera with a Gender value of .75 and a biological group of 1.33. In contrast, by using the kernel approach our method allows us to handle variables of different types in consistent way because kernels are closed under linear combination. 3.4 Independence of kernel type The method depends on two parameters: a kernel density and a connectivity threshold. These two parameters are independent of the kernel used; in fact the procedure could be applied on any definite positive matrix. This is in contrast of other methods using the kernel structure. For instance (Ben-Hur et al., 2001) use the kernel spread. This characteristic makes the proposed method very general as it can deal with any similarity functions; for example, the Condorcet criterion presented in Marcotorchino (1981). 4 Conclusion We have proposed a novel clustering method, based on Kernels. Our method does not need the number of clusters to be specified or their shapes to be known

212

Comput Econ (2007) 29:199–212

a priori as many other algorithms do. It has two parameters: a Kernel density threshold, which defines a set of initial covering clusters and a connectivity threshold, derived from the cluster density matrix, which allows us to group them into connected components. The cluster density matrix representation provided by our algorithm allows us to simplify and to fine tune these two parameters by looking at its characteristic shape. Our method depends on the shape of the kernel matrix for general cluster analysis but is independent of the feature space, the original space or the type of kernel used. Experimental results as illustrated above indicate that our algorithm performs equivalently or better than other algorithms like k-means, Minimum Entropy Clustering or Support Vector Clustering. We have used this method to cluster millions of bank customer records. We have also used different variable definition schemes to take into account the difference between ordinal and qualitative variables. The result is favorably compared to other Data Mining clustering tools, in terms of rapidity and population coherence within segments. We plan to use in the future this algorithm on sentences and texts clustering based on convolution kernels as described by Haussler (1999). We also plan to automatically generate density thresholds by a statistical analysis of the kernel matrix. Acknowledgements The authors are grateful to one referee for his critical reading of the manuscript and many thoughtful and valuable comments that help us to improve considerably our paper.

References Ben-Hur, A., Horn, D., Siegelmann, H., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Fischer, I., & Poland, J. (2004). New methods for spectral clustering. Technical report No. IDSIA12-04. Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A review. A.C.M. Computing Surveys, 31, 264–323. Li, H., Zhang, K., & Jiang, T. (2004). Minimum entropy clustering and applications to gene expression analysis. In Proceedings of the 2004 IEEE Computational Bioinformatics Conference. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1965, University of California, 281–297. Marcotorchino, F. (1981). Agrgation des similarits. PhD dissertation. Michaud, P. (1985). Agrgation la majorit: Analyse de rsultats d’un vote. IBM Scientific Center of Paris, Technical report F052. Ng, A. Y., Jordan, M., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14. Platt, J. C. (1998). A fast algorithm for training support vector machines. Technical Report MSRTR-98-14. Roberts, S. J. (1996). Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30(2), 261–272.