Clustering Using Elements of Information Theory

0 downloads 0 Views 347KB Size Report
Another usual class of algorithms are the agglomerative hierarchical algorithms. That ... to cluster using algorithms that could not correctly cluster the entire dataset, ... The other datasets, despite the spatial complexity, could be corrected ...
Clustering Using Elements of Information Theory Daniel de Ara´ ujo1,2 , Adri˜ao D´oria Neto2 , Jorge Melo2 , and Allan Martins2 ´ Federal Rural University of Semi-Arido, Campus Angicos, Angicos/RN, Brasil [email protected] Federal University of Rio Grande do Norte, Departament of Computer Engineering and Automation, Natal/RN, Brasil {adriao,jdmelo,allan}@dca.ufrn.br 1

2

Abstract. This paper proposes an algorithm for clustering using an information-theoretic based criterion. The cross entropy between elements in different clusters is used as a measure of quality of the partition. The proposed algorithm uses “classical” clustering algorithms to initialize some small regions (auxiliary clusters) that will be merged to construct the final clusters. The algorithm was tested using several databases with different spatial distributions. Key words: Clustering, Cluster Analysis, Information Theoretic Learning, Complex Datasets, Entropy

1

Introduction

There are many fields that clustering techniques can be applied such as marketing, biology, pattern recognition, image segmentation and text processing. Clustering algorithms attempt to organize unlabeled data points into clusters in a way that samples within a cluster are “more similar” than samples in different clusters [1]. To achieve this task, several algorithms were developed using different heuristics. Although in most part of clustering tasks no information about the underlying structure of the data is used during the clustering process, the majority of clustring algorithms requires the number of classes as a parameter to be given a priori. Moreover, the spatial distribution of the data is another problematic issue in clustering tasks, since most part of the algorithms have some bias to a specific cluster shape. For example, single linkage hierarquical algorithms are sensitive to noise and outliers tending to produce elongated clusters and k-means algorithm yields to elliptical clusters. The incorporation of spatial statistics of the data gives a good measure of spatial distribution of the objects in a dataset. One way of doing that is use information-theoretic elements to help the clustering process. More precisely, Information Theory involves the quantification of information in a dataset using

2

Clustering Using Elements of Information Theory

some statistical measures. Recently, [2–4] achieved good results using some elements of information theory to help clustering tasks. Based on that, this paper proposes a information-theoretic based heuristic to clustering datasets. In fact, we propose a iterative two-step algorithm that tries to find the best label configuration by switching the labels according to a cost function based on the cross entropy [3]. The utilization of statisticial based measures enables the algorithm to cluster spatial complex datasets. The paper is organized as follows: in Sect. 2 we make some consideration about the information theory elements used in the clustering algorithm; in Sect. 3 we describe the information-theoretic based clustering criterion; in Sect. 4 we present the proposed clustering algorithm; Sect. 5 shows some results obtained and in Sect. 6 the conclusions and considerations are made.

2

Information Theoretic Learning

Information Theory involves the quantification of information in a dataset using some statistical measures. The most used information-theoretic measures are Entropy and its variation. Entropy is a measure of uncertainty about a stochastic event or, alternatively, it measures the amount of missing information associated with an event [5]. From the idea of entropy arose other measures of information, like Mutual information [4], Kullback-Leibler divergency [6], cross entropy [4] and joint entropy [4]. Let us consider a dataset X = {x1 , x2 , ..., xn } ∈ Rd with independent and identically distributed (iid) samples. The most traditional measure of information is the Shannon’s entropy Hs , that is given by [7]: Hs (x) =

n X

pk I(pk )

(1)

k=1

Pn where k=1 pk = 1 and pk ≥ 0. After that, Alfred Renyi proposed another measure of entropy in the 60’s, known as Renyi’s entropy [8]: Z 1 ln f α (x)dx α ≥ 0, α 6= 1 (2) HR (x) = 1−α The most used variation of the Renyi entropy is its quadratic form, where α = 2: Z HR (x) = − ln fx (x)2 dx (3) In (3), fx (x) is a probability density function (PDF). So , it is necessary the estimation of that PDF and as we are working in a clustering context task, we don’t have any information about the underlying structure of the data. Then, we used one of the most popular approach to make nonparametric estimation: the Parzen Window estimator [9]. The Parzen Window can be written as:

Clustering Using Elements of Information Theory

f (x) =

N 1 X G(x − xi , σ 2 ) N i=1

3

(4)

where G(x, σ 2 ) is multivariate Gaussian function defined as: G(x, σ 2 ) =

„ « X −1 1 1 T P exp − (x − µ)| | (x − µ) (2π)d/2 | |1/2 2

(5)

P in this case, is the covariance matrix and d is the dimension of x. When we substitute (4) and (5) in (3), we have:

HR (x) = − ln

Z

 ! N N X 1 1 X G(x − xi , σ 2 )  G(x − xj , σ 2 ) N i=1 N j=1

N N 1 XX G(xi − xj , 2σ 2 ) = − log 2 N i=1 j=1

(6)

According to [4] this equation is known as Information Potential, because the similarity to potential energy between physical particles. The Information Potential was successfully used in several works as distance measure or clustering criterion [3, 2, 10]. As we can notice, entropy measures the information of one random variable. When we are interested in quantify the interaction between two different datasets, one choice is to compute the cross entropy between them [3]. Extending the concepts of Renyi’s entropy, we can define formally the cross entropy M between two random variables X = (xi )N i=1 and Y = (yj )j=1 as:

H(X; Y ) = − log = − log

Z

px (t)py (t)dt



N M 1 XX G(xi − yj , 2σ 2 ) N M i=1 j=1

(7)

The cross entropy criterion is very general and can be used either in a supervised or unsupervised learning framework. We are using an information-theoretic criteria based on maximization of cross entropy between clusters.

3

The Proposed Clustering Criterion

Two major issues in clustering are: how to measure similarity (or dissimilarity) between objects or clusters in the dataset and the criterion function to be optimized [1]. For the first issue, the most obvious solution is use the distance

4

Clustering Using Elements of Information Theory

between the samples. If the distance is used, then one would expect the distance between samples in the same cluster to be significantly less than the distance between samples in different clusters [1]. The most common class of distance used in clustering tasks is the Euclidean distance [1, 11, 12]. Clusters formed using this kind of measure are invariant to translation and rotation in feature space. Some applications, like gene expression analysis, rather use correlation or association coefficients to measure similarity between objects [13, 12]. About the criterion function, one of the most used criterion is the sum of squared error. Clusterings of this type produce a partition containing clusters with minimum variance. However, the sum of squared error is most indicated when the natural clusters form compact and separated clouds [1, 14]. Another usual class of algorithms are the agglomerative hierarchical algorithms. That class of algorithm represents the dataset as a hierarchical tree where the root of the tree consists of one cluster containing all objects and the leaves are singleton clusters [1]. The spatial shape of the partitions produced by hierarchical clustering algorithms depends of the linkage criterion used. Single linkage algorithms are sensitive to noise and outliers. Average and complete linkage produces elliptical clusters. This paper proposes the use of cross entropy as a cost function to define the clusters of a given dataset. The objective of the algorithm is to maximize cross entropy between all clusters. As pointed earlier, the cross entropy is based on a entropy measure that need the estimation of the data density distribution. So, the approach utilized in this work is the cross entropy using Parzen Window estimation method described in Sect 2. Using elements of information theory as clustering criterion takes advantage of the underlying statistical information that the data carries. Indeed, the algorithm makes no assumption of the statistical distribution of the data, instead of that, it tries to estimate that distribution and uses it as a measure of similarity between clusters. When the cross entropy is utilized in the clustering context, it is taken into account the relation between each group. This relation is showed as the influence that one cluster can have on another.

4

The Proposed Clustering Algorithm

The main goal of a clustering algorithm is to group objects in the dataset putting into the same cluster samples that are similar according to a specific clustering criterion. Iterative distance-based algorithms form a effective class of techniques to deal with clustering tasks. They work based on the minimization of the total squared distance of the samples to their cluster centers. Although, they have some problematic issues like the sensibility to the initial guesses of the cluster centers and restrictions related to the spatial distribution of the natural groups [11].

Clustering Using Elements of Information Theory

5

One way of using iterative distance-based algorithms efficiently is to cluster the dataset using a high number of clusters, i.e., using more clusters than the intended number of clusters in final partition and after that merge those clusters to form larger and more homogeneous clusters. Many authors apply this approach in their works and had good results for spatial complex datasets [3, 2, 10]. We are using cross entropy as a cost function and its calculation utilizes all data points of each cluster when we use the Parzen Window estimator. So, the larger the cluster is, the longer is the time to compute the cross entropy. When we apply the strategy of split the dataset into several small regions we treat two issues: the small regions are usually more homogeneous and easier to cluster using algorithms that could not correctly cluster the entire dataset, like k-means and with smaller clusters the time consumption to compute cross entropy decreases. Based on that, the proposed clustering algorithm works in a iterative twostep procedure: first, the dataset is divided in a large number of compact small regions, named auxiliary clusters. Then, each region is randomly labeled according to the specified number of cluster but not yet corresponding to the final partition labels, e.g., if we are dealing with a two-class dataset, two kinds of labels will be distributed to the auxiliary clusters and the two clusters are composed by all regions sharing the same labels. The second step of the algorithm consists in switch the labels of each small region checking whether this change increases the cost function. Every time the cost function is increased the change that causes the raise is kept, otherwise it is reversed. This processes is repeated until there is no changes in the labels. The small regions found in the initial phase work as auxiliary clusters that will be used to discover final clusters in the dataset. The task of finding auxiliary clusters can be made by a “traditional” clustering algorithm like k-means, competitive neural networks or hierarchical algorithms. In our case, we used the k-means algorithm, that is a well-known clustering algorithm [1]. The label switch process take each auxiliary cluster and changes its label to a different one. Then, the cost function is calculated and if some increase is noticed, that change is recorded and the new label configuration is assumed. After all clusters labels have been changed, the process starts again and continues until there is no new changes in any auxiliary cluster label. This can be seen as a search for the optimal label configuration of the auxiliary clusters and for consequence the optimal configuration of the clusters in the final partition. Due to the initial random assigning process, the proposed algorithm is nondeterministic and can produce different results for different initialization. Also, the number of auxiliary clusters direct influences the overall performance.

5

Experimental Results

To test the performance of the proposed clustering algorithm we used some traditional clustering datasets with simple and complex spatial distributions.

6

Clustering Using Elements of Information Theory

Figure 1 illustrates all datasets. Notice that the dataset A (Fig. 1a) is a classic two well-separated clouds classes and it is the simplest dataset of all used in this paper. The dataset B (Fig. 1b) and dataset C (Fig. 1c) have a more complex spatial distribution.

(a) Dataset A: Two well-separated classes (b) Dataset B: Two half-moons dataset dataset

(c) Dataset C: Two circles dataset Fig. 1: Datasets

If we use the same traditional center-based technique (k-means) that we used to create the initial auxiliary clusters of our algorithm, it would be able to correctly separate the clusters of only one dataset, the simplest one (dataset A). This happens because the clusters into that dataset have spherical shapes. For the rest of tested datasets, k-means could not achieve good results. Figure 2 show the performance of k-means over all datasets tested with our proposed algorithm. As pointed out earlier, the number of auxiliary clusters and the number of final clusters are needed to start the process. Running some pre-experimental

Clustering Using Elements of Information Theory

(a)

7

(b)

(c) Fig. 2: k-means clustering for all datasets.

tests, always with the number of clusters equals to the real number of classes, we could find which number of auxiliary cluster is more suitable based on the cross entropy value. Also, since there is some randomness in the initial label assigning process, we run 10 simulations using each dataset and show here the one with greater cross entropy. The results achieved using the proposed algorithm are show in Figs. 3, 4 and 5. For each dataset it is shown the entire process of clustering. In each Fig., the first picture shows the dataset clustered using the auxiliary clusters. The second picture illustrates the initial labels assigned randomly to each auxiliary clusters. The rest of pictures composes the switching label process leading to the last and final picture represent the final partition. As we can see, the algorithm performed the correct separation of the classes in all cases. The dataset A could be corrected clustered using any number of auxiliary clusters. The other datasets, despite the spatial complexity, could be corrected clustered using the specified parameters.

8

Clustering Using Elements of Information Theory

Fig. 3: Partition achieved using five auxiliary clusters.

Fig. 4: Partition achieved using 11 auxiliary clusters.

Clustering Using Elements of Information Theory

9

Fig. 5: Partition achieved using 10 auxiliary clusters.

Those results are in agreement with other works using Information Theory to help the clustering process. For instance, [10] used Renyi’s entropy as a clustering criteria, [15] proposed a co-clustering algorithm using mutual information and [2] who developed a clustering algorithm based on Kullback-Leibler divergence.

6

Conclusions

In this paper we proposed a clustering algorithm using elements of information theory as a cost function criterion. A two-step heuristic creates a iterative procedure to form clusters. We also tested the algorithm using simple and complex spatial distribution. When the correct number of auxiliary clusters is used, the algorithm performed perfectly. The use of statistical based measures enables the algorithm to cluster spatial complex datasets using the underlying structure of the data. But it is reasonable to think that the algorithm has some issues derived from its base structure based on the k-means clustering algorithm and the kernel function used to estimate the density probability. But, considering the initial experimental tests, which achieved good results, there is a lot of variables that can be changed to improve the capacity of the algorithm to some clustering tasks.

References 1. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley (2001) 2. Martins, A.M., Neto, A.D.D., Costa, J.D., Costa, J.A.F.: Clustering using neural networks and kullback-leibler divergency. In: Proc. of IEEE International Joint Conference on Neural Networks. Volume 4. (2004) 2813–2817

10

Clustering Using Elements of Information Theory

3. Rao, S., de Medeiros Martins, A., Pr´ıncipe, J.C.: Mean shift: An information theoretic perspective. Pattern Recogn. Lett. 30(3) (2009) 222–230 4. Principe, J.C.: 7. In: Information theoretic learning. John Wiley (2000) 5. Principe, J.C., Xu, D.: Information-theoretic learning using renyi’s quadratic entropy. In: Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation, Aussois. (1999) 407–412 6. Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1) (1951) 79–86 7. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27 (Jul, Oct 1948) 379–423, 625–56 8. Cover, T.M., Thomas, J.A.: Elements of Information Theory. 2 edn. John Wiley (1991) 9. Parzen, E.: On the estimation of a probability density function and the mode. Annals of Mathematical Statistics (33) (1962) 1065–1076 10. Gokcay, E., Principe, J.C.: Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 24(2) (2002) 158–171 11. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. 2. edn. Morgan Kaufmann, San Francisco, CA (2005) 12. Hair, J., ed.: Multivariate data analysis. 6. ed. edn. Pearson/Prentice Hall, Upper Saddle River, NJ (2006) 13. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439) (Oct 1999) 531–537 14. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1988) 15. Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: In ACM SIGKDD, 2003. Peter Grnwald. The Minimum Description Length Principle. (2003)

Suggest Documents