K-medoid-style Clustering Algorithms for Supervised ... - CiteSeerX

2 downloads 99 Views 277KB Size Report
Our work focuses on the generalization of k-medoid-style clustering ... G. H. F. Attribute 1. Attribute 2 a. Traditional Clustering b. Supervised Clustering. : Setosa. : Virginica ..... (http://www.ics.uci.edu/~mlearn/MLRepository.html). [6] A. Demiriz ...
K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat Dept. of Computer Science University of Houston E-mail: [email protected]

Christoph F. Eick Dept. of Computer Science University of Houston E-mail: [email protected]

Abstract

respect to a single class. Moreover, in supervised clustering, we also like to keep the number of clusters small, and objects are assigned to clusters using a notion of closeness with respect to a distance function.

This paper centers on the discussion of k-medoid-style clustering algorithms for supervised summary generation. This task requires clustering techniques that identify class-uniform clusters. This paper investigates such a novel clustering technique we term supervised clustering. Our work focuses on the generalization of k-medoid-style clustering algorithms. We investigate two supervised clustering algorithms: SRIDHCR (Single Representative Insertion/Deletion Hill Climbing with Restart) and SPAM, a variation of PAM. The solution quality and run time of these two algorithms as well as the traditional clustering algorithm PAM are evaluated using a benchmark consisting of four data sets. Experiments show that supervised clustering algorithms enhance class purity by 7% to 19% over the traditional clustering algorithm PAM, and that SRIDHCR finds better solutions than SPAM. Key Words: supervised summary generation, clustering classified examples, k-medoid clustering algorithms, data mining.

1.

Introduction

This paper centers on a novel data mining technique we term supervised summary generation. The objective of supervised summary generation is the creation of class centered summaries that recognize the patterns that are typical for one class; it also identifies how those patterns deviate from patterns that characterize other classes. In order to create such summaries, supervised summary generation requires the recognition of class-uniform clusters that have high probability density. This paper proposes supervised clustering algorithms for this purpose. Clustering is typically applied in an unsupervised learning framework using particular error functions, e.g. an error function that minimizes the distances inside a cluster. Supervised clustering, on the other hand, deviates from traditional clustering in that it is applied on classified examples where the objective is to identify clusters that have high probability density with

Attribute 1

Attribute 1

: Setosa : Virginica

A E B

F

C G D Attribute 2

a. Traditional Clustering

H Attribute 2 b. Supervised Clustering

Figure 1: Differences between Traditional Clustering and Supervised Clustering Fig. 1 illustrates the differences between traditional and supervised clustering. Let us assume that the black examples and the white examples in the figure represent subspecies of Iris plants named Setosa and Virginica, respectively. A traditional clustering algorithm would, very likely, identify the four clusters depicted in Figure 1.a. The reason is that traditional clustering focuses on minimizing intra-cluster dissimilarities regardless of the classes that the objects belong to. If our objective is to generate summaries for the Virginica and Setosa classes of the Iris Plants, the clustering in Figure 1.a would not be very attractive since it combined Setosa and Virginica objects in cluster A and put examples of the Virginica class in two different clusters B and C. A supervised clustering algorithm that maximizes class purity, on the other hand, would split cluster A into two clusters E and F. Another characteristic of supervised clustering is that it tries to keep the number of clusters low. Consequently, clusters B and C would be merged into one cluster

without compromising class purity while reducing the number of clusters. A supervised clustering algorithm would identify cluster G as the union of clusters B and C as illustrated by Figure 1.b. The remainder of this paper will center on the discussion of algorithms for supervised clustering and on the empirical evaluation of the performance of these algorithms with respect to speed and solution quality. Section 2 discusses related work. Section 3 talks about Kmedoid-style clustering. Section 4 introduces the clustering algorithms investigated. Section 5 presents experimental results and Section 6 concludes the paper and sheds light on our future work.

2. Related Work Although we are not aware of research that directly centers on supervised summary generation and supervised clustering, there has been some work that has some similarity to our research under the heading of semisupervised clustering. Semi-supervised clustering attempts to enhance a clustering algorithm by using side information in the clustering process that usually consists of a "small set" of classified examples. Xian [4] (and similarly [1]) take the classified training examples and transform those into constraints and derive a modified distance function that minimizes the distance between points in the data set that are known to be similar with respect to these constraints using classical numerical methods. The K-means clustering algorithm in conjunction with the modified distance function is then used to compute clusters. Klein [2] proposes a shortest path algorithm to modify a Euclidian distance function based on prior knowledge. Demiriz [6] proposes an evolutionary clustering algorithm to obtain clusters that minimize (the sum of) cluster dispersion and impurity. Although some similarity in algorithms and techniques can be observed with respect to the work reviewed above, it should be stated clearly that, although our research investigates clustering algorithms, its focus is not on traditional clustering, but rather on using clustering as a preprocessing step to enhance existing classification algorithms and as a framework for summary generation.

3. K-mdoid-style Algorithms for Supervised Clustering K-medoid-style clustering aims at finding a set of k representatives among all objects in the data set that best characterize the objects in the data set. Clusters are created by assigning each object to the cluster of the representative (medoid) that is closest to that object.

One might wonder why our work centers on developing k-medoid-style supervised clustering algorithms. The reason is that we believe that medoids are quite useful for class-specific data summarization, because it is the most prototypical object of the members of a cluster. Moreover, algorithms that restrict representatives to objects belonging to the data set, such as k-medoid, explore a smaller solution space if compared with centroid–based clustering algorithms, such as k-means, which searches a much larger set of representatives. Finally, when using kmedoid style clustering algorithms, only an inter-object distance matrix is needed and no “new” distances have to be computed during the clustering process as is the case with k-means. As mentioned earlier, the fitness functions used for supervised clustering are significantly different from the fitness functions used by traditional clustering algorithms. Supervised clustering evaluates a clustering based on the following two criteria: • Class impurity, Impurity(X). Measured by the percentage of minority examples in the different clusters of a clustering X. A minority example is an example that belongs to a class different from the most frequent class in its cluster. • Number of clusters, k. In general, we like to keep the number of clusters low; trivially, having clusters that only contain a single example is not desirable, although it maximizes class purity. In particular, we used the following fitness function in our experimental work (lower values for q(X) indicate ‘better’ clustering solution X). q(X) = Impurity(X) + β∗Penalty(k) where Impurity(X ) =

(1)

# of Minority Examples , n

 k−c  n and Penalty(k) =    0

k ≥c k

Suggest Documents