Algorithms for Clustering High Dimensional and Distributed Data

Algorithms for Clustering High Dimensional and Distributed Data Tao Li Shenghuo Zhu Mitsunori Ogihara Computer Science Department University of Rochester Rochester, NY 14627-0226, USA ftaoli,zsh,[email protected]

Abstract

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efficiency and effectiveness of CoFD and D-CoFD algorithms. Keywords: CoFD, clustering, high dimensional, maximum likelihood, distributed.

The contact author is Tao Li. His contacting information: Email: [email protected], tele-

phone: +1-585-275-8479, fax: +1-585-273-4556.

1

1

Introduction

The problem of clustering data arises in many disciplines and has a wide range of applications. Intuitively, the clustering problem can be described as follows: Let W be a set of n data points in a multi-dimensional space. Find a partition of W into classes such that the points within each class are similar to each other. To measure similarity between points a distance function is used. A wide variety of functions has been used to define the distance function. The clustering problem has been studied extensively in machine learning [17, 67], databases [32, 74], and statistics [13] from various perspectives and with various approaches and focuses. Most clustering algorithms do not work efficiently in high dimensional spaces due to the curse of dimensionality. It has been shown that in a high dimensional space, the distance between every pair of points is almost the same for a wide variety of data distributions and distance functions [14]. Many feature selection techniques have been applied to reduce the dimensionality of the space. However, as demonstrated in [1], the correlations among the dimensions are often specific to data locality; in other words, some data points are correlated with a given set of features and others are correlated with respect to different features. As pointed out in [40], all methods that overcome the dimensionality problems use a metric for measuring neighborhoods, which is often implicit and/or adaptive. In this paper, we present a non-distance-based algorithm for clustering in high dimensional spaces called CoFD1 , The main idea of CoFD is as follows: Suppose that a data set W with feature set S needs to be clustered into K classes, C1 ; : : : ; CK with the possibility of recognizing some data points to be outliers. The clustering of the data then is represented by two functions, the data map D W ! f ; ; : : : ; K g and the feature map F S ! f ; ; : : : ; K g, where ; : : : ; k correspond to the clusters and corresponds to the set of outliers. Accuracy of such representation is measured using the log likelihood. Then, by the Maximum Likelihood Principle, the best clustering will be the representation that maximizes the likelihood. In CoFD, several approximation methods are used to optimize D and F iteratively. The CoFD algorithm can also be easily adapted to estimate the number of classes when the value K is not given as a part of the input. An added bonus of CoFD is that it produces interpretable descriptions of the resulting classes since it produces an explicit feature map. Furthermore, since the data and feature maps provide natural summary and representative information of the datasets and cluster structures, CoFD can be naturally extended to cluster distributed datasets. We present its distributed version, D-CoFD. The rest of the paper is organized as follows: Section 2 presents the CoFD algorithm. Section 3

1

:

1

01

0

:

CoFD stands for Co-learning between Feature maps and Data maps.

2

01

describes the D-CoFD algorithms. Section 4 shows our experimental results on both synthetic and real data sets. Section 5 surveys the related work. Finally, our conclusions are presented in Section 6.

2

The CoFD Algorithm

This section describes CoFD and the core idea behind it. We first present the CoFD algorithm for binary data sets. Then we will show how to extend it to continuous or non-binary categorical data sets in Section 2.6.

2.1 The Model of CoFD Suppose we wish to divide W into K classes with the possibility of declaring some data as outliers. Such clustering can be represented by a pair of functions, F; D , where F S ! f ; ; : : : ; K g is the feature map and D W ! f ; ; : : : ; K g is the data map. Given a representation F; D we wish to be able to evaluate how good the representation is. To accomplish this we use the concept of positive features. Intuitively, a positive feature is one that best describes the class it is associated with. Suppose that we are dealing with a data set of animals in a zoo, where the vast majority of the animals is the monkey and and the vast majority of the animals is the four-legged animal. Then, given an unidentified animal having four legs in the zoo, it is quite natural for one to guess that the animal is a monkey because the conditional probability of that event is high. Therefore, we regard the feature “having four legs” as a positive (characteristic) feature of the class. In most practical cases, characteristic features of a class do not overlap with those of another class. Even if some overlaps exist, we can add the combinations of those features into the feature space. Let N be the total number of data points and let d be the number of features. Since we are assuming that the features are binary, W can be represented as a matrix, which we call the data-feature matrix. Let i N and j d. We . We also say that the j th feature is active in the ith point if and only if Wij . say that the ith data point possesses the ith feature if and only if Wij The key idea behind the CoFD algorithm is the use of the Maximum Likelihood Principle, which states that the best model is the one that has the highest likelihood of generating the observed data. We apply this principle by regarding the datafeature matrix as the observed data and the representation D; F as the model. Let the data map D and the feature map F be given. Consider the N K matrix, W D; F , defined as follows: for each i, i N , and each j , j d,

:

01

:

(

01

)

)

1

1

(

^(

(

1

)

3

01

=1 =1

)

1

1 1 ()= ( ) ^( ) = 0 0 = () = ( ) 01 01 (

0

the ij entry of the matrix is if D i F j K and otherwise. This is the model represented by F and D . interpreted as the consistence of D i and F j . Note that W D; F if D i F j . For all i, i N , j , j d, b 2 f ; g, and 2 f ; g, we consider P Wij b j Wij D; F

, the probability of the j th feature being active in the ith data point in the real data conditioned upon the j th feature being active in the ith data in the model represented by D and F . We assume that this conditional probability is dependent only on the values of b and . Let Q b; denote the probability of an entry being observed as b while the entry in the model being equal to . Also, let p b; denote

. Then the likelihood the proportion of i; j such that Wij b and Wij D; F of the model can be expressed as follows:

() 1 )

( ) =

( )

^ (

=

1

()

^ (

)=

( )

)=

Y P (W j W^ (D; F )) = log Y Q(b; ) X = dN p(b; )log Q(b; ) dN H (W j W^ (D; F ))

log L(D; F ) = log

ij

dN p(b; )

i;j

i;j

b;

(1)

b;

^ ^ = arg max log L(D; F )

D; F

(2)

D;F

log (

)

L D; F , i.e., alternatively We apply the hill-climbing method to maximize optimizing one of D and F by fixing the other. First, we try to optimize F by fixing D . The problem of optimizing F over all data-feature pairs can be approximately decomposed into subproblems of optimizing each F j over all data-feature pairs of feature j , i.e., minimizing the conditional entropy H Wj j Wj D; F . If D and F j are given, the entropies can be directly estimated. Therefore, F j is assigned to the class in which the conditional entropy of Wj is the smallest. Optimizing D while fixing F is the dual. Hence, we only need to minimize H Wi j Wi D; F for each i. A straightforward approximation method for mink gk. This imizing H Wj j Wj D; F is to assign F j to k kfi j Wij approximation method is also applicable to minimization of H Wi j Wi D; F where we assign D i to k gk. A direct improvement of the k kfj j Wij approximation is possible by using the idea of entropy-constrained vector quand(k ) tizer [21] and assigning F j and D i to k gk k kfi j Wij d N (k ) and k gk , respectively, where d k and N k k kfj j Wij N are the number of features and the number of points in class k .

()

()

(

^ ( (

arg max (

)) ^ (

)) ( ) arg max

( ) arg max =

() ( ) arg max ( = + log( ))

^ (

(

(

=

= ^ (

))

()

))

+log( )) () ()

2.2 Algorithm Description There are two auxiliary procedures in the algorithm: EstimateFeatureMap estimates the feature map from the data map; EstimateDataMap estimates the data map from the feature map. Chi-square tests are used for deciding if a feature is an 4

outlier or not. The main procedure, CoFD, attempts to find a best class by an iterative process similar to the EM algorithm. Here the algorithm iteratively estimates the data and feature maps based on the estimations made in the previous round, until no more changes occur in the feature map. The pseudo-code of the algorithm is shown in Figure 1. It can be observed from the pseudo-code description that the time complexities of both EstimateFeatureMap and EstimateDataMap are O K N d . The number of iterations in CoFD is not related to N or d.

(

)

2.3 An Example To illustrate how CoFD works, an example is given in this section. Suppose we have a data set as presented in Table 1. Initially, the data points and are chosen as seed points, where the data point is in class A and the data point is in class B , i.e., D A, D B . EstimateFeatureMap returns that the features a, b and are positive in class A, that the features e and f are positive in class B , F b F A, and that features d and g are outliers. In other words, F a F e F f B , and F d F g . Then EstimateDataMap assigns the data points , and to class A and the data points , and to class B . then EstimateFeatureMap asserts that the features a, b and are positive in class A, the features d, e and f are positive in class B , and the feature g is an outlier. In the next iteration, the result does not change. At this point the algorithm stops. The resulting clusters are: the class A for the data points , , and and the class B for the data points , , and .

(2) =

( )= ( )= 12

45

(5) =

2

2

( )= ( )=0

3

( )= ( )= ( )= 45 6

12

6

5 5

3

2.4 Refining Methods Clustering results are sensitive to initial seed points. Randomly chosen seed points may make the search trapped in local minima. We use a refining procedure, whose idea is to use conditional entropy to measure the similarity between a pair of clustering results. CoFD attempts to find a best clustering result having the smallest average conditional entropy against all others. Clustering a large data set may be time consuming. To speed up the algorithm, we focus on reducing the number of iterations. A small set of data points, for example, of the entire data set, may be selected as the bootstrap data set. First, the clustering algorithm is executed on the bootstrap data set. Then, the clustering algorithm is run on the entire data set using the data map obtained from clustering on the bootstrap data set (instead of using randomly generated seed points). CoFD can be easily adapted to estimate the number of classes instead of using K as an input parameter. We observed that the best number of classes results the

1%

5

smallest average conditional entropy between clustering results obtained from different random seed point sets. Based on the observation, the algorithm of guessing the number of clusters is described in Figure 3.

2.5 Using Graph Partitioning to Initialize Feature Map In this section, we describe the method of using graph partitioning for initializing the feature map. For the binary datasets, we can find the frequent associations between features and build the association graph of the features. Then we use the graph partitioning algorithm to partition the features into several parts, i.e., to induce the initial feature map. 2.5.1

Feature Association

The problem of finding all frequent associations among attributes in categorical (“basket”) databases [3], called association mining, is one of the most fundamental and most popular problems in data mining. Formally, association mining is the problem of, given a database D each of whose datasets is a subset of a universe I (the set of all items), and a threshold value , < < , (the so-called minimum support), enumerating all nonempty S I such that the proportion of the set of transactions in D containing each of the elements of S is at least , that is, kfT j T 2 D ^ S T gk kDk. Such a set S is called a frequent itemset. (There are variations in which the task is to enumerate all frequent itemsets having certain properties, e.g. [10, 22, 29].) There are many practical algorithms for this problem in various settings, e.g. [4, 5, 20, 34, 38, 43, 73] (see a survey by Hipp et al. [41]) and, using the current commodity machines, these practical algorithms can be made to run reasonably quickly.

0

2.5.2

1

Graph Partitioning

After finding all the frequent associations2 , we build the association graph. In the association graph, the nodes represent the features and the edge weights are the support of the associations. Then we could use the graph partitioning to get an initial feature map. Graph partitioning is to partition the nodes of a graph into several parts, such that the summation of the weights of edges connecting nodes in different nodes is minimized. So the partitioning of the association graph would divide the features into several subsets so that the connections between the subsets are minimized. In our experiments, We apply the Metis algorithms [50] to get an initial feature map. 2

In our experiment, we use the Apriori algorithm with the minimum support of 0:05.

6

The use of graph partitioning to initialize feature map enables our algorithm explicitly consider the correlations between the features. In general, the feature space is much smaller than the data space. Hence the graph partitioning on feature space is very efficient.

2.6 Extending to Non-binary Data Sets In order to handle non-binary data sets, we first translate raw attribute values of data points into binary feature spaces. The translation scheme in [68] can be used to discretize categorical and continuous attributes. However, this type of translation is vulnerable to outliers that may drastically skew the range. In our algorithm, we use two other discretization methods. The first is combining Equal Frequency Intervals method with the idea of CMAC [6]. Given m instances, the method

adjacent valdivides each dimension into k bins with each bin containing m k ues3 . In other words, with the new method, each dimension is divided into several overlapped segments and the size of overlap is determined by . An attribute is then translated into a binary sequence having bit-length equal to the number of the overlapped segments, where each bit represents whether the attribute belongs to the corresponding segment. We can also use Gaussian mixture models to fit each attribute of the original data sets, since most of them are generated from Gaussian mixture models. The number of Gaussian distributions n can be obtained by maximizing the Bayesian information criterion of the mixture model. Then the value is translated into n feature values in the binary feature space. The j th feature value is if the probability that the value of the data point is generated from the j th Gaussian distribution is the largest.

+

1

3

Distributed Clustering

In this section, we first briefly review the motivation for distributed clustering and then give the D-CoFD algorithms for distributed clustering.

3.1 Motivation for Distributed Clustering Over the years, data set sizes have grown rapidly with the advances in technology, the ever-increasing computing power and computer storage capacity, the permeation of Internet into daily life and the increasingly automated business, manufacturing and scientific processes. Moreover, many of these data sets are, in nature, The parameter can be a constant for all bins or different constants for different bins depending on the distribution density of the dimension. In our experiment, we set = bm=5k . 3

7

geographically distributed across multiple sites. For example, the huge number of sales records of hundreds of chain stores are stored at different locations. To cluster such large and distributed data sets, it is important to investigate efficient distributed algorithms to reduce the communication overhead, central storage requirements, and computation times. With the high scalability of the distributed systems and the easy partition and distribution of a centralized dataset, distribute clustering algorithms can also bring the resources of multiple machines to bear on a given problem as the data size scale-up.

3.2 D-CoFD Algorithms In a distributed environment, data sites may be homogeneous, i.e., different sites containing data for exactly the same set of features, or heterogeneous, i.e., different sites storing data for different set of features, possibly with some common features among sites. As we already mentioned, the data and feature maps imply natural summary information of the dataset and the cluster structures. Based on the observation, we will present extensions of our CoFD algorithm to those for clustering homogeneous and heterogeneous datasets. 3.2.1

Clustering Homogeneous Datasets

We first present an extension of CoFD algorithm, D-CoFD-Hom, to clustering homogeneous datasets. In the paper, we assume that the data sets have similar distributions and hence the number of classes of each site is usually the same. Algorithm D-CoFD-Hom

begin 1. Each site sends a sample of its data to the central cite; 2. The central site decides the parameters for data transformation, e.g., number of mixtures for each dimension, et al., and broadcasts to each site; 3. Each site preprocesses its data, performs CoFD, and obtains its data map Di and its corresponding feature map Fi ; 4. Each site sends its feature map Fi to the central cite; 5. The central site decides the global feature map F and broadcasts it to each site; 6. Each site estimates its data map Di from the global feature map F . end The task of estimating the global feature map at the central site can be described as follows: first we need to identify the correspondence between each pair of sites. If we view each class as a collection of features4 , we need to find a map4

The feature indices are the same among all sites.

8

ping between the classes of two sites to maximize the mutual information between the two sets of features. This can be easily reduced to the maximum bipartite graph matching problem, which is known to be NP-hard. There are several approximation algorithms, e.g., [42], for this problem. In our experiments, however, the number of classes is relatively small, thus, we used brute force search to find the mapping between the classes on two sites. Once the best correspondence has been found for each pair of sites, the global feature map can be established using EstimateFeatureMap, i.e., by assigning each feature to a class. 3.2.2

Clustering Heterogeneous Datasets

Here we present another extension of CoFD algorithm, D-CoFD-Het, for clustering heterogeneous datasets. We assume that each site contains relational data and the row indices are used as the key linking the rows from different sites. The duality of the feature map and data map in our CoFD algorithm makes it natural to extend it to an algorithm for clustering heterogeneous datasets. Algorithm D-CoFD-Het

begin 1. Each site preprocesses its data, performs CoFD and obtains its data map Di and its corresponding feature map Fi ; 2. Each site sends Di to the central cite; 3. The central site decides the global data map D and broadcasts it to each site. The similar approach as estimating the global feature map is used; 4. Each site estimates its feature map Fi from the global data map D . end

4

Experimental Results

4.1 Experiments with CoFD algorithm There are many ways to measure how accurately CoFD performs. One is the confusion matrix which is described in [1]. Entry o; i of a confusion matrix is the number of data points assigned to output class o and generated from input class i. The input map I is a map of the data points to the input classes. So, the information of the input map can be measured by the entropy H I . The goal of clustering is to find an output map O that recovers the information. Thus, the condition entropy H I jO is interpreted as the information of the input map given the output map O , i.e., the portion of information which is not recovered by the clustering algorithm. Therefore, the recovering rate of a clustering algorithm, defined as

( )

()

( )

9

1

( ) ( )=

(

) ()

H I jO =H I M I I; O =H I 5 , can also be used as a performance measure for clustering. The purity [75] that measures the extend to which each cluster contained data points from primarily one class is also a good metric for cluster quality. The purity of a clustering solution is obtained as a weighted sum of individual K ni 1 P Si ; P S i maxj nji cluster purities and is given by P urity i=1 n ni where Si is a particular cluster of size ni , nji is the number of documents of the i-th input class that were assigned to the j -th cluster, K is the number of clusters and n is the total number of points 6 . In general, the larger the values of purity, the better the clustering solution is. For a perfect clustering solution, the purity is . All the experiments are performed on a Sun U = machine with MB memory, running on Sun OS 5.7.

=P

( ) ( )=

5 333

4.1.1

( )

128

1

A Binary Synthetic Data Set

First we generate a binary synthetic data set W to evaluate our algorithm. N = 400 points are from K = 5 clusters. Each point has d = 200 binary features. Each cluster has l = 35 positive features, 140 negative features. The positive features of a point have a probability of 0:8 to be 1 and 0:2 to be 0; the negative features of a point have a probability of 0:8 to be 0 and 0:2 to be 1; the rest features have a probability of 0:5 to be 0, and 0:5 to be 1. Table 2 shows the confusion matrix of this experiment. In this experiment, p(D j1) = 83=83 = 1, p(E j2) = 1, p(C j3) = 1, p(Aj4) = 1, p(B j5) = 1, and the P rest are zeros. So H (I jO ) = p(ijo)log p(ijo) = 0, i.e., the recovering rate of the algorithm is 1, i.e., all data points are correctly recovered. 1

4.1.2

A Continuous Synthetic Data Set

In this experiment we attempt to cluster a continuous data set. We use the method ; data points in described in [1] to generate a data set W1 . W1 has N a -dimensional space, with K . All input classes are generated in some -dimensional subspace. Five percent of the data points is chosen to be outliers, which are distributed uniformly at random throughout the entire space. Using the second translation method described in Section 2.6, we map all the data point into features. a binary space with Then, data points are randomly chosen as the bootstrap data set. By running CoFD algorithm on the bootstrap data set, we obtain clustering of the bootstrap data set. Using the bootstrap data set as the seed points, we run the

7

20

=5

1000

5 6

= 100 000

41

MI (I; O) is mutual information between I and O. P (Si ) is also called the individual cluster purity.

10

algorithm on the entire data set. Table 3 shows the confusion matrix of this experof the data points are recovered. The conditional entropy iment. About : H I jO is : while the input entropy H I is : . The recovering rate is H I jO =H I : . We make a rough comparison with the result thus reported in [1]. From the confusion matrix reported in the paper, their recovering rate is calculated as : , which seems to indicate that our algorithm is better than theirs in terms of recovering rate.

99 65% ( ) 0 0226 1 ( ) ( ) = 0 987 0 927

4.1.3

( ) 1 72

Zoo Database

We also evaluate the performance of the CoFD algorithm on the zoo database available at the UC Irvine Machine Learning Repository. The database contains animals, each of which has boolean attributes and categorical attribute7 . We translate each boolean attribute into two features, whether the feature is active and whether the feature is inactive. We translate the numeric attribute, “legs”, into six features, which correspond to , , , , , and legs, respectively. Table 4 shows the confusion matrix of this experiment. The conditional entropy H I jO is : while the input entropy H I is : . The recovering rate of this algorithm is H I jO =H I : . In the confusion matrix, we found that the clusters with a large number of animals are likely to be correctly clustered.8 CoFD comes with an important by-product that the resulting classes can be easily described in terms of features, since the algorithm produces an explicit feature map. For example, the positive features of class A are “no eggs,” “no backbone,” “venomous,” and “eight legs”; the positive features of class B are “feather,” “airborne,” and “two legs.” “domestic” and “catsize”; the positive feature of class D is “no legs”; the positive feature of class E are “aquatic,” “no breathes,” “fins” and “five legs”; the positive features of class F are “six legs” and “no tail”; the positive feature of class G is “four legs”. Hence, class B can be described as animals having feather and two legs, and being airborne, which are the representative features of the birds. Class E can be described as animals being aquatic, having no breathes, having fins and five legs9 .

15

1

02456 ( ) 1 64 ( ) ( ) = 0 807

1

7

8

100

( ) 0 317

The original data set has 101 data points but one animal, “frog,” appears twice. So we eliminated one of them. We also eliminated two attributes, “animal name” and “type.” 8 For example, cluster no. 1 is mapped into cluster C ; cluster no. 2 is mapped into cluster B ; et al. 9 Animals “dolphin” and “porpoise” are in class 1, but were clustered into class E , because their attributes “aquatic” and “fins” make them more like animals in class E than their attribute “milk” does for class C .

11

4.2 Scalability Results In this subsection we present the computational scalability results for our algorithm. The results are averaged over five runs on each case to eliminate the randomness effect. We report the scalability results in terms of the number of points and the dimensionality of the space. features (dimensions). They Number of points: The data sets we test have features. Figure 4 and Figall contain clusters and each cluster has an average ure 5 show the scalability result of the algorithm in terms of the number of data points. The value of the y coordinate in Figure 4 is the running time in seconds. We implement the algorithm using octave : : on Linux : : . If it were implement in C , the running time could be reduced considerably. Figure 4 contains two curves: one is the scalability result with bootstrap and one without. It can be seen from Figure 4 that with bootstrap, our algorithm scales sub-linearly with the number of points and without bootstrap, it scales linearly with the number of points. The value of the y coordinate in Figure 5 is the ratio of the running time to the number of points. The linear and sub-linear scalability properties can also be observed in Figure 5. data points. Dimensionality of the space: The data sets we test have ; They all contain clusters and the average features (dimensions) for each cluster is about 16 of the total number of features (dimensions). Figure 6 and Figure 7 show the scalability result of the algorithm in terms of dimensionality of the space. The value of the y coordinate in Figure 6 is the running time in seconds. It can be seen from Figure 6 that our algorithm scales linearly with the dimensionality of the space. The value of the y coordinate in Figure 7 is the ratio of the running time to the number of features.

200 35

5

2 1 34

2 4 10

10 000

5

4.3 Document Clustering Using CoFD In this section, we apply our CoFD algorithm to cluster documents and compare its performance with other standard clustering algorithms. Document clustering has been used as a means for improving document retrieval performances. In our experiments, documents are represented using binary vector-space model where each document is a binary vector in the term space and each element of the vector indicates the presence of the corresponding term. 4.3.1

Document Datasets

Several datasets are used in our experiments:

12

URCS: URCS dataset is the collection of technical reports published in the year 2002 by the computer science department at university of Rochester10 . There are 3 main research area at URCS: AI 11 , System and Theory while AI has been loosely divided into two sub-areas: NLP12 and Robotics/Vision. Hence the dataset which a contain 476 abstracts that are divided into 4 different research areas. WebKB: The WebKB dataset contains webpages gathered from university computer science departments. There are about 8300 documents and they are divided into 7 categories: student, faculty, staff, course, project, department and other. The raw text is about 27MB. Among these 7 categories, student, faculty, course and project are four most populous entity-representing categories. The associated subset is typically called WebKB4. In this paper, we did experiments on both 7-category and 4-category datasets. Reuters: The Reuters-21578 Text Categorization Test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. In our experiments, we use a subsets of the data collection which include the 10 most frequent categories among the 135 topics and we call it Reuters-top 10. K-dataset: The K-dataset was from WebACE project [35] and it was used in [15] for document clustering. The K-dataset contains 2340 documents consisting news articles from Reuters new service via the Web in October 1997. These documents are divided into 20 classes. The datasets and their characteristics are summarized in Table 5. To preprocess the datasets, we remove the stop words use a standard stop list and perform stemming using a porter stemmer, all HTML tags are skipped and all header fields except subject and organization of the posted article are ignored. In all our experiments, we first select the top 1000 words by mutual information with class labels. The feature selection is done with the rainbow package [56]. 4.3.2

Document Clustering Comparisons

Document clustering methods can be mainly categorized into two types: partitioning and hierarchical clustering [69]. Partitioning methods decompose a collection of documents into a given number disjoint clusters which are optimal in terms of some predefined criteria functions. For example, the traditional K -means method tries to minimize the sum-of-squared-errors criterion function. The criteria functions of adaptive K -means approaches usually take more factors into consideration and hence more complicated. There are many different Hierarchical clustering 10

The dataset can be downloaded from http://www.cs.rochester.edu/u/taoli/data. Artificial Intelligence. 12 Natural Language Processing. 11

13

algorithms with different policies on combining clusters such as single-linkage, complete-linkage and UPGA method [45]. Single-linkage and complete-linkage use the maximum and the minimum distance between the two clusters, respectively, while UPGMA - Unweighted Pair-Groups Method Average uses the distance of the cluster centers to define the similarity of two clusters for aggregating. In our experiments, we compare the performance of CoFD on the datasets with K -means, Partitioning methods with various adaptive criteria functions and hierarchical clustering algorithms using single-linkage, complete-linkage and UPGA aggregating measures. For partitioning and hierarchical algorithms, we use the CLUTO clustering package described in [75, 76]. The comparisons are shown in Table 6. Each entry is the purity of the corresponding column algorithm on the row dataset. P1, P2, P3 and P4 are the partitioning algorithm with different adaptive criteria functions as shown in Table 7. Slink, Clink and UPMGA columns are different hierarchical clustering algorithms using single-linkage, complete-linkage and UPGA aggregating policies. In our experiments, we use the cosine function measure of the two document vectors as their similarity. From Table 6, we observe that CoFD achieve the best performance on URCS and Reuters-top 10 datasets. On all other datasets, the results obtained by CoFD are really close to the best results. On WebKB4, K -means has the best performance of : and CoFD gives the result of : . P2 achieves the best result of : on on K-dataset. The results WebKB and P1 gives the highest performance of : on WebKB and K-dataset obtained by CoFD are : and : respectively. Figure 8 shows the graphical comparison. The comparison shows that, although there is no single winner on all the datasets, CoFD is a viable and competitive algorithm in document clustering domain. To get a better understanding of the CoFD, Table 8 gives the confusion matrices built from the clustering results on URCS dataset. The columns of the confusion matrix are NLP, Robotics/Vision, Systems and Theory respectively. The result shows Systems and Theory are much different from each other, and different from NLP and Robotics/Vision; NLP and Robotics/Vision are similar to each other; AI is more similar to SYSTEMS than ROBOTICS is.

0 647

0 638

0 762 0 501

0 505

0 754

4.4 Experiments Results of D-CoFD algorithms This subsection presents the experiment results with D-CoFD algorithms for distributed clustering.

14

4.4.1

An Artificial Experiment

= 10000

We generate the test data set W2 using the algorithm in [59]. W2 has N in a -dimensional space with K , including noise dimensions. The first translation method described in Section 2.6 is used for preprocessing. We first apply the CoFD algorithm on the dataset, for which the recovering rate was , i.e., all data points are correctly recovered. We then horizontally partition the datasets into five subsets containing points each and applied the D-CoFD-Hom algorithm. An important step of the D-CoFD-Hom algorithm is to identify the correspondence between the classes of a set and those of the other for each pair of sites. We use a brute force approach to find the mapping that maximizes the mutual information between them. For example, given the feature map FA and FB from site A and site B respectively, we want to establish the correspondence between classes on site A and those on B . We first construct the confusion matrix of two feature maps where the entry p; q is the number of features that are positive in both class p at site A and class q at site B . According to the confusion matrix, we can easily derive the correspondence. Table 9 gives a detailed example of a confusion matrix and its consequent correspondence. Overall, all the points are also correctly clustered and thus the recovering rate was . In another experiment, we vertically partition the datasets into three subsets containing points and features each. As in the homogeneous case, to identify the correspondence between sets of classes, we use brute force search to find the mapping that maximizes the mutual information between them. For example, in our experiment, given the data map DA and DB of site A and site B , we want to establish the correspondence of site A and B . We first construct the confusion matrix of two feature maps where the entry p; q of the confusion matrix is the number of points that are clustered into both class p at site A and class q at site B . We build the confusion matrix of the data maps similar to that of the feature maps in the homogeneous case. Then, according to the confusion matrix of the data maps, we can easily derive the map. Table 10 presents the confusion matrix and its correspondence. The outcome of the D-CoFD-Het algorithm is shown in Table 11 and the recovering rate in this case was : . The above experiments demonstrate that the results of distributed clustering are very close to that of the centralized clustering.

35

=5

5

1

1000

( )

1

10000

10

( )

0 982

4.4.2

Dermatology Database

We also evaluate our algorithms on the dermatology database from the UC Irvine Machine Learning Repository. The database is used for the differential diagnosis of erythemato-squamous diseases. These diseases all share the clinical features

15

366

of erythema and scaling, with very little differences. The dataset contains data points over attributes and was previously used for classification. In our experiment, we use it to demonstrate our distributed clustering algorithms. The confusion matrix of centralized clustering, i.e., CoFD, on this dataset is presented in Table 12 and its recovering rate is : . We then horizontally partition the instances. The confusion matrix by dataset into subsets, each containing D-CoFD-Hom is shown in Table 13.

34

3

5

0 686 122

Related Work

Self-Organizing Map (SOM) [53] is widely used for clustering, and is particularly well suited to the task of identifying a small number of prominent classes in a multidimensional data set. SOM uses incremental approach where points/patterns are processed one-by-one. However, the “centroids” with huge dimensionality are hard to interpret in SOM. Other Traditional clustering techniques can be classified into partitional, hierarchical, density-based and grid-based. Partitional clustering attempts to directly decompose the data set into K disjoint classes such that the data points in a class are nearer to one another than the data points in other classes. Hierarchical clustering proceeds successively by building a tree of clusters. Densitybased clustering is to group the neighboring points of a data set into classes based on density conditions. Grid-based clustering quantizes the object space into a finite number of cells that form a grid-structure and then performs clustering on the grid structure. Most of them use distance functions as objective criteria and are not effective in high dimensional spaces. Next we review some recent clustering algorithms which have been proposed for high dimensional spaces or without distance functions and are largely related to our work. CLIQUE [2] is an automatic subspace clustering algorithm for high dimensional spaces. It uses equal-size cells and cell density to find dense regions in each subspace in a high dimensional space. Cell size and the density threshold need to be provided by the user as inputs. CLIQUE does not produce disjoint classes and the highest dimensionality of subspace classes reported is about ten. Our algorithm produces disjoint classes and does not require the additional parameters such as the cell size and density. Also it can find classes with higher dimensionality. Aggarwal et al. [1] introduce projected clustering and presents algorithms for discovering interesting patterns in subspaces of high dimensional spaces. The core idea is a generalization of feature selection which allows the selection of different sets of dimensions for different subsets of the data sets. However, the algorithms are based on the Euclidean or Manhattan distance and their feature selection method is a variant of singular value decomposition. Also the algorithms assume that the

16

number of projected dimensions are given beforehand. Our algorithm does not need the distance measures and the number of dimensions for each class. Also our algorithm does not require all projected classes to have the same number of dimensions. Ramkumar and Swami [64] propose a method for clustering without distance functions. The method is based on the principles of co-occurrence maximization and label minimization which are normally implemented using associations and classification techniques. The method does not consider correlations of dimensions with respect to data locality. The idea of co-clustering of data points and attributes dates back to [7, 39, 62]. Co-clustering is a simultaneous clustering of both points and their attributes by utilizing the canonical duality contained in the point-by-attribute data representation. Govaert [31] researches simultaneous block clustering of the rows and columns of contingency table. Dhillon [24] presents a co-clustering algorithm using bipartite graph between documents and words. Our algorithms, however, use the association graph of the features to initialize the feature map and then apply an EM-type algorithm. Han et al. [36] propose the clustering algorithm based on association rule hypergraphs but they don’t consider the co-learning problem of data map and feature map. A detailed survey on hypergraph based clustering can be found in [37]. Cheng et al. [19] propose an entropy-based subspace clustering for mining numerical data. The criterion of the entropy-based method is that low entropy subspace corresponds to a skewed distribution of unit densities. Decision tree based clustering method is proposed in [55]. There are also some recent work on clustering categoric datasets, such as ROCK [33] and CACTUS [30]. Our algorithm can be applied to both categorical and numerical data. There are also a lot of probabilistic clustering approaches where data is considered to be a sample independently drawn from a mixture model of several probability distributions [58]. The area around the mean of each distribution constitutes a natural cluster. Generally the clustering task is then reduced to finding the parameters such as mean, variance, etc. of the distributions to maximizing the probability that the data is drawn from the mixture model (also known as maximizing the likelihood). Expectation-Maximization (EM) method [23] is then used to find mixture model parameters that maximize the likelihood. Moore [60] suggests acceleration of EM method based on a special data index, KD-tree, where data is divided at each node into two descendants by splitting the widest attribute at the center of its range. Algorithm Autoclass [18] utilizes a mixture model and covers a wide range of distributions including Bernoulli, Poisson, Gaussian and log-normal distributions. The algorithm MCLUST [28] uses Gaussian models with ellipsoids of different volumes, shapes and orientations. smyth [67] uses probabilistic clustering for mutli-variate and sequential data and Cadez et al.[16] use mixture model for cus17

tomer profiling based on transactional information. Barash and Friedman [9] use context specific Bayesian clustering method to help in understanding the connections between transcription factors and functional classes of genes. The contentspecific approach has the ability to deal with many attributes that are irrelevant to the clustering and are suitable for the underlying biological problem. Kearns et al. [51] give an information-theoretic analysis of hard assignments (used by K means) and soft assignments (used by EM algorithm) for clustering and proposed a posterior partition algorithm which is close to the soft assignments of EM for clustering. The relationship between maximum likelihood and clustering is also discussed in [47]. Minimum encoding length approaches, including Minimum Message Length (MML) [70] and Minimum Description Length (MDL) [65] can also be used for clustering. Both approaches perform induction by seeking a theory that enables the most compact encoding of both the theory and available data and admit to probabilistic interpretations. Given prior probabilities for both theories and data, minimization of the MML encoding closely approximates maximization of posterior probability while an MDL code length defines an upper bound on “unconditional likelihood” [66, 71]. Minimum encoding is an information theoretic criterion for parameter estimation and model selection. Using minimum encoding for clustering, we are looking for the cluster structure which minimum the size of encoding. The algorithm SNOB [72] uses a mixture model in conjunction with Minimum Message Length (MML) principle. An empirical comparison of criteria prominent is given in [63] and it also concludes that Minimum Message Length appeared to be the best criterion for selecting the number of components for mixtures of Gaussian distributions. Fasulo [26] gives a detailed survey on clustering approaches based on mixture models [8], dynamical systems [52] and clique graphs [11]. Similarity based on shared features has also been analyzed in cognitive science such as the Family resemblances study by Rosch and Mervis. Some methods have been proposed for classification using maximum entropy (or maximum likelihood)[12, 61]. The classification problem is a supervised learning problem, for which the data points are already labeled with known classes. However, the clustering problem is an unsupervised learning problem as no labeled data is available. Recently there has been lots of interest in distributed clustering. Kargupta et al. [49] present collective principal component analysis for clustering heterogeneous datasets. Johnson and Kargupta [46] study hierarchical clustering from distributed and heterogeneous data. Both of the above works tend to subject to vertical partitioning. In our D-CoFD algorithms, we envisage data that is horizontally partitioned as well as vertically partitioned. Lazarevic et al. [54] propose a distributed clustering algorithm for learning regression modules from spatial datasets. Hulth 18

and Grenholm [44] propose a distributed clustering algorithm which is intended to be incremental and to work in a real-time situation. McClean et al. [57] present algorithms to cluster databases that hold aggregate count data on a set of attributes that have been classified according to heterogeneous classification schemes. Forman and Zhang [27] present distributed K-means, K-Harmonic-Means and EM algorithms for homogeneous (horizontally partitioned) datasets. Parallel clustering algorithms are also discussed, for example in [25]. More work on distributed clustering can be found in [48].

6

Conclusions

In this paper, we first propose a novel clustering algorithm CoFD which does not require the distance function for high dimensional spaces. We come up with a new perspective of viewing the clustering problem by interpreting it as the dual problem of optimizing the feature map and data map. As a by-product, CoFD also produces the feature map which can provide interpretable descriptions of the resulting classes. CoFD also provides a method to select the number of classes based on the conditional entropy. We introduce the recovering rate, an accuracy measure of clustering, to measure the performance of clustering algorithms and to compare clustering results of different algorithms. We extend CoFD and propose D-CoFD algorithms for cluster distributed datasets. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of the algorithms. Documents are usually represented as high-dimensional vectors in term space. We evaluate CoFD for clustering on a variety of document collections and compare its performance with other widely used document clustering algorithms. Although there is no single winner on all the datasets, CoFD outperforms the rest on two datasets and its performances are close to the best on the other three datasets. In summary, CoFD is a viable and competitive clustering algorithm.

Acknowledgments The project is supported in part by NIH Grants 5-P41-RR09283, RO1-AG18231, and P30-AG18254 and by NSF Grants EIA-0080124, EIA-0205061, NSF CCR9701911, and DUE-9980943.

19

References [1] Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. In ACM SIGMOD Conference, pages 61–72, 1999. [2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering for high dimensional data for data mining applications. In SIGMOD-98, 1998. [3] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. In Proceedings of the ACM-SIGMOD 1993 International Conference on Management of Data, pages 207–216, 1993. [4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining., pages 307–328. AAAI/MIT Press, 1996. [5] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of 20th Conference on Very Large Databases, pages 487–499, 1994. [6] J. S. Albus. A new approach to manipulator control: The cerebellar model articlatioon controller (CMAC). Trans. of the ASME, J. Dynamic Systems, Meaasurement, and Control, 97(3):220–227, sep 1975. [7] M. R. Anderberg. Cluster Analysis for Applications. Academic Press Inc., 1973. [8] J. Banfield and A. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821, 1993. [9] Yoseph Barash and Nir Friedman. Context-specific bayesian clustering for gene expression data. In RECOMB, pages 12–21, 2001. [10] R. J. Bayardo Jr. and R. Agrawal. Mining the most interesting rules. In Proceedings of 5th International Conference on Knowledge Discovery and Data Mining, pages 145–154, New York, NY, 1999. ACM Press. [11] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J. of Comp. Biology, 6(3/4):281–297, 1999. [12] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Lingistics, 1996. 20

[13] M. Berger and I. Rigoutsos. An algorithm for poinr clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21(5):1278–1286, 1991. [14] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? In ICDT Conference, 1999. [15] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. Document categorization and query generation on the world wide web using webace. AI Review, 13(5). [16] I. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transactional data with applications to profiling, visualization, and prediction. In KDD2001, pages 37–46, San Francisco, CA, 2001. [17] P. Cheeseman, J. Kelly, and M. Self. AutoClass: A bayesian classification system. In ICML’88, 1988. [18] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In Advances in Know/edge Discovery and Data Mining. AAAI Press/MIT Press, 1996. [19] C-H Cheng, A. W-C Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD-99, 1999. [20] D. Cheung, J. Han, V. T. Ng, and A. W. Fu an Y. Fu. Fast distributed algorithm for mining association rules. In International Conference on Parallel and Distributed Information Systems, pages 31–42, 1996. [21] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vector quantization. IEEE Trans., ASSP-37(1):31, 1989. [22] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. wullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trasactions on Knowledge and Data Engineering, 13(1):64–78, 2001. [23] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data. Journal of the Royal Statistical Society, (39):1–38, 1977. [24] I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report 2001-05, UT Austin CS Dept, 2001.

21

[25] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pages 245–260, 2000. [26] D. Fasulo. An analysis of recent work on clustering algorithms. Technical Report 01-03-02, U. of Washington, Dept. of Comp. Sci. & Eng., 1999. [27] G. Forman and B. Zhang. Distributed data clusteringcan be efficient and exact. SIGKDD Explorations, 2(2):34–38, 2001. [28] C. Fraley and A. Raftery. Mclust: Software for model-based cluster and discriminant analysis. Technical Report 342, Dept. Statistics, Univ. of Washington, 1999. [29] A. Fu, R. Kwong, and J. Tang. Mining n most interesting itemsets. In Proceedings of 12th International Symposium on Methodologies for Intelligent Systems, pages 59–67. Springer-Verlag Lecture Notes in Computer Science 1932, 2000. [30] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83, 1999. [31] G. Govaert. Simultaneous clustering of rows and columns. Control and Cybernetics, (24):437–458, 1985. [32] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large database. In Proceedings of the 1998 ACM SIGMOD Conference, 1998. [33] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000. [34] E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conference, pages 277–288. ACM Press, 1997. [35] Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. WebACE: A web agent for document categorization and exploration. In Katia P. Sycara and Michael Wooldridge, editors, Proceedings of the 2nd International Conference on Autonomous Agents (Agents’98), pages 408–415, New York, 9–13, 1998. ACM Press. 22

[36] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Clustering based on association rule hypergraphs. In Research Issues on Data Mining and Knowledge Discovery, pages 0–, 1997. [37] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Hypergraph based clustering in high-dimensional data sets: A summary of results. Data Engineering Bulletin, 21(1):15–22, 1998. [38] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD Conference, pages 1–12. ACM Press, 2000. [39] J. Hartigan. Clustering Algorithms. Wiley, 1975. [40] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,Inference, and Prediction. Springer, 2001. [41] J. Hipp, U. G¨untzer, and G. Nakhaeizadeh. Algorithms for association rule mining—a general survey and comparison. SIGKDD Explorations, 2(1):58– 63, 2000. [42] J. Hopcroft and R. Karp. An n5=2 algorithm for maximum matchings in bipartite graphs. SIAM Journal of Computing, 2(4):225–231, 1973. [43] M. Houtsma and A. Swami. Set-oriented mining of association rules. Research Report RJ 9567, IBM Almaden Research Center, San Jose, CA, October 1993. [44] N. Hulth and P. Grenholm. A distributed clustering algorithm. Technical Report 74, Lund University Cognitive Science, 1998. [45] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [46] Erik L. Johnson and Hillol Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In M. Zaki and C. Ho, editors, Large-scale Parallel Systems, pages 221–244. Springer, 1999. [47] M. I. Jordan. Graphical Models: Foundations of Neural Computation. MIT Press, 2001. [48] Hillol Kargupta and Philip Chan, editors. Advances in Distributed and Parallel Data Mining. AAAI Press, 2000.

23

[49] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik Johnson. Distributed clustering using collective principal component analysis. In ACM SIGKDD-2000 Workshop on Distributed and Parallel Knowledge Discovery, 2000. [50] George Karypis and Vipin Kumar. Multilevel algorithms for multi-constraint graph partitioning. Technical Report 98-019, 1998. [51] M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Learning in Graphical Models. Kluwer AP, 1998. [52] Jon M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. pages 599–608, 1997. [53] Teuvo Kohonen. The Self-Organizing Map. In New Concepts in Computer Science, 1990. [54] A. Lazarevic, D. Pokrajac, and Z. Obradovic. Distributed clustering and local regression for knowledge discovery in multiple spatial databases. In 8th European Symposium On Artificial Neural Networks,ESANN2000, 2000. [55] Bing Liu, Yiyuan Xia, and Philip S. Yu. Clustering through decision tree construction. In SIGMOD-00, 2000. [56] Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. [57] S. McClean, B. Scotney, K. Greer, and R. Pairceir. Conceptual clustering of heterogeneous distributed databases. In Workshop on Ubiquitous Data Mining, PAKDD01, 2001. [58] G. McLachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, NY, 1988. [59] G. Milligan. An algorithm for creating artificial test clusters. Psychometrika, 50(1):123–127, 1985. [60] A. Moore. Very fast em-based mixture model clustering using multiresolution kd-trees. In Proceedings of Neural Information Processing Systems Conference, 1998.

24

[61] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67, 1999. [62] S. Nishisato. Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto, 1980. [63] J. J. Oliver, R. A. Baxter, and C. S. Wallace. Unsupervised Learning using MML. In Machine Learning: Proceedings of the Thirteenth International Conference (ICML 96), pages 364–372. Morgan Kaufmann Publishers, 1996. [64] G.D. Ramkumar and A. Swami. Clustering data without distance function. IEEE Data(base) Engineering Bulletin, 21:9–14, 1998. [65] J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):416–431, 1983. [66] Jorma Rissanen. Stochastic complexity. Journal of the Royal Statistical Society B, 49(3), 1987. [67] P. Smyth. Probabilistic model-based clustering of multivariate and sequential data. In Proceedings of Artificial Intelligence and Statistics, pages 299–304, San Mateo CA, 1999. Morgan Kaufman. [68] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In SIGMOD-96, 1996. [69] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In In KDD Workshop on Text Mining, 2000. [70] C.S. Wallace and D.M. Boulton. An information measure for classification. Computer Journal, 11(2):185–194, 1968. [71] C.S. Wallace and P.R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society (Series B), 49(3), 1987. [72] S. Wallace and D.L. Dowe. Intrinsic classification by mml – the snob program. In Proceedings of the 7 Australian Joint Conference on Artificial Intelligence, pages 37–44. World Scientific, 1994. [73] M. Zaki, S. Parthasarathy, M. Ogihara, and W. LI. Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1:343– 373, 1998.

25

[74] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, 1996. [75] Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report 01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. [76] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. Technical Report 02-022, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2002.

26

List of Tables 1 2 3 4 5 6 7 8 9 10 11 12 13

A data set example. . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for W1 . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for W1 . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix of the zoo data. . . . . . . . . . . . . . . . . . . Document DataSets Descriptions. . . . . . . . . . . . . . . . . . Document Clustering Comparison Table . . . . . . . . . . . . . . Criteria Functions for Partitioning Algorithms . . . . . . . . . . . Confusion matrix of technical reports by CoFD. . . . . . . . . . . Experiments of W2 in homogeneous setting . . . . . . . . . . . . Experiments of W2 in heterogeneous setting . . . . . . . . . . . . Confusion Matrix on Site A. . . . . . . . . . . . . . . . . . . . . Confusion matrix of Centralized Result for Dermatology Database. Confusion matrix of Distributed Result for Dermatology Database.

27

28 28 28 28 29 29 29 30 30 30 31 31 31

feature data point

a

b

c

d

e

f

g

1 2 3 4 5 6

1 1 1 0 0 0

1 1 0 1 0 0

0 1 1 0 0 0

0 1 0 0 1 1

1 0 0 1 1 0

0 0 0 1 1 1

0 1 0 0 1 0

Table 1: A data set example. Input Output

1 2 3 4 5

A

B

C

D

E

0 0 0 86 0

0 0 0 0 75

0 0 75 0 0

83 0 0 0 0

0 81 0 0 0

Table 2: Confusion matrix for W1 . Input

A

B

C

D

E

O.

1 0 0 10 20520 16

0 15496 0 0 0 17

1 0 0 17425 0 15

17310 2 0 0 0 10

0 22 24004 5 0 48

12 139 1 43 6 4897

Output

1 2 3 4 5 Outliers

Table 3: Confusion matrix for W1 . Input Output

A B C D E F G

1

2

3

4

5

6

7

0 0 39 0 2 0 0

0 20 0 0 0 0 0

0 0 0 2 1 0 2

0 0 0 0 13 0 0

0 0 0 0 0 0 3

0 0 0 0 0 8 0

1 0 0 0 4 5 0

Table 4: Confusion matrix of the zoo data. 28

Datasets URCS WebKB4 WebKB Reuters-top 10 K-dataset

# documents 476 4199 8,280 2,900 2,340

# class 4 4 7 10 20

Table 5: Document DataSets Descriptions.

Datasets URCS WebKB4 WebKB Reuters-top 10 K-dataset

CoFD

K -means

P1

P2

P3

P4

Slink

Clink

UPMGA

0.889 0.638 0.501 0.734 0.754

0.782 0.647 0.489 0.717 0.644

0.868 0.639 0.488 0.681 0.762

0.836 0.644 0.505 0.684 0.748

0.872 0.646 0.503 0.724 0.716

0.878 0.640 0.493 0.691 0.750

0.380 0.392 NA 0.393 0.220

0.422 0.446 NA 0.531 0.514

0.597 0.395 NA 0.498 0.551

Table 6: Document Clustering Comparison Table. Each entry is the purity of the corresponding column algorithm on the row dataset.

Algorithms P1 P2

P P Minimize P Maximize P P Maximize P

i=1

2Di ;y 2D d(x;y )

i

x;y

1

i=1 ni

k

i=1

P4

( )

x

K

K

P3

qP P n qP P P qP qP P qP

Criteria Function d x; y Maximize K x;y2Di i=1

ni

(

x;y

2Di d(x;y )

2Di d(x;y ))

2Di ;y 2D

d(x;y )

2Di

d(x;y )

x

x;y

K

i=1

k

i=1

ni

x;y

2Di d(x;y )

2Di ;y 2D

d(x;y )

2Di

d(x;y )

x

x;y

Table 7: Criteria Functions for Different Partitioning Algorithms. D is the dataset and Di ’s are the clusters of size ni . K is the number of clusters. d x; y is the distance/similarity between x and y .

( )

29

Input Output

A B C D

1

2

3

4

68 8 0 25

0 1 0 70

8 4 160 6

0 120 0 6

Table 8: Confusion matrix of technical reports by CoFD.

Site B Site A

Outlier

1

2

3

4

5

7 1 1 0 1 0

0 0 0 29 0 0

0 0 0 0 0 28

0 25 0 0 0 0

0 2 27 0 1 0

0 0 0 0 28 0

Outlier 1 2 3 4 5

Site A Site B

1 3

2 4

3 1

4 5

5 2

Table 9: Confusion matrix and correspondence of feature maps of W2 in homogeneous setting.

Site B Site A

1 2 3 4 5

1

2

3

4

5

1989 1 2 0 0

0 2 0 1998 0

0 0 0 3 1998

3 0 1958 14 28

6 1997 1 0 0

1 1

3 4

5 3

Site A Site B

2 5

4 2

Table 10: Confusion matrix and correspondence of data maps of W2 in heterogeneous setting.

30

Input

1

2

3

4

5

0 1999 1 0 0

0 0 0 2000 0

1 0 1958 13 28

0 0 0 2 1998

1997 1 2 0 0

Output

1 2 3 4 5

Table 11: Confusion Matrix on Site A.

Input Output

A B C D E F

1

2

3

4

5

6

0 112 0 0 0 0

5 13 0 14 7 22

0 0 72 0 0 0

1 2 1 15 4 26

0 4 0 5 43 0

20 0 0 0 0 0

Table 12: Confusion matrix of Centralized Result for Dermatology Database.

Input Output

A B C D E F

1

2

3

4

5

6

3 99 1 5 2 2

0 6 0 4 16 35

9 0 63 0 0 0

0 1 1 2 3 42

41 0 0 8 3 0

0 0 0 20 0 0

Table 13: Confusion matrix of Distributed Result for Dermatology Database.

31

List of Figures 1 2 3 4 5 6 7 8

Description of CoFD algorithm. . . . . . . . . . . . . . . . . Clustering and refining algorithm. . . . . . . . . . . . . . . . Algorithm of guessing the number of clusters. . . . . . . . . . Scalability with the number of points:running time. . . . . . . Scalability with the number of points:running time/#features. . Scalability with the number of features:running time. . . . . . Scalability with the number of features:running time/#features. Purity comparisons on various document datasets. . . . . . . .

32

. . . . . . . .

. . . . . . . .

33 34 34 34 35 35 36 36

Procedure EstimateFeatureMap(data points: W , data map: D )

begin for j in if

1 d do max F (j ) = arg max

jfijD (i)=C ^Wij =1gj

C

jfijWij =1gj

1=K then

jfijD (i)=C ^Wij =1gj

C

jfijWij =1gj

;

else F j outlier; endif end return F end Procedure EstimateDataMap(data points: W , feature map:F ) fIf no outliers, T Coutlier . Otherwise, =K .g begin for i in N do jfjjF (j )=C ^Wij =1gj if T Coutlier then C jfjjWij =1gj

( )=

=0

1 max D (j ) = arg max

1

jfjjF (j )=C ^Wij =1gj

C

jfjjWij =1gj

;

else D i outlier; endif end return D ; end Algorithm CoFD(data points: W , # of classes: K ) begin let W be the set of randomly chosen nK distinct data points from W ; assign each n of them to one class,say the map be D ; assign EstimateFeatureMap(W ,D ) to F ; repeat assign F to F ; assign EstimateDataMap(W ,F ) to D ; assign EstimateFeatureMap(W ,D ) to F ; until conditional entropy H F jF is zeros; return D ; end

()=

1

1 1

1

1

(

1)

Figure 1: Description of CoFD algorithm.

33

Algorithm Re ne(data points: W , the number of classes: K ) begin do m times of CoFD(W ,K ) and assign the results to Ci for i ; ; m; compute the average conditional entropies m 1 H Cj jCi for i T Ci ; ; m; j =1 m return T C ; Ci i end

=1 ) =1

( )= P ( arg min ( )

Figure 2: Clustering and refining algorithm. Algorithm GuessK(data points: W ) fL is the estimated maximum number of clusters;g begin for K in L do do m times of Cluster(W ,K ) and ; ; m; assign the results to CK i for i compute the average conditional entropies m 1 T CK i ; ; m; H CK j jCK i for i j =1 m end return K i T CK i ; end

2

( )= P arg min min (

=1 )

( )

=1

Figure 3: Algorithm of guessing the number of clusters.

running time in seconds

1000 w/o bootstrap w/ bootstrap 100

10

1

0.1 100

1000

10000

100000

#points

Figure 4: Scalability with the number of points:running time.

34

running time/#points

0.01 w/o bootstrap w/ bootstrap

0.001

0.0001 100

1000

10000 #points

100000

Figure 5: Scalability with the number of points:running time/#features.

running time in seconds

14 12 10 8 6 4 2 20

200

500 #features

1000

Figure 6: Scalability with the number of features:running time.

35

running time/#features

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 20

200

500

1000

#features

Figure 7: Scalability with the number of features:running time/#features.

1 0.9 0.8 COFD K-Means P1 P2 P3 P4 Slink Clink UPMGA

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 URCS

WebKB4

WebKB

Reuterstop10

K-dataset

Figure 8: Purity comparisons on various document datasets.

36