Jun 20, 2003 - In Section 3, we formulate the problem of categorical data clustering using IB ..... data set was taken from the UCI Machine Learning Repository.
Clustering Categorical Data based on Information Loss Minimization Periklis Andritsos, Panayiotis Tsaparas, Ren´ee J. Miller, Kenneth C. Sevcik University of Toronto {periklis,tsap,miller,kcs}@cs.toronto.edu June 20, 2003
1 Introduction As the size of databases continues to grow, understanding their structure gets more difficult. This, together with the lack of documentation and the unavailability of the original designers of the database adds further difficulty to the job of researchers and professionals to understand the structure of large and complex databases. At the same time, data sources are distributed over several sites and their integration introduces anomalies and often results in “dirty” databases, i.e., databases that contain erroneous or duplicate data records. Our research focuses on the application of data mining, and in particular clustering techniques, to aid the process of recovering and understanding high-level views of data sets. Clustering is a problem of great practical importance that has been the focus of substantial research in several domains for decades [13, 14]. As storage capacities grow, we have at hand larger amounts of data available for analysis. Clustering plays an instrumental role in this process. This trend has created a surge of research activity in the construction of algorithms for clustering that can handle large amounts of data, and produce results of high quality. The problem of clustering is defined as that of partitioning data objects into groups, such that objects in the same group are similar, while objects in different groups are dissimilar. This definition assumes that there is some well defined notion of similarity or distance between data objects. For example, when the objects are defined over a set of numerical attributes there are natural definitions of distance that draw from geometric analogies. The definition of distance allows us to define a quality measure for a clustering, e.g., the mean square distance of points from the centroids of the cluster). Clustering then becomes the problem of grouping together points such that the quality measure is optimized. This problem becomes more challenging when the data is categorical, that is, 1
when there is no inherent distance measure between data objects. In many domains, data is described by a set of descriptive attributes, many of which are not numerical or inherently ordered in any way. A data set often comes with a schema, the means through which we understand and query the underlying data. Most systems assume that the schema is predefined and that the data conforms to the structure and constraints of this schema. However, when it comes to large and heterogeneous data sources, their integration may introduce inconsistencies and the current schema may not represent the data accurately [1]. Designing a database is a process usually done by humans (a database designer or a domain expert), therefore the resulting schema is often subjective. Moreover, as systems age, the insertion of new data into the source and the introduction of possibly new constraints, forces the relaxation of older constraints and degradation of the initial schema. As a result, we may need to redefine schemas and represent the data using an alternative schema. Using data mining techniques, we intend to analyze a given database instance and produce a more descriptive schema. We are designing a system that decomposes a database instance, using clustering, into horizontal partitions and upon this, finds a vertical decomposition of each one of them. As mentioned, clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data objects. We call collections of such data categorical. As a concrete example, consider a database that stores information about movies. A movie is a tuple expressed over the attributes “director”, “actor/actress”, and “genre”. An instance of this database is shown in Table 1. In this setting it is not immediately obvious what is the distance between the values “Coppola”, and “Scorsese”, and what is the distance between tuples “Vertigo” and “Harvey”. More importantly, it is not obvious how to define a quality measure for the clustering. On the other hand, for humans there is an intuitive notion of quality for clustering. A good clustering is one where the clusters are informative about the tuples they contain. Since tuples are expressed over attribute values, we require that the clusters be informative about the attribute values of the tuples they hold. That is, given a cluster, we should be able to predict the attribute values of the tuples in the cluster to the greatest degree possible. In this case the quality measure of the clustering is the information that the clusters hold about the attributes. Since a clustering is a summary of the data, some information may be lost. Our objective will be to minimize this loss, or equivalently to minimize the increase in uncertainty. Consider again the example in Table 1, and assume that we want to partition it in 2 clusters. One possible clustering is to group the first two movies together into one cluster c1 and the remaining four into another c2 . Note that cluster c1 preserves all information about the actor and the genre of the movies it holds; given cluster c1 , the probability of predicting the values for these attributes is 1, i.e., p(actor = R.DeN iro|c1 ) = p(genre = crime|c1 ) = 1. Furthermore, it is not hard to see that any other clustering will result in greater information loss. For example, moving “Vertigo” and “N by NW” from cluster c 2 to cluster c1 , decreases the uncertainty, about the attributes, within c2 , but at the same time we have a greater increase in uncertainty of cluster c1 , since our predictive ability for the actor attribute in c1 is less than it used to be for c2 . This intuitive idea was formalized by Tishby, Pereira and Bialek [19]. They recast clustering as the compression
2
movie
director
actor
genre
Godfather II
M. Scorsese
R. De Niro
Crime
Good Fellas
F. F. Coppola
R. De Niro
Crime
Vertigo
A. Hitchcock
J. Stewart
Thriller
N by NW
A. Hitchcock
C. Grant
Thriller
The Bishop’s Wife
H. Koster
C. Grant
Comedy
Harvey
H. Koster
J. Stewart
Comedy
Table 1: An instance of the movie database
of one random variable into a compact representation that preserves as much information as possible about another random variable. Their approach was named the Information Bottleneck (IB) method, and it has been applied to a variety of different areas. In this work, we consider the application of the IB method to the problem of clustering categorical data. In our current work, we make the following contributions. • We formulate the categorical clustering problem within the Information Bottleneck framework. • We use IB directly to define the distance between objects. • We propose LIMBO, a scalable algorithm for clustering categorical data. To the best of our knowledge LIMBO is the first scalable application of the IB method. LIMBO requires at most three passes over the data set. • We empirically evaluate the quality of clusterings produced by LIMBO relative to other categorical clustering algorithms. • We discuss the possible use of the IB method in the problem of structure discovery. The outline of this paper is as follows. In Section 2, we introduce some basic definitions and the Information Bottleneck, (IB) method. In Section 3, we formulate the problem of categorical data clustering using IB and introduce LIMBO a scalable algorithm that clusters categorical data. Some preliminary empirical results are given in Section 4. In Section 5 we present related work on categorical data clustering. Finally, we conclude our paper in Section 6 where we also present ongoing and future research directions.
2 Background In this section, we present the main concepts from information theory that will be used in the rest of the paper. At the same time we give an outline of the Information BOttleneck method and how it is used in clustering. 3
2.1 Information Theory basics We now present the notation that we will use for the rest of the paper. The following definitions can be found in any Information Theory textbook [8]. Let T denote a discrete random variable that takes values over the set T and let p(t) denote the distribution of T . The entropy H(T ) of variable T is defined by H(T ) = −
X
p(t) log p(t)
t∈T
The entropy H(X) can be thought of as the average number of bits required to describe the random variable T . Intuitively, entropy captures the “uncertainty” of variable T ; the higher the entropy, the lower the certainty with which we can predict the value of the variable T . Now, let T and A be two random variables that range over sets T and A respectively, and let p(t, a) denote their joint distribution. The conditional entropy H(A|T ) is defined as H(A|T ) =
X
p(t)H(A|T = t)
t∈T
= −
X
p(t)
t∈T
X
p(a|t) log p(a|t)
A∈A
Given T and A, the mutual information, I(T ; A), quantifies the amount of information that the variables hold about each other. The mutual information between two variables is the amount of uncertainty (entropy) in one variable that is removed by knowledge of the value of the other one. Specifically, we have I(T ; A) =
XX
p(t, a) log
t∈T A∈A
p(t, a) p(t)p(a)
= H(T ) − H(T |A) = H(A) − H(A|T ) Mutual information is symmetric, non-negative and equals zero if and only if T and A are independent. Finally, Relative Entropy, or the Kullback-Leibler (KL) divergence, is a standard information-theoretic measure of the difference between two probability distributions. Given two distributions p and q over a set T, the relative entropy is DKL [pkq] =
X
t∈T
p(t) log
p(t) q(t)
Intuitively, the relative entropy DKL [pkq] is a measure of the inefficiency of an encoding that assumes the distribution q, when the true distribution is p.
2.2 The Information Bottleneck Method Let T denote a set of objects that we want to cluster (e.g., customers, tuples in a relation, web documents), and assume that the elements in T are expressed as vectors in a feature space A (e.g., items purchased, words, attribute 4
values, links). Let n = |T| and d = |A|. Our data can then be represented as an n × d matrix M , where each row holds the feature vector of an object in T. Now let T and A be random variables that range over the sets T and A respectively. We normalize matrix M so that the entries of each row sum up to one. For some object t ∈ T, the corresponding row of the normalized matrix holds the conditional probability p(A|T = t). The information that one variable contains about the other can be quantified using the mutual information I(T ; A) measure. Furthermore, if we fix a value t ∈ T, the conditional entropy H(A|T = t) gives the uncertainty of a value for variable A selected among those associated with the object x. We introduce a third random variable C taking values from a set C. The mutual information I(A; C) quantifies the information about the values of A provided by the identity of a cluster (a given value of C). The higher this quantity is the more informative the cluster identity is about the features of its constituents. Our goal is to find a mapping of all values t ∈ T into clusters c ∈ C such that every object t is mapped to a cluster c(t), and the cluster c(t) provides essentially the same prediction or information about the values in A as the specific knowledge of t. This idea was formalized by Tishby, Pereira and Bialek [19]. Given the joint distribution of two random variables T and A, the problem of clustering is recast as the problem of compressing variable T as much as possible, while at the same time preserving as much information as possible about variable A. For some c ∈ C, 1 if t ∈ c p(c|t) = 0 otherwise X p(c) = p(t) t∈c
p(a|c) =
1 X p(t)p(a|t) . p(c) t∈c
The compactness of the clustering is directly related to the mutual information I(T ; C) between T and C. The lower I(T ; C), the more compact the clustering. Tishby et al. define clustering as an optimization problem, where for any value of k of clusters, we wish to maximize I(A; Ck ) while keeping I(T ; Ck ) as small as possible. Intuitively, in this procedure, the information contained in T about A is “squeezed” through a compact “bottleneck” of clusters C, which is forced to represent the “relevant” part in T with respect to A. Finding the optimal clustering is an NP-complete problem [10]. Slonim and Tishby [17] propose a greedy agglomerative approach, the Agglomerative Information Bottleneck (AIB) algorithm, for finding a locally optimal clustering. The algorithm starts with a clustering C, where each object t ∈ T is assigned to its own cluster. Therefore, we have that C ≡ T , and I(A; C) is maximized. The algorithm then proceeds iteratively, until k clusters are formed. At each step we merge two clusters ci , cj ∈ C into a single component c∗ to produce a new clustering C 0 , so that the information loss δI(ci , cj ) = I(A; C) − I(A; C 0 ) is minimized. The new component c∗ has 1 if t ∈ c or t ∈ c i j p(c∗ |t) = 0 otherwise p(c∗ ) = p(ci ) + p(cj ) 5
(1) (2)
p(a|c∗ ) =
p(ci ) p(cj ) p(a|ci ) + p(a|cj ) . p(c∗ ) p(c∗ )
(3)
Tishby et al. show that δI(ci , cj ) = [p(ci ) + p(cj )] · DJS [p(a|ci ), p(a|cj )] where DJS is the Jensen-Shannon (JS) divergence, defined as follows. Let pi = p(a|ci ) and pj = p(a|cj ) and let p¯ =
p(cj ) p(ci ) p(a|ci ) + p(a|cj ) p(c∗ ) p(c∗ )
denote the weighted distribution of pi and pj . The DJS distance is defined as DJS [pi , pj ] =
p(ci ) p(cj ) DKL [pi ||¯ p] + DKL [pj ||¯ p] p(c∗ ) p(c∗ )
The DJS distance is the average DKL distance of pi and pj from p¯. It is non-negative and equals zero if and only if pi ≡ pj . It is also bounded above by 1, and it is symmetric. The value δI(ci , cj ) measures the information loss when merging ci and cj . We can also view the information loss as the increase in the uncertainty. Recall that I(A; C) = H(A) − H(A|C). Since the value H(A) remains unaffected from the changes in C, maximizing the mutual information I(A; C) is the same as minimizing the entropy of the clustering H(A|C).
3 IB Clustering and LIMBO In this section, we formulate the problem of clustering categorical data in the context of the Information Bottleneck method, and we introduce LIMBO, a scalable algorithm to cluster such data.
3.1 Clustering Categorical Data using the IB method The input to our problem is a set T of n tuples on m attributes A1 , A2 , . . . , Am . The domain of attribute Ai is the set Ai = {Ai .v1 , Ai .v2 , . . . , Ai .vki }. A tuple t ∈ T takes exactly one value from the set Ai on each attribute Ai . Let A = A1 ∪ · · · ∪ Am denote the set of all possible attribute values, where identical values from different attributes are treated as distinct values. Let d = k1 + k2 + · · · + km denote the size of A. We represent our data as an n × d matrix M , where each t ∈ T is a d-dimensional row vector in M . M [t, a] = 1, if tuple t contains attribute value a, and zero otherwise. Since every tuple contains one value for each attribute, each tuple vector contains exactly m 1’s. Now, let T and A be random variables that range over sets T and A respectively. Following the formalism of Section 2.2, we define p(t) = 1/n, and we normalize the matrix M so that the t th row holds the conditional probability distribution p(A|t). Since each tuple contains exactly m attribute values, for some a ∈ A, p(a|t) = 1/m if a appears in tuple t, and zero otherwise. Table 2 shows the normalized matrix M for the movie database example. 1 1 We
use abbreviations for the attribute values. For example d.H stands for director.Hitchcock.
6
Given the normalized matrix, we can proceed with the application of the IB method to cluster the tuples in T. Note that matrix M can be stored as a sparse matrix, so we do not need to materialize all n × d entries. We can now define d.S d.C d.H d.K a.DN a.S a.G g.Cr g. T g.C p(t) t1
1/3
0
0
0
1/3
0
0
1/3
0
0
1/6
t2
0
1/3
0
0
1/3
0
0
1/3
0
0
1/6
t3
0
0
1/3
0
0
1/3
0
0
1/3
0
1/6
t4
0
0
1/3
0
0
0
1/3
0
1/3
0
1/6
t5
0
0
0
1/3
0
0
1/3
0
0
1/3 1/6
t6
0
0
0
1/3
0
1/3
0
0
0
1/3 1/6
Table 2: The normalized movie table the mutual information I(T ; A) and proceed with the Information Bottleneck method. Note that our approach merges all attribute values into one variable, without taking into account the fact that the values come from different attributes. Alternatively, we can define a random variable for every attribute A i . Let pi (a|t) denote the conditional probability of variable Ai given T . We can show, [3] that in the case of relational data, for the Information Bottleneck method, considering all attributes together is equivalent to considering each attribute independently.
3.2 ScaLable InforMation BOttleneck, (LIMBO), Clustering The Agglomerative Information Bottleneck algorithm suffers from high computational complexity, namely O(n 2 ), which is prohibitive for large datasets. We now introduce the scaLable InforMation BOttleneck, (LIMBO) algorithm that uses distributional summaries in order to deal with larger data sets. We employ an approach similar to the one presented in the BIRCH clustering algorithm for clustering numerical data [20]. However, the definition of summaries and the notion of similarity that we use to produce the solution are different. LIMBO is based on the idea that we do not need to keep whole tuples or whole clusters in main memory, but instead, just sufficient statistics to describe them.
3.3 Distributional Cluster Features We summarize a cluster of tuples in a Distributional Cluster Feature (DCF). We will use the information in the relevant DCF ’s to compute the distance between two clusters or between a cluster and a tuple. Let T denote a set of tuples over a set A of attributes, and let T and A be the corresponding random variables, as described in Section 3.1. Also let C denote a clustering of the tuples in T and let C be the corresponding random
7
variable. For some cluster c ∈ C, the Distributional Cluster Feature (DCF) of cluster c is defined by the pair DCF (c) = n(c), p(a|c) where n(c) is the number of tuples in c and p(a|c) is the conditional probability of the attribute values given the cluster c. In the case that c consists of a single tuple t ∈ T, n(c) = 1 and p(a|c) = 1/m for all attribute values a ∈ A that appear in tuple t, and zero otherwise, where m is the number of attributes. For larger clusters, the DCF is computed recursively as follows. Let c1 and c2 be two clusters, and let c∗ denote the cluster we obtain by merging c1 and c2 . The DCF of the cluster c∗ is equal to DCF (c∗ ) = n(c1 ) + n(c2 ), p1 + p2
(4)
where p1
=
p2
=
n(c1 ) p(a|c1 ) n(c1 ) + n(c2 ) n(c2 ) p(a|c2 ) n(c1 ) + n(c2 )
We define the distance between two clusters c1 and c2 , d(c1 , c2 ), to be the information loss δI(c1 , c2 ) = I(A; C)− I(A; C 0 ), where C and C 0 denote the clusterings before and after the merge of clusters c1 and c2 . We note that the information loss is independent of the clustering C, that is, the amount of information we lose depends only on the clusters c1 and c2 , and not on the rest of the clustering. Therefore, d(c1 , c2 ) is a well-defined distance measure that does not depend on C. From the discussion in Section 2.2, we have that n(c1 ) n(c2 ) DJS [p(a|c1 ), p(a|c2 )] d(c1 , c2 ) = + n n where n is the total number of tuples in the dataset. The DCF s can be stored and updated incrementally. Each DCF provides a summary of the corresponding clusters which is sufficient for computing the distance between two clusters. We will often interchange the terms cluster and DCF . Thus, we can talk about merging DCF s, or computing the distance between two DCF s. We will now describe the structure to hold the DCF s.
3.4 The DCF tree The DCF tree is a height-balanced tree as depicted in Figure 1. It is characterized by two parameters, the Branching Factor, B, and the Threshold, τ (φ), the former being the maximum number of children for a node, and the latter an upper bound on the distance between a leaf cluster and a tuple, if the tuple is to be merged into this leaf cluster. We discuss the choice of the value of τ (φ) in Section 3.5. All nodes of the tree store DCF s. At any point in the construction of the tree the DCF s at the leaves store a clustering of the tuples seen so far. Each non-leaf node stores a DCF that is produced by merging the DCF s of its 8
Root Node DCF1
DCF2
DCF3
DCF 6
child1
child2 child3
child6
Non-leaf node DCF1
DCF2
DCF3
DCF 5
child1
child2 child3
child5
Leaf node prev DCF1 DCF2
Leaf node
DCF 6 next
prev DCF1 DCF2
DCF 4 next
{ Data }
Figure 1: A DCF Tree with B = 6 children. The DCF tree is built in a dynamic fashion as new tuples arrive and they are placed in the appropriate leaves. After the whole data set is read from the disk, the DCF tree embodies a compact representation of it. The data set is summarized by the information in the DCF s of the leaves.
3.5 The LIMBO clustering algorithm The LIMBO algorithm proceeds in three phases. In the first phase, the DCF tree is constructed to summarize the data. In the second phase, the leaves of the tree are clustered to produce the desired number of clusters. In the third phase, we label the data, that is, we assign tuples to clusters. Phase 1: Insertion into the DCF tree. Tuples are loaded one by one. Tuple t is converted into DCF (t), as described in Section 3.3. Then, starting from the root, we trace a path downward in the DCF tree. When at a non-leaf node, we compare DCF (t) to each DCF entry of the node (maximum of B entries), finding the closest DCF entry to DCF (t). We follow the child pointer of this entry to the next level of the tree. When at a leaf node, let DCF (c) denote the DCF entry in the leaf node that is closest to DCF (t). DCF (c) is the summary of some cluster c in the current clustering C. At this point we need to decide whether DCF (t) will be absorbed in the cluster DCF (c) or not. If the distance d(c, t), that is, the information loss incurred by merging t into c is less than the threshold τ , then we proceed with the merge, otherwise t forms a cluster on its own. If there is space for another entry in the leaf node, DCF (t) is inserted and all DCF s in the path toward the root are updated using Equation (4). If there is no space, the leaf node has to be split. This is done in a similar manner as in BIRCH [20]. The two DCF s in the node that have the greatest distance between them are selected as seeds for the two leaves and the remaining DCF s, along with DCF (t) are placed in the leaf that contains the seed DCF to which they are closest. When a leaf node is split, resulting in the creation of a new leaf node, its parent is updated, and a new entry is
9
created that describes the newly inserted node. If there is space in the non-leaf node, we proceed with the addition of the entry, otherwise the non-leaf node should be split. This process continues upwards in the tree until the root is either updated or split itself. In the latter case, the height of the tree is increased by one. The I/O complexity of this stage is equal to the cost of reading the data set from the disk, which requires only one scan. The CPU complexity is computed as follows. When a new point is inserted we have to descend down the tree and go through at most logB n nodes. When in a node, in order to find the closest entry, we need to check at most B entries, and each computation of the information loss is proportional to d. Hence, the CPU cost of the insertion of all n points in the tree is O(dBn logB n) Phase 2: Perform clustering. After the construction of the DCF tree, the leaf nodes hold the DCF s of a clustering C of the tuples in T. In this phase, our algorithm will employ the Agglomerative Information Bottleneck (AIB) algorithm ˜ of the DCF s. The time for this phase depends upon the to cluster the DCF s in the leafs and produce a clustering C ˜ number of clusters in clustering C. We assume that φ is selected such that the DCF tree is compact enough to fit in main memory. Hence, there is no I/O cost involved in this phase, since it only involves the clustering of the leaf nodes of the DCF tree. The CPU cost has complexity O(L2 log L) for running the the AIB algorithm. In our experiments, φ was chosen to be sufficiently large so that L n, which makes this phase computationally feasible. Phase 3: Labeling of the data. Phase 2 results in a set of k clusters, and their corresponding DCF s. In the last phase, we perform a scan over the data set to assign each tuple to the cluster whose DCF is closest to the tuple. The I/O cost of this phase is equal to reading the data set from the disk once. The CPU complexity is O(kdn), since each tuple is compared against the k DCF s that describe the clusters.
3.6 Threshold value The threshold value τ controls the decision to merge a new tuple into an existing cluster, or to place it in a cluster by itself. Therefore, it affects the size of the DCF tree, as well as the granularity of the resulting representation. A good choice for τ is necessary to produce a concise and useful summarization of the data set. In LIMBO, we adopt a heuristic for setting the value τ that makes use of the mutual information the set of tuples T contains about the set of attribute values, A. Before running LIMBO, we read the data once from the disk in order to compute the value of I(A; T ). There exist n tuples in T, so “on average” every tuple contributes I(A; T )/n to the mutual information I(A; T ). We define τ (φ) as τ (φ) = φ
I(A; T ) , n
where 0 ≤ φ ≤ n, denotes the amount of the “average” mutual information that we wish to preserve when merging 10
a tuple into a cluster. If the merge incurs information loss more than φ times the “average” mutual information, then the new tuple is placed in a cluster by itself.
4 Experimental Evaluation In this section, we present a preliminary comparative experimental evaluation of the LIMBO algorithm on both real and synthetic data sets. We also explore the effect of φ on the quality and efficiency of the algorithm. The section concludes with an efficiency evaluation of LIMBO. We now review the algorithms that we consider.
4.1 Algorithms ROCK Algorithm. ROCK [12] assumes a similarity measure between tuples, and defines a link between two tuples whose similarity exceeds a threshold θ. The aggregate interconnectivity between two clusters is defined as the sum of links between their tuples. ROCK proceeds hierarchically, merging the two most interconnected clusters in each step. Thus, ROCK is not applicable to large data sets. We use the Jaccard Coefficient for the similarity measure as suggested in the original paper. For data sets that appear in the original ROCK paper we set the threshold θ to the value suggested there, otherwise we set θ to the value that gave us the best results in terms of quality. For our experimentation, we use the implementation of Guha et al, [12]. STIRR Algorithm. STIRR [11] uses a linear dynamical system to iterate over multiple copies of a hypergraph of weighted attribute values, until a fixed point is reached. Each copy of the hypergraph contains two groups of attribute values, one with positive and another with negative weights, which define the two clusters. Handling more than two clusters involves a post-processing step not defined in the original paper. In our experiments, we use our own implementation. We report results for ten iterations. We augment tuples with an extra tuple-id attribute to track which tuples are placed in each of the clusters. COOLCAT Algorithm. The approach most similar to ours is the COOLCAT algorithm [6], by Barbar´a, Couto and Li. The COOLCAT algorithm uses the same quality measure for a clustering as our approach, namely the entropy of the clustering. It differs from our approach in that it relies on sampling, and it is non-hierarchical. COOLCAT starts with a sample of points and identifies a set of k initial tuples such that the minimum pairwise distance among them is maximized. These serve as representatives of the k clusters. It then places all remaining tuples of the data set in one of the clusters such that, at each step, the increase in the entropy of the resulting clustering is minimized. For the experiments, we implemented COOLCAT based on the CIKM paper by Barbar`a et al. [6]. LIMBO Algorithm. We observed experimentally that the branching factor B does not affect significantly the quality of the clustering. We set B = 4, so that the DCF tree is of manageable size, and the effect on the creation of the tree 11
is minimal. We explored a large range of values for φ. Larger values for φ delay leaf-node splits and create a smaller tree with a coarse representation of the data set. On the other hand, smaller φ values incur more splits but preserve a more detailed summary of the initial data set. In our implementation, we store DCF s as sparse vectors using a sparse vector library2 resulting in significant savings in space.
4.2 Data Sets We experiment with the following data sets. The first three have been previously used for the evaluation of the categorical clustering techniques mentioned above [6, 11, 12]. The synthetic data sets are also used for our scalability studies. Congressional Votes: This relational data set was taken from the UCI Machine Learning Repository.
3
It contains 435
tuples of votes from the U.S. Congressional Voting Record of 1984. Each tuple is a congress-person’s vote on 16 issues and each vote is boolean, either YES or NO. There is an additional attribute that gives the label for each tuple, Republican or Democrat. There are a total of 168 Republicans and 267 Democrats. There are 288 missing values that we treat as separate values. Mushroom: The Mushroom relational data set also comes from the UCI Repository. It contains 8,124 tuples, each representing a mushroom characterized by 22 attributes, such as color, shape, odor, etc. The total number of distinct attribute values is 117. It also contains an extra attribute that labels each mushroom as either poisonous or edible. There are 4,208 edible and 3,916 poisonous mushrooms in total. There are 2,480 missing values. Synthetic Data Sets. We produce synthetic data sets using a data generator available on the web. 4 This generator offers a wide variety of options, in terms of the number of tuples, attributes, and attribute domain sizes. We specify the number of classes in the data set by the use of conjunctive rules of the form (Attr1 = a1 ∧ Attr2 = a2 ∧ . . .) ⇒ Class = c1. The rules may involve an arbitrary number of attributes and attribute values. We name these synthetic data sets by the prefix DS followed by a number which is the number of clusters, e.g., DS5 or DS10. The data sets contain 5,000 tuples, and 10 Attributes, with domain sizes between 20 and 40 for each attribute. Three attributes participate in the rules the data generator uses to produce the cluster labels. All data set are summarized in Tables 3. 2 http://www.genesys-e.org/ublas/ 3 http://www.ics.uci.edu/∼mlearn/MLRepository.html 4 http://www.datgen.com/
12
Data Set
Records
Attributes
Attr. Values
Missing
Votes
435
16
48
288
Mushroom
8,124
22
117
2,480
DS5
5,000
10
314
0
DS10
5,000
10
305
0
Table 3: Summary of the data sets used
4.3 Quality measures for Clustering Clustering quality lies in the eye of the beholder; given two clusterings, determining the best clustering usually depends on subjective criteria. Consequently, we will use several quantitative measures of clustering performance. We use the information loss, IL, I(A; T ) − I(A; C) for comparing clusterings. The lower the information loss, the better the clustering. With a clustering of low information loss, given a cluster, we can predict the attribute values of the tuples in the cluster with relatively high accuracy. We also measured the value of the Category Utility, CU [15]. Category utility is defined as the difference between the expected number of attribute values that can be correctly guessed given a clustering, and the expected number of correct guesses with no such knowledge. Let C be a clustering. If Ai is an attribute with values vij , then CU is given by the following expression: CU =
X |c| X X [P (Ai = vij |c)2 − P (Ai = vij )2 ] n i j
c∈C
Many data sets commonly used in testing clustering include a hidden variable, which specifies the class with which each tuple is associated. While there is no guarantee that any given classification corresponds to an optimal clustering, it is nonetheless enlightening to compare clusterings with pre-specified classifications of tuples. To do this, we use performance measures from Information Retrieval, namely Precision and Recall [5] for the resulting clusters. Assume that the tuples in T are already classified into k classes G = {g 1 , . . . , gk }, and let C denote a clustering of the tuples in T into k clusters {c1 , . . . , ck } output by a clustering algorithm. Consider a one-to-one mapping, f , from classes to clusters, such that each class gi is mapped to the cluster f (gi ). The classification error of the mapping is defined as k X E= gi ∩ f (gi ) , i=1
where gi ∩ f (gi ) measures the number of tuples in class gi that received the wrong label. We compute the optimal
mapping between clusters and classes, that is, the mapping that minimizes the classification error. We use E min to
denote the classification error of the optimal mapping. Without loss of generality assume that the optimal mapping assigns class gi to cluster ci . We define precision, P , and recall, R, for a cluster ci , 1 ≤ i ≤ k as follows. Pi =
|ci ∩ gi | |ci |
and
Ri = 13
|ci ∩ gi | . |gi |
Pi and Ri take values between 0 and 1 and, intuitively, Pi measures the accuracy with which cluster ci reproduces class gi , while Ri measures the completeness with which ci reproduces class gi . We define the precision and recall of the clustering as the weighted average of the precision and recall of each cluster. More precisely P =
k X |gi | i=1
|T |
Pi
and
R=
k X |gi | i=1
|T |
Ri .
We think of precision, recall and classification error as indicative values of the ability of the algorithm to reconstruct the existing classes in the data set. In our experiments, we report the values of information loss IL, category utility CU , precision P , recall R, and classification error Emin . For LIMBO and COOLCAT all numbers are averages over 100 runs with different (random) orderings of the tuples.
4.4 Quality-Time tradeoffs for LIMBO As previously discussed, the behavior of the LIMBO algorithm depends heavily upon the parameter φ. The value of φ offers a tradeoff between the compactness (number of leaf entries) of the DCF tree, and the expressiveness (information preserved) of the summarization it produces. For small values of φ, we obtain a fine grain representation of the data set at the end of Phase 1. However, this results in a tree with a large number of leaf entries, which leads to a higher computational cost for Phase 2 of the algorithm. For large values of φ, we obtain a compact representation of the data set (small number of leaf entries), which results in faster execution time, at the expense of increased information loss. We experimented with a range of values for φ in all data sets. In the extreme case of φ = 0.0, we only merge identical tuples, and no information is lost. The LIMBO algorithm reduces to the AIB algorithm and we obtain the best clustering quality. For all data sets, for φ ≤ 1.0, we observe essentially no change (up to the third decimal digit) in the quality of the clustering. For φ = 1.0, the number of leaf entries is much smaller, leading to very low execution times. Table 4 shows the reduction in the number of leaf entries for each data set when we set φ = 1.0. Votes
Mushroom
DS5
DS10
94.71%
99.77%
98.71%
98.82%
Table 4: Leaf entries’ reduction for φ = 1.0 Figures 2 and 3 show the reduction in leaf entries and the change in quality measures, respectively, for the DS5 data set. In Figure 3, P and R, are very close for all φ except φ = 1.5. Similar figures can be observed for all other data sets [3]. These figures demonstrate that we can obtain significant compression of the data sets at no expense in the final clustering quality. The consistency of LIMBO can be attributed in part to the effect of Phase 3 which assigns the tuples to cluster representatives, and hides some of the information loss incurred in the previous phases. Thus it is 14
sufficient for Phase 2 to discover k well separated representatives, in order for Phase 1 to produce a good clustering.
5 CU=2.56
4 CU=2.56
3 2
CU=2.56 CU=2.56 CU=2.56 CU=2.29 CU=0.00
1 0
0
0.25
0.5
0.75
Value of φ
1
1.25
Quality Measures
Number of Leaf Entries (x1000)
As a result, even for large values of φ LIMBO may obtain exactly the same clustering as for φ = 0.0.
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
1.5
IL P R E
min
0
0.25
Figure 2: DS5 Leaf Entries
0.5
0.75 1 Value of φ
1.25
1.5
Figure 3: DS5 Quality
4.5 Comparative Evaluations Our goal in this section is to demonstrate the quality of the clusterings LIMBO produces. Table 5 shows the results for all algorithms on all quality measures for the Votes and Mushroom data sets. For LIMBO we present results for φ = 1.0, which are essentially the same as for φ = 0.0. The size entry in the table holds the number of leaves for LIMBO, and the sample size for COOLCAT. For the Votes data set we use the whole data set as sample, while for Mushroom we use 1,000 tuples. Votes (2 clusters)
Mushroom (2 clusters)
Algorithm
size
IL(%)
P
R
Emin
CU
Algorithm
size
IL(%)
P
R
Emin
CU
LIMBO (φ = 1.0)
23
72.26
0.89
0.87
0.13
2.91
LIMBO (φ = 1.0)
18
81.45
0.91
0.89
0.11
1.71
435
73.55
0.87
0.85
0.15
2.78
COOLCAT
1,000
84.57
0.76
0.73
0.27
1.46
ROCK (θ = 0.7)
-
74.00
0.87
0.86
0.16
2.63
ROCK (θ = 0.8)
-
86.00
0.77
0.57
0.43
0.59
STIRR
-
97.00
0.66
0.64
0.36
0.30
STIRR
-
83.00
0.66
0.64
0.36
1.01
COOLCAT
Table 5: Results for real data sets As Table 5 indicates, LIMBO’s quality is superior to ROCK, STIRR and COOLCAT, in both data sets. In terms of IL, LIMBO created clusters which retained most of the initial information about the attribute values. With respect to the other measures, LIMBO outperforms all other algorithms, exhibiting the highest CU , P and R in all data sets tested, as well as the lowest Emin .
15
DS5 (n=5000, 10 attributes, 5 clusters)
DS10 (n=5000, 10 attributes, 10 clusters)
Algorithm
size
IL(%)
P
R
Emin
CU
Algorithm
size
IL(%)
P
R
Emin
CU
LIMBO (φ = 1.0)
66
76.66
0.998
0.998
0.002
2.56
LIMBO (φ = 1.0)
59
72.70
0.997
0.997
0.003
2.83
COOLCAT
125
77.02
0.995
0.995
0.05
2.54
COOLCAT
125
73.32
0.979
0.973
0.026
2.74
ROCK (θ = 0.0)
-
85.00
0.839
0.724
0.28
0.44
ROCK (θ = 0.0)
-
78.00
0.830
0.818
0.182
2.13
Table 6: Results for synthetic data sets We also evaluate LIMBO’s performance on two synthetic data sets, namely DS5 and DS10. These data sets allow us to evaluate our algorithm on data sets with more than two classes. We do not present STIRR’s results here, since it is designed for producing just two clusters. The results are shown in Table 6. We observe again that LIMBO has the lowest information loss and produces nearly optimal results with respect to precision and recall. Our experiments show that ROCK is very sensitive to the threshold value θ and in many cases, the clustering produces one giant cluster that includes tuples from most classes. This results in poor precision and recall. The STIRR algorithm produces clusterings of poor quality in both data sets, a fact that might be due to the choice of uniform initial weights for the attribute values. However, prior knowledge is required to define initial weights for the attribute values.
4.6 Efficiency Evaluation In this section, we study the scalability of LIMBO algorithm, and we investigate how the parameters of LIMBO affect the execution time. First, we use the DS10 data set to measure execution times for each phase separately, for φ ≥ 0.25. Results are shown in Figure 4 where for each value of φ we also include the size of the DCF tree in memory. In this figure, we observe that for small φ the computational bottleneck of the algorithm is Phase 2. As φ increases, the time for Phase 2 decreases in a quadratic fashion. Time for Phase 1 also decreases, however, at a much slower rate. Phase 3, as expected, remains unaffected, and it is equal to 1.43 seconds for all values of φ. For φ ≥ 0.75 the number of leaf entries becomes sufficiently small, so that the computational bottleneck of the algorithm becomes Phase 1. This is the range of φ in which we are interested. We now study in detail Phase 1, where the DCF tree is built. We experiment with large data sets of 10K, 500K and 1M tuples, produced with the synthetic data set generator, having 10 attributes with 20 to 40 values each. As the graph in Figure 5 demonstrates, execution time for Phase 1 remains constant for values of φ up to 0.8. For 0.8 < φ < 1.4, the execution time drops significantly. This decrease is due to the reduced number of splits and the decrease in the DCF tree size. Figure 6 plots the number of leaves as a function of φ.
5
In the same plot we
show the size of the tree. We observe that for the range of values for φ in which we are interested (0.8 < φ < 1.4), 5 The
y-axis of Figure 6 has a logarithmic scale.
16
300 200
534KB
100
309KB 135KB 2KB 0 0
0.25
0.5
0.75
Value of φ
1
1.25
Figure 4: Execution times for DS10
6
1,000,000 tuples 500,000 tuples 10,000 tuples
12
10
Number of Leaf Entries
904KB
400
Time (sec)
Phase 1 Phase 2 Phase 3
10
Time (sec) x1,000
500
8 6 4 2
206MB
179MB
1,000,000 tuples 500,000 tuples 10,000 tuples
108MB
104MB 92MB
66MB 56MB
35MB
4
34MB
10
12MB
18MB
2MB 2MB 1.4MB
6.3MB 1MB 664KB
2
10
108KB
0
0
0
0.2
0.4
0.6
0.8
1
Value of φ
1.2
1.4
1.6
Figure 5: Execution times for Phase 1
148KB
385KB
10
0
162KB
0.5
1 Value of φ
1.5
Figure 6: Leaf Entries for Phase 1
LIMBO produces a manageable DCF tree, with small number of leaves, which means fast execution time for the AIB algorithm in Phase 2. Furthermore, we observed that the vectors that we maintain, remain relatively sparse. The average density of the DCF tree vectors, i.e., the average fraction of non-zero entries remains between 41% and 87%. 14
φ=1.1 φ=1.0
~ 500 Leaf Entries ~ 1500 Leaf Entries
12
Time (sec) x1,000
Time (sec) x1,000
20
15
10
10
5
8 6 4 2
0
1
5 Number of Tuples x100,000
0
10
1
5 Number of Tuples x100,000
10
Figure 7: Execution times for large data sets From the above observations and the discussion in Section 4.4, it appears that, for every data set, there exists a range of φ values, for which the decrease in clustering quality is negligible, and at the same time, the size of the DCF tree is small enough, so that Phase 1 is sped up significantly, and the input for Phase 2 can be handled efficiently. For example, for DS10, this range is the interval [0.75, 1.0]. In our final experiment, we study the execution time for all three phases of LIMBO as a function of the size of the data set. We consider three data sets of sizes 100K, 500K, and 1M , each containing 15 clusters. The first two data sets are samples of the 1M data set. The left graph in Figure 7 shows the execution time for LIMBO on these three data sets, for φ = 1.0, and φ = 1.1. In this figure, we observe that execution time scales in a near-linear fashion with respect to the size of the data set. Furthermore, Table 7 shows that the clustering quality of LIMBO for this range of φ values remains unaffected, and it is almost the same across data sets. We ran LIMBO again on 17
IL(%)
P
R
Emin
CU
100K
76
0.99
0.99
0.01
2.61
500K
75
0.99
0.99
0.01
2.62
1M
75
0.99
0.99
0.01
2.62
Table 7: Quality for large data sets for φ ∈ [1.0, 1.1] the three data sets, picking values for φ within this range, such that the number of leaves at the end of Phase 1 is approximately the same for all three data sets. The right graph in Figure 7 shows that in this case the execution time scales linearly.
5 Related Work CACTUS, [9], by Ghanti, Gehrke and Ramakrishnan, uses summaries of information constructed from the data set that are sufficient for discovering clusters. The algorithm defines attribute value clusters with overlapping clusterprojections on any attribute. This makes the assignment of tuples to clusters unclear. Our approach is based on the Information Bottleneck (IB) Method, introduced by Tishby, Pereira and Bialek [19]. The Information Bottleneck method has been used in an agglomerative hierarchical clustering algorithm [17] and applied to the clustering of documents [18]. Recently, Slonim and Tishby [16] introduced the sequential Information Bottleneck, (sIB) algorithm, which reduces the running time relative to the agglomerative approach. However, it depends on an initial random partition and requires multiple passes of the data for different initial partitions. In the future, we plan to experiment with sIB in Phase 2 of LIMBO. Finally, another algorithm that uses an extention to BIRCH [20] is given by Chiu, Fang, Chen, Wand and Jeris [7]. Their approach assumes that the data follows a multivariate normal distribution. The performance of the algorithm has not been tested on categorical data sets.
6 Conclusions and Future Work In this paper, we have described and evaluated the LIMBO algorithm for clustering categorical data sets. LIMBO’s hierarchical nature, together with its merging threshold φ, make it capable of producing a range of clusterings of different sizes even in very large data sets. Our experiments show that LIMBO’s execution time scales in an almost linear fashion with the size of the data set, and that this performance is achieved at no expense in clustering quality. Compared to several other categorical clustering algorithms, LIMBO shows superior cluster quality both with respect to direct measures of clustering quality, and with respect to consistency with meaningful pre-specified clas18
sifications Our ongoing and future work includes following tasks. • We plan to extend LIMBO so that it works both with numerical and categorical data. • We plan to apply it as a data mining technique for schema discovery and management [2]. We are currently examining the benefits of using categorical clustering to discover groupings of tuples that share similar structural characteristics. • We studied the performance of LIMBO in the domain of Software Clustering, [4]. In this domain, we are interested in decomposing large software legacy systems into groups of cohesive modules, for purposes of understanding and effectively maintaining these systems.
References [1] P. Andritsos, R. Fagin, A. Fuxman, L. M. Haas, M. Hernandez, C.-T. Ho, A. Kementsietsidis, R. J. Miller, F. Naumann, L. Popa, Y. Velegrakis, C. Vilarem, and L.-L. Yan. Schema Managements. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 25(3): 32–38, Sept. 2002. [2] P. Andritsos and R. J. Miller. Using Categorical Clustering in Schema Discovery. In IJCAI Workshop on Information Integration on the Web, page 211, Acapulco, Mexico, 9–10 Aug. 2003. [3] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: A linear algorithm to cluster categorical data. Technical report, UofT, Dept of CS, CSRG-467, 2003. [4] P. Andritsos and V. Tzerpos. Software clustering based on infomration loss minimization. Technical report, Submitted for Publication, 2003. [5] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, 1999. [6] D. Barbar´a, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th International Conference on Information and Knowledge Management, (CIKM), McLean, VA, USA, 4–9 Nov. 2002. ACM Press. [7] T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris. A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining, (KDD), pages 263–268, San Francisco, CA, USA, 26–29 Aug. 2001. AAAI Press. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, NY, USA, 1991.
19
[9] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS: Clustering Categorical Data Using Summaries. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, (KDD), pages 73–83, San Diego, CA, USA, 15–18 Aug. 1999. ACM Press. [10] M. R. Garey and D. S. Johnson. Computers and intractability; a guide to the theory of NP-completeness. W.H. Freeman, 1979. [11] D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In Proceedings of the 24th International Conference on Very Large Data Bases, (VLDB), pages 311–322, New York, NY, USA, 24–27 Aug. 1998. Morgan Kaufmann. [12] S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Atributes. In Proceedings of the 15th International Confererence on Data Engineering, (ICDE), pages 512–521, Sydney, Australia, 23–26 Mar. 1999. IEEE Press. [13] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. [14] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990. [15] Mark A. Gluck and James E. Corter. Information, Uncertainty, and the Utility of Categories. In Proceedings of the 7th Annual Conference of the Cognitive Science Society, pages 283–287, Irvine, CA, USA, 1985. [16] N. Slonim, N. Friedman, and N. Tishby. Unsupervised Document Classification using Sequential Information Maximization. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval, (SIGIR), pages 129–136, Tampere, Finland, 11-15 Aug. 2002. ACM Press. [17] N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In Neural Information Processing Systems: Natural and Synthetic 12, (NIPS-12), pages 617–623, Breckenridge, CO, USA, Dec. 1999. MIT Press. [18] N. Slonim and N. Tishby. Document Clustering Using Word Clusters via the Information Bottleneck Method. In Proceedings of the 23d Annual International ACM Conference on Research and Development in Information Retrieval, (SIGIR), pages 208–215, Athens, Greece, 24-28 July 2000. ACM Press. [19] N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–387, Urban-Champaign, IL, USA, 22– 24 Sept. 1999. [20] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient Data Clustering Method for Very Large Databases. In Proceedings of the International Conference on Management of Data, (SIGMOD), pages 103–114, Motreal, QB, Canada, 4–6 June 1996. ACM Press.
20