Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo1 and Jen-Peng Huang2 1
2
Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600
[email protected] Department of Information Management Southern Taiwan University of Technology, Taiwan 710
[email protected]
Abstract. Concept hierarchies are important in many generalized data mining applications, such as multiple level association rule mining. In literature, concept hierarchy is usually given by domain experts. In this paper, we propose algorithms to automatically build a concept hierarchy from a provided distance matrix. Our approach is modifying the traditional hierarchical clustering algorithms. For the purpose of algorithm evaluation, a distance matrix is derived from the concept hierarchy built by our algorithm. Root mean squared error between the provided distant matrix and the derived distance matrix is used as evaluation criterion. We compare the traditional hierarchical clustering and our modified algorithm under three strategies of computing cluster distance, namely single link, average link, and complete link. Empirical results show that the traditional algorithm under complete link strategy performs better than the other strategies. Our modified algorithms perform almost the same under the three strategies; and our algorithms perform better than the traditional algorithms under various situations.
1
Introduction
Generalization on nominal data is frequently studied, such as mining multiple level association rules, by means of a concept hierarchy [8,5,3]. In a concept hierarchy of categories, the similarity between two categories is reflected by the length of the path that connecting the categories. The similarity between two concepts is not necessary unchanged all the time. Consider the scenario that lawyers and doctors have common habits in a certain period of time. However, these common habits may change in the next time period. So, with respect to habit, the similarity between lawyer and doctor is changing. Concept hierarchies used in generalized data mining applications are usually given by domain experts. However, it is difficult to maintain a concept hierarchy when the number of categories is huge or when the characteristics of the data are changing frequently. Therefore, there is a need for an algorithm to automatically build a concept hierarchy of a set of nominal values. In this paper, our proposed approach is to modify traditional hierarchical clustering algorithms for this purpose. The input to our method is a distance matrix which
2
Huang-Cheng Kuo and Jen-Peng Huang
can be computed by CACTUS [2] for tabular data, or by Jaccard coefficient for transactional data. Our contribution in modifying the traditional agglomerative hierarchical clustering algorithm, called traditional clustering algorithm in the rest of this paper, is twofold. (1) The traditional clustering algorithm builds a binary tree. While, in a concept hierarchy, it is very likely that more than two concepts share a common general concept. In order to capture such characteristics of concept hierarchy, our modified agglomerative hierarchical clustering algorithm allows more than two clusters to merge into a cluster. (2) The leaves of the binary tree generated by the traditional clustering algorithm are not necessary at the same level. This may cause an inversion problem. Consider the merging of a deep subtree and a single-node subtree. The longest path between two leaves on the deep subtree is longer than the path from a leaf on the deep subtree and the leaf of the single-node subtree. We solve this inversion problem by keeping all the leaves at the same level. In addition to the modified algorithm, we devise a novel measure metrics for the built concept hierarchy by deriving a distance matrix from the concept hierarchy. Root mean squared error between the input distance matrix and the derived distance matrix can then be computed. The paper is organized as follows. In section 2, we illustrate the need for a concept hierarchy. In section 3, the measurement for the algorithms is presented and the way to obtain the input distance matrix is discussed. Section 4 discusses the algorithms to build a concept hierarchy from a given distance matrix among the categories. Experiment description and the result are in section 5. The conclusion is in section 6.
2
The Need for a Concept Hierarchy
Concept hierarchies, represented by taxonomies or sets of mapping rules, can be provided by domain experts. Following are examples of mapping rules [4] for “Status” and “Income” attributes: Status: {freshman, sophomore, junior, senior} → undergraduate Status: {graduate, undergraduate} → student Income: {1,000 ˜25,000} → low income Income: {25,001 ˜50,000} → mid income Multiple level association rule mining uses “support” for obtaining frequent itemsets [5,8]. By increasing the level of the concept hierarchy, support for an itemset increases with a possible increase in frequency. Other applications, such as data warehouse, require dimension tables for drill-down and roll-up operations [3]. Automatically constructing a concept hierarchy for a huge number of categories would relieve the burden of a domain expert.
Building Concept Hierarchy
3
3
Measurements for Concept Hierarchy
There are some metrics for the quality of clustering. For example, how well the intra-cluster distance is minimized, how well the inter-cluster distance is maximized, how high the accuracy is when the algorithm performs on a pre-classified test dataset. But, there are no existing metrics for measuring the quality of a concept hierarchy automatically built by an algorithm. The above mentioned metrics for clustering algorithms are not applicable to concept hierarchies, since the number of clusters is not concerned for a concept hierarchy building algorithm. Input to the algorithms is a distance matrix, denoted as provided distance matrix. Output from the algorithm is a concept hierarchy. Since a correct concept hierarchy is usually not available, we propose an indirect measurement. In order to compare with the provided distance matrix, we convert the output concept hierarchy into a distance matrix, denoted as derived distance matrix. An element of the derived distance matrix is the length of path from a category to another. Min-max normalization is applied to the derived distance matrix so that it has the same scale with the provided distance matrix. Root mean squared error between the two distance matrices can be computed as the quality of the output concept hierarchy. Definition: Concept Hierarchy Quality Index Given a distance matrix, Mprovided , over a set of categories {c1 , c2 , . . . , cn }, the quality index of a concept hierarchy with respect to Mprovided is the root mean squared error between Mprovided and Mderived , where Mderived (ci , cj ) is the normalized length of path from ci to cj in the concept hierarchy. Root mean squared error is defined as (Mprovided (ci , cj ) − Mderived (ci , cj ))2 ∀ci,cj
There are methods for different types of data to obtain distance matrix. For data in relational tables, we adopt the similarity definition from CACTUS [2] with simplification. After obtaining the similarities between pairs of categories, we would like to normalize the similarities into distances in the range of and 1. Since the distance between two different categories should be greater than zero, we denote as the expected distance between a category and its most similar category [7].
4
Algorithms
It is intuitive to use traditional agglomerative hierarchical clustering for building a concept hierarchy of categorical objects. We first describe the traditional
4
Huang-Cheng Kuo and Jen-Peng Huang
algorithm and point out two drawbacks of the algorithms. Then, we propose a modified version of hierarchical clustering algorithm. 4.1
Traditional agglomerative hierarchical clustering
Hierarchical clustering treats each object as a singleton cluster, and then successively merges clusters until all objects have been merged into a single remaining cluster. The dendrogram built in this way is a binary tree. Leaf nodes in such a tree are likely at different levels of the tree. In this paper, we study the three strategies for computing the distance between a pair of clusters, namely, single link [9], average link, and complete link [6]. Agglomerative hierarchical clustering merges the pair of clusters with smallest distance into a cluster. The three strategies define the distance between a pair of clusters. The distances, distsingle , distaverage , and distcomplete , between cluster C1 and C2 are defined below. distsingle (C1 , C2 ) =
min
x∈C1 ,y∈C2
distaverage (C1 , C2 ) = distcomplete (C1 , C2 ) =
dist(x, y)
avg
dist(x, y)
x∈C1 ,y∈C2
max
x∈C1 ,y∈C2
dist(x, y)
With regard to building a concept hierarchy tree, there are two major drawbacks for traditional hierarchical clustering algorithms. First, the degree of a node can be larger than 2. For example, in figure 1, there are more than two kinds of juices, and they are all specific concepts of juice. However, traditional hierarchical clustering algorithm tends to build a binary tree concept hierarchy. b drink XXXX XX zb beverage H H H HH H whisky H H jb H jb H b b juice HH @ coke beer H H @ jb b Rb HH @ b 9 alcohol b
grape juice
apple juice
orange juice
lemonade
Fig. 1. A Concept Hierarchy
A possible way for perceiving the similarity between two categories in a concept hierarchy is the length of the path connecting the two categories. So, the second drawback is that the distance relationship among the categories might not be preserved with the traditional algorithm. In figure 2, the path
Building Concept Hierarchy
5
b drink XXXX XX 9 zb beverage alcohol b H H H HH whisky H H H jb H jb H b b juice coke beer @ @ grape-or-apple Rborange-juice@ b juice J
J or-lemonade
J
J
^b b
J ^b J b grape juice
apple juice
orange juice
lemonade
Fig. 2. Concept Hierarchy Built by Traditional Algorithm
from grape juice to orange juice is longer than the path from grape juice to coke. This is in contradiction with the intention specified by the users in figure 1. In order to solve or to improve the drawbacks, we propose modified hierarchical clustering algorithms that have two important features: (1) leaves of the tree are at the same levels, (2) the degree of an internal node can be larger than 2, i.e., a node joins another node in the upper level. 4.2
Multiple-way agglomerative hierarchical clustering
We propose a new hierarchical clustering algorithm to improve the traditional hierarchical clustering algorithm. Initially all items are singleton clusters, and all items are leaf nodes. In order to guarantee that all leaves are at the same level, the algorithm merges or joins clusters level by level. In other words, clusters of the upper level will not be merged until every cluster of the current level has a parent cluster. Two clusters of the same level can be merged and a new cluster is created as the parent of the two clusters. The newly created cluster is placed at the upper level of the tree. We propose a new operation that a cluster can join a cluster at the upper level, such that, the cluster of the upper level is the parent of the cluster of a current level. The process continues until the root of the tree is created. Two clusters of the same level can be merged and a new cluster is created as the parent of the two clusters. The newly created cluster is placed at the upper level of the tree. We propose a new operation that a cluster can join a cluster at the upper level. Such that, cluster of the upper level is the parent of the cluster of current level. The process continues until the root of the tree is created. In the following discussion, a node is a cluster that contains one or more categorical objects. First, we discuss the “join” operator for hierarchical clustering. Consider the four clusters, A, B, C, and D, in figure 3. Assume that dist(A, B) is the smallest among all pairs of clusters, and A and B are merged
6
Huang-Cheng Kuo and Jen-Peng Huang
into cluster E. Assume that either dist(A, C) or dist(B, C) is less than dist(C, D). In other words, cluster C is better merged with A or B than merged with D. In traditional hierarchical clustering algorithm, C is either merged with D or E. Merging with D is not good for C. Merging with E may be good. But, (1) if dist(A, B), dist(A, C) and dist(B, C) are about the same, the clustering result makes C quite different from A and B; (2) leaf nodes in the subtree rooted at C will not be at the same level of the whole tree.
Fig. 3. Hierarchical Clustering with Join Operator
5
Experiment Results
In this paper, we evaluate algorithms with generated data. A provided distance matrix of n objects is generated with the assistance of a tree, which is built bottom up. The data generation is described in the following steps. 1. Let each object be a leaf node. 2. A number of nodes of the same level are grouped together and an internal node is created as the parent node of the nodes. This process continues until the number of internal nodes of a level is one. In other words, the root is created. Any internal node of the tree has at least two children. The degree of an internal node is uniformly distributed in the interval of two and span, a given parameter. 3. The distance between any pair of leaf nodes is prepositional to the length of their path in the tree. The distances are divided by the length of the longest path, i.e., are normalized to one. 4. Noise is applied on the distance matrix. Uniformly distributed numbers between 1 − noise and 1 + noise are multiplied to the distance values. In the experiment, we generate distance matrices where noise = 0.1 and noise = 0.2. The distance values, after the noise is applied, are truncated to the interval of zero and one. The tree generated in the step 2 can be regarded as a perfect concept hierarchy. Since there is noise in the provided distance matrix, so the quality
Building Concept Hierarchy
7
Fig. 4. Experiment Results for noise = 0.1
index of the perfect concept hierarchy with respect to the provided distance matrix is not zero. In the experiment, we illustrate the performance of the algorithms under three parameters, namely noise, span, and number of items. For each parameter combination, root mean squared error values of 30 tests are averaged. For generating provided distance matrices, we build trees with four intervals of spans: [2, 4], [2, 6], [2, 8], and [2, 10]. Figure 4 depicts the quality indices for the algorithms where noise for the input datasets is 0.1. The lines NAL, NCL, and NSL represent for the performance for our new modified algorithm under the strategies average link, complete link, and single link. The lines TAL, TCL, and TSL represent for the performance for traditional algorithm under the three strategies. The line Perf represents the quality index of the perfect concept hierarchy. The results show that our proposed methods perform much better than the traditional agglomerative hierarchical clustering algorithm for all the input distance matrices. However, the strategy of cluster-to-cluster distance
8
Huang-Cheng Kuo and Jen-Peng Huang
does not affect the result in our algorithms. Whereas, for the traditional algorithms, the single link strategy performs better than the other two strategies. The reason might be that we generate the input distance matrices from trees. All the algorithms perform worse for wider spans. Comparing the performance for data with different noise levels, all the algorithms perform worse for noisier data.
Fig. 5. Experiment Results for noise = 0.2
Figure 5 depicts the quality indices for the algorithms where noise for the input datasets is 0.2. Compare to the result from input data with noise 0.1, root mean squared error increases from 0.07 to 0.12 for our new algorithms where span = [2, 4]. Similar comparisons can be observed for different spans.
Building Concept Hierarchy
6
9
Conclusions and Future Works
Concept hierarchy is a useful mechanism for representing the generalization relationships among concepts. So that, multiple level association rule mining can be conducted. In this paper, we build a concept hierarchy from a distance matrix with the goal that the distance between any pair of concepts is preserved as much as possible. We adopt the traditional agglomerative hierarchical clustering with two major modifications: (1) not only a cluster merges with another cluster, but also a cluster joins another cluster, (2) leaf nodes are all at the same level of the concept hierarchy. Empirical results show that our modified algorithm performs much better than the original algorithm. Some areas of this study warrant further research: (1) A frequently questioned drawback of hierarchical clustering algorithm is that it does not rollback the merge or division. If re-assignment of an object from a cluster to another is allowed in certain stages, the clustering result may be improved. (2) All the lengths, i.e. weights on edges of the concept hierarchy, are the same. If weights on the edges of the concept hierarchy can be trained, the distance relationship between concepts can be better preserved.
References 1. U. M. Fayyad, K. B. Irani, “Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning,” In Proceedings of the thirteenth International Joint Conference on Artificial Intelligence, 1993, pp. 1002-1027. 2. V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999, pp. 73-83. 3. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Pub., 2001. 4. J. Han and Y. Fu, “Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases,” Workshop on Knowledge Discovery in Databases, 1994, pp. 157-168. 5. J. Han and Y. Fu, “Discovery of Multiple-Level Association Rules from Large Databases,” VLDB Conference, 1995, pp. 420-431. 6. A. K. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., 1988. 7. Huang-Cheng Kuo, Yi-Sen Lin, Jen-Peng Huang, “Distance Preserving Mapping from Categories to Numbers for Indexing,” International Conference on Knowledge-Based Intelligent Information Engineering Systems, Lecture Notes in Artificial Intelligence, Vol. 3214, 2004, pp. 1245-1251. 8. R. Srikant and R. Agrawal, “Mining Generalized Association Rules,” VLDB Conference, 1995, pp. 407-419. 9. R. Sibson, “SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method,” Computer Journal, Vol. 16, No. 1, 1972, pp. 30-34.