Pattern Recognition 34 (2001) 2561–2563
Rapid and Brief Communication
www.elsevier.com/locate/patcog
E$cient clustering of large data sets V.S. Ananthanarayana, M. Narasimha Murty ∗ , D.K. Subramanian Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India Received 15 March 2001; accepted 23 March 2001
1. Introduction Clustering is an activity of 1nding abstractions from data and these abstractions can be used for decision making [1]. In this paper, we select the cluster representatives as prototypes for e$cient classi1cation [3]. There are a variety of clustering algorithms reported in the literature. However, clustering algorithms that perform multiple scans of large databases (of size in Tera bytes) residing on the disk demand prohibitive computational times. As a consequence, there is a growing interest in designing clustering algorithms that scan the database only once. Algorithms like BIRCH [2], Leader [5] and Single-pass k-means algorithm [4] belong to this category. In the case of Leader algorithm, each cluster is represented by its leader, which is a member of the cluster. A leader represents a cluster of all the patterns that fall within a user-de1ned threshold distance from it. BIRCH also forms clusters based on a distance threshold. Both Leader and BIRCH are (i) incremental and (ii) scan the database only once. Further, the number, size and shape of the resulting clusters depend on the order of the data. In fact, BIRCH may be viewed as an extension of Leader and a major di?erence is that BIRCH employs B+ trees to store the ‘cluster features’. In the current study, we consider Leader because it is conceptually simpler. In the single pass k-means algorithm, a part of the data is transferred from the disk to a bu?er and data in the bu?er are grouped into k clusters using the k-means algorithm. It retains the centroids of the k-clusters in the bu?er and discards the loaded data. This process is repeated till all the patterns in the large data set residing ∗ Corresponding author. Tel.:+91-80-309-2779; fax: +91-80-360-2911. E-mail addresses:
[email protected] (V.S. Ananthanarayana),
[email protected] (M.N. Murty),
[email protected] (D.K. Subramanian).
on the disk are considered. This algorithm partitions the data into k clusters and the size and shape of the clusters obtained may vary based on the order in which blocks of patterns are transferred. However, a major limitation of the algorithm is the centroids, that are generated as abstractions by the algorithm, may not always represent the clusters properly. In this paper, we propose an e$cient single pass clustering algorithm for generation of clusters and their descriptions using a novel data structure called pattern count tree (PC-tree). The descriptions of clusters are based on frequent co-occurrence of features in the patterns. We show that the proposed algorithm is incremental, order-independent and generates clusters and their descriptions using a single database scan. We use the cluster representatives obtained using— PC-tree based, Single pass k-means and Leader algorithms as prototypes for classifying a large Handwritten data set [3]. We compare the resulting classi1ers based on classi1cation accuracy, space and time requirements. 2. Pattern count tree (PC-tree) based clustering (PCBClu) Pattern count tree is a data structure which is used to store all the patterns of the data set in a compact way. Each node of the PC-tree consists of four 1elds. They are: ‘Feature’—speci1es nonzero positional value of the pattern; ‘Count’—speci1es the number of patterns represented by a portion of the path reaching this node; ‘Child pointer’—represents the pointer to the following path; and ‘Sibling pointer’—points to the node which indicates the subsequent other paths from the node under consideration. Here, cluster description is the set of features along the path from the root to the leaf of PC-tree. All the patterns that are mapped onto the path form the members of the cluster. PC-tree requires less space to
0031-3203/01/$20.00 ? 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 0 9 7 - 8
2562
V.S. Ananthanarayana et al. / Pattern Recognition 34 (2001) 2561–2563
Fig. 1. Data patterns for ‘1’ and ‘7’; (A) Patterns in 4 × 4 matrix form. (B) Representation of Patterns in 1gure A. (C) PC-tree to hold the data Patterns.
hold the prototypes because of two reasons: (i) if two or more patterns share a pre1x, P, then P is stored only once; (ii) only nonzero positional values are stored in the PC-tree. 2.1. Algorithm STORE: For each pattern, Pi ∈ D (database): If any pre1x subpattern, SPi , of Pi exists as a pre1x in a branch eb : Put features in SPi in eb , by incrementing the corresponding count 1eld value of the nodes in the PC-tree. Put the subpatterns, if any, of Pi by appending additional nodes with count 1eld value equal to 1 to the path in eb . Else Put Pi as a new branch of the PC-tree. RETRIEVE: For each branch, Bi ∈ PC (PC-tree): For each node, Nj ∈ Bi : Output Nj .Feature. The features corresponding to the nodes in the branch of the PC-tree from root to leaf constitute a prototype.
For example, consider the patterns for the digit ‘1’ and ‘7’ as shown in Fig. 1A. Each pattern is a matrix of size 4 × 4. A ‘1’ in a location indicates the presence and a ‘0’ indicates the absence of the feature. Fig. 1B gives an equivalent representation of the patterns using features that are present. Fig. 1C shows the PC-tree for the corresponding data patterns. Here, the cluster descriptions are {3; 7; 11; 14; 15; 16; }, {3; 7; 11; 15}, {1; 2; 3; 7; 11; 15}. In the case of PC-tree, the cluster description itself forms the prototype. Members of the cluster, with description {3; 7; 11; 14; 15; 16} are {3; 7; 11; 14; 15; 16}, {3; 7; 11; 14; 15} and {3; 7; 11}. 2.2. Prototype selection Let D be the data set. Let {C1 ; C2 ; : : : ; Ck } be k clusters generated by a clustering algorithm. Let R = {R1 ; R2 ; : : : ; Rk } be the corresponding cluster descriptions. Note that Ri is the leader of cluster Ci in the case of Leader algorithm, Ri is the centroid of Ci in the case of Single pass k-means algorithm, and it is a branch of the PC-tree in the case of PCBClu algorithm. In clustering, |R| ¡ |D| in order to reduce both space and classi1cation time if the descriptions are used for classi1cation. In the case of PCBCLu, the reduction is two fold: (i) it reduces the number of cluster descriptions and (ii) it stores only nonzero features in a compact way.
V.S. Ananthanarayana et al. / Pattern Recognition 34 (2001) 2561–2563 Table 1 Comparison of various classi1er Algorithm
CA (%)
T (in s) DT
S (in bytes)
P
CT
Single pass k-means
36.78 40.98 48.79
460 780 2120
600 901 1832
1; 544; 000 2; 316; 000 3; 860; 000
2000 3000 5000
Leader
91.4 92.71 92.74
308 587 696
600 1201 1905
1; 544; 000 3; 088; 000 4; 804; 928
2000 4000 6224
89.56 91.67 93.61
2 3 7
350 708 1174.5
451; 820 900; 500 1; 494; 224
1971 3924 6507
PCBClu
Table 2 Comparison between PCBClu and Leader algorithms In-memory space (in bytes)
PCBClu (%)
Leader (%)
451,820 900,500 1,494,224
89.56 91.67 93.61
78.46 86.89 88.32
3. Experiments We implemented Leader, Single pass k-means and PCBClu algorithms to examine the prototypes generated for handwritten digit data. The training data consist of 667 patterns for each class of digits 0 to 9, totalling to 6670 patterns. The test data consist of 3333 patterns [3]. Table 1 shows the classi1cation accuracy (CA), time requirement (T)—to generate the prototypes, i.e., design time (DT ) and to classify the patterns using the generated prototypes, i.e., classi1cation time (CT ), in-memory space requirement to generate the prototypes (S), number of prototypes generated (P)—using Leader, Single pass k-means and PCBClu algorithms. The classi1cation accuracy of prototypes generated by all algorithms are determined using modi1ed NNC [3]. From Table 1 it is clear that the time requirements for prototypes generation in the case of PCBClu are very small, since there are no in-memory computations like the other two algorithms considered in this paper. Space requirement for prototypes in the case of PCBClu is also small because it stores the prototypes in a very compact
2563
way based on nonzero positional values. Note that such a space reduction is one of the important requirements in clustering large databases. For the same in-memory requirements as that of PCBCLu, the classi1cation accuracy of Single pass k-means algorithm is the worst; so we do not consider it for further experimentation. Table 2 shows the classi1cation accuracy of PCBClu and Leader algorithms for the same in-memory space requirements. Results reported in Table 2 clearly bring out the superiority of PCBCLu over Leader. 3.1. Characteristics of PCBClu (1) Clustering requires a single scan of the database. (2) Space and time requirements for prototypes are small. (3) The algorithm is incremental; but it is orderindependent. Our experimental results discussed in Section 3 justify points (1) and (2) mentioned above; and point (3) is clear from PCBCLu algorithm. 4. Conclusion In this paper, we have proposed a clustering scheme for prototype selection, based on a novel data structure called PC-tree. The advantages of the proposed scheme are: (i) cluster representations can be obtained using a single database scan, (ii) space to hold clusters and time required to generate clusters are less when compared with other single pass clustering algorithms. References [1] A.K. Jain, M. Narasimha Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surveys (1999) 264–323. [2] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an e$cient data clustering method for very large databases, Proceedings of the ACM SIGMOD, 1996. [3] T. Ravindra Babu, M. Narasimha Murty, Comparison of genetic algorithm based prototype selection scheme, Pattern Recognition 34 (2) (2001) 523–525. [4] F. Farhstrom, J. Lewis, Fast, single pass k-means algorithm, http:==citeseer.nj.nec.com. [5] H. Spath, Cluster Analysis: Algorithms for Data Reduction and Classi1cation of Objects, Ellis Horwood, UK, 1980.