Jiawei Han, Jian Pei, Yiwei Yin: Mining Frequent Pattern without Candidate ... Jiawei Han, Jian Pei, Runying Mao: CLOSET: An Efficient Algorithm for.
Towards an Extensible Library System for Data Mining Phan Nhat Hai, Nguyen Hoang Anh, Tran Minh Quang, Ly Hoang Hai, Dang Tran Khanh. Computer Science and Engineering Faculty, Ho Chi Minh City University of Technology 268 Ly Thuong Kiet Street, Ward 14, District 10, HCMC, Vietnam.
{phannhathai171185, nguyenhoangchip}@yahoo.com {quangtran, lhhai, dtkhanh}@cse.hcmut.edu.vn
Abstract. The main purpose of this research is to design and implement a library system that helps practitioners and researchers to inherit and develop previous achievements in Data Mining area. The design of this library focuses mainly on the usability, extension, and flexibility of the system. It contains basic and classic algorithms in Cluster Analysis and Association Rules Mining which are decomposed into high granularity components. Each component is cohesive and independent enough to be reused in multiple algorithms. As a result, users can build up their new algorithms by using or modifying some of these components. Comparing with other libraries, this library is more extensive and efficient. Several prominent algorithms in Association Rules Mining and Cluster Analysis have been integrated in this library such as Apriori, FPGrowth, K-Means, Rock, Farnstrom, DBScan, and ect. Moreover, the vivid and visual interface is also provided that makes the system more users friendly and usable. Keywords: library data mining, cluster analysis, association rules mining.
1
Introduction
Data mining is the process of discovering potentially useful, interesting, and previously unknown patterns from large datasets. It has many applications in economy, science, and medicine. With the enormous benefit that it can bring to people, creating a tool exploding it becomes a big demand in society nowadays. There are a lot of libraries that support Data mining quite well up to now. They, however, still remain some disadvantages such as lack of flexibility and extension for the purpose of improvement by other people. For example, Weka [4], a famous library that supplies many basic features, is hard to be extended. The research goal, therefore, is to construct an extensible library system with the following features: - Base on two major techniques of Data Mining: Clustering and Association Rules Mining. - Follow extensive orientation to increase flexibility so that the available algorithms and components can be used or modified easily.
Association Rules Mining and Clustering are chosen among mining techniques to develop the system because of their popular usage. However, users are free to integrate some other techniques into this library system according to their purposes. At the moment, several components have been integrated in this library. They can be successfully associated to implement the following algorithms: Apriori[7], FPGrowth[10], Closet[11], FP-Max[12], ExMiner[13], K-Means[3], K-medoids[5], Modified KMeans[1], Incremental Kmean[2], Incremental Clustering[3], Rock[6], Farnstrom[14], and DBScan[15]. Additionally, visual interface which includes control structure is also provided. The rest of the paper is organized as follow. Section 2 is the related work. The next section states the problems and solutions in building this library system. The design of it is addressed in section 4. Section 5 continues with the experimental evaluation, and the last one is the conclusion and future work.
2
Related Work
There are many tools supporting Data mining. For example, Estard Data Miner, which is designed for both experts and ordinary users, is widely used in various fields. Besides, some libraries are also developed such as Weka[4], Rapid Miner [8], and Tanagra[9]. Weka is a set of modules used in data preprocessing, machine learning, and modeling technique. It is compatible to many platforms, open-source; and its interface is visual. However, “the smooth” 1 of components in a certain algorithm is not good enough to support the flexibility. It is difficult to improve available algorithms in Weka library as well as to create new algorithms by using available components. Rapid Miner, which is also compatible to many platforms and open-source, is developed to support machine learning algorithms. It contains more than 400 machine learning, preprocessing, and visualization operations. Tanagra supplies for researchers and students a friendly tool. Besides the visual interface as the above libraries, users can easily integrate their methods. Nevertheless, as Weka, the flexibility in Rapid Miner and Tanagra is not high.
3
Problems and Solutions in Building the Library System
3.1
Problems
From what stated above, following major problems have to be thoroughly solved to make the library usable:
1
The smooth: the decomposing into components of an algorithm.
1. 2. 3. 4.
The extension and flexibility are required so that an object can be used not only diversely but also actively, besides users are able to overwrite the existed algorithms or a part of an algorithm. The inheritance ability is also needed to utilize the process of building up new algorithms. The library system has to supply multiple algorithms in order that users can have multiple choices. The library system is required to have a friendly interface to increase the interaction between it and users.
3.2
Solutions
The library system needs to be designed and implemented following extensive orientation. Every algorithm integrated in it must be carefully analyzed so that it can be decomposed into reasonable components. Each component is cohesive and independent enough to be reused in multiple algorithms, allowing users to build up their new algorithms with little effort. The design follows Template Method in Design Patterns, mainly for making a skeleton for an algorithm, thus it is very easy to overwrite or modify the available components. With a user that only needs to get the result without understanding the association of components, there are available composite components so that he only needs to either call these components in the program or manipulate on interface provided.
4
Library System Design
The library design attempts to create components that the ability to be reused and shared is as high as possible. Decomposing algorithms into components is relied on their specific steps or tasks in an algorithm. Therefore, algorithms integrated in this library must be carefully analyzed basing on their properties and correlation between them. Consequently, they can be divided into reasonable components that can satisfy the requirement. Several examples are given below to illustrate two major features of this library, the association of components to generate algorithms and the inheriting components to create new algorithms. 4.1 4.1.1
Cluster Analysis Association of Components
The following part addresses KMeans[3] as an example of the association of components in Clustering Basic steps in KMeans algorithms 1. Select an initial partition with K clusters containing randomly chosen samples, and compute the centroids of the clusters.
2. 3. 4.
Generate a new partition by assigning each sample to the closest cluster centre. Compute new cluster centres as the centroids of the clusters. Repeat steps 2 and 3 until an optimum value of the criterion function is found (or until the cluster membership stabilizes). KMeans resetAllInstanceList Cluster [] addAllElement ToCluster
Cluster []
Cluster [] Compute Centroids
resetCentroidImage
False Cluster []
Cluster []
initial
Terminal Condition
True Return
Fig. 1. The interaction between components in KMeans algorithm
-
Initial: initiate arbitrary centres and other parameters such as threshold and load data into memory. The output of this component is an array of clusters 2 . - addAllElementToClusters: the above array of cluster is used as input of this component. It assigns all data samples into appropriate clusters (step 2). After that, the centres of these clusters are updated by component ComputeCentroids (step 3). - TerminateCondition: check ending condition of algorithm. If it is satisfied, TerminateCondition will return result, in contrast it will transfer data to component resetCentroidImage to keep the current positions of the centres. In addition, entire list of data patterns of the clusters is deleted by component resetAllInstanceList. Loop until the ending condition is satisfied. In Fig. 1, the attention is that all components communicate with each other with an array of clusters. Output of a certain component is input of other ones. All components create a chain that returns expected results. The library system also provides some basic components to other algorithm families in Cluster Analysis such as Rock, Kmedoid, IncrementalClustering, Farnstrom, and DBSCAN. 4.1.2
Inheritance
An example to illustrate the inheriting some basic components of one algorithm to implement another algorithm, an important purpose of this library, is given below, which, in detail, describes the construction of Modified_KMeans[1] based on classic KMeans. Basic steps in Modified KMeans 1. Choose arbitrary K objects for K clusters centres. 2. Assign each object in the training set to the closest cluster and update the centres of the clusters. 2
Cluster contains list of data points, centre, and weight of this cluster.
3. 4.
If the cluster criterion is satisfied (the cluster centre does not move), go to step 4, else go to step 2. If there is a cluster which can be moved to a better position to reduce the total sum of the distortion errors, move it to the new position and then go to step 2, else stop.
Modified KMeans MoveCentroid
Cluster []
Cluster []
addAllElement ToCluster
Cluster []
resetAll InstanceList Compute Centroids
Cluster [] False
Cluster [] initial
resetCentroidImage
TerminalCondition True
Return Fig. 2. Components in Modified KMeans algorithm.
As illustrated in Fig. 2, Modified_KMeans is implemented by inheriting all components of KMeans associating with making some new components such as MoveCentroid, which is used to move a centre from disadvantaged position to better position. The components still communicate with an array of clusters. In summary, they are used not only for their own algorithm but also for many other algorithms. Therefore, effort to implement an algorithm decreases significantly. 4.2
Association Rule Mining (ARM)
Although association rule mining was developed in recent, the number of algorithms proposed for ARM is really great. Thus the library cannot involve all the components needed to implement all these algorithms. Efficiently, some basic components are included to be the foundation to generate different algorithms. Thanks to the interact ability with each other of the following algorithms, they are chosen. 1. Apriori[7]: a classic algorithm which has deep impact on the successive algorithms. It is used to find all the interesting patterns in datasets. 2. FP-Growth [10]: developed based on FP-Tree structure. After emerge of tree structure, it is so popular that almost algorithms base on it. As Apriori, this algorithm finds all the interesting patterns from dataset. 3. Closet [11]: new perspective on the frequent itemsets set. It finds only the frequent closed itemsets. An itemset X is frequent closed itemset if (1) X is frequent, and (2) there exists no itemset X’ such that X’ is a proper superset of X, and every transaction containing X also contains X’. 4. FP-Max [12]: new perspective on the frequent itemsets set, it reduce size of the set much smaller. This algorithm is used to find every maximal frequent itemsets. An itemset X is a maximal frequent itemset if (1) X is frequent, and (2) there isn’t any superset of X that is frequent.
5.
ExMiner [13]: sometimes support is not efficient or known in advance, thus it brings bad result. In some cases users want to find most interested pattern but they don’t know exactly the reasonable support. At that case, ExMiner is chosen. Instead of providing threshold, users can provide the number of interested patterns that they want, namely K, and ExMiner is able to find set of top K frequent patterns. Fig. 3 describes the components interaction to generate algorithms mentioned above. Vector Vector
Data
Generate Candidate
countSupportOfAllItem
Candidate[]
Hashtable sortFrequentItem
Vector
Item []
pruneCandidate
Item []
findFrequent ItemBaseOnK
Item[]
Candidate[]
findFrequentItem BaseOnThreshold
countSupport
Item []
Item [] sortItemInTransaction String[]
buildTree
tree
FP-Growth FP-Max
Frequent Pattern[]
Frequent Pattern[] Frequent Pattern[]
Closet Virtual Growth
Frequent Pattern[]
Frequent Patterns
Frequent Pattern[] Frequent Pattern[]
FP-Growth
Fig. 3. Interaction between components in ARM.
The following parts explain the components interaction in specific. Apriori and ExMiner are described clearly and the remaining algorithms are done similarly. 4.2.1
Association of Components
Like Clustering, Apriori[7] is used to illustrate the association of components in ARM. Basic steps in Apriori algorithms Lk: frequent k-itemset set, k-itemset is an itemset containing k items.
Ck: candidate set, each candidate has k items. 1. Find frequent item, L1. 2. Generate Ck from L(k-1). 3. Prune a candidate in Ck if any subset of it is not in L(k-1). 4. Count support of candidates in Ck, if candidate that has support greater or equal to threshold, store it in Lk, else dismiss it. If Lk has only one element, end, else go to step 2. generateCandidate countSupportOfAllItem Hashtable
Vector Item[]
sortFrequentItem
findFrequent ItemBaseOnK
Candidate[]
Item[] Item[]
findFrequentItem BaseOnThreshold
pruneCandidate Candidate[] countSupport
Frequent Pattern[]
FrequentPattern[]
Fig. 4. Component Association to create Apriori algorithms
-
countSupportOfAllItem: counts support of all the items existing in dataset. sortFrequentItem: sorts item into decreased support order. findFrequentItemBaseOnThreshold: finds all the items that have support isn’t smaller than threshold. - findFrequentItemBaseOnK: finds K items of which support are greatest in dataset. - generateCandidate: generates candidate set from frequent itemsets set. - pruneCandidate: prunes candidate that has subset isn’t frequent. - countSupport: counts support to get frequent itemset. Fig.4 shows the association of components to make the Apriori algorithm. All other algorithms such as FP-Growth, Closet, FP-Max, ect are combined in the same way as described in Fig. 3. 4.2.2
Inheritance
ExMiner[13] is mentioned to illustrate the inheritance in ARM. Basic steps in ExMiner: 1. Scan dataset to count support of all 1-itemset. 2. According to K, set border-sup and generate F-list. Insert support values of first k items in F-list into a queue, say supQueue. 3. Construct an FP-Tree according to F-list. 4. Call VirtualGrowth to explore the FP-Tree and set the final internal support threshold minsup to the smallest element, minq, of supQueue. 5. Mine FP-Tree with support threshold minsup to output top-k frequent patterns. VirtualGrowth FP-Growth Fig. 5 ExMiner components.
-
VirtualGrowth: virtually mine FP-Tree to find suitable threshold. FP-Growth: with the support provided, mine FP-Tree to find top K frequent patterns. From Fig.5, it is easy to reveal that ExMiner is created by inheriting the components of other algorithm, FP-Growth. First, component VirtualGrowth virtually mines the FPTree to find suitable support. After that, component FP-Growth will be called to actually mine FPTree to get frequent patterns. By inheriting the component of FP-Growth algorithm, ExMiner is implemented faster than usual. The library system thus can be extended by the same way.
5
Experimental Evaluation
Thirteen experiments are done to determine the differences between KMeans algorithm implemented in this library and the same algorithm implemented in Weka. In order to measure accuracy of clustering algorithm, the true cluster means must be compared with the estimated cluster means. If number of clusters k is small, then it is possible to get highest accuracy; thus the number of clusters is set to 5 (k = 5). In addition, to measure performance of the algorithms, some metrics such as cpu runtime, memory usage are thoroughly considered in these experiments. Due to the limitation of space, only some experimental results are reported on the next part. The synthetic data have 3 dimensions, and the number of clusters is set to 5 (k = 5). Number of data points is set from 100.000 points to 1.000.000 points and the noise data points is set to 10%. The experiments are carried out on 13 different synthetic datasets. For each dataset, each KMeans algorithm generates 100 different testings from different initial conditions. The best of these 100 models is retained for comparison of accuracy and performance between this library and Weka since KMeans is known to be sensitive to how cluster means are initialized. These experiments are conducted on an Intel P4/3GHz machine, 1GB main memory (RAM) and the operating system is Window XP Service pack 2. The library is implemented on Java JDK1.6 platform. 30
CPU Running Time
s
27.59
25 21.31
20 18.55
15 10 5 0
7.50 4.33 5.27 3.36 4.86 4.84 1.80 2.38 2.58 2.72 1.19 1.48
10
15
20
25
30
35
13.64
13.53
12.88
11.39
8.67 6.97
40
8.16
50
12.98
Library System
16.36 14.89
W eka
9.92
60
70
80
90
100
Fig.6. Cpu runtime, K-mean of library and K-mean of Weka.
10.000
350
Maximum used memory
MB
302.08
300
277.15 245.90 223.13
250 191.13
200
96.56
100 50
Library System
162.24
150 106.59
120.21
76.65 47.75
weka
77.80
61.48
30.44 28.72 15.87 23.19
153.56 140.77 122.55 103.26 97.94
50.32 58.95 37.38 46.41
0
10.000
10
15
20
25
30
35
40
50
60
70
80
90
100
Fig. 7. Maximal used memory, K-mean of library and Weka. As result shown in Fig. 6, the executive time of KMeans on Weka is rather longer than that on this library system. For example, with 1.000.000 transactions, the executive time on this library is 16.36 seconds and 27.59 seconds on Weka. The reason is that clusters in Weka program are objects stored in data points, thus the storing is duplicated, in specific, data points are stored both in dataset and in clusters. As a result, memory used by Weka’s program increases significantly, as illustrated in Fig. 7. For instance, as with 1.000.000 transactions, library’s program consumes 153MB memory and Weka consumes 302MB memory. Moreover, inserting data points into clusters as well as getting them out also slows down the executive time, as shown in Fig. 6. To make the program faster and more efficient, only the key values of data points are manipulated, such a way that is employed in library program. That is why the executive time of program in the library is rather shorter and acceptable.
6
Conclusion and Future Work
This paper proposed a library system that can be extended. Compared with other libraries in Data Mining field, inheritance and flexibility are the advanced features of this library. Users can be more active in using available components for their purposes. At present, the library system provides some components that their association generates the following algorithms: Apriori, FPGrowth, Closet, FPMax, ExMiner, Kmean, K-medoids, Incremental Kmean, Modified K-mean, Incremental Clustering, Rock, Farnstrom, and DBScan. The accuracy and performance of most of these components have been already tested. In the future, the library system continues to be tested and widened both the number of components and other techniques in Data Mining. Additionally, we will find out SOA (Serviced Oriented Architecture) technology to improve components integration method and graphic user interface.
References: 1. 2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13. 14. 15.
B.Fritzke. The LBG-U method for vector quantization-an improvement over LBG inspired from neural networks, Neural Processing Letter (1997), Vol.5, No.1, pp.35-45. Nguyễn Đức Cường: Flexible information management strategies in Machinel learning and Data mining. A thesis is submitted to University of Wales, Cardiff, United Kingdom (2004). Mehmed Kantardzic: Data mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons © (2003), ISBN: 0471228524. Weka – Data mining: Pratical Machine Learning Tools and Techniques with Java Implementations. Distributed under GNU public license. Kaufman, L. and Rousseeuw, P. J.: Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm, Elsevier (1987), pp.405–416. Sudipto Guha, RaJeev Rastogi, Kyuseok Shim: Rock A Robust Clustering Algorithm for Categorical Attributes. Information systems (2000), Vol.25, No.5, pp.345 – 366. Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for mining Association rules. In proc. Of VLDB ‘94, Santiago, Chile (1994), pp.487-499. Mierswa, Ingo and Wurst, Michael and Klinkenberg, Ralf and Scholz, Martin and Euler, Timm: YALE: Rapid Prototyping for Complex Data Mining Tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006). Ricco RAKOTOMALALA, TANAGRA: a free software for research and academic purposes, in Proceedings of EGC'2005, RNTI-E-3 (2005), Vol. 2, pp.697-702. Jiawei Han, Jian Pei, Yiwei Yin: Mining Frequent Pattern without Candidate Generation. In proc. of ACM SIDMOD Conference on Management of Data (2000), pp.1-12. Jiawei Han, Jian Pei, Runying Mao: CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemset, Proc ACMSIGMOD Int. Workshop Data Mining and Knowledge Discovery (2000), pp.11-20. Grosta and Jianfei Zhu: High Performance Mining of Maximal Frequent Itemsets, 6th SIAM International Workshop on High Performance Data Mining, San Francisco (2003). Tran Minh Quang, Shigeru Oyanagi, and Katsuhiro Yamazaki: ExMiner: An efficient Algorithm for Mining Top-K Frequent Pattern, ADMA 2006, SpringerVerlag Berlin Heidelberg (2006), pp.436-447. Fredrik Farnstrom, James Lewis, Charles Elkan. Scalability for Clustering Algorithms Revisited. SIGKDD Explorations, ACM SIGKDD (2000). Volume 2, Issue 1, pp.51 – 57. Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Published in Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (1996), pp.226-231.