Horizontal vs. Vertical Partitioning in Association Rule Mining: A Comparison Anjan Das Department of Computer Science, St Anthony’s College, Shillong, Meghalaya, India
[email protected]
Dhruba K Bhattacharyya Department of Information Technology, Tezpur University, Napaam, Assam India
[email protected]
Jugal K. Kalita Department of Computer Science, University of Colorado, Colorado Springs, CO 80933
[email protected]
Abstract Association rules identify associations among data items and were introduced in [1,2,3]. There are useful rule mining algorithms [4] based on the horizontal partitioning approach. These algorithms partition the database and find frequent itemsets in each partition, and combine the itemsets in each partition to get the global candidate itemsets as well as the global support for the items. In this paper, a novel rule mining algorithm has been presented based on vertical partitioning approach. To establish the superiority of the algorithm in comparison to [4], it also includes experimental results.
1. Introduction An example of an association rule is the statement: 80% of the customers who buy bread also buy butter. Association rules have numerous applications in areas such as decision support, understanding customer behavior, and fault detection in telecommunication networks. The terms used most frequently in relation to association rules are i t e m s e t , s u p p o r t, confidence, frequent itemsets and large itemsets. An itemset is a non-empty set of items. The support of an itemset X, denoted by σ(X), is the number of transactions in a database that contain X. An itemset with support greater than a preNotations Ck p lk p Ck p Lkp CkG
defined minimum support is called a frequent (large) itemset. The confidence of X Y, denoted by ρ(X Y ) is the percentage of transactions that contain X also contain Y . ρ(XY) can be calculated as σ ( X ∪Y)/σ(X). An association rule between two disjoint and frequent itemsets X and Y, denoted by XY, exists if the support of X∪Y is at least some pre-defined minimum support, and the confidence is at least some pre-defined confidence value.
2. The Horizontal Partition Algorithm A horizontal partitioning algorithm [4] works in two phases. In phase I, it logically divides the database into a number of non-overlapping horizontal partitions. It considers each partition individually, generates large itemsets for that partition, and merges the individual large itemsets to generate the set of potentially large itemsets. In phase II, the actual supports for these itemsets are computed, and the large itemsets are found. The partition sizes are chosen such that each partition can be accommodated in the main memory so that the partitions are read only once in each phase. The notations used are given in Table 1. Figure 1 gives the algorithm.
Meaning A local candidate k-itemset in partition p A local large k-itemset in a partition p Set of local candidate k-itemset in a partition p Set of local large k-itemset in a partition p Set of global candidate k-itemset
Notations CG LkG LI P LG
Meaning Set of all global candidate itemset Set of global large k-itemset The set of large itemsets in the partition I The set of all vertical partitions The set of global large itemsets
Table 1: Notations used in the algorithms
The algorithm finds all large itemsets in the database and works faster than Apriori [1] as demonstrated in [4]. The reason is that the algorithm needs only two scans of the database and finds the support of an itemset by intersection of the transactions id list (tidlists) of the items. Proper choice of the number of partitions and the partition sizes is crucial for this algorithm. The execution time depends on the size of the global candidate set. As the size of the global candidate sets increases, the execution time also increases. Although there are many factors which affect the size of the global candidate set, one main factor is the dimensionality of the records or transactions. With the increase in dimensionality, the size of the global candidate set as well as the execution time also increases 1. P = partiton_database (D) 2. n = Number of partitions 3. f o r i=1 to n do begin //Phase I……………….. 4. read_in_partition (pi ∈ P) 5. Li = gen_large_itemsets (pi) 6. end 7. for (i=2; L i ≠φ; j=1,2…..n; i++) do begin //Merge phase………….. 8. CiG =∪ j=1,2…..n Lji 9. end 10. CG = ∪ CiG 11. for i=1 to n do begin //Phase II………………. 12. read_in_partition (pi ∈ P) 13. ∀ candidates, c ∈ C G gen_count(c, pi) 14. end 15. LG = { c ∈ CG | c.count >= minsup} 16. Answer= LG
Figure 1: Algorithm Horizontal Partition
3. The Vertical Partition Algorithm In this algorithm, the database is scanned only once to find the transaction list (tidlist) of the items. The tidlist is the array of transaction-ids in which the itemsets occurs. The algorithm divides the databaes vertically into logical partitions. The large itemsets for each partition are determined using depth first search (DFS) on each partition. The large itemsets found in each partition are combined to form the global candidate sets; supports are also calculated. The global candidate set is the superset of the actual
large itemsets. The algorithm finds all possible large itemsets. Moreover, the number of candidate itemsets is reduced because one partition is handled at a time. It is assumed that transactions are given in the form < T I D , 1,0,1,1,0,…> and the items are in lexicographic order. The algorithm can also be adapted in the case where the transactions are given in the form . It is also assumed that the TIDs are in monotonically ascending order and the data are in the secondary storage.
3.1 The Algorithm The algorithm creates logical partitions of the database vertically and finds large itemsets in each partition using DFS. It combines the large itemsets of the first two partitions to get the candidate large itemsets and finds the actual large itemsets among them. The resulting large itemsets are combined with the large itemsets of the next partition. This process is continued until all partitions are considered. The algorithm is given in the Figure 2. 3.1.1 Generating Large Itemsets in a Partition The procedure gen_large_itemsets generates large itemsets in a partition. The procedure takes a partition as input and returns the set of large itemsets in the partition. The procedure first finds the frequent items in the partition and creates a directed graph with the frequent items as nodes. One pair of nodes (m, n) is connected if both occur in a transaction and n occurs after m. Next, the algorithm uses DFS to find large itemsets. Each traversed path in the search represents an itemset consisting of the items denoted by the nodes in the path. The support of an itemset is obtained by intersecting the tidlists of the items in the set. At each step of DFS, the common transactions-id list of the items is stored to be used in the next depth. It also stores the tidlist for every large itemset. These tidlists of the large itemsets are used by gen_count to find the support of the potential candidate large itemsets. The search continues as long as the itemset denoted by the path up to the current node (item) is large. All the large itemsets in a partition are also actual large itemsets in the database.
1. Read the database D and create the tidlists for each of the items 2. P=vertical_partition_database(D) 3. n=Number of partitions 4. for i = 1 to n do begin 5. L i = gen_large_itemsets(pi ∈P) 6. end 7. LG = L1 8. for i = 2 to n do begin 9. CG = combine_local_large(LG, L i) 10. for all candidate c∈ CG gen_count(c) 11. LG = L G ∪ Li ∪ { c ∈ CG | c.count ≥ minimum support } 12. end 13. Answer = LG
Figure 2: The Vertical Partition Algorithm 3.1.2 Combining Local Large Itemsets The global set of potential large itemsets is obtained by combining the large itemsets found in the partitions. We concatenate a large itemset of a partition with a large itemset of another partition. Combining the large itemsets in all the partitions at a time generates too many candidate sets. So they are combined incrementally. It starts with first two partitions to find potential
large itemsets and actual large itemsets are found among them. The resulting large itemsets are combined with the next partition and so on. 3.1.3 Support Count The supports for the global candidate sets which are generated by combining the large itemsets of the partitions are obtained by the function gen_count(). The function uses the tidlists of the large itemsets generated in gen_large_itemsets of the two constituent itemsets of the candidate set. The function just counts the number of common transaction-ids in the tidlists of the two constituent large itemsets.
4. Experimental Results All experiments were carried out with synthetic data generated using the algorithm in [2] and by using a Pentium IV machine with 128 MB RAM and 10GB SCSI disk. For all data sets, the number of items and the number of potential large itemsets were taken as 250 and 500, respectively.
Database: T5I2100K, Number of Transactions: 100,000, Minimum Support: 5% No of Execution Time (in Sec) Partitions Items=100 Items=150 Items=200 [4] Proposed [4] Proposed [4] Proposed 2 3 3 3 8 6 12 5 3 3 3 6 3 8 10 3 3 6 6 8 8 15 3 3 6 3 9 5 20 3 3 9 3 15 5
Items=250 [4] Proposed 20 16 11 9 11 9 11 9 18 9
Database: T10I2100K, Number of Transactions: 100,000, Minimum Support: 5% No of Partitions 2 5 10 15 20
Items=100 [4] Proposed 15 3 3 2 4 2 9 2 126 2
Execution Time (in Sec) Items=150 Items=200 [4] Proposed [4] Proposed 9 5 10 7 7 3 10 5 11 3 17 4 36 3 83 4 61 3 100 4
[4] 12 24 52 54 60
Items=250 Proposed 11 8 6 6 6
Database: T20I4100K, Number of Transactions: 100,000, Minimum Support: 5% No of Partitions 2 5 10 15 20
Items=100 [4] Proposed 17 4 15 3 20 2 76 2 100 2
Execution Time (in Sec) Items=150 Items=200 [4] Proposed [4] Proposed 74 10 520 63 89 9 590 56 242 7 788 35 312 7 968 28 500 7 1125 25
Items=250 [4] Proposed 750 110 891 88 1010 65 1312 52 1512 38
Database: T20I6100K, Number of Transactions: 100,000, Minimum Support: 5% No of Partitions 2 5 10 15 20
Items=100 [4] Proposed 5 4 4 2 5 2 8 2 8 2
Execution Time (in Sec) Items=150 Items=200 [4] Proposed [4] Proposed 38 13 143 31 40 5 368 19 143 2 388 14 160 2 423 14 198 4 481 14
[4] 151 388 409 445 501
Items=250 Proposed 77 56 46 45 44
Table 3: Comparison Results with [4] in terms of Execution Time From the experiments, we can conclude the following. (i) For both [4] & the proposed algorithm, the execution time increases with the increase in number of items. It can be seen (for items=150 & 200) that [4] performs better for the database (T5I2100K) when number of items is small. However, with the increase in number of items, the performance of the proposed algorithm has been found to be better than [4]. (ii) In case of [4], the execution time increases with the increase of number of partitions whereas, in case of the proposed algorithm, the execution time decreases with the increase of number of partitions and becomes almost constant.
5. Conclusion This paper presents a novel association rule mining algorithm based on vertical partitioning. It also reports a comparative study of the proposed algorithm with [4] to establish that the proposed algorithm performs better when
number of items as well as number of partitions are reasonably large.
References 1.
2. 3.
4.
5.
R Agarwal, H Mannila, R Shrikant, et al. Fast discovery of association Rules. In U. Fayyad and et al, editors Advances in Knowledge Discovery and Data mining. pp. 307-328, MIT Press, 1996. R Agarwal and R Shrikant. Fast algorithms for mining association rules. In the Proc. of VLDB, 1994, pp. 487-499, September 1994. H Mannila, H Toivonen and I Verkamo. Efficient algorithms for discovering association rules. In the Proc. of K D D Workshop, pp 181-192, 1994. A Savasere, E Omiecinski and S Navathe An efficient algorithm for mining association rules for large databases, pp 432-444, VLDB, 1995. A K Pujari. Data Mining Techniques. Universities Press (India) Ltd., Hyderabad, India, 2001.