Journal of the Chinese Institute of Engineers, Vol. 29, No. 3, pp. 391-401 (2006)
391
A MINIMAL PERFECT HASHING SCHEME TO MINING ASSOCIATION RULES FROM FREQUENTLY UPDATED DATA
Judy C. R. Tseng, Gwo-Jen Hwang*, and Wen-Fu Tsai
ABSTRACT Data mining techniques have attracted the attention of researchers from various areas. One of the most important issues in data mining is the mining of association rules from very large and frequently updated databases. In this paper, an algorithm for mining association rules based on a minimal perfect hashing scheme is presented. By generating non-collision hashing tables, the novel approach is especially suitable for handling very large databases containing huge amounts of transactions and frequently updated data. Some experiments have been performed on the databases with the number of transactions ranging from 10,000 to 1,000,000. The experimental result shows that our novel approach performs much better than previously proposed methods. Key Words: data mining, databases, data warehouse, association rules.
I. INTRODUCTION Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules (Han et al., 1999). For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they bought only a few items. The retailer concluded that they purchased beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue, such as putting beer in a location that is *Corresponding author. (Tel: 886-915396558; Fax: 886-63017001; Email:
[email protected]) J. C. R. Tseng is with the Department of Computer Science and Information Engineering, Chung Hua University, Hsinchu, Taiwan 300, R.O.C. G. J. Hwang is with the Department of Information and Learning Technology, National University of Tainan, Tainan, Taiwan 700, R.O.C. W. F. Tsai is with the Department of Information Management, National Chi Nan University, Nan-Tou, Taiwan 545, R.O.C.
closer to diapers or making sure beer and diapers are sold at full price on Thursdays (Cheunge et al., 1996). Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue, cuts costs, or both. Data mining allows users to analyze data from many different dimensions or angles, categorize it, and to summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Various applications of data mining have depicted the benefits of applying it, such as finding the relationships among products, analyzing customer behavior for e-commerce, and diagnosing student behaviors for distance learning (Chen et al., 1996). However, as the computer and communication techniques proliferate, the amount of data to be analyzed becomes very large, which reveals the importance of performance considerations in data mining research. One of the most important data mining issues is the mining of association rules. Association rules represent the relationships among items in a given database such that the presence of some items in a transaction will imply the presence of other items in the same transaction (Sarawagi et al., 1998; Houtsma and Swami, 1995; Han et al., 2000; Han and Fu, 2000;
392
Journal of the Chinese Institute of Engineers, Vol. 29, No. 3 (2006)
Hipp et al., 2000; Liu et al., 2000; Wang et al., 2000; Aggarwal and Yu, 2001; Aly et al., 2001). The problem of mining association rules in the context of a database was first explored by Agrawal et al. (1993). In this pioneering work, it has been shown that mining association rules can be decomposed into two subproblems. The two problems are identifying items of interest and producing rules of association. A set of items is called an itemset. First, we need to identify all itemsets that are contained in a sufficient number of transactions above the minimum (support) requirement. These itemsets are referred to as frequent itemsets. Second, once all frequent itemsets are obtained, the desired association rules can be generated in a straightforward manner. Various algorithms have been proposed to discover the frequent itemsets (Zaki et al., 1997; Zaki and Hsiao, 1999; Pasquier et al., 1999; Zaki, 2000a; 2000b). Generally speaking, these algorithms first construct a candidate set of frequent itemsets based on some heuristics, and then discover the subset that indeed contains frequent itemsets. This process can be done iteratively in the sense that the frequent itemsets discovered in one iteration will be used as the basis to generate the candidate set for the next iteration. The iterations involve generating candidate itemsets and scanning database. The heuristic to construct the candidate set of frequent itemsets is crucial to performance (Agrawal and Shafer, 1996). The larger the candidate set, the more the processing cost required to discover the frequent itemsets. As reported previously, the processing in the initial iterations dominates the total execution cost. It was shown that the initial candidate set generation, especially for the frequent 2-itemsets, is the key issue to improve the performance of data mining (Agrawal and Shafer, 1996). Another performance related issue is the amount of data that has to be scanned during frequent itemset discovery. A straightforward implementation is to scan all transactions in the database for each iteration. Reducing the number of transactions to be scanned and trimming the number of items in each transaction can improve the data-mining efficiency in later stages. Soo et al. (1997) proposed the Direct Hashing and Pruning (DHP) algorithm to cope with these problems. DHP utilizes a hashing technique to filter the ineffective candidate frequent itemsets, especially for determining frequent 2-itemsets. The number of candidate frequent itemsets generated by DHP is smaller than those generated by previous methods, improving the performance bottleneck of the whole process significantly. DHP also avoids database scans in some passes so as to reduce the disk I/O cost involved. The performance of the DHP algorithm greatly
Fig. 1 Illustrative example of a database with four transactions
depends on the hashing function selected. The high collision rate of the hashing function affects the performance of DHP; moreover, to find frequent 2itemsets, DHP and other previously proposed methods would require scanning the database a second time. In this paper, we propose an algorithm based upon a multi-phase indexing and pruning scheme to cope with these problems. Experimental results show that our novel approach can significantly improve the performance of mining association rules from very large and frequently updated databases. II. REVIEW OF RELEVANT WORKS Let I ={i 1, i 2, ... , i m} be a set of literals, which shall be called the data items in the following discussions and DB be a set of transactions. Each transaction T is a set of data items such that T ⊆ I, and is associated with an identifier, called TID. Let X be a set of items. A transaction T is said to contain X if and only if X ⊆ T. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The rule X ⇒ Y holds in the transaction set DB with confidence c if c% of the transactions that contain X also contain Y. The rule X ⇒ Y has support s in the transaction set DB if s% of transactions in DB contain X ∪ Y (Soo et al., 1997). 1. Apriori Algorithm In the Apriori, a candidate set of frequent itemsets is constructed at each iteration with the number of occurrences of each candidate itemset being counted (Agrawal and Shafer, 1996). The frequent itemsets are then generated based on a predetermined minimum support, which represents the acceptable minimum number of occurrences for itemsets to be considered as occurring frequently. Considering the example given in Fig. 1, there are four transactions with five data items (i.e., A, B, C, D, and E) in the database DB. In the first iteration, Apriori simply scans all the transactions to count the number of occurrences for each data item. The set of candidate 1- itemset, C1, is shown in Fig. 2. Assuming that the minimum support is 2, the set of frequent 1-itemsets, L1 can then be determined by selecting the items in C1 with the number of occurrences greater than or equal to the minimum support required.
J. C. R. Tseng et al.: Mining Association Rules from Frequently Updated Data
393
Fig. 3 Illustrative example of constructing hash tree for C 2
Fig. 2
Generating candidate itemsets and frequent itemsets by Apriori
The Apriori Candidate Generation Function takes as argument L k – 1, the set of all frequent (k – 1)itemsets. It returns a superset of the set of all frequent k-itemsets. The function works as the base candidate generation function of all the algorithms introduced in this paper, which consists of two steps: Step 1: Join L k –1 with L k –1 : insert into C k select p.item 1, p.item 2, p.item 3, ..., p.item k –1 , q.item k –1 from L k –1 p, L k –1 q where p.item 1 = q.item 1, ... , p.item k –2 = q.item k –2 , p.item k –1 < q.item k –1 Step 2: Delete itemset c ∈ Ck if any (k –1)-subset of c is not in L k –1 : for all itemsets c ∈ C k do for all (k – 1)-subsets s of c do if (s ∉ L k –1 ) then delete c from C k;. Consider the example given in Fig. 2, Apriori uses L1 * L1 to generate a candidate set of itemsets C2, where * is an operation for concatenation. Next, the four transactions in DB are scanned again and the number of occurrences of each candidate itemset is counted. The set of frequent 2-itemsets, L2, is then constructed with the candidates in C2 whose supports satisfy the minimum support. C3 is generated from L2 according to the following steps. First, two frequent 2-itemsets in L2 with identical first item, such as {BC} and {BE}, are selected. Second, Apriori tests whether each 2-item combination of {BCE} is in L 2. Since the 2-itemset {CE}, which consists of the second items of {BC} and {BE}, is a frequent itemset in L 2, we know that all the subsets of {BCE} are frequent and hence
{BCE} becomes a candidate 3-itemset. Third, as there is no other candidate 3-itemset from L2 after the testing, Apriori then scans all the transactions in DB and discovers the frequent 3-itemsets as L3 in Fig. 2. Since there is no candidate 4-itemset to be constituted from L3, Apriori ends the process of discovering frequent itemsets. In Apriori, candidate itemsets Ck are stored in a hash tree. A node of the hash tree either contains a list of itemsets (a leaf node) or a hash table (an interior node). In an interior node, each bucket of the hash table points to another node. The root of the hash tree is defined to be at depth 1. An interior node at depth d points to the nodes at depth d + 1. Itemsets are stored in the leaves. When an itemset c is to be inserted, the tree is traced from top to bottom until a leaf is reached. At each interior node with depth d, a hash function is used to determine the branch direction to go further for the itemset. All nodes are initially created as leaf nodes. When the number of itemsets in a leaf node exceeds a specified threshold, the leaf node is converted to an interior node (Agrawal and Shafer, 1996). Taking C2 in Fig. 2 as an example, the corresponding hash tree can be constructed as shown in Fig. 3. When the {AC} itemset is generated, the hash tree is traced from root to leaves. Hash function (1) is used to determine the branch direction of item A and then that of C is determined by using hash function (2). If A-C branch does not exist in the hash tree, a new leaf node is created. If a leaf node contains too many itemsets, it will be converted to an interior node. To explain how the hash tree structure works in the database scanning and in the number of itemset counting phase, the transaction TID200 that contains B, C and E is taken as an example. As shown in Fig. 4, for TID200, the candidate itemsets are {BC} and {BE}. In each interior node, such as nodes 1 and 3, the data item is hashed to determine the branch direction until a leaf node is reached. 2. DHP (Direct Hashing & Pruning) Algorithm In Apriori, the number of candidate itemsets for
394
Fig. 4
Journal of the Chinese Institute of Engineers, Vol. 29, No. 3 (2006)
Illustrative example of constructing hash tree for transaction TID200
C k is |L k –1 | * (|L k –1 | – 1)/2, which will be extremely large if |L k –1 | is large. The DHP algorithm employs a hashing technique to filter out ineffective candidate itemsets during the generation of candidate itemsets for the next iteration (Soo et al., 1997). Instead of treating each k-itemset in Lk –1 * Lk –1 as a candidate itemset in C k, DHP selects a k-itemset into C k only if it has the possibility of becoming a frequent itemset in the next iteration. When the number of occurrences for each itemset in C k –1 is counted by scanning the database, DHP accumulates information about C k in advance, by hashing all possible k-itemsets of each transaction to a hash table. Each bucket in the hash table contains an integer representing the number of itemsets that have already been hashed to this bucket so far. Based on the resulting hash table, a bit vector is constructed. The value of a bit is set to one if the number in the corresponding entry of the hash table is greater than or equal to the minimum support. An example of generating candidate itemsets by DHP is given in Fig. 5. For the candidate set of frequent 1-itemsets, i.e., C1 = {{A}, {B}, {C}, {D}, {E}}, all transactions of the database are scanned to count the support of these 1-items. In this step, a hash tree for C1 is built on the fly for the purpose of efficient counting. DHP tests whether or not each item exists already in the hash tree. If yes, it increases the count of this item by one. Otherwise, it inserts the item with a count equal to one into the hash tree. When each transaction is read from the database, DHP counts the occurrences of all the 1-subsets and hashes all the possible 2-subsets of this transaction into a generated hash table H 2. When a 2-subset is hashed to bucket i, the value of bucket i is increased by one. After the database is scanned, the occurrences of all the 1-subsets are counted and each bucket of the hash table H 2 has the number of 2-itemsets hashed to the bucket. Given the hashing function H(x, y)= ((order of x) × 10 + (order of y)) mod 7 to the hash table H 2,
Fig. 5 Example of generating candidate 2-itemsets by DHP
the minimum support s becomes 2. Using the H2 answer sets that are equal to or larger than s to filter out 2-itemsets from L1 * L1, we have C2 = {{AC}, {BC}, {BE}, {CE}} as shown in Fig. 5, instead of C2 = {{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} resulted by Apriori in Fig. 2. Note that collisions may occur while DHP hashes all the possible 2-subsets of each transaction into the hash table, e.g., ({C E}, {A D}) and ({A C}, {C D}). That is, the counts in the hash table can only be used to filter out the itemsets whose numbers of occurrences are less than the minimum support. For counts being greater than or equal to the minimum support, we can only say that the corresponding 2itemsets are the candidate 2-itemsets. That is, another disk scan is required to confirm whether the 2itemsets are frequent. We clearly know that the collision rate will increase quickly when the size of the database becomes larger. Moreover, if the collision rate of the hashing function is too high, the size of C2 may be close (or equal) to that of Apriori. In addition to the use of hash functions, the DHP algorithm also utilizes the fact that any subset of a frequent itemset must be a frequent itemset by itself. For example, {B C E} ∈ L 3 implies {B C} ∈ L 2, {B E} ∈ L 2 , and {C E} ∈ L 2. That is, if a transaction contains some frequent (k + 1)-itemsets, any item contained in these (k + 1)-itemsets will appear in at least k of the candidate k-itemsets in C k. Consequently, an item in transaction t can be trimmed if it does not appear in at least k of the candidate k-itemsets in t. DHP will construct a trimmed database D k in each iteration for the next phase scanning.
J. C. R. Tseng et al.: Mining Association Rules from Frequently Updated Data
3. Bottlenecks in the Finding of Frequent Itemsets In Early Phases It can be seen that the larger the candidate set, the greater the processing cost required to discover the frequent itemsets. Some researchers have reported that the processing in the initial iterations in fact dominates the total execution cost (Agrawal, and Shafer, 1996; Soo et al., 1997). They also indicated that the candidate itemsets generated during an early iteration are generally, by orders of magnitude, larger than the set of frequent itemsets it really contains. Therefore, the initial candidate itemset generation, especially for the frequent 2-itemsets, is the key issue to improve the performance of data mining. The factors that cause bottlenecks in finding frequent itemsets in early phases, especially in 2-iteration, are given as follows: |L | (1) As C 2 consists of C 2 = |L 1|(|L 1| – 1)/2, 2-itemsets will become extremely large if |L1| is large. Since the number of occurrences of 1-itemset can easily be greater than or equal to the minimal support, C 2 is likely to contain much more ineffective candidate itemsets than later phases. (2) In 2-iteration, the depth of the hash tree is only 3. However, as the 2-itemsets of C 2 may be very large, it will be time-consuming to convert leaf nodes to interior nodes and to reallocate memory to keep those interior nodes in the hash tree. (3) As the number of nodes in the hash tree of C2 gets larger, the searching cost will be relatively higher. (4) As it is difficult to evaluate the performance of the hash function of DHP before starting the mining process, the collision rate may be very high as the number of transactions increases. Therefore, in some cases, the performance of DHP may be worse than Apriori. 1
III. MULTI-PHASE INDEXING AND PRUNING SCHEME Although the DHP algorithm uses the hashing technique that filters out many ineffective itemsets for the next candidate itemset generation, its performance depends greatly on the hashing function selected. If the collision rate of the hashing function is high, the performance of DHP could be significantly affected. In the following, an algorithm based on a minimal perfect hashing scheme is proposed to cope with such problems. 1. MPIP (Multi-Phase Indexing and Pruning) Algorithm If a hashing function can be found that is one to one from the set of keys to the address space, it is a
395
perfect hashing function. If a perfect hashing function maps from the set of keys to an address space with the same size, this perfect hashing function is called the minimal perfect hashing function (Chang, 1984). Although some researchers have proposed algorithms to generate non-collision hashing functions, the computation cost is very high owing to the use of large amount of prime number multiplications (Chang, 1984). To cope with this problem, the MPIP Algorithm employs a minimal perfect hashing function to produce no-collision hash tables to reduce the time needed for scanning the entire database and searching data items in a very large hash tree. Moreover, by employing the minimal perfect hashing function and a transaction pruning strategy, MPIP is capable of reducing the execution cost without increasing the requirement for memory space. Let the candidate 1-itemset C1 = {X1, X2, ..., XN}, where X i is the identification of the i th data item in the database and N is the number of data items. P(Xi) denotes the function that maps the identification of Xi to its position, that is, P(Xi) = i. In a database with real data, usually a primary key value consists of zero or more heading letters with a sequential number, such as P001, P002, Q001 and Q002. Some supermarkets employ a compound number to represent a data item, such as 091800003985, where 0918 may represent the classification number of the data item and 00003985 is the serial number. Therefore, it is not difficult to map a data item into a number that represents its associated position in the 1-itemset. For the example given in Fig. 2, each data item is mapped to a sequential integer number, that is, P(A)= 1, P(B)= 2, P(C)= 3, P(E)= 4 and P(E)= 5. The MPIP algorithm is given as follows: Algorithm MPIP begin s = a minimum support; I = {I 1, I 2, ... , I N }, a set of distinguishable items in the database; N= |I|, number of data items in the database; |DB|= number of transactions in the database; k = the number of data items of frequent item sets found in k-th iteration; g = the number of the g-th group in database; item i = the i th data item in k-itemsets; P(item i) = the position of item i; Minimal Perfect Hash Function for kitemsets: Create hashing function with C kN hash addresses; //Construct hash function
396
Journal of the Chinese Institute of Engineers, Vol. 29, No. 3 (2006)
Table. 1 3-itemsets for the example given in Fig. 1 3-itemsets
Group name
Group number r
Intra-group -offset
Hash address
{A B C} {A B D} {A B E}
AB
1
0
P(C) – P(B) = 1 P(D) – P(B) = 2 P(E) – P(B) = 3
1 2 3
{A C D} {A C E}
AC
2
5 – P(B) = 3
P(D) – P(C) = 1 P(E) – P(C) = 2
4 5
{A D E}
AD
3
5 – P(B) + 5 – P(C) = 5
P(E) – P(D) = 1
6
{B C D} {B C E}
BC
4
5 – P(B) + 5 – P(C) + 5 – P (D) = 6
P(D) – P(C) = 1 P(E) – P(C) = 2
7 8
{B D E}
BD
5
5 – P(B) + 5 – P(C) + 5 – P (D) + 5 – P(C) = 8
P(E) – P(D) = 1
9
{C D E}
CD
6
5 – P(B) + 5 – P(C) + 5 – P (D) + 5 – P(C) + 5 – P (D) = 9
P(E) – P(D) = 1
10
Inter-group-offset
g g g g H k(item1, item 2, ... , item k – 1, item k)
= Inter-Group-offset + Intra-Group-offset g–1
=(
Σ
r=1
r r r Inter(item1, item2, ... , itemk – 1)) g
+ Intra(item k) g–1
=(
Σ
r=1
r
g
N – P(itemk – 1)) + (P(item k) g
– P(item k – 1)) For all transaction t ∈ DB do //scan database begin For all k-subsets ∈ t do g g g g H k(item 1, item 2, ... , item k – 1, item k)++; end for g g L k = {k-subsets|H k(item 1, item 2, ... , g g item k – 1, item k) ≥ s}; Apriori with pruning: For (z = k + 1; L z – 1 = φ; z++) do begin C z = Apriori-gen (L z – 1); //new candidates TransTrim[1 ... |DB|] = 0; //initialize array TransTrim[|DB|] = 0 For all transactions t ∈ DB & TransTrim [t-TID] = 0 do begin C t = SubsetCount-Pruning (Cz, z, t); // Count the occurrences of each candidate in hash tree with simple pruning end L z = {c ∈ Cz| c.count ≥ s} end end MPIP
Procedure SubsetCount-Pruning (C z , z, t) For all c such that c ∈ C z and c(= t i1 ... t iz) ∈ t do begin c.count++; for (j = 1; j ≤ z; j++) a[i j ]++; end TransTrim[t-TID] = 1; for (i = 0; i < |t|; i++) if (a[i] ≥ z) { transFlag[t-TID]= 0; break; } end Procedure Initially, a hash function H k is constructed for the k-iteration based on the multi-phase indexing function. The hashing address of each k-itemset is represented by an Inter-Group-Offset and an IntraGroup-Offset. The Inter-Group-Offset of the g th group represents the accumulated number of itemsets in the itemset from the 1 st to the (r-1)th group. For kitemsets, the Inter-Group-Offset of Group r can be g–1
obtained by computing
Σ
r=1
r
N – P(itemk – 1), where N r
is the total number of data items and itemk – 1 represents the (k-1)th data item in the k-itemset. Considering the example of 3-itemsets given in Table 1, we have k = 3, N = 5, I1 = A, I2 = B, I3 = C, I4 = D and I5 = E, and hence P(A) = 1, P(B) = 2, P(C) = 3, P(D) = 4 and P(E) = 5. The Inter-Group-Offset of 5 th group (g = 5) is obtained by
4
Σ N – P (item2) = (5 – P(B)) + r=1 r
(5 – P(C)) + (5 – P(D)) + (5 – P(C)) = 3 + 2 + 1 + 2 = 8.
J. C. R. Tseng et al.: Mining Association Rules from Frequently Updated Data
397
Intra-Group-Offset represents the order of a kitemset inside the corresponding group. It can be obtained by computing P(item k ) – P(item k – 1 ). For example, the Intra-Group-Offset of “B, C, D” is P(D) – P(C) = 4 – 3 =1, and the Intra-Group-Offset of “B, C, E” is P(E) – P(C) = 4 – 3 = 2. Therefore, the hash address of “B, D, E” can be obtained as follows: H 3(B, D, E) 4
r
4
r
=(
Σ Inter (item1, item2)) + Intra(item 3) r=1
=(
Σ 5 – P(item2)) + (P(item 3) – P(item 2)) r=1 1
r
3
3
2
3
3
= 5 – P(item 2) + 5 – P(item 2) + 5 – P(item 2) 4
+ 5 – P(item 2) + (P(E) – P(D))
Fig. 6 Example of generating frequent 2-itemsets by MPIP
= (5 – P(B)) + (5 – P(C)) + (5 – P(D)) + (5 – P(C)) + 5 – 4 = 3 + 2 + 1 + 2 +1 = 9. Thus, by the generalized MPIP scheme, we can match each k-itemset to a unique hashing address bucket efficiently. The MPIP scheme generates the hashing function for the k-iteration bottleneck during the extraction of frequent itemsets process. More complicated and higher memory cost hash functions can be generated as k increases. After determining the hashing function, MPIP scans all transactions in the database and tests all possible k-itemsets in each transaction. Unlike Apriori, MPIP does not need to build candidate k-itemsets as well as complicated hash trees. To skip the bottleneck steps in the initial mining phases, MPIP matches k-itemsets to the no-collision hash table and finds the frequent itemsets based on the user-defined minimal support. After extracting frequent k-itemsets L k , MPIP continues to find frequent (k+1)-itemsets to the largest frequent itemsets of the database by employing Apriori algorithm in subsequent iterations. Moreover, MPIP employs a transaction pruning strategy to trim ineffective transactions to enhance the performance. 2. Illustrative Example of Generating Frequent 2Itemsets by MPIP By applying the 2-itemset indexing function H 2 (I 1, I 2) to replace the hash function given in Fig. 5, we have the new procedure of finding frequent 2-itemsets as shown in Fig. 6. Note that only one database scan is needed in this procedure since the hash table can be directly used to find frequent 2-itemsets. Moreover, MPIP does not have to be re-executed for any change to the databases, instead, only the corresponding hash table entries of the data that are updated, inserted or removed need to be re-computed and updated.
In addition, it can be seen that MPIP generates L 2 after one database scan without the need of generating candidate itemsets in the 1 st and 2 nd iterations, that is, the time needed for hash tree search and construction is also avoided. This is why MPIP can have a more efficient and stable performance in the initial phases of mining association rules. 3. Pruning Strategy in MPIP MPIP constructs a transactions-trimming array TransTrim [1 ... n] to record the effectiveness of each transaction in the database, where n is the number of transactions in the database. For each transaction t, initially TransTrim [t] is set to 0, and the hash tree of C k is traced to determine a[i], the number of occurrences for the i-th data item in transaction t to be in the k-subsets of t that are in C k, for i = 1 to k. If the i-th data item belongs to a k-subset that is a member of Ck, a[i] is increased by one. For example, in transaction TID 100 = {A C D}, and hence a[0] represents the occurrences of A, a[1] represents the occurrences of C, and a[2] represents those of D. The 2-subsets of TID 100 are {A C}, {A D} and {C D}, among which only {A C} is in C 2. Therefore, we have a[0] = 1, a[1] = 1, and a[2] = 0. If any a[i] ≥ k, the transaction must contain at least one candidate (k+1)-itemsets, and hence TransTrim [t-TID] is set to 0; otherwise, TransTrim[t-TID] is set to 1. For example, in transaction TID 200 = {B C E}, since {B C}, {B E} and {C E} of TID 200 are in C2, we have a[0] = 2, a[1] = 2 and a[2] = 2, and hence TransTrim[TID 200] = 0. After determining the TransTrim[t-TID] value for each transaction, MPIP only needs to scan the transactions with TransTrim [t-TID] = 0 instead of re-constructing new trimming database in each iteration. An illustrative example of trimming transactions by MPIP is given in Fig. 7.
398
Journal of the Chinese Institute of Engineers, Vol. 29, No. 3 (2006)
T10.I4.D100K 45 40 35
Fig. 7 Illustrative example of trimming transactions by MPIP
Time (min)
30
T5.I2.D100K 25
25 20 15
20
10
Time (min)
5
15
0 2
1
0.75
0.5
0.25
Minimum support (%)
10
Apriori
5
DHP
MPHP
Fig. 9 Experiment on T10.I4.D100k 0 2
1
0.75
0.5
0.25
T20.I6.D100K
Minimum support (%)
100 Apriori
DHP
MPHP
90 80
Fig. 8 Experiment on T5.I2.D100k
VI. EXPERIMENTS AND EVALUATION To evaluate the performance of MPIP, several experiments were conducted on an HP LH3000 NetServer equipped with two PIII 1G processors and 4 GB memory space. One processor and 1.5 GB memory space was allocated to execute the Java-based programs for the experiments. In our experiments, the original data generation program of Apriori (Agrawal and Shafer, 1996) is used to generate databases with specified parameters as follows: N: Number of data items |D|: Number of transactions in the database |T|: Average size of transactions |I|: Average size of the maximal potentially frequent itemsets used to generate the datasets |L|: Number of maximal potentially frequent itemsets 1. Comparison of MPIP, DHP and Apriori Algorithms Some experiments were conducted to compare the performance of MPIP, Apriori and DHP. We adopted the same pruning strategy in MPIP and DHP to prove that the multi-phase indexing function is more efficient and stable than DHP. Figs. 8, 9 and 10 show the execution time of each algorithm on three
Time (min)
70 60 50 40 30 20 10 0 2
1
0.75
0.5
0.25
Minimum support (%) Apriori
DHP
MPHP
Fig. 10 Experiment on T20.I6.D100k
databases T5.I2, T10.I4 and T20.I6 with the number of transactions D100K (100,000 transactions). It can be seen that MPIP and DHP can achieve much better performance than Apriori, especially when the size of the database grows. Such experimental results are reasonable since Apriori needs to maintain a large number of candidate itemsets in the earlier phases, which constructs a large and inefficient hash tree. On the contrary, DHP can reduce the number of candidate items and prune ineffective transactions for the next phase, and MPIP can directly obtain frequent itemsets without maintaining a hash tree and scanning the database twice. Figure 11 shows the performance of DHP and MPIP with a number of transactions in the database
399
J. C. R. Tseng et al.: Mining Association Rules from Frequently Updated Data
Increasing number of transactions
DataBase items increase
40
1.00
35
0.90 0.80
25
Time (min)
Time (min)
30
20 15
0.70 0.60
10
0.50
5
0.40
0 100
250
500
750
0.30 1000
1000
Number of transactions (K) T10 I4 (MPHP) T15 I4 (MPHP) T20 I4 (MPHP)
Fig. 11 Performance of MPIP and DHP with increasing number of transactions
ranging from 100K to 1 million. It can be seen that the performance of MPIP is better than that of DHP as the number of transactions increases. Moreover, the slope of MPIP is smoother than DHP. In other words, MPIP performs a stable execution process in the large-scale database. Figure 12 shows the performance of MPIP and DHP with the number of data items ranging from 1000 to 10000 on the databases T5.I2 and T10.I4 with 100K transactions. The minimal support was set to 0.75. It can be seen that MPIP can achieve more stable and more efficient performance than DHP. 2. MPIP with Update Operations In the real world, the number of transactions in a database can increase rapidly at every moment, and the decision maker may need to access the most upto-date information. By applying Apriori or DHP, the data mining procedure may need to be re-executed while new transactions are inserted into the database, which could be time-consuming and might delay the access to the most up-to-date information. As MPIP employs a perfect hashing function to count the number of occurrences of each possible combination of the itemsets, it can be used to cope with the frequent update problem by re-hashing the itemsets in each updated transaction to the hash table and updating the corresponding number of occurrences directly instead of re-mining the entire database. Therefore, in MPIP, only the relevant itemsets of the updated transactions need to be checked to determine whether they are frequent itemsets or not.
5000
7500
10000
Number of distinguishable data items T5 I2 D100k (DHP)
T5 I2 D100k (MPHP)
T10 I4 D100k (DHP)
T10 I4 D100k (MPHP)
Fig. 12 Performance of MPIP and DHP with increasing number of data items
Time (min)
T10 I4 (DHP) T15 I4 (DHP) T20 I4 (DHP)
2500
7 6 5 4 3 2 1 0 100+100
MPIP Performance with 100,000 Updating operations (T20 I4)
250+100
500+100
750+100
1000+100
DataBase (K) + Update Data (100K) MPIP-Update
MPIP-NonUpdate
Fig. 13 Performance of MPIP with 100K new-coming transactions
To evaluate the performance of applying MPIP to frequently updated databases, we use the T20.I4 database with the original number of transactions ranging from 100K to 1000K and 100K new-coming transactions to test the performance of MPIP with update operations. Fig. 13 shows the experimental results by comparing the performance of MPIP with and without using the update operations. It can be seen that MPIP with update operations can significantly improve the performance by dealing with newcoming transactions rather than re-mining the entire database, and hence the most up-to-date information can be derived on time. V. MEMORY ISSUES IN MPIP AND DHP Another concern for applying a data mining approach is the size of memory required, that is, the
400
Journal of the Chinese Institute of Engineers, Vol. 29, No. 3 (2006)
Size of memory needed in MPIP in determining frequent 2-itemsets (T20 I4)
Memory size (Mega Bytes)
200 150 100 50 0
1
2
3
4
5
Number of data items Fig. 14 Size of memory needed in MPIP in determining frequent 2- itemsets
size of memory needed for maintaining the hash table and the hash tree. Without loss of generality, we shall discuss the memory needed for the most critical phase, i.e., the phase for determining the frequent 2-itemsets. Fig. 14 shows the size of memory needed in MPIP in determining the frequent 2- itemsets. The solid line depicts the memory space needed for maintaining the hash table, while the dash line indicates the wasted memory space that can be avoided by performing an additional database scan. It can be seen that the waste of memory space is nearly 25% of the hash table. Therefore, it is the trade-off of the time needed for performing an additional scan and the additional memory space needed for maintaining the hash table. As MPIP attempts to avoid additional database scans by hashing the 1-itemsets and 2-itemsets of the transactions at the same time, it requires additional memory space for maintaining the hash table. For MPIP, the memory space of the hash table is C 2N = N × (N – 1)/2, where N is the number of data items. It can be seen that the memory size needed for MPIP only depends on the number of data items N, and will not be influenced by the number of transactions in the database. Moreover, the size of the hash tree in MPIP is zero, since it does not have to maintain a hash tree. For DHP, as reported by Soo et al. (1997), at least C 2N memory needs to be allocated for maintaining the hash tree; otherwise, the performance will be close to that of Apriori while the number of transactions in the database increases. For Apriori, the memory needed depends on the number of candidate itemsets, that is, at least C2N memory space is needed. When the minimal support s is small, the hash tree of Apriori could grow extra large and require even more memory space than C 2N. To sum up, MPIP can achieve better performance than DHP and Apriori with the same amount of memory requirement. VI. CONCLUSIONS AND FUTURE WORKS In this paper, we propose an efficient algorithm,
MPIP, to find K-frequent itemsets. MPIP uses a minimal perfect hashing scheme to replace the hash tree used in the previously proposed methods. There are several advantages of MPIP: (1) MPIP is very efficient. It generates frequent 2itemsets directly from one scan over the database without generating C 1, L 1 and C 2. Moreover, it replaces the hash tree by a hash table to reduce the tree search cost. Therefore, it is suitable to handle a very large database with a huge amount of transactions. (2) The performance of MPIP is stable while being applied on different kinds of databases. (3) The memory size needed by the MPIP scheme depends only on the number of data items. In other words, we can estimate the memory size needed to assess the performance before starting the mining operations. (4) The MPIP algorithm is relatively simpler and easier to implement than other efficient algorithms, such as DHP. (5) MPIP algorithm can be integrated with other database pruning or Scan-reduction methods (e.g. the pruning method of DHP) to achieve better performance under different circumstances. (6) MPIP achieves good performance in a database with new-coming transactions, and hence can provide up-to-date information to decision makers. Currently, we are trying to apply MPIP to several data mining problems, including the mining of a portfolio database of an e-learning system. ACKNOWLEDGMENTS This study is supported in part by the National Science Council of the Republic of China under contract numbers NSC-93-2524-S-009-004-EC3 and NSC-93-2524-S-260-003. REFERENCES Agrawal, R., Imielinski, T., and Swami, A., 1993, “Mining Association Rules between the Sets of Items in Large Database,” 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., USA, pp. 207-216. Agrawal, R., and Shafer, J. C., 1996, “Parallel Mining of Association Rules,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 962 -969. Aggarwal, C. C., and Yu, P. S., 2001, “A New Approach to Online Generation of Association Rules,” IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 4 , pp. 527 -540. Aly, H. H., Amr, A. A., and Taha, Y., 2001, “Fast Mining of Association Rules in Large-Scale
J. C. R. Tseng et al.: Mining Association Rules from Frequently Updated Data
Problems,” Proceedings of the Sixth IEEE Symposium on Computers and Communications, Hammamet, Tunisia, pp. 107-113. Berry, M. J. A., and Linoff, G., 1997, Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons, Inc., New York, USA. Chang, C. C., 1984, “The Study of an Ordered Minimal Perfect Hashing Scheme”, Communications of the ACM, Vol. 27, No. 4, pp. 384-387. Cheung, D. W., Ng, V. T., Fu, A. W., and Fu Y., 1996, “Efficient Mining of Association Rules in Distributed Databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 911-922. Chen, M. S., Han, J., and Yu, P. S., 1996, “Data Mining: An Overview from a Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866 -883. Han, E. H., Karypis, G., and Kumar, V., 2000, “Scalable Parallel Data Mining for Association Rules,” IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 3, pp. 337 -352. Han, J., and Fu, Y., 2000, “Mining Multiple-Level Association Rules in Large Databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 5, pp. 798-805. Han, J., Lakshmanan, L. V. S., and Ng, R. T., 1999, “Constraint-Based, Multidimensional Data Mining,” IEEE Computers, pp. 46-50. Hipp, J., Guntzer, U., and Nakhaeizadeh, G., 2000, “Algorithms for Association Rule Mining- A general Survey and Comparison”, ACM SIGKDD Explorations, Vol. 2, No. 1, pp. 58-64. Houtsma, M., and Swami, A., 1995, “Set-Oriented Mining for Association Rules in Relational Databases,” Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, pp. 25-33. Liu, B., Hsu, W., Chen, S., and Ma, Y., 2000, “Analyzing the Subjective Interestingness of Association Rules,” IEEE Intelligent Systems, Vol. 15, No. 5, pp. 47-55. Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L.,
401
1999, “Discovering Frequent Closed Itemsets for Association Rules,” Proceedings of the 7 th International Conference on Database Theory, Jerusalem, Israel, pp. 398-416. Sarawagi, S., Thomas, S., and Agrawal, R., 1998, “Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications,” Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 343-354. Soo, J., Chen, M. S., and Yu, P. S., 1997, “Using a Hash-Based Method with Transaction Trimming and Database Scan Reduction for Mining Association Rules,” IEEE Transactions On Knowledge and Data Engineering, Vol. 9, No. 5, pp. 813825. Wang, W., Yang, J., and Yu, P. S., 2000, “Efficient Mining of Weighted Association Rules(WAR),” Proceedings of the Sixth ACM SIGKDD international conference on Knowledge Discovery and Data Mining, Boston, MA, USA, pp. 270-274. Zaki, M. J., 2000a, “Generating Non-Redundant Association Rules,” Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, pp. 34-43. Zaki, M. J., 2000b, “Scalable Algorithms for Association Mining,” IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 3, pp. 372 -390. Zaki, M. J., Li, S. P. W., and Ogihara, M., 1997, “Evaluation of Sampling for Data Mining of Association Rules,” Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, Birmingham, UK, pp. 42-50. Zaki, M. J., and Hsiao, C. J., 1999, “CHARM: An Efficient Algorithm for Closed Association Rule Mining”, Technical Report, Computer Science, Rensselaer Polytechnic Institute, NY, USA. Manuscript Received: Aug. 16, 2004 Revision Received: Mar. 20, 2005 and Accepted: May 27, 2005