A Framework for Efficient Scalable Mining of Rule

0 downloads 0 Views 351KB Size Report
ulation (e.g., TID intersection in DiffSets). Most of the time, these ... above has one common characteristics among them: they have tight coupling of the .... database where each item has two attributes: Qty, that stores the number of times an ...
EARLY DRAFT

A Framework for Efficient Scalable Mining of Rule Variants Kok-Leong Ong

Wee-Keong Ng

Ee-Peng Lim

Nanyang Technological University, Nanyang Ave, N4-B3C-14 Singapore 639798, SINGAPORE [email protected]

Abstract Association rule mining is an important data mining problem. Since its inception, different variants of rules has been proposed in the literature. In each case, different attributes (e.g., weight and quantity) are considered to obtain more informative rules. To our knowledge, each proposal is based on the Apriori algorithm that is, in modern context, inefficient. Methods that outperform the Apriori (e.g., FPGrowth and DiffSets) are restricted to the discovery of plain vanilla rules, and does not scale well to mining other variants. In this paper, we present an unifying framework for mining variants of rules that separates the scalability and performance aspects of Apriori-based algorithms from the constraints of mining each specific variant. This framework is easy to instantiate with algorithms proposed to date, and supports new algorithms considering future variants. More importantly, it favors the simplicity of the Apriori algorithm, leverages the performance to that of FP-Growth, and maintains a simple scalability model.

1. Introduction Association rules [1], in the present era of data mining, has been a well researched domain. The insights produced is easy to comprehend and applies well to several class of business problems [8]. As a result, several variants of rules were proposed [5, 9, 10, 11, 16, 17]. Each variant is a rule that considers certain attributes in the process of discovery, and presents a more informative set of insights over its plain-vanilla cousin. Examples of such attributes include weights [16], quantities [17], and multiple-concepts [4]. An interesting observation among these variants is that each has its own algorithm for discovering rules containing the attribute of interest. While these algorithms are unique, they are also similar. Each algorithm is unique by the merit of the attribute under consideration, and are similar due to the “generate and test” paradigm popularized by the Apriori algorithm proposed in [1]. Since derivative

works on these variants were based on the Apriori, they inherit the performance bottlenecks inherent in their parent. In the recent years, efficient techniques has been proposed to overcome the limitations of the Apriori. This includes the FP-Growth [6] and DiffSets [18] that are both efficient and scalable. However, these modern techniques are constrained to mining plain vanilla rules. In other words, these new algorithms are not flexible in supporting the mining of rules considering attributes such as those discussed earlier. From our own experience [12], extending them result in awkward implementations that has poorer performance than its original design. Without support for other rule variants, the mining of informative insights would be lost. Clearly, loosing informative insights in favor of speed defeats the original purpose of knowledge discovery, and adopting the modern algorithms for each variant would be a daunting task that is both costly and time consuming. This is especially true given the complexity of these modern mechanisms for discovering association rules. In this paper, we present a framework for scaling up existing algorithms proposed by the authors of each variant. This general framework maintains the simplicity of the Apriori model (i.e., the “generate and test” process) while improving its performance and scaling up its ability to handle large data sets. The main insight, based on an analysis made by [7] and our experience, is Apriori’s dependency on the database size and transaction size. Given an itemset, the time to determine its support is primarily bounded by the size of each transaction and the number of transactions in the database. With large databases and long transactions, Apriori-based algorithms hence suffer heavy performance penalties. One way to reduce this cost is to minimize the number of candidates to be tested. The other is to reduce the cost of testing each candidate. Modern mechanisms reduce the cost of testing each candidate by constructing a data structure containing meta-data about the database. Using this information, the candidate and its support can be derived quickly. The disadvantage of this approach lies in

the rigid nature of the data structure, which was optimized for a particular variant, making it difficult to fit on similar algorithms. Our approach is to separate the scalability and performance issues of data mining from the constraints imposed by the attribute(s). By decoupling from the algorithmic process, any scalability and performance enhancements made will allow a variant wide improvement. This is what modern mechanisms are missing, i.e., they tend to tie the discovery and performance issues together. And this is what makes the modern mechanisms a poor fit on the algorithm of each variant. The kernel of our framework is the construction of a transaction graph (T-Graph) which is a compact representation of the database. Rather than constructing a meta-data structure about the database, the T-Graph is a re-organized database that is compact and efficient for deriving the support count of a given candidate. Its construction cost is low and naturally supports partitioning that is a key factor to scaling Apriori-based algorithms on large data sets. More importantly, since the T-Graph is not a meta-data structure, it can be used to discover any variant of association rules without modification. The rest of the paper is organized as follows. In the next section, we present the related work. Section 3 then introduces the unifying framework. Here, we discuss the idea of a unifying approach to adjust existing algorithms for performance and scalability on large data sets. We next discuss the T-Graph which is the underlying mechanism for enhancing Apriori-based algorithms in Section 4. Section 5 compares the performance of our framework against the Apriori and FP-Growth. Our experimental results show on average an order of magnitude in speedup against Apriori and hence, leverages our proposal with that of FP-Growth. Finally, we conclude our discussion in Section 6.

2. Related Work Since the Apriori was proposed in 1994, a large number of work in rule mining is to discuss efficiency issues. This is largely motivated by the fact that Apriori performance is not sustainable under varying data conditions; particularly with large databases and long transactions. Initial works to breaking the Apriori bottleneck largely address the inefficiencies of the algorithm through peek-hole optimizations. That is, the focus is on efficient implementations. As a result, the performance improvement is minor. Some of these techniques include transaction trimming [14], candidate hashing [14], bitmaps [3] and sampling [15]. Studies in the later stages took on the bird’s eye view to association rule mining. The focus is to look beyond the “generate and test” paradigm and to eliminate the dependencies exhibited by Apriori. This created several modern mechanisms including the FP-Growth [6], and DiffSets [18] that are typically an order of magnitude

faster than the Apriori on large databases. These mechanisms are characterized by the use of meta-data structures (e.g., FP-Growth) or novel techniques to support count tabulation (e.g., TID intersection in DiffSets). Most of the time, these methods are more sophisticated than Apriori’s direct counting, and are also more scalable to large data sets that is missing in Apriori’s design. Regardless of the view taken, the approaches discussed above has one common characteristics among them: they have tight coupling of the discovery process to the metadata structure or its support tabulation mechanism. This makes it rigid in fitting the methodology on other rule variants. As a matter of fact, this strict coupling forces the analyst to decide between speed and informative insights that is, most of the time, opposing factors. Thus, our main contribution is a framework that equips each variant with the performance and scalability of modern mechanisms without giving up the simplicity of the Apriori model. In other words, our framework is capable of leveraging the performance and scalability of these Apriori-based algorithms to that of the FP-Growth with minimum modifications.

3. The Unifying Framework Figure 1 shows the conceptual framework of any Aprioribased algorithm. In the discovery of frequent itemsets, an Apriori-based algorithm would begin by filtering unique items in the database that fail to meet the minimum support (i.e., ). These items are known as frequent 1-itemsets or  as shown in line 1. large 1-itemsets which are stored in Using these frequent candidates as the seed, we next generate candidate itemsets containing 2 items each (i.e.,  ) as shown in lines 2-3. By using frequent candidates as the seed, we eliminate the need to test candidates that guarantee to the current support requirements. The candidates in fail are then individually tested against each transaction in the database . Depending on the attributes under consideration, a transaction may declare support for  which will cause its support count to be incremented by some value (lines 4-7). Once we complete the pass through the database, the actual support count of a given candidate will be known. We therefore determine if it is frequent by evaluating it against the support criteria as shown in line 8. This repeats until no frequent itemsets are found and the algorithm will proceed to generate rules from these itemsets that satisfy the confidence requirements. An important advantage of the “generate and test” approach is the ability to easily embed attribute considerations into the process of rule discovery. As a result, algorithms for known variants to date are Apriori-based [5, 9, 10, 11, 16, 17]. In favor of speed, modern mechanisms disregard the flexibility in the “generate and test” approach making their proposals rigid to attribute extensions. We argue against

 

Procedure FindLargeItemsets(Items , Database 01 FindLargeOneItemsets( ) 02 for( ; ; ++) do 03 GenerateCandidates( ) 04 foreach candidate do 05 foreach transaction do 06 if( supports ) then increment 07 endfor 08 if( ) then 09 endfor 10 endfor 11 Answer =







, Support )

    !#" %$  & !  ' !#" %$ ( )*& ! +,)- + ( (/. 021#35426 (/. 021#3542678  !  !'9-: (#;

 +,)?A@B(/C>ED

do

Figure 1. The conceptual framework for Apriori-based algorithms.

this approach in two ways. First, the contribution of efficient mechanisms should easily extend to other variants of association rules for the benefit of a larger community. Second, the Apriori approach remain an important conceptual framework for future development of informative rules. Considering the above, we therefore propose that any performance and scalability enhancements should be created within the context of this conceptual framework. However, the key problem with this approach is the need for multiple database scans. When transactions are long and the database large, testing a candidate becomes costly. Therefore, one approach is to test candidates selectively. The way to do this is via the Apriori’s method, where candidates confirmed to be infrequent in the coming passes are eliminated without testing. However, this does not address the cost of testing candidates that survived the pruning process. Thus, we need to reduce the cost of testing the remaining candidates. We observe the following characteristics in Apriori-based algorithms.

F F

For small databases with short transactions, Apriori performs relatively well. In some cases, it is faster than complex mechanisms that has the overheads of building the meta-data structure.

F

Most of the time, an itemset is supported by a fraction of the transactions in the entire database. In other words, most of the tests for transaction support and counting are “wasted”. Testing short itemsets on long transactions are costly. In level-wise searches, the majority are short itemsets.

From the three observations, we realized that it is possible to speedup the Apriori algorithm if we are able to efficiently test each itemset. In other words, if we can reduce

the size of the database (by the first and second observation) and shorten the transactions (from the third observation), we will be able to speedup the loops in lines 4-9. Of course, the hypothesis here is that we are able to reduce the number of transactions, and the number of items each transaction contains at a cost that is negligible, or very small compared to the full scans made on the original database. To do that, the framework refinement is shown in Figure 1 as line 1b and line 5. Line 1b is the additional step made after frequent 1-itemsets are found. Here, we are using the frequent 1-itemsets to construct a compact database representation G . This database representation is then used in line 5 which was modified to scan a downsized database

IHJ2KLGM that is constructed from the information in G under the constraints of the candidate  . At this point of time, we assume that G is some efficient database representation that allows retrieval of a fraction of transactions that are related to  . Any transactions not related to  in anyway will not be returned by IHJ2KLGM and hence, removes the need for a complete database scan. At the same time, we also assume that the transactions returned from IHJ2KLGM has got unrelated items removed such that it is in the shortest possible form. With a smaller number of transactions and a shorter transaction size, these Aprioribased algorithms will gain significant speedup. Therefore, the problem of performance and scalability now reduces to the challenge of solving G and IHJ2KLGM . To conclude this section, notice the framework refinement is minimal and direct. In essence, it retains the structure of the algorithm, and does not disturb any attribute constraints encoded in line 3 (i.e., GenerateCandidates()), line 6 (i.e., the conditions for to support  , and how much to increment  ’s support), and line 8 (i.e., the criteria for an itemset to be called frequent). In addition, the framework refinement also respects the database by creating G that is a compact representation of . This approach ensures that the construction of G remains consistent throughout, and therefore needs to be implemented only once. We now discuss G the transaction graph (T-Graph), and how it can be used to realize IHJ2KLGM in the next section.

4. Transaction Graph

POQ  KC SRTK/U/U/U#KC SVXW be a transaction database where N ZY is a transaction with a unique ID (i.e., TID) and con tains a set of items in [\]OQ^ KC^_R`K/U/U/U#KC^b a`W . Each item ^  has a set of attributes cd]Ofe KJe_RTK/U/U/U#KJe W where egY,c Let

is a attribute of an item that has a value in the context of a transaction or the database. A transaction graph (or T-Graph in short) is a graph based structure defined with the following characteristics.

1. It consists of a node table with two fields, node-label and node-link, where the node-link is an edge from the

node table to the node representing the item with nodelabel as the literal. 2. Each node in the graph contains three fields, nodelabel, in-link, and out-link, where node-label registers which item the node represents, in-link the reference to the incident edges, and out-link the reference to its outgoing edges. 3. Each node also contains a lookup table holding attributes of the item it represents. Attributes stored on the node of an item have values that are consistent throughout the database, and are known as uniform attributes (denoted ehHi^jK -M ). 4. Each edge in the graph contains two fields, t-tids and n-tids. The t-tids holds TIDs having the incident node as the last item in the transaction (when ordered lexicographically by their label). Otherwise, the TIDs are placed in the n-tids field. 5. Each TID stored in the edge has a lookup table that stores attributes of the item the edge incidents. Each attribute has different values that exist in the context of the transaction, and are known as transaction dependent attributes (denoted ehHi^jKC kM ).

4.1. Construction Algorithm Let us illustrate the construction of the T-Graph (shown in FIgure 2) with an example. Figure 3 shows a transaction database where each item has two attributes: Qty, that stores the number of times an item appears in a transaction and Weight, that indicates the importance of an item relative to the others. The composition of these two attributes forms a variant that has not being discussed in existing literature, and is intuitively a more informative set of rules when applied to the supermarket scenario [13]. Following the algorithm in Figure 2, each transaction read are first sorted  by their item literal. In the case of the first transaction , we have (after sorting) l a(1, 0.1), b(2, 0.3), d(1, 0.8), e(1, 0.9) m . The first number in the bracket indicates the quan tity and the second, its weight. Therefore, nToqprHJsEKC MkPt and nToqprHJsSKC SRQMkvu . Notice that the attribute value differs among transactions and hence, is a transaction dependent attribute. On the other hand, the weight given to an item ^ is a uniform attribute since wrxLKzy?Y{ t`U U UB| }| ~zK_€‚5ƒ…„bo_Hi^jKC EV%M† €‚5ƒ…„boHi^jKC a5M . The ordering is important since each path through the T-Graph follows this order, and multiple transactions with similar prefix can be merged as long as their specific details are maintained. Since T-Graph is initially empty, the  first item inserted by would be a. According to the algorithm, an entry in the node table for a is entered. Then a node ‡jˆ and an edge ‰Tˆ representing a are created. The edge is initialized in the node-link field of a and points to



Š

Procedure ConstructTGraph(Database , NodeTable ) 01 foreach transaction do 02 foreach lexicographically ordered item do 03 if( ) then 04 CreateNode( ) 05 CreateEdge( , , ) 06 07 08 if( is first item in ) then 09 AssignTIDs( ); 10 else 11 CreateEdge( , , ) 12 AssignTIDs( ); 13 endif 14 else 15 16 if( is first item in ) then 17 AssignTIDs( ); 18 else 19 GetEdge( , ) 20 if( is ) then 21 CreateEdge( , , ) 22 endif 23 AssignTIDs( ); 24 endif 25 endif 26 endfor 27 endfor

+)-

‹)Œ Š Ž ‹ ‘# ‹ Š Ž Ž’\“r@B‹_XkDC%”2“-)?• Š–’–—BŽ`‘#2˜ ‹ + ‘# Ž™š›Ž ‘#™†

‹-)?+

‹ Ž™ Ž ‘#™ Ž™†Ž

‘#›bœ%CžfœQ@ ŠŸ@B‹ DXD ‹ + ‘# Ž™š›Ž

‘#™† Ž™ Ž  ‘#™ Ž ¡b¢£¢ ‘#™† ‹ Ž™ Ž ‘#™ Ž™†Ž

‘

Procedure AssignTIDs(Edge ) If is the last item in then add all of to .t-tids; otherwise, add them to .n-tids.

‹

“r@B‹_%+ DCB“?)-• and TID ‘ Procedure CreateEdge(Item ‹ , Parent Ž_™ , Child Ž ) Creates an edge for ‹ that emits from Ž_™ and incidents on Ž . Procedure GetEdge(Parent Ž_™ , Child Ž ) Returns the edge connecting the node Ž_™ and Ž ; it returns +

‘

+

NULL if none exists.

Figure 2. T-Graph Construction algorithm.

‡jˆ . The attributes (i.e., €‚5ƒ…„bo_HJsSK ) and nToqprHJsSKC )) are stored in ‡jˆ and ‰2ˆ respectively. Since item a is the first  item of , and there are still items to be processed, we enter the TID (and the transaction dependent attributes) into ‰2ˆ .n-tids. Subsequent entries of items in  are entered in the same way except for item e, where the details are entered in the t-tids field instead of the n-tids field. The rationale for putting TIDs, and hence their associated attributes, in separate compartments is to allow efficient determination of related transactions for a given candidate. This will be elaborated in Section 4.4. 

Upon processing , we repeat for the next transaction SR . We process item a first as it’s the first item in ER . Since a node for a has been created, we simply include the TIDs and the transaction dependent attributes in the edge

Node Table

a

node-link

Items(Qty,Weight) b(2, 0.3), a(1, 0.1), e(1, 0.9), d(1, 0.8) a(4, 0.1), d(9, 0.8), e(2, 0.9) d(6, 0.8), b(3, 0.3), e(2, 0.9) d(2, 0.8), e(1, 0.9), a(4, 0.1), b(8, 0.3) e(2, 0.9), a(1, 0.1), c(2, 0.4) b(3, 0.3), d(6, 0.8) d(1, 0.8), c(7, 0.4)

node-label

TID 1 2 3 4 5 6 7

{2|}

d c

4.2. Cost Analysis We analyze the cost of constructing the T-Graph in three aspects: I/O, space and time cost. We see that the T-Graph requires exactly only one scan of the database for its construction. However, if the support threshold is known before hand, an additional scan maybe included to find all frequent 1-itemsets so that a smaller T-Graph can be constructed. In either case, this will cap the I/O cost of the T-Graph to a maximum of two passes. This puts the T-Graph on par with the FP-Tree, which is the meta-data structure created by FP-Growth. Lemma 1 Given a database containing ¦ unique items, the number of nodes created in the T-Graph is at most ¦ . Proof: Suppose more than ¦ nodes exist in the T-Graph. Since there are only ¦ unique items, there must exists a node with the same node-label. However, by definition of

b {1,3,4|6}

d

e

{|1,2,3,4} {7|}

{|7}

c incident on a. Notice that the algorithm skips the inclusion of uniform attributes nor does it create a new node. Instead, all edges share a common node and all uniform attributes are included only once. This admits a more efficient use of available memory on the nodes and tagged attributes that many existing proposals missed. Continuing with the remaining transactions, we construct the complete T-Graph as shown in Figure 4. For clarity, we have excluded the explicit representation of attributes of items that are stored in the respective nodes and edges. In the case of our example, node a contains the weight attribute (i.e., €‚5ƒ…„boHJsSK -M?P¤ U t ). On the edge, each TID has a lookup table that stores the transaction dependent attributes of the item it incidents. For example, the edge from node a to node b has two transactions 1 and 4. Since the edge incidents b, the edge stores the value of nToqp_Hi¥šK/t5M and nToqprHi¥šKCuqM . Finally, the TIDs on the left of ‘ | ’ are stored in the n-tids and the TIDs on the right, in ttids. Hence, in the case of the edge connecting node b and node d, TIDs 1, 3, 4 are stored in n-tids and TID 6 is stored in t-tids.

{1,4|}

{3,6|}

b

Figure 3. A transaction database where items have two attributes – quantity and weight.

a

{1,2,4,5|}

{|5}

e {5|}

Figure 4. The complete T-Graph constructed for the transaction database in Figure 3.

the algorithm, the creation of this node will not happen. This is because, a node is created if and only if its literal is missing from the node table. Hence, an existing node cannot be created and therefore, only a maximum of ¦ unique labelled nodes can exist in the T-Graph. In terms of node cost, each new item added to the results in a new node being created. For the FP-Tree, the number of nodes depends on the characteristics of the database. This is also the reason why the FP-Tree needs to make two I/O scans. By finding frequent 1-itemsets in the first scan and the construction in the second, the FP-Tree will have increased overlapping prefixes. This overlapping in turn reduces duplicate nodes that gives it the compact representation. However, in most cases, the nodes of the FP-Tree will have some duplicate nodes due to the way transaction paths are created. Hence, T-Graph shines in this aspect as it uses the minimum number of nodes.

T-Graph

Lemma 2 Given a database containing ¦ unique items, the upper bound on the number of edges in the T-Graph is

§ ¨ HJ¦«ªyqME¬¦ fa © 

Proof: For each transaction in the T-Graph, all items are sorted according to their lexicographical order before it is inserted into the T-Graph. Since items are processed in this manner, an emitting edge from a node only incidents another node whose node-label is lexicographically ordered after itself. Hence, if ^jKC­Y®[ is the first and second item after their ordering, ^ can potentially have edges to the remaining ¦ª}t items. In the case of item ­ , it has one possible edge less than ^ (i.e., ¦¯ª ) since an edge to ^ will not

be created by definition of the algorithm. Hence, §  the number of edges among the nodes contributed ° af© HJ¦±ªyqM edges. The remaining ¦ edges are the links from the node table to every individual nodes in the T-Graph. This forms the maximum number of edges possible for a given ¦ unique items. In the FP-Tree, the number of edges is strictly dependent on the database characteristics (i.e., the number of unique items, the number of node links between branches and the number of nodes in the FP-Tree). Therefore, a direct comparison on the edge cost is difficult. Between the FP-Tree and the T-Graph, it can be shown that one has a lower edge count then the other by carefully selecting the database parameters. Hence, while we managed to provide an upper bound for T-Graph in Lemma 2, the reader should note that the edge cost, like the FP-Tree, is often lower due to the overlapping items in the transactions. This is further compensated by the low node count in T-Graph where no two nodes are identical. Lemma 3 Let the time to insert an item in a transaction into the T-Graph be ²C³ units. Then, the amount of time to construct the T-Graph from a database is given by

´ µ´ ¨ | aq|q¶}²³ fa © 

Proof: The algorithm makes a pass through the database and process each transaction in sequence, item by item. Each item in the transaction is inserted sequentially into the T-Graph and the algorithm terminates when the last item in the last transaction is inserted. Hence, Lemma 3. Lemma 3 shows that T-Graph’s time complexity is similar to the FP-Tree. Both process each transaction and its items in sequence. The difference lies in the sorting criteria of the items in each transaction. T-Graph sorts each item by its lexicographical ordering, while FP-Tree sorts by their 1-item support count. From the three costs analysis, we show that the T-Graph is an attractive solution that shares the same performance as the FP-Tree. However, T-Graph has the additional support for mining itemsets with different attributes. This is the key to allow speedup with minimal modifications to the algorithm of each rule variant. Taking the FP-Tree as an example, fitting it for a particular attribute (e.g., nToqp ) requires modification to its meta-data structure. This can be a daunting task as shown in [12].

4.3. Completeness and Compactness

Since the mining of frequent itemsets no longer occurs in the database, but via the T-Graph, we need to show that T-Graph completely captures the database for discovery

of different itemset variants in a compact manner. This is done by differentiating between uniform attributes and transaction dependent attributes. Uniform attributes, being consistent throughout the database, are stored only once, while transaction dependent attributes cannot be dropped or the integrity of the representation will be compromised. Lemma 4 The T-Graph holds complete information of the transaction database. Proof: To verify that the T-Graph holds complete information of the database, we can attempt to reconstruct the database from the T-Graph. That is, if we can reconstruct every transaction with all the items and attributes, then we showed that the T-Graph is a complete representation of the database. In fact, this can be easily done with one pass through the node table as follows. Begin with an empty database. For every entry in the node table, if the node-link edge contains TIDs, then create transactions with each having a TID indicated in the t-tids or n-tids field. Write the label of the incident node and its attributes (from the edge and the node) to each transaction. Given the current edge, if the TID is in the t-tids field, then its corresponding transaction is complete and can be entered into the database. While there remain transactions not entered into the database, traverse on the outgoing edges of the current node. Continue to insert the current item and attributes into the transaction for the current set of TIDs (i.e., ignore TIDs not represented by the current set of transactions). Stop when all transactions for the current entry in the node table are entered, and repeat for the next entry until all entries are processed. The database is now reconstructed. Lemma 5 The memory requirements of the bounded by the size of the database.

T-Graph

is

Proof: From the algorithm, we see that the main consumer of memory are the TIDs. For each transaction containing | ?| items, | ?| TIDs are added to the T-Graph. Hence, the number of TIDs in the T-Graph is determined by the number of transactions and its size. Mathematically, this is written ´ µ´  as ° af© | aq| , and hence Lemma 5.

4.4. Finding Candidate Transactions

The construction of a T-Graph is to realize the G that we have discussed in Section 3. Until this point, we have shown the details of line 1b; the first part of the framework refinement. The other part of the framework (line 5 after refinement) is to use the T-Graph constructed (i.e., G ) to select transactions that has the probability of supporting a given candidate. This selection process will return a subset of , where each transaction in this subset is trimmed to contain

only relevant items. The general idea behind this selection process is to use the TIDs for intersection to determine relevant transactions for a given itemset. This must be fast so that the cost of querying G is substantially lower than making a pass through the database. Back to Figure 4, we see that one inherent characteristics of the T-Graph is that it has implicitly partitioned the TIDs based on the co-occurrences of items in a transaction. For example, if we start from the node table and follow the edge to the node for item b, the TIDs we picked up (i.e., 3 and 6) indicates that these two transactions has items b as the smallest lexicographically ordered item. Using this knowledge, if we have an itemset O a, b W , then there is no need to pick up TIDs 3 and 6 since the transactions on this edge will not contain a. This implicit partitioning allows each item’s TID set to be small so that TID intersections can be efficiently carried out. To further minimize TID intersection, we also maintain the information about the last lexicographically ordered item in each transaction. This information is embedded in the T-Graph by recording the TIDs in separate compartments: n-tids and t-tids. Consider the itemset O a, d, f W , if a transaction contains d as the last item, then including its TID for intersection would be unnecessary because it cannot possibly support the itemset as its TID will not be in the set of TIDs for f (by definition of lexicographical order of items in the transaction). Hence, without regards to the attributes, we can safely exclude this particular transaction as part of the results to be returned by G . For the purpose of discussion, let · be an itemset to be passed to IHi·=KLGM , and ^rV¸Y· be the xz¹Jº item in · such that all items in · are ordered lexicographically. Based on the above general idea, we have the following rules for determining the set of TIDs for each item ^»YZ· such that efficient TID intersection can be performed to return relevant transactions by IHJ2KLGM . Rule 1: Hi^_V Y8·=KCxš¼t5M Collect the TIDs in n-tids of all incident edges on node ^rV ; if x'¯| ·Ÿ| then collect the TIDs in t-tids of all incident edges as well. Rule 2: Hi^_V'Y}·=K/t¾½Ÿxš¿»| ·Ÿ| M If there exists an incident  edge on node ^rV such that ^_ViÀ and ^_V is lexicographically ordered one after another, then collect all TIDs in n-tids of this edge; else collect all TIDs in the n-tids of all edges edges except the edge from the node table. Rule 3: Hi^_VY¯·=KCx}Á| ·Ÿ| M If there exists an incident  edge on node ^rV such that ^_ViÀ and ^_V is lexicographically ordered one after another, then collect all TIDs in this edge; else collect all TIDs of all edges except the edge from the node table.

Rule 4: H£wr^¼Y\·®M At any point after the collection of TIDs for each item, evaluate whether the number of TIDs for this item fails to satisfy the support requirement. If so, there is no need to do the intersection, and T-Graph can return no relevant transactions to avoid redundant processing. Example 1: Let us assume that we are interested in finding the relevant transactions for an itemset O a, c, d W using the database in Figure 3 and the T-Graph in Figure 4. To find the relevant transactions, we need to first find the TIDs for each item relevant to the other items in the itemset. Since a is the first item, we shall use Rule 1 to determine the relevant set of TIDs. That is, we collect all TIDs in the n-tids section of all incident edges. From Figure 4, this set of TIDs is thus O 1, 2, 4, 5 W . In the same way, we would apply Rule 2 for item c which gives O 5 W . Rule 3 is used for item d to derive the TIDs O 7 W . Notice that both Rule 2 and 3 are very similar with the only difference in collecting TIDs from n-tids only or from both n-tids and t-tids. Once we obtained the TIDs for each item, we can simply perform the intersection to determine that no transaction supports this itemset. Rule 4 is an optimization step that is used to avoid redundant TID intersections. In our example, if the minimum support is 2, then there is no need to perform any intersection because one of the item contains only 1 TIDs (i.e., item c). In other words, the maximum number of TIDs that can support the itemset is bounded by the smallest number of TIDs among the items in the itemset. Even when the effort to avoid TID intersection fails, the intersection cost is still substantially lower than the TID intersection proposed in [1]. We show this in the example below. Example 2: Continuing from Example 1, using the method in [1], item a will have TIDs O 1, 2, 4, 5 W ; c will have TIDs O 5, 7 W ; and d will have O 1, 2, 3, 4, 6, 7 W . Assuming that these TIDs are stored in a hash table, then the minimum number of intersections possible is 3 units. Using T-Graph, the TIDs derived for c and d are O 5 W and O 7 W respectively. In this case, the intersection cost is only 1 units. At the end of the intersection, the TIDs that remain will determine the transactions to be returned. And since we are performing the intersection by the items in the itemset, we can easily picked up the attribute values to reconstruct the transaction. The difference here is that, in the process of picking TIDs, we only looked at nodes that are relevant to the itemset of interest. As such, the reconstruction phase would only include the items relevant in the itemset and its attribute values. This effectively trims the transaction which we have discussed earlier. Thus, if we are looking

at the itemset O a, e W , then it will be supported by transaction 1. However, due to the reconstruction, the transaction  is implicitly trimmed and returned as ÂO a, e W instead.

5. Experimental Results The purpose of our experiment is to compare the performance of our framework, and evaluate its success in enhancing Apriori-based algorithms against modern mechanisms. For space reasons, we show this by applying the framework on the original Apriori algorithm, and then compare the results against FP-Growth.

5.1. Methodology We conduct the experiments on a single-CPU Pentium-3 notebook at 700MHz, cache size 256KB, and 384MB of RAM running Windows XP. Each database is a text file, where each line represents a transaction and each item is an integer. No attributes are generated for our tests to facilitate a fair comparison against FP-Growth. We then apply our framework on the Apriori and compare all three algorithms (Apriori, Framework-enhanced Apriori and FP-Growth) on two different data sets. The synthetic data sets are generated from the ARMiner [2] package and then converted into text format for our algorithms. For the first test, we generated a database containing 100K transactions with an average transaction size of 25, 10K different items, 50K potential patterns with an average frequent patten length of 20 and a correlation level of 0.5. For the second test, we doubled the number of transactions and increased the correlation level to 0.75, keeping the rest of the parameters the same as our first test set. We implemented our algorithms in .Net and executes all tests on top of the .Net runtime. The Apriori and FP-Growth implementation is adapted from the ARMiner project [2] of which both are efficient implementations. Since they are written in Java, we used J# (a .Net Java compatible language) to import the code and recompile it within the .Net environment. We then implemented T-Graph in C# (another .Net language) and compiled it to the same runtime environment. The language difference is insignificant because the .Net environment compiles the different languages to the same intermediate code (i.e., IL) that is similar to Java’s byte code.

5.2. Performance Comparison Figure 5 shows the runtime performance of the three algorithms on the two data sets proposed. We ran the experiment with support thresholds of 1-5% noting their total runtime cost which includes the construction of the meta-data structure in FP-Growth, and the T-Graph in our frameworkenhanced Apriori.

Even with the overheads considered, we see that both

FP-Growth and our framework scaled up much better than Apriori. As the support threshold goes below 2%, the num-

ber as well as the length of frequent itemsets increase dramatically. The candidate sets that Apriori must handle becomes extremely large, and testing each candidate itemset against the long transactions become very expensive. In the case of our framework, the use of T-Graph (i.e., G ) eliminates the need to scan the entire database, and reduces the cost to test an itemset against a long transaction. Instead, selective TID intersections are used to quickly determine the set of relevant transactions. In the case of scaling up Apriori, there are no attribute considerations. Therefore, the number of transactions returned for a given query to the T-Graph directly maps to the support count of the itemset. Although there are lower candidate counts in the higher support threshold of 3-5%, our framework consistently reaped significant speedup against the Apriori. At the same time, we observed that FP-Growth lags behind in the speedup against our framework. This is because there are fewer TIDs under high support requirements. Together with the inherent partitioning by the edges and storing of TIDs in separate compartments, lesser intersections are needed. This in turn, translates to a lower cost in computing the relevant transactions which turn our to be cheaper than the process of building the conditional FP-Tree in FP-Growth. From the plot of the results, we see that the performance gain from the framework’s refinement discussed in Section 3 and Figure 1 is significant. The simple modification made has allowed the Apriori to rival the performance of the FP-Tree without the complete overhaul required. On average, our framework achieved an order of magnitude in performance over the Apriori, putting it on par with FP-Growth. More importantly, this gain comes at a lower cost and greater benefit. The cost is lower from the framework refinement, and the greater benefits come from the fit of our approach to any existing Apriori-based algorithms. More importantly, a future variant that considers a new attribute for their domain can focus better on the insights and quickly mould the constraints using the “generate and test” approach. They can then scale up with our proposal saving time in creating an efficient algorithm.

5.3. Scalability Results Performance is only one part of the framework’s benefit. Together with performance is the issue of scalability. Since data mining is primarily concerned with processing large data sets, a good data mining algorithm must therefore be scalable on the size of the database. Most Apriori-based algorithms proposed do not address scalability in their literature. As such, the other motivation of our framework is to build this capability in the same way that we have done to boost performance. Again, the advantage of our approach

T25I20N10KD100K (Correlation: 0.5)

T25I20N10KD200K (Correlation: 0.75)

2500

2000 Apriori Apriori (Framework Enhanced) FP-Growth

2000

1800

1600 Apriori Apriori (Framework Enhanced) FP-Growth

1500

Time (Seconds)

Time (Seconds)

1400

1000

1200

1000

800

600 500

400

200 0 1%

2%

3%

4%

5%

0 1%

Support

2%

3%

4%

5%

Support

Figure 5. Scale up performance of Apriori, Apriori (Framework enhanced) and FP-Growth. is the ability to scale up transparently. As a matter of fact, the framework refinement discussed in Figure 1 is also the refinement that is required for scalability. Our proposal to scale up on large data sets is to use a number of personal computers connected on a local area network. This configuration needs zero setup cost and is very common in any organizations today. No special machine is needed, and a heterogenous mix of hardware configuration (e.g., different amount of available memory) is acceptable. The data file would exists on a file server accessible by the machines participating in the task. An important characteristics of T-Graph is its representation of a database rather than a meta-data structure. It’s difference from a conventional database would be its ability to return relevant transactions efficiently. Therefore, it is very straightforward when it comes to partitioning the database – we simply construct a T-Graph for each partition. In other words, we have the following lemma.





Lemma 6 Let à KfÅRTK/U/U/U/Kfà be the partitions of such  that HzÄ af© Ã2a5MN . Let Gba be the T-Graph constructed

from partition ÃTa , then IHJ2KLGM†\H

Ä af©  IHJ2KLGba2MCM .

Proof: By Lemma 4, a T-Graph is a compact representation of the database. Thus, a T-Graph constructed from a partition is therefore a compact representation of the partition. Since (by definition) the run of IHJ2KLGa2M returns the relevant transactions from partition y represented by Ga , the union of the results from each T-Graph is equivalent to the results of querying IHJ2KLGM . Notice that each partition can be constructed in parallel as the construction process is read-only. Ignoring network traffic conditions, the effective construction time reduces to the time required to build the T-Graph on the largest

partition. A controller (representing IHJ2KLGM ), which runs the framework-enhanced algorithm continues to query G , which is now a set of T-Graphs residing on different machines. Given an itemset  , the query to IHJ2KLGM is first transmitted to each machine housing the T-Graph. Since the query to T-Graph is to obtain relevant transactions, each T-Graph returns the relevant transactions for  in the partition it represent. Again, this process is executed in parallel. Assuming equal partition size and even distributions of related transactions for  , the process of finding relevant transactions is hence bounded by slowest machine in the network. Once a T-Graph computes the relevant transactions, the results are returned to IHJ2KLGM which assembles all the transactions, and returns them as its answer. Hence, from the viewpoint of the algorithm, the process is transparent. More importantly, the hardware setup is easy to achieve and scales up well by simply partitioning the database. The parallelism in the TID intersection not only improves speed, but also compensates the overheads involved in the transport of data across the network as well as the cost of assembling transactions. Although we have omitted the experimental results, our discussion is sufficient to show that the proposal is scalable to the needs of very large databases, beyond those discussed in this paper.

6. Summary In this paper, we introduced a simple but powerful framework for scaling up existing Apriori-based algorithms. Using our framework, minimal modification is needed to gain significant speedup and the ability to handle large data sets. This is possible with a systematic modification on the algorithmic process, and the use of an efficient database representation (i.e., T-Graph). Unlike meta-data structures, our proposal re-organizes the database into a form that is well-

suited to the task of association rule mining. It’s main design characteristics is to avoid entire database scans for each candidate, and to create short transactions that makes the cost of support testing low. From our experimental results, we achieve an average speedup of 10 times over the Apriori, and outperforms FP-Growth by a factor of 3. This test is conducted on data sets with a large number of patterns and a high correlation level. As a result, it stresses each algorithm on the number of patterns it can handle within the limited available memory. Clearly, under this high load condition, our framework performs particularly well. We also discussed how our framework scales well to large data sets. The advantage of our proposal is the use of cheap personal computers connected on a local area network. This setup is very common and affordable. Parallelism is achieved by partitioning the database and then constructing the T-Graph for each partition. More importantly, the scale up process is totally transparent. In other words, any Apriori-based algorithm using our framework will also benefit from such a scale up without any modifications. From the results above, we believe our proposal makes an important contribution to the state-of-the-art in association rule mining. Earlier proposals, while efficient, has ignored the existence of other variants leading to mechanisms that are rigid to attribute extensions. This is an important problem since earlier works on mining informative rules are primarily Apriori-based. Therefore, they cannot benefit from proposals such as the FP-Growth without a large overhaul. In contrast, our framework allows the “generate and test” paradigm to be retained in the process of enhancing performance and scalability. By respecting the “generate and test” paradigm, we allow all existing variants (as well as future variants) to benefit from our proposal.

Acknowledgement We would like to thank Zehua Liu for reviewing the early draft of the paper, and the initial implementation of the framework.

References [1] R. Agrawal and R. Srikant. Fast Algorithm for Mining Association Rules. In Proc. of VLDB, pages 487–499, Santiago, Chile, Aug. 1994. [2] L. Cristofor. Association Rules Miner (ARMiner). http: //www.cs.umb.edu/˜laur/ARMiner, 2002. [3] G. Gardarin, P. Pucheral, and F. Wu. Bitmap Based Algorithms For Mining Association Rules. In Proc. Actes des journes Bases de Donnes Avances, Tunisie, Oct. 1998. [4] J. Han. Mining Knowledge at Multiple Concept Levels. In Proc. of the 4th Int. Conf. on Information and Knowledge Management, pages 19–24, Baltimore, Maryland, Nov. 1995.

[5] J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proc. of VLDB, Zurich, Swizerland, 1995. [6] J. Han, J. Pei, and Y. Yin. Mining Frequent Pettern Without Candidate Generation. In Proc. of SIGMOD, Dallas, TX, May 2000. [7] J. L. Han and A. W. Plank. Background for Association Rules and Cost Estimate of Selected Mining Algorithms. In Proc. of CIKM, Rockville, MD, USA, 1996. [8] G. H. John. Enhancements to the Data Mining Process. PhD thesis, Stanford University, Department of Computer Science, Mar. 1997. [9] K. Koperski and J. Han. Data Mining Methods for the Analysis of Large Geographic Databases. In Proc. of Conf. on Geographic Information Systems, Canada, Mar. 1996. [10] B. Liu, W. Hsu, and Y. Ma. Mining Association Rules with Multiple Minimum Supports. In Proc. of ACM SIGKDD, San Diego, CA, USA, Aug. 1999. [11] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences. In Proc. of ACM SIGKDD, Montreal, Canada, Aug. 1995. [12] K.-L. Ong, W.-K. Ng, and E.-P. Lim. Mining Multi-Level Rules with Recurrent Items Using FP’-Tree. In Proc. of the 3rd Int. Conf. on Information, Communications and Signal Processing, Singapore, Oct. 2001. [13] K.-L. Ong, W.-K. Ng, and E.-P. Lim. Mining Variants of Association Rules Using the CrystalBall Framework. CAIS Technical Report, http://www.cais.ntu. edu.sg:8000/˜ongkl/papers/tr.pdf, 2002. [14] J. S. Park, M.-S. Chen, and P. S. Yu. Using a Hashed-Based Method with Transaction Trimming and Database Scan Reduction for Association Rules. IEEE TKDE, 9(5):813–825, Oct. 1997. [15] H. Toivonen. Sampling Large Database for Association Rules. In Proc. of VLDB, pages 134–145, Mumbia, India, Sept. 1996. [16] W. Wang, J. Yang, and P. S. Yu. Efficient Mining of Weighted Association Rules (WAR). In Proc. of ACM SIGKDD, Boston, MA, USA, Aug. 2000. [17] O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. In Proc. of ICDE, San Diego, Mar. 2000. [18] M. J. Zaki and K. Gouda. Fast Vertical Mining Using Diffsets. RPI Technical Report 01-1, 1998.