Custom Memory Placement for Parallel Data Mining S.Parthasarathy, M. J. Zaki, and W. Liz fsrini,
[email protected]
z
Oracle Corporation, 500 Oracle Parkway, Redwood Shores, CA 94065
[email protected]
The University of Rochester Computer Science Department Rochester, New York 14627 Technical Report 653 November 1997 Abstract A lot of data mining tasks, such as Associations, Sequences, and Classi cation, use complex pointer-based data structures that typically suer from sub-optimal locality. In the multiprocessor case shared access to these data structures may also result in false-sharing. Most of the optimization techniques for enhancing locality and reducing false sharing have been proposed in the context of numeric applications involving array-based data structures, and are not applicable for dynamic data structures due to dynamic memory allocation from the heap with arbitrary addresses. Within the context of data mining it is commonly observed that the building phase of these large recursive data structures, such as hash trees and decision trees, is random and independent from the access phase which is usually ordered and typically dominates the computation time. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance signi cantly. We evaluate a set of placement policies over a representive data mining application (association rule discovery) and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield an additional 5-20% gain.
Keywords: Knowledge Discovery, Data Mining, Association Rules, Memory Placement, Memory Allocation, Improving Locality, Reducing False Sharing, Dynamic Data Structures, Hash Tree
The University of Rochester Computer Science Department supported this work. This work was supported in part by an NSF Research Initiation Award (CCR-9409120) and ARPA contract F19628-94-C-0057.
1 Introduction Data Mining is an emerging area, whose goal is to extract signi cant patterns or interesting rules from databases. Data mining is in fact a broad area which combines research in machine learning, statistics and databases. It can be broadly classi ed into these categories [1]: Classi cation (Clustering) { nding rules that partition the database into nite, disjoint, and previously known (unknown) classes; Sequences { extracting commonly occurring sequences in ordered data; and Associations (a form of summarization) { nding the set of most commonly occurring groups of items. A common feature of most of these tasks is that they are compute intensive and operate on large sets of data. This makes them attractive for implementation on shared memory multiprocessors, and thus bene t from the available parallelism. If the number of processors is increased it is not easy to have one shared memory with fast access times for all processors. Therefore, most contemporary designs include private coherent caches for each processor. Due to the increasing gap between processor and memory subsystem performance in modern processors [15], for data-intensive tasks improving the data locality (i.e. having as much of the data local to the processor cache) is extremely important to ensure that most accesses are to fast local memory. Further in a shared multiprocessor system there are multiple local caches that have to be kept coherent with one another. A key to eciency is to minimize the cost of maintaining coherency. The elimination of false sharing (i.e. the problem where coherence is maintained even when memory locations are not shared) is an important step towards achieving this objective. Most algorithms for mining tasks use recursive data structures, such as those found in Association Rules(Hash Trees, Lists), Classi cation (Decision Trees), and Sequences (Hash Trees, Lists), which typically exhibit poor locality. Such structures consist of individual nodes that are allocated arbitrarily from a central heap and which are linked together by pointers. Unfortunately traditional locality optimizations found in the literature [9, 7] cannot be applied directly for such structures. These optimizations are essentially array based and rely on the fact that consecutive array elements are at consecutive addresses and therefore small strided accesses can be placed on same cache line. The arbitrary allocation of memory in data mining applications also makes detecting and eliminating false sharing a dicult proposition. An additional problem is that the locality enhancement policies may increase the amount of false sharing in these data structures. Reducing false sharing while optimizing locality at the same time is thus a crucial issue for performance. With this goal in mind, we propose a set of mechanisms and policies for controlling the allocation and memory placement of dynamic data structures. The mechanism takes the form of a custom memory allocation library. We evaluate a set of placement policies using a shared memory multiprocessors on a representative parallel data mining application, viz. Association Rule discovery. Our solution to the problem is to understand the access patterns of the program and allocate memory in such a way that data likely to be accessed together are grouped together (memory placement) while minimizing false sharing. Our experiments show that simple placement schemes can be quite eective, and for the datasets we looked at improve the execution time of our application by upto a factor of two. 1
More complex schemes yield an additional 5-20% gain. While we present results for one mining application, we observe that its access patterns and data structures are pretty much the same as those found in implementations of sequence mining and quantitative association mining. We expect that for these applications similar, if not better, improvements will be observed. The rest of the paper is organized as follows. We describe the parallel association mining algorithm in section 2. Details of the memory placement library are provided in Section 3. Section 4 describes the custom placement policies for association mining application. We next present our experimental results in section 5. Relevant related work is discussed in section 6, and we present our conclusions in section 7.
2 Parallel Association Mining 2.1 Problem Formulation The problem of mining associations over basket data was introduced in [2]. It can be formally stated as: Let I = fi1 ; i2; ; img be a set of m distinct attributes, also called items. Each transaction T in the database D of transactions, has a unique identi er, and contains a set of items, called an itemset, such that T I , i.e. each transaction is of the form < TID; i1 ; i2; :::; ik >. An itemset with k items is called a k-itemset. A subset of length k is called a k-subset. An itemset is said to have a support s if s% of the transactions in D contain the itemset. An association rule is an expression A ) B , where itemsets A; B I , and A \ B = ;. The con dence of the association rule, given as support(A [ B )=support(A), is simply the conditional probability that a transaction contains B , given that it contains A. The data mining task for association rules can be broken into two steps. The rst step consists of nding all frequent itemsets, i.e., itemsets that occur in the database with a certain user-speci ed frequency, called minimum support. The second step consists of forming implication rules among the frequent itemsets [5]. The second step is relatively straightforward. Once the support of frequent itemsets is known, rules of the form X ?Y ) Y (where Y X ), are generated for all frequent itemsets X , provided the rules meet the desired con dence. On the other hand the problem of identifying all frequent itemsets is hard. Given m items, there are potentially 2m frequent itemsets. However, only a small fraction of the whole space of itemsets is frequent. Discovering the frequent itemsets requires a lot of computation power, memory and I/O, which can only be provided by parallel computers.
2.2 Sequential Association Algorithm Our parallel shared-memory algorithm is built on top of the sequential Apriori association mining algorithm proposed in [3]. During iteration k of the algorithm a set of candidate k-itemsets is generated. The database is then scanned and the support for each candidate is found. Only the frequent k-itemsets are retained for future iterations, and are used to generate a candidate set for the next iteration. A pruning step eliminates any candidate 2
F1 = ffrequent 1-itemsets g; for (k = 2; Fk?1 6= ;; k + +)
/*Candidate Itemsets generation*/ Ck = Set of New Candidates;
/*Support Counting*/ for all transactions t 2 D for all k-subsets s of t if (s 2 Ck ) s:count + +; /*Frequent Itemsets generation*/ Fk = fc 2 Ck jc:count minimum supportg; S
Set of all frequent itemsets =
k Fk ;
Figure 1: Sequential Association Mining which has an infrequent subset. This iterative process is repeated for k = 1; 2; 3 ; until there are no more frequent k-itemsets to be found. The general structure of the algorithm is given in gure 1. In the gure Fk denotes the set of frequent k-itemsets, and Ck the set of candidate k-itemsets. There are two main step in the algorithm: candidate generation and support counting.
Candidate Itemset Generation In candidate itemsets generation, the candidates Ck for the k-th pass are generated by joining Fk?1 with itself, which can be expressed as:
Ck = fx j x[1 : k ? 2] = A[1 : k ? 2] = B [1 : k ? 2]; x[k ? 1] = A[k ? 1]; x[k] = B [k ? 1]; A[k ? 1] < B [k ? 1]; where A; B 2 Fk?1g where x[a : b] denotes items at index a through b in itemset x. Before inserting x into Ck , we test whether all (k ? 1)-subsets of x are frequent. If there is at least one subset that is not frequent the candidate can be pruned.
Hash Tree Data Structure: The candidates are stored in a hash tree to facilitate fast
support counting. An internal node of the hash tree at depth d contains a hash table whose cells point to nodes at depth d + 1. All the itemsets are stored in the leaves in a sorted linked-list. To insert an itemset in Ck , we start at the root, and at depth d we hash on the d-th item in the itemset until we reach a leaf. The itemset is then inserted in the linked list of that leaf. If the number of itemsets in that leaf exceeds a threshold value, that node is converted into an internal node. The maximum depth of the tree in iteration k is k. Figure 2 shows a hash tree containing candidate 3-itemsets, and gure 3 shows the structure of internal and leaf nodes in more detail. 3
Candidate Hash Tree (C 3) Hash Function: h(B)=h(D)=0 ; h(A)=h(C)=h(E)=1 B,D
DEPTH 0
A,C,E
0
B,D
DEPTH 1
0
DEPTH 2
B,D 0
LEAVES
1
A,C,E
B,D 0
1
A,C,E 1
B,D 0
A,C,E 1
B,D 0
BDE
BCD
BCE
ABD
A,C,E 1
A,C,E 1
B,D
ABC ABE ADE CDE
A,C,E
0
1
ACD
ACE
Figure 2: Candidate Hash Tree HASH TREE NODE (HTN): INTERNAL HTN_Pointer_Array Hash_Table: Itemset_List = NULL
HTN
HTN
HASH TREE NODE (HTN): LEAF ListNode (LN)
Hash_Table = NULL Itemset_List:
3-Itemset
A
A
A
B
B
D
C
E
E
Figure 3: Structure of Internal and Leaf Nodes 4
Support Counting To count the support of candidate k-itemsets, for each transaction T in the database, we conceptually form all k-subsets of T in lexicographical order. For each subset we then traverse the hash tree and look for a matching candidate.
Hash Tree Traversal: The search starts at the root by hashing on successive items of the transaction. At the root we hash on all transaction items from 0 through (n ? k + 1), where n is the transaction length and k the current iteration. If we reach depth d by hashing on item i, then we hash on the remaining items i through (n ? k + 1) + d at that level. This is done recursively, until we reach a leaf. At this point we traverse the linked list of itemsets and increment the count of all itemsets that are contained in the transaction. Frequent Itemset Example Once the support for the candidates has been gathered, we traverse the hash tree in depth rst order, and all candidates that have the minimum support are inserted in Fk , the set of frequent k-itemsets. This set is then used for candidate generation in the next iteration. For a simple example demonstrating the dierent steps, consider an example database, D = fT1 = (1; 4; 5); T2 = (1; 2); T3 = (3; 4; 5); T4 = (1; 2; 4; 5)g, where Ti denotes transaction i. Let the minimum support value be 2. Running through the iterations, we obtain
F1 C2 F2 C3 F3
= = = = =
f(1); (2); (4); (5)g f(1; 2); (1; 4); (1; 5); (2; 4); (2; 5); (4; 5)g f(1; 2); (1; 4); (1; 5); (4; 5)g f(1; 4; 5)g f(1; 4; 5)g
Note that while forming C3 by joining F2 with itself, we get three potential candidates, (1; 2; 4), (1; 2; 5), and (1; 4; 5). However only (1; 4; 5) is a true candidate. The rst two are eliminated in the pruning step, since they have a 2-subset which is not frequent, namely the 2-subsets (2; 4) and (2; 5), respectively.
2.3 Parallel Association Mining on SMP Systems We now brie y describe the Common Candidate Partitioned Database (CCPD) [30] parallel association mining algorithm on shared-memory multiprocessors. As the name suggests CCPD keeps a common candidate hash tree among all processors, but the database is logically split among them in a blocked fashion. The new candidates are generated in parallel by all the processors, and are also inserted into the candidate hash tree in parallel. Once all the candidates have been generated, each processor scans its local database and counts the support for each itemset in parallel. Finally, the master process selects the frequent itemsets. 5
A brief description of the two key steps of parallel candidate generation and support counting is given below. A more detailed study on the optimizations and parallel performance of CCPD can be found in [30].
Parallel Candidate Itemset Generation Equivalence Classes: Let F2 = fAB, AC, AD, AE, BC, BD, BE, DEg. Then C3 = fABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDEg. Assuming that Fk?1 is lexicographically sorted, we can partition the itemsets in Fk?1 into equivalence classes based on their common k ? 2 length pre xes, i.e., the equivalence class a 2 Fk?2, is given as: Sa = [a] = fb 2 Fk?1 j a[1 : k ? 2] = b[1 : k ? 2]g k-itemsets can simply Candidate jSij pairs. For our example F
be generated from itemsets within a class by joining all 2 above, we obtain the equivalence classes: SA = [A] = 2 fAB, AC, AD, AEg, SB = [B] = fBC, BD, BEg, and SD = [D] = fDEg. We observe that itemsets produced by the equivalence class [A], namely those in the set fABC, ABD, ABE, ACD, ACE, ADEg, are independent of those produced by the class [B] (the set fBCD, BCE, BDEg). Any class with only 1 member can be eliminated since no candidates can be generated from it. Thus we can discard the class [D].
Parallel Candidate Generation: Since each equivalence class can generate an indepen-
dent set of candidates, we assign to each processor a set of distinct classes. To achieve a good load balance among the processors, we rst assign a weight to each equivalence class. The weight corresponds to the number of possible candidates that can be generated from a j S j class, which is given as wi = 2 . The classes are then sorted on weight, and are assigned to the processors in a greedy fashion { we remove each class in turn from the sorted list, and assign it to the processor with the least total weight. i
Parallel Hash Tree Formation: Once the classes have been scheduled each processor
generates the candidates and inserts it into the candidate hash tree in parallel. To avoid race conditions among the multiple processors, a lock is associated with each leaf node in the hash tree. When processor i wants to insert a candidate itemset into the hash tree it starts at the root node and hashes on successive items in the candidate itemset until it reaches a leaf node. At this point it acquires a lock on this leaf node for mutual exclusion while inserting the itemset. With this locking mechanism, each process can insert the itemsets in dierent parts of the hash tree in parallel.
Hash Tree Balancing: Another important factor aecting the performance is the hash
function used in the internal nodes of the hash tree. A simple \mod" function on the hash table length can result in unbalanced trees with long search paths. In [30] we presented a new hash function which utilizes the equivalence classes to nd out the occurrence of dierent items in the candidates. Using this information a hash function is created such that 6
the resulting tree is balanced to a great extent, resulting in shorter paths and consequently faster support counting.
Parallel Support Counting Each itemset in the candidate hash tree has a single count eld associated with it. Since the tree is shared among the processors, more than one processor may try to access the count eld to increment its support. We thus need a locking mechanism to provide mutual exclusion among the processors while incrementing the count. The itemset data structure is shown in gure 4. The basic steps remain the same as in the sequential algorithm. Each processor scans its local database partition, and for each transaction a check is made if any candidates are contained in it. Once all processors have nished the support counting, the master extracts those candidates that have minimum support and inserts them into Fk . The support counting step typically accounts for 80% of the total computation time [30]. Our locality enhancement techniques are essentially geared towards improving the performance of this step. ITEMSET int Count; volatile int Lock; int *Itemset;
A
B
C
Figure 4: Itemset Data Structure
Optimized Candidate Search: In [30] we presented an optimization to speed up the subset counting phase. It utilizes the fact that the subsequences of the transaction are generated in lexicographic order, to pre-empt the candidate search. Each time a hash tree node is visited for the rst time, it is marked as having been seen. At any later point if we happen to visit the same node (depends on the hash function used), we don't have to process it again.
3 Custom Memory Placement Library In this section we brie y discuss some of the features of our runtime memory allocation library. The library, which is built upon the Unix malloc library, provides the mechanisms to control memory placement policies for dynamic data structures encountered in data mining applications. In contrast to the standard Unix malloc library, our library allows greater
exibility in controlling memory placement. At the same time it oers a faster memory 7
freeing option, and ecient reuse of preallocated memory. Furthermore it does not pollute the cache with boundary tag information. Applications like association mining, which use large temporary dynamic data structures like the hash tree repeatedly in each iteration, can take advantage of the global memory freeing and the ecient memory reuse option. The above feature is also quite ecient when data is to be remapped. Data mining applications can also bene t from aligning and grouping together related data structures, based on access patterns. This is likely to improve the locality. Another bene t of aligning and padding is to reduce the eect of false sharing. With the above issues in mind, the key features of our library are as follows:
Memory Allocation: The memory is allocated from user de ned regions. For example,
the user may specify that the candidate itemsets be allocated from a speci c region. Each region is composed of segments of user speci ed size. Initially a region is composed of a single segment. Subsequent calls to the custom malloc() routine allocate memory from this preallocated segment. If there is no more space remaining on the current segment, we call the standard malloc (or alternately allocate memory from another region) to allocate a new segment. The old segment is placed in a list of allocated segments associated with the region. Each of the allocation routines returns a pointer to space suitably aligned for storage of any type of object from the current segment.
Data Alignment: The library allows the user to request a contiguous memory region, with the rst byte aligned to a requested address. The functionality is similar to the memalign routine in the standard malloc library.
Padding: The ability to pad the allocated memory to the nearest alignment boundary
(for example, a cache line or a page) is present in our system. Furthermore, no restrictions are made on the size of the padded space. Combining alignment and padding is also possible.
Grouping Related Data: We allow related data to be grouped together using reservations. On an allocation a data structure can reserve space for an allied data structure, via the padding mechanism above. A pointer and the size of the reserved memory is then returned to the callee. We provide routines to allocate memory from the reserved space instead of a segment. The callee can use those routines to allocate memory for a related data structure at a later point.
Freeing Memory: Another feature of our library is the ability to free memory on a
region basis. Freeing memory is application controlled and the entire region can be freed by a single call. A simple list of allocated memory segments associated with each region suces to keep track of memory that is used/unused. Maintaining and searching detailed free lists as in the standard malloc library can be avoided.
Ecient Memory Reuse: In this feature all the segments of a region can either be
deleted, or they can be marked as unused. If the region is deleted, then subsequent allocations will allocate new segments, as described above. However, in association mining, we may make use of the fact that a large data structure like the hash tree is 8
deallocated and rebuilt once per iteration. In such cases, by just marking the region as unused subsequent allocation calls can be made more ecient by reusing the segments allocated in the previous phase.
Thread Safety: Our library also has thread-safe and thread-unsafe calls to each library
function. This allows the application to avoid the unnecessary locking overhead if it knows beforehand that two processors will not allocate memory from the same region simultaneously.
4 Memory Placement Policies for Association Mining Processor speeds are improving by 50-100% every year, while memory access times are improving by only 7% every year [15]. Memory latency is therefore becoming an increasingly important performance bottleneck in modern multiprocessors. While cache hierarchies alleviate this problem to an extent, application level locality optimization is crucial for improving performance. It is even more important for data mining applications due to the use of dynamic data structures like linked-lists, hash tables and hash trees. In this section we describe a set of custom memory placement policies for improving the locality and reducing false sharing for parallel association mining. The proposed techniques are directly applicable to other mining algorithms such as quantitative and multi-level associations, sequential patterns, and so on.
4.1 Improving Reference Locality We consider several placement policies relevant to the parallel data mining algorithm. All the locality driven placement strategies are directed in the way in which the hash tree and its building blocks (hash tree nodes, hash table, itemset list, list nodes, and the itemsets) are traversed with respect to each other. The speci c policies we looked at are described below:
Common Candidate Partitioned Database (CCPD): This is the original program that includes calls to the standard Unix memory allocation library available on the target platform.
Simple Placement Policy (SPP): This policy does not use the program access patterns.
All the dierent hash tree building blocks enumerated above are allocated memory from a single region. This scheme does not rely on any special placement of the blocks based on traversal order. Placement is implicit in the order of hash tree creation. Data structures calling consecutive calls to the memory allocation routine are adjacent in memory. The runtime overhead involved for this policy is minimal. There are three possible variations of the simple policy depending on where the data structures reside: 1) Common Region: all data structures are allocated from a single global region, 2) Individual Regions: each data structure is allocated memory from a separate region, and 3) Grouped Regions: data structures are grouped together using program semantics 9
and allocated memory from the same region. The SPP scheme uses in this paper used the Common Region approach.
Localized Placement Policy (LPP): This is an instance of a policy grouping related
data structures together using local access information present in a single subroutine. In this policy we utilize the reservation mechanism to ensure that a list node with its associated itemset in the leaves of the nal hash tree are together whenever possible, and that the hash tree node and the itemset list header are together. The rationale behind this placement is the way the hash tree is traversed in the support counting phase. An access to a list node is always followed by an access to its itemset, and an access to a leaf hash tree node is followed by an access to its itemset list header.
Global Placement Policy (GPP): This policy uses global access information to apply
the placement policy. We utilize the knowledge of the entire hash tree traversal order to place the hash tree building blocks in memory, so that the next structure to be accessed lies in the same cache line in most cases. The construction of the hash tree proceeds in a manner similar to SPP. We then remap the entire tree according the tree access pattern in the support counting phase. In our case the hash tree is remapped in a depth- rst order, which closely approximates the hash tree traversal. The remapped tree is allocated from a separate region. Remapping costs are comparatively low and policy also bene ts by delete aggregation and the lower maintenance costs of our custom library.
Custom Placement Example A graphical illustration of the CCPD, Simple Placement Policy, and the Global Placement Policy appears in gure 5. The gure shows an example hash tree that has three internal nodes and two leaf nodes. A hash tree node is labeled as HTN, a hash table as HTNP, a linked list node as LN and an itemset as Itemset. The numbers that follow each structure show the creation order when the hash tree is rst built. Each internal node points to a hash table, while the leaf nodes have a linked list of candidate itemsets.
CCPD: The original CCPD code uses the standard malloc library. Thus the dierent
structures are allocated in the order of creation and may come from dierent locations in the heap. The box on the top right hand side labeled Standard Malloc corresponds to this case.
LPP: The middle box on the right hand side labeled Custom Malloc corresponds to the LPP placement policy. It should be clear from the gure that no access information is used. The custom malloc library simple ensures that all allocations are contiguous in memory, and can therefore greatly assist cache locality. 10
HASH TREE
Standard Malloc
Hash Table
111 000 00 11 000 111 00 11 HTN#1 HTNP#2 000 111 00 11 00 11 00000000 11111111 000 111 00 11 00 11 00000000 11111111 000 111 HTN#3 11 HTNP#4 00 00000000 11111111 000 111 00 11 00000000 11111111 00000000 11111111 HTN#5 LN#6 000 111 00000000 11111111 000 111 00000000 11111111 Itemset#7 LN#8 000 111 00000000 11111111 00000 11111 000 111 00000000 11111111 00000 11111 00000000 11111111 Itemset#9 HTN#10 00000 11111 0 1 00000000 11111111 00000 11111 0 1 00000000 11111111 00000 11111 HTNP#11 HTN#12 0 1 00000000 11111111 0 1 00000 11111 0 1 1 00000000 11111111 0 00000 LN#131 Itemset#14 11111 0 00000 11111 0 1 00000 11111
HTN#1
HTNP#2
HTN#10
Hash Table
Hash Table
HTNP#11
HTN#12
HTN#3 HTNP#4
Itemset List
LN#13
Itemset List
HTN#5
LN#6 Itemset#7
Itemset#14
Custom Malloc HTN#1 HTNP#2 HTN#3 HTNP#4 HTN#5 LN#6 Itemset#7 LN#8 Itemset#9 HTN#10 HTNP#11 HTN#12 LN#13 00000000000000 11111111111111 00000000000000 11111111111111 Itemset#14 00000000000000 11111111111111 LN#8
Itemset#9
A
C
B
B
D
D
C
E
E
LEGEND: HTN (Hash Tree Node), HTNP(HTN Pointer Array), LN(List Node) #x denotes order of node creation
11111111111111 00000000000000 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 Remapping in Depth-First Order
HTN#1 HTNP#2 HTN#10 HTNP#11 HTN#12 LN#13 Itemset#14 HTN#3 HTNP#4 HTN#5 LN#6 Itemset#7 LN#8 00000000000000 11111111111111 00000000000000 11111111111111 Itemset#9 00000000000000 11111111111111
11111111111111 00000000000000 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111
Figure 5: Improving Reference Locality via Custom Memory Placement of Hash Tree
GPP: The last box labeled Remapping in Depth-First Order corresponds to the GPP
policy. In the support counting phase each subset of a transaction is generated in lexicographic order, and a tree traversal is made. This order roughly corresponds to a depth rst search. We use this global access information to remap the dierent structures so that those most likely to be accessed one after the other are also contiguous in memory.
4.2 Reducing False Sharing The support counting phase of our algorithm suers from false sharing when updating the itemset support counters. A processor has to acquire the lock, update the counter and release the lock. During this phase, any other processor that had cached the particular line on which the lock was placed gets its data invalidated. 11
Padding and Aligning: As a simple solution one may place unrelated data that might
be accessed simultaneously on separate cache lines. We can thus align the locks to separate cache lines or simply pad out the rest of the coherence block after the lock is allocated. Unfortunately this is not a very good idea for association mining because of the large number of candidate itemsets involved in some of the early iterations (around 0.5 million itemsets; see also the intermediate hash tree sizes in section 5). While padding will eliminate false sharing, it results in unacceptable memory space overhead and more importantly, a signi cant loss in locality.
Segregate Read-Only Data: Instead of padding we chose to separate out the locks
and counters from the itemset, i.e., to segregate read-only data from read-write data. All the data in the hash tree that is read-only comes from a separate region and locks and counters come from separate region. 1 This ensures that read-only data is never falsely shared. Although this scheme does not eliminate false sharing completely, it does eliminate unnecessary invalidations of the read-only data, at the cost of some loss in locality and an additional overhead due to the extra placement cost. To each of the policies for improving locality we added the feature that the locks and the support counters come from a separate region, for reducing false sharing. There are three resulting strategies: 1) Lock-region Simple Placement Policy (L-SPP), 2) Lock-region Localized Placement Policy (L-LPP), and 3) Lockregion Global Placement Policy (L-GPP).
Privatize (and Reduce): Another technique for eliminating false sharing is called Soft-
ware Caching [8] or Privatization. It involves making a private copy of the data that will be used locally, so that operations on that data do not cause false sharing. For association mining, we can utilize that fact that the support counter increment is a simple addition operation and one which is associative and commutative. This property allows us to keep a local array of counters per processor allocated from a private lock region. Each processor can increment the support of an itemset in its local array during the support counting phase. This is followed by a global sum reduction. This scheme eliminates false sharing completely, with acceptable levels of memory wastage. This scheme also eliminates the need for any kind of synchronization among processors, but has to pay the cost of the extra reduction step. We combine this scheme with the global placement policy to obtain a scheme which has good locality and eliminates false sharing completely. We call this scheme the Local Counter Array-Global Placement Policy, (LCA-GPP).
5 Experimental Evaluation All the experiments were performed on a 12-node SGI Power Challenge shared-memory multiprocessor. Each node is a MIPS processor running at 100MHz. There's a total of As an additional feature, during the candidate generation phase, which happens in parallel, each processor allocates memory for locks and counters from private regions using a thread unsafe version of the malloc calls knowing that no other processor will allocate from its private region. This saves additional locking overhead when allocating memory in parallel. 1
12
Database
Total Size
CCPD 0.1%(0.5%) 1 4 8 T5.I2.D100K 2.6MB 43.9(12.6) 27.9(9.4) 26.9(7.5) T10.I4.D100K 4.3MB 133.4(66.4) 70.0(31.8) 56.6(23.6) T20.I6.D100K 7.9MB 3027.2(325.3) 1780.8(126.0) 1397.1(80.8) T10.I6.D400K 17.1MB 895.3(225.8) 397.7(80.6) 288.9(55.5) T10.I6.D800K 34.6MB 2611.1(463.1) 1382.7(159.4) 999.0(95.1) T10.I6.D1600K 69.8MB 2831.4(883.6) 1233.0(280.1) 792.7(160.6) T10.I6.D3200K 136.9MB 6084.7(2143.6) 2591.1(815.2) 1676.5(461.3) Table 1: Database properties and Total Execution Times for CCPD 256MB of main memory. The primary cache size is 16 KB (64 bytes cache line size), with dierent instruction and data caches, while the secondary cache is 1 MB (128 bytes cache line size). The databases are stored on an attached 2GB disk. All processors run IRIX 5.3, and data is obtained from the disk via an NFS le server. Hash Tree Size
Hash Tree Size (MBytes)
100 ’T5.I2.D100K’ ’T10.I4.D100K’ ’T20.I6.D100K’ ’T10.I6.D400K’ ’T10.I6.D800K’ ’T10.I6.D1600K’
10
1
0.1
0.01 2
3
4
5
6 Iteration
7
8
9
10
Figure 6: Intermediate Hash Tree Size (0.1% Minimum Support) We used dierent synthetic databases generated using the procedure described in [3]. These databases have been used as benchmarks for many association rules algorithms [5, 16, 24, 26, 3]. They mimic the transactions in a retailing environment. Each transaction has a unique ID followed by a list of items bought in that transaction. The data-mining provides information about the set of items generally bought together. Table 1 shows the databases 13
used and their properties. The number of transactions is denoted as jDj, average transaction size as jT j, and the average maximal potentially large itemset size as jI j. T5.I2.D100K (see table 1) corresponds to a database with average transaction size of 5, 100000 transactions, and average maximal potential large itemset size of 2. In all the datasets, we set the number of maximal potentially large itemsets jLj = 2000, and the number of items N = 1000. We refer the reader to [3] for more detail on the database generation. All our experiments were performed with a minimum support values of 0.1% and 0.5%. Table 1 also shows the total execution times for CCPD algorithm on a single processor, 4 processors and 8 processors to give an indication on how long each run took. Figure 6 also shows the intermediate hash tree sizes for the dierent databases at 0.1% support.
5.1 Improving Reference Locality and Reducing False Sharing #proc = 1 (0.5% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
LPP GPP
0.80 0.70 0.60 0.50 0.40 0.30 0.20 T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K
T10.I6.D1600K
Databases
#proc = 1 (0.1% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
LPP GPP
0.80 0.70 0.60 0.50 0.40 0.30 0.20 T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K
T10.I6.D1600K
Databases
Figure 7: Memory Placement Policies: 1 Processor In designing the dierent placement policies, our goal was to minimize the total execution 14
time of the algorithm. We measured the overall time spent in allocating and freeing memory in all the sequential processor runs using the Unix malloc library versus using our custom malloc library. For the Unix malloc, less than 1% of the total execution time was spent in malloc/free for the larger databases. While we get improvements over those time using our custom malloc, the gains are not a signi cant part of the total execution time. The huge bene ts from our placement policies are mainly due to locality enhancements and false sharing reductions.
Uniprocessor Performance In the sequential run we compare the three locality enhancing strategies on dierent databases. The results are shown in gure 7. All times are normalized with respect to the CCPD time for each database. We observe that the simple placement (SPP) strategy does extremely well, with 40-55% improvement over the base algorithm. The simple scheme also does better than the local placement strategy. This is due to the fact that SPP is the least complex in terms of runtime overhead, yet its purely dynamic placement approaches the placement of the local placement strategy for the most part. The global placement strategy involves the overhead of remapping the entire hash tree (less than 2% of the running time). On smaller datasets we observe that the gains in locality are not sucient in overcoming this overhead, but as we move to larger datasets the global strategy performs the best. Furthermore, as the size of the dataset increases the locality enhancements become more prominent as more time is spent in the support counting phase of the program. While this trend continues for larger databases, the gains in locality due to remapping have to be balanced against the increasing cost of remapping for larger trees. This tradeo becomes apparent for the T20.I6.D100K and T10.I6.D800K databases. Both of these result in large intermediate hash tree sizes, increasing the remapping cost, and undermining the locality gains. Another important factor in uencing the simple placement policy is that the creation order of the hash tree is approximately the same as the access order, that is, the candidates are inserted in the tree in lexicographic order. Later in the support counting phase, we form the subsets of a transaction also in lexicographic order. However, due to the dynamic nature of tree construction, SPP could still bene t from improved locality. This is exactly the place where the global scheme bene ts (minus the remapping cost).
Multiprocessor Performance For the multiprocessor case, we present results on 4 and 8 processor runs. The results on the various placement policies are shown in gures 8, and 9. All times are normalized with respect to the CCPD time for each database. In the graphs, the strategies have been arranged in increasing order of complexity from left to right. We observe that with larger databases (both in number of transactions and in the average size of the transaction), the eect of improving locality becomes more apparent. The is due to the fact that larger databases spend more time in the support counting phase, and consequently there are more locality gains. We note that the more complex the strategy, the better it does on larger data sets, since there are enough gains to oset the added overhead. 15
#proc = 4 (0.5% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
L-SPP L-LPP
0.80
GPP 0.70
L-GPP LCA-GPP
0.60 0.50 0.40 0.30 0.20
T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K T10.I6.D1600K T10.I6.D3200K
Databases
#proc = 4 (0.1% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
L-SPP L-LPP
0.80
GPP 0.70
L-GPP LCA-GPP
0.60 0.50 0.40 0.30 0.20
T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K T10.I6.D1600K T10.I6.D3200K
Databases
Figure 8: Memory Placement Policies: 4 Processors Furthermore, with increasing number of processors the eect of improving locality becomes less apparent. The reason for this is that since the database is partitioned among the processors, less time is spent in the support count phase per processor. Although the relative bene ts from locality reduce, the relative bene ts from reducing false sharing increase, with increasing number of processors. Our lock-region based strategies to reduce false sharing tend to do better in this situation. With increasing number of processors the amount of wasted computation increases for the global placement schemes, since the remapping happens sequentially. However, the added overhead in locking and remapping are overcome when the dataset is large enough, and we see gains with the more complex policies. On comparing the bar charts for the 0.5% support and 0.1% we note that the relative bene ts of our optimizations become more apparent on lower support (for the T20.I6.D100K database the relative bene ts for 0.1% support case is 30% more than for the 0.5% support 16
case). It is also interesting to note that for 0.5% support the only database in which GPP performs better than SPP is the largest database in our experiments, whereas for 0.1% support GPP performs better than SPP for all except the smallest two databases. Both these observations can be attributed to the fact that the ratio of memory placement overhead to the total execution time is higher for higher support values. Further, on going from a lower support to a higher one, the net eective bene t of placement also reduces with reduction in candidate hash tree sizes. #proc = 8 (0.5% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
L-SPP L-LPP
0.80
GPP 0.70
L-GPP LCA-GPP
0.60 0.50 0.40 0.30 0.20
T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K T10.I6.D1600K T10.I6.D3200K
Databases
#proc = 8 (0.1% support) 1.00
Normalized Execution Time
CCPD SPP
0.90
L-SPP L-LPP
0.80
GPP 0.70
L-GPP LCA-GPP
0.60 0.50 0.40 0.30 0.20
T5.I2.D100K
T10.I4.D100K
T20.I6.D100K
T10.I6.D400K
T10.I6.D800K T10.I6.D1600K T10.I6.D3200K
Databases
Figure 9: Memory Placement Policies: 8 Processors We will now discuss some policy speci c trends. The base CCPD algorithm suers from a poor reference locality and high amounts of false sharing. As explained earlier, this is mainly due to the dynamic nature of memory allocations. The simple placement policy, SPP, performs extremely well. This is not surprising since it imposes the smallest overhead of all the special placement strategies. As we explained earlier, the dynamic creation order to an 17
extent matches the hash tree traversal order in the support counting phase. We thus obtain a gain of 40-60% by using the simple scheme alone. The lock-region simple placement scheme, L-SPP, does slightly better than the SPP policy. As with all lock-region based schemes, it loses on locality, but with increasing data sets, the bene ts of reducing false sharing outweigh the overhead. L-LPP was designed to improve the the loss of locality due to separate lockregion in L-SPP, by trying to place related data together in some cases. However, it also has extra complexity, and consequently it is generally very close to L-SPP. From a pure locality viewpoint, the global placement policy performs the best with increasing data size. It uses the hash tree traversal order to re-arrange the hash tree to maximize locality, resulting in signi cant performance gains. We try to reduce the false sharing of the GPP policy by coupling it with a separate lock-region. L-GPP, does worse than GPP on the smaller data sets due to its increased overhead (remapping + separate lock segments). However for the biggest data base, it clearly outperforms the previous policies. Finally, GPP with the local lock-region per processor, LCA-GPP, eliminates false sharing completely, and at the same time eliminates all synchronization, yet having good locality. It thus performs the best for the large data sets. We expect the bene ts of reducing false sharing to be more pronounced for software distributed shared memory due to larger coherence blocks.
6 Related Work 6.1 Association Mining Sequential Algorithms Several algorithms for mining associations have been proposed in the literature [2, 22, 5, 16, 24, 17, 26, 3, 27]. The Apriori algorithm [22, 5, 3] was shown to have superior performance to earlier approaches [2, 24, 16, 17] and forms the core of almost all of the current algorithms. The key observation used is that all subsets of a frequent itemset must themselves be frequent. It makes multiple database scans as described in section 2.2. It may thus incur high I/O overhead. The Partition algorithm [26] minimizes I/O by scanning the database 2 or 3 times. It partitions the database into small chunks which can be handled in memory. In the rst pass it generates the set of all potentially frequent itemsets (any itemset locally frequent in a partition), and in the second pass their global support is obtained. Another way to minimize the I/O overhead is to work with only a small random sample of the database. An analysis of the eectiveness of sampling for association mining was presented in [32], and [27] presents an exact algorithm that nds all rules using sampling. We recently proposed new association mining algorithms that usually make only 3 scans [33, 34]. They use novel itemset clustering techniques, based on equivalence classes and maximal hypergraph cliques, to approximate the set of potentially maximal frequent itemsets. Ecient lattice traversal techniques based on bottom-up and hybrid search, are then used to generate the frequent itemsets contained in each cluster. The new algorithms were shown to perform better than both Apriori and Partition. 18
Parallel Algorithms Distributed Memory Machines: There has been relatively less work in parallel mining
of associations. Three dierent parallelizations of Apriori on a distributed-memory machine (IBM SP2) were presented in [4]. The Count Distribution algorithm is a straight-forward parallelization of Apriori. Each processor generates the partial support of all candidate itemsets from its local database partition. At the end of each iteration the global supports are generated by exchanging the partial supports among all the processors. The Data Distribution algorithm partitions the candidates into disjoint sets, which are assigned to dierent processors. However to generate the global support each processor must scan the entire database (its local partition, and all the remote partitions) in all iterations. It thus suers from huge communication overhead. The Candidate Distribution algorithm also partitions the candidates, but it selectively replicates the database, so that each processor proceeds independently. The local portion is scanned once during each iteration. Other parallel algorithms improving upon these ideas in terms of communication eciency, or aggregate memory utilization have also been proposed [10, 14]. The PDM algorithm [25] presents a parallelization of the DHP algorithm [24]. However, PDM performs worse than Count Distribution [4]. In recent work we presented new parallel association mining algorithms [35]. They utilize the fast sequential algorithms proposed in [31, 33] as the base algorithm. The database is also selectively replicated among the processors so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Furthermore they have excellent locality since only simple intersection operations are used to compute the frequent itemsets.
Shared Memory Machines: To the best of our knowledge the CCPD [30] algorithm
is the only parallel algorithm targeting shared memory machines. In [30] we presented the parallel performance of CCPD, and also the bene ts of optimizations like hash tree balancing, parallel candidate generation, and short-circuited subset checking. This paper studied the locality and false sharing problems encountered in CCPD, and solutions for alleviating them.
6.2 Improving Locality Several automatic techniques like tiling, strip-mining, loop interchange, uniform transformations [9, 7, 11, 20] have been proposed on a wide range of architectures, to improve the locality of array-based programs. However these cannot directly be applied for dynamic data structures. Prefetching is also a suggested mechanism to help tolerate the latency problem. Automatic prefetching based on locality on array based programs was suggested in work done in [23]. Building upon this work, more recently in [21], a compiler based prefetching strategy on recursive data structures was presented. They propose a data linearization scheme, which linearizes one data class in memory to support prefetching, whereas we suggest an approach 19
where related structures are grouped together based on access patterns to further enhance locality. Our experimental results indicate increased locality gains when grouping related data structures, rather than linearizing a single class.
6.3 Reducing False Sharing Five techniques mainly directed at reducing false sharing were proposed in [28]. They include padding, aligning, and allocation of memory requested by dierent processors from dierent heap regions. Those optimizations result in a good performance improvement for the applications they considered. In our case study padding and aligning was not found to be very bene cial. Other techniques to reduce false sharing on array based programs include indirection [12], software caching [8], and data remapping [6]. Our placement policies have utilized some of these techniques.
6.4 Memory Allocation General purpose algorithms for dynamic storage have been proposed and honed for several years [19, 18, 29]. An evaluation of the performance of contemporary memory allocators on 5 allocation intensive C programs was presented in [13] Their measurements indicated that a custom allocator can be a very important optimization for memory subsystem performance. They pointed out that most of the extant allocators suered from cache pollution due to maintenance of boundary tags and that optimizing for space (minimizing allocations) can serve the opposite end of reducing speed, due to lower locality. All of these points were borne out by our experimental results and went into the design of our custom memory placement library. Our work diers from theirs in that they rely on custom allocation, while we rely on custom memory placement (where related data are placed together) to enhance locality and reduce false sharing.
7 Conclusions Locality and false sharing are important issues in modern multiprocessor machines due to the increasing gap between processor and memory subsystem performance. Particularly more so in data mining applications, which operate on large sets of data and use large pointer based dynamic data structures, where memory performance is critical. Such applications typically suer from poor data locality and high levels of false sharing where shared data is written. Most of previous the work dealing with enhancing locality and reducing false sharing has been proposed in the context of array-based applications and cannot be applied to recursive data structures such as those found in the data mining domain. In this paper, we proposed a set of mechanisms and policies for controlling the allocation and memory placement of such data structures, to achieve the con icting goal of maximizing locality, while minimizing the level of false sharing for such applications. The mechanism takes the form of a custom memory allocation library, and we evaluate a set of placement policies on 20
shared memory multiprocessor machines on the representative association rules parallel data mining application. Our experiments show that simple placement schemes can be quite eective, and for the datasets we looked at they improve the execution time of our application by upto a factor of two. We also observed that as the datasets became larger (alternately at lower support values), and there is more work per processor the more complex placement schemes improve matters further (upto 20% for the largest dataset). It is likely that for larger datasets this gain from more complex placement schemes will increase. While we evaluated these policies on one data mining application, viz. association rules, its access patterns and data structures are similar to those found in sequence mining and quantitative association mining. We expect that for these applications similar, if not better, improvements will be observed. The reason for our optimism lies in the fact that for these applications the hash trees are likely to be larger, and the support counting phase more compute intensive. We are currently evaluating our placement policies in these domains.
21
References [1] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. In IEEE Trans. on Knowledge and Data Engg., pages 5(6):914{925, 1993. [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Intl. Conf. Management of Data, May 1993. [3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining. MIT Press, 1996. [4] R. Agrawal and J. Shafer. Parallel mining of association rules. In IEEE Trans. on Knowledge and Data Engg., pages 8(6):962{969, 1996. [5] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th VLDB Conf., September 1994. [6] J. M. Anderson, S. P. Amarsinghe, and M. Lam. Data and computation transformations for multiprocessors. ACM Symp. Principles and Practice of Parallel Programming, 1995. [7] J. M. Anderson and M. Lam. Global optimizations for parallelism and locality on scalable parallel machines. ACM Conf. Programming Language Design and Implementation, June 1993. [8] Ricardo Bianchini and Thomas J. LeBlanc. Software caching on cache-coherent multiprocessors. 4th Symp. Parallel Distributed Processing, pages 521{526, December 1992. [9] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler optimizations for improving data locality. 6th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 252{262, October 1994. [10] D. Cheung, V. Ng, A. Fu, and Y. Fu. Ecient mining of association rules in distributed databases. In IEEE Trans. on Knowledge and Data Engg., pages 8(6):911{922, 1996. [11] Michal Cierniak and Wei Li. Unifying data and control transformations for distributed shared-memory machines. ACM Conf. Programming Language Design and Implementation, June 1995. [12] Susan J. Eggers and Tor E. Jeremiassen. Eliminating false sharing. Intl. Conf. Parallel Processing, pages I:377{381, August 1991. [13] D. Grunwald, B. Zorn, and R. Henderson. Improving the cache locality of memory allocation. ACM Conf. Programming Language Design and Implementation, June 1993. [14] E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997. 22
[15] J. Hennessey and D. Patterson. Computer Architecture : A Quantitative Approach. Morgan-Kaufmann Pub., 1995. [16] M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen. A perspective on databases and data mining. In 1st Intl. Conf. Knowledge Discovery and Data Mining, August 1995. [17] M. Houtsma and A. Swami. Set-oriented mining of association rules in relational databases. In 11th Intl. Conf. Data Engineering, 1995. [18] C. Kingsley. Description of a very fast storage allocator. Documentation of 4.2 BSD Unix malloc implementation, February 1982. [19] Donald E. Knuth. Fundamental Algorithms: The Art of Computer Programming. Volume 1. Addison-Wesley Pub, 1973. [20] W. Li. Compiler cache optimizations for banded matrix problems. Intl. Conf. on Supercomputing, pages 21{30, July 1995. [21] Chi-Keung Luk and Todd C. Mowry. Compiler-based prefetching for recursive data structures. 6th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 1996. [22] H. Mannila, H. Toivonen, and I. Verkamo. Ecient algorithms for discovering association rules. In AAAI Wkshp. Knowledge Discovery in Databases, July 1994. [23] T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 62{73, October 1992. [24] J. S. Park, M. Chen, and P. S. Yu. An eective hash based algorithm for mining association rules. In ACM SIGMOD Intl. Conf. Management of Data, May 1995. [25] J. S. Park, M. Chen, and P. S. Yu. Ecient parallel data mining for association rules. In ACM Intl. Conf. Information and Knowledge Management, November 1995. [26] A. Savasere, E. Omiecinski, and S. Navathe. An ecient algorithm for mining association rules in large databases. In 21st VLDB Conf., 1995. [27] H. Toivonen. Sampling large databases for association rules. In 22nd VLDB Conf., 1996. [28] J. Torrellas, M. S. Lam, and J. L. Hennessy. Shared data placement optimizations to reduce multiprocessor cache miss rates. Intl. Conf. Parallel Processing, pages II:266{ 270, August 1990. [29] C. B. Weinstock and W. A. Wulf. An ecient algorithm for heap storage allocation. ACM SIGPLAN Notices, 23(10):141{148, October 1988. 23
[30] M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing'96, November 1996. [31] M. J. Zaki, S. Parthasarathy, and W. Li. A localized algorithm for parallel association mining. In 9th ACM Symp. Parallel Algorithms and Architectures, June 1997. [32] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In 7th Intl. Wkshp. Research Issues in Data Engg, April 1997. [33] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997. [34] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. Technical Report URCS TR 651, University of Rochester, April 1997. [35] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, to appear, December 1997.
24