Exploiting GPU and Cluster Parallelism in Single ...

Exploiting GPU and Cluster Parallelism in Single Scan Frequent Itemset Mining Youcef Djenouria , Djamel Djenourib , Asma Belhadic , Alberto Canod a

Dept. of Mathematics and Computer Science, Southern Denmark University, Odense, Denmark b CERIST Research Center, Algiers, Algeria c RIMA Lab, USTHB, Algiers, Algeria d Dept. of Computer Science, Virginia Commonwealth University, Richmond, VA, USA

Abstract This paper considers discovering frequent itemsets in transactional databases and addresses the time complexity problem by using high performance computing (HPC). Three HPC versions of the Single Scan (SS) algorithm are proposed. The first one (GSS) implements SS on a GPU (Graphics Processing Unit) architecture using an efficient mapping between thread blocks and the input data. The second approach (CSS) implements SS on a cluster architecture by scheduling independent jobs to workers in a cluster. The third, (CGSS) accelerates the frequent itemset mining process by using multiple cluster nodes equipped with GPUs. Moreover, three partitioning strategies are proposed to reduce GPU thread divergence and cluster load imbalance. Results show that CGSS outperforms SS, GSS, and CSS in terms of speedup. Specifically, CGSS provides up to a 350 times speedup for low minimum support values on large datasets. GCSS demonstrably outperforms the state-of-the-art HPC-based algorithms on big databases. Keywords: Frequent Itemset Mining, High-Performance Computing, Support Computing, Big data, GPU. 1. Introduction Frequent Itemset Mining (FIM) aims to discover sets of items that appear frequently in a transactional database (i.e., frequent itemsets). Let T be a set of m transactions {T1 , T2 , . . . , Tm } in a transactional database, and I be a set of n distinct items or attribute values {I1 , I2 , . . . , In }. An itemset, X, is a subset of the set of items, i.e., X ⊆ I. The support of an itemset is defined as the ratio of the

Preprint submitted to Information Sciences

June 21, 2018

number of transactions that contains X to m. An itemset X is frequent if its support is no less than a user’s predefined minimum support threshold minsup [2]. The problem of FIM is to find all itemsets that have a support no less than the minsup threshold. FIM has applications in many domains such as market basket analysis [16], social network analysis [4], decision making [12, 13, 17] and information retrieval [15, 14]. The “Big Data” revolution, where database instances have large sizes, yields new challenges for FIM. The motivation to overcome these challenges is well founded by the numerous applications of FIM where Big Data instances are found, such as frequent gene extractions from DNA in bio-informatics [38], frequent itemset extraction from Twitter streams for social network analysis [29], or estimating a particular quantile of a distribution [3]. FIM algorithms can generally be classified into two main categories. The first category is Apriori based algorithms [2], which adopt a generate and test strategy to explore the search space. They first find frequent items, and then recursively combine frequent itemsets containing k items to generate candidate frequent itemsets containing k+1 items. Then, the support (occurrence frequency) of these candidates is calculated to obtain the frequent k + 1 itemsets. The second category is FP-Growth based algorithms [25]. They adopt a divide and conquer strategy, first compressing the transactional database in main memory using an efficient tree structure and then recursively applying the procedure to find frequent itemsets. Since both approaches discover all frequent itemsets with a support no less than minsup, they are called exact approaches. However, these approaches are computationally expensive, especially when the number of attributes increases. In general, generate and test approaches perform multiple database scans, which slows down frequent itemset discovery for large databases. FP-Growth based approaches perform a limited number of database scans, but they store the entire database in memory which has a large footprint. Bio-inspired meta-heuristics are another approach to solve the FIM problem. This includes swarm intelligence and genetic algorithms [21]. These approaches are relatively efficient when dealing with big databases. Contrary to exact approaches, bio-inspired heuristics are not guaranteed to find all the frequent itemsets having a support no less than MinSup. Several FIM algorithms have been recently designed for handling big data such as FiDooP [43], Apriori [32, 36], and MEGPU [19]. However, these algorithms simply adapt existing approaches for High Performance Computing (HPC) platforms. They still require performing multiple scans of the transactional database, and their performance is usually highly sensitive to variations of the minimum support threshold.

2

Recently, Single Scan (SS) was introduced in [18]. It generates all possible itemsets from each transaction. If an itemset generated by processing a transaction has already been generated when processing a previous transaction, then its support is incremented by one. Otherwise, its support is set to one. This process is repeated until all transactions in the database have been processed. SS is a correct and complete algorithm that extracts all frequent itemsets by performing a single scan of a transactional database. It was shown that SS has excellent performance in terms of computational time. Moreover, an advantage of SS is that it is non-sensitive to the minimum support threshold. However, the itemset generation process of SS is costly, since all possible candidate itemsets are extracted from each transaction. To address these limitations, the current paper proposes three HPC-based versions of SS for the frequent itemset mining problem. The first approach (GSS) has been implemented for the GPU architecture by developing an efficient mapping between thread-blocks and input data. Every thread-block is responsible for one database partition, and processes this partition in an independent way. The second approach (CSS) implements SS on a cluster architecture by proposing an efficient distribution of independent jobs among workers. Each worker mines one partition given by the master. Finally, the third approach (CGSS) combines both cluster and GPU to further boost the runtime performance of the SS algorithm. Moreover, three partition strategies are proposed to reduce GPU thread divergence and the cluster load imbalance. GSS and CSS are analyzed theoretically. and the three algorithms (GSS, CSS, and CGSS) are evaluated using standard database instances commonly used to benchmark FIM algorithms. These instances are of various sizes (large and big data). Experimental results show that GSS, CSS and CGSS reach a high speed compared to the sequential SS algorithm for large databases. The results also reveal that CGSS outperforms the state-of-the-art HPC-based FIM approaches for big databases. The remainder of this paper is organized as follows. Section 2 reviews FIM algorithms. Section 3 describes the three proposed parallel versions of SS. Section 4 analyzes the proposed approaches from a theoretical perspective. Finally, Section 5 presents the performance evaluation and Section 6 draws the conclusions. 2. Related Work This section provides a literature review of HPC-based approaches for FIM, where GPU and cluster-based FIM algorithms are discussed.

3

2.1. GPU-based FIM Algorithms Parallel FIM on GPU was first proposed by Fang et al. [22]. The transaction datasets, itemsets, and transactions are represented by binary bitmap data structures. This improves GPU support counting of candidate itemsets. Zhou et al. [48] developed the GPU-FPM algorithm. It is inspired by the Apriori algorithm and uses a vertical representation of the dataset to overcome the memory limitation of GPUs. Another GPU approach, the GPU-FPM algorithm developed by Zhou et al. [48], was inspired by the Apriori algorithm. It uses a vertical representation of the dataset to overcome the memory limitation of GPUs and the Mempack structure to store different types of data. Syed et al. [1] proposed a new Apriori algorithm for GPU architecture, where the generation of itemsets is first performed by the GPU. Each thread block computes the support of a set of itemsets and the generated itemsets are then sent back to the CPU, which assembles the frequent itemsets. Silvestri et al. [37] proposed a parallel version of the DCI algorithm [34] in which the intersection and computation operators (which are the most frequent and costly operations of DCI) are parallelized. Two strategies have been proposed, namely the transaction wise approach and candidate wise approach. In the first, all GPU cores work on the same candidate and each thread controls a portion of the data [6]. In the second, many candidates are handled by the GPU at the same time. A contiguous sub-sequence of candidates is assigned to each thread block. Zhang et al. [47] developed a GPApriori algorithm using two data structures for the purpose of accelerating itemset support counting. The CUDA-Apriori algorithm [9] first divides transactions among different threads. Then, k-candidate itemsets are generated and handled by one thread using only a portion of the dataset assigned to the corresponding block. Wang et al. [41] parallelized the FP-Growth using CUDA by developing an efficient strategy to build and mine an FP-Tree on a GPU host. Gu et al. [23] proposed the Bit Q Apriori algorithm to simplify the process of candidate generation and support counting. Unlike the Apriori algorithm, the Bit Q Apriori algorithm generates an itemset containing k items by combining an item with a frequent itemsets containing k − 1 items. The bitset structure is used to store the identifiers of transactions containing each candidate. Therefore, support counting can be implemented by using the Boolean AND operator, to avoid performing multiple database scans. Schlegel [36] proposed the cApriori algorithm, which compresses the transactional database to store it on the shared memory of the GPU-blocks. Jian et al. [26] designed the CU-Apriori algorithm, which develops two strategies for parallelizing both candidate itemset generation and support counting on

4

the GPU. To generate candidates, each pair of two frequent (k −1) itemsets are assigned to a thread. Then it compares them to check if they share a common (k −2) prefix. If so, a k-sized candidate itemset is generated by combining the itemsets. To evaluate the support of generated candidates, each candidate is assigned to a thread, which then counts the itemset’s support by scanning the transactions. The evaluation of frequent itemsets has also been improved using mapping and sum reduction techniques to merge all support counts of itemsets [19]. This process was also improved by developing three strategies for minimizing the impact of the GPU thread divergence issue [20]. Li et al. [31] developed a multilevel layer data structure to enhance the performance of support counting. It divides vertical data into several layers, where each layer is an index table of the next layer. This strategy can completely represent the original vertical structure. In a vertical structure, each item corresponds to a fixed-length binary vector. However, using this strategy the length of each vector varies depending on the number of transactions containing the corresponding item. 2.2. Cluster-based FIM Algorithms Parallel versions of Apriori have been implemented on a cluster architecture using the MapReduce framework [49, 10, 27]. In [27], the authors parallelize the itemset support counting process using MapReduce. The candidate itemsets are first assigned to each cluster-node, each machine mapping the given itemset to a transaction. Then, the reducer sums the values found by the mappers to determine the support count of each itemset. The algorithm in [10] is applied to extract relational entities on the Web. The algorithm was shown to be effective for a small number of items. However, increasing the number of items increases the number of candidates, which degrades the overall performance of the approach. Zhou et al. [49] proposed an approach where both support counting and candidate generation are parallelized. The map function is used to combine pairs of k-itemsets to obtain itemsets of length 2 × k. The reduce function is then launched to generate itemsets of length k + 1 from each itemset of length 2 × k. This algorithm was shown to outperform the first in terms of runtime, but it does not ensure synchronization when generating candidates and performing support counting. A new parallel FIM algorithm based on the MapReduce framework is proposed in [45]. It includes two steps. The first is pre-processing where mappers divide input transactions according to a neighborhood scheme. Then, each mapper checks neighbor constraints to avoid duplicate neighbor transactions. In the second step, every mapper extracts frequent itemsets from the neighborhood trans-

5

actions. The reducer computes a prevalence measure and outputs prevalent colocated event sets. Ravi et al. [35] reviewed several Apriori-based architectures. Li et al. [30] proposed a parallel version of the FPGrowth algorithm, called PFP. It partitions the computation step in such a way that each machine executes an independent set of tasks, and it proposes solutions to the memory consumption problem of the sequential FPGrowth algorithm. However, it requires synchronization between all machines during its grouping and aggregation steps. A structure called Peano Count Tree (P-tree) [11] was proposed for parallel FIM. The P-tree structure provides a lossless and compressed representation of spatial data. This ensures fast support calculation during the mining process. Wu et al. [42] discussed some challenges in big data analytics, such as mining evolving data streams and the need to handle many exabytes of data across various application areas like social network analysis. Moens et al. [33] presented the BigFIM algorithm, which combines principles from both Apriori and Eclat. BigFIM is implemented using the MapReduce paradigm. Partitions are determined using Eclat and each of them is processed in parallel using the Apriori approach. In [50], a compressed data layout scheme is developed allowing high off-chip memory bandwidth utilization. This data structure reduces the memory requirements of support counting, especially when dealing with big data instances. Xun et al. [43] proposed a Hadoop implementation based on MapReduce programming (FiDoop) for frequent itemset mining. It incorporates the concept of FIU-tree rather than the traditional FP-tree structure of the FPgrowth algorithm, for the purpose of improving the storage of candidate itemsets. An improved version, FiDoop-DP proposes an efficient strategy to partition datasets among mappers [44]. This allows better exploration of the search space using the cluster hardware architecture by avoiding job redundancy. All aforementioned HPC-based approaches are designed to improve the performance of classical FIM algorithms such as Apriori and FPGrowth. To the best of our knowledge, there is no HPC-based FIM algorithm that parallelizes the recent SSFIM algorithm [18]. In this paper, three HPC-based approaches have been designed for parallelizing the SS approach, presented in the next section. 3. High Performance Computing for Single Scan In this section, we propose three HPC-based approaches to parallelize SS. The first (GSS) is designed to be run on a GPU, the second (CSS) on a cluster, and the third (CGSS) combines the advantages of both cluster and GPU architectures. The three proposed approaches perform a pre-processing step, which consists of 6

sorting the input transactional database in ascending order according to the number of items. This reduces thread divergence and the load balancing costs for the GPU and Cluster architectures, respectively. 3.1. GSS Graphic Processing Units (GPUs) were originally designed for video games and multimedia purposes. However, they have also been used for many other general-purpose tasks in the last decade due to their powerful computing capabilities [7, 40]. GPU computing involves both the host (CPU side computations and memory) and device (GPU side computations and memory). GPUs have a massively parallel, many-core architecture capable of running millions of threads organized in blocks. Each thread in a block communicates with other threads using a low-latency shared memory, whereas the communication between blocks relies on a high-bandwidth global memory. The CPU/GPU communication is conducted through the PCI-E bus. In the following, the proposed version of SS for the GPU architecture is called GSS. In GSS (see Fig. 1), the transactional database is first sorted according to the number of items per transaction in ascending order and then transmitted to the GPU. This data transfer is required only once to initialize the GPU memory. Next, the transactional database is partitioned into p partitions, where p is the number of blocks available in the GPU. Each block of threads is mapped to a partition and every thread is assigned to process a transaction. The partitions are built on mp rounds, with p blocks and m transactions. During the ith round, the (((i−1)×p)+ 1)th transaction is assigned to the first block, the (((i − 1) × p) + 2)th transaction is assigned to the second block and so on until the (((i − 1) × p) + p)th transaction is assigned to the pth block. This minimizes thread divergence between different threads for the generation process. Indeed, threads of a block deal with the same number of items per transactions. After the partition phase, the generation phase, is launched. mp threads per block are required to process the database using only one scan. GSS defines a local hash table, tablei , located in the shared memory of each block bi for computing the support of the itemsets of the partition pi , and a global hash table to determine the global support of all itemsets. The global hash table is maintained in the global memory of the GPU. The j th thread, tji , allocated to the ith thread block generates the candidate itemsets from the transaction Tj of the partition pi . Next, a local sum reduction technique is applied on each block to construct the local hash table. A global sum reduction is then applied to calculate the global hash table of all itemsets. Finally, the GPU sends only the itemsets

7

Figure 1: GSS Framework.

that exceed the minimum support constraint to the CPU. The pseudo-code of GSS algorithm is given in Algorithm 1. 3.2. CSS A computing cluster is composed of several heterogeneous machines (nodes), each having its own memory and processor. The user submits a job to the cluster via a network. Once the cluster receives a job, the frontend-node schedules and distributes smaller tasks to the compute-nodes. The output of the compute-nodes is returned to the frontend-node and merged. While many models for running programs on a cluster architecture have been proposed [5] the most used is the master/workers model. The master distributes independent tasks to workers, then each worker performs its tasks independently before returning the results to the 8

Algorithm 1 GSS. 1: Input: T: Transactional database. Υsup : Minimum Support user’s threshold. 2: Output :F: The set of frequent Itemsets. 3: idx ← blockIdx.x × blockDim.x + threadIdx.x. 4: S ← GenerateAllItemsets(Tidx ). 5: for each itemset t ∈ S do 6: if t ∈ LH[idx] then 7: LH[idx, t] ← LH[idx, t] +1 8: else 9: LH[idx, t] ← 1 10: end if 11: end for 12: syncthreads(). 13: LH ← LocalSumReduction(blockIdx.x ). 14: syncthreads(). 15: GH ← GlobalSumReduction(). 16: F ← ∅ 17: for each itemset t ∈ GH do 18: if GH[t] ≥ Υsup then 19: F←F∪t 20: end if 21: end for 22: cudaMemcpy(fitness(S), cudaMemcpyDeviceToHost). /****************Notation********************/ Where blockIdx.x: returns the index of the block of the thread x. threadIdx.x: returns the index of the thread x in its block. blockDim.x: returns the number of threads in the block of the thread x.

master. The master merges all the results returned by the workers to generate a global solution to the problem. The master communicates with workers via the Message Passing Interface (MPI). The main challenges of such a design are to minimize the time required for synchronization, communication among nodes, and to maximize load balancing between nodes. CSS (the parallel version of SS for the cluster architecture, (see Fig. 2) uses the master/workers model. The master first sorts the transactional database according to the number of items per transaction in ascending order and then partitions the transactional database into k partitions, where k is the number of available workers. Two pointers (Left, Right) are created and initialized to the first and m rounds. last transactions, respectively. Here, the partitions are built using 2×k th th In the i round, the (((i − 1) × Lef t) + 1) and (((i − 1) × Right) + 1)th transactions are assigned to the first worker, the (((i − 1) × Lef t) + 2)th and (((i − 1) × Right) + 2)th transactions are assigned to the second worker, and so on until the (((i − 1) × Lef t) + k)th and (((i − 1) × Right) + k)th transactions are assigned to the k th worker. The Left pointer is incremented by k, whereas, the Right pointer is decremented by k. This minimizes the load imbalance between 9

Figure 2: CSS Framework.

the different workers during the generation process. Indeed, all workers deal with the same number of items per transaction. After this step, the master sends each partition pi to the worker wi . The latter processes the transactions of the partition pi and generates all candidate itemsets from each transaction tji , gradually creating a local hash table tablei to compute the support of the generated itemsets. When a worker wi scans all transactions of the partition, pi , it sends tablei to the master. When the master receives all hash tables from the k workers, it merges them into a global hash table. Finally, it extracts all frequent itemsets from the global hash table. The pseudo-code of CSS algorithm is given in Algorithm 2. 3.3. CGSS In CGSS (see Fig. 3), the master/workers model is used, and is applied in five steps. The first step consists of dividing the entire transactional database into p partitions using a naive approach. Each partition is assigned to a GPU worker. The second step is to split each partition Pi among the blocks of the ith GPU worker using intelligent partitioning to reduce thread divergence. In the third step, each thread block generates candidate itemsets using their transactions, and creates a 10

Algorithm 2 CSS. 1: MPI Init(&argc,&argv). 2: MPI Comm size(MPI COMM WORLD,&K). 3: MPI Comm rank(MPI COMM WORLD,&myid). 4: if myid = 0 then 5: Partitioning(K) 6: i ← 1. 7: while i

Exploiting GPU and Cluster Parallelism in Single ...

Exploiting GPU and Cluster Parallelism in Single ...

Suggest Documents

Exploiting Parallelism on Keccak: FPGA and GPU ...

Exploiting GPU Parallelism to Optimize Real-World ...

FINDING AND EXPLOITING PARALLELISM IN A ... - CiteSeerX

JANUS: Exploiting Parallelism via Hindsight

MMT: Exploiting Fine-Grained Parallelism in

Exploiting Parallelism in Coalgebraic Logic ... - Semantic Scholar

TERAFLUX: Exploiting Dataflow Parallelism in Teradevices - CiteSeerX

Exploiting Parallelism in Tabled Evaluations? - Semantic Scholar

Exploiting Coarse Grained Parallelism in ... - Semantic Scholar

Exploiting Parallelism in Decision Tree Induction - DCC

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE ...

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE ...

Exploiting parallelism in the ME calculus

Exploiting the Multilevel Parallelism and the

Data Level Parallelism in Vector, SIMD, and GPU Architectures

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU ...

Finding and Exploiting Parallelism in an Ocean Simulation ... - CiteSeerX

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

Exploiting Parallelism with Dependence-Aware ... - Semantic Scholar

Ch4. Exploiting Instruction-Level Parallelism with Software ...

RouteBricks: Exploiting Parallelism To Scale Software Routers

RouteBricks: Exploiting Parallelism To Scale Software Routers

Exploiting Coarse-Grained Parallelism Using Cloud ... - MDPI