Efficient Parallelization of a Two-List Algorithm for the Subset-Sum ...

2 downloads 1575 Views 310KB Size Report
Email: [email protected]. Abstract—Recently, hybrid CPU/GPU cluster has been wide- ly used to deal with compute-intensive problems, such as the subset-sum ...
2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming

Efficient Parallelization of a Two-List Algorithm for the Subset-Sum Problem on a Hybrid CPU/GPU Cluster Letian Kang College of Information Science and Engineering Hunan University Changsha, Hunan 410082, China Email: [email protected]

Lanjun Wan College of Information Science and Engineering Hunan University Changsha, Hunan 410082, China Email: [email protected]

model with shared-memory, the parallelization of the twolist algorithm has been published in [2]–[4]. However, to date, there are no efficient parallelization of the two-list algorithm on a hybrid CPU/GPU cluster. Over the past few years, with the rapid development of the hybrid CPU/GPU systems, some parallel algorithms [5], [6] have been implemented on a GPU to accelerate solving knapsack problems. However, the computational and memory resources available in one single machine are limited and may not be sufficient for solving large-scale knapsack problems. Recently, the use of the hybrid CPU/GPU cluster to accelerate solving compute-intensive problems is becoming more prevalent. However, how to better exploit the hybrid CPU/GPU cluster for solving knapsack problems has not been well studied, there are few works [7], [8] report successful GPU cluster implementation of solving knapsack problems. In this paper, we propose an efficient MPI-CUDA parallel implementation of a two-list algorithm for solving SSP on a hybrid CPU/GPU cluster. In order to allocate the most suitable workload to each node to achieve good load balancing between nodes and minimize the communication overhead, we design an effective workload distribution scheme. We conduct a series of experiments on a cluster with 32 nodes, where each node has two six-core Intel Xeon X5670 CPUs and one NVIDIA Tesla M2050 GPU. The results show that the two-list algorithm can be effectively parallelized using MPI and CUDA on the cluster. The main contributions of this paper include: • A generic hybrid MPI-CUDA dual-level parallelization scheme of a two-list algorithm is proposed to efficiently solve SSP on a hybrid CPU/GPU cluster. • An effective workload distribution scheme is designed to allocate the most suitable workload to each node to achieve good load balancing between nodes and minimize the communication overhead. • A series of experiments are conducted to compare the performance of the hybrid MPI-CUDA implementation with that of the best sequential CPU implementation,

Abstract—Recently, hybrid CPU/GPU cluster has been widely used to deal with compute-intensive problems, such as the subset-sum problem. The two-list algorithm is a well known approach to solve the problem. However, a hybrid MPI-CUDA dual-level parallelization of the algorithm on the cluster is not straightforward. The key challenge is how to allocate the most suitable workload to each node to achieve good load balancing between nodes and minimize the communication overhead. Therefore, this paper proposes an effective workload distribution scheme which aims to reasonably assign workload to each node. According to this scheme, an efficient MPI-CUDA parallel implementation of a two-list algorithm is presented. A series of experiments are conducted to compare the performance of the hybrid MPI-CUDA implementation with that of the best sequential CPU implementation, the single-node CPU-only implementation, the single-node GPU-only implementation, and the hybrid MPI-OpenMP implementation with same cluster configuration. The results show that the proposed hybrid MPI-CUDA implementation not only offers significant performance benefits but also has excellent scalability. Keywords-MPI-CUDA implementation; hybrid CPU/GPU cluster; two-list algorithm; subset-sum problem; knapsack problem.

I. I NTRODUCTION Given n positive integers W =[w1 , w2 , · · · , wn ] and a positive integer M , the subset-sum problem (SSP) is the decision problem of finding a binary n-tuple solution X=[x1 , x2 , · · · , xn ] for the equation n 

wi xi = M , xi ∈ {0, 1}.

(1)

i=1

SSP is a special case of the 0/1 knapsack problem and has been applied to different engineering fields such as capital budgeting, workload allocation and job scheduling. In recent decades, many exact and heuristic algorithms have been developed to solve SSP. A well known approach is the two-list algorithm proposed by Horowitz and Sahni [1], which solves SSP in time O(n2n/2 ) with O(2n/2 ) memory space. In order to reduce the computation time of solving SSP, based on the SIMD (Single Instruction Multiple Data) 2168-3034/14 $31.00 © 2014 IEEE DOI 10.1109/PAAP.2014.44

Kenli Li College of Information Science and Engineering Hunan University Changsha, Hunan 410082, China Email: [email protected]

93

MPI communication CPU

CPU

CPU

PCI-E bus

PCI-E bus

GPU

GPU

node 1

W w1 w2  wt 0 wt0 1 wt 0  2  wt 0  t1 wt 0 t1 1 wt 0 t1  2  wn

CPU

CPU

CPU

...

W0

PCI-E bus

node 2

W1 W2 W0

GPU node n

Ă

node m

Ă

node p

A

A

Ă

A

Ă

A

B

B

Ă

B

Ă

B

C

C

Ă

C

Ă

C

Ă

B  cm Bm

Figure 2.

Hybrid MPI-CUDA Parallel Programming Model

W2

node 2

B  c1 B  c2 B1 B2

high-speed interconnection network

Figure 1.

W1

node 1

Ă

search

B  cn Bn

Workload Distribution Scheme

an effective workload distribution scheme is proposed. Then, the inter-node parallelization with MPI and the intra-node parallelization with CUDA are discussed.

the single-node CPU-only implementation, the singlenode GPU-only implementation, and the MPI-OpenMP implementation with same cluster configuration. The rest of this paper is organized as follows. Section II introduces the hybrid MPI-CUDA parallel programming model. Section III describes a hybrid MPI-CUDA parallel implementation of a two-list algorithm. Section IV gives the experimental results and performance analysis. Section V concludes this paper and discusses the future work.

A. Workload Distribution Scheme This section describes a workload distribution scheme which aims to reasonably assign workload to each node on a hybrid CPU/GPU cluster. The determination of the workload distribution scheme needs to consider the processing capability, memory capacity and communication rate of each node. In order to achieve better performance, a workload distribution scheme should be able to ensure a balanced distribution of workload between nodes, and to minimize the inter-node communication overhead and CPU-GPU communication overhead within each node. Here, considering all nodes are homogeneous at the node level, namely each node has the same processing capability, memory capacity and communication rate. Hence, the basic idea of the workload distribution scheme is to evenly divide the workload into as many chunks as nodes, and then each of these chunks is assigned to a node. Given an n-element input vector W =[w1 , w2 , · · · , wn ], Figure 2 illustrates a workload distribution scheme for a cluster with p nodes, where p is a power of 2. The workload distribution scheme consists of the following five steps: Step 1: At first, we get the first t0 elements of vector W , namely extract a subvector W0 from vector W , where t0 = log p and W0 = [w1 , w2 , · · · , wt0 ]. Then, we get the remaining n−t0 elements of vector W , and finally divide them into two equal or approximately equal parts W1 and W2 . W1 contains t1 elements, where t1 = (n − log p)/2 and W1 = [wt0 +1 , wt0 +2 , · · · , wt0 +t1 ]. W2 contains t2 elements, where t2 = n − t0 − t1 and W2 = [wt0 +t1 +1 , wt0 +t1 +2 , · · · , wn ]. Step 2: We first produce all 2t1 possible subset sums of W1 at each node, and then sort them in nondecreasing order and store them as the list A=[a1 , a2 , · · · , a2t1 ]. Step 3: We first produce all 2t2 possible subset sums of W2 at each node, and then sort them in nonincreasing order and store them as the list B=[b1 , b2 , · · · , b2t2 ]. Step 4: We first produce all p subset sums of W0 at each node, and then store them as the list C = [c1 , c2 , · · · , cp ]. After that, we produce a new list Bm by adding the element

II. T HE H YBRID MPI-CUDA PARALLEL P ROGRAMMING M ODEL In this section, we briefly describe the hybrid MPI-CUDA parallel programming model, which is widely utilized to accelerate many parallel applications on a hybrid CPU/GPU cluster, as shown in Figure 1. To facilitate our discussion, assuming that all computing nodes are homogeneous at the node level and each computing node has multiple CPUs and one GPU. In this model, MPI is used to control the application and perform the communication operations between computing nodes, while CUDA is used to compute the tasks on the GPU. As seen in Figure 1, only one MPI process is launched on one CPU core within each computing node, and it is used to control and to communicate with the GPU. In general, an MPI process firstly transfers the input data from CPU to GPU through the PCI-E bus, then it invokes the CUDA kernel, after that all GPU threads run the kernel in parallel, finally the MPI process transfers the output data from GPU to CPU. It is clear that the hybrid programming model provides dual-level parallelism: coarse-grained data parallelism and task parallelism (suitable for MPI), and fine-grained data parallelism and thread parallelism (suitable for CUDA). Thus, this hybrid model is effective and useful for some applications, which have two levels of parallelism and some compute-intensive tasks that can be processed on a GPU. III. T HE H YBRID PARALLEL I MPLEMENTATION OF T WO -L IST A LGORITHM This section describes how to implement Li et al.’s parallel two-list algorithm [4] on a hybrid CPU/GPU cluster. Firstly,

94

node 1

node p

Algorithm 1 The Basic MPI-CUDA Implementation of the

MPI_Init

MPI_Init

Divide W into three parts: W0, W1, W2

Divide W into three parts: W0, W1, W2

Allocate device memory for W0, W1, W2

Allocate device memory for W0, W1, W2

Transfer W0, W1, W2, and M from host to device

Transfer W0, W1, W2, and M from host to device

Generate lists A, B, C and B1 on GPU

Generate lists A, B, C and Bp on GPU

Parallel Two-List Algorithm Input: An n-element input vector W =[w1 , w2 , · · · , wn ] Output: A solution of SSP or NULL 1: Initialize the MPI execution environment; 2: Divide W into three parts: W0 , W1 , and W2 ; 3: Allocate device memory for W0 , W1 andW2 ; n 4: Initialize the knapsack capacity: M = 12 i=1 wi ; 5: Copy W0 , W1 , W2 and M from host to device; 6: Generate three lists A, B and C on GPU; 7: Generate the list Bm by adding the element cm to each element of B on GPU at the m-th node, where 1 ≤ m ≤ p; 8: Initialize isF ound to zero; 9: Perform the pruning operation on the two sorted lists A and Bm on GPU at the m-th node, where 1 ≤ m ≤ p; 10: Copy the pruning results from device to host; 11: if isF ound = 1 then 12: call MPI Abort;  a solution is found 13: else 14: Perform the search operation on the two sorted lists A and Bm on GPU at the m-th node, where 1 ≤ m ≤ p; 15: Copy the search results from device to host; 16: end if 17: Free device memory; 18: Terminates the MPI execution environment;

isFound = 0

isFound = 0

Perform pruning operation on lists A and B1 on GPU Transfer the pruning results from device to host isFound=0

Y

Ă

Perform pruning operation on lists A and Bp on GPU Transfer the pruning results from device to host

MPI_ Abort

isFound=0

N

Y

MPI_ Abort

N

Perform the search routine on lists A and B1 on GPU

Perform the search routine on lists A and Bp on GPU

Transfer the search results from device to host

Transfer the search results from device to host

MPI_Finalize

MPI_Finalize

Figure 3.

Operations Performed by CPU and GPU at Each Node

The basic MPI-CUDA implementation of the parallel two-list algorithm is described in Algorithm 1. Specifically, we first initialize the MPI execution environment. Second, according to our proposed workload distribution scheme, we divide the input vector W into three parts W0 , W1 and W2 , and allocate device memory for them. Third, we copy W0 , W1 and W2 to device memory, and copy the knapsack capacity M to constant memory. Next, we generate three lists A, B and C on GPU using CUDA. Then, we generate a new list Bm by adding the element cm to each element of B on GPU at the m-th node, where 1 ≤ m ≤ p. After that, in order to reduce the search space, we perform the pruning operation presented in [4] on lists A and Bm on GPU at the m-th node. After pruning, if a solution is found, then we call MPI Abort() to terminate the MPI execution environment; otherwise, we perform the search operation presented in [4] on lists A and Bm on GPU. Finally, we copy the search results back to the host memory and output them.

cm to each element of the list B at the m-th node, where Bm =[b1 + cm , b2 + cm , · · · , b2t2 + cm ] and 1 ≤ m ≤ p. Step 5: We perform the search operation on the two sorted lists A and Bm to find a solution of SSP at the m-th node, where 1 ≤ m ≤ p. Note that the same lists A, B and C are generated at each node respectively, this is because the time for transferring them is longer than the time for generating them. Besides, at each node, the data to be processed comes from local memory, and the results don’t need to be transferred to other nodes. Therefore, hardly any communication between different nodes is needed. B. Inter-Node Parallelization with MPI This section describes the basic MPI-CUDA implementation of Li et al.’s parallel two-list algorithm. The algorithm contains three stages as follows: the parallel generation stage, the parallel pruning stage, and the parallel search stage. Each of them can be implemented with the hybrid MPI-CUDA programming model described in Section II. Here, we adopt the coarse-grained task-parallelism through MPI at the inter-node level and the fine-grained dataparallelism with CUDA at the intra-node level. In our hybrid MPI-CUDA implementation of the parallel two-list algorithm, we run one MPI process per node, and each MPI process drives CPU and GPU to perform a series of operations. Figure 3 shows the operations performed by CPU and GPU at each node. Apparently, each node has the same workload and operations.

C. Intra-Node Parallelization with CUDA This section describes how to implement the three stages of the parallel two-list algorithm by exploiting the finegrained data-parallelism with CUDA. Within each computing node, the generation stage consists of the following phases: 1) Generating the nondecreasing list A. Specifically, we first initialize A1 =[0, wt0 +1 ]. Next, we use 2 GPU threads to add the item wt0 +2 to each element of the list A1 in parallel, generating a new list A11 = [0 + wt0 +2 , wt0 +1 + wt0 +2 ]. Then, we use the optimal parallel merging algorithm presented in [9] to merge

95

Algorithm 2 The Procedure of Generating List A with CUDA

Algorithm 4 The Pruning Subroutine Implemented with CUDA

Input: A1 = [0, wt0 +1 ] Output: The nondecreasing list A with 2t1 subset sums 1: for i = 1 to t1 − 1 do 2: k = min(gmax , 2i ), where gmax is the maximum number of GPU threads available; 3: for all k GPU threads do in parallel 4: Produce a new list A1i by adding the item wi+t0 +1 to each element of the list Ai ; 5: Use the optimal parallel merging algorithm to merge lists Ai and A1i into a new nondecreasing list Ai+1 ; 6: end for 7: end for

Input: Lists A and Bm are evenly divided into k blocks respectively Output: A solution or all the picked block pairs 1: for all GPU thread Pi do in parallel 2: for j = 1 to k do 3: X = ai.1 + bj.eB ; 4: Y = ai.eA + bj.1 ; 5: if X = M or Y = M then 6: isF ound = 1;  a solution is found 7: stop; 8: else if X < M and Y > M then 9: Write (Ai , Bj ) to the device memory; 10: end if 11: end for 12: end for

Algorithm 3 The Procedure of Generating List C with CUDA Input: C1 = [0, w0 ] Output: The list C with p subset sums 1: for i = 1 to t0 − 1 do 2: k = min(gmax , 2i ); 3: for all k GPU threads do in parallel 4: Produce a new list Ci1 by adding the item wi+1 to each element of the list Ci ; 5: Merge lists Ci and Ci1 into a new list Ci+1 ; 6: end for 7: end for

A is evenly divided into k blocks, where each block contains eA = 2t1 /k elements. Similarly, the list Bm is evenly divided into k blocks, where each block contains eB = 2t2 /k elements. For clarity, let A = [A1 , A2 , · · · , Ai , · · · , Ak ] and Bm = [B1 , B2 , · · · , Bj , · · · , Bk ], where Ai = [ai.1 , ai.2 , · · · , ai.eA ] and Bj = [bj.1 , bj.2 , · · · , bj.eB ]. Then, the block Ai and the entire list Bm are assigned to the GPU thread Pi , where 1 ≤ i ≤ k. Finally, based on the prune rule presented in [4], we use k GPU threads to perform the pruning operation described in Algorithm 4 in parallel. Once one block pair has been picked, it should be written into the device memory. After pruning, those picked block pairs are evenly assigned to k GPU threads, and each GPU thread performs the search subroutine of Horowitz and Sahni’s two-list algorithm [1] in parallel, so as to find a solution of SSP. The detailed CUDA implementations of the pruning and search stages are presented in our previous work [10].

lists A1 and A11 into a new nondecreasing list A2 . The above process is repeated until all the remaining items of W1 have been processed. After processing the last item wt0 +t1 , we obtain the final list At1 , that is, the nondecreasing list A. The procedure of generating list A is described in Algorithm 2. 2) Generating the nonincreasing list B. The procedure of generating list B is almost the same as that of generating list A, we omit it for clarity. The detailed CUDA implementation of generating list A (B) is elaborated in our previous work [10]. 3) Generating the list C. At first, we initialize C1 = [0, w1 ]. Next, we use k GPU threads to add the item wi+1 into each element of the list Ci in parallel, generating a new list Ci1 = [ci.1 + wi+1 , ci.2 + wi+1 , · · · , ci.2i + wi+1 ], 1 ≤ i ≤ t0 − 1. Then, we use k GPU threads to merge lists Ci and Ci1 into a new list Ci+1 in parallel. After t0 −1 iterations, finally obtain the list Ct0 with 2t0 subset sums, that is the needed list C. The procedure of generating list C is described in Algorithm 3. 4) Generating the list Bm at the m-th node. We use k GPU threads to add the element cm into each element of the list B in parallel, generating a new list Bm =[b1 + cm , b2 + cm , · · · , b2t2 + cm ], where 1 ≤ m ≤ p. After lists A and Bm have been generated at the m-th node, where 1 ≤ m ≤ p. Within each node, in order to reduce the search space of each GPU threads, we perform the pruning operation on lists A and Bm as follows. At first, the list

IV. E XPERIMENTAL E VALUATION In this section, we first present the experimental setup, and then evaluate the performance of the proposed hybrid MPI-CUDA implementation. A. Experimental Setup Our experiments are carried out on a hybrid CPU/GPU cluster with 32 computing nodes. Each node is configured with two six-core Xeon X5670 CPUs, one Tesla M2050 GPU and 32GB of main memory. For each node, we configure one MPI process to communicate with other nodes. The version of MPI is MPICH2-1.2. The compilers used are GCC version 4.4.7 and NVIDIA nvcc version 5.0. Due to the exponential growth in the memory requirement, the problem size is limited by the available memory. Therefore, we test ten different problem sizes which scale from 42 to 60. For each problem size, we use a random number generator to produce 100 different instances of SSP, and the average execution time of 100 instances is considered.

96

Figure 4. The Speedup of the hybrid MPI-CUDA Implementation over the Best Sequential CPU Implementation for Different Problem Sizes under the Clusters with 4-32 Nodes

Figure 5. The Execution Time and Speedup of the Hybrid MPI-CUDA Implementation for Different Numbers of Computing Nodes When n = 54

In order to accurately evaluate the performance of the proposed MPI-CUDA nimplementation, for each instance, we specify M = 0.5 i=1 wi and conduct the experiments as follows: (1) sequential CPU implementation, namely we run Horowitz and Sahni’s sequential two-list algorithm on a single CPU; (2) single-node CPU-only implementation, namely we implement Li et al.’s parallel two-list algorithm on two CPUs using OpenMP; (3) single-node GPU-only implementation, namely we implement it on a single GPU using CUDA; (4) hybrid MPI-OpenMP implementation, namely we implement it on multiple nodes using MPI and OpenMP; (5) hybrid MPI-CUDA implementation, namely we implement it on multiple nodes using MPI and CUDA.

hybrid MPI-CUDA implementation for different numbers of computing nodes when the problem size is fixed at 54, where the bar represents the execution time and the line represents the speedup over the sequential CPU implementation. From the figure, we can see that the speedup increases from 10.30× to 27.74× when the number of computing nodes increases from 4 to 32. When the number of nodes reaches 32, it achieves the best speedup. This indicates that the hybrid MPI-CUDA implementation scales well with the number of computing nodes and has excellent strong-scalability. 2) Comparison with Single-Node CPU-Only / GPU-Only Implementation: Figure 6 shows the performance comparison among three different parallel implementations for different problem sizes. The bar marked as “single-node CPU-only” represents the execution time of the single-node CPU-only implementation, the bar marked as “single-node GPU-only” denotes the execution time of the single-node GPU-only implementation, and the bar marked as “32-nodes MPI-CUDA” refers to the execution time of the hybrid MPICUDA implementation on a cluster with 32 nodes. The line “Cmp-CPU-only” represents the speedup of the MPI-CUDA implementation over the single-node CPU-only case, and the line “Cmp-GPU-only” refers to the speedup of the MPICUDA implementation over the single-node GPU-only case. As seen in Figure 6, the execution time of the hybrid MPI-CUDA implementation is reduced by an average of 85.21% compared with the single-node CPU-only case and by 79.80% compared with the single-node GPU-only case. As the problem size increases from 46 to 54, the speedup compared with the single-node CPU-only case increases from 6.52× to 6.94×, whereas that compared with the single-node GPU-only case increases from 4.85× to 5.05×. It is clear that the performance achieved on the cluster with 32 nodes is stable, this is because the speedup is approximately equal as the problem size increases.

B. Results and Discussion In this section, the performance of the hybrid MPICUDA implementation is compared with that of the best sequential CPU implementation, the single-node CPU-only implementation, the single-node GPU-only implementation, and the hybrid MPI-OpenMP implementation with same cluster configuration. 1) Comparison with Sequential CPU Implementation: Figure 4 demonstrates the speedup of the hybrid MPI-CUDA implementation over the best sequential CPU implementation for different problem sizes under the clusters with 432 nodes. It is clear that the speedup of the MPI-CUDA implementation increases with the increase of the problem size, and it will gradually reach a peak. Besides, the MPICUDA implementation does not give a substantial speedup for small problem sizes, this is mainly because there is not enough work to fully utilize the available computational resources of the GPU cluster. However, when SSP increases to reasonable scale, the MPI-CUDA implementation provides a significant speedup. For example, it achieves up to 28.76× speedup on a cluster with 32 nodes when n = 60. Figure 5 illustrates the execution time and speedup of the

97

Single-node GPU-only

Cmp-CPU-only

Cmp-GPU-only

the most suitable workload to each node in the cluster, an effective workload distribution scheme is designed. The performance of the proposed hybrid MPI-CUDA implementation is compared with that of the best sequential CPU implementation, the single-node CPU-only implementation, the single-node GPU-only implementation, and the hybrid MPIOpenMP implementation with same cluster configuration. The results show that the hybrid MPI-CUDA implementation not only provides significant performance benefits but also has a good scalability. Our work confirms that the twolist algorithm can be effectively parallelized using MPI and CUDA on a hybrid CPU/GPU cluster. In future work, we will explore new techniques to make full use of all the available computational resources of each node in the cluster, namely find an effective method to combine both CPUs and GPUs to cooperatively accelerate solving large-scale SSP.

32-nodes MPI-CUDA

3300

11

3000

10

2700

9

2400

8

2100

7

1800

6

1500

5

1200

4

900

3

600

2

300

1

0

Speedup

Execution time (ms)

Single-node CPU-only

0 46

48

50

52

54

n

ACKNOWLEDGMENT

Figure 6. The Performance Comparison Among Three Different Parallel Implementations for Different Problem Sizes

This work was partially funded by the Key Program of National Natural Science Foundation of China (Grant No. 61133005), and the National Natural Science Foundation of China (Grant Nos. 61070057, 61370095, 61173013, 61370098, 61350011) R EFERENCES [1] E. Horowitz and S. Sahni, “Computing partitions with applications to the knapsack problem,” Journal of the ACM (JACM), vol. 21, no. 2, pp. 277–292, 1974. [2] D.-C. Lou and C.-C. Chang, “A parallel two-list algorithm for the knapsack problem,” Parallel Computing, vol. 22, no. 14, pp. 1985–1996, 1997. [3] C. A. A. Sanches, N. Y. Soma, and H. H. Yanasse, “An optimal and scalable parallelization of the two-list algorithm for the subset-sum problem,” European Journal of Operational Research, vol. 176, no. 2, pp. 870–879, 2007. [4] K.-L. Li, R.-F. Li, and Q.-H. Li, “Optimal parallel algorithms for the knapsack problem without memory conflicts,” Journal of Computer Science and Technology, vol. 19, no. 6, pp. 760– 768, 2004. [5] V. Boyer, D. El Baz, and M. Elkihel, “Solving knapsack problems on GPU,” Computers & Operations Research, vol. 39, no. 1, pp. 42–47, 2012. [6] S. S. Bokhari, “Parallel solution of the subset-sum problem: an empirical study,” Concurrency and Computation: Practice and Experience, vol. 24, no. 18, pp. 2241–2254, 2012. [7] V. Boyer, D. El Baz, and M. Elkihel, “Dense dynamic programming on multi GPU,” in PDP, 2011, pp. 545–551. [8] J. Jaros, “Multi-GPU island-based genetic algorithm for solving the knapsack problem,” in Evolutionary Computation (CEC), 2012 IEEE Congress on. IEEE, 2012, pp. 1–8. [9] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without memory conflicts,” Computers, IEEE Transactions on, vol. 100, no. 11, pp. 1367–1369, 1987. [10] L. Wan, K. Li, J. Liu, and K. Li, “GPU implementation of a parallel two-list algorithm for the subset-sum problem,” Concurrency and Computation: Practice and Experience, 2014. 10.1002/cpe.3201.

Figure 7. The Speedup of the Hybrid MPI-CUDA Implementation over the Hybrid MPI-OpenMP Implementation for Different Problem Sizes under the Clusters with 4-32 Nodes

3) Comparison with Hybrid MPI-OpenMP Implementation: For fair comparison, the performance of the hybrid MPI-CUDA implementation is compared with that of the hybrid MPI-OpenMP implementation. Figure 7 shows the speedup of the hybrid MPI-CUDA implementation over the hybrid MPI-OpenMP implementation for four different cluster sizes when the problem size scales from 46 to 54. It is obvious that the MPI-CUDA implementation has much better performance than the MPI-OpenMP implementation with the same cluster configuration, e.g., the speedup increases from 1.34× to 1.38× when the problem size scales from 46 to 54 under the cluster with 32 nodes. V. C ONCLUSION In this paper, an efficient MPI-CUDA dual-level parallel implementation of a two-list algorithm for solving SSP is proposed on a hybrid CPU/GPU cluster. In order to allocate

98

Suggest Documents