Noname manuscript No. (will be inserted by the editor)
A GPU parallelization of Branch-and-Bound for Multiproduct Batch Plants Optimization Andrey Borisenko · Michael Haidl · Sergei Gorlatch
Received: date / Accepted: date
Abstract Branch-and-Bound (B&B) is a popular approach to accelerate the solution of the optimization problems, but its parallelization on Graphics Processing Units (GPUs) is challenging because of B&B’s irregular data structures and poor computation/communication ratio. The contributions of this paper are as follows: 1) we develop two CUDA-based implementations (iterative and recursive) of B&B on systems with GPUs for a practical application scenario – optimal design of multi-product batch plants, with a particular example of a Chemical-Engineering System (CES); 2) we propose and implement several optimizations of our CUDA code by reducing branch divergence and by exploiting the properties of the GPU memory hierarchy; and 3) we evaluate our implementations and their optimizations on a modern GPUbased system and we report our experimental results. Keywords GPU computing · CUDA · parallel branch-and-bound · combinatorial optimization · multi-product batch plant design This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG), Cells-inMotion Cluster of Excellence (EXC 1003 – CiM), University of Muenster, Germany. Andrey Borisenko was supported by the DAAD (German Academic Exchange Service) and by the Ministry of Education and Science of the Russian Federation under the ”Mikhail Lomonosov II”-Programme. Andrey Borisenko Tambov State Technical University, Russia Tel.: +7-4752-630706 Fax: +7-4752-631813 E-mail:
[email protected] Michael Haidl, Sergei Gorlatch University of Muenster, Germany Tel.: +49-251-8332740 Fax: +49-251-8332742 E-mail:
[email protected] E-mail:
[email protected]
2
Andrey Borisenko et al.
1 Motivation and Related Work Combinatorial optimization [8] when applied in practical cases is often very time-consuming due to the ”combinatorial explosion” – the number of combinations to be examined grows exponentially, such that even the fastest supercomputers would require an intolerable amount of time. A common remedy is to formulate a mixed-integer nonlinear programing (MINLP) model [6, 14] and to exploit the Branch-and-bound (B&B) technique for solving it. In B&B, the search space is represented as a tree whose root is the original problem, the internal nodes are partially solved subproblems, and the leaves are the potential solutions. B&B proceeds in several iterations where the best solution found so far (upper bound) is progressively improved: a bounding mechanism is used to eliminate the subproblems that are not likely to lead to a better solution and to cut their corresponding sub-trees. This reduces the size of the explored search space, but can be still time-consuming in practice and requires acceleration, for example using parallel computing. Parallelization of B&B has been extensively studied, recently with a focus on systems comprising Graphics Processing Units (GPUs); usually the most time-consuming bounding mechanism of B&B is addressed [9]. The main difficulty in B&B are irregular data structures that are not well suited for GPU computing and the fact that the computation/communication ratio is low. In [3], a hybrid implementation of B&B for the knapsack problem demonstrates that for small problem sizes it is not efficient to launch the B&B computation kernels on GPU. A parallel CUDA implementation in [4] makes use of data compression. In [10], a hybrid CPU-GPU implementation is presented, and [15] studies the design of a parallel B&B in large-scale heterogeneous compute environments with multiple shared memory cores, multiple distributed CPUs and GPU devices. In this paper, we parallelize B&B and illustrate it with a practical application – the optimal selection of equipment for multi-product batch plants [11]. We develop and evaluate an implementation of B&B on a CPU-GPU system using the CUDA programming environment [13] in two versions – an iterative and a recursive one – and we describe their optimizations, as well as compare them to each other. We report experimental results on the speedup of our GPU-based implementations as compared to the sequential CPU version.
2 Application Case Study Our application use case is the optimization of a Chemical-Engineering System (CES). A CES a set of equipment (reactors, tanks, filters, dryers etc.) which implement a sequence of I processing stages; i-th stage can be equipped with equipment units from a finite set Xi , with Ji being the number of equipment units variants in Xi . All equipment unit variants of a CES are described as Xi = {xi,j }, i = 1, I, j = 1, Ji , where xi,j is the main size j (working volume, working surface) of the unit suitable for processing stage i. A CES variant
Parallelizing Branch-and-Bound on GPUs
3
QI Ωe , e = 1, E (where E = i=1 Ji is the number of all possible system variants) is an ordered set of equipment unit variants, selected from the respective sets. Each variant Ωe of a system must be in an operable condition (compatibility constraint), i.e., it must satisfy the conditions of a joint action for all processing stages: S(Ωe ) = 0. An operable CES variant must run at a given production rate in a given period of time (processing time constraint), i.e., it satisfies the restriction for the duration of its operating period T (Ωe ) ≤ Tmax . The design of an optimal CES can be formulated as the following optimization problem [1]: to find a variant Ω ∗ ∈ Ωe , e = 1, E of a CES, where the optimality criterion – equipment costs Cost(Ωe ) – reaches a minimum, and the compatibility and the processing time constraints are satisfied: Ω ∗ = argmin Cost(Ωe ), e = 1, E
(1)
Ωe = {x1,j1 , x2,j2 , . . . , xI,jI |ji = 1, Ji , i = 1, I}, e = 1, E
(2)
xi,j ∈ Xi , i = 1, I, j = 1, Ji
(3)
S(Ωe ) = 0, e = 1, E
(4)
T (Ωe ) ≤ Tmax , e = 1, E
(5)
Figure 1 shows the search space as a tree: all possible variants of a CES with I stages are represented by a tree of height I (see Figure 1). Each tree level corresponds to one processing stage of the CES, each edge corresponds to a selected equipment variant from Xi , where Xi is the set of possible variants at stage i. For example, the edges from level 0 correspond to the elements of X1 . Each node ni,k at the tree layer Ni = {ni,1 , ni,2 , . . . , ni,k }, i = 1, I, k = Qi 1, Ki , Ki = l=1 (Jl ) corresponds to a variant of a beginning part of the CES, composed of equipment units for stages 1 to i. Each path from the root to one of the leaves thus represents a complete variant of the CES.
0
root
x1,1
1
2
x2,2
...
...
xI-1,1
xI,1 nI,1
...
xI-1,2 nI-1,2
xI,2
...
nI,2
...
xI,JI
...
...
nI,JI
Fig. 1 The search tree and its depth-first traversal.
n1,J1
... n2,J2
... xI-1,JI-1
...
x1,J1
...
...
x1,J2
...
...
nI-1,1
I-1
...
n2,2
n2,1
...
n1,2
n1,1
x2,1
I
x1,2
nI-1,JI-1
...
4
Andrey Borisenko et al.
To enumerate all possible variants of a CES, a depth-first traversal of the tree is performed as shown in Figure 1; this process continues recursively for all valid beginning parts resulting from appending device variants of the current level to the valid beginning parts from previous levels. When a leaf is reached, the recursive process stops and the new solution is compared to the current optimal solution, possibly replacing it. A complete tree traversal (selecting a device on each edge traversal) and checking constraints (Equations 4 and 5) requires a considerable computational effort. E.g., for a CES consisting of 16 stages where each process stage can be equipped with devices of 5 to 12 sizes [2], the number of choices is 516 –1216 (approximately 1011 –1017 ). Hence, performing an exhaustive search (brute-force solution) for finding a global optimum is usually impractical when performed sequentially on the CPU. 3 Parallelization for GPU Figure 2 illustrates our strategy of dividing the search tree into subtrees for parallels processing: the sequential host process on the CPU dispatches computations to multiple threads on the GPU and then gathers the results from these threads. The tree-like organization of B&B provides a potential for parallelization, as all branches of the tree can be processed simultaneously. CPUo(sequentialopart)
0
root
x1,1
1
x1,2
n1,1
n1,2
x2,1
2(G)
x2,2
n2,1
x3,1
x2,1 n2,2
x3,2
x3,1
x2,2
n2,3
x3,2
x3,1
n2,4
x3,2
x3,1
x3,2
GPUo(parallelopart) Threado1
3 4(I)
Threado2
Threado4
Threado3
n3,1
n3,2
n3,3
n3,4
n3,5
n3,6
n3,7
n3,8
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
x4,1 x4,2
n4,3
n4,5
n4,1
n4,2
n4,4
n4,6
n4,7
n4,8
n4,9
n4,10
n4,11 n4,12
n4,13 n4,14 n4,15 n4,16
Fig. 2 Dividing the search tree into subtrees for parallel processing.
Qi All nodes Ni = {ni,1 , ni,2 , . . . , ni,k }, i = G + 1, I, k = 1, Ki , Ki = l=1 Jl at each layer i below G are traversed PG in an independent thread. The total number of threads is Nthreads = i=1 Ki , 1 ≤ G ≤ I. The granularity parameter G limits the number of threads: each subtree below the granularity level will be processed as one thread on the GPU. E.g., Figure 2 corresponds to an example CES consisting of 4 stages (I = 4) where each stage can be equipped with 2 devices (J1 = J2 = J3 = J4 = 2); the number of possible system variants is 24 = 16. We use granularity G = 2, so the threads number
Parallelizing Branch-and-Bound on GPUs
5
is 22 = 4. All nodes at layers from 0 to 2 are processed on the CPU, then partial solutions are transferred to the GPU, and all nodes at layers from 3 to 4 are processed in parallel on the GPU. 3.0.1 Host Code The host (see Listing 1) starts its work by loading input data from file by calling ReadInputData() (line 2). We use the CUDA Runtime API [13].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
main () { /* prepare all n e c e s s a r y data */ ReadInputData ( inData ); /* number of threads */ numThreads = ThreadsNumber ( inData , G ); /* prepare all o p e r a t i o n a l data */ P r e p O p e r a t i o n a l D a t a ( oprData , inData , numThreads ); /* start tree t r a v e r s a l for d i v i d i n g tree into */ /* s u b t r ee s and c r e a t i n g b e g i n n i n g parts of the CES */ EnumerateHost (0 , 0); /* send all n e c e s s a r y data to device */ cudaM emcpyHt oD ( inData , oprData ,W , G ); /* define p a r a m e t e r s for kernel launch */ blocksPerGrid = numThreads / M A X _ T H R E A D S _ P E R _ B L O C K + 1; t hr ea ds P er Bl oc k = M A X _ T H R E A D S _ P E R _ B L O C K ; /* s t a r t in g kernel f u n c t i o n on device */ FindSolution < < < blocksPerGrid , threadsPerBlock > > >( G ); /* s y n c h r o n i z e device */ c u d a D e v i c e S y n c h r o n i z e (); /* copy the results from device to host */ cudaM emcpyDt oH (W , Cost , minCost ); /* find global optimal s o l u t i o n */ for ( n =1; n