Solving Combinatorial Problems on HPC with bobpp Tarek Menouer
Bertrand Le cun
University of Versailles Saint-Quentin-en-Yvelines Email:
[email protected]
University of Versailles Saint-Quentin-en-Yvelines Email:
[email protected]
Pascal Vander-Swalmen University of Versailles Saint-Quentin-en-Yvelines Email:
[email protected]
Abstract—This paper presents a High Level Framework for High Performance Computing named BOBPP 1 . The aim of BOBPP is to provide an interface between combinatorial optimizations problems and parallel computers. Specifically, BOBPP is designed to propose, on one hand, abstract tree search algorithms which are widely used for exact solving in Combinatorial Optimization, and on the other hand, several parallelizations of these tree search algorithms according to the architecture of a machine.
The following section presents more in details the other frameworks, The section III describes the problems bobpp is able to solve and the architectures it cans use. The way how the framework solves the problems and parallelizes the tree is detailed. We also explain in few words how to use bobpp. The section IV shows some experiments made with this framework for different kinds of problems solved on two types of parallel computers. Finally, in section V, we give some perspectives for bobpp.
I. Introduction Parallelized resolution of Combinatorial Optimization problems has been widely studied in the literature. These problems belong to the N P-Hard complexity class of problems, they require an exponential solving time in the worst case. Then, after having designed the best sequential algorithm to solve a problem, according to the existing computers, it is natural to reduce the computation time using a parallel machine. Parallel algorithms for tree search algorithms have been widely studied [?], [?]. Several software frameworks have been proposed for solving Combinatorial Optimization problems. Like bobpp [?], they also establish the interface between the users and the parallel machines. These tools include BCP [?], PEBBL [?] PICO [?], ALPS [?], [?], [?], Bob [?], [?], PUBB [?], [?], PPBB [?]. But all of them only deals with Branch and Bound like algorithms: Branch and Bound, Branch and Price, and Branch and cut. Furthermore they only propose one parallelization which is suitable for one type of machine and generally one kind of application. bobpp proposes different tree search algorithms (Branch and X, Divide and Conquer) making possible to solve Quadratic Assignment Problems (QAP) with a custom lower bound [?], [?] and Mixed Integer Problems [?]. Recently, the Google or-tools library [?] was used with bobpp to obtain parallel Constraint Programming solvers, and also different parallelizations based on pthreads, mpi or kaapi [?].
II. Related works Several frameworks for Combinatorial Optimization problems have been proposed in the literature. They may be classified according to two major criteria: 1) The node search algorithm involved in the search process. These algorithms include Divide and Conquer (D&C), Branch-and-Bound (B&B) and its derived algorithms Branch and Price, Branch and Cut and A*. 2) The programming environment they use to implement the parallelization : pthreads, openmp, mpi, pvm . . . Many of the available parallel search algorithm frameworks are specialized at the same time in the algorithm they implement, and for a specific programming environments and/or parallel architecture. For example, the first framework PPBB [?] proposed Branch and Bound interface parallelized on a distributed architecture using pvm. BCP [?] is an implementation of the Branch-andPrice-and-Cut algorithm, which runs only with mpi. Some other frameworks tend to diversify the proposed functionalities. SYMPHONY [?], for example, solves MixedInteger Programming (MIP) problems using pvm for distributed memory machines or openmp for shared memory machines. ALPS [?], [?], [?], which in some way is a successor of SYMPHONY [?] and BCP [?], generalizes the node search to any tree search, which of course enables the B&B search, among others. Though, the only available programming environment for ALPS [?], [?] is mpi. In a similar manner, PEBBL [?] integrates the B&B search from PICO [?], allowing the implementation of a larger variety of solvers than MIP
1 This work is funded by the ANR HORUS project and the pajero OSEO-ISI project, bobpp sources are available at http://forge.prism. uvsq.fr/projects/bobpp
solvers. PEBBL and PICO are part of the ACRO project [?], although ALPS, BCP, SYMPHONY are part of the COIN-OR project [?]. bobpp provides several search algorithm classes, while being able to use different parallelization methods. The aim is to propose a single framework for most classes of Combinatorial Optimization problems, which can be solved on as many different parallel architectures as possible. Figure 1 shows how bobpp interfaces between highlevel applications (QAP, TSP, ...) and different parallel architectures using several parallel programming environments. bobpp has been developed in C++ language, and proposes a C++ API, composed of basic classes which are extendable by the user. Consraint Problem QAP
TSP
VRP
external solver
User algorithm
Bobpp Parallel Programming environment
Sequential
Athapascan Posix threads
Fig. 1.
MPI
bobpp framework
III. bobpp framework A. Functionalities for solving combinatorial problems Minimal functionalities are required to solve combinatorial problems. The algorithm has to be able to explore a search-tree in different ways (depth-first, best-first, widthfirst searches. . . ). The goal of the search may be to get the first or best solution, number of best solutions, etc. These functionalities together make possible to write a code solving numerous problems such as Vehicle Routing Problem (VRP), QAP, Knapsack Problem, N-queens, . . .
for the QAP implements the branching operation and the evaluation operation known as the Gilmore and Lawler Lower Bound [?], [?]. But for the VRP, the child generation method must call the evaluation function called column generation using an external linear solver. External linear solver may be glpk, CPLEX, Gurobi, etc. Some examples, using different strategies, are given in the source code of bobpp. C. The parallel search To explore the search-tree on parallel computers, the main issue is the load-balancing. Furthermore bobpp must be able to explore this search-space accordingly to the user configuration. The idea is to provide a pool of nodes in which one the threads will pick a node up. Since this pool is sometimes sorted by a user criteria, this object in bobpp is called the PQ (short for Priority Queue). Once a node has been removed by a thread from a PQ, bobpp will apply the child generation method on it. Then all the resulting nodes are inserted in a PQ. This mechanism is given in the figure 2. Threads have no idea about the path between the root and the node they are computing, their task is only computing children (making a local subtree). To breakout the bottleneck using too many threads on the same PQ, bobpp is able to address many of them. In such a case an explicit load balancing is done between them. The best choice for the number of PQs during the solving depends on the nature of the problem: some problems develop a small tree with “difficult” nodes (in term of computation time or space required to store them) and other problems develop huge trees with very “small” nodes (in term of space in memory or computation time). From the interface point of view, the user chooses the parallel environment he wants by calling the appropriate bobpp classes or methods in his code. Priority Queue
B. The interface bobpp manages the search on the tree which is the common structure of these problems. Several objects and methods have to be defined to perform the search. Provided classes are: • Class Instance: stores all the global data used during the search; • Class Node: represents a node of the search-tree; • Class Genchild: contains the method generating the children of the nodes; • Class Algo: the algorithm to use; • Class Goal: the stop criteria of the algorithm; • Class Stat: the statistics of the execution. The objects and methods must be adapted for the problem and chosen according to the algorithm used to solve it (B&B, D&C, etc.). For example, the child generation
Child Generation
Fig. 2.
The work of one thread in bobpp
IV. Experimentations A. Protocol Computers used to benchmark bobpp are one biprocessor Intel Xeon X5650 (2.67 GHz) with HyperThreading (12 physical cores for 24 threads) with 48 GB of RAM and one quad-processor AMD Opteron 6176 (2.3 GHz) (48 cores) with 256 GB of RAM. Two kinds of problems were solved: the QAP and the Golomb ruler.
6.5 3.8
447.4 261.74 146.63 1
4
8
12
16ht 20ht 24ht
# threads
Fig. 3.
Mean computation time and speedup for the QAP on Intel
time (seconds)
2416.03
time speedup
15.32 13.8 12.36 10.89 9.48 7.73 5.87
672.83 411.49 157.69
speedup
The mean computation time and the speedup according to the number of threads for the Intel based computer are given in figure 3. The reader may notice the break on the curves beginning at 12 threads, this is due to the HyperThreading technology: on one hand the QAP node size is very little and on the other hand Intel processor have a large cache memory size (12 MB). bobpp gives enough work to each thread so cores don’t need information in RAM and the Hyper-Threading technology is no longer useful. Using 12 threads on 12 cores, bobpp divides the computation time by 9.3, but using 24 threads on 12 cores the speedup is only of 11.6.
3.59 1 4 8 12 16 20 24 30 36 42 48 # threads
Fig. 4. Mean computation time and speedup for the QAP on AMD
The mean computation time and the speedup according to the number of threads for the AMD based computer are given in figure 4. The break due to the Hyper-Threading
C. Using an external solver The bobpp framework is able to parallelize an external solver, as an example, the or-tools solver which is a Constraint Programming Solver developed by Google, has been parallelized. This contains a library to model and solve problems of satisfaction or optimization for Constraint Programming using exact methods. Like bobpp, or-tools develops a tree to find a solution, that is why the parallelization of this search was interesting. To evaluate the bobpp performances with this external solver, some experiments were performed with the Golomb ruler problem. A Golomb ruler is a set of marks at different positions with different distance between each two pairs of marks. This problem is an optimization problem. For this benchmark, the Golomb ruler is modeled within or-tools, bobpp handles roots of several subtrees to share among different runs of or-tools. All tests were run at least three times on the problem of size 13. The tree developed by or-tools contains 72,431,530 nodes, the number of nodes managed by bobpp is 77,757 nodes. 10565
38.14
time speedup
31.72 24.62 17.5
speedup
time (seconds)
time speedup
speedup
11.6 10.8 9.6
1704.44
technology is not visible here, the curve is smooth but at the end. If all the cores of this computer are used performances fall, the main memory is too much accessed by the threads because of their small cache memory size (512 KB).
time (seconds)
B. Normal use of bobpp For this benchmark, the QAP was directly modeled as a user algorithm in bobpp. The best solution is required, thus the critical tree is explored to prove that the solution found is the best one. In the benchmarks shown here, there is no search anomaly [?], [?]. Three instances were solved (the mean sequential solving time is 1704 seconds on the Intel processor and 2416 seconds on the AMD one). Each instance was solved 2 times. The mean size of the trees is 74,070,656 nodes and 6 PQs were used for all runs but for the sequential one and for the 4 threaded run for which one only 4 PQs were used.
11.2
2758 1717 943 277
6.15 3.83 1 1 4 8
16
24
32
40
48
# threads
Fig. 5.
Execution time and speedup to solve the Golomb-13
The mean computation time and the speedup according to the number of threads for the AMD based computer are given in figure 5. The reader may notice a great gap between speedups obtained for this problem and those ones obtained for the QAP. The main difference is in the computation done in each node managed by bobpp. Actually, without a total control on the external solver, it was necessary to compute the maximum of nodes in the search tree within a node managed by bobpp. Here, two levels of nodes exist. A node for bobpp is a subtree for or-tools. This behavior increases the locality during the computation and decreases the need of the PQ by the threads. Thus, the load-balancing is made explicitly when using bobpp with or-tools. To validate the load-balancing efficiency on 48 cores, figure 6 gives two lines, the first one is the computation
time (seconds)
time number of nodes 1957 1508 1274 248 215 200
0 1 4
10
20
30
40
number of nodes (thousands)
time for each thread and the second one shows the number of nodes developed in or-tools by each thread. The loadbalancing is good since all the threads receive almost the same amount of work and compute during approximately the same time. Only one thread computes more than 248 seconds although other threads compute between 215 and 248 seconds and manage between 1,274,277 and 1,957,476 or-tools nodes. The mean computation time is 225 seconds and the average node is 1,508,990 nodes.
48
thread ID
Fig. 6.
Load balancing on 48 threads solving the Golomb-13
V. Conclusion and Perspectives bobpp achieves good speedups on shared memory architectures. These results are obtained with several types of combinatorial problems on different computers thanks to the abstraction provided by bobpp. If bobpp retains a good locality, speedups are very good. That is why among future works, bobpp will provide a criteria to force a local treatment without using the PQs. In the very-near future bobpp will parallelize algorithms on distributed memory architecture. The prototype with mpi is ready but not enough reliable to let us communicate about this part. bobpp will be tested in heterogeneous context: distributed and shared memory HPCs. From a user algorithm point of view, bobpp is able to solve problems based on Linear Programming using an external solver. The interface now used is obsolete, so the Open Solver Interface (OSI) will be integrated in bobpp in several months. References [1] C. R. Theodor Crainic, Bertrand Le Cun, Parallel Branch and Bound Algorithms. USA: John Wiley and Sons, 2006, ch. 1, pp. 1–28. [2] B.Gendron and T.G.Crainic, “Parallel branch-and-bound algorithms: Survey and synthesis,” Operational Research, vol. 42, no. 06, pp. 1042–1066, 1994. [3] “Bob++: Framework to solve Combinatorial Optimization Problems http://forge.prism.uvsq.fr/projects/bobpp.” [4] M. J. Saltzman, COIN-OR: An Open Source Library for optimization. Boston: Kluwer, 2002. [5] J. Eckstein, C. A. Phillips, and W. E. Hart, “PEBBL 1.0 User Guide,” RUTCOR, RRR 19-2006, August 2006.
[6] J.Eckstein, C. A. Phillips, and W. E. Hart, “PICO: An objectoriented framework for parallel branch and bound,” in Proceedings of the Workshop on Inherently Parallel Algorithms in Optimization and Feasibility and their Applications, ser. Studies in Computational Mathematics, E. Scientific, Ed., 2001, pp. 219–265. [7] Y. Xu, T. Ralphs, L. Ladányi, and M. Saltzman, “ALPS: A Framework for Implementing Parallel Search Algorithms,” in In proceedings of the Ninth INFORMS Computing Society Conference, 2005. [8] T. Ralphs, L. Ladányi, and M. Saltzman, “A Library Hierarchy for Implementing Scalable Parallel Search algorithms,” The Journal of Supercomputing, vol. 28, no. 2, pp. 215–234, may 2004. [9] T.K.Ralphs, L. Ladányi, and M. Saltzman, “Parallel Branch, Cut, and Price for Large-scale Discrete Optimization,” Mathematical Programming, vol. 98, no. 253, 2003. [10] B. Le Cun, C. Roucairol, and the PNN team, “Bob : a unified platform for implementing branch-and-bound like algorithms,” Laboratoire PRiSM, Université de Versailles - Saint Quentin en Yvelines, RR 95/16, Sep. 1995. [11] V.-D. Cung, S. Dowaji, C. B. Le T. Mautor, and C. Roucairol, “Concurrent data structures and load balancing strategies for parallel branch-and-bound / a∗ algorithms,” in III Annual Implementation Challenge Workshop, DIMACS, New Brunswick, USA, Oct. 1994. [12] Y. Shinano, M. Higari, and R. Hirabayashi, “Generalized utility for parallel branch-and-bound algorithms,” in Proceedings of the 1995 Seventh Symposium on Parallel and Distributed Processing, no. 392. Los Alamitos, CA: IEEE Computer Society Press, 1995. [13] Y. Shinano, M. Higaki, and R. Hirabayashi, “An Interface Design for General Parallel Branch-and-Bound Algorithms,” in Workshop on Parallel Algorithms for Irregularly Structured Problems, 1996, pp. 277–284. [14] S. Tschoke and T. Polzer, “Portable parallel branch-and-bound library user manual, library version 2.0.” Department of Computer Sciences, University of Paderborn., Tech. Rep., 1996. [15] A. Djerrah, “Résolution exacte d’un problème d’optimisation combinatoire np-difficile sur grilles de machines,” Ph.D. dissertation, Université de Versailles-Saint-Quentin, july 2006, in French. [16] A. Djerrah, S. Jafar, V.-D. Cung, and P. Hahn, “Solving QAP on cluster with a bound of reformulation linearization techniques,” in In the 17th IMACS World Congress Scientific Computation, Applied Mathematics and Simulation IMACS2005, Paris, France, July 2005. [17] C. Louat, “Etude et mise en œvre de stratégies de coupes efficaces pour des problèmes entiers mixtes 0-1,” Ph.D. dissertation, Université de Versailles-Saint-Quentin, jan 2009, in French. [18] “or-tools : Operations research tools developed at google http://code.google.com/p/or-tools/.” [19] “Kaapi: a library for high performance parallel computing based on an abstract representation to adapt the computation to the available computing ressources. http://kaapi.gforge.inria.fr/.” [20] T. Ralphs and M. Guzelsoy, “The SYMPHONY Callable Library for Mixed Integer Programming,” in In proceedings of the Ninth INFORMS Computing Society Conference, 2005. [21] “Acro : A common repository for optimizers https://software.sandia.gov/trac/acro/.” [22] “coin-or : Computational infrastructure for operations research http://www.coin-or.org/.” [23] P. Gilmore, “Optimal and suboptimal algorithms for the quadratic assignment problem,” SIAM Journal on Applied Math., vol. 10, pp. 305–313, 1962. [24] E. Lawler, “The quadratic assignment problem,” Management Science, vol. 9, pp. 586–599, 1963. [25] G. Li and B. Wah, “Coping with anomalies in parallel branchand-bound,” IEEE Trans. on Computers, vol. C-35, no. 6, pp. 568–573, June 1986. [26] M. J. Quinn and N. Deo, “An upper bound for the speedup of parallel best-bound branch-and-bound algorithms.” BIT, vol. 26, pp. 35–43, 1986.