A parallel interior-point algorithm for linear programming on a shared memory machine Erling D. Andersen Knud D. Andersen y January 23, 1998 Abstract
The XPRESS1 interior point optimizer is an \industrial strength" code for solution of large-scale sparse linear programs. The purpose of the present paper is to discuss how the XPRESS interior point optimizer has been parallelized for a Silicon Graphics multi processor computer. The major computational task, performed in each iteration of the interior-point method implemented in the XPRESS interior point optimizer is the solution of a symmetric and positive de nite system of linear equations. Therefore, parallelization of the Cholesky decomposition and the triangular solve procedure are discussed in detail. Finally, computational results are presented to demonstrate the parallel eciency of the optimizer. It should be emphasized that the methods discussed can be applied to the solution of large-scale sparse linear least squares problems. Acknowledgment: We appreciate the comments made by an anonymous referee appointed by CORE which helped us to improve the manuscript. This work was carried out, while the second author had an EC TMR-fellowship located at CORE. Also the second author would like to thank Laurence Wolsey for making the fellowship possible.
Key words: linear programming, interior-point methods, parallel computing.
Department of Management, Odense University, DK-5230 Odense M, Denmark. E-mail:
[email protected]. y CORE, Catholic University of Louvain, 34 Voie Roman du Pays, B-1348 Louvain la Neuve, Belgium. E-mail:
[email protected] 1 Available from Dash Associates, see http://www.dash.co.uk/
1
1 Introduction Linear programming (LP) is an important tool in many areas of science, but unfortunately LPs often get very large in terms of the number of constraints and variables, when they have to represent the underlying problem accurately. Hence, ecient methods for solution of large-scale LPs are required. Moreover, the methods should be able to exploit parallel computers eciently, because parallel computers now oer a lot of computing power cheaply. The two most popular methods for solution of general LPs are the classical simplex method and the more recent interior-point methods. However, recent computational results presented in [2, 8, 24] indicate that the interior-point methods for many medium to large-scale LPs are signi cantly more ecient than the simplex method. Therefore, in this paper we restrict the attention to parallelization of an interior-point method. Several researchers have already considered the parallelization of an interior-point algorithm. For general problems parallel implementations of an interior-point algorithm for a distributed memory computer are presented in [7, 20, 18, 19]. Although the results obtained are less promising. Furthermore, the methods described in [7, 20, 18, 19] do not employ state-of-the-art algorithms such as the predictor-corrector primal-dual algorithm, see [3]. Hence, the sequential performance of the authors codes is likely to be substandard, when compared to the best implementations of an interior-point LP algorithm. A recent implementation of the interior-point LP optimizer for a distributed machine is presented in [9]. Lustig and Rothberg [25] discuss parallelization of the CPLEX LP barrier code for a Silicon Graphics shared memory multi-processor computer (SGI-MP). They report good computational results for LP problems with a fairly dense Cholesky factorization. Parallelization of interior-point methods for special classes of LP problems have been considered in [22, 16, 30] which address the solution of multicommodity and stochastic programming problems. The work presented in this paper is in the same spirit as the work presented in [25], that is we discuss parallelization of an existing and highly ecient interior-point LP solver. Moreover, we are targeting general LP problems and the same computer architecture i.e. the SGI-MP. However, in contrast to [25] we present the methods employed to parallelize the code in detail. Another dierence is that a dierent interior-point algorithm is employed here , i.e. the homogeneous algorithm, rather than the primal-dual algorithm employed in [25]. Although it should be mentioned that the tasks that have to parallelized within the primal-dual and the homogeneous algorithm are identical. The outline of the paper is as follows. In Section 2 we brie y review the interiorpoint algorithm implemented in the XPRESS interior point optimizer and identify the computationally intensive parts that have to be parallelized. In Section 3 we discuss the actual parallelization of the XPRESS interior point optimizer. Section 4 presents computational results that document the possible speed-up that can be obtained using a SGI-MP and the XPRESS optimization package. Finally, in section 5 we present our conclusions.
1
2 The homogeneous algorithm The purpose of this section is to give a brief introduction to the so-called homogeneous interior-point algorithm, which is employed by the XPRESS interior point optimizer. The main advantage of the homogeneous algorithm compared to the more popular primal-dual algorithm is that it detects infeasibility and unboundedness reliably. The algorithm presented below solves the LP problem in the standard form (P ) minimize cT x subject to Ax = b; x 0; where b 2 Rm ; A 2 Rmn; and c; x 2 Rn. Without loss of generality we will assume A is of full row rank. The dual of (P ) is (D) maximize bT y subject to AT y + s = c; s 0; where s 2 Rn. In practice m and n may be very large. However, the matrix A is sparse and typically less than 1% of the entries in A are nonzero. We do not assume that A has any particular structure such as being block-angular. In the following we use the notation that if x is a vector, then capital X denotes
X := diag(x): Moreover, if u and v are two column vectors, then (u; v) is also a column vector with u stacked on top of v. Now the homogeneous algorithm can be stated as Algorithm 2.1.
Algorithm 2.1 1. Choose (x0; 0 ; y 0; s0; 0 ) such that (x0 ; 0 ; s0 ; 0 ) > 0 and "f ; "g > 0. k := 0. 2. LOOP:
rpk := b k ? Axk ; rdk := c k ? AT yk ? sk ; rgk := k + cT xk ? bT yk:
3. If jj(rpk ; rdk ; rgk )jj "f and ((xk )T sk + k k ) "g , then terminate.
4. Choose 2 [0; 1) and solve
Adx ? bd AT dy + ds ? cd ?cT dx + bT dy ? d S k dx + X k ds k d + k d
= = = = =
(1 ? )rpk ; (1 ? )rdk ; (1 ? )rgk ; ?X k sk + k e; ? k k + k
(1)
for (dx; d ; dy ; ds ; d). 5. For some 2 (0; 1) then let
k := maximize subject to (xk ; k ; sk ; k ) + (dx; d ; ds; d) 0; ? : 1
2
6.
(xk+1; k+1) k +1 (y ; sk+1; k+1)
:= (xk ; k ) + k (dx; d ); := (yk ; sk ; k ) + k (dy ; ds; d):
7. k = k + 1: 8. GOTO LOOP Details and motivation of the homogeneous algorithm are presented in [2]. Furthermore, it should be emphasized that Algorithm 2.1 is a simpli cation of the algorithm implemented in practice. For example a variant of Mehrotra's predictor-corrector heuristic is used in step 4 of the algorithm to choose . This has the consequence that the linear equation system (1) is solved several times for dierent right-hand sides. The major computational tasks in the homogeneous algorithm are matrix-vector products with A performed in step 2 and solution of the Newton equation system (1) in step 4. The remaining tasks have complexity O(n) and hence in most cases they do not dominate the computations. Therefore, the subsequent discussion mainly treats parallelization of matrix-vector products with A and solution of the Newton equation system.
2.1 The solution of the Newton equation system
In this section we demonstrate how the solution of the Newton equation system can be reduced to solving a system of linear equations with a positive de nite coecient matrix. To simplify the discussion, assume the linear system 32 3 2 3 2 A ? b 77 66 ddx 77 66 rr^^pd 77 66 T I ? c A 66 T bT ?1 7777 6666 dy 7777 = 6666 r^g 7777 (2) 66 ?c r ^ d S X 5 4 s 5 4 xs 5 4 r^ d must be solved for several right-hand sides. Note that X and S are positive de nite diagonal matrices implying that if A is of full row rank, the solution to (2) is unique. De ne " ? T # ? X S A ; (3) K := A 1
then it can be veri ed that K is a nonsingular matrix if A is of full row rank, which we assume. This implies that each of the two linear systems K (p; q) = (c; b): (4) and " ? r^xs # r ^ ? X d (5) K (u; v) = r^ 1
p
have a unique solution. If the vectors (u; v) and (p; q) are known, the search direction can easily be obtained using the formulas ?1 T T d = rg ?1r?cT uc ubT?qb v ; dx = u + pd ; dy = v + qd ; ds = X ? (^rxs ? Sdx); d = ? (^r ? d ): ^ +
^
+
+
1
1
3
Note that even though the system (2) has to be solved for dierent right-hand sides, the system (4) is only solved once in each iteration. Furthermore, it can be observed that the two systems (4) and (5) can be solved independently and hence simultaneously. Solving (4) and (5) is identical to solving
"
?X ? S AT 1
A
#" # " # u = r ; r v 1 2
(6)
for dierent values of the right-hand sides (r ; r ). This system can be further reduced to the normal equation system using 1
2
u = D(r ? AT v); where D = ?XS ? . Now v is given as the solution to Mv = AD? r ? r ; 1
(7)
1
1 1
2
(8)
where M := AD? AT is symmetric and positive de nite. Hence, a Cholesky decomposition of M exists, that is M = LLT ; where L is a positive de nite lower triangular matrix. Using the Cholesky decomposition, the linear equation system (8) can then be solved easily. The main advantage of this approach is that numerical pivoting is not required to secure numerical stability [11]. Therefore, the pivot order can be chosen to minimize llin in the Cholesky decomposition. Moreover, the pivot order can be chosen once and for all, because the sparsity pattern of M is independent of D. The main drawback of this approach is that even though A is very sparse L may be dense. Moreover, an approach based on solving (6) directly using symmetric Gaussian elimination may be numerically more stable. However, the analyses presented by Wright [29] indicate that the normal equation approach works well in most cases. In practice the Cholesky decomposition is used as follows. First, a symmetric ordering of the rows and columns of M is chosen such that the ll-in in the Cholesky decomposition L is limited. The problem of choosing an ordering, which minimizes the amount of ll-in is an NP-hard problem. Therefore, the problem can only be solved approximately in a reasonable amount of time. The approximate minimum degree heuristic (AMD) proposed by Davis, Amestoy, and Du [1] including a modi cation suggested by Rothberg [27], is the default ordering method in the XPRESS interior point optimizer. Second, when an ordering has been determined the data structure for the Cholesky decomposition is initialized. This is referred to as the symbolic phase, because no numerical computations are involved. Finally, the actual numerical factorization takes place using the data-structures computed in the symbolic phase. The ordering and symbolic phases are only executed once before the rst numerical factorization. 1
2.2 The sparse push Cholesky
The algorithm for computing the Cholesky decomposition can be organized in dierent ways leading to dierent computational performance. This has been studied in [5, 23, 28, 17]. In particular the push Cholesky, the pull Cholesky, and the multifrontal Cholesky have been studied. These three methods are mathematically equivalent methods, but 4
lead to dierent computational eciency depending on the computer architecture. In the following we mainly treat the push Cholesky, because our experience is that it is the most ecient method to use within an interior-point method executed on the most frequent occurring computer architectures. Computation of the Cholesky decomposition is based on the following matrix decomposition
"
2 1 2 = 4 L ? 21 L L 11
21
11
# L L M= L L 3" # " 21 T ? 12 T # 01 05 I ) ( L L ) ; ( L ?2 ? 21 T 0 I I 0 L ? L L (L L ) 11
21
21
22
21
11
22
21
21
11
(9)
11
11
1
where L is a square matrix and L 2 is the Cholesky decomposition of L . By using this matrix decomposition recursively, the algorithm for the computation of the Cholesky decomposition can be stated as follows 11
11
11
Algorithm 2.2
1: for j = q 1; : : : ; m 2: ljj := ljj 3: l j m j := l j m j =ljj 4: for k = j + 1; : : : ; m 5: if lkj 6= 0 then 6: l k m k := l k m k ? lkj l k m j ( +1:
)
( :
( +1:
)
)
( :
)
( :
)
We use the notation l k for the kth column of L and l i j k for the kth column from row i to row j . An implicit assumption in Algorithm 2.2 is that only nonzeros in L are stored. Therefore, for example in step 6 only the nonzero components of l k m j are subtracted from l k m k . This implies that if M is a sparse matrix, then Algorithm 2.2 is much faster than the dense equivalent. The Algorithm 2.2 is denoted the push Cholesky, because the update :
( : )
( :
( :
)
)
L ? L L? 2 (L L? 2 )T 1
1
22
21
11
21
(10)
11
is performed immediately after L L? 2 has been computed. Whereas in the pull Cholesky the update of a column in (10) is delayed until the column is factorized. The straightforward implementation of the Cholesky decomposition as presented in Algorithm 2.2 chooses L to be a scalar in (9). However, in practice it is worthwhile to let L be a matrix. In particular the nonzeros in L are often located such that a set of adjacent columns in L has a dense diagonal block and an identical nonzero pattern below the dense diagonal block. For historical reason such a set of similarly structured columns is called a supernode [10]. In Figure 1 we show an example of a supernode. 1 The resulting Cholesky1 decomposition based on supernodes works as follows.1 First L1? 2 is computed, next L L? 2 is computed, and nally the update L ? L L? 2 (L L? 2 )T is performed. Hence, each operation is performed for all the columns in the supernode in one step instead of for each column in the supernode individually. This leads to several computational advantages. First, the use of supernodes makes loop unrolling possible, see 1
21
11
11
11
11
21
22
11
5
21
11
21
11
2 66 xx 66 66 x 66 x 66 x 66 66 x 66 66 x 66 66 x 64 x
x x x x x x x x x x x x x x x x x x x x x
3 77 77 77 77 x 77 77 x 777 x 777 77 x 77 75 x
Figure 1: An example of a supernode in L. [3, p. 220]. Indeed our implementation employs up to 16-way loop unrolling depending on the number of oating-point registers the computer has. Second, workstations have a memory hierarchy, consisting of a large relative slow main memory, and a small highspeed cache memory. The size of the cache tends to grow slowly with the size of the main memory. On computers having such a memory hierarchy the most recently used part of the main memory will reside in the cache. Implying if subsequent memory references are to the same part of the memory, then it can be fetched eciently from the cache. Now the supernodal push Cholesky tends to perform more work on the same block of memory than Algorithm 2.2 and hence exploits the cache better. To improve the cache behavior, supernodes that requires more space than the cache size are partitioned into several small supernodes which t into the cache as suggested in [28]. Third as already mentioned only nonzeros in L are stored explicitly. This implies that there is an additional computational overhead associated with the computation of
L ? (L L? = )(L L? = )T : 22
21
1 2 11
21
1 2 11
(11)
The computation of (11) is organized as a sequence of supernode to supernode updates which corresponds to the computation
T := T ? UU T ; where U is a rectangular matrix and T is a symmetric matrix. Hence, it is only necessary to update the lower triangular part of T . However, due to the sparsity both U and T are stored compactly as visualized in Figure 2. Note that by the de nition of a supernode each column within U and T is structurally identical (only below the diagonal for T ). Therefore, it is sucient to store the nonzero rows and a row index for each row in U and T . Therefore, in Figure 2 the rows 1, 2, 4, and 6 of T are all nonzero whereas in U only the rows 1, 4, and 6 are nonzero. This implies that when T ? UU T is formed, only the rows 1, 4, and 6 of T are modi ed by the update and the positions marked u in the left-hand size of T are exactly those positions in the updated T which is modi ed. Therefore, to perform the update of T with UU T correctly it is necessary to form a \map" between the rows of UU T and T which leads to a computational overhead called the sparse overhead. 6
011 B 2C C B B A @4C 6
2u 66 x 64 u u
0 0 u u T
0 0 0 u
3 011 77 B C 75 := BB@ 24 CCA 6
2x 66 x 64 x x
0 0 x x T
0 0 0 x
3 3 0 1 2 x x "x x x# 1 77 75 ? B@ 4 CA 64 x x 75 x x x x x 6 U UT
Figure 2: Example of a sparse update. x=a nonzero position, u=an update position . The map can be computed eciently using a linear search. As such, the sparse overhead is not insigni cant, but by exploiting the supernodes the map array is only formed once for each supernode and not once for each column in the supernode leading to a large reduction in the sparse overhead. A version of Algorithm 2.2 that exploits the supernodes is called a supernodal push Cholesky and it is the algorithm employed in the sequential version of the XPRESS interior point optimizer. Further details about the algorithm can be seen in [2]. One important fact about our implementation is that the update (11) is implemented as a set of BLAS like subroutines for triangular matrices which perform the majority of the work.
3 Parallelization of the algorithm In this section we turn to parallelization of the XPRESS interior point optimizer. The actual implementation is targeting the SGI architecture which is based on shared-memory and symmetric multiprocessing. Therefore, the subsequent discussion is to some extent based on how the SGI-MP computer works. However, recently a new standard called OpenMP(see http://www.openmp.org/ ) has been created. This standard should make it easy to port our SGI based code to machines from other computer vendors which are using the shared-memory architecture. The SGI-MP computer can for our purposes be considered as a shared memory machine, where each processor communicates with the memory and with each other through a shared bus. To reduce the communication on the shared bus each processor is equipped with a small rst level cache (16Kb) and a large second level cache (4 MBytes). We have chosen to target our implementation at the SGI-MP architecture, because it is fairly easy to port an existing serial code to the shared-memory architecture. Greater diculties are expected when porting to a distributed parallel computer, because then it is necessary to distribute the data between the processors and to keep the communication low. On a shared-memory computer it is not necessary to distribute data, and communication is fast as long as the shared bus is not saturated. Hence, the goal of this study is to develop a parallel version of the XPRESS interior point optimizer, where data is shared on a coarse grain, and in such a way that the local cache of each processor is used to reduce the communication with the global shared memory to keep the bus load low. Furthermore, there is a cost associated with splitting a given job between several processors called the parallel overhead which we aim at keeping low. Moreover, to ensure that the code is parallel ecient an additional objective has been that the work should be distributed equally among the processors. With these considerations in mind and the fact that a shared-memory computer is 7
not scalable the overall goal has been to develop a parallel code which is ecient on a low number of processors, say less than 24, independent of the structure in the coecient matrix of the LP problem. However we assume that the LP problem is large and is expensive to solve in sequential mode.
3.1 The Power C language
The XPRESS interior point optimizer has been written in the C programming language and it is not made into a parallel program by default when compiled with the SGI Power C compiler on the SGI-MP. However, in the SGI Power C language it is possible to specify certain compiler directives in the source code to instruct the compiler which operations should be parallelized. Therefore, in this section we give a short introduction to the elements of the Power C language, which are important for the subsequent discussion. The compiler directives are given in form of \#pragma" constructs that are modeled after the Parallel Computing Forum (PCF) directives for parallel Fortran, [15]. The programming framework used in the Power C language is thread based which can be illustrated by the following code fragment #pragma parallel #pragma pfor(i=0; 1000; i) { for(i=0; i