A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

A Data Parallel Augmenting Path Algorithm for the Dense Linear Many-To-One Assignment Problem* OLOF DAMBERG [email protected] Department of Mathematics, Linkoping Institute of Technology, S{581 83 Linkoping, Sweden SVERRE STORY [email protected] TOR SREVIK [email protected] Department of Informatics, University of Bergen, Thormhlensgate 55, N{5020 Bergen, Norway

Abstract. The purpose of this study is to describe a data parallel primal-dual augmenting path

algorithm for the dense linear many-to-one assignment problem also known as semi-assignment. This problem could for instance be described as assigning n persons to m( n) job groups. The algorithm is tailored speci cally for massive SIMD parallelism and employs, in this context, a new ecient breadth- rst-search augmenting path technique which is shown to be faster than the shortest augmenting path search normally used in sequential algorithms for this problem. We show that the best known sequential computational complexity of O(mn2 ) for dense problems, is reduced to the parallel complexityof O(mn), on a machine with n processorssupportingreductions in O(1) time. The algorithm is easy to implement eciently on commercially available massively parallel computers. A range of numerical experiments are performed on a Connection Machine CM200 and a MasPar MP-2. The tests show the good performance of the proposed algorithm.

Keywords: Parallel SIMD computers, parallel algorithm, assignment problem, bipartite matching, semi-assignment, primal-dual method.

1. Introduction The classical linear assignment problem (LAP) may be described as assigning n persons to n distinct jobs on a one-to-one basis such that the total cost of the assignment is minimized. This is a fundamental problem in combinatorial optimization. It also serves as an important building block in the solution of more complex optimization problems. Hence, it is a well studied problem, and a number of ecient algorithms have been proposed in recent decades. In the present paper we will study a more general assignment problem, the many-to-one assignment problem (MTOAP), which is a special case of the multiassignment problem (see [28], Pages 157{160). A multi-assignment problem can be viewed as a minimum cost matching in an undirected bipartite graph G = (M [ N; E), where a node i 2 M may be assigned to several nodes in N, and a node j 2 N may be assigned to several nodes in M. Pairs (i; j) themselves can * Submitted to Computational Optimization and Applications (COAP515-94; August 29, 1994). This revision sent August XX, 1995.

DAMBERG, STORY AND SREVIK

2

occur with multiplicities xij . The linear transportation problem falls into this class (where xij would correspond to the ow from i to j), and the LAP is a special case of this problem where only one-to-one matches with multiplicity one are allowed. We are concerned with the MTOAP, which is the special case of multi-assignment where it is allowed to assign several (ki) nodes in N to a node i 2 M, but only allow one node in M to be assigned to a node in N. MTOAP is known in the literature as the semi-assignment problem . However, this term is also associated with the greedy solvable case of linear assignment where there is no restriction on how many nodes in N that can be assigned to a node in M (see e.g., [23], Page 265). Hence, for clarity, we use the term many-to-one assignment. The MTOAP may be formulated as the following special linear program min s.t.

m X n X c

ij xij i=1 j =1 n xij = ki;

X j =1 m

Xx i=1

ij

i 2 M = f1; : : :; mg

= 1; j 2 N = f1; : : :; ng

(1) (2)

xij 0; 8i; j; (3) where the set M corresponds to rows (or row nodes) and the set N corresponds to columns (or column nodes); m n. Let cij denote the cost of assigning column j 2 N to row i 2 M and ki the number of columns that must be assigned to row i 2 M. If m = n and ki = 1; 8i, we have a LAP. Let xij = 1 if column j is assigned to otherwise. A feasible solution clearly exists if and only if Pmrowkii =andn. xijNote= 0that if (1) were inequality constraints, it is always possible i=1 to convert these to equality constraints by introducing dummy columns with zero cost. By the total unimodularity property of the constraint matrix one can show that an optimal solution yields xij 2 f0; 1g 8i; j. The dual problem to MTOAP is max

Xm k u + Xn v i=1

i i

j =1

j

s.t. ui + vj cij ; 8i; j (4) where ui are the dual variables associated with constraints (1) and vj are the dual variables associated with constraints (2). Let ccij denote the reduced cost, i.e., ccij = cij ? ui ? vj . The optimality conditions or Karush-Kuhn-Tucker conditions for a solution to MTOAP are primal feasibility (1){(3), dual feasibility (4) and complementary slackness: xij ccij = 0; 8i; j; (5)

PARALLEL ALGORITHM FOR MTOAP

3

which states that an optimal assignment (i; j) must have ccij = 0. This problem occurs in a multitude of applications such as project planning, scheduling, capital budgeting and planning, network design and military battle planning. Particular instances include the problems of connecting n terminals to m concentrators in a computer network [15] or assigning groups of m weapons to n targets [11]. The present study was prompted by the need to quickly nd feasible solutions to large concentrator location problems when solving them with a massively parallel algorithm. Here the MTOAP appears as a subproblem inside a subgradient algorithm [15]. MTOAP is clearly a special case of the classical transportation problem. Therefore algorithms such as network simplex, primal or dual simplex, primal-dual methods and shortest augmenting path algorithms may be applied to solve MTOAP. Specially adapted primal simplex algorithms for solving LAP and MTOAP was developed in [3], [4]. Their alternating basis algorithm is an adaptation of the network simplex method, which considers only a subset of bases called the alternating path bases, thus making the method more ecient by reducing the impact of degeneracy considerably. The same type of algorithm for more general network ow problems was also proposed independently in [14]. Auction algorithms appear in e.g., [7], [6] and a shortest augmenting path algorithm for the transportation problem is given in [33]. In [25] a process parallel primal simplex algorithm is described. It is also possible to replicate each node i 2 M (together with the associated edges) into ki assignment nodes, thus yielding an ordinary assignment problem. Then any algorithm for solving LAP may be applied. The best of these (i.e., the Hungarian and the shortest augmenting path methods) have a sequential computational complexity of O(n3); see e.g., [19], [10]. A competing algorithm, which is very fast especially for large sparse problems, is the auction algorithm; see e.g., [5], [6], [7], [8]. A range of studies on process and data parallel implementations of the auction algorithm can be found in, e.g., [5], [8], [11], [12], [22], [20], [34]. Corresponding studies on the Hungarian and shortest augmenting path algorithm appear in, e.g., [9], [2], [22], [30], [31]. Recently, in [21], a sequential algorithm for the solution of the MTOAP was proposed. It is a generalization of the shortest augmenting path algorithm for the LAP by [19]. A O(mn2 ) time bound for dense problems is obtained, which is a considerable improvement for problems where m n. To our knowledge there is presently no parallel algorithm, speci cally developed for MTOAP, proposed in the literature. Inspired by the fact that a sequential O(mn2 ) algorithm can be obtained [21] we decided to try to obtain a O(mn) data parallel algorithm (however not by directly parallelizing the SAP SA algorithm in [21] as motivated later). As mentioned above there is a plethora of papers on parallel algorithms for LAP, and since MTOAP has a very similar structure we chose, after studying the alternatives, to adopt a known data distribution technique for the LAP, however coupled with a new solution technique.

4


In Section 2 we provide a description of the proposed algorithm, tailored speci cally for data parallel implementation and dense problems. The algorithm is based on the primal-dual augmenting path concepts, hence, it is not theoretically new. It employs an ecient breadth- rst-search to nd augmenting paths, and we show that in a massively parallel setting this is superior to the normal best- rst-search (shortest augmenting path) normally used sequential algorithms for LAP and MTOAP. A similar technique is used in [30], [31], where competitive massively data parallel, O(n2) time, augmenting path algorithms for LAP are presented. We also prove the correctness and worst case time complexity of O(mn) of the algorithm. Extensive experiments on a MasPar MP-2 and a Connection Machine CM200 are performed to test the implementation of the algorithm. Comparisons are made with the sequential SAP SA code run on a Sun SPARCserver 20/61. These document that the parallelization strategy applied to our algorithm performs very well in practice. It also reveals that the practical performance of the algorithm depends not only on m and n, but also on the cost range of the cost matrix and the dierences in the node capacities ki. The numerical results are reported in Section 3. Finally, in Section 4, we close with our conclusions and suggestions for further research.

2. A data parallel algorithm for MTOAP The control structure of a data parallel program is equivalent to the one found in a sequential program. The data parallel program, however, achieves its parallelism by operating on multiple instances of similar data distributed across the processors. A good data distribution is crucial for maximum eciency. In general, one should avoid communication between data sets, and if communication is necessary, one should arrange data so that nearest neighbor instead of router communication can be utilized. See e.g., [35] for a survey on data parallelism in conjunction with network structured optimization problems.

2.1. Parallel operations and distribution of data As mentioned in the introduction, we have used a well known data distribution technique in the proposed algorithm; see e.g., [12], [30], [31], [22], [20]. With the elements of all n-dimensional vectors distributed to n processors in a one-to-one mapping, we are able to do an elementary vector operation in one time unit, reducing the complexity from O(n) in sequential to O(1) in parallel, implying a perfect speed-up of n, provided the vector length equals the number of processors. Matrices are distributed row wise across the processors, i.e., a matrix column is local to a processor. This implies that any row of a matrix can be operated upon in constant time. No operations on entire columns of a matrix will take place in our algorithm.


5

Not all vector elements need to take part in every operation. We assume that the ability to mask out elements exists. Beside elementary operations such as element wise add, multiply and compare, we also need to do global operations, also known as reductions , to nd the maximum or minimum element of a vector. Communication between processors only occur in form of reductions, which can be performed in constant time; see e.g., [13], [27]. No other exchange of data between processors is necessary in our algorithm. In summary, all parallel operations needed in our algorithm can be performed in constant time.

2.2. Parallelization of augmenting path search The key element in any algorithm for the MTOAP (or LAP) which is based on the primal-dual algorithm (see e.g., [26], Chapters 5{7 and 9{11 which cover the theoretical basis) is an augmenting path algorithm. The purpose of this is to nd a path which, starting from an incompletely assigned row (i.e., there is capacity left), alternately consists of an unmatched edge (with zero reduced cost) and a matched edge and nally ends at an unassigned column. By reversing the assignments along this path one is able to augment the assignments by one (cf. [26], Theorem 10.1 and Lemmas 10.1 and 10.2). Alternatively one can equally well start at an unassigned column and end up at an incompletely assigned row. In [21] this is done, but as will become evident later, this limits the possibility for massive parallelism. We propose a breadth- rst-search (BFS) procedure for nding an augmenting path and show that is very suitable for massive parallelization. We will exemplify the BFS search procedure, which is done by a standard labeling technique (see e.g., [26], Pages 120{124), by nding an augmenting path relative to a partial assignment as given in Figure 1. This construction is obtained when forming the restricted primal (RP) problem by using the primal-dual algorithm. The objective is to maximize the ow on the admissible edges , which have ccij = 0 (see e.g., [26], Page 145). The search starts in r1, since it is incompletely assigned (reachable from the source s). By the construction of the RP problem the edges between rows and columns have in nite capacity, hence columns c1; c2; c3 and c4 are reachable from r1, and they are all labeled as such. However, by the de nition of an alternating path only adjacent columns reachable through free edges may be part of the nal path. Hence, nodes c1 and c4 can be eliminated from further discussion. An augmenting path must pass either through c2 or c3 . The reachable nodes are found by a search through all columns (in a given row). This can be done in parallel by simply distributing the n columns, over the processors, along with all relevant column information. With n processors, they all can simultaneously decide whether or not they have an admissible edge and do all the necessary updates on the column data. In our example we continue by letting the appropriate processors examine the column nodes (c2 and c3). Here, two cases exist: i) the column is unassigned, or


6 Free (flow = 0) admissible edge Matched (flow = 1) admissible edge

c1 c2

s

3

r1

2

r2

c3

1 1 1 1

t

c4

1

1

r3 c5

1

Capacitated auxiliary edge

c6

Figure 1. The Restricted Primal problem: A maximum ow problem using only admissible edges

ii) the column is already assigned. In the rst case we can terminate the search since an unassigned column has been reached and an augmenting path exists. By reversing the assignments along the path, the number of assignments will increase by one. (In our example this case does not occur in the rst stage.) In the second case the column is assigned to a row through a matched edge and, by de nition, the edge can be part of an alternating path. All rows that can be reached this way are marked as labeled. Here, r2 can be reached from both c2 and c3. The procedure now alternates between the row and column stage in a breadth rst manner until an unassigned column is found. If an unassigned column cannot be found, the labeling of the nodes makes it possible to update the dual variables according to the following (cf. [26], Pages 146{147) i2LR; minj 2= LC fcij ? ui ? vj g; (6) ui ui + ; i 2 LR (7) ui ui ; i 2= LR (8) vj vj ? ; j 2 LC (9) vj vj ; j 2= LC; (10) where LR and LC are the labeled rows and columns. The search can continue from the stage from which this condition was detected. This is possible since no admissible edge traversed in the search graph can become inadmissible by Proposition 1. Hence, the previous search is still valid. By adjusting the dual variables according to (6){(10), no previously admissible matched edge can become inadmissible. Nor any admissible free edge which connects nodes that both are labeled can become inadmissible. All edges traversed in the search graph remain admissible. Proposition 1

Proof: See [16].


7

Note that several columns can lead to the same row since multiple columns can be assigned to a row (this can not happen in LAP), and that several rows can lead to the same column through free edges. Hence, if an appropriate multi-path predecessor labeling procedure is used, it may be possible to augment the assignment along several (disjoint) paths. During the search, however, we remove redundant alternating paths, so all nodes only have one predecessor, i.e., when a row is reached from a column across a matched edge, we remove all columns assigned to this row from the set of columns to be examined. This also ensures that the number of columns that must be visited is no more than m. Coupled with the fact that all operations performed while examining a row or column node can be done in O(1) time, this gives a time total complexity of O(m) to nd one augmenting path; see Proposition 3. Starting the augmenting path from an unassigned column (as in [21]) implies that one has to search through m rows. This can be just as easily parallelized, but will, of course, limit the number of processors one successfully can apply to m, giving a parallel complexity of O(n2). This clearly reduces the possible potential for improvement, since this time complexity could be obtained without using a special MTOAP-algorithm; this is for instance what is reported in [31]. Thus parallelizing by distributing columns to processors seems to be the obvious choice, and this prevents us from doing a straightforward parallelization of the SAP SA algorithm in [21]. An alternative augmenting path search choice is the best- rst-search (BestFS) where the objective is to nd the shortest augmenting path in terms of the reduced costs. This is normally used in sequential LAP and MTOAP algorithms and has so far been proven to be the best choice. We have implemented this by adapting the shortest path code (see e.g., [19], [12], [22], [33]) to MTOAP, obtaining a O(m) time algorithm for nding an augmenting path. The pseudo code may be found in the Appendix. BFS ts perfectly to the data distribution chosen, since when examining edges from a row, all edges can be checked in parallel in constant time. This is also the case when using BestFS, since the entire `forward star' can be checked simultaneously to update the node prices. Computational results given in Section 3 reveal, however, that BFS is faster than BestFS.

2.3. The algorithm | pseudo code, correctness and complexity In principal a procedure to nd augmenting paths is all that is needed to solve a MTOAP or LAP, but in order to speed up the computation all ecient implementations add an initialization heuristic, where a partial assignment is obtained. Several more or less sophisticated initializations exist; see e.g., [21], [19], [10]. Most of these succeed in getting the major part of the assignment in place before one needs to start the augmentation. We also employ such a heuristic and complete the assignment process by the BFS augmenting path technique described above.


8

We denote vectors which we intend to operate on in a data parallel fashion, by bold face, lower case (a) letters. Scalar data are displayed in lower case italic (a) letters. A single element i of a parallel vector a is denoted ai . Matrices are displayed in bold face upper case (A) letters. A row i of a matrix A is denoted by Ai (an entire column is never referenced), and Aij is an element at row i and column j. Furthermore, for enhanced clarity, a parallel elementary operation involving x processors will be marked by `k x', and a reduction `R x' where the statement occurs. We use the notation where (logic expression), to denote a masking operation. Consider the following example: where (m) a a +2. Here, m is a masking vector, which normally is implemented as a logical data type having the values `true' (or 1) for active elements or `false' (or 0) for inactive elements. This statement says that wherever an element of m is active, the corresponding element of a is incremented by two. The remaining elements (if any) of a are unchanged. Note that the dimension of the vectors m and a must be the same. Multiple statements within a where statement are permitted. Nested where statements are also allowed and the active elements will be determined by a logical ànd' operation on the masks involved. There are a few instances where we need to obtain an arbitrary element of a set, where only some of the elements may be feasible (and not known beforehand). This operation is executed by the any function in our algorithm. It is of no importance for the algorithm's correctness which element is picked, but it is required that it is done in constant time. In practice, this is implemented by a global minimization (or maximization), i.e., a reduction on the indices of the feasible elements of the set, where the feasible indices are de ned by a masking vector. Finally, to compactify the pseudo code, no end blocks such as endif and endwhere has been used; the indentation of the code should make this perfectly clear anyway. Let kbi denote the residual capacity for row i. Let the parallel vectors u and v hold the dual variables ui ; i = 1; : : :; m and vj ; j = 1; : : :; n respectively. The vector u is spread across m processors and v across n processors. Let C hold the cost matrix fcij g, and X hold the assignment matrix fxij g. Xi will act as a mask, i.e., if column j is assigned to row i then element Xij is active. Both matrices, which are of dimension m n, are spread rowwise across n processors. By this distribution we can operate on all elements in a row simultaneously. Let r2c, which is a parallel nvector, hold the complementary information to X, i.e., r2cj = i if row i is assigned to column j; r2cj = 0 otherwise. 1. algorithm MTO; 2. initialize(); 3. for i = 1; : : : ; m do 4. while (k > 0) augment(i); 5. end (MTO).

b

i

b

Input: C, m; n; k . Output: X Initialization heuristic i

Augmenting path by BFS


9

For an ecient implementation one has to nd a good trade o between how much work it is worthwhile spending on the initialization in order to save work in the augmentation phase. The optimal choice of initialization procedure will not be discussed at any length in this paper. We describe a simple initialization procedure which works well. We have, however, also implemented a version of the more sophisticated initialization developed in [19] for LAP and adopted to MTOAP in [21] since this is what is normally used in a sequential augmenting path algorithm. The pseudo code is available in the Appendix. We comment on the dierence in eciency in Section 3. In procedure initialize we display a simple row and column reduction heuristic (cf. [10]) adapted to MTOAP, which gives a dual feasible solution having at least one reduced cost ccij = 0 in each row and column. The complementary slackness conditions now allow an assignment (i; j) wherever ccij = 0, and this is used to get a partial set of assignments. In the row reduction phase we look at the rows cyclically, since we found it gave substantially more assignments compared to trying to get as many assignments as possible in one row before looking at the next. The latter approach gives a clustering eect which reduces the possibility for nding assignments further on in the initialization phase. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

procedure initialize(); for i = 1; : : : ; m kn X 0; Rn u minval (C ); insert (i,RS); k n r2c 0; while (not empty (RS))

Row reduction (2{15) Row has no assignments u = min fc g; v = 0 Add row to list Columns not assigned Cycle through row list

i

i

i

kn

Rn Rn

i

i

next (RS);

if (kb > 0)

j

ij

j

Row capacity left

m (C = u ) and (r2c = 0); Unassigned columns with cc = 0 if (m is nonempty) where (m) j any f1; : : : ; ng; Pick a column bk kb ? 1; r2c i; X 1; Assign (i; j ) else delete (i,RS); No unassigned column available

i

i

i

i

ij

i

j

ij

else delete (i,RS); No capacity available k n v 1; Column reduction (16{23) for i = 1; : : : ; m kn where (C ? u < v) kn v C ? u;r i; v = min fc ? u g; r = argmin fc ? u g for j = 1; : : : ; n i r; cc = 0 for this row b if (r2c = 0 and k > 0) Unassigned column and row capacity left kb kb ? 1; r2c i; X 1; Assign (i; j ) end (initialize). Note: Lines 11{12 can (and should in an implementation) be combined into one reduction operation. i

i

j

j

ij

i

j

i

ij

j

i

i

i

i

j

ij

ij

i


10

Proposition 2 i) Procedure initialize maintains dual feasibility (4) and complementary slackness conditions (5). ii) The parallel computational complexity of procedure initialize is O(n) using n processors.

Proof: See [16].

The more sophisticated initialization (which is based on [19]) given in the Appendix has a complexity of O(mn) and works very well especially for high cost ranges. However, the cost is signi cantly higher compared to the simple heuristic given above and only in some cases there will be a smaller total execution time. If modi ed not to perform the intermediate reduction transfer (see the Appendix) the procedure is slightly faster than the simple initialization; see Section 3. We also tested a parallelized version of the initialization procedure given in [21]. However, it does not perform well (in terms of execution time) since it is adapted to operate on the `transposed' problem (i.e., using m processors). Procedure augment is used to nd an augmenting path from an incompletely assigned row to an unassigned column by a BFS labeling procedure. If no such path is found, the dual variables are updated and the search continues. Finally, when a path to an unassigned column is found, the assignments along the path is reversed, thus obtaining an additional assignment. Procedure augment consists of three intermixed phases: i) BFS search, ii) dual update, and iii) assignment update. During the BFS search, data is collected on the reduced costs for the inadmissible edges (pda), in case it is necessary to update the dual variables in order to nd an augmenting path. By marking each node in the BFS search graph with its predecessor node (rpc and cpr), this information can be utilized by the assignment updating phase which backtracks along the augmenting path and reverses the assignments. The following additional vectors are used for bookkeeping: i) lc, a parallel nvector, which will act as a masking vector for the set (LC) of labeled ( ow-reachable) columns, ii) cs, a parallel n-vector, which is the masking vector for the set of columns to be examined in the BFS column stage, iii) lr, a parallel m-vector, which will act as a masking vector for the set (LR) of labeled rows. 1. procedure augment(r); 2. k n lc 0; cs 0; pda 1; 3. k m cpr 0; lr 0; 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

r

Start row r into row list

insert ( ,RS);

kn kn kn kn kn

Rn

forever Always exit at line 22 while (not empty (RS)) BFS row stage (6{10) i pop (RS); lr 1; Examine row i (and label) where (X ) lc 1; Label assigned columns where (C ? u ? v < pda and not lc) min fc ? u ? v jj 2= LC g 2 pda C ? u ? v; rpc i; Also update column predecessor where (pda = 0 and not lc) Admissible columns from row i cs 1; lc 1; enter column set and are labeled if (cs is empty) Dual update (13{18): no admissible columns i

i

i

i

i

i

i

LR

ij

i

j


14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.

kn

Rn

kn kn km kn

Rn Rn

Rn Rn

where (not lc) minval (pda); min fc ? u ? v ji 2 LRg 2 pda pda ?; Adjust since new dual variable values where (lc) v v ?; Compare Eqs. (7){(8) where (lr) u u +; Compare Eqs. (9){(10) where (pda = 0 and not lc) cs 1; New admissible columns if (cs and r2c = 0 is nonempty) Admissible unassigned columns? where (cs and r2c = 0) j any f1; : : : ; ng; Column j is goto REARRANGE; unassigned. Augmenting path! repeat until (cs is empty) BFS column stage (23{27) where (cs) j anyf1; : : : ; ng; Examine column j assigned to row i r2c ; cpr j ; i, update row predecessor insert (i,RS); Add row i to row list where (X ) cs 0; Eliminate other columns assigned to row i REARRANGE: Assignment update (28{34) repeat until (i = r); Backtrack until start row r is reached i rpc ; r2c i; X 1; Assign (i; j ) j cpr ; if (j > 0) X 0; Deassign (i; j ) kb kb ? 1; j = LC

j

kn

11

ij

i

j

i

i

j

j

ij

i

ij

end (augment). Notes: i) a repeat loop is always executed at least once, and the until test is executed last, ii) Lines 20{21 can (and should in an implenentation) be combined into one reduction operation as well as lines 23{24. Proposition 3 i) Procedure augment nds an augmenting path from an incomr

r

pletely assigned row to an unassigned column and reverses the assignments along the augmenting path, hence the number of assignments is increased by one. Dual feasibility (4) and complementary slackness (5) are preserved. ii) The parallel computational complexity of procedure augment is O(m) using n processors.

Proof: See [16].

Since the augmenting path algorithm is called at most n times, we can solve a MTOAP, with our parallel algorithm in O(mn) time, since procedure initialize has an O(n) complexity. Thus we have the main result: Proposition 4 i) The algorithm MTO solves MTOAP correctly, i.e., gives a solution which satis es the optimality conditions [(1){(3), (4), (5)]. ii) The total parallel computational complexity of algorithm MTO is O(mn) using n processors. Proof: See [16]. Note that an algorithm with the more sophisticated initialization and with best rst-search also parallelizes with the same technique, and gives a parallel complexity of O(mn). Pseudo code is available in the Appendix.

12


3. Numerical experiments To see how well our algorithm performs in practice and to compare to the theoretical time complexity bounds given in Section 2.3, we have run extensive experiments on dense problems. Sparse problems have not been tested since our algorithm is designed for dense problems only. An ecient code for sparse problems will require a sparse storage scheme with associated pointers for ecient traversal of the graph. How this should be implemented in parallel is not within the scope of this paper. The tests were run on a Connection Machine CM200 and a MasPar MP-2 which are massively parallel single instruction multiple data (SIMD) systems, suitable for data parallel applications. The systems consist of a front end computer and a parallel processing unit. The CM200 we have used has a Sun SPARC front end and a parallel processing unit consisting of 16K one-bit processing elements (PE). The CM processing node (PN) consists of two processor chips each having 16 onebit PEs with 128K Byte of memory, and a 64-bit oating point unit (FPU). The PNs are connected via a network where each PN has two connections. The 16K CM forms a boolean 9-cube with a total of 512 FPUs and 2G Byte of memory. The MasPar MP-2 architecture is somewhat dierent. Our test machine has 16K 32-bit processors with 64K Byte of memory per processor for a total of 1G Byte of memory. No oating point accelerators are available on the MP. The MP processors are arranged in a torus wrapped mesh connected network. The CM and MP support two general mechanisms for communication. The nearest neighbor communication, which is the optimal way to send and get data given that one can arrange data accordingly, is the rst mechanism. A router is also available for general purpose communication. The latter should, if possible, be avoided since it is much slower than nearest neighbor communication. Note that in our implementation, no communication of any of the above kinds are necessary. However, a special kind of communication (combined with arithmetic computations) known as a reduction, is used. A reduction is used to get a global result, a scalar, from a vector or a matrix | for instance the maximum, minimum or the sum of all elements. Special hardware and software support is available on both machines for ecient execution of reductions. It is well documented [13], [27] that the MP-2 computes minimum or maximum reductions in constant time. Also the CM supports this operation in O(1) time as pointed out by an anonymous referee. On the CM, the CM Fortran [32] programming language permits usage of a special slicewise mode of operation. It implies that the FPUs perform all computations. The one-bit PEs are not directly involved, other than usage of the memory attached to them. This mode has been shown to be very advantageous for most applications, and this is what we have used for our implementation. On the MasPar the MasPar Application Language (MPL)[24], an ANSI C derivative with extensions for data parallel operations, was used. The reason for this choice is simply that the compilers for the languages chosen normally generate the most ecient code for the respective machine.


13

All tested instances of the many-to-one assignment problem were randomly generated. The (integer) elements in the cost matrix were drawn from a uniform distribution on the interval 0 cij < C, where C is the cost range. The random number generators used were machine dependent. To avoid spurious eects due to randomness, each test result reported is averaged over 5 separate problem runs. The individual dierences between these 5 instances where in all cases small. The practical performance of our implementation will certainly depend on the problem sizes m and n, as predicted by the complexity analysis, but it may also depend on the cost range, C, and the dierences in the row node capacities, ki . To examine these dependencies we have run experiments for four dierent cost ranges and generated the row capacities in two dierent ways: i) uniform row capacities, and ii) randomly generated row capacities. In the rst case the row capacities ki are n=m, while they in the secondPcase are drawn from a uniform distribution in the interval [1; 2 mn ] while keeping ki = n. For LAP (n = m) ki = 1; 8i.

3.1. Computational results As discussed in Section 2.3 there is dierent varieties of the initialization as well as the augmenting path procedure. We have implemented three dierent initialization procedures: i) the simple one displayed in algorithm initialize, ii) the sophisticated one based on [19] given in the Appendix and iii) a direct parallelization of the initialization given in [21]. Together with two dierent augmenting path procedures: i) BFS (augment and ii) BestFS (see Appendix), giving a total of 6 dierent combinations. Here we only report results for the two combinations which performed best. These combinations are denoted MTO1 - simple initialization and breadth- rst search and MTO6 - sophisticated initialization and best- rst-search. When we refer to algorithm MTO the meaning is that the results are valid for both variants. The main result from our experiments is that our implementation of the MTO algorithm does behave somewhat better than predicted by the complexity analysis. The running time is sublinear in m and about linear in n. Noting that these are randomly generated problems, this is not surprising since it is known (see e.g., [29]) that for randomly generated LAP the sustained running time is smaller than predicted by worst case complexity analysis. This is displayed in Figure 2, where we have plotted by which factor the time increases when doubling the problem size in one of the variables while keeping the other one xed. In these plots we have aggregated the results over the dierent cost ranges. The actual data is from variant MTO1. The empirical running time behavior is displayed in Tables 1 and 2. In Table 2 we have also compared MTO to an algorithm, LAP, for solving pure assignment problems given in [30]. In order to solve MTOAP instances with LAP, the rows must be duplicated to obtain a square assignment problem. Assuming that the LAP algorithm does not have some knowledge about the special structure of the problem, we expect that the time complexity will be O(n2 ). Data is furthermore available


14

Increasing number of rows 3

2.5

2.5 linear

2 1.5 1 0.5 0

Time increase factor

Time increase factor

Increasing number of columns 3

linear

2 1.5 1 0.5

1

2

3

4

5

6

q

0

1

2

3 q

4

5

Figure 2. Left plot: Time increase factor when doubling the number of columns, n. The factors are computed as follows: the time of problem of size (m = 250;n = 250 2 ) divided by the time of problem of size (m = 250;n = 250 2( ?1) ), q = 1; :: : ; 6. Right plot: Time increase factor when doubling the number of rows, m. The factors are computed as follows: the time of problem of size (m = 250 2 ; n = 16000) divided by the time of problem of size (m = 250 2( ?1) ;n = 16000), q = 1; :: : ; 5. Times are averaged over all cost ranges. Solid line - MP-2(Uni), Dashed line MP-2(Rndm), Dotted line - CM200(Uni) and Dash-dotted line - CM200(Rndm) q

q

q

q

for a comparison with the sequential algorithm SAP SA of [21]. This code has kindly been provided by Professor J. L. Kennington. From Tables 1 and 2 we observe that the LAP algorithm, as predicted, have a practical time complexity of O(n2). Thus the MTO algorithm is order of magnitude faster than the LAP algorithm when m n. In addition the MTO implementation is also more economic with memory space since no splitting of rows is necessary in the MTO case. (As an example (Table 2), a 250 16000 instance could not be solved by LAP due to memory shortage; the size would be 16000 16000.) The experiments also uncover dierences in performance due to dierent cost ranges and/or dierences in distribution of row capacities. More detailed statistics on the experiments than presented in the tables shows that the more variation we get in the span of the row capacities, the less successful the initialization is. This is true for all variants of initialization. The following argument might explain this phenomena. During the row reduction phase we expect (on randomly generated data) to create approximately equal number (n=m) of zeros for each row in the reduced cost matrix. This indicates that the number of assignments obtained is likely to be highest when all capacities equals n=m, which is exactly what we see. In Tables 1 and 2 data is also available for a comparison of the CM and MP-2 implementations of MTO. On the CM200 only 512 64-bit processors are available, as opposed to 16K 32-bit processors on the MP-2. For the problems we have tested this means that no virtual processors (i.e., the physical processor handles multiple data by dividing its memory and serially loops over all slices) are needed on the MP-2, while virtual processors are needed in most cases on the CM200. Thus a strict application of our complexity analysis is only valid for the MP-2.


15

Table 1. Comparison of MTO implementations for MTOAP instances with increasing number of rows, m. The number of columns is xed at n = 16000. `Time' - wall clock time in seconds averaged over 5 separate problem runs, Ùni' - Uniform row capacities and `Rndm' randomly generated row capacities. MTO1 - Simple init + BFS, MTO6 Sophisticated init + BestFS Algorithm Machine n = 16000

m

250 500 1000 2000 4000 8000

C

100 1000 10000 100000 100 1000 10000 100000 100 1000 10000 100000 100 1000 10000 100000 100 1000 10000 100000 100 1000 10000 100000

MTO1

MTO6

MTO1

MP-2 Time Uni Rndm 2.61 14.95 4.84 8.60 2.01 8.49 2.34 12.92 2.11 6.56 8.20 12.07 3.17 10.93 3.89 17.32 2.32 4.11 13.25 20.08 5.55 13.52 6.56 21.71 2.59 3.40 19.95 39.80 9.28 17.17 11.15 27.87 2.90 3.30 7.06 15.57 16.69 21.40 19.21 32.46 3.31 3.53 6.44 7.70 27.53 30.37 29.64 39.03

MP-2 Time Uni Rndm 4.66 10.21 4.54 8.17 1.83 8.66 2.22 13.21 3.34 4.24 11.07 12.94 3.15 11.49 3.90 17.72 3.55 3.93 26.67 22.76 5.81 15.39 6.86 22.74 4.52 4.60 28.36 45.72 9.44 20.88 11.47 29.20 6.01 6.02 20.61 34.31 18.72 28.25 19.78 35.01 8.74 8.68 17.20 20.49 34.23 41.45 31.23 42.73

CM200 Time Uni Rndm 22.25 122.13 18.38 47.21 11.54 56.40 17.59 98.07 19.71 77.21 34.63 67.38 17.49 68.18 27.19 125.64 20.66 40.27 47.62 116.96 28.19 85.69 45.58 154.25 21.34 28.10 47.74 200.88 47.32 104.09 73.20 187.06 23.14 27.44 54.32 211.79 83.07 128.73 123.00 226.24 25.98 25.65 56.17 88.54 137.07 253.67 197.09 283.86

The most expensive parallel operation in our algorithm is, by a large margin, the global reduction. The reason for this is the cost of the global communication needed in this operation. On the CM200, when virtual processors are used, this part of the reduction step is constant, while the local reduction scales linearly. However, the communication part will dominate even for large vectors, hence also the global reduction on the CM200 appears to be virtually constant for vectors of small to moderate size. Only for the highest values of n, we observe an increase in time signi cantly higher for the CM200 than for corresponding experiments on the MP2. In conclusion, the MP-2 is consistently faster than the CM200 since much fewer processors are available on the CM, and that the network used for global reductions is substantially slower on the CM. (This has also been observed in [12] where an AMT DAP-510 was compared to a CM-2.)


16

Table 2. Comparison of LAP, MTO and SAP SA implementations for MTOAP instances with increasing number of columns, n. The number of rows is xed at m = 250. `Time' wall clock time in seconds averaged over 5 separate problem runs, Ùni' - uniform row capacities and `Rndm' - randomly generated row capacities. MTO1 - Simple init + BFS, MTO6 - Sophisticated init + BestFS Algorithm Machine m = 250

n

250

C

100 1000 10000 100000 500 100 1000 10000 100000 1000 100 1000 10000 100000 2000 100 1000 10000 100000 4000 100 1000 10000 100000 8000 100 1000 10000 100000 16000 100 1000 10000 100000

LAP

MTO1

MTO6

MTO1

MP-2 Time Uni Rndm 0.30 0.22 0.19 0.25 1.12 1.39 0.65 0.83 0.69 0.96 0.71 0.97 4.51 7.00 1.54 3.27 2.10 4.05 2.25 4.28 18.07 30.53 3.88 12.91 4.92 15.56 6.27 16.47 74.01 126.73 17.29 49.32 12.67 55.17 14.79 67.65 268.36 564.03 73.63 252.40 36.16 200.59 45.67 270.95 N/A N/A N/A N/A N/A N/A N/A N/A

MP-2 Time Uni Rndm 0.17 0.22 0.23 0.25 0.27 0.35 0.30 0.36 0.42 0.51 0.41 0.48 0.40 0.82 0.46 0.64 0.48 0.90 0.58 0.98 0.53 1.50 0.64 1.16 0.69 1.62 0.83 2.01 0.72 3.11 1.13 2.04 0.98 2.96 1.25 3.77 1.24 6.49 2.11 3.99 1.40 5.17 1.77 6.89 2.61 14.95 4.84 8.60 2.01 8.49 2.34 12.92

MP-2 Time Uni Rndm 0.18 0.21 0.22 0.24 0.28 0.36 0.27 0.33 0.46 0.45 0.37 0.46 0.49 0.83 0.41 0.63 0.48 0.87 0.56 0.94 0.73 1.39 0.58 1.13 0.69 1.55 0.80 1.95 1.35 2.57 1.02 1.96 0.94 2.92 1.20 3.73 2.25 5.61 2.00 3.97 1.28 5.32 1.69 6.93 4.66 10.21 4.54 8.17 1.83 8.66 2.22 13.21

CM200 Time Uni Rndm 0.96 1.15 1.56 1.63 1.46 1.76 1.78 2.04 2.41 3.08 2.59 3.44 2.09 3.82 2.23 3.51 3.45 5.91 4.19 6.56 2.90 7.76 2.88 5.93 4.16 10.00 5.96 13.38 5.09 16.65 4.71 9.65 6.00 17.53 8.01 25.21 9.95 40.24 7.96 20.15 7.39 28.87 11.51 50.10 22.25 122.13 18.38 47.21 11.54 56.40 17.59 98.07

SAP SA

SS20/61 Time Uni Rndm 0.08 0.08 0.08 0.08 0.24 0.24 0.30 0.30 0.30 0.46 0.34 0.48 0.47 0.86 0.78 0.98 0.99 1.05 0.88 1.39 0.58 1.40 1.53 2.63 2.26 3.72 3.17 3.62 1.58 4.27 4.14 9.03 5.68 9.22 8.29 9.96 5.45 12.50 10.40 16.43 13.85 21.56 18.83 32.99 17.08 30.37 33.64 41.07 31.52 66.33 48.55 88.32

To examine the ecency of our parallel algorithm compared with well a coded sequential algorithm we run the same tests with SAP SA algorithm of [21]. They claim it to be the most ecient sequential algorithm available to date for dense problems and we have not found any evidence to the contrary. We tested it on a Sun SPARCstation 20/61 workstation with 96M Byte of memory. This machine runs substantially faster than each of the nodes on the MP-2 and CM-200. We have measured the CPU speed ratio on MP-2 and CM200 relative to the SS20/61 for this sort of computation (integer addition and comparisons) to be 130 and 50, respectively in favor of the SS20/61.


17

LAP 0.3

0.2

0.25

Efficiency

Efficiency

MTOAP 0.25

0.15 0.1

0.2 0.15 0.1

0.05 0

0.05 1

2

3

4 q

5

6

7

0

1

2

3 q

4

5

Figure 3. Eciency for MTO implementation in comparison with SAP SA. Eciency = sequential time / (parallel time number of processors) scaled so that processor speeds are equal. Times are averaged over all cost ranges. Data for left plot (many-to-one assignment problem; m = 250; n = 250 2( ?1) ; q = 1;: : :; 7) are from Table 2; Solid line - MP-2(Uni), Dashed line - MP-2(Rndm), Dotted line - CM200(Uni) and Dash-dotted line - CM200(Rndm). Data for right plot (assignment problem; n = m = 250 2( ?1) ; q = 1;: : : ; 5) are from Table 3; Solid line - MP-2, Dotted line CM200 q

q

As expected the sequential code execute faster than our dataparallel implementations on small problems. In addition to using a much faster processor, the code also incorporates a number of speed enhanching features, not possible to take advantage of in a simple data parallel implementation. In spite of this, on larger problems, the parallel code becomes superior. Furthermore the largest problems was not possible to run on the SPARCstation at all due to memory limitations. In most parallel algorithm there will remain pure sequential parts. For a xed size problem the eciency of the algorithm will therefore decrease as the number of processors increase. This is usually refered to as Amdahl's law [1]. Against this one can argue that the reason for using parallel computers and add more processors to solve a problem is that one want to solve larger problems [18]. This has lead to an interest in scalability of the parallel algorithm on speci c machines. Perfect scalability is usually described as keeping the eciency constant, when increasing the problem size and the number of processors simultaneously [17]. The eciency of our algorithm is displayed in Figure 3. We see that the eciency reduces with factor of 2{4 on the MP-2 when the problem size increases with a factor of 128. This is considered to be highly scalable [17]. Note that the gures for CM-200 does not describe scalability as de ned above, because the CM-200 runs out of processors at n = 512. Increase n beyond this leads to more work on each processors on the CM-200. Thus the increased eciency we notice is due to more work on each processor. Finally, we have tested our implementation on pure assignment problems and compared it to the LAP code on the MP-2 and to the SAP SA code on the SPARC-


18

Table 3. Comparison of LAP, MTO and SAP SA implementations for LAP instances. MTO1 - Simple init + BFS, MTO6 Sophisticated init + BestFS Algorithm Machine n=m C 250 100 1000 10000 100000 500 100 1000 10000 100000 1000 100 1000 10000 100000 2000 100 1000 10000 100000 4000 100 1000 10000 100000 8000 100 1000 10000 100000

LAP

MTO1

MTO6

MTO1

SAP SA

MP-2 Time 0.30 0.22 0.19 0.25 0.55 0.46 0.63 0.62 0.38 1.62 1.60 1.73 0.41 8.38 3.54 4.66 0.58 25.99 9.46 12.67 1.02 15.79 24.76 27.56

MP-2 Time 0.17 0.22 0.23 0.25 0.24 0.45 0.62 0.57 0.38 0.99 1.29 1.57 0.70 3.18 2.93 4.05 1.32 3.55 6.33 9.84 2.43 4.07 15.50 21.83

MP-2 Time 0.18 0.21 0.22 0.24 0.33 0.45 0.54 0.56 0.68 0.95 1.27 1.48 1.45 3.21 3.01 3.58 3.16 8.49 6.35 8.76 6.65 10.96 17.32 20.63

CM200 Time 0.68 0.82 1.14 1.10 1.21 1.63 2.67 2.88 1.65 3.54 5.36 7.07 2.49 8.19 12.45 15.69 5.55 18.24 27.18 40.49 10.81 27.95 58.17 98.58

SS20/61 Time 0.08 0.06 0.07 0.06 0.31 0.36 0.42 0.47 0.59 2.06 2.36 3.09 2.65 12.25 12.80 16.47 N/A N/A N/A N/A N/A N/A N/A N/A

station. A comparison is given in Table 3. The major dierence between MTO and is that a BFS search is used in MTO and a DFS search in LAP in the augmenting path procedure. Statistics (not presented in the tables) indicate that the BFS search gives signi cantly shorter augmenting paths (in terms of number of edges), with about the same amount of work to nd a path. As a consequence the rearrangement of assignments along the path will be much faster. In [30], [31] LAP is compared to several other parallel assignment codes, and is shown to be clearly competitive. We also show in Table 4 results on the LAP collected from various papers. From this we deduce that our implementation of MTO competes with the best dedicated LAP codes on pure assignment problems. Finally, by summarizing the results of the tables for the many-to-one assignment problems and the pure assignment problems, we can compute that on the average the proposed BFS procedure is about 20% faster than BestFS and that the sophisticated initialization procedure is about 3% faster than the simple initialization procedure proposed. LAP


19

Table 4. Collection of results for the pure assignment problem Reference

n=m 500

C

100 1000 10000 100000 1000 100 1000 10000 100000 2000 100 1000 10000 100000

[34] Time 0.55 0.46 0.63 0.62 0.38 1.62 1.60 1.73 0.41 8.38 3.54 4.66

[2] Time 0.24 0.45 0.62 0.57 0.38 0.99 1.29 1.57 0.70 3.18 2.93 4.05

[20] Time 0.33 0.45 0.54 0.56 0.68 0.95 1.27 1.48 1.45 3.21 3.01 3.58

[22] Time 1.21 1.63 2.67 2.88 1.65 3.54 5.36 7.07 2.49 8.19 12.45 15.69

Time 0.31 0.36 0.42 0.47 0.59 2.06 2.36 3.09 2.65 12.25 12.80 16.47

4. Conclusions and further research In this paper we have devised a data parallel algorithm for MTOAP with worst case time complexity of O(mn) provided that n processors are available. The theoretical results are supported by empirical studies. For large problems with m n, the special MTO algorithm is order of magnitude faster than a corresponding LAP code. The advantage increases with n=m. The eciency and scalability of the algorithm is also demonstrated. We have also shown that the proposed breadth- rst-search procedure is faster than the best- rst-search procedure normally employed in sequential augmenting path algorithms. Clearly, the empirical running time depends not only on the problem size, but also on the cost range. Also large amplitude in the row capacities ki, tends to complicate the problem. These dependencies are not captured by the standard worst case complexity analysis. To develop theoretical tools for analyzing and understanding these dependencies is a topic for future research. The data parallel programming model is easily applied to this problem. One only needs to formulate the algorithm in terms of vector operations, and then decide how to distribute data to processors such that operations on all elements of a vector take place simultaneously. We believe that the data parallel approach taken here can be applied to a wider class of network optimization problems, such as the transportation problem, and intend to do so in future work.


20

Acknowledgments This research was in part supported by the Nordic Academy for Advanced Study (J.nr. 93.30.067/00), the Swedish Transport and Communications Research Board (Dnr. 92{128{63) and the Norwegian Research Council (J.nr. 100433/410). We thank the Center for Parallel Computers at the Royal Institute of Technology, Stockholm, Sweden, for the use of their Connection Machine CM200 and Para==ab at the University of Bergen, Bergen, Norway, for the use of their MasPar MP-2. Professor J. L. Kennington kindly provided us with the SAP SA code of [21]. We thank two anonymous referees for constructive criticism.

Appendix Pseudo code for sophisticated initialization and best- rst-search. These are parallelized versions, adapted to MTOAP, of algorithms given in [19], [12]. The procedures as given in [33], [21] would parallelize only over m processors. 1. procedure initialize sophisticated(); 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

Standard column reduction. .. ; O(m) + O(n) = O(n) time Reduction transfer: O(m2 ) time (Not used in our implementation) for i = 1; : : : ; m if (k = 0) Full row; transfer is possible pt 1; Possible transfer: min 6= fc ? u ? v j r2c = ig for r = 1; : : : ; mjr 6= i kn where (C ? u ? v < pt) pt C ? u ? v; kn where (X ) Rn minval (pt); kn v v + ; Columns assigned to row i cheaper to other rows u u ? ; Row i more expensive to other columns Augmenting row reduction: O(mn) time. (Several laps could be attempted; two laps in our implementation) Put all incompletely assigned into a list (RS). . . while (not empty (RS )) r pop (RS); nx 0; repeat until (u1 = u2 or i = 0 or nx m) I.e., until no reduction transfer or found unassigned col or number of path extensions exceed m (guaratees O(mn) complexity). kn where (not X ) Rn u1 minval (C ? v); Rn where (C ? v = u1) j 1 any f1; : : : ; ng; kn where (j 1 6= f1; : : : ; ng) Rn u2 minval (C ? v); Rn where (C ? v = u2) j 2 any f1; : : : ; ng; u u2; kn where (X ) v C ? u2; Adjust for already assigned columns if (u1 < u2) v 1 v 1 ? (u2 ? u1);

b

i

r

r

r

r

i

i

i

r

r

r

r

r

r

r

r

j

j

i

rj

r

r

j

j


21

30. else if (r2c 1 > 0) j 1 j 2; 31. i r2c 1 ; 32. if (i > 0) 33. X 1 0; k k + 1; nx 34. X 1 1; k k ? 1; 35. r2c 1 r; r i; 36. end (initialize sophisticated). j

j

b b b b

i;j

i

r;j

i

nx + 1;

i

i

j

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38.

procedure augment best-first-search(r); k n lc 1; cs 0; rpc r; npv 0; npv is node prices for cols k m cpr 0; lr 0; npu 0; npu is node prices for rows lr 1; k n where (not X ) Possible admissible cols kn lc 0; npv C ? u ? v Compute node prices kn np minval (npv); Get best node price kn where (npv = np) For the cols with best node price kn lc 1; cs 1; Label and put in column set to search forever Always exit at line 13 Rn if (cs and r2c = 0 is nonempty) Unassigned cols with best nodeprice? Rn where (cs and r2c = 0) j any f1; : : : ; ng; Column j is goto REARRANGE; unassigned. Augmenting path! Rn where (cs) j any f1; : : : ; ng; Pick a column from search set cs 0; i r2c ; npu np; cpr j ; lr 1; Eliminate from i

i

i

j

kn kn kn kn kn kn kn kn

Rn

kn kn kn kn

kn km

i

j

i

i

i

search set; get row assigned to column; update row node price; update row predecessor and label where (X ) Columns assigned to row cs 0; lc 1; npv np; should be eliminated from search set; label and transfer node price where (not lc) Possible admissible cols where (C ? u ? v +np < npv) Update node prices npv C ? u ? v+np; Decrease if better rpc i; Update col predecessor where (npv = np) Any new cols with current best nodeprice lc 1; cs 1; Label and update search set if (cs is empty) Search set empty? where (not lc) Possible admissible cols np minval (npv); Get new best node price where (npv = np) For the cols with best node price lc 1; cs 1; Label and put in column set to search REARRANGE: Dual and assignment update where (lc) v v + npv ?np; Update dual variables where (lr) u u ? npv +np; repeat until (i = r); Backtrack until start row r is reached i rpc ; r2c i; X 1; Assign (i; j ) j cpr ; i

i

i

i

j

j

ij

i

i


22

39. if (j > 0) X 0; 40. k k ? 1; 41. end (augment best-first-search).

b b r

ij

Deassign (i; j )

r

Notes: Lines 11{12 can (and should in an implementation) be combined into one reduction operation. Also (in an implementation) the reduction at line 14 can be avoided as long as the if-reduction at line 27 yields `false'. Use some appropriate bookkeeping to keep track of the column.

References 1. G. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Conference proceedings, Washington D.C., 1967. Thompson Books. 2. E. Balas, D. Miller, J. Pekny, and P. Toth. A parallel shortest path algorithm for the assignment problem. J. Assoc. Comput. Mach., 38:985{1004, 1991. 3. R. Barr, F. Glover, and D. Klingman. The alternating basis algorithm for the assignment problem. Math. Programming, 13:1{13, 1977. 4. R. Barr, F. Glover, and D. Klingman. A new alternating basis algorithm for semi-assignment networks. In W. White, editor, Computers and Mathematical Programming, pages 223{ 232. National Bureau of Standards Special Publication, U.S. Government Printing Oce, Washington D.C., 1978. 5. D. P. Bertsekas. The auction algorithm: A distributed relaxation method for the assignment problem. Ann. Oper. Res., 14:105{123, 1988. 6. D. P. Bertsekas. The auction algorithm for assignment and other network ow problems: A tutorial. Interfaces, 20:133{149, 1990. 7. D. P. Bertsekas and D. A. Casta~non. The auction algorithm for the transportation problem. Ann. Oper. Res., 20:67{96, 1989. 8. D. P. Bertsekas and D. A. Casta~non. Parallel synchronous and asynchronousimplementations of the auction algorithm. Parallel Comput., 17:707{732, 1991. 9. D. P. Bertsekas and D. A. Casta~non. Parallel asynchronous Hungarian methods for the assignment problem. ORSA J. Comput., 5:261{274, 1993. 10. G. Carpaneto, S. Martello, and P. Toth. Algorithms and codes for the assignment problem. Ann. Oper. Res., 13:193{223, 1988. 11. D. A. Casta~non. Development of advanced WTA algorithms for parallel processing. Final Report TR-457, ALPHATECH, Inc., 111 Middlesex Turnpike, Burlington, MA 01803, 1989. 12. D. A. Casta~non, B. Smith, and A. Wilson. Performance of parallel assignments algorithms on dierent multiprocessor architectures. Technical Report TP-1245, ALPHATECH, Inc., 111 Middlesex Turnpike, Burlington, MA 01803, 1989. 13. P. Christy. Software to support massively parallel computing on the MasPar MP-1. In Proceedings of IEEE Compcon Spring 1990. IEEE, February 1990. 14. W. Cunningham. A network simplex method. Math. Programming, 11:105{116, 1976. 15. O. Damberg and A. Migdalas. A data parallel space dilation algorithm for the concentrator location problem. In P. M. Pardalos et al., editors, Parallel Processing of Discrete Optimization Problems, volume 22 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 57{80. AMS, 1995. 16. O. Damberg, S. Story, and T. Srevik. A data parallel primal-dual algorithm for the dense linear many-to-one assignment problem. Report LiTH-MAT-R-1994-15, Department of Mathematics, Linkoping Institute of Technology, S-581 83 Linkoping, Sweden, 1994. 17. A. Gupta and V. Kumar. Performance properties of large scale parallel systems. J. Parallel Distrib. Systems, 19:234{244, 1993. 18. John L. Gustafson. Reevaluating Amdahl's law. Comm. ACM, 31:532{533, 1986. 19. R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325{340, 1987.


23

20. D. N. Kempka, J. L. Kennington, and H. A. Zaki. Performance characteristics of the Jacobi and the Gauss-Seidel versions of the auction algorithm. ORSA J. Comput., 3:92{106, 1991. 21. J. Kennington and Z. Wang. A shortest augmenting path algorithm for the semi-assignment problem. Oper. Res., 40:178{187, 1992. 22. J. L. Kennington and Z. Wang. An empirical analysis of the dense assignment problem: Sequential and parallel implementations. ORSA J. Comput., 3:299{306, 1991. 23. E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York, 1976. 24. MasPar Computer Corporation, Sunnyvale, CA. MasPar Parallel Application Language (MPL). Reference Manual. Software Version 3.2, May 1993. 25. D. L. Miller, J. F. Pekny, and G. L. Thompson. Solution of large dense transportation problems using a parallel primal algorithm. Oper. Res. Lett., 9:319{324, 1990. 26. C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Clis, NJ, 1982. 27. L. Prechelt. Measurement of MasPar MP-1216A communication operations. Technical Report 01/93, Institut fur Programmstrukturen und Datenorganisation, Fakultat fur Informatik, Universitat Karlsruhe, 1993. 28. R. T. Rockafellar. Network ows and monotropic optimization. John Wiley & Sons, New York, 1984. 29. T. Srevik. Average case complexityanalysis of algorithms for the linear assignment problem. In Proceedings of NOAS '93, NTH, Trondheim, Norway, 1993. 30. S. Story and T. Srevik. An SIMD, ne-grained, parallel algorithm for the dense linear assignment problem. Report 72, Department of Informatics, University of Bergen, Bergen, Norway, 1992. 31. S. Story and T. Srevik. Massively parallel augmenting path algorithms for the assignment problem. Report, Department of Informatics, University of Bergen, Bergen, Norway, 1994. 32. Thinking Machines Corporation. CM Fortran Language Reference Manual. Version 2.1. Cambridge, MA, January 1994. 33. Z. Wang. The Shortest Augmenting Path Algorithm for Bipartite Network Problems. PhD thesis, Southern Methodist University, Dallas, TX, 1990. 34. J. M. Wein and S. A. Zenios. On the massively parallel solution of the assigment problem. J. Parallel Distrib. Comput., 13:228{236, 1991. 35. S. A. Zenios. Data parallel computing for network-structured optimization problems. Comput. Optim. Appl., 3:199{242, 1994.

A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

Suggest Documents

A Data Parallel Augmenting Path Algorithm for the Dense ... - CiteSeerX

A Data-Parallel Algorithm for Iterative Tomographic Image ... - CiteSeerX

A Scalable Parallel FEM Surface Fitting Algorithm for Data ... - CiteSeerX

An Experimental Study of a Parallel Shortest Path Algorithm for ...

PARMA: A Parallel Randomized Algorithm for ... - CiteSeerX

A Parallel Algorithm for Change Detection - CiteSeerX

Programmable Parallel Data-path for FEC

A* Algorithm for the time-dependent shortest path problem - CiteSeerX

Polytime Algorithm for the Shortest Path in a Homotopy ... - CiteSeerX

A Fully Parallel Algorithm for the Symmetric Eigenvalue ... - CiteSeerX

a parallel algorithm for computing the extremal ... - CiteSeerX

A Sublinear-Time Randomized Parallel Algorithm for the ... - CiteSeerX

A parallel algorithm for estimating the secondary structure ... - CiteSeerX

A Highly Parallel Algorithm for the Numerical Simulation of ... - CiteSeerX

A Dense Stereo Correspondence Algorithm for Hardware ...

Performance Study of a Parallel Shortest Path Algorithm

A Graph Based Algorithm for Data Path Optimization in Custom

ALGORITHM FOR DETERMINING CONTROLLING PATH ... - CiteSeerX

A Graph Based Algorithm for Data Path Optimization in Custom ...

A polynomial-time algorithm for computing a shortest path ... - CiteSeerX

A Randomized Algorithm for Finding a Path Subject to ... - CiteSeerX

A polynomial-time algorithm for computing a shortest path ... - CiteSeerX

a parallel algorithm for the eigenvalue assignment

A Data Parallel Algorithm for XML DOM Parsing