a parallel unsymmetric-pattern multifrontal method - Semantic Scholar

3 downloads 2348 Views 123KB Size Report
z Computer and Information Sciences Department, University of Florida, .... As each frontal matrix tasks is scheduled, it is assigned a speci c subcube .... Tech. Rep. TR-93-020, Computer and Information Sciences Department, University of.
A PARALLEL UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD STEVEN M. HADFIELDy AND TIMOTHY A. DAVISz

Abstract. The sparse LU factorization algorithm by Davis and Du [4] is the rst multifrontal method that relaxes the assumption of a symmetric pattern matrix. While the algorithm o ers signi cant performance advantages for unsymmetric pattern matrices, the underlying computational structure changes from a tree (or forest) to a directed acyclic graph. This paper discusses some key issues in the parallel implementation of the unsymmetric-pattern multifrontal method when the pivot sequence is known prior to factorization. The algorithm was implemented on the nCUBE 2 distributed memory multiprocessor and achieved performance is reported. Key words. LU factorization, unsymmetric sparse matrices, multifrontal methods, parallel

algorithms

AMS subject classi cations. 65F50, 65F05.

1. Introduction. The multifrontal approach to sparse matrix factorization decomposes a sparse matrix into a collection of smaller dense submatrices (frontal matrices) that can partially overlap one another. This overlapping causes data dependencies between the frontal matrices. A computational graph structure is built for the factorization by representing the frontal matrices as nodes and data dependencies between frontal matrices as edges. Parallelism is available within this computational structure both between independent frontal matrices and within the partial factorization of individual frontal matrices. One or more steps of Gaussian (or Cholesky) factorization of the sparse matrix is done within each frontal matrix. An important advantage of the multifrontal method is the avoidance of indirect addressing within the factorization of the dense frontal matrices. If the sparse matrix has (or assumes) a symmetric pattern then the resulting computational structure is a tree (or forest). The independence of subtrees allows for ecient scheduling. Prior to the work of Davis and Du [4], all multifrontal algorithms have assumed a symmetric pattern and several parallel distributed memory implementations have resulted [10, 14, 15]. When the sparse matrix has a signi cantly unsymmetric pattern, a more general multifrontal method can be employed which takes advantage of the unsymmetric pattern to reduce required computations and expose greater inter-node parallelism. This unsymmetric-pattern multifrontal approach was developed by Davis and Du [4] and has proven to be very competitive with signi cant potential parallelism [12]. The unsymmetric-pattern multifrontal approach does however result in a computational structure that is a directed acyclic graph (DAG) instead of a tree (or forest). Figure 1 illustrates the LU factorization of a 7-by-7 matrix. The example matrix is factorized with four frontal matrices. The notation ij refers to a fully summed entry in a pivot row or column of the active submatrix; ij refers to an update from previous pivots that is not yet added into ij . An update term ij is computed via a Schur a

c

a

c

This project is supported the National Science Foundation (ASC-9111263, DMS-9223088). Department of Mathematical Sciences, US Air Force Academy, Colorado, USA. phone: (719) 472-4470, email: had eldsm%dfms%[email protected] z Computer and Information Sciences Department, University of Florida, Gainesville, Florida, USA. phone: (904) 392-1481, email: [email protected] .edu.. Technical Report TR-94-028, CIS Dept., Univ. of Florida, August 5, 1994 1  y

2

S. M. HADFIELD AND T. A. DAVIS

complement within a frontal matrix. In Figure 1, the updates 44 and 45 are passed to the third frontal matrix. All other updates from the rst frontal matrix are passed to the second frontal matrix. 2 3 0 14 15 0 0 11 0 66 21 22 23 0 25 0 0 7 66 31 32 33 0 0 0 37 7 7 7 6 A = 6 41 0 0 44 45 46 0 77 66 0 52 53 0 55 56 0 7 4 0 0 0 0 0 66 67 7 5 0 75 0 77 71 72 0 c

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

c

a

a

:

a

a

a 11 a 14 a 15

a 22 a 23 a 24 a 25 0

a 21 c 24 c 25

a 32 a 33 a 34 a 35 a 37

a 44 a 45 a 46 0

a 31 c 34 c 35

a 52 a 53 c 54 c 55 c 57

a 54 a 55 a 56 a 57

a 66 a 67

a 71 c 74 c 75

a 72 a 73 c 74 c 75 c 77

a 74 a 75 c 76 c 77

a 76 a 77

a 41 c 44 c 45

. Multifrontal factorization of a 7-by-7 sparse matrix

Fig. 1

Scheduling and task-to-processor mapping of this generalized DAG structure is a more challenging problem than in the symmetric-pattern multifrontal method, especially for a distributed memory computer. This paper addresses these issues for the parallel implementation of the unsymmetric-pattern multifrontal method on the nCUBE 2 distributed memory multiprocessor, and reports the achieved performance. In this paper, allocation refers to the number of processors that factorize a particular frontal matrix. Scheduling is an ordering imposed on the execution. Subcube assignment is the designation of a particular set of processors for each frontal matrix. Data assignment is the distribution of the submatrix's entries across the participating processors. 2. Mechanisms. Parallelism is limited to the numerical factorization. De nition of the computational DAG structure and the scheduling, allocation, and assignment are done sequentially in a preprocessing step. The computational DAG structure is determined from UMFPACK, Davis and Du 's unsymmetric pattern multifrontal method [3]. This structure is called the assembly DAG, which is then used in scheduling, allocation, and assignment. Once the assembly DAG is de ned, we determine the number of processors to allocate to each frontal matrix. Experiments have shown that use of both inter- and intra-frontal matrix parallelism is necessary for peak performance [12, 7, 5]. As interfrontal matrix parallelism (between independent frontal matrices) is most ecient, it is preferred when available. Intra-frontal matrix parallelism (multiple processors cooperating on a speci c frontal matrix) is most useful for larger frontal matrices. The performance characteristics of intra-frontal parallelism can be accurately modeled via an analytical formula [11] which signi cantly aids both the allocation and scheduling processes. Allocation of processor sets to speci c frontal matrix tasks is done via blocks of tasks. Speci cally, eligible is de ned as the next block of tasks and contains the Q

A PARALLEL UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD

3

next group of ready (available to execute as predecessors have all completed) tasks. The size of this set is limited to the number of processors in the system. Each task in eligible has its sequential processing time predicted by analytical formulas. This predicted time is divided by the sum of the predicted times for all tasks in eligible and then multiplied by the number of available processors. This gives the frontal matrix task's portion of the available processors which is rounded down to the next power of two (that is, the next full subcube). Once this is done for all tasks in eligible , their outgoing dependencies are considered satis ed and new tasks added to the queue of ready tasks as necessary. The process continues until all frontal matrices have received allocations. The initial set of ready tasks is then reconstructed for scheduling. We schedule the frontal matrix tasks using a critical path priority scheme. For each frontal matrix, its critical path priority is de ned as the weight of the heaviest weighted path from the frontal matrix to any exit node in the assembly DAG. Weights are assigned to frontal matrix tasks based on their sequential execution time estimates. As each frontal matrix tasks is scheduled, it is assigned a speci c subcube of processors. Subcube assignment is done so as to allow the largest possible number of inter-frontal matrix messages to be eliminated by placing that data on the same processor from which it is sent. We refer to this as overlapping. Management of the subcubes is done using a binary buddy management system [13] where subcubes are split by a xed ordering of the hypercube's dimensions. Alternative binary buddy management systems are available that can split across several alternative dimensions but they were not used here [2, 1]. However, the binary buddy subcube manager is augmented to improve the chances of overlapping data assignments. Speci cally, when no subcubes of the requested size are available, all available subcubes of the next larger available size are fragmented. The subcube with the best overlap potential is selected and the unselected subcubes are coalesced. Scheduling of available tasks continues until there is no subcube for the next task or all ready tasks have been scheduled. A simulated run time is maintained using the task execution times predicted by the analytical models. These models predict task execution time based on any processor set allocation. When the simulated run time indicates a task has completed, its subcube is returned to the binary buddy manager and any outgoing dependencies considered satis ed with new tasks becoming ready as appropriate. As each task is scheduled and assigned a subcube, its frontal matrix data is assigned to the subcube processors. All data assignments are column-oriented and a number of options are implemented. Overlapping tries to assign columns of data passed between frontal matrices to the same processor as they were assigned in the source frontal matrix. This eliminates message passing. When overlapping is not possible, clustering per children is attempted. The columns of data to be sent to the current frontal matrix from a predecessor (child) that are assigned to a single processor in the child are also assigned to a single processor in the current frontal matrix. While the source and destination processor are di erent, all the data may be passed in a single message and the additional costly message setups foregone. Clustering per parents looks forward from the current frontal matrix to immediate successors. If a number of columns are to be sent to a particular successor, they can be assigned to a single processor in the current frontal matrix to improve the chances for subsequent overlapping and cluster per children. We examined all possible combinations of overlapping, clustering per children, and cluster per parents. Each assignment strategy attempts to minimize message Q

Q

Q

4

S. M. HADFIELD AND T. A. DAVIS

passing overhead. However, the strategies are subject to the constraint that no single processor can be assigned more than its proportional share of the frontal matrix's data. We also examined block and scattered (column-wrap) assignments. Once scheduling, allocation, and assignment are completed, the resulting schedules and assignments are passed to the processors and the parallel numerical factorization starts. For each frontal matrix, the assigned processors allocate their portion of the frontal matrix's storage, assemble (add) in the original values and contributions from previous frontal matrices, and then partially factorize the frontal matrix. The partial factorization process is done with a column-oriented, pipeline fan-out routine similar to that found in [9] but generalized to allow any type of column assignment scheme. Entries in the Schur complement of each frontal matrix are forwarded to subsequent frontal matrices. Message typing conventions allow these forwarded contributions to be selectively read by the receiving frontal matrices. The portions of the frontal matrix that fall within the L and U factors are retained in separate storage for use by subsequent triangular solves. 3. Results. The parallel numerical factorization was evaluated using the matrices described in Table 1. The table reports the matrix name, order, number of nonzeros, speedup on 8 and 64 processors, and memory scalability on 8 and 64 processors All these matrices are highly unsymmetric in pattern. The GEMAT11 is an electrical power problem [6]; the others are chemical engineering matrices [16]. Speedup was determined by dividing the method's single processor execution time by its parallel execution time. The memory scalability is de ned as 1 p , where p is the largest amount of memory required on any single processor when processors are used to solve the problem. These results compare favorably with similar results using symmetric pattern multifrontal methods on matrices of like order [10, 14, 15]. The memory requirements scale extremely well with additional processors indicating that increasingly larger problems can be solved by using additional processors. M =pM

M

p

Table 1

Speedup and memory scalability

Matrix RDIST1 EXTR1 GEMAT11 RDIST2 RDIST3A

Speedup Scalability n nonzeros 8 64 8 64 4134 94408 5.2 20.2 0.69 0.31 2837 11407 4.5 12.2 0.54 0.15 4929 33108 5.1 16.8 0.50 0.11 3198 56834 4.4 17.3 0.56 0.19 2398 61896 5.0 17.8 0.67 0.22

The competitiveness of the new parallel algorithm was evaluated by comparing its single processor execution time to that of the MA28B algorithm [8] running on a single nCUBE 2 processor. Due to memory limitations, these results could only be obtained for two of the matrices. These results are shown in Table 2 with the parallel unsymmetric pattern multifrontal code called PRF (run times are in seconds). The speedup of PRF in this table is the MA28B run time divided by the PRF ( = 64) run time. The communication reducing data assignment features of overlapping, clustering per children, and clustering per parents were e ective in reducing the amount of required communications with reductions by as much as 22% for a 64-processor con guration. However, the resulting irregular distribution of columns had adverse e ects on the performance of the partial dense factorization routine. This routine p

A PARALLEL UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD

5

Table 2

Competitiveness Results

Run times Matrix MA28B PRF (p = 1) PRF (p = 64) Speedup EXTR1 1.61 2.61 0.24 6.8 GEMAT11 3.10 2.52 0.18 17.0

typically accounted for 80% to 95% of execution time while communication between frontal matrices accounted for only 2% to 10% of execution time. Thus, the best performance on the nCUBE 2 was obtained using a strictly scattered assignment for small hypercubes (of dimension 2 or less) and a blocked assignment for the larger hypercubes. The mechanisms we use to reduce communication would be more important on a parallel computer with slower communications (relative to computation speed). 4. Conclusion. We have found that the unsymmetric pattern multifrontalmethod of Davis and Du has signi cant parallel potential that can be e ectively exploited even within a distributed memory environment. The results obtained here are comparable to similar results for distributed memory implementations of symmetric pattern multifrontal methods. REFERENCES [1] S. Al-Bassam and H. El-Rewini, Processor allocation for hypercubes, Journal of Parallel and Distributed Computing, 16 (1992), pp. 394{401. [2] M.-S. Chen and K. G. Shin, Processor allocation in an n-cube multiprocessor using gray codes, IEEE Transactions on Computers, C-36 (1987), pp. 1396{1407. [3] T. A. Davis, Users' guide for the unsymmetric-pattern multifrontal package (UMFPACK), Tech. Rep. TR-93-020, Computer and Information Sciences Department, University of Florida, Gainesville, FL, June 1993. [4] T. A. Davis and I. S. Du , An unsymmetric-pattern multifrontal method for sparse lu factorization, SIAM J. Matrix Anal. Appl., (submitted March 1993, under revision.). [5] I. S. Du , Parallel implementation of multifrontal schemes, Parallel Computing, 3 (1986), pp. 193{204. [6] I. S. Du , R. G. Grimes, and J. G. Lewis, User's guide for the Harwell-Boeing sparse matrix collection (Release I), Tech. Rep. TR/PA/92/86, Computer Science and Systems Division, Harwell Laboratory, Oxon, U.K., October 1992. [7] I. S. Du and L. S. Johnsson, Node orderings and concurrency in structurally-symmetric sparse problems, in Parallel Supercomputing: Methods, Algorithms, and Applications, G. F. Carey, ed., John Wiley and Sons Ltd., New York, NY, 1989, pp. 177{189. [8] I. S. Du and J. K. Reid, Some design features of a sparse matrix code, ACM Trans. Math. Softw., 5 (1979), pp. 18{35. [9] G. A. Geist and M. Heath, Matrix factorization on a hypercube, in Hypercube Multiprocessors 1986, M. Heath, ed., Society for Industrial and Applied Mathematics, Philadelphia, PA, 1986, pp. 161{180. [10] A. George, M. Heath, J. W.-H. Liu, and E. G.-Y. Ng, Solution of sparse positive de nite systems on a hypercube, J. Comput. Appl. Math., 27 (1989), pp. 129{156. [11] S. Had eld, On the LU Factorization of Sequences of Identically Structured Sparse Matrices within a Distributed Memory Environment, PhD thesis, University of Florida, Gainesville, FL, April 1994. [12] S. Had eld and T. Davis, Potential and achievable parallelism in the unsymmetric-pattern multifrontal LU factorization method for sparse matrices, in Fifth SIAM Conference on Applied Linear Algebra, 1994. [13] K. C. Knowlton, A fast storage allocator, Communications of the ACM, 8 (1965), pp. 623{625. [14] R. Lucas, T. Blank, and J. Tiemann, A parallel solution method for large sparse systems of equations, IEEE Transactions on Computer-Aided Design, CAD-6 (1987), pp. 981{991.

6

S. M. HADFIELD AND T. A. DAVIS

[15] A. Pothen and C. Sun, A mapping algorithm for parallel sparse cholesky factorization, SIAM J. Sci. Comput., 14 (1993), pp. 1253{1257. [16] S. E. Zitney and M. A. Stadtherr, Supercomputing strategies for the design and analysis of complex separation systems, Ind. Eng. Chem. Res., 32 (1993), pp. 604{612.

Suggest Documents