Performance of a Fully Parallel Sparse Solver - Semantic Scholar

Performance of a Fully Parallel Sparse Solver Michael T. Heath y

Padma Raghavan z

March 12, 1996 Abstract

The performance of a fully parallel direct solver for large sparse symmetric positive de nite systems of linear equations is demonstrated. The solver is designed for distributed-memory, message-passing parallel computer systems. All phases of the computation, including symbolic processing as well as numeric factorization and triangular solution, are performed in parallel. A parallel Cartesian nested dissection algorithm is used to compute a ll-reducing ordering for the matrix and an appropriate partitioning of the problem across the processors. The separator

This research was supported by the Advanced Research Projects Agency through the Army Research Oce under contract number DAAL03-91-C-0047. y Department of Computer Science and NCSA, University of Illinois, 1304 West Spring eld Ave., Urbana, IL 61801, e-mail: [email protected]. z Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN 37996, e-mail: [email protected].

1

Parallel Sparse Solver

2

tree resulting from nested dissection is used to identify and exploit largegrain parallelism in the remaining steps of the computation. The parallel performance of the solver is reported for a series of test problems on the Thinking Machines CM-5 and the Intel Touchstone Delta. The parallel eciency, scalability, and absolute performance of the solver, as well as the relative importance of the various phases of the computation, are investigated empirically.

Introduction Background Large sparse systems of linear equations arise in many areas of computational science and engineering, particularly in solving partial dierential equations, such as elliptic boundary value problems and implicitmethods for time-dependent problems. The solution of sparse linear systems consumes the dominant portion of computing time in many applications, and is thus an obvious target for a parallel implementation. Unfortunately, in adapting conventional computer codes to parallel computers, the solution of sparse linear systems has often proved to be a major bottleneck in attaining good parallel performance. There are numerous reasons for this disappointing performance, including the irregular structure of many sparse problems, the relative complexity of sparse data structures (which often employ indirect addressing), and the relatively small amount


3

of computation (compared to that for dense matrices) over which to amortize the communication necessary in a parallel implementation. Trying to overcome this bottleneck was the main motivation for the development of the distributed sparse solver whose performance we report here. The solver was developed as part of ARPA's Scalable Parallel Libraries initiative, whose goal is to provide prototype scalable library routines for massively parallel processing (MPP) roughly comparable to the mathematical subroutine libraries available on conventional computers. While our personal focus has been on direct methods for solving sparse linear systems, other participants in this multi-site project have been working on iterative methods, and their results will be reported elsewhere. In this paper we report only on our experience with symmetric positive de nite (SPD) systems associated with an underlying geometry, such as those from nite-element and nite-dierence methods. Our related results on direct methods for nonsymmetric and nonsquare sparse systems are reported elsewhere (Raghavan 1995a). The relevant matrix factorization for solving an SPD linear system Ax = b directly is Cholesky factorization, A = LLT , where L is a lower triangular matrix. The Cholesky factor L can be used to compute the solution x by forward and back substitution, respectively, in the triangular systems Ly = b and LT x = y. For sparse systems, Cholesky factorization may incur ll, that is, matrix entries that are zero in A may become nonzero in L. The amount of such ll is strongly aected by the ordering of the rows and columns of the matrix,


4

so a judicious choice of ordering is needed to limit ll, and thereby reduce the computational resources (time and memory) required to solve the system. A number of eective ordering heuristics are known, including minimum degree and nested dissection. To summarize, the standard approach to solving sparse SPD systems by Cholesky factorization involves four distinct steps (George and Liu 1981): 1. Ordering, in which the rows and columns of the matrix are reordered so that the Cholesky factor suers relatively little ll. 2. Symbolic factorization, in which all ll is anticipated and data structures are allocated in advance to accommodate it. 3. Numeric factorization, in which the numeric entries of the Cholesky factor are computed. 4. Triangular solution, in which the solution is computed by forward and backward substitution. Note that the rst two steps involve no oating-point computation. Throughout this paper, we use the term step to refer to one of the four steps listed above; these are further divided into substeps that we call phases.

Relation to Previous Work Due to their obvious importance, there has been a great deal of research eort on parallel methods for solving sparse linear systems. Most of this work up to


5

1991 is surveyed by Heath, Ng, and Peyton (Heath et al. 1991), and there have also been numerous important accomplishments subsequently. Much of this work has concentrated on the numeric factorization, since it is usually the most computationally intensive step on conventional architectures. More recently, there has been more attention paid to the ordering step, which has resulted in a number of ordering algorithms intended to produce orderings that both limit ll and enhance parallelism in the subsequent factorization. In attempting to produce a fully parallel, prototype scalable solver for sparse systems, we have borrowed freely from previous work, but we have also tried to bear in mind a guiding principle of simplicity. Accordingly, we have developed a simple framework into which techniques of varying levels of sophistication can be incorporated. In many cases, the more advanced ideas are still the subject of largely theoretical study and have not been implemented in practical, publicly available codes (particularly not in parallel codes), nor have they been integrated eectively with the other steps of the overall computation. Some of the design decisions we have made in the interest of keeping our implementation task to a manageable size are the following:

The ordering technique we use exploits an embedding of the problem in Euclidean space to compute small line or plane separators in parallel, subject to a balance constraint (Heath and Raghavan 1995). This ordering technique is suitable for linear systems associated with geometric information, such as those from nite-element and nite-dierence methods.


6

Other recent ordering methods, including spectral ( Hendrickson et al. 1992; Pothen et al. 1990), geometric (Miller et al. 1991; Vavasis 1991) and combinatorial (Bui and Jones 1993) methods, have great potential, but these are not yet generally available in parallel implementations. Another approach is Rothberg's parallel implementation of the traditional Multiple Minimum Degree ordering (Rothberg 94).

Our symbolic and numeric factorization algorithms are based on the simple concept of a separator tree, foregoing more precise characterizations of ll, such as that provided by clique trees (Peyton 1986; Pothen and Sun 1991).

In the numeric factorization, we use a one-dimensional, column-oriented data mapping rather than a theoretically more scalable two-dimensional, submatrix-oriented assignment (Rothberg 1993; Schreiber 1992).

In assigning columns to processors, we use an equal weighting of the subtrees in the separator tree, rather than making the processor subsets proportional in size to the subtrees or to the total arithmetic work in each subtree, or other such re nements that would require further communication to redistribute data (Pothen and Sun 1993). In each case, we have opted for a relatively simple strategy that can be implemented with a reasonable amount of eort and makes maximum use of established technology. We view all of these issues as potential opportunities for further improvement of our code.


7

Our main goal was to develop a prototype, not a de nitive package; a feasibility study, not the last word on the subject. Our near term goal was to develop a fully parallel solver that is scalable to the machines of today, having a few hundred processors. We wanted a package that could be used now to begin experimenting with adapting to current MPP architectures applications codes that require a sparse solver. The result of this eort is a package we call CAPSS (CArtesian Parallel Sparse Solver), which is designed to solve sparse SPD systems on distributed-memory, message-passing parallel computer systems, performing all phases of the computation, including symbolic processing as well as numeric factorization and triangular solution, in parallel.

Overview We present the results of an experimental study of the performance of CAPSS. We examine empirically its parallel eciency, scalability, and absolute performance. Our computational experiments were performed on a Thinking Machines CM-5 and an Intel Delta. Since there is no universally agreed upon de nition of scalability, we consider a number of relevant measures of performance as the number of processors varies. We focus particularly on the relative importance of the various steps of the computation and how their proportions of the overall computation change with the number of processors, which was not possible in past studies that have largely concentrated on only one or two steps of the computation.


8

Given the inherent diculty of the problem and the relatively poor ratio of communication speed to computation speed of current machines, we did not expect to attain ideal scalability, but we did hope to see evidence of a useful degree of scalability on current machines, and perhaps trends that would encourage more sophisticated implementations in the future. At the very least we have demonstrated that obvious serial bottlenecks can be avoided, such as relegating the ordering and other symbolic processing to a front-end host or other serial machine, as has often been done in the past.

Algorithms In this section we present a brief outline of the algorithms used in the sparse solver. We provide only the minimum details needed to understand the performance issues involved; a detailed discussion of the algorithms can be found in papers by Heath and Raghavan (Heath and Raghavan 1993, 1995). Our overall approach is simply to parallelize each of the four main steps in the standard sequential method outlined earlier. Each step is broken into a distributed phase that requires cooperation and communication among processors, and a local phase in which the processors operate independently on separate portions of the problem. The sequence of steps and phases is illustrated in Figure 1.


9

Ordering To compute the ordering in parallel is problematic because most distributed parallel algorithms depend on data locality for eciency, but determining an appropriate locality-preserving mapping of the problem onto the processors is essentially equivalent to the ordering problem itself. For this reason, we took a geometric approach, which we call Cartesian nested dissection (CND), that is based on coordinate information for the underlying graph of the matrix, in which nodes represent variables (unknowns) and edges represent nonzero entries in the matrix. Such coordinate information is generally available for practical problems from nite element and nite dierence methods. However, such information may not be available for linear systems from other problem domains such as circuit simulation or linear programming. Coordinate bisection, in which nodes are partitioned according to their geographic locations, has been used in a number of related contexts in distributed computing, such as load balancing and mesh partitioning (see, e.g., Berger et al. 1987; Vaughan 1991; Williams 1991). In our case, we seek to determine a separator (a set of nodes whose removal splits the problem into two disconnected subgraphs) that not only results in roughly equal subgraphs (to maintain good load balance) but also minimizes the size of the separators (to reduce ll and communication requirements). To allow the necessary exibility in seeking a small separator, we introduce a user-selected balance parameter, , which controls the minimum allowable


10

size of the subgraphs, as a proportion of the total, resulting from coordinate bisection. For example, = 1=2 would insist on an exact balance, while = 1=3 would allow one subgraph to be at most twice the size of the other. In our performance tests we have generally used a value of = 0:4, except for square grid problems, where we took = 0:49 in order to force the algorithm to nd theoretically ideal separators. A discussion of the eectiveness of CND in limiting ll can be found in our earlier papers (Heath and Raghavan 1995; Raghavan 1993). This coordinate-based separator algorithm is applied repeatedly to further subdivide the resulting subgraphs until there are as many subgraphs as processors. At each level of nested dissection, the algorithm chooses the smallest possible separator consistent with the balance constraint imposed by the given value for . This requires counting and searching operations in each coordinate dimension for a series of trial coordinate values, and these operations are carried out in a distributed parallel manner using global communication operations analogous to parallel pre x. Initially, the problem can be mapped onto the distributed processor memories essentially arbitrarily. Once the dissection process has produced P subgraphs, where P is the number of processors, the problem data can then be redistributed so that each of the P lowest level subgraphs determined by dissection is assigned entirely to a separate processor (and hence is termed a local subgraph). After such redistribution, then the ordering process can continue


11

with no further interprocessor communication until all nodes have been numbered. This local phase can continue to employ the CND ordering algorithm (as we have done in the tests reported here) or can use any sequential ordering algorithm desired. To summarize, the ordering step is broken into three phases: 1. A distributed phase, which we denote by d-order, in which processors communicate and collaborate in identifying separators that split the graph into P pieces. 2. A redistribution phase, which we denote by redist, in which the problem data are rearranged so that each subgraph identi ed in the rst phase is mapped onto a separate processor. 3. A local phase, which we denote by l-order, in which each processor completes the ordering of the local subgraph assigned to it. The rst two of these phases require global communication among processors, while the third requires no communication.

Symbolic Factorization In conventional sparse matrix codes, the symbolic factorization step is designed to anticipate all ll and allocate the necessary data structures to accommodate it. In a sequential setting, it is relatively easy to compute the ll exactly, based on the corresponding elimination tree, whose structure re ects precisely the


12

column dependences in the Cholesky factor. In a distributed parallel setting, however, a signi cant amount of communication would be required to compute the ll exactly, so instead we have opted to use the simpler structure given by the separator tree which results naturally from the nested dissection process. In eect, we assume that the submatrix corresponding to each separator is dense, which may result in an overestimate of the ll. In our experience, however, the additional storage required is rarely excessive in practice, and is more than oset by the ease with which the corresponding data structures can be allocated and manipulated in a parallel setting. This point is illustrated in our numerical experiments. During symbolic factorization, storage is estimated for the columns of the Cholesky factor. Columns corresponding to the nodes in a given local subgraph are assigned to the processor owning that subgraph, while columns corresponding to separator nodes are mapped cyclically among the subset of processors owning columns corresponding to nodes that are connected to the given separator. Thus, the symbolic factorization step consists of two separate phases, local and distributed, but because of the use of the separator tree that is a direct byproduct of the nested dissection ordering step, the distributed phase is particularly simple, requiring very little communication or execution time. The local phase may be somewhat more substantial, but is still small compared to other steps of the computation. Since the total time for symbolic factorization is relatively small, and the distributed portion is imperceptible when plotted on


13

the scale of the overall computation, in the graphical presentation of our results we show only the overall time for symbolic factorization, which we denote by sfact.

Numeric Factorization The numeric factorization step is essentially a multifrontal implementation of sparse Cholesky factorization (Du and Reid 1983,Liu 1992, Lucas 1987), which starts at the leaves of the separator tree, merging portions of the corresponding dense submatrices and propagating the resulting update information upward to higher levels in the tree. Initially the computation is entirely local, as each processor factors the matrix columns corresponding to the internal nodes of the subgraph that it has been assigned, with no dependence on any columns owned by other processors. We denote this local phase by l-nfact. Eventually interprocessor communication is required in order to factor matrix columns corresponding to separator nodes, which depend on data from multiple processors. For factoring the dense submatrices corresponding to each separator, we use a distributed fan-in algorithm (Geist and Heath 1986). An explicit wrap mapping prior to applying the numeric dense kernel would yield a good load balance, but would require an expensive redistribution of data. Instead, we weave the redistribution into the computation by wrap mapping the successive targets of the fan-ins among the processors involved, which eectively yields a wrap mapping (desirable for the subsequent triangular solution


14

as well) without any additional communication. In this manner, update information from dierent processors is incorporated as the computation moves up the separator tree, involving an expanding hierarchy of processor subsets until reaching the root (i.e., the highest level separator), at which point all processors cooperate in the computation. We denote this distributed phase by d-nfact.

Triangular Solution The triangular solution has a similar structure to the numeric factorization, again corresponding to traversal of the separator tree. There are two major dierences, however: there is far less computation, and there are two separate substeps in the triangular solution, namely forward and back substitution. Like numeric factorization, the forward substitution begins with purely local computation on the local subgraphs, and then proceeds up the separator tree from leaves to root, with a fan-in algorithm (Heath and Romine 1988) again providing the relevant dense kernel. The forward substitution can be largely overlapped with the factorization step (assuming that the right hand side of the equation is known in advance), in which case its execution time becomes essentially invisible compared to that of the factorization. The back substitution process is just the opposite, beginning at the root and proceeding downward toward the leaves of the separator tree, using a fan-out algorithm (Heath and Romine 1988) as the dense kernel. Thus, for the back substitution, the distributed phase precedes the nal local phase that completes

15


the computation of the solution. Because the individual phases of the triangular solution are relatively small and dicult to distinguish, we report only the overall time for the entire process, which we denote by solve.

Ordering

Symbolic Factorization

distributed phase

redistribute

local phase

(d_order)

(redist)

(l_order)

local phase

distributed phase

backward

(solve)

distributed phase forward

Triangular Solution

local phase

local phase

distributed phase (s_fact)

distributed phase (d_nfact)

local phase (l_nfact)

Numeric Factorization

Figure 1: The sequence of steps (and phases) in parallel direct solution.

Testing Environment In this section we describe the software and hardware environments and the test problems used in our empirical performance study.


16

Computational Environments The code implementing the above algorithms is written in C, using doubleprecision oating-point arithmetic for the numeric parts of the computation. We use an explicitly parallel, message-passing MIMD programming style suitable for most distributed-memory parallel computers. We used this code for all our experiments except those reported in Table 4. Our experiments were run on Thinking Machines CM-5 and Intel Touchstone Delta machines. Our code is easily adaptable to additional platforms within this basic computational paradigm, including networked clusters of workstations. The code can be obtained from netlib or by ftp from ftp.cs.utk.edu (directory pub/padma/CAPSS). Although the two machines we used support the same basic computational paradigm, they dier in signi cant ways. The Intel Touchstone Delta we used has 512 processors, each with 16 Mbytes of memory and a mesh interconnection network. The theoretical peak double-precision oating-point rate per processor is about 60 M ops for the Delta, but is very dicult to achieve in practice. Compiled code typically achieves perhaps 10% of peak speed on the Delta, while assembler code may reach 50% or so of peak. The CM-5 we used has 512 nodes, each with 32 Mbytes of memory, interconnected by a fat tree communication network. Each node of the CM-5 is a Sparc processor augmented by four vector units. The vector units were designed for use with data-parallel SIMD programs. Use of the vector units is not feasible for message-passing MIMD programs because adapting to their dierent memory organization would require signi cant


17

code development in an assembler like language. Thus, the experiments we ran on the CM-5 used only the oating-point unit of the Sparc processor on each node, which has a peak performance of roughly 5 M ops (the peak with vector units is 128 M ops per node). Another important dierence between the two machines is their ratios of communication speed to computation speed. The Delta processors have faster

oating-point speed than the Sparc processor of the CM-5, but the communication speed of the Delta is slower than that of the CM-5. Thus, the CM-5 (without vector units) is a better balanced machine for which it is easier to attain good parallel eciency, but it is often slower overall than the Delta for the same number of processors. This eect is further ampli ed by the dierence in memory capacity: with twice the memory per node, the CM-5 is capable of running much larger problems for a given number of processors, which provides a more favorable computational granularity, thereby tending to enhance parallel eciency. We give test results for both machines, but because of its greater memory capacity, the CM-5 is capable of solving a wider range of problems on a broader range of numbers of processors.

Test Problems The test problems that we used came from a variety of sources, including some that we created ourselves using a commercial nite element package, PATRAN.


18

Our collection of test problems is far too large to give all of our results here, so we present results for a representative sample of test problems described in Table 1, which indicates the size of each problem (i.e., the number of equations and unknowns), the number of nonzeros in half of the symmetric matrix, and the dimension of the embedding of the problem (i.e., the number of Cartesian coordinates). In the last two columns of Table 1, we provide the number of nonzeroes in the factor and the total number of operations to compute the factor; these correspond to balance factors used in our performance tests. The problems in the table have been divided into small, medium, and large size categories, for use with varying ranges of numbers of processors. All problems except those labeled gxxx are two and three dimensional nite-element problems; gxxx where xxx = k , denotes the k k model grid problem. Using a test suite such as the Harwell-Boeing collection is not a viable option because geometric information is not available; although many of the Harwell-Boeing systems are from niteelement and nite-dierence methods, the associated coordinate information has not been preserved.

Performance of Sparse Solver We report in this section on a series of experimental tests of the performance of the parallel sparse solver. We consider various measures of eciency and scalability, the relative importance of each step in the overall computation as the number of processors varies, and the absolute performance of the solver.


19

Table 1: Description of problems used in performance testing; the last two columns contain the number of nonzeroes in the factor and the number of operations to compute the factor. Order Nonzeros Dim. Nonzeroes in L Operations (thousands) (millions) hammond 4,720 18,442 2 152 11 barth4dual 11,451 28,331 2 209 11 shuttle 10,429 57,014 3 391 29 barth5 15,606 61,484 2 566 59 698 103 sphere6 16,386 65,538 3 gsq1 17,443 68,797 2 653 123 gl1 17,320 68,831 2 831 198 kall3 10,556 86,665 3 1,623 486 vaughan 29,681 111,476 3 4,500 1,817 g300 90,000 269,400 2 3,366 490 brack 62,631 429,190 3 12,370 5,589

ap 51,537 531,157 3 22,049 13,627 g600 360,000 1,078,800 2 16,206 4,124 Label


20

Relative Cost of Steps The relative costs of the various steps depend strongly on the architecture on which they are implemented. The numeric factorization has the greatest computational complexity, but the high degree of optimization of oating-point computation in some environments may make it relatively less expensive compared to the symbolic steps, which often involve relatively inecient list traversals and indirection. In addition, in our approach the ordering step includes a redistribution phase that bene ts all subsequent steps. Figures 2 through 5 show the execution times for two medium sized problems from Table 1 on the CM-5 and the Intel Delta. The number of processors varies from 8 to 128 on the CM-5 and from 16 to 128 on the Intel Delta. The execution time is broken down into the incremental costs of the various phases, with the total given by the upper curve. The phases are plotted in the same order in which they occur in the algorithm, so each curve shows the cumulative execution time through the given phase. A number of trends in these graphs can be noted. For the problem vaughan, the numeric factorization step is still dominant, but for g300, the symbolic steps are comparable in cost to the numeric factorization. All of the steps (except solve) generally decline as the number of processors increases, but the local phases tend to decline more rapidly than the corresponding distributed phases, since the proportion of a xed problem done in local mode decreases as the number of processors increases. Although it still forms a relatively small portion


21

of the total time, the triangular solution is an obvious potential bottleneck in moving to larger numbers of processors, and more sophisticated algorithms than the simple column-oriented fan-out and fan-in algorithms that we have used for this step are clearly needed. A new approach to improving the eciency of the triangular solution step is discussed in a recent report (Raghavan 1995b).

22


s e c o n d s

500

solve

400

d-nfact

300 200 100 0

l-nfact sfact l-order redist d-order

...................................... ........ ........ ....... ......... ........ ....... ......... ........ ....... ......... ....... ........ ........ ........ ........ ................................ ....... ......... ........ ........ ....... ........ ........ ........ ........ ........ ........ ........ ........ ........ ..... .......... .... ........... ........... .... ............ .... ........... .... ........... .... ........... ..... ........... .... ............ .... ........... .... ............ .... ............ .... .......... .... ........... .... ............ .... ........... .... ............ .... .......... .... ............ .... ........... .... ............... . . . . . . . ....................... .. .... .................... ..................... .... ...................... .... .................. .... .................... .... ....................... .... .................... .... .................... .... ..................... ......... ...................... ......... ...................... ......... ....................... .......... ..................... .......... ...................... ........ ....................... . . . ..................... ......... ..... .......... ........ .... .......... ..... . . . . . . . . . . ... .................. .................. ........ .............. .................. ................................ . . . . . . . . . . . . ................... ................................................................................................................................................ ........................................ ............................. .. ..... ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ............ ......... . . . . . . ....

8

16

32

64

128

number of processors Figure 2: Execution time for problem vaughan on CM-5. 100 s e c o n d s

75 50 25 0

solve d-nfact l-nfact sfact l-order redist d-order

................................................... ............. .......... ............ ............ ............. ........... ............. ........... ............ ............. ............. ........... ............. ............ ............ ............ ..... ..... ............. ............ ............. ....................................... ............ ........... ................ ...... ............ ............ ............ ........... ........... .......... ........... ........... ............ ........... ............ ........... ............ .......... .. ........... ..................... ........... . .................................................... ............. ................... ................. ....................... ................. ...................... ................. ............... ........... ................ .......... ................. ......... ................. .......... ...... ........... . .......... ........... .......... ............... .......... ............ .......... ......... . . . . . . ................ .... ................... ..... ................... ..... .................. .... .................. .................. ........ ......... ............................ ........ ............ ............................................... ............... ....... .......... ......................................... .... .......... ...... ........................................................................................................................................ ................ . . . . . . ........ . .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. . . . ............ . . . . . . . . ..... . . . . . . .

16

32

64

128

number of processors Figure 3: Execution time for problem vaughan on Intel Delta


23

Comparing the two machines for the same problems, we see that the overall behavior is roughly similar, but the distributed phases are relatively more costly on the Intel Delta than the corresponding local phases, due to the less favorable communication-to-computation ratio. For example, the cost of the distributed ordering phase for problem g300 decreases on the CM-5 but increases slightly on the Intel Delta with the number of processors, and the triangular solution is even more obviously a potential problem on the Intel Delta. We also note that for this implementation in C, the Intel Delta is roughly three to four times faster than the CM-5 for the same number of processors. This advantage diminishes as the number of processors grows, as the superior processor speed of the Delta is oset by inferior parallel eciency.

Parallel Eciency We have seen that the overall execution time for the solver declines as more processors are used, but is this decline rapid enough to use the additional processors cost eectively? To consider this question, we next turn to the parallel eciency of the solver as the number of processors varies. There are several dierent notions of parallel speedup and eciency, depending on how the size of the problem scales with the number of processors. For a xed problem, speedup is given by S = T1=Tp, where T1 is the execution time on one processor and Tp is the execution time on p processors. Eciency is then de ned as E = S=p = T1=(pTp). As noted by Amdahl (Amdahl 1967), such

24


125 s e c o n d s

100

solve d-nfact

75

l-nfact

50 25 0

sfact l-order redist d-order

. ....................................... ........ ... ... ....... ......... ....... ....... . . . . ....... ............................. ........ ........ ....... ....... ........ .... ........ ... ....... .... ....... .... ........ .... ........ .... ....... .... ....... .... ........ .... ........ ... ....... .... ....... .... ....... .... ........ .... ........ .... ....... . . . ....... . . . . .... ............................. ........ .... ........ .... ........ .... .... ..... .... ............. .... ...... ...... ... ...... ...... .... ...... ..... .... ............. ...... ...... .... ...... ...... .... ...... ...... .... ...... ..... .... ............. .... ...... ...... ... ...... ..... ...... ............ ....... ...... ..... ...... . ...... ............. ................................. ........... . ............... . . ..... ........ ......... ...... ........ ........... ........ ......... ...... ... ....... ........ ....... ...... ....... ........... ......... ......... ...... ...... ....... ........ ......... . . . . . . . . . ...... ....... ........ ........ . .......................... ...... ........ ........ ....... ............ ........ ........ ....... ....... ....... ......... ......... ...... ....... ....... ......... ....................... ...... ...... ..... ............. ... ........... ...... ...... ...... .......... ............................................................ ........... ....... ................ ........... .. .................... . . . . . . ............ ........ .. ............ ................... ......................... ............ ............................................... ............ .................... ........... ................ .............. ........... ............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........... ........... ................. ................ ............................ ............ .......................... ................ ....................... .. ............................. .. ... ............................. ................................................................................................................. ......................................................... ........ .............................................. ... ... .................................................................................................................................................................................................................................................................................................................... ................................................................................................................. ...................... .......................................................................................................................................................................................... ....................................................................................................................................................................................................... ...........

8

16

32

64

128

number of processors Figure 4: Execution time for problem g300 on CM-5. 20 s e c o n d s

15 10 5 0


..... .... ........................................ ......... ..... ...... ...... ...... ...... ...... ...... ....... ...... ...... ..... ...... ...... ..... ...... ...... . . . ..... ...... .................................... ...... ....... ..... ..... .......... ...... ...... ..... ...... ..... ...... ........... ..... ...... ..... ...... ...... .......... ...... ..... ...... ...... . ..... ...... ...... ............. ..... ........... ...... ...... .......... ...... ...... .......... ..... ....... ........... ....... ...... ........... ....... . . ......... . . . . . . . . . . ...... ................................ ....... ........... ...... ........... . . . ...... ..... ........... ...... ...... ......... ...... ...... ......................................................................................................................... ....... ....... ....... ...... ....... ...... ....... ...... ...... ...... ....... ...... ............................ ...... ....... .......................... ...... ....... ............................... ...... .................................... ..... ...... ........... ....................................... .......................... ......... ............... .............. .......... .............. . .......... ......................................... ........... ................ ............. ....................................... ......... ................ .............. .......... .............. ............ ........................... .......... ................. ........................... ........................... ...... .......................... .................................. .......................... .......................... . . . . . . . . . . . . . ........................................................................................................ .......................................... .......................... .............. ................ . ........................................... ............................................ .............................. ............................................................................................................................................................................................................................................................................................................................................................ ........ ........................................................................................................................................ . . . . . . . . .......................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................ ................................................................................................................................................................................................................................................................................................. ...............

16

32

64

128

number of processors Figure 5: Execution time for problem g300 on Intel Delta.


25

xed-problem scaling almost inevitably yields poor eciency, as communication and other overhead becomes increasingly signi cant as the number of processors grows, since there is not a proportionate increase in computation over which to amortize it. Moreover, for a distributed-memory parallel computer, this de nition of speedup and eciency limits one to problems that are small enough to t in the memory of a single processor. Among our test problems, only the rst eight in Table 1 can be solved on a single processor of the CM-5; Figures 6, 8, and 10 show the eciency based on conventional xed-problem speedup for these problems. Figures 7, 9, and 11 show the eciency for the rst two problems on the Intel Delta. Figures 6 and 7 include all of the symbolic processing combined, Figures 8 and 9 include both the local and distributed phases of the numeric factorization, and Figures 10 and 11 combine all of the steps, including triangular solution. As expected, the eciency drops o fairly rapidly as these small problems are spread ever more thinly across increasingly many processors.

26


1:0 e f f i c i e n c y

0:8 0:6 0:4 0:2 0:0


0:8 0:6 0:4 0:2 0:0

... ..... .......... ......... ............. .................. ................. ................. ................. .................. ................. ................. .. .............. ... ............... .. .................. .. ................ .. .. ............ .. .. .............. .. .................. ... .. ............... ... ... .................. .. ... ................. ... .. ............ .. ... .................... .. .. ....... ..... .. .. ......... ...... ... .. ...... ......... ... ... ......... ... ................... ................. .. ... ......... ....... ................ .. . ....... ...... ................. ....... .. ... .......... .......... ... .. .. ........... ............ ... .. .... ........ ........... ........... ............ .. ... ............................... .......... .. ............ ....... ........... .. .. ........... ... .. ............. ........ ............ ...................... ...... .......... .. .. . . . . ....... .... ........................... ... ... ...... ....................... ............................. ... ....... ................ ...... ........ ........... ............ .. . . . . . . ...... . .............. . . . . . . . . . . . . . . . .............. ............ ................ .. . ......... ............ ........................... .. .......... ........... .............. ................................... .......... ...... .. . . . . . . . . . . . . . . . . . . . . . . . . ................ . . . . . . . . . ....... .. .............................. .............. ....... .. .............................. ......... ...... ... ........................................ . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .......... .............................................................................................................. .............................................. .......... ............. .... ... ................................... ......... ...... ............................... ......................................................... .......... ...... ... ....................................... . .... ................. ............................................................................................................... ........................... .. ............. ............................................ .......................................... ............................. .......... .. ..................................... .... ........... .................... ............ ............ ............. ........... . ............. .............. .............. . ............. .... .............

1

2

4

8

16

barth4dual barth5 gl1 gsq1 sphere6 hammond shuttle kall3

32

number of processors Figure 6: Eciency of symbolic processing on CM-5. ... .... ....... ....... ...... ....... ....... ....... ....... .. ... ... ... .. ... ... ... ... .... ... .... ... .... .. ... ... ... ... .... ... ... ... .... ... .... ... .... ... ... ... .... . .. ... ............ ...... ... ...... ... ....... .. ...... ... ...... ... ...... ...... ...... ...... ...... ...... ...... ...... ........... ....... ............ ........... ...... ........... ..... ........... ...... ............ ...... .......... ...... ......... ..... .......... ...... ......... ........... .......... ............. ......... ............. .......... .............. .......... ............. .......... .............. ......... ............ .......... ............... ......... ............. .......... .............. ....... ........... .......... ........ ........ .......... ....... ......... ........ ...

1

2

4

8

16

barth4dual hammond

32

number of processors Figure 7: Eciency of symbolic processing on Intel Delta.

27



0:8 0:6 0:4 0:2 0:0


0:8 0:6 0:4 0:2 0:0

...................................................................................... ............................................ ...................................... ............................... ....................... .............. ............ ...... .................... ....... .............. ..................... ........ ..................... .............. ................... ....... .... .............. ....... ...................... .... .......... . ................... .... .... ................. ...................... .... .................. ... ........ ..... ..... . . . .... .... ..... ...... . ..... ................. .... ............. .... ...... ..... ....... . . .... . . . . . . .............. .... ..... ..... ..... .......... .... .......... ........ .... ..... ..... ....... .... ....... ... .... .... ..... ..... ..... ..... . .... .... ...... ..... ...... . .... .......... .... .... ..... .... ..... . . . . . .... . . . .... .... .... ..... . .... ..... .... ..... ... ..... ............... . . .... .. . . .. . .... ... .... .... ................ ......... ................ . . ... ...... ... ..... .... .... ....... .... .... .... ...... .... ...... .... .... ....... ... ..... ...... ... ...... ...... ........ . . . .... . . .... .......... . ....... ... ..... ........ ........ . . ..... ................ ...... .... ..... ........ ...... .... .... ...... . . . . . . . . . ...... ..... .... ..... .... ..... .......... ...... .... ... ..... . . . . . . . . . . . . .. . .... ..... .... ..... ...... ..... ... ..... ................................. ....... . . . ................ ..... .... ...... ...... ...... .......... ...... ..... ..... ... ...... ...... .................... .... ......... ....... ...... ........ ......... ....... .. ....... .......... ...... ........ .......... ... ...... ...... ...... ............... ....... ..... ...... . . . . . . . . . . . ..... ........ ...... ...... ............. ...... ..... ............ ..... ..................... ....... ...... ............ ...... ..... ........................... ...... ....................... .......... ...... ..... ....................... ................. ...... ..................... ............. ........... ...... ............ ...... ....................................... ...... ....... ........................... .................................... ...... ....... ...... ..................... ......................... ...... ............ ................... ...... ...... ...... . . . . ...... .............................................................................................. ...... ...... . .. .................... ...... ....... ...................................................... ...... ..... ........ ....... ...... ... ...... . ........... ...... ...... ....... ....... ...... ...................... ....... ...... ........... ....... ... .................. ...... .................. ...... ...... ............. ...... .................. ................... ........................ ............ . ......

1

2

4

8

16

sphere6 gsq1 kall3 gl1 barth4dual barth5 hammond shuttle

32

number of processors Figure 8: Eciency of numeric factorization on CM-5. ... ...... ........ ....... ...... ........ ...... ....... ... ..... ... .... ... .... ... .... .. ... ... ... ... .... ... .... ... .... ... .... ... .... ... .... .... ... ... .. .... ... ..... ... ...... ... ..... ... ....... ... ...... ...... ... ...... ... ...... ... ...... ... ...... ...... ...... ...... ...... ...... ...... ...... ....... ....... ...... ...... ...... ..... ...... ...... ...... ...... ...... ...... ...... ..... ...... ...... ...... ................ ...... ..................... ............................ .................. ............................ .................... ................... ........ ........ ......... ........ ........ ......... ......... ......... ......... ....... ......... ......... ........ ........ ........ ... ......... ......... ......... ......... ........ ......... ........

1

2

4

8

16

barth4dual hammond

32

number of processors Figure 9: Eciency of numeric factorization on Intel Delta.


28

Scalability We turn now to an examination of the scalability of the sparse solver. The inevitable decline in eciency that we have observed for a xed problem as the number of processors grows is not necessarily cause for alarm, since larger computers are normally used to solve larger problems. There is general agreement that the problem being solved should grow with the number of processors, but there is some diversity of opinion concerning how rapidly the problem should scale up with the number of processors. The rate of problem growth can be characterized by keeping some quantity constant as the number of processors varies. Some plausible invariants are

total problem size (Amdahl 1967) work per processor (Gustafson 1988) total execution time (Worley 1990) memory per processor (Sun 1993) eciency (Grama et al. 1993) computational error (e.g., discretization error) (Singh et al. 1993) While a xed problem size is generally too restrictive, keeping the amount of memory used per processor constant as the number of processors grows often allows the problem to grow at an impractically high rate, since the amount of

29



0:8 0:6 0:4 0:2 0:0


0:8 0:6 0:4 0:2 0:0

.... ......... .............. ................ ..................... ............ ..... ................ ....... .................. ......... .................. .......... ................... .......... ............... .......... .................... ............. ......... ................ .......... .................. ......... ................. .................................. ............ .... .......................................... ........... ... ........................................... ........... .... ........................... ............ ... ............ ............ .... .............. ............... .... .... ........... ............ .... .......... .... ......... ............... ..... .... .......... ... ......... ...... .... ......... ... .......... ....... .... ...... . . . .... ... ........... ....... ..... .... ...... ... .............. ....... ...... .... ... ...................... ....... ...... . . .... ... ....... .................... ....... ...... .... ... .............................. ....... ...... ........................... ...... .... ...... ... . . . . . . . . . . . . . . . . ...... ..................... ...... ...... .... ...... ...................... ....... . . .... ...... ..................... ....... ........................... ...... .... ...... ................................... ...... ....... .... ...... ................................... ....... . ... ...... .. . . .. ... ...... ..... ....... ................................................................................. .......... ....... .................. ......... ........................ ...... ...... ...... .............. ......... ...................... ...... ...... .................. ........ . ... ...... ...... ..................... ............ ....................................... ...... ...... ......... ........... ................ .......... . . . . . . . . . . . ...... ...... .. ............................. ............ .. ....... ...... .................... .............. ........... .................................... ...... ............... .......................... .......... .......... ...... ...... ................................................... .......... .......... . ...... ........................................................ ........... ........... ........... ...... ........ ..................................... ......... ......... ... ...... ............................ .......... ......... ......... ...... . . . . . . . . . . . . . . ............. .................................... ... ..................................................................................... ................................ ................. ................... ................. ... .................. .................. .................. ................ ................ ....

1

2

4

8

16

sphere6 kall3 barth4dual gl1 gsq1 barth5 hammond shuttle

32

number of processors Figure 10: Eciency of overall computation on CM-5. ... ... ...... ........ ........ ...... ....... ...... ........ ....... ...... ....... ...... ....... ....... ......... ... ..... ... ... ... ... ... .... .... ... .... ..... ... ..... ... ... ... ..... ... ....... ...... ... ...... .... ...... ...... ...... ...... ...... ....... ...... ...... ...... ...... ...... ...... ...... ..... ........ ...... ........... ...... ............ ....... ........... ...... ........... ...... ............ ...... .......... ....... ....... ........ ....... ......... ........ ......... ........ ....... ........ ......... ....... ......... ....... ........ ........ ....... ....... ........ ........ ........ ....... ....... ........ ....... ....... ........ ........ ....... ....... ........ ....... ....... ..... ........ ....... ........ ....... ....... ........ ........ ........ ..

1

2

4

8

16

barth4dual hammond

32

number of processors Figure 11: Eciency of overall computation on Intel Delta.


30

computation often grows faster than linearly with the amount of memory, and hence the total execution time may grow unacceptably large even though the eciency may be very high. A reasonable compromise between these extremes is to solve as large a problem as possible subject to a xed limit on the total execution time. A closely related criterion is to maintain a constant amount of work per processor, in which case a perfectly scalable algorithm should maintain a xed execution time. In the case of solving sparse linear systems, the amount of work per processor is relatively easy to control, so we x this quantity as our problem scaling criterion. We need a family of problems of similar structure whose size, and resulting work, is parameterizable. A convenient class of problems for this purpose is the regular k k square grid, which we denote by gxxx, where xxx = k . We have already seen g300 and g600 in Table 1. Figures 12 and 13 show the execution time (again showing the incremental costs of individual phases) for a series of square grids on the CM-5 and Intel Delta with the number of processors varying from 1 to 128. The grid size k for each number of processors was chosen so that the amount of work per processor in the numeric factorization is approximately the same in each case. Another way of saying this is that the grids were chosen so that when the number of processors doubles, the total amount of work in the factorization doubles. The hope, then, is that the total execution time will remain constant, which should be the case if the proportion of parallel overhead does not grow with the number

31


of processors. grid size 100 75 s e c o n d s

50 25 0

132

164

205

256

320

400

500

624

... ..... ........... .... .. .... ..... ..... ... .... ................... ...... ........ .... .......... .................... ........... ......... ..... ...... ....... .... ............ . . ......... . .... ...... ..... ............................. ......... .... ............. ..... ...... ...... . . . ......... ...... ....... ........ ......... . . . ..... .............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................. ...... ..... ... ............ ................. ............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................. ............................... .... .... .... ... ...... ..... ............................................................................... .............................. .................. .... ..... ..... .......................... .............................. .... ..... ..... ........................... .......................................................................................... ..... ..... ........ ..... .... ...... ..... ...... ... . . . . . . . ...... .... ...... ....... ...... ...... ...... ....... ...... ...... ...... ...... ...... ...... ...... ...... ...... .............................. ...... ...... ...... ...... . . . . ...... . .. ....... .... ........... . . ...... ...... .... ....... ...... .... ....... ...... .... ...... ...... .. .... ...... ...... .... ................ ...... ...... .... ......... ...... .... ..... . . . . . ....... ...... ........ .... .... . . ........ . . . . . . . . . . . . . . ...... ....... ........ ... ... . . . . . . . ....... . . . . . . . . . ..... ....... . ........ ....... ....... .... .... . ........ . . . . ....... ...... .......... ... .... . . . . . . . . . . . ......... . . . . . . . . ......... ....... ......... ... .... . . . . . . . . . . . . . .......... . . . . . . ........ . ........ .......... ........ . .... . . . . . . . . ...... . . ......... ........ ... . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .......... ......... ... ................ .... .......... ......... . . . . .. ... .......... .......... .................... ... ......................... .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... ......... .................... ... .............. .... .............. .................................. .. ............... ...................... ............................................. . ................. ......................... .................................................. ................. ... . . .................. .......................................................... .................. ............................................................................... .... ................ .. .............................. ................. ... ............................................. ... ....................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ............................. ... ... .... ... . . ................................................... .......................................... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............................. ......................................................................................................... ... .................................................................................... ... ............................................... ....................................

1

2

4

8

16

32

64


128

number of processors Figure 12: Execution time for series of grid problems on CM-5. Let T1 be the execution time on one processor and Tp the execution time on p processors to solve a scaled problem with total work p times that of the problem used for a single processor. Ideally, we would want Tp be the same as T1, in which case scaled eciency Ep = T1=Tp = 1. Consider the graph in Figure 12 for the CM-5. We see that the overall execution time (upper curve) is fairly at, but appears to be trending upward as we reach 128 processors, so the


32

solver obviously falls short of being perfectly scalable. Using the reported times to compute scaled eciencies, we observe that E2 = :79, increases to E16 = :94 and then drops to E128 = :83. By considering the individual phases, we can see some of the reasons for this behavior. The execution time degrades signi cantly in going from 1 to 2 processors, primarily due to the redistribution phase that is required in the parallel algorithm. (If we had assumed, as many authors have, that a locality-preserving mapping of the problem to the processors were available from the outset, making the redistribution unnecessary, then the parallel eciencies we report in this and previous sections would be signi cantly higher.) We observe further that as the number of processors increases, execution times for the distributed phases grow at a slightly faster rate than those for the corresponding local phases decline, yielding an overall rise in execution time. The scalability of the code is aected by the communication to computation ratio; recall that the CM-5 is a more balanced machine with a lower communication to computation ratio than the Intel Delta. Figure 13 shows performance on the Intel Delta for 1 to 128 processors (we were unable to run g164 on two processors because of memory limitations). One can immediately see that the rise in execution time is sharper because of the larger communication to computation ratio. The scaled eciencies are E4 = :91, E16 = :91, and E128 = :58, with the largest growth coming from the factorization and triangular solution. The latter adds signi cantly to the total time, and the scaled eciencies exclud-

33


grid size 40

30 s e c o n d s

20

10

0

132

205

256

320

400

500

624

.. ... ... ... ... . . . ... ... ... ... .............................. ... ... . ... ... ... . .... .. ... ... ..... ... ... ... . .... . .... .. ..... .... .... ...... . . . . . . . . .. .. ...... ..... ...... .... ...... ..... ...... ..... ..... ..... ...... . . . . . . . . . . .... ...... ...... ....... ...... ....... ......... ...... ........... ....... ............ ...... .......... . . . . . . . . . . . . . . . . .. ..... ......... ............................................................................................................................... .................... ........ ........ ................................................................................................................................................................. ........ ....... ............. ........ .. ......... ............. .... .................... ....... ........................ ....... ........................... ....... ....... ....... ............................... ...... ....... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ....... ...... ....... ...... ....... ...... ...... ...... ....... ...... ....... ......... ......... .......... ......... . . . . . . . . . . ......... ....... .......... ....... ......... . . . . . . .............. ......... ....... ............... . . . . . . . . . . . . . . ......... ............ ........................ ....... ............ . . . . . . . . . . . . . . . . . . ......... . ........ ..... ........... ....... ........ . . . . . . ........... . ...... ........... ................. . . . . . . ........ ....... .. ......... ..... . . . . . .................. . . . . . . . . . . . . . . ......... . .... ......... ................................... ...... .......... .............................. ...... . ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... ........................................ .......... ............... .. .............................. ........................................................ ....................................................................................... .... .................... .............................. .......................... ... .................... .. ... .... .................... ... ................. ............................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................................................................................................................................ .. ..... .............................. ... . . . .... .... ......................... .... ........................ .... .................................... .... ............................................ ............................... ... ....................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ........................................... .... ........................ ...........................

1

4

8

16

32

64

solve

d-nfact

l-nfact sfact l-order redist d-order

128

number of processors Figure 13: Execution time for series of grid problems on the Intel Delta.


34

ing triangular solution (corresponding to the line labeled `d-nfact') are higher at E4 = :90, E16 = :91 and E128 = :67. The growth in numeric factorization times stems in large part from our use of one-dimensional column to processor mapping within each dense submatrix factorization. With more than 128 processors, the degradation in eciency is even larger, as demonstrated by the results in Table 2, which for consistency with the previous graphs shows cumulative times through each phase. For larger numbers of processors, we expect that the numeric factorization time will be substantially improved by using twodimensional mapping for each dense partial factorization. However, this is not likely to resolve the growth in triangular solution time, which remains a serious problem. Table 2: Execution time on 128 to 512 processors of the Intel Delta (in seconds, cumulative through each phase). processors 128 256 512

grid d-order redist l-order sfact l-nfact d-nfact solve 624 2.6 4.8 5.6 6.0 8.7 29.6 35.9 777 5.2 8.4 9.4 10.4 13.3 53.2 67.0 958 9.2 12.8 13.3 13.7 14.8 165.3 202.0

To examine scalability in greater detail, we focus on the most time consuming step, numeric factorization. For any parallel algorithm, the two principal factors inhibiting eciency are computational load imbalance and communication overhead. Table 3 gives the computational load (in millions of oating-point arithmetic operations) and communication ux (in thousands of oating-point

35


numbers sent) for the numeric factorization, including both distributed and local phases. Both the average across all processors and the maximum on any one processor are shown. Table 3: Computational load and communication ux for series of grid problems. number grid computational load (millions) proc. size avg. max. ratio 1 132 37 37 1.00 2 164 37 37 1.01 4 205 37 37 1.01 8 256 37 43 1.15 16 320 37 46 1.24 32 400 37 47 1.27 49 1.32 64 500 37 128 624 37 47 1.26 256 777 37 48 1.29 512 958 37 47 1.26

communication ux (avg. ux/ (thousands) avg. ops ) avg. max. ratio ratio 0 0 1.00 0.0 7 7 1.01 0.2 39 40 1.00 1.1 121 122 1.01 3.3 276 293 1.06 7.5 539 610 1.13 14.6 965 1118 1.16 26.2 1629 1876 1.15 43.9 2642 3120 1.18 71.3 4148 4816 1.16 112.1

We see from the Table 3 that the average computational load is the same for any number of processors, which re ects how the problem sizes were chosen, but the maximum computational load rises as the number of processors increases, yielding a growing load imbalance. Such behavior is an almost inevitable consequence of our simple strategy of statically mapping work to processors based on the separator tree produced by nested dissection, since subgraph size (which is what we control) does not necessarily correlate perfectly with the work required to compute the corresponding columns of the factor matrix. Moreover, an im-


36

balance at any level of dissection propagates through (and may be ampli ed by) any further levels of dissection, so that the load imbalance tends to worsen as the number of processors grows. There is also some imbalance in the communication ux, but it grows more slowly than the computational load imbalance. A more important factor is that the average communication ux grows with the number of processors while the average computational load remains xed. The last column of Table 3 shows the number of oating-point numbers sent per thousand oating-point arithmetic operations, indicating that the algorithm incurs increasing communication overhead relative to computational work, resulting in a drop in eciency as the number of processors grows. Despite this apparent lack of scalability, we nevertheless nd these results encouraging, since there is signi cant room for improvement in our implementation. In particular, the communication overhead could be reduced by using a two-dimensional matrix partitioning, and the computational load balance could be improved through more sophisticated mapping strategies (there is a tradeo, however, since achieving a better load balance may require additional communication). Moreover, even in its present form the solver is able to maintain a reasonably constant execution time for appropriately scaled problems on a fairly wide range of numbers of processors.


37

Absolute Performance For reasons discussed earlier, the sparse solver cannot attain a substantial fraction of the peak performance on either the CM-5 or the Intel Delta by use of C programming alone, and thus its absolute performance is unimpressive. As a test of its absolute performance potential, we adapted the solver to use singleprecision i860 assembler-coded level-1 BLAS as the dense matrix kernels in the numeric factorization. This version was used to solve the three largest problems in Table 1 using 64 processors on the Intel Delta, and the results are presented in Table 4. To illustrate the performance gain from using tuned assembler-code BLAS1 routine, we also provide the times for the C, single-, and double-precsion versions. For the tuned code, we observe a maximum aggregate oating-point rate of almost 0:5 G ops. For comparison, the amount of arithmetic work in these sparse problems is similar to that required for dense Cholesky factorization of a matrix of order 3000 to 6000, for which the execution rate on the same machine is less than 1 G ops (Demmel et al. 1993, page 328) using a twodimensional block data mapping and double-precision BLAS2. We note that if the scalability experiment corresponding to Figure 13 is performed using this version of the code, the plot remains roughly the same. The execution time is decreased, but the scaled eciency is similar to that for the double-precision C code. In other words, if Tpa (Tpb) denotes the numeric phase time on 2 p 128 processors using the tuned code (C, double-precision code), then Tpb > Tpa, but the ratios Tpa=T1a and Tpb=T1b are nearly equal.

38


Much of the execution speed of the sparse code is due to large-grained task parallelism, in which the processors work independently on dierent subtrees. The ne-grained data parallelism, in which multiple processors cooperate in solving dense subproblems, involves dense matrices of order at most a few hundred for these test problems. Better utilization of the memory hierarchy, as well as greater scalability, should be possible through the use of higher-level BLAS for these dense kernels. Table 4: Execution time using single-precision assembler BLAS1 on 64 processors of Intel Delta (times in seconds, rate in M ops). label order redist sfact

nfact time rate 2.19 0.40 Asm-single 15.12 369 C -single 46.79 119 C -double 57.82 97 2.31 0.42 Asm-single 24.78 549 C -single 128.31 107 C -double 158.22 86 3.24 0.77 Asm-single 8.26 499 C -single 25.40 163 C -double 31.60 130

solve total

brack

9.19

ap

9.53

g600

3.94

3.70 30.60 3.70 3.80 10.90 47.94 10.91 10.96 5.03 21.24 5.22 5.25

Conclusions and Future Work We conclude from this study that it is indeed possible to develop an eective, fully parallel sparse direct solver for distributed-memory, message-passing multi-


39

computers. Our solver was shown to achieve reasonable eciency for suciently large problems using up to 128 processors. Not unexpectedly, the solver falls short of the desired scalability, but a number of algorithmic improvements are already evident that should enable us to come closer to this goal in the future. Even in its current state, the solver is capable of solving a useful range of problems on today's distributed-memory machines, and will therefore permit us to experiment with fully parallel, distributed solutions of large-scale applications that require a sparse direct solver. In future work we expect to incorporate a number of the potential improvements mentioned earlier, including two-dimensional partitioning of the dense frontal matrices and more sophisticated data mapping and load balancing strategies. We also plan to compare the eciency and eectiveness of our parallel Cartesian nested dissection ordering with those of the other recent ordering methods cited earlier, as soon as parallel implementations of these become generally available. In addition, we are also developing an alternative parallel ordering algorithm that does not require coordinate information, and hence will complement the CND ordering used in the present study.

Acknowledgements Our computational experiments were performed on the Thinking Machines CM5 at the National Center for Supercomputing Applications at the University of Illinois, and on the Intel Touchstone Delta operated by the California Institute


40

of Technology on behalf of the Concurrent Supercomputing Consortium.

References Amdahl, G. M. 1967. Validity of the single processor approach to achieving large-scale computing capabilities. Proc. AFIPS, 30:483{485. Berger, M. J., and Bokhari, S. H. 1987. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans. Comput., C-36:570{580. Bui, T. N., and Jones, C. 1993. A heuristic for reducing ll-in in sparse matrix factorization. In R. Sincovec, D. Keyes, M. Leuze, L. Petzold, and D. Reed, editors, Sixth SIAM Conference on Parallel Processing for Scienti c Computing, pages 445{452, Philadephia, PA, SIAM Publications. Demmel, J., Dongarra, J., van de Geijn, R., and Walker, D. 1993. Lapack for distributed memory architectures: The next generation. In R. Sincovec, D. Keyes, M. Leuze, L. Petzold, and D. Reed, editors, Sixth SIAM Conference on Parallel Processing for Scienti c Computing, pages 323{329, Philadephia, PA, SIAM Publications. Du, I. S., and Reid, J. K. 1983. The multifrontal solution of inde nite sparse symmetric linear equations. ACM Trans. Math. Software, 9:302{325.


41

Grama, A., Gupta, A., and Kumar, V. 1993. Isoeciency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Tech., 1(3):12{21. Geist, G.A., and Heath, M.T. 1986. Matrix factorization on a hypercube multiprocessor. In M. T. Heath, editor, Hypercube Multiprocessors, Philadephia, PA, SIAM Publications. George, J.A., and Liu, J.W-H. 1981. Computer Solution of Large Sparse Positive De nite Systems. Prentice-Hall Inc., Englewood Clis, NJ. Gustafson, J.L. 1988. Reevaluating Amdahl's law. Comm. Assoc. Comput. Mach., 31:532{533. Hendrickson, B., and Leland, R. 1992. An improved spectral graph partitioning algorithm for mapping parallel computations. Technical Report SAND92-1460, Sandia National Laboratories, Albuquerque, NM 87185. Heath, M.T., Ng, E., and Peyton, B. W. 1991. Parallel algorithms for sparse linear systems. SIAM Review, 33:420{460.


42

Heath, M.T., and Romine, C.H. 1988. Parallel solution of triangular systems on distributed-memory multiprocessors. SIAM J. Sci. Stat. Comput., 9:558{588 Heath, M.T., and Raghavan P. 1993. Distributed solution of sparse linear systems. Technical Report UIUCDCS-R-93-1793, Department of Computer Science, University of Illinois, Urbana, IL 61801, February 1993. Heath, M.T., and Raghavan P. 1995. A Cartesian parallel nested dissection algorithm. SIAM J. Matrix Anal. Appl., 16(1):235{253. Lucas, R., Blank, W., and Tieman J. 1987. A parallel solution method for large sparse systems of equations. IEEE Trans. Comput. Aided Design, CAD6:981{991. Liu, J.W-H. 1992. The multifrontal method for sparse matrix solution: theory and practice. SIAM Review, 34:82{109. Miller, G.L., Teng S., and Vavasis S.A. 1991. A uni ed geometric approach to graph separators. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science, pages 538{547. IEEE. Peyton, B.W. 1986. Some applications of clique trees to the solution of sparse


43

linear systems. PhD thesis, Department of Mathematical Sciences, Clemson University, Clemson, SC.

Pothen, A., and Sun, C. 1991. A distributed multifrontal algorithm using clique trees. Technical Report CS-91-24, Dept. of Computer Science, Pennsylvania State University, University Park, PA 16802. Pothen, A., and Sun, C. 1993. A mapping algorithm for parallel sparse Cholesky factorization. SIAM J. Sci. Comput., 14:1253{1257. Pothen, A., Simon, H.D., and Liou, K.-P. 1990. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11:430{452. Raghavan, P. 1995a. Distributed sparse Gaussian elimination and orthogonal factorization. SIAM J. Sci. Comput., 16:1462-1477. Raghavan, P. 1995b. Ecient Parallel Triangular Solution with Selective Inversion. Technical Report CS-95-314, Department of Computer Science, University of Tennessee, Knoxville, TN 37996. Raghavan, P. 1993. Line and plane separators. Technical Report UIUCDCSR-93-1794, Department of Computer Science, University of Illinois, Urbana, IL


44

61801. Rothberg, E. 1994. A parallel implementation of the multiple minimum degree heuristic. Presentation, Fifth Siam Conference on Applied Linear Algebra, Snowbird, Utah. Rothberg, E. 1993. Performance of panel and block approaches to sparse Cholesky factorization on the iPSC/860 and Paragon multiprocessors. Technical report, Intel Supercomputer Systems Division, 14924 N. W. Greenbrier Parkway, Beaverton, OR 97006. Schreiber, R. 1993. Scalability of sparse direct solvers. In A. George, J. Gilbert, and J. Liu, editors, Graph Theory and Sparse Matrix Computation, pages 191{ 209. Springer-Verlag. Singh, J.P., Hennessy, J.L., and Gupta, A. 1993. Scaling parallel programs for multiprocessors: methodology and examples. IEEE Computer, 26(7):42{50. Sun, X.H., and Ni, L.M. 1993. Scalable problems and memory-bound speedup. J. Parallel Distrib. Comput., 19:27{37. Vaughan, C. T. 1991. Structural analysis on massively parallel computers.


45

Comput. Systems. Engrg., 2:261{267.

Vavasis, S.A., 1991. Automatic domain partitioning in three dimensions. SIAM J. Sci. Stat. Comput., 12:950{970. Williams, R.D. 1991. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency: Practice and Experience, 3:457{ 481. Worley, P.H. 1990. The eect of time constraints on scaled speedup. SIAM J. Sci. Stat. Comput., 11:838{858.

Performance of a Fully Parallel Sparse Solver - Semantic Scholar

Performance of a Fully Parallel Sparse Solver - Semantic Scholar

Suggest Documents

PARALLEL AUXILIARY SPACE AMG SOLVER ... - Semantic Scholar

A new parallel sparse direct solver : presentation and

A THREAD PARALLEL SPARSE CHEMISTRY SOLVER FOR CMAQ ...

A parallel sparse linear solver - Integrated Systems Lab

Parallel Performance Wizard: A Performance ... - Semantic Scholar

A Domain Decomposition Solver for a Parallel ... - Semantic Scholar

Optimizing Parallel Sparse Matrix-Vector ... - Semantic Scholar

PARALLEL DIRECT METHODS FOR SPARSE ... - Semantic Scholar

a solver for massively parallel direct numerical ... - Semantic Scholar

A parallel distributed solver for large dense ... - Semantic Scholar

High-performance parallel solver for 3D ...

Singularity analysis of a fully parallel manipulator ... - Semantic Scholar

Parallel Program Performance Metrics: A ... - Semantic Scholar

Parallel Performance Wizard - Semantic Scholar

Optimized Sparse Solver in MSC/NASTRAN for ... - Semantic Scholar

Static Load Balancing of Parallel PDE Solver for ... - Semantic Scholar

A Scalable Parallel Algorithm for Sparse Cholesky ... - Semantic Scholar

A Parallel Search Strategy Based on Sparse ... - Semantic Scholar

A Library for Parallel Sparse Matrix Vector Multiplies - Semantic Scholar

A Fully Data Parallel WFST-based Large ... - Semantic Scholar

A General Solver Based on Sparse Resultants

Performance Analysis of Parallel Systems - Semantic Scholar

Performance of MPI parallel applications - Semantic Scholar

Predictive performance modelling of parallel ... - Semantic Scholar