A High Performance Version of Parallel LAPACK - Semantic Scholar

3 downloads 231 Views 175KB Size Report
ScaLAPACK, assume that the block or panel width ! for these ... present a project in progress to extend ScaLA- .... 3. to port and test the performance of the.
A High Performance Version of Parallel LAPACK: Preliminary Report Peter Strazdins, E-mail: [email protected] Australian National University. Hari Koesmanro, E-mail: [email protected] , Australian National University. Department of Computer Science, Australian National University, Acton, ACT 2600. AUSTRALIA Phone: +61 6 249 5140 Fax: +61 6 249 0010

Abstract Dense linear algebra computations require the technique of `block-partitioned algorithms' for their ecient implementation on memoryhierarchy multi-processors. Most existing studies and libraries for this purpose, for example ScaLAPACK, assume that the block or panel width ! for these algorithms must be the same as the matrix distribution block size r. We present a project in progress to extend ScaLAPACK using the `distributed panels' technique, ie. to allow ! > r, which has the twofold advantages of improving performance for memoryhierarchy multiprocessors and yielding a simpli ed user interface. A key element of the project is a general Distributed BLAS implementation, which has been developed for primarily the Fujitsu AP series of multiprocessors but is now fully portable. Other key components are versions of the BLACS and BLACS libraries to achieve high performance cell computation and communication respectively on the required target multiprocessor architectures. Preliminary experiences and results using the Fujitsu AP1000 multiprocessor indicate that good performance improvements are possible for relatively little e ort. Performance models in-

dicate similar improvements can be expected on multiprocessors with relatively low communication costs and large (second-level) caches. Future work in the project include improving the DBLAS to 'cache' previously communicated data and the porting and testing of the codes on other multiprocessor platforms.

1 Introduction Dense linear algebra computations require the technique of `block-partitioned algorithms' for their ecient implementation on such parallel computers. Here, the register, cache and o processor memory levels of the memory hierarchy all a ect the optimal block-partition size for such algorithms. Most existing studies on dense linear algebra computations have assumed that the block-partition size or panel width for the algorithm, !, to be the same as the matrix distribution block size, r, where a square (r  r) blockcyclic matrix distribution is being employed on a P  Q rectangular processor con guration. Here the choice of ! = r is essentially determined by the o -processor memory level of the memory hierarchy. This means that the panel formation part of the computation is not fully parallelized, as the

P2-J-1

panel is fully contained in a row or column of processors, and hence only that subset of processors can participate in panel formation. On this assumption, a parallel version of the extensive dense linear algebra library LAPACK (Linear Algebra PACKage), called ScaLAPACK (Scalable LAPACK) has been produced at the University of Texas, Knoxville and at Oak Ridge National Laboratories, and release 1.1 is currently available in the public domain [3, 1]. Corresponding to the fact that LAPACK is built in terms of the Basic Linear Algebra Subroutines (BLAS) library, ScaLAPACK is based on a version of parallel BLAS called the PBLAS [2], as well as the BLAS itself and the Basic Linear Algebra Communication Subprograms (BLACS). This assumption has been re-examined in the context of matrix factorization computations on scalar-based distributed memory parallel processors, such as the Fujitsu AP1000 [7]. There considerations of the register and cache levels of the memory hierarchy require a large panel width !. `Distributed panels' versions of these computations, where ! > r (and typically ! > rP ), allow full parallelization of the panel formation stage. It has been shown that on the AP1000, the `distributed panels' technique yields a 15%{ 25% overall improvement in speed for LU and Cholesky factorization [7]. It should be further noted that successors to these parallel processors, such as the Fujitsu AP3000, are likely to have secondary caches, which will enhance the bene ts of the distributed panels technique. As well as performance improvements, the distributed panels technique o ers a simpler interface and ease of use in a parallel dense linear algebra library. For the traditional techniques, the optimal block size r is an empirically determined function of the type of computation, the architecture and, least convenient of all, the matrix size. However, for the distributed panels technique, it is found that r = 1 is optimal (or suciently near to optimal) [7]. The foundation for the `distributed panels' matrix factorization algorithms has been the production of a parallel version of the Basic Linear Algebra Subroutine (BLAS) library. This parallel version (developed for the AP1000) is called the DBLAS (Distributed BLAS) [6, 8]. Its current implementation has the de ciency that

in the context of these dense linear algebra computations, applying the `distributed panels' technique leads to redundant communication, typically doubling the communication volume. This occurs with distributed panels because matrix rows or columns rst have to be communicated during panel formation, and then later when the whole panel is communicated. For these reasons, a Project has been begun to extend the ScaLAPACK codes to use the distributed panels technique, and measure the performance bene ts on the Fujitsu AP line of multicomputers and other scalar-based multicomputers. The remainder of this paper is organized as follows. Section 2 gives the detailed objectives for this Project, and strategies in meeting these objectives are discussed in Section 3. Preliminary experiences and results are given in Section 4. A description of a detailed performance modelling of parallel LU factorization, and its predictions for various architectures, is given in Section 5. Plans for future work and conclusions are given in Section 6.

2 Objectives for the Parallel LAPACK Project This Project can be partitioned into two parts. Part I forms the foundations, and some of the work has already been published elsewhere [6, 7, 8].

P2-J-2

1. to investigate of the performance (speed) bene ts of the `distributed panels' technique for LU, Cholesky and QR factorization computations, and for the Hessenberg, bi-diagonal and tri-diagonal reduction algorithms on the Fujitsu AP1000/AP+. The latter forms the dominant computational components of QR-based eigenvalue computation and singular value decomposition. 2. to develop a theoretical performance model to compare `distributed panels' versions of these computations versus the traditional versions. From this, suitable scalar-based (and pos-

sibly even vector-based) parallel computer platforms can be identi ed. 3. to investigate how a parallel BLAS library (ie. the DBLAS) can be implemented to avoid producing redundant communications without compromising software engineering considerations. Part II involves the production of a version of parallel LAPACK of improved performance. Two fundamental approaches exist: to largely produce the software from scratch, or to (minimally) extend of ScaLAPACK to achieve this aim. The latter approach has the bene t of being able to re-use completely (or at least with minimal change, see Section 3) many already existing components (including documentation). While the former is beyond our resources to attempt, it can still be considered for some limited parts of the system. The objectives of Part II are to: 1. to product a PBLAS interface to the (portable) DBLAS, to enable the DBLAS (which can eciently support distributed panels computations) to be called from the ScaLAPACK codes. 2. to extend ScaLAPACK 1.1 to accommodate `distributed panels' (single and double precision datatypes only). 3. to port and test the performance of the extended ScaLAPACK codes to other suitable parallel computer platforms. 4. to release the improved version of parallel LAPACK for researchers in computational science. These improvements include both performance gains and a simpli ed user interface.

3 Strategies to Extend ScaLAPACK In this section, we discuss various strategies to meet the above objectives. Vital components of ScaLAPACK are high performance and reliable versions of the BLAS and BLACS for each multicomputer platform of interest. For SPARC processors (under Solaris I

operating systems), no vendor supplied BLAS is available. However, a version has already been produced (in 1994) by the DBLAS project; this is of high performance only for the level 3 BLAS however. Two versions of BLACS have already been produced for the AP1000: the rst was produced in 1994 for the -release of ScaLAPACK [4]: this is of high performance, but unfortunately the BLACS interface and functionality has greatly changed since, rendering this version unusable in its present form. The second version was produced in 1995 by the ANU-Fujitsu Area 3 Project for the release 1.0 ScaLAPACK port [5]; unfortunately the performance is not nearly as high (one reason is that the ApLib fast row and column broadcasts are not used). Near-optimal BLAS and BLACS for the AP1000/AP+ will ultimately be required by this project. Extending the DBLAS implementation so that it can avoid generating redundant communications when used in the context of matrix factorizations or reduction computations (Aim I.3) is non-trivial. A tractable approach to this problem is to employ an internal software caching strategy in the DBLAS, where recently received matrix rows/columns are `remembered' in each processor so that they need not be recommunicated. Issues such as cache invalidation are not trivial, but they have been solved in similar situations (eg. hardware caches for shared memory parallel processors) and present no fundamental obstacle here. `Annotation routines' to specify which matrix operands are to be cached and which cached matrix rows/columns are to be invalidated can provide a tractable ` rst solution' to this problem. ScaLAPACK is a very large software system, and the approach of how to extend it for this purpose must be carefully considered. The current version of ScaLAPACK (1.1) has been (on the whole) skillfully software engineered and suitable for such an extension. It can be extended to incorporate distributed panels as follows:  the top-level (`level 3') ScaLAPACK routines (involving  4000 lines of Fortran code for double precision) will mainly require the panel width to be re-de ned (eg. from r to a suitable value of !). This value is then typically passed down to the `level 2' and PBLAS routines. If done carefully, this in-

P2-J-3

volves only a few lines of code per routine. Some relaxation of error checking is also required in places. At a later stage (when performance is required) the DBLAS `annotation routine' calls can be added where needed.  the `level 2' ScaLAPACK routines are more dicult. Those already written in largely in terms of the PBLAS (involving  1400 lines of Fortran code for double precision) must be carefully examined for implicit assumptions that ! = r, but are likely to require only a small amount of change. Those having the ! = r assumption fundamentally built in to their coding (generally, these contain BLAS/BLACS calls rather than PBLAS calls) will have to be largely rewritten. These involve  3500 lines of Fortran code for double precision) A promising approach is to interface these routines (eg. the QR routines PDLARTF() and PDLARTB()) to their DBLAS-based counterparts, which has several advantages in expressing such complex parallel computations (see Section 5.1 [8]).  minor modi cations to the ScaLAPACK test programs (there are 8 for double precision) will also be required, in order to port them to the AP1000/AP+. The Host Access Package hap produced for the AP1000/AP+ can make this process relatively easy, the largest change being in the reading of the test parameters: these must be read from standard input rather than from a named le. Also, the panel width ! must be added to the test parameters. Note that the `level 1' ScaLAPACK routines and the ScaLAPACK auxiliary routines (eg. the `Tools' routines) are una ected by this extension.

PACK 1.1 to AP1000 itself involved a signi cant amount of work, even when made easier by the HAP package. This was largely because of the size and complexity of the many components required, and required extensive testing to ensure their reliability. At this stage, we have concentrated on the matrix factorization codes and test programs, with only the level 3 PBLAS interface to the DBLAS implemented. Debugging, fundamentally dicult for large parallel computations at any time, has been problematic for several reasons. Firstly, when an error is reported, it is often dicult to determine which component (eg. modi ed ScaLAPACK routines, BLACS, PBLAS, DBLAS, checking routines, or test program), is at fault. In particular, the test programs are large, complex and lacking in documentation; also the assumption of ! = r is implicitly made in some places here as well. Secondly, while the HAP package is useful for porting the test programs, all cell's output is concatenated, which makes execution tracing dicult (especially upon abnormal termination of one of the cells, where some cell output may be lost also). Thirdly, Fortran (compared with C) is a cumbersome language for software engineering, and lacks safety in parameter checking1 . Another unforeseen problem were decisions in the ScaLAPACK algorithms on the representation of data which implicitly assumed ! = r. For example:

4 Experiences and Preliminary Results

 in the level 3 QR factorization routine

The implementation phase of the Project began in August 1996, and we report here early experiences and results. The (re-)port of ScaLA-

 in LU decomposition, the pivot vector is as-

sumed to be row-and-column replicated for the level 2 factorization routine PDGETF2(), and is assumed to be only column-replicated for the main routine, PDGETRF(). For ! = r, it turns out that PDGETRF() can call PDGETF2() safely to set its pivot vector, but extension to ! > r required nontrivial changes, including a routine to perform compression of the pivot vector. PDGEQRF(),

an intermediate triangular factor matrix, T , is formed, and is in upper

For example, a minor modi cation which moved a comma in a PBLAS call from column 72 to 73 resulted in the following parameters being corrupted. No warning message was given by the compiler.

P2-J-4

1

triangular form. For !  r, T is contained completely one cell. However, for ! > r, the formation of an upper (instead of lower) triangular T causes explicit parallel transposition, which will degrade performance appreciably.

4 3.5

MFLOPS/cell

3 2.5 2

4 1.5

r=w=16 r=1,w=64

3.5 1

MFLOPS/cell

3 0.5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 N

2.5 2 1.5

r=w=8 r=1,w=64

1 0.5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 N

Figure 1: Improvement in ScaLAPACK LU Factorization an N  N matrix with r  r blocks and panel width !, on an 8  8 AP1000. Preliminary results for LU and Cholesky decomposition of N  N matrices on the AP1000 are given in Figures 1 and 2. These show the performance of the unmodi ed ScaLAPACK algorithms with optimal block size r for that range of N , and the performance of the extended (distributed panels) versions (which use level 3 DBLAS), with ! = 64 and r = 1. Both use the same BLAS and BLACS, with `line-sending' mode required for reasonable BLACS performance. These show a 15-20% improvement for N > 2000 for LU decomposition, and a slightly smaller improvement for Cholesky factorization. While both of these curves can be improved considerably with faster BLACS and BLAS (cf. Section 5), these results are encouraging when one considers that the ! = 64 version currently performs redundant row / column broadcast communications, which detract from tis performance. This e ect is heightened by the current BLACS, which is about 3 times slower than

Figure 2: Improvement in ScaLAPACK Cholesky Factorization a lower triangular N  N matrix with r  r blocks and panel width !, on an 8  8 AP1000. the native AP row / column broadcast, and has most impact for the middle ranges of N . Secondly, the use of slow level 2 BLAS a ects the ! = 64 version more, since the fraction of level 2 computation is O(!N 2 ). Finally, the level 3 DBLAS routines used for the ! = 64 version was in transition at the time the results were taken and have large software overheads (including internal error checking); the e ect of this would be greatest for low values of N for the Cholesky factorization.

5 Performance Modelling In this section, we will outline a detailed performance model for parallel LU factorization, after the manner of the Distributed Linear Algebra Machine Model (DLAM) [1]. However, in order to predict the advantages (or disadvantages) of the distributed panels version of this algorithm, a much higher degree of detail is required. We assume that the time (in software) to send an ordinary message of length n is given by Tc (n; ; ) = + n . Table 1 gives the various constants and their empirically determined values on the AP1000; those on the second half are now required for modelling the distributed

P2-J-5

panels computation, as the larger values of ! imply a greater proportion of level 2 BLAS and TRSM/TRMM computations. Furthermore, 3 must be more accurately modelled by a function 3 (!), where here ! is identi ed with the minimum dimension of the level 3 computation. This is because, for ! = r, load balance considerations may require ! to be much smaller than the value required for the optimal 3 (to a lesser extent, the 3T and 2 are also a ected by ! as well.). description AP1000 commun. startup time 40s ?1 is commun. bandwidth 1.21s

3?1 level 3 BLAS compute speed 0.2s ? 1

3T BLAS TRSM/TRMM speed 0.27s

2?1 level 2 BLAS compute speed 0.4s

1?1 level 1 BLAS compute speed 1s

4.5 4 3.5

MFLOPS/cell



3

3T

2

1

vector of size ! and block size r across a row of Q processors. As can be seen, this performance model is complex and requires numerical solution via a computer program. Figure 3 gives a comparison of this model tuned to the AP1000 (assuming redundant communications if ! > r) versus an optimized DBLAS-based algorithm. This shows generally a very close agreement of the model with the actual, with deviations at N = 2048; 3072 probably due to direct-mapped cache con icts, and those for N  1024 probably due to O(N ) software overheads within the DBLAS (cf. Section 5.3 [8]).

Table 1: Machine-dependent Quantities used for the Performance Model. Below, we give the model for the formation of the lower panel L (see Figure 4, [7]) for an N  N LU factorization of a matrix of block size r, using a panel width ! on a P  Q processor grid. The total time for for panel formation over the whole computation is given by: N (N ?1) + 1 2P Ncomb (P )( + 2 )+

nd local max. in lj nd global max. in lj 2N ( + g2l(!; r; Q) )+ swap pivot row in L N (N ?1) + scale lj by max. 1 2P ?1) )+ bc (P )(N + 2ENd((!!;r;Q column b/c uj ) N (N ?!)(!?1) + rank-1 update lj uTj 2 2Ed (!;r;Q) bc (Q)(N + N (N2P?!) ) row b/c lj Here lj denotes the pivot column, and uj denotes the pivot row in the upper triangular portion of L. comb (P ) is the weighting coecient for global combine operations across a column of P processors (generally = lg2 (P )), and bc (P ) is the coecient for a broadcast along a column of P processors (= 1 on the AP1000). Ed = Ed (!; r; Q) is the eciency due to load balance of a distributed panels computation [7] and g2l(!; r; Q) the local length on cell 0 of a

3 2.5

act:r=1,w=64 model:r=1,w=64 act:r=w=16 model:r=w=16

2 1.5 1 0.5 500

1000 1500 2000

2500 3000 3500 4000 4500 N

Figure 3: Comparison of actual (DBLAS-based) LU factorization with the performance model for an N  N matrix with r  r blocks and panel width !, on an 8  8 AP1000. For the AP+, the performance model was less accurate, generally predicting about 3MFLOPs greater speed than the actual. This was possibly because the performance parameters were not as well calibrated, and possibly because the O(N ) software overheads are greater on the AP+. While developing such a model required a considerable amount of work, and `debugging' such a model is fundamentally dicult (there is no real way of checking the answer), they have the following advantages, once calibrated on the architecture of interest:  they can be used to predict performance for

P2-J-6

algorithm variants much faster than it can be measured, especially if the variants are yet to be implemented. For example, the model predicts that if redundant communications are eliminated, 3{ 7% further improvement for ! > r is expected on the AP1000. Also, the optimal ! for r = 1 and the optimal r for ! = r can be easily predicted by the model.

 they can be used to perform a scalability analysis of a parallel algorithm.

 they can be used to predict on which archi-

tectures an algorithm will be optimal, given sucient knowledge of that architecture.

 they can be used to perform `performance

debugging'. For example, the model producing the program computes the speed (per cell) and percentage of overall time for each component. Grouping these components together, and comparing these with the measurements of the corresponding sections of code 2 can help identify which sub-computation is unexpectedly slow. In the graph of Figure 3, this method identi ed that the DBLAS call (DDGER()) performing the communication of lj and uj and their rank-1 update was the greatest source of the suspected O(N ) software overheads.

6 Conclusions Work

and

Future

In this paper, we have described a project in progress to produce a high performance version of parallel LAPACK by extending the existing parallel library ScaLAPACK 1.1 to incorporate the distributed panels technique. Not only can this result in a simpli ed user interface, but can yield better performance for memory hierarchy multiprocessors. Preliminary results for LU and Here, care must be taken not to have these measurements too ne-grained, otherwise they can be a ected by timer overheads, and care must be taken that the measurement one component group does not include 'idle time' caused by another group 2

Cholesky Factorizations indicate a 10-20% improvement, with scope for further improvements. The main challenges in the extension of the remaining ScaLAPACK codes themselves have been identi ed to be the inherent diculties of debugging such a large parallel software system, and the representation of some of the secondary data structures in the existing ScaLAPACK which is incompatible to using distributed panels. Improving the performance of the (level 1 and 2) BLAS and BLACS libraries for the AP1000 and AP+ is important for proving our approach. For the latter, porting the MPI BLACS to the AP/AP+ would be an ideal solution, since high performance MPI is supported on these platforms. Unfortunately the rst attempt failed, apparently due to assumptions on the MPI implementation by the MPI BLACS that do not hold for the AP MPI implementation. An alternative currently being investigated is to upgrade the version of BLACS [4] to current BLACS functionality (which includes multiple BLACS contexts). Communication performance within the DBLAS must also be improved, by the eliminations of redundant communication via `panel caching'. The porting and performance testing of the extended ScaLAPACK codes on various multicomputer platforms will be an important element of this project. For this purpose, the detailed performance models of these computations, shown to be in close agreement to the actual LU factorization on the AP1000, will be useful for predicting which platforms will be the most promising.

Acknowledgements We would like to acknowledge the support from the Small ARC Grant F96042, the Department of Computer Science ant ANU and the ANU-Fujitsu CAP Project to help sponsor this work. We would also like to thank David Sitsky for helpful advice on HAP and MPI, and Andrew Tridgell for assistance in detecting dynamic memory errors.

P2-J-7

References [1] J. Choi, J. Demmel, J. Dhillon, J. Dongarra, S. Oustrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance. Technical Report CS-95-283, Computer Science Dept, University of Tennessee, Knoxville, 1995. [2] J. Choi, J. Dongarra, S. Oustrouchov, A. Petitet, and D. Walker amd R. C. Whaley. A Proposal for a Set of Parallel Basic Linear Algebra SubPrograms. Technical Report CS95-292, Computer Science Dept, University of Tennessee, Knoxville, 1995. [3] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. In Frontiers '92: Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, Virginia, October 1992. [4] A. Lynes. Linear Algebra Communication and Algorithms on the Fujitsu AP1000. Honours Thesis, Department of Computer Science, Australian National University, Nov 1994. [5] A. Rendell. Porting ScaLAPACK (and PBLAS) to the Fujitsu AP1000. Technical report, ANU Supercomputing Facility, Australian National University, 1995. [6] P.E. Strazdins. Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS. In Third Parallel Computing Workshop for the Fujitsu PCRF, pages P1{R{1 { P1{R{7, Kawasaki, November 1994. [7] P.E. Strazdins. Matrix Factorization using Distributed Panels on the Fujitsu AP1000. In IEEE First International Conference on Algorithms And Architectures for Parallel Processing (ICA3PP-95), pages 263{73, Brisbane, April 1995. [8] P.E. Strazdins. A High Performance, Portable Distributed BLAS Implementation. In Fifth Parallel Computing Workshop for P2-J-8

the Fujitsu PCRF, pages P2{K{1 { P2{K{ 10, Kawasaki, November 1996.

Suggest Documents