Keywords: Parallel computing, Newton-Krylov methods, sparse matrices, .... generally based on modifications of Gaussian elimination (LU factorization). This.
Parallel Krylov Methods for Econometric Model Simulation Giorgio Pauletto
Manfred Gilli
Hoover Institution
Department of Econometrics
Stanford University
University of Geneva
Stanford, CA 94305-6010
1211 Geneva 4, Switzerland
Abstract This paper investigates parallel solution methods to simulate large-scale macroeconometric models with forward-looking variables. The method chosen is the Newton-Krylov algorithm. We concentrate on a parallel solution to the sparse linear system arising in the Newton algorithm, and we empirically analyze the scalability of the GMRES method, which belongs to the class of so-called Krylov subspace methods. The results obtained using an implementation of the PETSc 2.0 software library on an IBM SP2 show a near linear scalability for the problem tested. Keywords: Parallel computing, Newton-Krylov methods, sparse matrices, forward-looking models, GMRES, scalability. JEL Classification: C63, C88, C30.
1 Introduction There are many engineering problems for which parallel computing has proven efficient. Economic problems are, however, often quite different in both structure and quantification. This is particularly true for systems of equations representing large economic models, whose Jacobian matrices are typically sparse and nonsymmetric and whose main diagonal may contain zeros. Furthermore, the values of the nonzero elements often have a large range. 1
The purpose of this paper is to explore the potential of parallel computing for the simulation of such large economic models. We consider a Newton method for the solution of the nonlinear model and our investigation concentrates only on the solution of the linear system in the Newton step. Parallel computing has evolved rapidly and has seen its main objective move through the following stages (according to Keyes (1997)): Solving faster At first parallel computation was used to gain speed. Solving bigger Then, as the architecture of parallel computers was scaled up, providing larger memory and more powerful processing units, and as Amdhal’s law became a serious bottleneck for some applications, parallel computing became a means for solving larger problems in a constant time. Solving cheaper Then, as the cost of parallel hardware decreased relative to fast workstations there came a trend toward clusters of complete computers using a standard communication interface; this provides an efficient and very cheap alternative parallel machine. This is exemplified by Beowulf1 class cluster computers, which uses PC based workstations, a fast Ethernet switch and Linux as its operating system. Solving smarter And, finally, with better algorithms and cheaper hardware, we may find that a parallel environment is the researcher’s most important companion for computationally intensive tasks.
2 Important Issues in Parallel Programming To some extent, the problems encountered with parallel programming can be compared to those that early serial programmers faced. According to Demmel et al. (1993), programmers of current serial machines can ignore many of the details 1
See http://beowulf.gsfc.nasa.gov/ for a description of the Beowulf project.
2
that earlier programmers could ignore only at the risk of significantly slower programs. With modern parallel computers, once again we must be concerned with such details as data transfer time between memory and processors and efficient numerical operations. Two features are of basic importance in understanding how to design and analyze parallel algorithms: locality and regularity of computations. Locality Locality refers to the proximity of the arithmetic and the storage (memory) components of the computer. There are three levels of memory hierarchy: (a) limited, fast, expensive cache-memory, (b) extensive, slower, economical conventional RAM memory and (c) disk memory. Useful arithmetic can be performed only on data stored in cache, and so data must be moved from the slower levels of memory to the cache to participate in computation. In a distributed memory parallel computer, there are even more hierarchical levels since each processor must communicate with all the other processor’s memories. This is illustrated in Figure 1, where the ’s represent the arithmetic processors and the ’s the memories.
Network
...
...
...
...
...
...
Figure 1: Parallel computer. Storing or fetching data from memory by processor is called communication, and it occurs over a network. Depending on the machine, communication may be done automatically by the hardware whenever nonlocal data is referenced or it may be required to write explicit statements for sending or receiving messages. A simple model for the time needed to move 3
data items from one location to
another is
where represents the start up time (or latency) and is the time needed to move "
called the bandwidth). It is important to notice that, a single data item ( !#is in practice, , so successive transmissions of short messages may be very time-consuming. This motivates designing algorithms that communicate as infrequently as possible, exhibiting a so-called coarse-grained parallelism in contrast to a fine-grain parallelism. Algorithm designs also depend on whether the algorithm must run on a machine having shared memory or one having distributed memory. Regularity Computations that have simple patterns and a high degree of regularity are executed fastest on parallel machines. To be efficient, computations must be decomposable into repeated applications of these regular patterns. This regularity must include both the arithmetic operations and the communications. In this paper, we are particularly interested in solving systems of linear equations by applying regular sequences of elementary linear algebraic operations. To help focus our attention on the design of parallel algorithms, we analyze three Basic Linear Algebra Subroutines (BLAS) operations in terms of floating point operations (Flops), the minimum number of memory references, and their ratio.2 Table 1: Memory references and operation counts for the BLAS.
*) ,,, $ %'& % $ % ( &+ matrix 3 vector $ % &54786 9;:< % 8 8 $ % A %8 matrix 3 matrix A % 8 &+476B 9>: < % B C B 8 Operation
Definition
(
saxpy
Memory Flops references Ratio . - 0 /12 - /
- >=?>=/1 @- - >D EF>=G
We see from Table 1 that only matrix-matrix multiplication offers an opportunity 2
Table given in Demmel et al. (1993).
4
to increase the ratio between Flops and memory references as H grows. Hence an effective approach for designing parallel algorithms attempts to decompose the computations as much as possible into a sequence of dense matrix multiplications.
3 Sparse Direct Methods Direct methods for solving nonsymmetric sparse linear systems of equations are generally based on modifications of Gaussian elimination (LU factorization). This is the method of choice for dense systems, and its main advantages are robustness and ability to control the conditioning of the problem. In the sparse case, the problem is to avoid costly fill-in, i.e. the creation of new nonzero elements in the sparse factors. This can be achieved by reordering the elements of the matrix and using a threshold strategy in the partial pivoting, an approach belonging to the class of submatrix-based methods. By contrast, in column-based methods, ordinary partial pivoting is used and any preordering for sparsity is completely separate. However, since pivoting typically influences the number of nonzero elements in the factors, the symbolic factorization phase cannot be separated completely from the numeric factorization. Since the columns are not reordered dynamically, these methods can lead to a higher level of fill-in. Such codes may be improved by treating columns with the same nonzero structure together as a supernode, allowing them to be treated as dense for storage and computation. When using parallel computers, the problem is to order the matrix to distribute the computations evenly (referred to as load balancing) and to minimize the interprocessor communication. As one might expect, there is a tradeoff between maximizing parallelism and minimizing fill-in in the sparse factors. A large project addressing the issues of parallel dense and sparse factorizations is the ScaLAPACK (dense and banded) package and, within it, the SuperLU MT (sparse) package, publicly available on Netlib at http://www.netlib.org/.
5
4 Sparse Iterative Methods Fill-in typically causes direct methods used on sparse matrices to produce large memory demands and degraded computational speed. For large problem sizes, Krylov iterative methods may be preferable, see Gilli and Pauletto (1998). The situation is generally more difficult for matrices with no specific structure such as nonsymmetric or nondefinite matrices. However, with general projection type methods such as GMRES or BiCGSTAB and suitable preconditioners, we can expect very good results. Dense linear systems can sometimes be solved at high computational speeds approaching the peak performance of the hardware—even on parallel computers. This is not, however, the situation in the sparse case for reasons we develop in the following. A particular challenge is the use of Krylov methods with parallel computer architectures (see van der Vorst and Chan (1997)). The main computational features of these methods are matrix-vector products, inner products, and vector updates. In a sparse framework, all of these computations involve I7J K'L operations on I7J K'L data, where K is the size of the system. Hence, the average operation count per datum is I7J M L . In contrast, Gaussian elimination on dense linear systems involves I7J K>N L operations on I7J K>O L data, leading to an average of I7J K'L operations per datum. The small operation count per datum of the sparse methods emphasizes the role of memory traffic, whereas the larger operation count of the dense method can shadow delays due to memory movements. High performance on parallel machines relies on level 3 BLAS operations (matrix operations) in which the communication costs are buried beneath the computations. The Krylov method we use in this application is the generalized minimal residual method (GMRES) introduced by Saad and Schulz (1986). This procedure solves a nonsymmetric linear system PRQTS+U iteratively by minimizing the residual URVPWQ over the Krylov subspace, span X Y Z [ PRY Z [ PRO Y Z [ \ \ \ [ PR] Y Z ^ . In order to save storage and computation, GMRES is usually restarted after each _ iterations and the method is then referred to as GMRES J _L . Since we are dealing with a nonsymmetric system, we need to orthogonalize each new basis vector against all previously generated vectors. Two schemes have been proposed in the literature: a modified Gram-Schmidt procedure and a Householder 6
orthogonalization process. The Householder orthogonalization is preferred in a parallel environment since it involves level 2 BLAS operations. However, we still have to deal with inner products, which on distributed memory machines create synchronization points. To compute an inner product, each processor first receives a corresponding part of the two vectors and computes a partial inner product. Then these values are added through a reduction operation. This forces all processors to wait until the global sum is eventually computed, therefore imposing a synchronization point. The delaying of updates used to achieve better parallelism, which is possible for the conjugate gradient method, cannot be applied here since the update is computed after ` steps. One way to obtain parallelism is first to generate the basis of the Krylov subspace and then to orthogonalize it. This is called ` -step GMRES a `b and is proposed in Chronopoulos and Kim (1990). We can note that such a procedure does not generate exactly the same subspace as the standard GMRES a `Tb algorithm. To get closer to the original method, some authors (de Sturler (1991), Bai et al. (1991)) propose generating the basis through suitable matrix polynomials and then orthogonalizing it. To overlap communication and computation, one can split the steps in the Gram-Schmidt process cleverly as in de Sturler (1991). This leads to increases in speed for sufficiently large problems solved with GMRES a `b in parallel shared memory environments.
5 Some Results with the IBM SP2 Hardware Environment The parallel computer used for the experiments that follow is an IBM RS/6000 Scalable POWER2 Parallel System (hereafter SP2). The SP2 at the University of Geneva consists of 15 processors, 14 of which are thin nodes and 1 is a wide node. Each thin node is a complete RS/6000 workstation built on a POWER2 superscalar pipelined chip running at 66 MHz and containing internal data and instruction cache memories. Moreover, each of these nodes has access to 192 MBytes of RAM (two nodes have 256 MBytes) and 4 GBytes of disk. Each workstation is claimed to have a peak performance of 266 MFlops. A more useful benchmark 7
is reported in Dongarra (1998), where 53 MFlops is found to be the LINPACK benchmark for size 100 dense systems and 181 MFlops for size 1000 with hand optimization. A value of 125 MFlops is reported for this machine by Chopard (1997). The wide node has a larger cache memory, bus size, and disk space; this node operates as the file system server and is used for interactive sessions. All processing units are interconnected by a high speed switch that topologically is a complete graph allowing for communication between each pair of nodes. The theoretical value for the latency is 0.5 cd and the bandwidth is 40 MBytes e1d . (A latency of 30 cd and a bandwidth of 30 MBytes e1d are reported by Chopard (1997) on the SP2 at Geneva).
Software Environment The software employed in our experiments is PETSc 2.0 (Portable and Extensible Toolkit for Scientific Computing), developed and maintained at the Mathematics and Computer Science Division of the Argonne National Laboratory. The package is a suite of routines and data structures that facilitate setting up and solving large-scale computational problems in parallel environments. It uses MPI for all message passing communications. The software is freely available and may be ported to a variety of platforms. The language used for programming PETSc is C and it can easily be linked to C, C++ or Fortran programs. The main purpose of PETSc is to address problems modeled as partial differential equations, but it provides general solvers that suit our purposes. The library contains components able to manipulate parallel sparse matrices and solve large-scale systems using preconditioned Krylov subspace methods. The different components of PETSc are illustrated in Figures 2 and 3, reprinted from the users manual (Balay et al. (1998)). The development of compilers for distributed memory computers is slow paced and has, so far, provided limited functionality in dealing with sparse matrices and the algorithms in which they are used. High Performance Fortran (HPF) compilers, for example, have not yet come to maturity and are more likely to be relevant for dense and data parallel applications. Therefore libraries such as PETSc aim at bridging the gap between writing complex message passing instructions and waiting for more sophisticated hardware and software solutions. PETSc also leverages existing libraries such as BLAS, LAPACK, MINPACK and MPI, and provides 8
Level of Abstraction
Application Codes
PDE Solvers
TS
(Time Stepping)
SNES (Nonlinear Equations Solvers) (Unconstrained Minimization)
SLES (Linear Equations Solvers)
KSP (Krylov Subspace Methods)
Matrices
BLAS
PC (Preconditioners)
Vectors
LAPACK
Draw
Index Sets MPI
Figure 2: Organization of the PETSc library means for the programmer to avoid recoding—and thoroughly studying—large pieces of tedious code that require great care to obtain good performance, see Balay et al. (1997). The library could be used simply as a black box, but it also allows the researcher to add pieces of code or fine tune already existing code.
Application To easily vary the size of the system to be solved, we examine an economic model having 413 equations and containing endogenous variables with lags and leads. Such a model with f lags and g leads can be formally written as
v {1n q q q n | q hji k l m n l m o;p n q q q n l m oFr n l m s;p n q q q n l m st uwv5xzy v {1n q q q n ~ } (1) ~ {~ Considering the model for a horizon gives rise to a system of equations. The presence of leads in the variables creates a system which, in this particular 9
Parallel Numerical Components of PETSc Nonlinear Solvers Unconstrained Minimization
Time Steppers
Newton-based Methods Other Line Search
Backward Euler
Euler
Trust Region
Pseudo-Time Stepping
Other
Krylov Subspace Methods GMRES
CG
CGS
Bi-CG-Stab
TFQMR
Richardson
Chebychev
Other
Preconditioners Additive Schwarz
Block Jacobi
Jacobi
ILU
LU (sequential only)
ICC
Other
Matrices Compressed Sparse Row (AIJ)
Block Compressed Sparse Row (BAIJ)
Block Diagonal (BDiag)
Dense
Other
Index Sets Vectors Indices
Block Indices
Stride
Other
Figure 3: Numerical components of PETSc case, is set up to be interdependent.3 The Newton method is applied to solve the nonlinear model (1) and we concentrate our experiment on the solution of the linear system R5 arising in the computation of the Newton step . Matrix is the Jacobian matrix of size
j 7
1 and is a constant vector depending on . The values of the nonzero entries in the Jacobian matrix have been generated randomly and the elements in the righthand-side vector have been set to one. Figure 4 exhibits the structure of the Jacobian matrix for ++ . In this study, stack sizes of 5 40, 60, 90, 150, 250, 300 are used, corresponding to linear systems ranging from 16,420 to 123,900 equations. The average number of nonzero entries per row is 7 and the bandwidth is determined by the highest lead and lag, which, respectively, are + and . Therefore the width of the upper band is approximately 2,000 and the width of the lower band is approximately 3
For an analysis of the structure of such stacked models see Gilli and Pauletto (1997).
10
0 500 1000 1500 2000 2500 3000 3500 4000 0
1000
2000 nz = 24848
3000
4000
Figure 4: Structure of the Jacobian matrix for ++ . 1,200. The linear systems resulting from different stack sizes are solved on the SP2 with the PETSc software using a GMRES(30) algorithm. The preconditioning step is performed by a single block Jacobi iteration using an incomplete LU factorization with no fill-in ILU(0). Figure 5 summarizes the performance in solving the different linear systems executed on 1 to 8 processors. The performance is measured in MFlops, calculated as the sum of Flops over all processors divided by the maximum computing time over all processors. Execution times can depend upon the load of the machine, and therefore the figures shown are average values over different runs. We see that a maximum speed-up of about 4.5 has been achieved when solving a linear system with 123,900 equations on 7 processors. Two features of the parallel solution algorithm appear clearly in Figure 5: the approach is only interesting for large problem instances and the performance depends on the size of the problem; as expected, it is not possible to increase the performance arbitrarily by augmenting the number of processors. To gain insight in interpreting these results and to identify the bottlenecks existing in the parallel algorithm, we give a condensed outline of GMRES in Algorithm 1. We briefly comment on each stage in the algorithm and then, in Table 2, compare execution times for the smallest and the largest problem solved on 7 processors. For the smaller system ( F ) GMRES converges in 10 iterations and for the larger one ( 1j F1 ) it takes 12. This makes comparison more difficult as the statements in loop 3 have been executed ¡F ¢ £FT"¤1¤ times for the smaller system and w ¥ F¢ £F¦+1 times for the larger one. As the total number 11
80 70 60 50 MFlops
40 30 20 10
ª « « « « « « « « « «ª « « « « « « « « « «ª « « « ª« « « « « « « « « « « « « © « « « « « « « « « © « « « « « « « « « « « « «© « « «« « « « « « © « « « « « « « « « « ª « « « « « « « « « « « « « « « « §« « « « « « « « « « § ¬° «« « « « « « © « « « « « « « « « « § « «ª « « « « « « « « « « § « « « « « ª « « §« « « « « « « « « « « « « « « « « « « ª« « « « « « « « « « « § «««« « « « « « « « © « « « « « « « « « « © « « « « « « ª « ¬¸ ¶ ± ® ¶ ´ «« « « «« « «© « « « « « « «« «« « « « « « ª«© « « « « «««« «««« «« « « « «« «« « «« «« « « « ª § « « « «« «« « « « « ¨ § § « « « « « « « « « « « « « « « « « « « « « « §« « « « « ª ©§ « « « « « « « « « « « © « « « « « « « « « «© ¬³ « « «§ ¬¯® ° ± ² ³ ´ 1
2
3
4
5
6
7
¬¯® ³ ¸ ± ¹ ´ ´ ¬¯® ´ ¸ ± ³ ² ´ ® ±¹ ² ´
µ1± ¶ · ´
8
Number of processors
Figure 5: Performance as a function of size and number of processors. of iterations is less than the restart value º¼»¾½F¿ , the loop in Statement 1 needs not be repeated. The computations PETSc performs in the different stages are as follows: MatMult computes À»¾Á¦Â , PCApply solves Ã5ľ»¾À for preconditioning, VecDot computes ÄWÅ Â , VecAXPY computes ÄÆ»ÇÄÉÈÊ , where Ê is a scalar, VecNorm computes Ë Ä2Ë Ì , and VecScale computes Â7»ÄÍ1Ê . The statements solve and update on lines 11 and 12 in the algorithm are executed only once and therefore are not included in the performance comparison. Table 2 compares the performance in MFlops and the time spent in the different stages for two problem sizes. The last row gives the total time in percent, which does not sum up to 100% as minor steps have been neglected in this analysis, and the overall performance, which is the sum of the performances in each single stage weighted by the time spent in the stage. As in Table 2 we consider rounded figures, the computed average performance is a little below what is shown in Figure 5. One can see that the gain in performance is achieved in particular in the stages as MatMult and PCApply which, as explained in Table 1, have a ratio between 12
Algorithm 1 Sketch of GMRES Î 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
for Õ¦Ö¡×WØ Ù do solve ÚÛ¡ÖÜwÝFÞ ß for á'Ö¡×WØ Õ do
ÏÐ for solving ÑWÒTÓ5Ô .
à
âFã ß ÖÛwä Ý Þ ß à ÛÖÛåTâFã ß ÝFÞ ã à
end for
â ß æ;ç è ß Öé Ûé ê ß æ;ç è ß small enough then exit loop ÝjÞ ß æ;ç à>ÖÛRë â ß æ;ç è ß if â
end for solve ì¦íwî íÖ5é ï1Þ ð update ò repeat if necessary
àéñç
Flops and memory references of 2 (PCApply essentially computes a matrixvector product). Table 2: Comparison of execution times for different stages in GMRES.
ó Ó+ô1õjö ÷1ø1ù Line Operation 2: MatMult 2: PCApply 4: VecDot 5: VecAXPY 6: VecNorm 7: VecScale Total and average
% time
ôù
ý 1þ ô
ó Óú ô1ûjö üFù1ù
MFlops
úý þ1õ
ý ú1ô1ô ô
úô ù Fü ô
ú1ô1ô úû
þ
% time
ôù
ôû û1õ ú1ú õ ü1û
ú
MFlops
÷õ 1÷ þ õ1õ ú û1ü ý1ø ý1ü ý1ý
Scalability In this section, we analyze empirically the scalability of the solution algorithm and the parallel system presented earlier. 13
The time taken to execute the algorithm on a single processor machine is called the serial execution time and is usually denoted by ÿ . Correspondingly, the parallel execution time on processors is ÿ . The ratio of ÿ to ÿ is the speed-up with processors, denoted . Generally there is some overhead in the parallel algorithm not present in the serial one; this is due to communication, synchronization, and other aspects of parallelization. The efficiency is the ratio of to : it ranges from 0 to 1 and provides a measure of the proportion of time devoted to performing useful computational work. First we compute the speed-ups obtained using the MFlops observed during the various executions. Since MFlops vary inversely with execution time, the speedup is computed as the ratio of the number of MFlops on processors to that on a single processor. Table 3: Speed-ups
for some problem sizes and number of processors .
! !
The efficiency of the parallel system is then computed as the speed-up divided by the number of processors. A parallel system is said to be scalable if the efficiency can be kept constant when increasing the problem size together with the number of processors. Thus for a same level of efficiency, the slower the increase in the problem size with respect to the increase of the number of processors, the higher the scalability. One measure of scalability in a parallel system is the isoefficiency function (Kumar et al. (1994), Kumar and Gupta (1994)), which defines this concept precisely. Figure 6 illustrates the results and shows the isoefficiency curves for efficiency values of approximately 0.75 and 0.60. Some of the values have been interpolated, since the given " # $ combinations do not exactly produce the stipulated 14
Table 4: Efficencies %'& for some problem sizes ( and number of processors ) .
(
+ 03 /, 4 ,.3 124 -13 +1 4 0+3 6/ 4 + 4-3 ,/4 +, -3 644
) *+ ,-./012 + 45 0045 --45 -445 , .45 + + 45 1245 0445 .645 -/!45 +1!45 + + 45 0-45 / 245 0+ 45 ./!45 -+ + 45 6.45 1 645 1445 0.45 /-45 .+ + 45 2+ 45 1/!45 0645 / 645 .1 + 45 2245 2045 2+ 45 1.45 0/!45 /.
efficiency values. This empirical relationship appears quite linear and suggests high scalability.
89 : ; 140
A@CB D E F =?>A@CB D G B
120
< =