Component-Based Derivation of a Parallel Stiff ODE ... - CiteSeerX

0 downloads 0 Views 860KB Size Report
(1). Several interesting examples of IVPs can be found in Refs. 3 and 4. ... to define stiffness mathematically (Chapter 6 of Ref. 2 ..... R=hn(S−1A é Id) F(Y(j−1) n. )+( ...... EC. and from the spanish DGES projects PB98-1281 and PB98-1294 are.
International Journal of Parallel Programming, Vol. 30, No. 2, April 2002 (© 2002)

Component-Based Derivation of a Parallel Stiff ODE Solver Implemented in a Cluster of Computers Jose M. Mantas Ruiz, 1 Julio Ortega Lopera, 2 and Jose A. Carrillo de la Plata 3 Received October 24, 2000; revised May 2001 A component-based methodological approach to derive distributed implementations of parallel ODE solvers is proposed. The proposal is based on the incorporation of explicit constructs for performance polymorphism into a methodology to derive group parallel programs of numerical methods from SPMD modules. These constructs enable the structuring of the derivation process into clearly defined steps, each one associated with a different type of optimization. The approach makes possible to obtain a flexible tuning of a parallel ODE solver for several execution contexts and applications. Following this methodological approach, a relevant parallel numerical scheme for solving stiff ODES has been optimized and implemented on a PC cluster. This numerical scheme is obtained from a Radau IIA Implicit Runge–Kutta method and exhibits a high degree of potential parallelism. Several numerical experiments have been performed by using several test problems with different structural characteristics. These experiments show satisfactory speedup results. KEY WORDS: Component-based software development; numerical algorithms with multilevel parallelism; parallel linear algebra libraries; stiff ordinary differential equations; distributed memory machines.

1

Dpto. Lenguajes y Sistemas Informa´ticos, E.T.S.I. Informa´tica, Univ. Granada, Avda. Andalucı´a 38, 18071 Granada, Spain. E-mail: [email protected] 2 Dpto. Arquitectura y Tecnologı´a de Computadores, E.T.S.I. Informa´tica, Univ. Granada, Avda. Andalucı´a 38, 18071 Granada, Spain. E-mail: [email protected] 3 Dpto. Matema´tica Aplicada, Facultad de Ciencias, Univ. Granada, Avda. Fuentenueva s/n 18071 Granada, Spain. E-mail: [email protected] 99 0885-7458/02/0400-0099/0 © 2002 Plenum Publishing Corporation

100

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

1. INTRODUCTION Differential equations are an extremely important tool in most fields of science and engineering because the behaviour of many processes is modelled by differential equations. The Ordinary Differential Equations (ODES) arising from the modelling process may take different forms. (1) One important formulation is that of the Initial Value Problem (IVP) for ODE. (2) The goal in the IVP is to find a function y: R Q R d given its value at an initial time t0 and a recipe f: R × R d Q R d for its slope: yŒ(t)=f(t, y),

y(t0 )=y0 ¥ R d,

t ¥ [t0 , tf ]

(1)

Several interesting examples of IVPs can be found in Refs. 3 and 4. Since differential equations are usually far too complicated to solve analytically, scientists and engineers rely on computers for a numerical approximation of the solution to an IVP within a designated interval [t0 , tf ]. (5) Numerical methods for integrating IVPs generally work in a step-bystep manner: the interval [t0 , tf ] is divided into subintervals [t0 , t1 ], [t1 , t2 ],..., [tN − 1 , tN ] where tN =tf , and approximations y1 , y2 ,..., yN for the solution at the end of each interval are computed in a so-called integration step. The length of the nth integration subinterval [tn − 1 , tn ] is called the stepsize. The degree to which the numerical solution obtained by a method corresponds to the true solution is given by the order. Supposing a fixed stepsize h, if the difference between the exact solution and the numerical approximation over one step is O(h p), then the order of the method is p. The numerical methods used to solve IVPs can be categorized as implicit or explicit (2) according to whether or not they require the solution of a non-linear system of equations at every integration step. Since the solution of a non-linear system is computationally expensive, explicit methods are preferred whenever they can be used with reasonable stepsizes. The computational cost of the implicit methods, however, may be justifiable, since for certain IVPs (stiff IVPs) an implicit method is much more efficient than comparable explicit ones. Stiff IVPs (2) are an important class of IVPs which arises when the underlying physical system has components with widely varying time scales. For example, stiff IVPs can arise in the modelling of gas mixtures in which the time scales of the reactants are much smaller than the movement times. Historically, there seems to have been some disagreement about how to define stiffness mathematically (Chapter 6 of Ref. 2 deals with this question in detail). Stiff IVPs are characterized by the fact that the numerical

Component-Based Derivation of a Parallel Stiff ODE Solver

101

solution of slow smooth movements is considerably perturbed by rapid nearby solutions. The stability (4, 6) of a numerical method is an important property which is used to characterize the ability of the method to manage these perturbations suitably during their evolution in time. Stiff IVPs impose severe stability demands on the numerical methods of solution. Since this requirement is only satisfied by some implicit methods, one commonly-adopted definition considers stiff equations to be IVPs where explicit methods do not work. (4) The solution of large stiff ODE systems is indispensable for modelling a wide variety of time-dependent processes in science and engineering (4, 5, 7) (pollution models, chemical kinetics, biological modelling, combustion and nuclear reactions, control systems, electrical circuits, etc.). Typically, stiff IVPs appear in the analysis of steady states of evolutionary problems. In this case, solutions consist of a smooth or slowly varying part (steady component) F(t) and a rapidly decaying part (transient component) G(t) (y(t)=F(t)+G(t)). After a short period of time the transient component fades away leaving only the smooth component. Therefore, there are two phases: a transient phase in which the transient component is still active and a smooth phase in which the numerical method must compute the smooth component while suppressing the transient. In the transient phase, a small stepsize must be taken to follow the solution to ensure accuracy. In this phase, explicit and implicit methods work in a similar way, and so an explicit method should be used to keep implementation costs at a minimum. Only in the smooth phase can the problem be said to be stiff. In this phase, explicit methods are not practical because very small stepsizes must be used (for linear stability reasons) to obtain good approximations of the transient component despite the smoothness of the solution. Therefore, stable implicit methods, which demand a great deal of computing power (because they involve the solution of a non-linear system at every integration step), must be used to approximate the transient component in the smooth phase. The enormous computing power required to solve stiff IVPs can be achieved by using efficient parallel algorithms on Distributed-Memory Parallel Machines (DMPMs). However, most of the activity in parallel methods for stiff ODES has been concentrated on the exploration of promising new numerical schemes, rather than the implementation of efficient, reliable and robust mathematical software for DMPMs. There is a strong connection between algorithms used to solve stiff IVPs and linear algebra techniques because the numerical solution of the non-linear systems, which arise when an implicit method is applied, requires the use of efficient linear algebra routines. Thus, if the system is stiff and large, enhanced performance on DMPMs can be gained through

102

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

the use of parallel linear algebra techniques. (5) Consequently, the numerical methods used to solve stiff ODEs exhibit two different levels of potential parallelism: (8) • task parallelism, owing to the fact that these methods can be decomposed into submethods which can be executed independently by disjoint groups of processors, and • data parallelism, because the most basic submethods of the decomposition are typically linear algebra computations which are susceptible to parallelization following a SPMD style (matrixvector products, factorizations, etc.). Stiff ODE solvers exist that are designed and implemented for distributed-memory multiprocessor architectures. One of the most outstanding solvers is the ParSODES package, (9) which implements fully implicit Runge–Kutta (IRK) methods using MPI. (10) However, it does not exploit the potential data parallelism arising from the linear algebra operations because its design and implementation are based on sequential versions of the linear algebra libraries BLAS (11) and LAPACK. (12) The parallel design of a numerical algorithm to solve IVPs must be structured differently to exploit three factors: • The task and data parallelism exhibited by the numerical method. • The structure of the problem itself (dense, banded, tridiagonal, sparse, etc.). There are many special classes of IVPs which have their own peculiar attributes. The design and implementation of a parallel algorithm may vary drastically depending on the structure of the problem. For example, time-dependent parabolic partial differential equations in which the spatial variables are discretized lead to large stiff ODE systems which are block-banded, in which the bandwidth depends on the dimension of the underlying partial differential equation and finite difference scheme. The use of parallel linear algebra routines specialized to deal with the particular structure of the matrices is of crucial importance in the performance of the parallel solution of these stiff ODE systems. • The characteristics of a particular parallel architecture. The Group Single Program Multiple Data (GSPMD) (8) programming model is specially suitable to exploit the multilevel parallelism of these applications on DMPMs. In a GSPMD computation, several independent subprograms are executed by independent groups of processors in parallel (see Fig. 1). A relevant methodological approach for the derivation of GSPMD programs from existing SPMD modules is presented in Ref. 8.

Component-Based Derivation of a Parallel Stiff ODE Solver

GSPMD PROGRAM

C1

103

C2

C3

C3

C5

C5

C4

C6

Distributed Memory Architecture

C6 C7

G4 G2

: Data Dependency

Ci : SPMD Basic Computation

G1

Gi : Processor Group

G3

Fig. 1. Parallel program implemented in a GSPMD style.

This approach, termed TwoL, allows the exploitation of the two levels of parallelism exhibited by multiple numerical methods and facilitates the appropriate reuse of modules of parallel linear algebra libraries. (13–15) The TwoL approach has also been applied to the implementation of ODE integrators on multicomputers. (16, 17) Moreover, this approach may benefit from explicit constructs to: • maintain and select among multiple implementations of an operation in an elegant way. This characteristic, the so-called performance polymorphism, (18) is very desirable to improve flexibility in performance tuning during the design of a GSPMD program because it makes it possible to improve the performance of the component-based numerical software by selecting among several implementations of the submethods. • deal with different IVP structures in order to exploit the particular structure of the matrices and obtain substantial performance improvements. The present work focuses on the component-based development of GSPMD software for the numerical solution of large stiff ODE systems on clusters of computers. In particular, we examine the parallelization of the class of one-step Radau IIA IRK methods. (4) These methods combine high accuracy with excellent stability properties and it is also possible to derive parallel numerical schemes of this class with a lot of potential parallelism in the abovementioned two levels. In order to maintain flexibility and reusability when deriving parallel implementations of ODE solution methods,

104

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

we propose an extension of the TwoL approach which has explicit constructs for performance polymorphism (18) and takes into account the exploitation of the particular structure of the problems to be solved. As a consequence, the methodological proposal includes three clearly defined methodological steps in which optimizations of different types can be carried out: (a) The first step produces a generic and reusable description of the functionality and potential task parallelism of the method. (b) The second step allows the adaptation of this description to exploit the structure of the problem to be solved by using a specialization mechanism in a machine-independent way. (c) Finally, the third step takes into account both the parameters of the problem and of the target machine in order to take all the parallel design decisions which affect the performance of the final program. On the basis of our methodological proposal, we perform optimizations on an advanced numerical method to solve stiff ODES and derive efficient distributed implementations of this method in a PC cluster. (19) This stiff ODE solver makes effective use of existing parallel linear algebra software components and exploits the two levels of potential parallelism of the numerical method and the structure of the ODE systems. The speedups obtained show it is possible to derive efficient cluster implementations of an advanced stiff ODE solver by combining previously existing SPMD parallel linear algebra components. The algorithmic description of the particular numerical scheme is the result of several decisions described in Section 2. Section 3 briefly introduces the TwoL approach (8) for the derivation of GSPMD numerical programs. Our methodological proposal is introduced and applied to the component-based derivation of an efficient stiff ODE solver for a PC cluster in Sections 4 and 5. Section 6 presents and analyses the experimental results obtained by executing the solver for several test problems with different structures and dimensions. The results show satisfactory speedups with respect to the sequential solver RADAU5. (4) Finally, some conclusions are drawn and further work is briefly presented in Section 7. 2. THE NEWTON-PILSRK METHOD Although the integration of IVPs is considered to be inherently sequential, there exist several approaches to the parallel solution of IVPs. (5, 20) One of the most interesting consists of modifying sequential

Component-Based Derivation of a Parallel Stiff ODE Solver

105

algorithms in order to exploit the parallelism within an integration step ( parallelism across the method (5, 20)). The fully implicit Runge–Kutta (IRK) methods are specially suitable to solve stiff IVPs because they exhibit good stability and convergence properties. (4, 2) However, they demand enormous computational power because a large non-linear system of equations must be solved at every integration step. An advanced numerical method to achieve parallelism across an IRK method is described in Ref. 21. This method, called Newton-Parallel Iterative Linear System Solver for Runge Kutta methods (abbreviated as NewtonPILSRK ), is based on the inclusion of a Parallel Iterative Linear Solver for the Newton systems that arise when an IRK method of type Radau IIA (4) is applied. An implementation of this numerical scheme on a four-processor shared-memory CRAY C98/4256 has been obtained (22) but this implementation is only centred on the exploitation of the task parallelism of the numerical scheme and does not work on distributed memory multiprocessors. We have carried out optimizations of the Newton-PILSRK method, and efficient distributed implementations which exploit task and data parallelism have been derived. The selection of parameters to define the method exactly is considered in Section 2.3. As a result, we obtain an algorithm which describes the main sources of task parallelism in the method.

2.1. Implicit Runge–Kutta Methods The Runge–Kutta methods (2) are one-step numerical methods to solve IVPs, which approximate the solution by averaging several approximations obtained at intermediate points (called stages) of an integration step. A particular Runge–Kutta method with s stages can be conveniently represented by using a matrix A ¥ R s × s and two vectors b, c ¥ R s. The matrix A, called the Runge–Kutta matrix, is the main parameter of a particular Runge–Kutta method. The vector c=(c1 ,..., cs ) T is called the abscissa vector and contains the values ci which determine where the solution is approximated in each stage. The vector b=(b1 ,..., bs ) contains the weights of the Runge–Kutta method. When these parameters are applied to an IVP-ODE given by: yŒ(t)=f(t, y),

y(t0 )=y0 ¥ R d,

t ¥ [t0 , tf ]

one has the following numerical scheme, which describes, using tensor notation, (5) how to obtain the vector yn ¥ R s which approximates the

106

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

solution at the point tn from the approximation obtained in the previous integration step yn − 1 : Yn =(1 é Id ) yn − 1 +hn (A é Id ) F(Yn ),

n=1,..., N

T

yn =yn − 1 +hn (b é Id ) F(Yn )

(2)

Here, the vector Yn ¥ R s · d is the so-called stage vector whose d-dimensional components Yn, i approximate y(tn − 1 +ci hn ), i.e.,

|} Y Tn, 1

Yn =

Y Tn, 2 x

where Yn, i % y(tn − 1 +ci hn ) ¥ R d,

i=1,..., s

Y Tn, s 1 stands for the s-dimensional unit vector (1,..., 1) T, Id denotes the identity matrix of dimension d × d and F(Yn ) means the componentwise f-evaluation of Yn , i.e.:

|

f(tn − 1 +c1 hn , Yn, 1 ) T f(tn − 1 +c2 hn , Yn, 2 ) T

F(Yn )=

x

}

¥ Rs·d

f(tn − 1 +cs hn , Yn, s ) T hn is the stepsize, i.e., the length (hn =tn − tn − 1 ) of the nth integration subinterval [tn − 1 , tn ]. The symbol é denotes the direct product of matrices which is defined as:

|

A é B=

a11 B

a12 B

...

a1n B

a21 B

a22 B

...

a2n B

...

...

...

...

}

am1 B am2 B ... amn B where A=(aij ) ¥ R m × n and B is a matrix of arbitrary dimension. The structure of the matrix A is crucial to how easy the system (2) is to solve. If A is strictly lower triangular, one has explicit Runge–Kutta methods and Yn is easily computed. Otherwise, we have implicit Runge– Kutta methods. If A is lower triangular, one has Diagonally Implicit Runge– Kutta methods (DIRK) and the solution of s non-linear systems with

Component-Based Derivation of a Parallel Stiff ODE Solver

107

dimension d is required. When A is dense and full, we have fully implicit Runge–Kutta methods (IRK) which require the solution of an s · d-dimensional non-linear system. We concentrate on the latter type of methods because they are more powerful than the DIRK methods and it is possible to achieve very stable parallel methods of this class. Specifically, we restrict our parallelization to the Radau IIA IRK methods because they possess a high order (the order of an s-stage Radau IIA method is 2s − 1), stiff accuracy (20) and excellent stability behaviour. (23) If we apply an s-stage Radau IIA IRK method to approximate the solution of (1), we have to solve, at every integration step, a system of s · d nonlinear equations. This system is usually solved by the modified Newton (1) (2) iteration that yields a sequence Y (0) n , Y n , Y n ,... defined by (j − 1) − 1) )=R(Y (j ), (Is · d − A é hn Jn )(Y (j) n − Yn n

j=1, 2,..., m

(3)

where Jn is an approximation to the Jacobian of f in (tn − 1 , yn − 1 ), and R(X)=hn (A é Id ) F(X)+(E é Id ) Yn − 1 − X, -X ¥ R s · d,

Es × s =[0

···

0 1]

To obtain an approximation of Y (0) n , we can use a simple formula called the predictor. (22) Expression (3) is called the corrector. The integer m in (3) must be as large as is necessary to make Y (m) sufficiently close to the n true solution of (2). When a Radau IIA IRK method (4) is applied, the definitive approximation of y(tn ) would be the last d-dimensional block of the vector Y (m) n , i.e., yn =Y (m) n, s .

2.2. A Parallel Iterative Linear System Solver for IRK Methods A linear system of dimension s · d arises in every round of (3). A traditional parallelization approach to reduce the computational costs associated with the solution of these systems is based on transforming them into block-diagonal form (each block corresponding to an eigenvalue of the matrix A) by applying a similarity transformation. (22) By applying this approach, it is possible to decouple the linear systems into s independent subsystems of dimension d if the Runge–Kutta matrix A possesses only real positive eigenvalues. Unfortunately, the matrix A of powerful methods like Radau IIA has one or more complex eigenvalues and we would have to solve a linear d-dimensional system for each real eigenvalue of A and a 2d-system for each complex eigenvalue of A. Although this approach is not

108

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

suitable to a parallel solution because of its limited task parallelism, it has been successfully applied to the 3-stage Radau IIA Runge–Kutta method to obtain the sequential stiff ODE solver RADAU5. (4) A relevant approach to achieve our purposes is proposed in Ref. 21. This approach is a mixture of iterative and direct algorithms to solve the nonlinear systems in (3) by Newton’s method with the frozen Jacobian. The basic idea is to decompose matrix A as B+(A − B) and then bringing the term containing A − B to the right-hand side of the equation. Finally a second iterative method (fixed-point iteration (24)) can be applied in (3) to compute Y (j) n , yielding: (Is · d − B é hn Jn ) DY (j, v) − 1) v − 1) − 1) =((A − B) é hn Jn )(Y (j − Y (j, )+R(Y (j ), n n n

v=1,..., r

(4)

v) − 1) where DY ( j, v)=Y (j, − Y (j , the value of r depends on the convergence n n 0) − 1, r) r) behaviour of the iteration, Y (j, =Y (j and Y (j, is accepted as the n n n (j) solution Y n of (4) and the jth iterate of (3). The selection of the matrix B which approximates A is performed in Ref. 21 to fulfil the following requirements:

1. B is nondefective (it can be transformed to a diagonal matrix using similarity transformations) and has only positive real eigenvalues. 2. The transient component of the solution is removed from the iteration error in the second iteration. 3. The inner iterations (PILSRK method) converge whenever the eigenvalues of A and Jn are in the positive and nonpositive halfplanes, respectively. 4. The spectral radius of the iteration-error-amplification matrix is minimized. (21) Equation (4) can be reorganized to obtain a Parallel Iterative Linear System Solver for IRK methods (PILSRK method) which is defined by the following iteration: (Is · d − B é hn Jn ) DY (j, v) − 1) − 1) =(Is · d − A é hn Jn )(Y (j − Y (j, v − 1))+R(Y (j ), n n

v=1,..., r

(5)

The composition of (3) and (5) is termed the Newton-PILSRK r) method. (21) In this composition, Y (m, is accepted as the solution Yn of the n nonlinear system (2).

Component-Based Derivation of a Parallel Stiff ODE Solver

109

Because B is nondefective and has positive real eigenvalues, it follows that B is similar to a diagonal matrix T (T=diag(d1 , d2 ,..., ds )) with positive entries. Therefore B=STS −1, where S is an arbitrary real, regular matrix and it is possible to diagonalize (5) by the similarity transformation Y (nj, v) =(S é Id ) X (j, v), to obtain another form for the Newton-PILSRK method: (Is · d − T é hn Jn ) DX (j, v)=D (v − 1)+R,

v=1,..., r, j=1,..., m

(6)

where: • D (v − 1)=(Is · d − L é hn Jn )(X (j − 1) − X (j, v − 1)), where L=S −1AS ¥ R s × s. − 1) )+(S −1E é Id ) Yn − 1 − X (j − 1), • R=hn (S −1A é Id ) F(Y (j n

• DX (j, v)=X (j, v) − X (j, v − 1), • X (j, 0)=X (j − 1) (D (0)=0), X (j)=X (j, r), Yn =(S é Id ) X (m), • X (0)=(S −1 é Id ) P(Yn − 1 ), where P( · ) denotes a predictor formula. Since the sd × sd matrix Is · d − T é hn Jn is block diagonal with blocks of dimensions d × d, the system which arises in (6) is equivalent to s independent subsystems of dimension d as follows: v) − 1) (Id − d1 hn Jn ) DX (j, =D (v +R1 , 1 1 v) − 1) =D (v +R2 , (Id − d2 hn Jn ) DX (j, 2 2

x

x

(Id − ds hn Jn ) DX

x (j, v) s

=D

(v − 1) s

x + Rs

where for all vector V ¥ R s · d, Vi denotes the ith d-block of the s · d-dimensional vector V=(V T1 ,..., V Ts ) T and di denotes the ith entry of the diagonal matrix T. A graphical representation of this partitioning with s=4 stages is shown in Fig. 2.

2.3. A Particular Parallel Numerical Algorithm to Solve Stiff IVPs We have applied the Newton-PILSRK scheme to the Radau IIA IRK method with s=4 stages. (4) With 4 stages, it is possible to obtain an acceptable degree of task parallelism, and this IRK method has an order of 2 · s − 1=7, which is very suitable for most real applications. (4, 2) After a suitable selection of B with s=4, we can identify three main sources of task parallelism in the method after performing the similarity transformation:

110

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

Fig. 2. Partitioning of a Newton-PILSRK system into s independent subsystems when s=4.

(a) The solution of a linear system of dimension s · d with coefficient matrix Is · d − A é hn Jn is replaced with the solution of s independent systems of dimension d, with matrices Id − di hn Jn . (b) Since the matrix L (L=S −1AS) which is necessary to compute D (v − 1) in (6) is block diagonal with two blocks (due to the complex eigenvalues of A), the computations where the matrix L appears can be decoupled into two independent computations on different blocks of L. (c) The evaluation of F(Y (j − 1)) is equivalent to s independent evaluations of f. The method also exhibits a lot of data parallelism, mainly due to linear algebra operations (such as solution of d-dimensional systems, matrix-vector products, computation of the Jacobian matrix, etc.) which can be implemented following a data-parallel style. In order to define a particular numerical algorithm, several details have to be determined: • The number of iterations of the PILSRK method, i.e., the value of r in (5). We only consider two inner iterations (r=2) because this

Component-Based Derivation of a Parallel Stiff ODE Solver

111

is sufficient for convergence with our choice of the matrix B. (21) With r=2, the PILSRK scheme is reduced to two consecutive linear systems with the same coefficient matrix, which can be written as: LU · DX (j, 1)=R LU · DX

(j, 2)

First linear system (v=1)

=R − M DX (j, 1)

Second linear system (v=2)

X (j)=X (j − 1)+DX (j, 1)+DX (j, 2) where LU=(Is · d − T é hn Jn ) is the coefficient matrix of the two linear systems and M=(Is · d − L é hn Jn ). • The value of m in (6). In our case, it is determined dynamically, by using a convergence control mechanism for the Newton process similar to that used in the sequential stiff ODE solver RADAU5. (4) • The initial approximation of the stage vector in every integration (0) step (Y (0) n ). To obtain Y n , we use the fourth order extrapolation predictor of the previous stage vector, which is defined by Y (0) n = (P é Id ) Yn − 1 , (22) P ¥ R s × s. The pseudocode description of the numerical scheme which results from these decisions is presented in Table I. In this description, the main computational steps are labelled as (i1), (i2),... .

3. DERIVATION OF GSPMD PROGRAMS: THE TWOL APPROACH In Ref. 8, a relevant parallel programming methodology to derive structured parallel implementations of numerical methods by using previously existing SPMD modules was presented. This approach, termed TwoL, is based on the GSPMD model and makes it possible to exploit the multilevel parallelism exhibited by many numerical methods. In TWOL, the derivation of a parallel program for a particular DMPM is carried out starting from a mathematical description of the numerical method to be implemented, followed by three steps (see Fig. 3).

3.1. Definition of the Module Specification for the Numerical Method In this step, the programmer must specify the hierarchical structure of the numerical method by combining methods which are effected by

112

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata Table I. Description of the Newton-PILSRK Method Algorithm (i1) (i2) (i3) (i4) (i5)

(i6) (i7) (i8) (i9) (i10) (i11)

Y0 =(1 é Id ) y0 for n=1,..., N { (N is the number of integration steps) “f Jn 4 “y (yn − 1 , tn − 1 ); parfor i=1, 2 { Mi =I2d − L i é hn Jn } parfor i=1,..., 4 { LUi =Id − di hn Jn } (0) −1 W=(S −1E é Id ) Yn − 1 ; Y (0) n =(P é Id ) Yn − 1 ; X =(S P é Id ) Yn − 1 ; for j=1, 2,..., m { parfor i=1,..., 4 { Fi =f(tn − 1 +ci hn , Y (n,j −i 1) ) } R=hn (S −1A é Id ) F+W − X ( j − 1); −1 parfor i=1,..., 4 { DX (j) i =LU i Ri } parfor i=1, 2 { Ri, 2d =Ri, 2d − Mi DX (j) i, 2d } ( j) −1 parfor i=1,..., 4 { DX (j) i =DX i +LU i Ri } (j) ( j − 1) ( j) ( j − 1) ( j) X =X +DX ; Y n =Y n +(S é Id ) DX (j) } Yn =Y (m) ; y =Last d-block in Yn } n n yf =yN ; Notation L i=ith 2 × 2-block of L Vi =ith d-block of V=(V T1 ,..., V T4 ) T, V ¥ R 4d Vi, 2d =ith 2d-block of V=(V T1, 2d , V T2, 2d ) T, V ¥ R 4d

Basic Modules

Parallel Numerical Method

Module Composition (3.1) Data dependencies

ask parallelism

Module Specification

Problem Size Architecture Parameters

Parallel Design Decisions (3.2) Scheduling and Load Balancing Data Distribution

Parallel Frame Program Translation (3.3)

Final GSPMD Program Fig. 3. Derivation of a numerical parallel program in TwoL.

Component-Based Derivation of a Parallel Stiff ODE Solver

113

modules. The modules which appear in the decomposition can be basic or composed modules. The basic modules represent regular algebraic computations on homogeneous data structures that work internally following a SPMD style according to a certain data distribution. These modules hide the potential data parallelism of the submethods. The composed modules are structured combinations of basic modules. The data dependencies between related modules are expressed by constructors of sequential composition, while constructors of concurrent composition are used to express the possibility of parallel execution. As a result of this step, we obtain a module specification which expresses the maximum degree of task parallelism of the method by making the data dependencies between modules explicit.

3.2. Deriving the Parallel Frame Program A parallel frame program is a module specification augmented with several parallel design decisions. These desisions attempt to adapt the module specification to a set of parameters of a particular architecture (taking into account a particular parallel programming model) and problem size in order to achieve good performance results. A brief description of these decisions is given below. • Scheduling. The execution order of modules without data dependencies (concurrent composition) is determined. These modules can be executed in parallel on different groups of processors or consecutively on the same group. • Load Balancing. The number of processors used to execute each basic module is determined. • Selection of the data distributions for the basic modules. The parameters for the basic modules which follow parameterized data distributions must be selected in order to achieve an optimal runtime solution. The cost associated with data redistributions between submodules must be taken into account because locally optimal data distributions can give a non-optimal global solution. To cope with this problem, data distribution types and a dynamic programming algorithm for the automatic selection of optimal data distributions are proposed in the TwoL framework. (25) These decisions cannot be taken independently because they are interrelated.

114

Mantas Ruiz, Ortega Lopera, and Carrillo de la Plata

3.3. Obtaining the Parallel Implementation The above decisions must be translated into a message-passing program for the target parallel machine. Ideally, portable implementations using a standard procedural language augmented with communication functions of a message passing library which supports communication contexts (26, 10) would be generated automatically, according to a syntaxdirected translation scheme, because the important design decisions have already been taken. 4. INTEGRATING PERFORMANCE POLYMORPHISM INTO TWOL When deriving GSPMD implementations for complex numerical applications like stiff ODE integration, the functionality of a submethod can be implemented by using different parallel numerical algorithms which differ only in their performance characteristics. Therefore, the derivation of these programs would explicitly support: • the maintaining of several implementations for each submethod (from different libraries and vendors) and, • the selection of the best implementation for each context without having to deal with the details of the respective implementations. This desirable characteristic is called performance polymorphism. (18) The selection of the best implementation depends fundamentally on the target machine, the problem to be solved and the data distribution scheme of the application. One approach to achieve performance polymorphism is based on the definition of polyalgorithms. (27, 15) However, the poly-algorithmic approach involves the design of efficient heuristic decision routines to select the best algorithm at runtime (which introduces overheads), and the incorporation of new algorithms to an existing component interface requires a great deal of effort to modify these heuristic routines. The TwoL methodological approach may benefit from explicit constructs for performance polymorphism which also allow the exploitation of the problem structure. On the basis of component-based development approaches, (28, 29) a proposal to integrate performance polymorphism within the TwoL framework is presented here. 4.1. Types of Components Software components are independently developed software units which can be composed without knowledge of how they have been

Component-Based Derivation of a Parallel Stiff ODE Solver

115

implemented. (30) To use a component, only the knowledge of its specification is necessary, not its implementation. In order to integrate performance polymorphism into TwoL, we need to decouple the description of the functional behaviour of a linear algebra operation from the multiple implementations which provide the same functionality. We will encapsulate each entity (descriptions and implementations) as separate components and a way of performance specification will be provided to use the implementations and to guide the selection of the best implementation. The central idea of our methodological proposal is the construction of a TwoL basic module by linking two different components: concepts (descriptions) and realizations (implementations). 4.1.1. Concepts A concept formally defines the functional properties of a basic module, including all functional aspects of the module interface. Assumptions regarding the parallel nature of the module and its performance should not be made. The operation signature must be presented together with a formal description of its functionality. A concept that denotes the computation of an approximation to the Jacobian of a function f at a point (t, y) ¥ R d+1 using forward differences is presented in Fig. 4. This operation frequently appears in numerical methods of stiff IVP solution. The formal parameters of the operation denoted by a concept can have three abstract modes (IN, OUT and CONCEPT MJacobian

CONCEPT MJacobian (IN Integer d, Double t, y(d), FUNCTION f FROM [Integer d, Double t,yo(d)] TO [Double y1(d)], OUT Double Jn(d,d) ) ENSURES for all i,j: {x:Integer where ((x>=1) and (x

Suggest Documents