AS-180

Laboratoire d'Informatique Fondamentale de Lille

AS-180

Publication

A new parallel adaptive block-based Gauss-Jordan algorithm N. Melab, S. Petiton and E-G. Talbi February 1998

c LIFL USTL

UNIVERSITE DES SCIENCES ET TECHNOLOGIES DE LILLE U.F.R. d'I.E.E.A. B^at. M3 { 59655 VILLENEUVE D'ASCQ CEDEX Tel. 03 20 43 47 24 Telecopie 03 20 43 65 66

A new parallel adaptive block-based Gauss-Jordan algorithm N. Melab, S. Petiton and E-G. Talbi Laboratoire d'Informatique Fondamentale de Lille (LIFL-CNRS URA 369) Universite des Sciences et Technologies de Lille1 59650 - Villeneuve d'Ascq Cedex E-mail: melab,petiton,[email protected] .fr Tel +33 3 20 43 45 39 - Fax +33 3 20 43 65 66

Abstract

Parallelism in adaptive execution environments requires a parallel adaptive programming methodology. In this paper, we present this methodology on the block-based Gauss-Jordan algorithm used in numeric analysis to solve linear systems. The application includes a work scheduling strategy and is in some way fault tolerant. It is implemented and experimented with the MARS parallel adaptive programming environment. The results show that an absolute eciency of 92% is possible on a farm of DEC/ALPHA processors interconnected by a Gigaswitch network, an absolute eciency of 67% can be obtained on an Ethernet network of SUN-Sparc 4 workstations and on each of these networks perfect relative eciency is often reached. Moreover, some experimentations done on a network of heterogeneous machines show that the overhead induced by the management of the adaptivity is not important.

Keywords:Gauss-Jordan Method, Adaptive Parallelism, Networks of Workstations (NOWs),

Heterogeneous Systems, Fault Tolerance.

1 Introduction Over the past few years, the parallel and distributed computing has known the emergence of the networks of workstations (NOWs) to the detriment of supercomputers. These NOWs have introduced some issues which have a great impact on the parallel programming methodology. The major of them are the following: workstations are heterogeneous i.e. they have not the same architecture and operating system. Moreover, NOWs are dynamic environments i.e. workstations frequently leave and enter the network due to their availability or failure. These factors make that it is necessary to nd new solutions for the problems of parallelism like the fault tolerance [12]. It is also necessary to change the parallel programming methodology in order to take into account the adaptive nature of the applications, we refer that to parallel adaptive programming. An application programmed according to this philosophy is called a parallel adaptive application, namely PAA, i.e. a parallel application whose processes vary in number and location as a 2

function of the load (availability of processors, CPU time, etc) of the machine. In this paper, we1 present a parallel adaptive programming methodology and its experimentation on the block-based Gauss-Jordan numerical method. This latter is a direct method used in numerical analysis for solving linear systems A(n; n):X (n) = Y (n), where A(n; n) is an invertible matrix of dimension n, X (n) and Y (n) are two vectors of dimension n. The method consists of inverting A(n; n) in order to compute X (n) = B (n; n):Y (n), where B (n; n) = A(n; n)?1. The methodology aims at: Exploiting the wasted time of the machines i.e. as soon as a machine becomes available it is owned for executing some work of the application. That operation is referred to as an application unfold. Keeping respected the machines' ownership, it means that when a machine is requisitioned by its owner the work, part of the application, is pulled out. That operation is referred to an application fold. Solving, in some way, the fault tolerance problem: if a failure of a given machine occurs only the work running on this latter is re-started. The principle of our parallel adaptive block-based Gauss-Jordan application is the following: a blocks' server maintains the two matrices A and B (with the matrix identity as an initial value). According to the availability/requisitioning of machines it extracts blocks from these matrices and distributes or redistributes them between the machines. These blocks are used by workers created on available machines to compute new blocks which are returned to the server that updates the matrices A and B . If a machine computing a block is requisitioned the partial work is folded and unfolded later on another machine which resumes the work at exactly its interruption point. Moreover, if a machine fails its work is re-started. In order to achieve the above objectives, we use the MARS system [6], a new programming support of parallel adaptive applications developped at LIFL (Universite de LilleI). The remainder of this paper is organized as follows: Section 2 presents in short the work related to the parallel scienti c computing. Section 3 describes the MARS environment at system and application levels. Section 4 presents our proposed fault-tolerant parallel adaptive GaussJordan algorithm and its implementation with MARS. In section 5, we show a performance evaluation of the algorithm. In Section 6, we draw the conclusions and give some of our future investigations. 1

This work is a collaboration between three members of three dierent teams of LIFL: N. Melab from

PALOMA, S. Petiton from MAP and E-G. Talbi from GOAL.

3

2 Parallel Scienti c Computing Long-running scienti c applications have, for many years, been traditionally performed on expensive supercomputers like Paragon, MP-2, CM5, SP2, etc. Recently, this tradition has been questioned due to the emergence of the networks of workstations (NOWs) which become more and more competitive. The main reasons of that are: rst, workstations are cheaper and more performant than they were. Second, networks have known a great technologic evolution that makes them more speedy. Third, the current decade has known the arrival of parallel programming environments like PVM [5] and MPI [4] well-suited for NOW programming, and high-performance libraries like ScaLAPACK [3] and BLAS [1]. On the other hand, many parallel scient c applications [13] have been developped on supercomputers and NOWs. Moreover, most of them have been designed according to the classical parallelism i.e. that where the execution environment is not dynamic. As a consequence, especially for NOWs, these applications do not exploit all the power of the machine and do not keep respected the workstations' ownership. So, in order to overcome these problems adaptive parallelism is required. In addition, the management of the adaptivity is not costly. For example, in [2], the authors show, with the help of the multi-block Navier Stokes solver application implemented on a PVM NOW, that if the number of processors does not vary frequently the data redistribution cost is not important compared with the time required for the actual calculations. In [12], the authors experiment a checkpointing algorithm for fault tolerance on a SUN NOW and an IBM-SP2. The results on both systems are close together. In this paper, we experiment adaptive parallelism on a numerical application, the blockbased Gauss-Jordan method. A classical parallel version of this application has already been developed in [11, 14]. In [11], the application is implemented on the MANU numerical machine which is constituted by 8 PEs (Processor Elements), 4 data migration controllers and 3 levels shared memory, interconnected by an OMEGA network. The method includes a real-time scheduler that schedules the tasks of the application according to a tasks' graph determined by hand. In [14], the proposed method is implemented with the Occam language on a network of 8 transputers connected as a cube. It is based on block row operations i.e. a block row of the matrix to be inverted is sent to each transputer. During computation, each processor produces a block of intermediate results and broadcasts it to all the transputers. In opposition to these versions, our parallel approach is adaptive as we stated it in the introduction and we implemented it on a network of heterogeneous workstations with the MARS environment. This latter has already been successfully experimented on parallel optimization problems [6]. It is presented in the next section.

4

3 MARS: a Programming Environment of Parallel Adaptive Applications MARS is a multi-threaded environment which allows programming parallel adaptive applications. It is built (see gure 1) on top of PM 2 [10], a parallel multi-threaded execution support which couples two librairies: the PVM communication library and a threads' package called MARCEL [10]. MARS 2

UNIX

PVM

MARCEL

PM

Figure 1: A macroscopic view of MARS The characteristics of MARS [6] are mainly: The portability: Actually MARS is portable on Sparc/SunOS, Sparc/Solaris, Alpha/OSF-1 and PC/Linux. The respect of the machines' ownership: When a machine executing a part of a MARS application is requisitioned by its owner the MARS system folds this application. The fault tolerance: MARS integrates a fault tolerance mechanism which includes a checkpointing algorithm [7]. This characteristic is developped in the paragraph 4.3.3. Other interresting characteristics of MARS, such as dynamic recon guration, protection, transparency, scalability and conviviality are described in [6].

3.1 MARS at system level: Architecture

As it is shown on the gure 2, MARS has a hierarchical architecture with three levels. At the rst level, a global scheduler GMS (Global MARS Scheduler) controls a set of group servers GS (Group Server) and ensures their cooperation. Each GS manages, at the second level, a cluster of nodes selected according to a given criterion which can be the geography, the homogeneity, etc, of the nodes. Each of these latter execute, at the third level of the hierarchy, a daemon named NM (Node Manager). The role of an NM is to inspect the state of its node in order to inform the GS on which it depends about its changes. The GS uses this collected node 5

state changes information to schedule MARS applications which execute into its cluster. These information can also be used by other GSs by the interposition of the GMS to schecdule their applications. GMS

...

GS

NM AMM

WM

NM

NM WM

Node 1 Node 2

GS

...

WM

Node n

Figure 2: Architecture of MARS

3.2 MARS at application level: Programming Methodology

The programming model of MARS is basically the classical SPMD model. A MARS application is composed of two modules: AMM (Application Manager Module) and WM (Worker Module). At runtime, the AMM is executed by one task, the master. The WM is executed by several tasks, the workers. Their number depends on the number of available nodes (one worker on each node). The role of the master task is to partition the whole application into work units and to distribute these latter between the workers. Continually, each worker asks the master for work and sends it back the results when it resumes the work. As it was indicated above, since MARS respects the ownership of machines a worker might be stopped when the machine on which it is running is requisitioned by its owner. Consequently, its work has to be folded and to stay on the master until a machine becomes available. At this moment, the work (partial work) is unfolded on the available machine according to a scheduling strategy. The execution of the folded work must resume exactly at its stopping point. We need here to specify what is a partial work. As a summary, programming a parallel adaptive application with MARS consists basically of: De ning what is the work and what is the partial work ; 6

Indicating how to fold the partial work, that is to say specifying a folding handler (cleanup

function) to be executed by MARS at folding time. This function represents an interface between the application and the MARS system ; Specifying how to unfold and resume the partial work. It is also necessary to de ne a work (complete or partial) scheduling strategy ; Writing the codes of the two modules of the application: AMM and WM.

4 Fault Tolerant Parallel Adaptive Block-based Gauss-Jordan

4.1 Sequential Block-based Gauss-Jordan

Let A and B be two matrices of dimension n and B be the inverse matrix of A. Let also A and B be partitioned into (q q) matrix blocks of a xed dimension b (b = n=q). The sequential block Gauss-Jordan algorithm can thus be written as follows :

Input: A, B I (n; n), q. Output: B = A?1

/* I (n; n): identity matrix of dimension n */

For k=1 to q do

(A ?1 )?1

A B

k

k k;k

A

k k;k

k;k

k k;k

For j=k+1 to q do A

k k;j

A A ?1

B

k k;j

A B

k k;k

k

End For For j=1 to k-1 do k k;k

k;j

k

?1

End For For j=k+1 to q do For i=1 to q and i 6= k do A A ?1 ? A ?1 A End For End For For j=1 to k-1 do For i=1 to q and i =6 k do B ?1 ? A ?1 B B End For End For For i=1 to q and i =6 k do B

k i;k

k;j

k i;j

k

k

i;j

i;k

k i;j

k

k

i;j

i;k

End For End For

A ?1 A k

i;k

k k;j

k k;j

k k;k

7

4.2 Parallelization

The parallelization of the sequential block-based Gauss-Jordan algorithm consists, as it is shown in the gure 3, of two kinds of parallelism: the inter-iterations parallelism and the intra-iteration parallelism. The rst one means that the iterations are partially or completely executed in parallel, what envolves a management of complex matrix blocks dependences. The partial parallelism means that only the inversion of the pivot A +1 of the next iteration is executed by anticipation. Unlike the complete parallelism, the remainder of the next iteration is triggered only at the end of the current iteration. The intra-iteration parallelism consists essentially of exploiting the parallelism induced by each of the ve loops of the algorithm. It falls into two categories: the inter-loops parallelism and the intra-loops parallelism. Only the intra-iteration parallelism is exploited in our implementation, so we develop it in the next sections. k

k;k

Parallelism

intra-iteration

inter-iterations

pivot inversion only

whole iteration

inter-loops

intra-loop

Figure 3: The Gauss-Jordan parallelism

4.2.1 The inter-loops parallelism

The inter-loops parallelism means that the loops are executed in parallel, the dependences between the loops must be managed. In the sequential algorithm, we identify two ones: the internal loop of the third loop depends on the rst one because it uses the block A which is computed by the rst loop ; on the other hand, the internal loop of the fourth loop is dependent of the second loop because it uses the block B which is computed by the second loop. Consequently, the parallel execution of the internal loop number j of the third (respectively the fourth) loop is triggered after the computation of the block A (respectively B ). k k;j

k k;j

k k;j

k k;j

4.2.2 The intra-loops parallelism

The intra-loops parallelism consists of executing each of the ve loops of the algorithm in parallel. Particularly, the internal loops of the loops numbers 3 and 4 are run each of them in parallel. The work units for these loops are of two kinds: matrix product and matrix triadic i.e. respectively X:Y and Z ? X:Y where X , Y and Z are matrix blocks. In the implementation, in order to ensure correctly the control of the loops one must associate to each work unit the number of the loop it belongs to. Moreover, a crucial criterion which must be taken into account is the granularity of the work units. Indeed, the size of the matrix blocks involved into each work unit must be such that its 8

computation cost balances the communication cost caused by its sending to a remote node.

4.2.3 The parallel Gauss-Jordan algorithm

As each MARS PAA, the Gauss-Jordan PAA is an SPMD application. It is constituted of two modules: AMM and WM. At runtime, AMM is executed by the master task and WM by the worker task. The gure 4 illustrates the Gauss-Jordan PAA architecture, the details of the gure are given in the following paragraphs. In the remainder of this article, we will confuse the term AMM with master task and WM with worker task. A

B

k

...

...

...

...

i

j

... ... i

... ...

k

master task b(1,i,j) wtype(=MT) l=2 res

d1 d2 d3 b(1,i,j) wtype(=MT) l=2

result

work

...

... worker task j

Figure 4: The parallel Gauss-Jordan algorithm

4.2.3.1 The AMM module

The AMM module loads the two matrices A and B and then manages them until it gets the nal result. At the begining of each iteration k (1 < k < q ) of the algorithm, it computes the inversion of the current pivot A . Then, it controls and synchronizes the execution of the loops. An array of synchronization bits is used to manage the loops' dependences quoted in the paragraph 4.2. k k;k

The AMM module extracts from A and B matrix blocks in order to constitute the work units (matrix products and matrix triadics) which are sent to the WMs according to a receiverinitiative strategy. Each work sending is done by an LRPC which creates a worker thread (WT) on a WM to execute the work unit. After this latter is completed, the AMM uses its resulting matrix block to keep up to date the corresponding block in A or B . For example, the gure 4 shows the sending by AMM of the work which characterises the matrix tridiac of the loop number 4 i.e. B ?1 ? A ?1 B . It also illustrates the sending back by WM of the resulting block i.e. B . The work components are the three data blocks B ?1 (d2), A ?1 (d1) and B (d3), the resulting block descriptor b, the type of work wtype (in our example Matrix Triadic k

k

i;j

i;k

k k;j

k i;j

9

k

k

i;j

i;k

k k;j

(MT ) and MP for Matrix Product) and the loop number l (for control and synchronization). The resulting block descriptor is used to memorize the identity of the block to update on the matrix A or B . Its three components represent respectively which of the matrices A or B must be updated (0 for A and 1 for B ), the line number i and the column number j of the block to update.

4.2.3.2 The WM module

It is indicated above that the work distribution is done according to the receiver-initiative strategy. Indeed, each WM continually asks AMM for work. If there is no work then WM waits a WAIT PERIOD time (we xed it as equal to 10 seconds). Otherwise, it receives the work to do and executes it. After that, it sends the resulting block to AMM. In the gure 4, the block is characterized by: its descriptor b, its data res, the name wtype of the matrix to update and the number l of the loop it belongs to.

4.3 MARS-based Adaptivity

Since MARS aims at keeping respected the machines' ownership and at exploiting the wasted time, nodes are dynamically pulled out and/or added to the virtual machine (NOW) which executes the PAA. As a consequence, work must be dynamically blocked and resumed later. Then, as it has been stated in the paragraph 3.2, one has to de ne what is the partial work, how to fold/unfold it and how to schedule it.

4.3.1 The partial work

In addition to the components describing a whole work given above i.e. b, l and wtype, other information must be considered for the partial work description. These information are given in bold in the gure 5: the blocks' descriptors b1, b2 and b3 of respectively B ?1 (d2), A ?1 (d1) and B (d3), I , J , tid and res. The blocks' descriptors are used in order to avoid storing the blocks d1, d2 and d3 themselves. These latter are extracted from the matrices A and B using their descriptors at the unfolding time. The parameters I and J designate the line and column numbers of the next element (resumption point) to compute if a machine computing a matrix product is requisitioned. One has to note that the matrix product calculation can either be the matrix product work (apart) or a part of a matrix triadic work. Moreover, this latter is never interrupted when the block minus operation is being computed. That is to say if X:Y of the matrix triadic Z ? X:Y is completed (let us assume T = X:Y ), the work Z ? T can not be folded, it has to complete because it is not costly in terms of CPU time. The parameter res contains the partial result of the matrix product. The interest of the information tid is indicated in the paragraph 4.3.3. k

k

i;j

i;k

k k;j

The partial work (its descriptors) is stored in the Partial Work List (PWL in gure 5) managed by the AMM. The role of the other list i.e. AWL is presented in the paragraph 4.3.3.

4.3.2 Application folding/unfolding 10

A

k

...

...

...

...

i

j

... ... i

... ...

k

... ...

PWL AWL

d1 d2 d3 b1(1,i,j) b2(0,i,k) b3(1,k,j) b(1,i,j) l,wtype I,J, tid

B

b1(1,i,j) b2(0,i,k) b3(1,k,j) b(1,i,j) l,wtype I,J, tid res

master task

unfold

fold

res

NM d1 d2 d3

...

worker task j

...

Node j

Figure 5: Folding/Unfolding work A folding operation occurs when a machine is requisitioned by its owner. This operation is picked up by the NM of the machine by executing a folding handler named cleanup de ned in the PAA and then killing the WM. The role of this primitive is to take the partial work (contained in the structure of the gure 5) and to send it back to the AMM. The sending back is done by using an LRPC which adds the partial work to PWL. The folding handler is the interface between the PAA in general and the MARS system. An unfolding operation occurs when a machine becomes idle. The WM of this latter asks for work with the help of an LRPC which pulls out a partial work from PWL and creates on WM a worker thread to resume it. The blocks' descriptors are taken from PWL and are used to extract the concrete data blocks, as it is shown in the gure 5, from the matrices A and B . In our implementation, no restriction is done regarding the number of foldings/unfoldings of the same work but, as it is the case for the migration mechanism, it is important to examine if it is necessary and how to x its limit.

4.3.3 Work scheduling and fault tolerance

The scheduling problem arises when a WM asks for work: what kind of work the AMM must send, a partial or a new one? Sending new work rst do not bring any bene t, on the contrary this strategy requires more memory space for storing PWL. We use the other alternative i.e. when PWL is not empty a partial work is selected rst according to the Last In First Out (LIFO) policy. If PWL is empty then one work unit is actived from one of the ve loops of the algorithm. Failures can arise in MARS at four levels: GMS and GS, NM, AMM and WM. The three rst failures are managed at the system level by a checkpointing-based technique [7]. The fault 11

tolerance of WMs is managed at the application level. In our implementation of the GaussJordan algorithm, the Active Work List (AWL) is used to this fact. Whenever a work unit (partial or complete) is sent to a WM, the AMM creates an instance (a work descriptor) of it and put it in AWL. Three situations are then possible for the active work: It is completed, its descriptor is then removed from AWL. The active work is folded, it is in this case removed from AWL and put in PWL. The WM runs out, the active work is then re-started on another WM. The information pid (identity of WM) which appears in the gures 4 and 5 is used to nd in AWL the active work whose WM run out.

5 Performance Evaluation In this section, we rst present the in uence of the granularity of parallelism on the eciency of the parallel execution of the block-based Gauss-Jordan algorithm. Then, we study the scalability of the algorithm. Finally, we evaluate the processing and communication costs induced by the parallel execution of the algorithm within an adaptive context.

5.1 Granularity of parallelism

The granularity problem is crucial especially for irregular applications [8, 9]. It consists of determining the work unit size which makes the work processing balancing the communication cost envolved by the work distribution. Its management can be either static meaning that the size of work units is xed at compiling or loading time, or dynamic i.e. the granularity of parallelism is determined at run time according to the state of the application and its execution support. A classi cation of the dierent existing strategies is presented in [9]. In this paper, only static granularity management is considered. The problem can be stated as following: what is the appropriate size of the matrix blocks of the elementary works i.e. matrix products and matrix triadics? That size depends on the size of the matrices A and B . For example, the table 1 shows the results obtained on a farm of 7 DEC/ALPHA processors (133 MHz clock, 64 MegaByte RAM and 1 GigaByte Disc) interconnected by a Gigaswitch network (200Mbps) and operating under OSF-1. The dimension of the matrices A and B is xed as equal to 800 (i.e. a size of 0.64 million of elements). The table shows, for dierent sizes of the blocks, the sequential execution time (SET ) in seconds i.e. the execution time of the sequential version of the algorithm on a single DEC/ALPHA processor, the parallel execution time on a single processor (PET 1) in seconds, the parallel execution time (average of 100 execution times of the same application) on N = 7 processors in seconds, the absolute speed-up (ASUP = SET=PETN ), the relative speed-up (RSUP = PET 1=PETN ), the absolute eciency (AE = ASUP=N ) and the relative eciency (RE = RSUP=N ). The best absolute eciency i.e. 84% is obtained with a blocks' size of 200. For smaller blocks' size more communications are generated 12

and so smaller eciency is obtained. On the other hand, if the size of blocks is greater parallelism is lost leading to a smaller eciency. By another way, one remarks that, as it is the case for a blocks' size of 200, perfect speed-ups can be obtained with our implementation. Table 1: Matrices' size: 800, Number N of machines: 7

Block size SET(in s) PET1(in s) PETN(in s) ASUP RSUP AE RE 100 200 400

480.672 502.110 612.098

902.27 802.011 754.709

143.056 85.124 203.600

3.360

6.28

5.90

9.42

3.01

3.71

0.48 0.90

0.84 1.35 0.43 0.53

The table 2 shows other results obtained with 100 runnings of the same application with matrices of a dimension 1500 (i.e. a size of 2.25 millions of elements) and dierent blocks' sizes. For that dimension, the best absolute eciency i.e. 92% is obtained with a grain size of 300. One can note that a perfect relative speed-up is again obtained. Table 2: Matrices' size: 1500, Number N of machines: 7

Block size SET(in s) PET1(in s) PETN(in s) ASUP RSUP AE RE 150 300 500

3374.065 3735.654 4165.055

6377.290 6002.310 5397.670

1056.380 580.776 861.425

3.19

6.43 4.84

6.04

0.46 0.86

6.27

0.69 0.90

10.33 0.92 1.48

5.2 Scalability

The table 3 shows the experiment results obtained on a network of SUN-Sparc 4 workstations running under Solaris 2. The size of the matrices A and B and the blocks' size are xed as equal to respectively 1500 and 300. The parameters evaluated above i.e. PETN (average of 100 runnings of the same application), ASUP , RSUP , AE and RE are expressed as functions of the number Nproc of used processors. In order to compute the absolute speed-up ASUP we measured the execution time of the sequential version of the block-based Gauss-Jordan algorithm on one SUN-Sparc 4 workstation and we found it equal to 4839.9 seconds. Table 3: Matrices' size: 1500, Blocks' size: 300

Nproc PETN(in seconds) 1 4 8 16 24 28

7582.19 1811.41 1303.85 481.17 401.40 402.87

ASUP

RSUP

0.64

1.00 4.19 5.82 15.76 18.89 18.82

2.67 3.71 10.06 12.06 12.01 13

AE

0.64

0.67 0.46 0.63 0.50 0.50

RE

1 1.05 0.73 0.98 0.78 0.67

The main remarks which can be done through the table 3 are the following: Absolute eciencies greater than 60% are obtained with a number of processors between 1 and 16 and the corresponding relative eciencies are perfect excepted for 8 processors (speed-up anomaly) (see gure 6). Let us note that the absolute eciency (67%) of a 4 processor system is greater than that (63%) of a 16 processor system, that is due to the following reason: at each iteration, the number of generated work units is equal to q 2 ? 1 i.e. 24 (because q = 1500=300 = 5). In the paragraph 4.2, we stated that the parallel execution of the internal loop number j of the third (respectively the fourth) loop is triggered after the computation of the block A (respectively B ). Consequently, at each iteration, only 2(q ? 1) i.e. 8 work units are processed at the beginning and the 16 other work units (internal loops) are processed after. So, for a 4 processor system all the processors are always used, but for a 16 processor system 8 processors are unused at the starting of the iteration what causes a certain loss of speed-up. The eciency decreases when the number of processors is upper to 16 because the maximum possible degree of parallelism is of 16 processors. k k;j

Absolute and relative speed-up

k k;j

Scalability 20 Abs Rel

15

10

5

0 0

5

10

15

20

25

30 N

Figure 6: The scalability of the algorithm

5.3 Adaptivity

In this section, we aim at experimentally studying the overhead induced by the management of the reaction of the application to the adaptivity of the material execution support. We con14

sidered, for our experimentations, a meta-system of heterogeneous machines including 8 DECALPHA/OSF-1 processors, 16 SUN-Sparc 4/Solaris 2 workstations and 3 SUN-Sparc 4/SunOs workstations ; all the 27 machines are interconnected by an Ethernet network. From the application side, we xed the size of the matrices as equal to 1500 and the size of the blocks as equal to 300. We experimented the same application 50 times in our laboratory on a working day and so machines frequently join and leave the meta-system due to their availability and failures. The number of machines eectively available varies between 20 (minimum number for the 50 runnings) and 27 (maximum number for the 50 runnings) until the end of the execution where it decreases to 0. The average number (average of the average numbers of machines used for each of the 50 runnings) of the machines eectively used is 22:47. The table 4 presents some obtained results. Table 4: Matrices' size: 1500, Blocks' size: 300, Number N of machines: 27 Non adaptive Adaptive

Overhead

CPU(in s) Communication and waiting(in s) Total execution(in s) 108.011 111.744 3%

355.193 391.218 9%

463.204 502.962 8%

The parameters considered in our study are the CPU time, the communication and waiting time and the total execution time of the master task (AMM module). The CPU time is the computing time meaning the time spent in extracting blocks from the matrices A and B , updating the two matrices with the results returned by the workers, etc. The communication and waiting time represents the time spent by the master task in distributing work and waiting for results. The total execution time is the sum of the CPU time and the communication and waiting time. The second and third lines of the table 4 show the results obtained by considering respectively a non adaptive context (week-end night) and an adaptive context (working day). The last line represents the overhead corresponding to each parameter induced by the management of the adaptivity. It is computed, for each column, by using the following formula: overhead = (3 line ? 2 line)=3 line. rd

nd

rd

One can note that the total overhead envolved in the adaptivity management is not very important in comparison, for example, with that (10%) obtained in [12] for managing fault tolerance on other numeric applications. The computation overhead is particularly small.

6 Conclusions and Future Work We have proposed a new parallel adaptive version of the block-based Gauss-Jordan algorithm. This latter exploits the inter-loops and intra-loops SPMD parallelisms. Thanks to the MARS adaptive programming environment, our application exploits the wasted time i.e. when a machine becomes idle, work (matrix product or matrix triadic) is sent to it (application unfolding). 15

It also keeps respected the machines' ownership, meaning that as soon as a user requisitions its machine the work running on this latter is stopped (application folding) and the partial remaining work is resumed later probably on another machine. Moreover, our application integrates a scheduling strategy which uses a list PWL to store the partial work. In this policy, no new work is generated so much as PWL is not empty. This latter is managed according to the LIFO policy. Another list AWL is used for fault tolerance of the workers. Indeed, everytime a work becomes active, its descriptor is stored in AWL and as soon as the worker running it fails the work is re-started with the help of its descriptor by another worker. The complete fault tolerance problem is considered in [7]. The application is developed and experimented with the MARS system. The obtained results show that the eciency depends on the size of the blocks. The absolute eciency can reach 92% for an application with matrices of size 1500 and blocks of size 300 on a farm of 7 DEC-ALPHA/OSF1 processors interconnecetd with a Gigaswitch network. On the other hand, absolute eciency greater than 60% are obtained on a network of a reasonable number of SUN-Sparc4/Solaris2 workstations. Moreover, perfect eciency is often reached on both architectures. Finally, other results obtained on a network of 27 heterogeneous network show that the management of the adaptivity is not costly. For example, for an application with matrices of 1500 and blocks of 300 and a number of machines uctuating between 20 and 27 with an average of 22:47, the computation overhead is of 3% and the total execution overhead is of 8%. In our implementation, the matrices A and B are lodged on the central memory of the machine where the AMM resides but that is not a constraint. Indeed, they can be stored on the memories of several machines or on a hard disk. These alternatives will be one of our further investigations. Another feature to address in the near future is the heterogeneity by considering the dierence of processing speed between the processors when scheduling work. Finally, we plan to experiment our application on a national wide area network.

Acknowledgements

We would like to thank Zouhir Ha di a doctoral student at the Laboratoire d'Informatique Fondamentale de Lille for having kindly put his long experience in developping MARS at our disposal.

References [1] J. Choi, J. Dongarra, R. Pozo, and D. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In the Proc. of the 4 Symp. on the Frontiers of Massively Parallel Computation, IEEE Publishers, pages 120{127, 1992. th

16

[2] Guy Edjlali, Gagan Agrawal, Alan Sussman, and Joel Saltz. Data Parallel Programming in Adaptive Environment. Proc. of 9 IPPS, Santa Barbara, California, USA, pages 827{832, Apr 25{28 1995. [3] L.S. Blackford et al. Scalapack: A Linear Algebra Library for Message-Passing Computers. SIAM Conference on Parallel Processing, Mar 1997. [4] MPI forum. MPI: A Message Passing Interface standard. Technical report, Apr 1994. [5] A. Geist, A. Beguelin, and J. Dongarra et al., editors. PVM: Parallel Virtual Machine, A User's guide and tutorial for Networked Parallel Computing. MIT Press, 1994. [6] Zouhir Ha di, El Ghazali Talbi, and Jean-Marc Geib. Meta-systemes : Vers l'integration des machines paralleles et les reseaux de stations heterogenes. Calculateurs Paralleles, Ed. Hermes, 1997. [7] D. Kebbal, E.G. Talbi, and J.M. Geib. A new approach for checkpointing parallel adaptive applications. PDPTA'97 proceedings, Las Vegas, Nevada, USA, pages 1643{1651, Jun 1997. [8] N. Melab, N. Devesa, M.P. Lecoue, and B. Toursel. Adaptive load balancing of irregular applications. A case study: IDA applied to the 15-puzzle problem. Springer-Verlag LNCS 1117, Proc. of the Third Intl. Workshop, IRREGULAR'96, Santa Barbara, California, USA, pages 327{338, 19{21 Aug 1996. [9] Nouredine Melab. Gestion de la granularite et regulation de charge dans le modele P3 d'evaluation parallele des langages fonctionnels. PhD thesis, Universite de LilleI, 1997. [10] Raymond Namyst. PM 2 : Un environnement pour une conception portable et une execution ecace des applications paralleles irregulieres sur architectures distribuees. PhD thesis, Universite de LilleI, 1997. [11] Serge G. Petiton. Parallelization on an MIMD computer with real-time scheduler, gaussjordan example. Aspects of Computation on Asynchronous Parallel Processors, M.H. Wright (Editor), Elseiver Science Publishers B.V. (North-Holland), IFIP, 1989. [12] James S. Plank, Youngbae Kim, and Jack. J. Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. 25 h Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995. [13] Peter M.A. Sloot. Bibliography of Parallel Scienti c Computing and Simulation Group at the University of Amsterdam. Dept. of Mathematics and Computer Science, Univ. of Amsterdam Kruislaan, Netherlands, 1996. [14] Ole Tingle. Systems of linear equations solved by block gauss jordan method using a transputer cube. Technical report IMM-REP-1995-08, Institute of Mathematical Modelling, Technical University of Danmark, 1995. th

t

17

Laboratoire d'Informatique Fondamentale de Lille

AS-180

Publication

A new parallel adaptive block-based Gauss-Jordan algorithm N. Melab, S. Petiton and E-G. Talbi February 1998

c LIFL USTL

UNIVERSITE DES SCIENCES ET TECHNOLOGIES DE LILLE U.F.R. d'I.E.E.A. B^at. M3 { 59655 VILLENEUVE D'ASCQ CEDEX Tel. 03 20 43 47 24 Telecopie 03 20 43 65 66

A new parallel adaptive block-based Gauss-Jordan algorithm N. Melab, S. Petiton and E-G. Talbi Laboratoire d'Informatique Fondamentale de Lille (LIFL-CNRS URA 369) Universite des Sciences et Technologies de Lille1 59650 - Villeneuve d'Ascq Cedex E-mail: melab,petiton,[email protected] .fr Tel +33 3 20 43 45 39 - Fax +33 3 20 43 65 66

Abstract

Parallelism in adaptive execution environments requires a parallel adaptive programming methodology. In this paper, we present this methodology on the block-based Gauss-Jordan algorithm used in numeric analysis to solve linear systems. The application includes a work scheduling strategy and is in some way fault tolerant. It is implemented and experimented with the MARS parallel adaptive programming environment. The results show that an absolute eciency of 92% is possible on a farm of DEC/ALPHA processors interconnected by a Gigaswitch network, an absolute eciency of 67% can be obtained on an Ethernet network of SUN-Sparc 4 workstations and on each of these networks perfect relative eciency is often reached. Moreover, some experimentations done on a network of heterogeneous machines show that the overhead induced by the management of the adaptivity is not important.

Keywords:Gauss-Jordan Method, Adaptive Parallelism, Networks of Workstations (NOWs),

Heterogeneous Systems, Fault Tolerance.

1 Introduction Over the past few years, the parallel and distributed computing has known the emergence of the networks of workstations (NOWs) to the detriment of supercomputers. These NOWs have introduced some issues which have a great impact on the parallel programming methodology. The major of them are the following: workstations are heterogeneous i.e. they have not the same architecture and operating system. Moreover, NOWs are dynamic environments i.e. workstations frequently leave and enter the network due to their availability or failure. These factors make that it is necessary to nd new solutions for the problems of parallelism like the fault tolerance [12]. It is also necessary to change the parallel programming methodology in order to take into account the adaptive nature of the applications, we refer that to parallel adaptive programming. An application programmed according to this philosophy is called a parallel adaptive application, namely PAA, i.e. a parallel application whose processes vary in number and location as a 2

function of the load (availability of processors, CPU time, etc) of the machine. In this paper, we1 present a parallel adaptive programming methodology and its experimentation on the block-based Gauss-Jordan numerical method. This latter is a direct method used in numerical analysis for solving linear systems A(n; n):X (n) = Y (n), where A(n; n) is an invertible matrix of dimension n, X (n) and Y (n) are two vectors of dimension n. The method consists of inverting A(n; n) in order to compute X (n) = B (n; n):Y (n), where B (n; n) = A(n; n)?1. The methodology aims at: Exploiting the wasted time of the machines i.e. as soon as a machine becomes available it is owned for executing some work of the application. That operation is referred to as an application unfold. Keeping respected the machines' ownership, it means that when a machine is requisitioned by its owner the work, part of the application, is pulled out. That operation is referred to an application fold. Solving, in some way, the fault tolerance problem: if a failure of a given machine occurs only the work running on this latter is re-started. The principle of our parallel adaptive block-based Gauss-Jordan application is the following: a blocks' server maintains the two matrices A and B (with the matrix identity as an initial value). According to the availability/requisitioning of machines it extracts blocks from these matrices and distributes or redistributes them between the machines. These blocks are used by workers created on available machines to compute new blocks which are returned to the server that updates the matrices A and B . If a machine computing a block is requisitioned the partial work is folded and unfolded later on another machine which resumes the work at exactly its interruption point. Moreover, if a machine fails its work is re-started. In order to achieve the above objectives, we use the MARS system [6], a new programming support of parallel adaptive applications developped at LIFL (Universite de LilleI). The remainder of this paper is organized as follows: Section 2 presents in short the work related to the parallel scienti c computing. Section 3 describes the MARS environment at system and application levels. Section 4 presents our proposed fault-tolerant parallel adaptive GaussJordan algorithm and its implementation with MARS. In section 5, we show a performance evaluation of the algorithm. In Section 6, we draw the conclusions and give some of our future investigations. 1

This work is a collaboration between three members of three dierent teams of LIFL: N. Melab from

PALOMA, S. Petiton from MAP and E-G. Talbi from GOAL.

3

2 Parallel Scienti c Computing Long-running scienti c applications have, for many years, been traditionally performed on expensive supercomputers like Paragon, MP-2, CM5, SP2, etc. Recently, this tradition has been questioned due to the emergence of the networks of workstations (NOWs) which become more and more competitive. The main reasons of that are: rst, workstations are cheaper and more performant than they were. Second, networks have known a great technologic evolution that makes them more speedy. Third, the current decade has known the arrival of parallel programming environments like PVM [5] and MPI [4] well-suited for NOW programming, and high-performance libraries like ScaLAPACK [3] and BLAS [1]. On the other hand, many parallel scient c applications [13] have been developped on supercomputers and NOWs. Moreover, most of them have been designed according to the classical parallelism i.e. that where the execution environment is not dynamic. As a consequence, especially for NOWs, these applications do not exploit all the power of the machine and do not keep respected the workstations' ownership. So, in order to overcome these problems adaptive parallelism is required. In addition, the management of the adaptivity is not costly. For example, in [2], the authors show, with the help of the multi-block Navier Stokes solver application implemented on a PVM NOW, that if the number of processors does not vary frequently the data redistribution cost is not important compared with the time required for the actual calculations. In [12], the authors experiment a checkpointing algorithm for fault tolerance on a SUN NOW and an IBM-SP2. The results on both systems are close together. In this paper, we experiment adaptive parallelism on a numerical application, the blockbased Gauss-Jordan method. A classical parallel version of this application has already been developed in [11, 14]. In [11], the application is implemented on the MANU numerical machine which is constituted by 8 PEs (Processor Elements), 4 data migration controllers and 3 levels shared memory, interconnected by an OMEGA network. The method includes a real-time scheduler that schedules the tasks of the application according to a tasks' graph determined by hand. In [14], the proposed method is implemented with the Occam language on a network of 8 transputers connected as a cube. It is based on block row operations i.e. a block row of the matrix to be inverted is sent to each transputer. During computation, each processor produces a block of intermediate results and broadcasts it to all the transputers. In opposition to these versions, our parallel approach is adaptive as we stated it in the introduction and we implemented it on a network of heterogeneous workstations with the MARS environment. This latter has already been successfully experimented on parallel optimization problems [6]. It is presented in the next section.

4

3 MARS: a Programming Environment of Parallel Adaptive Applications MARS is a multi-threaded environment which allows programming parallel adaptive applications. It is built (see gure 1) on top of PM 2 [10], a parallel multi-threaded execution support which couples two librairies: the PVM communication library and a threads' package called MARCEL [10]. MARS 2

UNIX

PVM

MARCEL

PM

Figure 1: A macroscopic view of MARS The characteristics of MARS [6] are mainly: The portability: Actually MARS is portable on Sparc/SunOS, Sparc/Solaris, Alpha/OSF-1 and PC/Linux. The respect of the machines' ownership: When a machine executing a part of a MARS application is requisitioned by its owner the MARS system folds this application. The fault tolerance: MARS integrates a fault tolerance mechanism which includes a checkpointing algorithm [7]. This characteristic is developped in the paragraph 4.3.3. Other interresting characteristics of MARS, such as dynamic recon guration, protection, transparency, scalability and conviviality are described in [6].

3.1 MARS at system level: Architecture

As it is shown on the gure 2, MARS has a hierarchical architecture with three levels. At the rst level, a global scheduler GMS (Global MARS Scheduler) controls a set of group servers GS (Group Server) and ensures their cooperation. Each GS manages, at the second level, a cluster of nodes selected according to a given criterion which can be the geography, the homogeneity, etc, of the nodes. Each of these latter execute, at the third level of the hierarchy, a daemon named NM (Node Manager). The role of an NM is to inspect the state of its node in order to inform the GS on which it depends about its changes. The GS uses this collected node 5

state changes information to schedule MARS applications which execute into its cluster. These information can also be used by other GSs by the interposition of the GMS to schecdule their applications. GMS

...

GS

NM AMM

WM

NM

NM WM

Node 1 Node 2

GS

...

WM

Node n

Figure 2: Architecture of MARS

3.2 MARS at application level: Programming Methodology

The programming model of MARS is basically the classical SPMD model. A MARS application is composed of two modules: AMM (Application Manager Module) and WM (Worker Module). At runtime, the AMM is executed by one task, the master. The WM is executed by several tasks, the workers. Their number depends on the number of available nodes (one worker on each node). The role of the master task is to partition the whole application into work units and to distribute these latter between the workers. Continually, each worker asks the master for work and sends it back the results when it resumes the work. As it was indicated above, since MARS respects the ownership of machines a worker might be stopped when the machine on which it is running is requisitioned by its owner. Consequently, its work has to be folded and to stay on the master until a machine becomes available. At this moment, the work (partial work) is unfolded on the available machine according to a scheduling strategy. The execution of the folded work must resume exactly at its stopping point. We need here to specify what is a partial work. As a summary, programming a parallel adaptive application with MARS consists basically of: De ning what is the work and what is the partial work ; 6

Indicating how to fold the partial work, that is to say specifying a folding handler (cleanup

function) to be executed by MARS at folding time. This function represents an interface between the application and the MARS system ; Specifying how to unfold and resume the partial work. It is also necessary to de ne a work (complete or partial) scheduling strategy ; Writing the codes of the two modules of the application: AMM and WM.

4 Fault Tolerant Parallel Adaptive Block-based Gauss-Jordan

4.1 Sequential Block-based Gauss-Jordan

Let A and B be two matrices of dimension n and B be the inverse matrix of A. Let also A and B be partitioned into (q q) matrix blocks of a xed dimension b (b = n=q). The sequential block Gauss-Jordan algorithm can thus be written as follows :

Input: A, B I (n; n), q. Output: B = A?1

/* I (n; n): identity matrix of dimension n */

For k=1 to q do

(A ?1 )?1

A B

k

k k;k

A

k k;k

k;k

k k;k

For j=k+1 to q do A

k k;j

A A ?1

B

k k;j

A B

k k;k

k

End For For j=1 to k-1 do k k;k

k;j

k

?1

End For For j=k+1 to q do For i=1 to q and i 6= k do A A ?1 ? A ?1 A End For End For For j=1 to k-1 do For i=1 to q and i =6 k do B ?1 ? A ?1 B B End For End For For i=1 to q and i =6 k do B

k i;k

k;j

k i;j

k

k

i;j

i;k

k i;j

k

k

i;j

i;k

End For End For

A ?1 A k

i;k

k k;j

k k;j

k k;k

7

4.2 Parallelization

The parallelization of the sequential block-based Gauss-Jordan algorithm consists, as it is shown in the gure 3, of two kinds of parallelism: the inter-iterations parallelism and the intra-iteration parallelism. The rst one means that the iterations are partially or completely executed in parallel, what envolves a management of complex matrix blocks dependences. The partial parallelism means that only the inversion of the pivot A +1 of the next iteration is executed by anticipation. Unlike the complete parallelism, the remainder of the next iteration is triggered only at the end of the current iteration. The intra-iteration parallelism consists essentially of exploiting the parallelism induced by each of the ve loops of the algorithm. It falls into two categories: the inter-loops parallelism and the intra-loops parallelism. Only the intra-iteration parallelism is exploited in our implementation, so we develop it in the next sections. k

k;k

Parallelism

intra-iteration

inter-iterations

pivot inversion only

whole iteration

inter-loops

intra-loop

Figure 3: The Gauss-Jordan parallelism

4.2.1 The inter-loops parallelism

The inter-loops parallelism means that the loops are executed in parallel, the dependences between the loops must be managed. In the sequential algorithm, we identify two ones: the internal loop of the third loop depends on the rst one because it uses the block A which is computed by the rst loop ; on the other hand, the internal loop of the fourth loop is dependent of the second loop because it uses the block B which is computed by the second loop. Consequently, the parallel execution of the internal loop number j of the third (respectively the fourth) loop is triggered after the computation of the block A (respectively B ). k k;j

k k;j

k k;j

k k;j

4.2.2 The intra-loops parallelism

The intra-loops parallelism consists of executing each of the ve loops of the algorithm in parallel. Particularly, the internal loops of the loops numbers 3 and 4 are run each of them in parallel. The work units for these loops are of two kinds: matrix product and matrix triadic i.e. respectively X:Y and Z ? X:Y where X , Y and Z are matrix blocks. In the implementation, in order to ensure correctly the control of the loops one must associate to each work unit the number of the loop it belongs to. Moreover, a crucial criterion which must be taken into account is the granularity of the work units. Indeed, the size of the matrix blocks involved into each work unit must be such that its 8

computation cost balances the communication cost caused by its sending to a remote node.

4.2.3 The parallel Gauss-Jordan algorithm

As each MARS PAA, the Gauss-Jordan PAA is an SPMD application. It is constituted of two modules: AMM and WM. At runtime, AMM is executed by the master task and WM by the worker task. The gure 4 illustrates the Gauss-Jordan PAA architecture, the details of the gure are given in the following paragraphs. In the remainder of this article, we will confuse the term AMM with master task and WM with worker task. A

B

k

...

...

...

...

i

j

... ... i

... ...

k

master task b(1,i,j) wtype(=MT) l=2 res

d1 d2 d3 b(1,i,j) wtype(=MT) l=2

result

work

...

... worker task j

Figure 4: The parallel Gauss-Jordan algorithm

4.2.3.1 The AMM module

The AMM module loads the two matrices A and B and then manages them until it gets the nal result. At the begining of each iteration k (1 < k < q ) of the algorithm, it computes the inversion of the current pivot A . Then, it controls and synchronizes the execution of the loops. An array of synchronization bits is used to manage the loops' dependences quoted in the paragraph 4.2. k k;k

The AMM module extracts from A and B matrix blocks in order to constitute the work units (matrix products and matrix triadics) which are sent to the WMs according to a receiverinitiative strategy. Each work sending is done by an LRPC which creates a worker thread (WT) on a WM to execute the work unit. After this latter is completed, the AMM uses its resulting matrix block to keep up to date the corresponding block in A or B . For example, the gure 4 shows the sending by AMM of the work which characterises the matrix tridiac of the loop number 4 i.e. B ?1 ? A ?1 B . It also illustrates the sending back by WM of the resulting block i.e. B . The work components are the three data blocks B ?1 (d2), A ?1 (d1) and B (d3), the resulting block descriptor b, the type of work wtype (in our example Matrix Triadic k

k

i;j

i;k

k k;j

k i;j

9

k

k

i;j

i;k

k k;j

(MT ) and MP for Matrix Product) and the loop number l (for control and synchronization). The resulting block descriptor is used to memorize the identity of the block to update on the matrix A or B . Its three components represent respectively which of the matrices A or B must be updated (0 for A and 1 for B ), the line number i and the column number j of the block to update.

4.2.3.2 The WM module

It is indicated above that the work distribution is done according to the receiver-initiative strategy. Indeed, each WM continually asks AMM for work. If there is no work then WM waits a WAIT PERIOD time (we xed it as equal to 10 seconds). Otherwise, it receives the work to do and executes it. After that, it sends the resulting block to AMM. In the gure 4, the block is characterized by: its descriptor b, its data res, the name wtype of the matrix to update and the number l of the loop it belongs to.

4.3 MARS-based Adaptivity

Since MARS aims at keeping respected the machines' ownership and at exploiting the wasted time, nodes are dynamically pulled out and/or added to the virtual machine (NOW) which executes the PAA. As a consequence, work must be dynamically blocked and resumed later. Then, as it has been stated in the paragraph 3.2, one has to de ne what is the partial work, how to fold/unfold it and how to schedule it.

4.3.1 The partial work

In addition to the components describing a whole work given above i.e. b, l and wtype, other information must be considered for the partial work description. These information are given in bold in the gure 5: the blocks' descriptors b1, b2 and b3 of respectively B ?1 (d2), A ?1 (d1) and B (d3), I , J , tid and res. The blocks' descriptors are used in order to avoid storing the blocks d1, d2 and d3 themselves. These latter are extracted from the matrices A and B using their descriptors at the unfolding time. The parameters I and J designate the line and column numbers of the next element (resumption point) to compute if a machine computing a matrix product is requisitioned. One has to note that the matrix product calculation can either be the matrix product work (apart) or a part of a matrix triadic work. Moreover, this latter is never interrupted when the block minus operation is being computed. That is to say if X:Y of the matrix triadic Z ? X:Y is completed (let us assume T = X:Y ), the work Z ? T can not be folded, it has to complete because it is not costly in terms of CPU time. The parameter res contains the partial result of the matrix product. The interest of the information tid is indicated in the paragraph 4.3.3. k

k

i;j

i;k

k k;j

The partial work (its descriptors) is stored in the Partial Work List (PWL in gure 5) managed by the AMM. The role of the other list i.e. AWL is presented in the paragraph 4.3.3.

4.3.2 Application folding/unfolding 10

A

k

...

...

...

...

i

j

... ... i

... ...

k

... ...

PWL AWL

d1 d2 d3 b1(1,i,j) b2(0,i,k) b3(1,k,j) b(1,i,j) l,wtype I,J, tid

B

b1(1,i,j) b2(0,i,k) b3(1,k,j) b(1,i,j) l,wtype I,J, tid res

master task

unfold

fold

res

NM d1 d2 d3

...

worker task j

...

Node j

Figure 5: Folding/Unfolding work A folding operation occurs when a machine is requisitioned by its owner. This operation is picked up by the NM of the machine by executing a folding handler named cleanup de ned in the PAA and then killing the WM. The role of this primitive is to take the partial work (contained in the structure of the gure 5) and to send it back to the AMM. The sending back is done by using an LRPC which adds the partial work to PWL. The folding handler is the interface between the PAA in general and the MARS system. An unfolding operation occurs when a machine becomes idle. The WM of this latter asks for work with the help of an LRPC which pulls out a partial work from PWL and creates on WM a worker thread to resume it. The blocks' descriptors are taken from PWL and are used to extract the concrete data blocks, as it is shown in the gure 5, from the matrices A and B . In our implementation, no restriction is done regarding the number of foldings/unfoldings of the same work but, as it is the case for the migration mechanism, it is important to examine if it is necessary and how to x its limit.

4.3.3 Work scheduling and fault tolerance

The scheduling problem arises when a WM asks for work: what kind of work the AMM must send, a partial or a new one? Sending new work rst do not bring any bene t, on the contrary this strategy requires more memory space for storing PWL. We use the other alternative i.e. when PWL is not empty a partial work is selected rst according to the Last In First Out (LIFO) policy. If PWL is empty then one work unit is actived from one of the ve loops of the algorithm. Failures can arise in MARS at four levels: GMS and GS, NM, AMM and WM. The three rst failures are managed at the system level by a checkpointing-based technique [7]. The fault 11

tolerance of WMs is managed at the application level. In our implementation of the GaussJordan algorithm, the Active Work List (AWL) is used to this fact. Whenever a work unit (partial or complete) is sent to a WM, the AMM creates an instance (a work descriptor) of it and put it in AWL. Three situations are then possible for the active work: It is completed, its descriptor is then removed from AWL. The active work is folded, it is in this case removed from AWL and put in PWL. The WM runs out, the active work is then re-started on another WM. The information pid (identity of WM) which appears in the gures 4 and 5 is used to nd in AWL the active work whose WM run out.

5 Performance Evaluation In this section, we rst present the in uence of the granularity of parallelism on the eciency of the parallel execution of the block-based Gauss-Jordan algorithm. Then, we study the scalability of the algorithm. Finally, we evaluate the processing and communication costs induced by the parallel execution of the algorithm within an adaptive context.

5.1 Granularity of parallelism

The granularity problem is crucial especially for irregular applications [8, 9]. It consists of determining the work unit size which makes the work processing balancing the communication cost envolved by the work distribution. Its management can be either static meaning that the size of work units is xed at compiling or loading time, or dynamic i.e. the granularity of parallelism is determined at run time according to the state of the application and its execution support. A classi cation of the dierent existing strategies is presented in [9]. In this paper, only static granularity management is considered. The problem can be stated as following: what is the appropriate size of the matrix blocks of the elementary works i.e. matrix products and matrix triadics? That size depends on the size of the matrices A and B . For example, the table 1 shows the results obtained on a farm of 7 DEC/ALPHA processors (133 MHz clock, 64 MegaByte RAM and 1 GigaByte Disc) interconnected by a Gigaswitch network (200Mbps) and operating under OSF-1. The dimension of the matrices A and B is xed as equal to 800 (i.e. a size of 0.64 million of elements). The table shows, for dierent sizes of the blocks, the sequential execution time (SET ) in seconds i.e. the execution time of the sequential version of the algorithm on a single DEC/ALPHA processor, the parallel execution time on a single processor (PET 1) in seconds, the parallel execution time (average of 100 execution times of the same application) on N = 7 processors in seconds, the absolute speed-up (ASUP = SET=PETN ), the relative speed-up (RSUP = PET 1=PETN ), the absolute eciency (AE = ASUP=N ) and the relative eciency (RE = RSUP=N ). The best absolute eciency i.e. 84% is obtained with a blocks' size of 200. For smaller blocks' size more communications are generated 12

and so smaller eciency is obtained. On the other hand, if the size of blocks is greater parallelism is lost leading to a smaller eciency. By another way, one remarks that, as it is the case for a blocks' size of 200, perfect speed-ups can be obtained with our implementation. Table 1: Matrices' size: 800, Number N of machines: 7

Block size SET(in s) PET1(in s) PETN(in s) ASUP RSUP AE RE 100 200 400

480.672 502.110 612.098

902.27 802.011 754.709

143.056 85.124 203.600

3.360

6.28

5.90

9.42

3.01

3.71

0.48 0.90

0.84 1.35 0.43 0.53

The table 2 shows other results obtained with 100 runnings of the same application with matrices of a dimension 1500 (i.e. a size of 2.25 millions of elements) and dierent blocks' sizes. For that dimension, the best absolute eciency i.e. 92% is obtained with a grain size of 300. One can note that a perfect relative speed-up is again obtained. Table 2: Matrices' size: 1500, Number N of machines: 7

Block size SET(in s) PET1(in s) PETN(in s) ASUP RSUP AE RE 150 300 500

3374.065 3735.654 4165.055

6377.290 6002.310 5397.670

1056.380 580.776 861.425

3.19

6.43 4.84

6.04

0.46 0.86

6.27

0.69 0.90

10.33 0.92 1.48

5.2 Scalability

The table 3 shows the experiment results obtained on a network of SUN-Sparc 4 workstations running under Solaris 2. The size of the matrices A and B and the blocks' size are xed as equal to respectively 1500 and 300. The parameters evaluated above i.e. PETN (average of 100 runnings of the same application), ASUP , RSUP , AE and RE are expressed as functions of the number Nproc of used processors. In order to compute the absolute speed-up ASUP we measured the execution time of the sequential version of the block-based Gauss-Jordan algorithm on one SUN-Sparc 4 workstation and we found it equal to 4839.9 seconds. Table 3: Matrices' size: 1500, Blocks' size: 300

Nproc PETN(in seconds) 1 4 8 16 24 28

7582.19 1811.41 1303.85 481.17 401.40 402.87

ASUP

RSUP

0.64

1.00 4.19 5.82 15.76 18.89 18.82

2.67 3.71 10.06 12.06 12.01 13

AE

0.64

0.67 0.46 0.63 0.50 0.50

RE

1 1.05 0.73 0.98 0.78 0.67

The main remarks which can be done through the table 3 are the following: Absolute eciencies greater than 60% are obtained with a number of processors between 1 and 16 and the corresponding relative eciencies are perfect excepted for 8 processors (speed-up anomaly) (see gure 6). Let us note that the absolute eciency (67%) of a 4 processor system is greater than that (63%) of a 16 processor system, that is due to the following reason: at each iteration, the number of generated work units is equal to q 2 ? 1 i.e. 24 (because q = 1500=300 = 5). In the paragraph 4.2, we stated that the parallel execution of the internal loop number j of the third (respectively the fourth) loop is triggered after the computation of the block A (respectively B ). Consequently, at each iteration, only 2(q ? 1) i.e. 8 work units are processed at the beginning and the 16 other work units (internal loops) are processed after. So, for a 4 processor system all the processors are always used, but for a 16 processor system 8 processors are unused at the starting of the iteration what causes a certain loss of speed-up. The eciency decreases when the number of processors is upper to 16 because the maximum possible degree of parallelism is of 16 processors. k k;j

Absolute and relative speed-up

k k;j

Scalability 20 Abs Rel

15

10

5

0 0

5

10

15

20

25

30 N

Figure 6: The scalability of the algorithm

5.3 Adaptivity

In this section, we aim at experimentally studying the overhead induced by the management of the reaction of the application to the adaptivity of the material execution support. We con14

sidered, for our experimentations, a meta-system of heterogeneous machines including 8 DECALPHA/OSF-1 processors, 16 SUN-Sparc 4/Solaris 2 workstations and 3 SUN-Sparc 4/SunOs workstations ; all the 27 machines are interconnected by an Ethernet network. From the application side, we xed the size of the matrices as equal to 1500 and the size of the blocks as equal to 300. We experimented the same application 50 times in our laboratory on a working day and so machines frequently join and leave the meta-system due to their availability and failures. The number of machines eectively available varies between 20 (minimum number for the 50 runnings) and 27 (maximum number for the 50 runnings) until the end of the execution where it decreases to 0. The average number (average of the average numbers of machines used for each of the 50 runnings) of the machines eectively used is 22:47. The table 4 presents some obtained results. Table 4: Matrices' size: 1500, Blocks' size: 300, Number N of machines: 27 Non adaptive Adaptive

Overhead

CPU(in s) Communication and waiting(in s) Total execution(in s) 108.011 111.744 3%

355.193 391.218 9%

463.204 502.962 8%

The parameters considered in our study are the CPU time, the communication and waiting time and the total execution time of the master task (AMM module). The CPU time is the computing time meaning the time spent in extracting blocks from the matrices A and B , updating the two matrices with the results returned by the workers, etc. The communication and waiting time represents the time spent by the master task in distributing work and waiting for results. The total execution time is the sum of the CPU time and the communication and waiting time. The second and third lines of the table 4 show the results obtained by considering respectively a non adaptive context (week-end night) and an adaptive context (working day). The last line represents the overhead corresponding to each parameter induced by the management of the adaptivity. It is computed, for each column, by using the following formula: overhead = (3 line ? 2 line)=3 line. rd

nd

rd

One can note that the total overhead envolved in the adaptivity management is not very important in comparison, for example, with that (10%) obtained in [12] for managing fault tolerance on other numeric applications. The computation overhead is particularly small.

6 Conclusions and Future Work We have proposed a new parallel adaptive version of the block-based Gauss-Jordan algorithm. This latter exploits the inter-loops and intra-loops SPMD parallelisms. Thanks to the MARS adaptive programming environment, our application exploits the wasted time i.e. when a machine becomes idle, work (matrix product or matrix triadic) is sent to it (application unfolding). 15

It also keeps respected the machines' ownership, meaning that as soon as a user requisitions its machine the work running on this latter is stopped (application folding) and the partial remaining work is resumed later probably on another machine. Moreover, our application integrates a scheduling strategy which uses a list PWL to store the partial work. In this policy, no new work is generated so much as PWL is not empty. This latter is managed according to the LIFO policy. Another list AWL is used for fault tolerance of the workers. Indeed, everytime a work becomes active, its descriptor is stored in AWL and as soon as the worker running it fails the work is re-started with the help of its descriptor by another worker. The complete fault tolerance problem is considered in [7]. The application is developed and experimented with the MARS system. The obtained results show that the eciency depends on the size of the blocks. The absolute eciency can reach 92% for an application with matrices of size 1500 and blocks of size 300 on a farm of 7 DEC-ALPHA/OSF1 processors interconnecetd with a Gigaswitch network. On the other hand, absolute eciency greater than 60% are obtained on a network of a reasonable number of SUN-Sparc4/Solaris2 workstations. Moreover, perfect eciency is often reached on both architectures. Finally, other results obtained on a network of 27 heterogeneous network show that the management of the adaptivity is not costly. For example, for an application with matrices of 1500 and blocks of 300 and a number of machines uctuating between 20 and 27 with an average of 22:47, the computation overhead is of 3% and the total execution overhead is of 8%. In our implementation, the matrices A and B are lodged on the central memory of the machine where the AMM resides but that is not a constraint. Indeed, they can be stored on the memories of several machines or on a hard disk. These alternatives will be one of our further investigations. Another feature to address in the near future is the heterogeneity by considering the dierence of processing speed between the processors when scheduling work. Finally, we plan to experiment our application on a national wide area network.

Acknowledgements

We would like to thank Zouhir Ha di a doctoral student at the Laboratoire d'Informatique Fondamentale de Lille for having kindly put his long experience in developping MARS at our disposal.

References [1] J. Choi, J. Dongarra, R. Pozo, and D. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In the Proc. of the 4 Symp. on the Frontiers of Massively Parallel Computation, IEEE Publishers, pages 120{127, 1992. th

16

[2] Guy Edjlali, Gagan Agrawal, Alan Sussman, and Joel Saltz. Data Parallel Programming in Adaptive Environment. Proc. of 9 IPPS, Santa Barbara, California, USA, pages 827{832, Apr 25{28 1995. [3] L.S. Blackford et al. Scalapack: A Linear Algebra Library for Message-Passing Computers. SIAM Conference on Parallel Processing, Mar 1997. [4] MPI forum. MPI: A Message Passing Interface standard. Technical report, Apr 1994. [5] A. Geist, A. Beguelin, and J. Dongarra et al., editors. PVM: Parallel Virtual Machine, A User's guide and tutorial for Networked Parallel Computing. MIT Press, 1994. [6] Zouhir Ha di, El Ghazali Talbi, and Jean-Marc Geib. Meta-systemes : Vers l'integration des machines paralleles et les reseaux de stations heterogenes. Calculateurs Paralleles, Ed. Hermes, 1997. [7] D. Kebbal, E.G. Talbi, and J.M. Geib. A new approach for checkpointing parallel adaptive applications. PDPTA'97 proceedings, Las Vegas, Nevada, USA, pages 1643{1651, Jun 1997. [8] N. Melab, N. Devesa, M.P. Lecoue, and B. Toursel. Adaptive load balancing of irregular applications. A case study: IDA applied to the 15-puzzle problem. Springer-Verlag LNCS 1117, Proc. of the Third Intl. Workshop, IRREGULAR'96, Santa Barbara, California, USA, pages 327{338, 19{21 Aug 1996. [9] Nouredine Melab. Gestion de la granularite et regulation de charge dans le modele P3 d'evaluation parallele des langages fonctionnels. PhD thesis, Universite de LilleI, 1997. [10] Raymond Namyst. PM 2 : Un environnement pour une conception portable et une execution ecace des applications paralleles irregulieres sur architectures distribuees. PhD thesis, Universite de LilleI, 1997. [11] Serge G. Petiton. Parallelization on an MIMD computer with real-time scheduler, gaussjordan example. Aspects of Computation on Asynchronous Parallel Processors, M.H. Wright (Editor), Elseiver Science Publishers B.V. (North-Holland), IFIP, 1989. [12] James S. Plank, Youngbae Kim, and Jack. J. Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. 25 h Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995. [13] Peter M.A. Sloot. Bibliography of Parallel Scienti c Computing and Simulation Group at the University of Amsterdam. Dept. of Mathematics and Computer Science, Univ. of Amsterdam Kruislaan, Netherlands, 1996. [14] Ole Tingle. Systems of linear equations solved by block gauss jordan method using a transputer cube. Technical report IMM-REP-1995-08, Institute of Mathematical Modelling, Technical University of Danmark, 1995. th

t

17