Load balancing of the direct linear multisplitting ...

3 downloads 0 Views 280KB Size Report
Fig. 8 – The global computing time with and without the use of the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster. NAT[8] RCM[8] ...
LABORATOIRE D’INFORMATIQUE DE L’UNIVERSITE DE FRANCHE-COMTE EA 4157

Load balancing of the direct linear multisplitting method in a grid computing environment Christophe Denis — Rapha¨el Couturier — Fabienne J´ez´equel

Rapport de Recherche n˚ RR 2008–01 ` THEME 3 – Janvier 2008

Load balancing of the direct linear multisplitting method in a grid computing environment Christophe Denis , Rapha¨el Couturier , Fabienne J´ez´equel Th`eme 3 Algorithmique Num´erique Distribu´ee Rapport de Recherche Janvier 2008

Abstract: Many scientific applications need to solve very large sparse linear systems in order to simulate phenomena close to reality. Grid computing is an answer to the growing demand of computational power but communication times are significant and the bandwidth is variable, therefore frequent synchronizations slow down performances. The use of the parallel linear multispliting method reduces the number of synchronizations needed by a parallel direct algorithm but suffers from load balancing. This paper presents a load balancing of the parallel linear multisplitting method in a grid computing environment permitting to obtain a gain on execution time up to 74%. Key-words: load balancing, multisplitting, grid

Laboratoire d’Informatique de l’Universit´e de Franche-Comt´e, Antenne de Belfort — IUT Belfort-Montb´eliard, rue Engel Gros, BP 527, 90016 Belfort Cedex (France) T´el´ephone : +33 (0)3 84 58 77 86 — T´el´ecopie : +33 (0)3 84 58 77 32

´ Equilibrage de charge de la m´ethode de multid´ecomposition lin´eaire directe dans un environnement de grille

R´ esum´ e : De nombreuses applications scientifiques n´ecessitent de r´esoudre de tr`es grands syst`emes lin´eaires pour simuler des ph´enom`enes physiques proches de la r´ealit´e. Le calcul du grille permet de r´epondre a` une demande croissante de ressources de calcul mais les temps de communications sont significatifs et la bande passante est variable, c’est pourquoi les synchronisations fr´equentes ralentissent les performances. L’utilisation de la m´ethode parall`ele de multid´ecomposition lin´eaire permet de r´eduire le nombre de synchronisations par rapport a` un algorithme parall`ele direct, mais cette m´ethode peut n´ecessiter l’utilisation des techniques d’´equilibrages de charge. Ce papier pr´esente un algorithme d’´equilibrage de charge pour cette m´ethode dans un contexte de grille de calcul permettant d’obtenir des gains de temps d’ex´ecution jusqu’`a 74%. Mots-cl´ es : e´ quilibrage de charge, multid´ecomposition, grille

Laboratoire d’Informatique de l’Universit´e de Franche-Comt´e, Antenne de Belfort — IUT Belfort-Montb´eliard, rue Engel Gros, BP 527, 90016 Belfort Cedex (France) T´el´ephone : +33 (0)3 84 58 77 86 — T´el´ecopie : +33 (0)3 84 58 77 32

Load balancing of the direct linear multisplitting method

7

Load balancing of the direct linear multisplitting method in a grid computing environment

b

Christophe Denisa,b , Rapha¨el Couturierc , Fabienne J´ez´equela a Laboratoire d’Informatique de Paris 6 Universit´e Pierre et Marie Curie - Paris 6 4 place Jussieu, 75252 Paris CEDEX 05 - France email : {Christophe.Denis,Fabienne.Jezequel}@lip6.fr School of Electronics, Electrical Engineering & Computer Science The Queen’s University of Belfast, Belfast BT7 1NN, U.K. c Laboratoire d’Informatique de l’Universit´e de Franche-Comt´e BP 527, 90016 Belfort CEDEX - France email : [email protected] 8 f´evrier 2008

1

Introduction

Many scientific applications need to solve very large sparse linear systems in order to simulate phenomena close to reality. Grid computing is an answer to the growing demand of computational power. In a grid computing environment, communication times are significant and the bandwidth is variable, therefore frequent synchronizations slow down performances. Thus it is desirable to reduce the number of synchronizations in a parallel direct algorithm. Inspired from multisplitting techniques, the GREMLINS (GRid Efficient Linear Systems) solver we develop consists in solving several linear problems obtained by splitting the original one. The design of the GREMLINS solver has been supported by the French National Resarch Agency(ANR). Each linear system is solved independently on a cluster by using the direct method. The decomposition is usually built with a graph partitioning approach which is not well suited for all parallel applications. The graph partitioning approach provides computing time over the subdomains which can vary from simple to double for our parallel multiple method. However the global method computing time can be decreased by load balancing the subdomain computing time. We present in this article a load balancing method to correct in computational volume an initial decomposition issued from graph partitioning tools. The load balancing method is controlled by the computation volume estimators of the parallel method. The paper is organized as

RR 2008–01

8

C. Denis R. Couturier F. J´ez´equel

follows. Section 2 presents the parallel linear multisplitting method used in the GREMLINS library. Our load balancing algorithm is exposed in Section 3 and experimental results are analysed in Section 4. Finally, we present our concluding remarks and our future work.

2

The parallel linear multisplitting method used in the GREMLINS library

We first define the linear multisplitting method. The parallelization of this method in the GREMLINS(Grid Efficient Methods for LInear Systems) library is based on a domain decomposition. The matrix decomposition is then presented with its associated parallel algorithm.

2.1

Definition

In this section we follow the general formulation of multisplitting algorithms given in [1]. Let us consider the iterative solution of a large sparse linear system Ax = b where A is non-singular. The definition of a multisplitting of A was introduced in [2]. Definition 1 A multisplitting of a non-singular matrix A ∈ Rn×n is a collection of n × n matrices (Ml , Nl , El ) where PA = Ml − Nl , Ml are nonsingular, El are non-negative diagonal, and L l=1 El = I. Let – L be a positive integer and for l = 1, ..., L ; S – Sl be non-empty subsets of {1, ..., n} such that L l=1 Sl = {1, ..., n} ; – nl = |Sl |, the cardinality of Sl . The sets Sl do not P necessarily need to be pairwise disjoint,thand in this case we would have L l=1 nl > n. Assume that in addition the i diagonal entry of El is zero if i ∈ {1, ..., n}\Sl . Multisplitting algorithms are described by the iterations generated by the successive approximations associated to an extended fixed point mapping defined from (Rn )L into itself where n is the dimension of the problem and L is the number of processors. This fixed point mapping is defined as follows,  T : (Rn )L −→ (Rn )L (1) X = (x1 , .., xL ) −→ Y = (y 1 , .., y L ), such that for l ∈ {1, .., L}  l  y = T (l) (z l ) L P  zl = Elk xk , k=1

LIFC

(2)

Load balancing of the direct linear multisplitting method

9

where Elk are weighting matrices satisfying  Elk are diagonal matrices    E ≥0 lk L P    Elk = In (identity matrix) , ∀l ∈ {1, ..., L} .

(3)

k=1

In (2), T (l) (z l ) = Ml−1 Nl z l + Ml−1 b

(4)

where A = Ml − Nl ,

l = 1, ..., L

(5)

Algorithm 1 summarizes this iterative method called linear multisplitting method. Algorithm 1 The linear multisplitting method Require: Given the initial vector x0 1: for i=0,1, until convergence do 2: for l=1 to L do 3: Ml y l = N l x i + b 4: end forP 5: xi+1 = L l=1 El yl 6: end for

2.2

Matrix decomposition

In the following we consider we have an n dimensional linear system Ax = b.

(6)

The matrix decomposition used in the GREMLINS library consists in splitting the matrix into horizontal rectangle matrices. Then each processor is responsible for the management of a rectangle matrix. Let L be the number of processors involved during the parallel computing, the subset Sl represents the row subset of A distributed on processor l. With this distribution, a processor knows the offset of its computation. This offset enables us to define the submatrix, denoted by ASub, which a processor is in charge of managing. The part of the rectangle matrix before the submatrix represents the left dependencies, called DepLef t, and the part after the submatrix represents the right dependencies, called DepRight. Similarly, XSub represents the unknown part to solve and BSol the right hand side involved in the computation. Fig. 1 describes the decomposistion of A, b and x into several parts (Deplef t, ASub, DepRight, Xlef t, XSub, XRight, BSub) required locally by a processor. RR 2008–01

10

C. Denis R. Couturier F. J´ez´equel

XLeft

DepRight

BSub

ASub

XSub

DepLeft

XRight Fig. 1 – Decomposition of the matrix

The matrix decomposition needs to find the subsets Sl for l = 1, ..., L. We called D = {S0 , ..., SL−1 } the decomposition set. Let TabOffset be an integer array of size L + 1. There is a global reordering of the matrix such as the rows belonging to each subset Sl lie betweenTabOffset[l] and TabOffset[l+1]-1.

2.3

The parallel linear multisplitting method

The decomposition of the matrix A is illustrated in Fig. 1. The submatrix (denoted by Asub) is the square matrix that a processor is in charge of. With such a decomposition, a processor needs to solve : at each step, a processor computes XSub by solving the following subsystem : ASub ∗ XSub = BSub − DepLef t ∗ XLef t − DepRight ∗ XRight

(7)

As soon as it has computed the solution of the subsystem, this solution needs to be sent to all processors depending on it. The four main steps of the linear multisplitting method are described as follows : 1. Initialization The way the matrix is loaded or generated is free. Each processor manages the load of the rectangle matrix (in the algorithm the rectangle matrix corresponds to DepLef t + ASub + DepRight). Then until convergence, each processor iterates on : 2. Computation At each iteration, each processor computes BLoc = BSub−DepLef t∗ XLef t − DepRight ∗ XRight. Then, it solves XSub using the Solve (ASub,BLoc) function. LIFC

Load balancing of the direct linear multisplitting method

11

3. Data exchange Each processor sends its dependencies to its neighbors. As the dependencies can be different from one processor to another, the function P artOf computes the part dedicated to processor i. Nonetheless, when a processor receives a part of the solution vector (denoted by XSub) of one of its neighbors, it should update its part of XLef t or XRight vector according to the rank of the sending processor. 4. Convergence detection Two methods are possible to detect the convergence. We can either use a centralized algorithm described in [3] or a decentralized version, that is a more general version, as described in [4]. This algorithm may be similar to block Jacobi algorithm. It actually generalizes the block Jacobi method which is only a particular case of the multisplitting method. The two main differences of the multisplitting method are : – The multisplitting method may use asynchronous iterations. In this case, the execution time may be reduced. For a complete description of asynchronous algorithms, interested readers may consult [5]. – Some components may be overlapped and computed by more than one processor. Overlapping some components of the system may drastically reduce the number of iterations to obtain the convergence whatever the chosen threshold is. >From a practical point of view, the use of asynchronous iterations consists in using non blocking receptions, dissociating computations from communications using threads and using an appropriate convergence detection algorithm. Overlapping some components is simple to implement since one only needs to change the variables Of f set and SizeSub in order to compute some components by two processors and define how overlapped components are taken into account. The sequential solver used in the algorithm is free. The linear multisplitting methods is called – direct linear multisplitting method if the sequential solver is direct – iterative linear multisplitting method if the sequential solver is iterative It can be a direct one or an iterative one. For the former class (direct), the most consuming part is the factorization part which is only achieved out at the first iteration. With large matrices, even after the decomposition process, the size of a submatrix that a processor is responsible for solving in sequential, may be quite large. So, the time required to factorize a submatrix may be long. In opposition to a long factorization time, the use of a sequential direct solver allows to compute other iterations very quickly because only the right-hand side changes at each iteration. For the latter class (iterative), all iterations require approximately the same time to be computed. So, if in advance, the user knows some properties of the matrix he wants to solve, he can choose the appropriate class of solver. The important charateristics of a

RR 2008–01

12

C. Denis R. Couturier F. J´ez´equel

sparse matrix are its pattern and the spectral radius of its iteration matrix. The pattern of a submatrix acts on the time required to factorize it at the first iteration of a solving with a sequential direct solve. Some previous works, carried out about the load balancing of the multiple front method [6, 7], have successfully been adapted to the linear multisplitting solver as presented in the next section. The spectral radius of the iteration matrix is correlated with the number of iterations required to solve the system. The closer the spectral radius is to 1, the more iterations are required to solve the system, as with all iterative methods. The condition for the asynchronous version to converge is slightly different and more restrictive than the synchronous one, i.e. in some rare pratical cases, the synchronous version would converge whereas the asynchronous one would not. As the explanation for this condition is quite complex, because it lies in several mathematical tools, we invite interested readers to consult [8].

3

Load balancing

This paper focuses on the time reduction of the direct linear multisplitting method. The performance of the direct linear multisplitting method is strongly influenced by the matrix decomposition set. In particular, if the factorization computing times of a decomposition set are not balanced, some processors will be idle. The parallel factorization of each diagonal block Al is performed during the first iteration of the method. This iteration is the most time consuming. In this paper we use the sequential direct solver coming from the MUMPS software [9]. The decomposition set issued from the matrix distribution presented in Section 2.2 provides submatrices Al l = 1, ..., L load balanced in terms of number of rows but unfortunately not in terms of computational volume. Indeed, as sparse factorization techniques are used, the computational volume depends on the pattern of Al and not on its size. The aim of the load balancing method is to better distribute the factorization computational volume over diagonal blocks by predicting their computing times. First we present how we estimate the factorization times and then the load balancing (LB) algorithm. Experimentations have been conducted on the GRID’5000 architecture, a nation wide experimental grid [10]. Currently, the GRID’5000 platform is composed of an average of 1300 bi-processors that are located in 9 sites in France : Bordeaux, Grenoble, Lille, Lyon, Nancy, Orsay, Rennes, Sophia-Antipolis, Toulouse. Most of those sites have a Gigabit Ethernet Network for local machines. Links between the different sites ranges from 2.5 Gbps up to 10 Gbps. Most processors are AMD Opteron. For more details on the GRID’5000 architecture, interested readers are invited to visit the website : www.grid5000.fr.

LIFC

Load balancing of the direct linear multisplitting method

3.1

13

Estimation of the factorization times

Let N op(A) be the estimated number of operations required to factorize the sparse matrix A. We propose the simple following model to predict the estimated factorization time Tfestim ac (A) : Tfestim ac (A) = αN op(A) + β

(8)

The number of operations required to factorize a diagonal block could be directly obtained from the analysis phase of the direct solver. It is the case for the direct solver included in the MUMPS software. Unfortunately, some solvers do not provide this piece of information. However, it is possible to study the algorithmic behavior of the direct linear solver to estimate the number of operations for the given matrix. This work has been successfully achieved for the multiple front method [6]. The parameters α and β are determined by using a least square method, n factorizations of diagonal blocks Ai , i = 1, ..., n are carried out on the same computer. The number of operations N op(Ai ) and the measured factorization times Tfmesu ac (Ai ) are obtaind for each factorization. The least square method provides the linear system which is then solved.   mesu   Tf ac (A1 ) N op(A1 ) 1     .. .. ..  · α =  (9)    . . .  β Tfmesu N op(An ) 1 ac (An ) We have performed this process by solving 100 sparse matrices factorizations on the paravent cluster (99 AMD Opteron 246 2.0GHz) of Grid’5000 to obtain N op(Ai ) and Tfmesu ac (Ai ). The parameters α and β are then obtained by solving Eq. 9. To test the robustness of this approach, we have solved other submatrices factorizations on the Lille cluster (53 AMD Opteron 248 2.2 GHz) of Grid’5000 having a sligtly different processor speed. Fig. 2 shows the measured factorization time and the estimated factorization computing time computed by using Eq 8 with α and β computed on the paravent cluster. The relative error between the estimated and the measured computing times are represented in Fig. 3. The maximum relative error is 17,5 % whereas the mean relative error is 12%. One could see that we could make a finer model by counting exactly the number of instructions required to factorize a matrix and multiply them by the processor cycle time. This method has no interest from our point of view. Indeed the compilation carries out some optimizations which are difficult to control. Moreover, processors have speculative instruction schedulers also making a finer prediction difficult. It is also difficult to study the real behavior of optimized routines such as BLAS. Last but not least, we need a robust model as we work on an heterogenous grid. RR 2008–01

14

C. Denis R. Couturier F. J´ez´equel

350 measured factorization computing times estimated factorization computing times

Computing times in s

300 250 200 150 100 50

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Fig. 2 – Comparison between the estimated and the measured computing times

18 16

Relative error in %

14 12 10 8 6 4 2 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17

Fig. 3 – Relative error between the estimated and the measured computing times

3.2

The load balancing algorithm

The algorithm takes as input parameter an initial decomposition set Di . The aim of this algorithm is to decrease the maximum factorization computing time. The outline of the LB algorithm is presented in Algorithm 2. The underlined statements are executed in parallel on the L processors without any communication. Direct solvers usually perform a matrix reordering before the factorization step to decrease the amount of computation. The reordering process

LIFC

Load balancing of the direct linear multisplitting method

15

is in our context the Approximate Minimum Degree method [11] as it is often used by the MUMPS software. It is then important to reorder each diagonal block with the method used by the direct solver to estimate cor2 rectly its factorization computing time. We denote by Tmax = maxl Tf2ac (Al ) the maximum factorization time. The 2 replaces estim if these computed times are estimated or mesu if they are measured. The algorithm returns a estim final decomposition Di having an inferior (or equal in the worst case) Tmax value. Algorithm 2 Outline of the LB algorithm 1: continue ← true 2: Build the initial decomposition set Di 3: D ← Di 4: Reorder in parallel each block Al by using the appropriate reordering method 5: Compute Tfestim ac (Al ) l = 1, ..., L in parallel 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

estim Compute Tmax k←0 estim best solution ← Tmax repeat k ←k+1 Find the diagonal block Al having the maximum estimated factorization time Transfer nt rows from Al to its neigbour block (see subsection (3.2.1) Reorder in parallel each block Al by using the appropriate reordering method Compute Tfestim ac (Al ) l = 1, ..., L in parallel estim Compute Tmax estim then if best solution < Tmax Dc ← D estim best solution ← Tmax end if continue = EndLoop?() (see subsection 3.2.2) until continue = true Return Dc

3.2.1

The transfer of nt rows

The decomposition set of the matrix is stored in a global array T abRow. Each diagonal bloc l, l ∈ {0, ..., L} has contiguous rows of A between T abRow[l] and T abRow[l + 1] − 1. Each diagonal block Al is formed in parallel on the L processors. Then, the estimated factorization computing times Tfestim ac (Al ) l = 1, ..., L are computed in parallel. Each processor l sends estim its Tf ac (Al ) to the master processor 0 which stores on a local array Q these

RR 2008–01

16

C. Denis R. Couturier F. J´ez´equel

estimators, we have Q[l] = Tfestim ac (Al ). The following process is serially performed by the master processor. It determines, thanks to the array Q, the diagonal block BlockM ax having estim . The aim of the transfer the maximum estimated factorization time Tmax estim . of nt from BlockM ax to another block BlockDest is to decrease Tmax Note that, due to the use of a reordering method on each block after the estim could increase (see Figure 5). transfer step, Tmax The choice of BlockDest depends on the value of BlockM ax. We have arbitrarly chosen that if BlockM ax ∈ {0, ..., L−2}, the last nt rows of BlockM ax will be transfered to the diagonal block BlocDest = BlockM ax+1. We need now to consider the case where BlockM ax is equal to L − 1. A simple first approach consists in transferring the first nt rows of BlocM ax to the diagonal block BlockM ax − 1. Unfortunately, this method produces more often a local cycle of transfer between BlockM ax and BlockM ax − 1. To avoid this, we choose BlockDest = 0 if BlockM ax is equal to L − 1 that is to say that nt rows are added to block 0 reducing implicitely the rows of BlockM ax. Let N bRowBlockM ax be the number of rows of BlockM ax, nt are computed as follows :

nt =

1 Q[BlockM ax] × × N bRowBlockM ax 2 Q[BlockM ax] + Q[BlockM ax + 1]

(10)

The transfer process is summarized in Alg. 3. Algorithm 3 Transfer of nt rows 1: Compute nt by using Eq. 10. 2: if BlockM ax 6= 0 then 3: T abRow[BlockM ax] ← T abRow[BlockM ax] − nt 4: else 5: for i=1 to l-1 do 6: T abRow[i] ← T abRow[i] + nt 7: end for 8: end if

3.2.2

Number of iterations of the load balancing algorithm

The optimal number of iterations kopt of our LB algorithm is difficult to predict. In a first approach, we could surestimate this number and stop the algorithm after for exemple 100 iterations. This way is useful to show the load balancing algorithm behavior as presented in Subsection 4.1. But in pratice, the use of the LB algorithm is obviously only interesting if the gain obtained by load balancing is superior to the computing time of the load LIFC

Load balancing of the direct linear multisplitting method

17

balancing. The gain obtained by load balancing at iteration k is defined in Eq. 11. estim estim gain(k) = Tmax (0) − Tmax (k) − TLB (k)

(11)

where estim (k) is the maximum estimated factorization computing time at – Tmax iteration k – Tlb (k) is the LB elapsed computing time at iteration k estim (0) is the maximum estimated factorization computing Remark 1 Tmax time obtained without load balancing.

According to Eq. 11, we define then an upper limit klim to the number of iterations load : estim estim klim = min k such that Tmax (0) − Tmax (k) − TLB (k) ≤ 0

(12)

It would be better to find the optimal iterations kopt such that : estim estim kopt = max Tmax (0) − Tmax (k) − TLB (k)

(13)

k

estim (k) strictly decreases and Let us suppose in a first approach that Tmax TLB (k) strictly increases over iterations as presented in Fig. 4. It is possible to compute at each iteration k the gain gain(k) and to stop at iteration k1 when gain(k1 ) starts to decrease. In this case kopt = k1 − 1. Theoritical case 200

Testim max Tlb

Coputing times in s

150

gain ~ Tglob

100

50

k=35 gain is maximum T~ is glob

0

0

minimum

20

40 60 iterations k

80

100

estim (k) and T (k) Fig. 4 – Theoretical case : evolution of Tmax Lb

estim (k) does not strictly decrease because there is a Unfortunately, Tmax reordering process at each iteration of the LB algorithm making the variation

RR 2008–01

18

C. Denis R. Couturier F. J´ez´equel

cage 15 − 64 processors 250 estim

Tmax

k=4

Computing times in s

200

T

lb ~

Tglob

T~glob is minimum

150

100

50

0

0

5

10

15

20

25

30

35

40

iterations k

estim (k) and T (k) Fig. 5 – Real case : evolution of Tmax Lb

estim (k) irregular. Fig 5 represents the variation of T estim (k) for a real of Tmax max case. It is not possible to find kopt during the run of the LB algorithm but only klim . In pratice, we have adopted this following strategy. The LB algorithm will stop if k = klim or k > N bIt, a fixed number of iterations provided by the user. In our experiments, N bIt = 5 seems a good compromise as presented in Subsection 4.2.

4

Experimental results

The objectives of the experiments are to show the benefit of the load balancing method without and with taking into account the load balancing computing time. Experimentations have been conducted on the GRID’5000 architecture. The initial decomposition required by the multisplitting method strongly influences the efficiency of the direct linear multisplitting method. We have chosen these three following decomposition methods presented to underline this fact : NAT : The initial decomposition set is built by dividing equitably the rows of the matrix on each processors. MET : The matrix A is modelized into a graph and the multilevel graph partitioning tool METIS[12] is applied on it to obtain the initial decomposition set. RCM : The matrix is re-ordered with the Reverse Cuthill McKee method[13] then the initial decomposition set is built by dividing equitably the rows of the reordered matrix on each processors.

LIFC

Load balancing of the direct linear multisplitting method

19

Table 1 presents the matrices taken from the University of Florida Sparse Matrix Collection [14]. Tab. 1 – The three matrices Matrix Name Number of rows cage13 445,315 cage14 1,505,785 cage15 5,154,859

used in the experiments Number of nonzero entries 7,479,343 27,130,349 99,199,551

In order to measure the effect of the load balancing, the direct linear multisplitting method is run with and without the LB algorithm (without load balancing corresponds to TLB = 0). The experiments described in Subsections 4.1 and 4.2 have been carried out on a local cluster of Grid’5000. They differ on the number of iterations of the LB algorithm. Experiments shown in Subsection 4.3 have been carried out on one local cluster and on two distant clusters.

4.1

Results obtained for N bIt = 100

We present here the results obtained on the Orsay cluster of Grid’5000 (186 AMD Opteron 246 2.0 GHz) The number of iterations of the LB algorithm is overestimated at 100. The aim is to show the effect of the load balancing without taking into account its cost in terms of computing time. The input matrices are the cage14 and cage13 matrices successively split using the three decomposition methods presented in MET, NAT and RCM : into 8 and 16 parts for cage13 matrix and into 16 and 32 parts for cage14 matrix. Table 3 presents : mesu ; – the maximum measured factorization time Tmax – the load balancing computing time TLB ; – the resolution computing time Tres without taking TLB into account ; – the global computing time Tglob including TLB ; – the load balancing criterion ∆(D). The load balancing criterion ∆(D) is defined as : Let {q0 , ..., qi , ..., qL−1 } be the measured factorization time associated with D. The load balancing criterion ∆(D) is : PL−1 1 i=0 qi ∆(D) = × 100 L maxL−1 i=0 qi

(14)

The load balancing criterion ∆(D) of a decomposition set ideally load balanced in terms of qualities qi is equal to 100%. The LB algorithm permits for all the cases to decrease the maximum factorization computing times by better distributing the computational load over processors. For example, let us consider the case when the cage14 matrix is RR 2008–01

20

C. Denis R. Couturier F. J´ez´equel

mesu = 308s. split on 32 parts. The initial decomposition MET provides Tmax mesu The use of the load balancing method permits to decrease Tmax to 26, 5s and to increase ∆(D) from 5, 8% to 48%. As the factorization is the most timeconsuming iteration of the multisplitting process, the resolution computing time also decreases (Tres = 427s without the LB algorithm and Tres = 137s mesu , T with the LB algorithm). Figures 6 and 7 show the evolution of Tmax res and ∆(D) with and without the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster. Unfortunately as the number of iterations of the LB method is overestimated, the load balancing computing time is higher than the gain obtained by load balancing. Consider again the cage14 matrix split on 32 parts with the MET initial partition. The load balancing time is TLB = 314s whereas the time reduction on Tres is 290s. Consequently although there is a time reduction on Tres , the global computing time is higher with the LB method. The number of iterations of the LB method needs to be reduced to have a real benefit on the global computing time as shown in the following subsection. The initial decomposition built with RCM or NAT provides lower global computing times with and without computing time than built with MET. Remark also that the solving has not been carried out without using the LB algorithm for the cage14 matrix split into 16 parts with the NAT initial decomposition set. cage13 300

Tmax without LB mesu

Tmax with LB(100 iter) mesu

250

T

without LB

T

with LB (100 iter)

Computing time in s

res res

200

150

100

50

0

NAT[8] RCM[8] MET[8] NAT[16] RCM[16] MET[16] Initial partition[number of processors]

mesu and T Fig. 6 – Tmax res with and without the use of the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster

4.2

Results obtained for N bIt = 5

The experimental protocol differs from the previous subsection on the number of iterations N bIt of the LB algorithm : nb iter is reduced to 5. LIFC

Load balancing of the direct linear multisplitting method

21

cage13

100

Tmax without LB mesu Tmax with LB(100 iter) mesu

90 80

∆ in %

70 60 50 40 30 20 10 0

NAT[8]

RCM[8]

MET[8]

NAT[16] RCM[16] MET[16]

Initial partition[number of processors]

Fig. 7 – The load balancing criterion with and without the use of the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster

Consider again the case when the cage14 matrix is split on 32 processors with MET as initial decomposition. The use of the load balancing method with mesu to 119s and to increase ∆(D) from 5.8% N bIt = 5 permits to decrease Tmax to 22%. The resolution computing time Tres decreases from 427s to 321.3s. These results seem to be obviously less interesting than for N bIt = 100. But the load balancing computing time is TLB = 22.7s whereas the time reduction on Tres is 189s. So, we obtained thanks to the LB algorithm a gain on the global computing time gTglob = 19.4%. We observe the same phenomenon in the other cases as shown in Figures 8, 10, 10, 9 and 11.

4.3

Results obtained for 128 processors on two distant clusters

The main aim of the experiments presented here is to compare the behaviour of the load balancing algorithm on a local cluster or on two distant clusters. We also increase the size of the matrix by taking the cage15 matrix and 128 processors. The METIS and RCM decomposition are not used on cage15 as they require a sequential reordering on the whole matrix producing a bottleneck for larger matrices such as cage15. We used in these experiments the NAT domain decomposition and the number of iterations of the LB algorithm varies from 0 (without the LB algorithm) to 20. The results show that the LB algorithm is scalable. Its computing time does not increase much when it runs on two distant clusters. On the other hand, the matrix deployment computing time increases a lot on two dif-

RR 2008–01

22

C. Denis R. Couturier F. J´ez´equel

Tab. 2 – Results obtained for 100 iterations : the maximum measured factomesu , the load balancing computing time T rization time Tmax LB , the resolution computing time Tres , the global computing time Tglob and the load balancing criterion ∆ mesu Matrix Number of Partition Tmax TLB Tres Tglob ∆ processors in s in s in s in s in % cage13 8 NAT 191 0 215 215 17.8 cage13 8 NAT 20 337 40 377 68 cage13 8 RCM 98 0 125.3 12.5 cage13 8 RCM 32 274 57 331 82 cage13 8 MET 245 0 265 265 22.4 cage13 8 MET 58 318 88 406 89 cage13 16 NAT 28.6 0 46.2 46.2 13.46 cage13 16 NAT 3.3 177 21 198 80 cage13 16 RCM 14.4 0 36.9 36.9 9.33 cage13 16 RCM 2.4 164 24 188 56 cage13 16 MET 47 0 72.7 72.7 9.2 cage13 16 MET 8.2 161 34 195 93 cage14 16 NAT OOC 0 OOC OOC 4.3 cage14 16 NAT 135 651 866 ∞ 47 cage14 16 RCM 494 0 1 534 1 534 4 cage14 16 RCM 16.5 592 109 701 59 cage14 16 MET 308 0 427 427 5.8 cage14 16 MET 26.5 314 137 451 45 cage14 32 NAT 183 0 183 256 4.3 cage14 32 NAT 8.9 292 82 374 47 cage14 32 RCM 41 0 128 128 4.17 cage14 32 RCM 2.16 314.2 73.8 388 63 cage14 32 MET 308 0 427 427 5.8 cage14 32 MET 26.5 314 137 451 45

ferent clusters as the amount of communication is high unlike with the LB balancing method.

5

Conclusion and future work

We have presented in this paper a LB algorithm permitting to decrease the global computing time of the direct linear multisplitting method. There exists a trade-off between the load balancing of the decomposition set and the effective gain on the computing time. An optimal number of iterations of the LB method exists but it is difficult to predict as the decrease of the maximal factorization time is irregular due to the reordering process. The LIFC

Load balancing of the direct linear multisplitting method

23

The global computing time without and with the LB method (Nbit=5). Matrix : cage 13 300 without the use of the LB method with the use of the LB method 250

Tglob in s

200

150

100

50

0

NAT[8] RCM[8] MET[8] NAT[16] RCM[16] MET[16] Initial Partition [number of processors]

Fig. 8 – The global computing time with and without the use of the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster Relative gain obtained on the global computing time with the LB method (Nbit=5). Matrix : cage 13 100 90 80

gain on Tglob in %

70 60 50 40 30 20 10 0

NAT[8] RCM[8] MET[8] NAT[16] RCM[16] MET[16] Initial Partition [number of processors]

Fig. 9 – The relative gain obtained on the global computing time by using the LB algorithm for the cage13 matrix on 8 and 16 processors on a local cluster

load balancing algorithm could be used on a local cluster and on distributed clusters as the overhead of computing time due to communication is not high. The future work consists in testing on more processors the scalability of the LB algorithm and more importantly to think about the deployment of the matrix which is very consuming on distributed clusters. On the contrary, it seems that some discretization methods like the finite element method could be used with the direct multisplitting method as it is the case in RR 2008–01

24

C. Denis R. Couturier F. J´ez´equel

The global computing time without and with the LB method (Nbit=5). Matrix : cage 14

1000 900

without the use of the LB method with the use of the LB method

800

Tglob in s

700 600 500 400 300 200 100 0

NAT[16] RCM[16] MET[16] NAT[32] RCM[32] MET[32] Initial Partition [number of processors]

Fig. 10 – The global computing time with and without the use of the LB algorithm for the cage13 matrix on 16 and 32 processors on a local cluster Relative gain obtained on the global computing time with the LB method (Nbit=5). Matrix : cage 14 100 90

gain on Tglob in %

80 70 60 50 40 30 20 10 0

NAT[16] RCM[16] MET[16] NAT[32] RCM[32] MET[32] Initial Partition [number of processors]

Fig. 11 – The relative gain obtained on the global computing time by using the LB algorithm for the cage14 matrix on 16 and 32 processors on a local cluster

the parallel multiple front method, see for example [15]. Comparing these two approaches in the GRID’5000 distributed grid would be an interesting perspective.

LIFC

Load balancing of the direct linear multisplitting method

25

Tab. 3 – Results obtained for 5 iterations : the measured factorization time Tfmesu ac , the global computing time Tglob , the computing time TLB and the number of iterations iter of the LB algorithm, the gain GTglob obtained on Tglob by load balancing and the load balanced criterion ∆ Matrix Number of Partition Tfmesu TLB (iter) Tres Tglob GTglob ac processors in s in s in s in s in % cage13 8 NAT 191 0 215 215 0 cage13 8 NAT 29 11 45 56 74 cage13 8 RCM 98 0 125.3 125.3 0 cage13 8 RCM 57.4 14 78 92 26.5 cage13 8 MET 245 0 265 265 0 cage13 8 MET 78 15 103 118 55.5 cage13 16 NAT 28.6 0 46.2 46.2 0 cage13 16 NAT 9 9 30 39 15.6 cage13 16 RCM 14.4 0 36.9 36.9 0 cage13 16 RCM 8.4 3.6 31.7 35.3 4.3 cage13 16 MET 47. 0 72.7 72.7 0 cage13 16 MET 21 10 52 62 14.7 cage14 16 NAT OOC 0 OOC OOC 0 cage14 16 NAT 67 25 130 155 ∞ cage14 16 RCM 494 0 720 720 0 cage14 16 RCM 152 31 154 185 74.3 cage14 16 MET 547 0 723 723 0 cage14 16 MET 367 36 464 500 30.8 cage14 32 NAT 183 0 256 256 0 cage14 32 NAT 41 17.3 120.7 138 46 cage14 32 RCM 33 0 108 108 0 cage14 32 RCM 14 10 89 99 8.3 cage14 32 MET 308 0 427 427 0 cage14 32 MET 119 22.7 321.3 344 19.4

6

Acknowledgments

This work has been supported by the French National Research Agency (ANR). Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, an initiative from the French Ministry of Research through the ACI GRID incentive action, INRIA, CNRS and RENATER and other contributing partners (see https :// www.grid5000.fr)

RR 2008–01

∆ in % 17.8 56 12.5 67 22.4 67 13.46 35 9.33 28 9.2 52 OOC 44 8 18 9.9 36 4.3 20 4.17 11.2 5.8 22

26

C. Denis R. Couturier F. J´ez´equel

Tab. 4 – Results obtained on a local cluster and on two distant clusters : the number of iterations iter of the LB algorithm, the LB computing time mesu , the global TLB , the maximum measured factorization computing time Tmax computing time Tglob and the load balanced criterion ∆ mesu Matrix Number of Partition iter TLB Tmax Tglob ∆ processors in s in s in s in % 1 local cluster cage15 128 NAT 0 0 249 612 1 cage15 128 NAT 5 26 68 460 7 cage15 128 NAT 10 50 32 449 11 cage15 128 NAT 15 73 31 455 15 cage15 128 NAT 20 96 22 487 15 2 distant clusters cage15 128 NAT 0 0 252 1 431 1 cage15 128 NAT 5 32 70 1 108 7 cage15 128 NAT 10 59 31 1002 11 cage15 128 NAT 15 89 32 1004 15 cage15 128 NAT 20 118 24 1084 15

Load balancing computin time TLB in s

120

100

on local cluster on two distant clusters

80

60

40

20

0

5

10 15 20 Number of iterations of the LB method

Fig. 12 – The evolution of the load balancing computing time computing on a local cluster and on two distant clusters

R´ ef´ erences [1] J. M. Bahi, J.-C. Miellou, K. Rhofir, Asynchronous multisplitting methods for nonlinear fixed point problems, Numerical Algorithms 15 (3,4) (1997) 315–345. [2] J. R. O’Leary, R. E. White, Multi-splittings of matrices and parallel LIFC

Load balancing of the direct linear multisplitting method

27

1500

Global computing time T

glob

in s

on a local cluster on two distant clusters

1000

500

0

0 5 10 15 20 Number of iterations of the LB algorithm

Fig. 13 – The evolution of the global computing time computing on a local cluster and on two distant clusters

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10]

solution of linear systems, SIAM J. Algebra Discrete Methods 6 (1985) 630–640. J. M. Bahi, S. Contassot-Vivier, R. Couturier, Evaluation of the asynchronous iterative algorithms in the context of distant heterogeneous clusters, Parallel Computing 31 (5) (2005) 439–461. J. M. Bahi, S. Contassot-Vivier, R. Couturier, F. Vernier, A decentralized convergence detection algorithm for asynchronous parallel iterative algorithms, IEEE Transactions on Parallel and Distributed Systems 1 (2005) 4–13. D. P. Bertsekas, J. N. Tsitsiklis, Parallel and Distributed Computation : Numerical Methods, Prentice Hall, Englewood Cliffs NJ, 1989. C. Denis, J. Boufflet, P. Breitkopf, A load balancing method for a parallel appication based on a domain decomposition, in : IPDPS, International Parallel and Distributed Processing Symposium, 2005. C. Denis, J. Boufflet, P. Breitkopf, M. Vayssade, B. Glut, Load balancing issues for a multiple front method, in : International Conference on Computational Science, 2004. J. Bahi, R. Couturier, Parallelization of direct algorithms using multisplitting methods in grid environm ents, in : IPDPS 2005, IEEE Computer Society Press, 2005, pp. 254b, 8 pages. P. Amestoy, I. Duff, J. Koster, J.-Y. L’Excellent, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM Journal of Matrix Analysis and Applications 23 (1) (2001) 15–41. R. Bolze, F. Cappello, E. Caron, M. Dayd´e, F. Desprez, E. Jeannot, Y. J´egou, S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst, P. PriRR 2008–01

28

C. Denis R. Couturier F. J´ez´equel

met, B. Quetier, O. Richard, E.-G. Talbi, I. Touche, Grid’5000 : A large scale and highly reconfigurable experimental grid testbed, International Journal of High Performance Computing Applications 20 (4) (2006) 481–494. [11] P. R. Amestoy, T. A. Davis, I. S. Duff, An approximate minimum degree ordering algorithm, SIMAX 17 (4) (1996) 886–905. [12] G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs, Journal of Parallel and Distributed Computing 48 (1988) 96–129. [13] E. Cuthill, J. McKee, Reducing the bandwidth of sparse symmetric matrices, in : In Proc. 24th Nat. Conf. ACM, 1969. [14] T. Davis, University of florida sparse matrix collection, NA Digest. [15] J. A. Scott, Parallel frontal solvers for large sparse linear systems, ACM Trans. Math. Softw. 29 (4).

LIFC

Laboratoire d’Informatique de l’universit´e de Franche-Comt´e UFR Sciences et Techniques, 16, route de Gray - 25030 Besanc¸on Cedex (France) LIFC - Antenne de Belfort : IUT Belfort-Montb´eliard, rue Engel Gros, BP 527 - 90016 Belfort Cedex (France) LIFC - Antenne de Montb´eliard : UFR STGI, Pˆole universitaire du Pays de Montb´eliard - 25200 Montb´eliard Cedex (France)

http://lifc.univ-fcomte.fr