Numerical Algorithms and software tools for Efficient ...

4 downloads 30121 Views 132KB Size Report
Numerical Algorithms and software tools for Efficient Meta-Computing ..... Performance of the analytical Aitken Schwarz algorithm on intranet. Cluster 1. Cluster 2.
1

Numerical Algorithms and software tools for Efficient Meta-Computing M. Garbey

a

M. Hess b , Ph. Pirasa,b , M. Reschb , D. Tromeur-Dervouta



a

Center for the Development of Scientific Parallel Computing, CDCSP/ISTIL University Lyon 1, 69622 Villeurbanne cedex,France b

High Performance Computing Center Stuttgart (HLRS), Allmandring 30, D-70550 Stuttgart, Germany This paper reports the development of a numerical Aitken-Schwarz domain decomposition method associated with the PACX-MPI message passing library, to perform the computation of a Navier-Stokes model in velocity-vorticity formulation in a metacomputing framework. 1. Introduction The development of cost affordable parallel computer hardware architecture evolves in less and less integration of the components in a same location (CPU, cache, memory). From the shared memory architectures of the early 80s (Cray XMP), the architecture went to the distributed memory in the 90s (Intel Paragon, Cray T3E), and has been followed in the end of 90s by a hybrid hardware architecture as clusters built of shared memory systems linked by a dedicated communication network ( Dec Alpha architecture with memory channel). One can further extend the concept by considering metacomputing environments, where the architecture can be a constellation of supercomputers linked through national or international communication networks. Indeed, large scale computing on a network of parallel computers seems to be mature enough from the computer science point of view to allow experiments for real simulations. Nevertheless, the old but readily problem of designing efficient parallel programs for such architectures is still: to reduce or to relax the time to access to the distant required data. This constraint is quite strong because there is at least an order of 2 of magnitude between the communication bandwidth inside the parallel computers and the communication bandwidth of the network linking the distant parallel computers. One can argue that the real improvement of the distant network will overcome this constraint. Actually dedicated links of some hundred megabyte/s can be used and some research projects expect a dedicated bandwidth of some gigabyte/s[1]. But, as the final objective is to run real applications involving hundred of processors in a non-dedicated environment, the communication bandwidth will be typically shared and so reduced and in fact the problem of the difference of the communication bandwidth performance for intercommunication and intracommunication will be still persistent. ∗

The work of CDCSP was supported by the R´egion Rhˆone Alpes.

2 Obviously, this kind of computation that requires huge computational and network resources will be useful for time critical applications where the simulation is conducted to provide data as a basis to make the right decisions. Concentration of human forces and sea barrier resources on seaboard in case of pollution by oil, computing of risks to choose burying sites for nuclear waste, determining risks of contaminated areas in case of air dead pollution to save the population, are examples for such critical applications. This paper reports some first findings on how to realize this kind of metacomputing application, the problems encountered, and the users environments that need to be developed. Efficient metacomputing needs the development of communication tools as well as development of numerical algorithms in order to take advantage of the hardware of each MPP and to minimize the number and the volume of communications on the slow network. We address as an example the challenge of metacomputing with two distant parallel computers linked by a slow affordable network providing a bandwidth of about 2-10 Mbit/s and running the numerical approximation of a Navier-Stokes model. By solving an elliptic problem we show that efficient metacomputing needs to develop new parallel numerical algorithms for such multi-cluster architectures. On top of the local parallel algorithms within each cluster, we develop new robust and parallel algorithms that work with several clusters linked by a slow communication network. We use a two level domain decomposition algorithm that is based on an Aitken or Steffensen acceleration procedure combined to Schwarz for the outer loop and a standard parallel domain decomposition for the inner loop [7]. 2. Coupling tools for metacomputing As far as we know several tools exist to couple applications on different hardware architectures [2,3]. Most applications run one physical model on each architecture and exchange the needed fields between the applications. Several coupling softwares have been developed based on the Corba technology [4]. Although this technology has been successful in industrial applications to integrate several components of the industrial process, we do not use this approach because of the following reason: first we are in this case interested to run one single application, and to distribute it on several distant computers. The Corba technology uses a quite complex and time consuming protocol for security that is not essential when doing parallel computation on a single scientific application where both data type and pattern are quite regular even on irregular mesh. Second, this approach needs to develop an overlayer of communications of the original code that can be quite difficult to program. A large part of people that deal with huge computation, as mechanical engineers, applied mathematicians and physicians, know quite well about procedural language like C or Fortran and about communication libraries like PVM or MPI. However, they are typically not so familiar with object oriented concepts as used in Corba. Although these coupling tools are available to couple applications, some numerical developments are still necessary in order to obtain good performances in a metacomputing environment. We develop for this purpose algorithmic schemes called C(p, q, j) to solve coupled system of PDE’s [9].

3 In the case of one single application on which we focus here, some software as LSF allows to manage an application running on several distant resources. Although these facilities to distribute processes on physical distant processors are available, it is not possible at the present time to specify the location of a specific process on a processor of a specific parallel computer involved in the computing pool. This capability is of leading importance in the metacomputing framework in order to relax the constraints on communication of the numerical algorithm between distant processors. The users must be able to distribute processes precisely according to his needs. The HPC Center of Stuttgart (HLRS) provides a communication library called PACXMPI [13][10] that allows to run MPI applications on meta-computers [11][12]. PACXMPI concentrates on expanding MPI for usage in an environment that couples different platforms. With this library the communication inside each MPP relies on the highly tuned MPI-version of the vendor, while communication between the two MPPs is based on the TCP/IP standard protocol. For the data exchange between the two MPPs each side has to provide 2 extra nodes for the communication, one for each direction. While one node is always waiting for MPI commands from inner nodes to transfer to the other MPP, the other node is executing commands received from the other MPP and hands data over to its own inner nodes. This tools is well adapted to multi-cluster parallel computer architectures. 3. An application example : elliptic solver of a Navier-Stokes code As an example of computation, we focus on the numerical solution of a Navier-Stokes model written in velocity-vorticity (V −ω) formulation. From the numerical point of view, we have to solve six equations with 3D fields of unknowns: three equations are devoted to the transport of the vorticity and three other are equations devoted to the computation of the divergence free velocity field. Computations of the three velocity fields equations are the most time consuming part of the elapsed time because corresponding residuals must be kept very low in order to satisfy the divergence free constraint. Mathematically the PDE problem is written as follows: ~ − ω~n ω n+1 ~ ) = 1 ∇2 ω n+1 ~ ~ × (V~n × ω n+1 −∇ ∆t Re ~ × ω~n ∇2 V~n+1 = ∇

(1) (2)

We used 2nd order finite differences in space and 2nd order in time. The time marching scheme for the vorticity is the ADI scheme with correction of Douglas [5]. The velocity is solved with a classical Multi-grid method but with a dual Schur complement domain decomposition to solve the coarser grid problem [6] for parallel efficiency purpose. The 3D lid driven cavity is the target application. The geometry under consideration now is rectangular ( (0, 3) × (0, 1)2 ) but one might extend the methodology to complex geometries with fictitious domain methods. The goal is to solve this problem between distant parallel computers. As we already have an MPI code that treats this lid driven cavity problem, we want to split the domain of computation in macro domains with a variable overlap area. Each macro domain is associated with one pool of processors belonging to the same parallel computer. We apply Schwarz additive domain decomposition to solve the velocity

4 equation, using a Multigrid solver as the inner solver on each macro domain. Thus the intercommunication corresponds to the communications involved in the Schwarz algorithm. The other communication for the inner solver on each macro domain is only between processors belonging to the same computer. We use PACX-MPI to provide intercommunication in order to use the best communication network where possible. An overlap area of one mesh size is numerically sufficient for the Aitken Schwarz methodology under consideration that is developed belows. 4. Aitken-Schwarz Domain Decomposition It is well known that Schwarz algorithm is a very slow but robust algorithm for elliptic operators. The linear convergence rate depends only on the geometry of the domain and on the operator. We develop a numerical procedure, called Aitken-Schwarz, to accelerate this algorithm based on the linear convergence property of the method and the separability of the elliptic operator . This Aitken-Schwarz Domain Decomposition was first presented at the Parallel CFD 99 conference [7] and was extended to more general linear and non linear problems at Domain Decomposition DD12 conference [8]. We keep the sequel of traces of few Schwarz iterates solution at the artificial interfaces. As the convergence of the error between these traces of iterate solutions and the exact solution is purely linear, Aitken acceleration may compute the trace of the exact solution at the artificial interfaces between the subdomains of the domain decomposition. Let us exemplify the algorithm for the velocity solver of Navier-Stokes in velocityvorticity formulation. Our Poisson problem writes uxx + uyy + uzz = f in the square (0, 3) × (0, 1)2 with homogeneous Dirichlet boundary conditions. We partition the domain S into two overlapping stripes (0, a) × (0, 1)2 (b, 3) × (0, 1)2 with b > a. We introduce the regular discretization in the y and z direction yi = (i − 1)hy , hy = Ny1−1 , , zj = (j − 1)hz , hz = Nz1−1 , and central second-order finite differences in the y and z directions. Let us denote by uˆij (respect.. fˆij ) the coefficient of the sine expansion of u (respect.. f ). The semi-discretized equation for each sinus wave is then uˆij,xx − (4/h2y sin2 (i

hy hz ) + 4/h2z sin2 (j )) uˆij = fˆij , 2 2

(3)

The algorithm for this specific case is the following: First, we compute three iterates with the additive Schwarz algorithm. Second, we compute the sinus wave expansion of the trace of the iterates on the interface Γi with fast Fourier transforms. Third, we compute the limit of the wave coefficients sequence via Aitken acceleration separately for each interface: uˆ∞ kl|Γi

uˆ0kl|Γi uˆ3kl|Γi − uˆ1kl|Γi uˆ2kl|Γi = 3 . uˆkl|Γi − uˆ2kl|Γi − u1kl|Γi + u0kl|Γi

(4)

A last solution step in each subdomain with the new computed boundary solution gives the final solution. We must notice that a better implementation of the method can be achieved: three Schwarz iterates are needed if one takes into account the coupling (ˆ ukl|Γ1 , uˆkl|Γ2 ) between interfaces and accelerates globally (ukl|Γ1 , ukl|Γ2 ). Two Schwarz iterates are necessary if one computes analytically the amplification vector of each wave uˆkl|Γi [8]. In the lid driven cavity problem, we adapt the Aitken-Schwarz method to solve

5 the three elliptic equations with non homogeneous boundary conditions of the velocity . In this problem we have one non homogeneous boundary condition for the 2nd velocity component on the boundary z = 1/2 corresponding to the moving boundary. We shift the solution by Φ analytical solution of : ∆Φ = 0 on Ω, Φ|∂Ω−{z=1/2} = 0, Φ|z=1/2 = 1.

(5)

and retrieve a numerical problem with homogeneous boundary conditions. 5. Numerical results and performances The target computers that we used for these firsts experiments are 4 alpha servers 4100 with 4 processors ev5 400 Mhz gathered in a Digital Tru64 cluster named MOBY-DECK with an internal latency of 5µs and a bandwidth of 800 Mbit/s. 5.1. Performance of the code in a metacomputing context First we report on an experiment that actually demonstrates the need of new algorithms like Aitken-Schwarz for metacomputing. We consider 3 Laplace solves and a total of 197000 unknowns. Two first rows of Table 1 show that with a good spreading of the macro domains, that is one macro domain on each parallel computer, we can lower the communication bandwidth between these two computers without loosing too much efficiency. On the other hand if we split the macro domain between two distant computers then the elapsed time increase by a factor of 10. As a matter of fact, the communication of the inner solver (multigrid) much more than 10 Mbit/s to be efficient. Second, we report on an experiment to solve with Aitken-Schwarz the 3 components of velocity using MPICH between a SUN Enterprise 10000 located at ENS Lyon, MOBYDECK and a Compaq DS20 located at University of Lyon, connected with a 2 Mbit/s non dedicated link between these 3 computers. The total number of unknowns in this last experiment is 288000. Processors of MOBY-DECK are 30% faster than the SUN Enterprise 10000 for this application. Table 2 shows that the elapse time increases only by 7.4% when degrading the communication bandwidth by a factor of 50. 5.2. Overhead induced by PACX on a single MPP The MPICH drawback is to do not use the best MPI implementation on the machine. For example on MOBY-DECK, MPICH used the 100 Mbit/s (theoritical) communication bandwidth on FDDI communication network but not the 800 Mbit/s (theoritical) communication bandwidth of memory channel communication network. In this section, we show that PACX-MPI communication software allow us to overcome this difficulty and we do not consider the overall performance of the numerical method. The experiment has been run with the full NS code describe in section 3. In PACX-MPI each parallel computer is considered as a partition and two added processors per partition are devoted to the communication between partitions. We call in the following such communications as external communications and communications inside a partition as internal communications. We timed on MOBY-DECK point to point communications with one partition of the machine. The overhead introduced by PACX-MPI for internal communication is around 3 microseconds for the latency and the bandwidth is reduced of about 3%. We next considered two partitions of the machine and a communication from one processor belonging to one partition to a processor belonging to the other partition. Then an

6 additional overhead for communication is induced because of the access to the TCP protocol. An effective communication bandwidth of 60 Mbit/s and a latency of 1,1 ms as been measured for these inter partitions communications with PACX-MPI. Table 3 gives the time in seconds of external communications and the time for the total communication (the sum of internal and external communications) for a domain split into two macro domains of 2 × 2 processors. The MPI row considers one partition of the machine and the vendor MPI communication library with 8 processes. The PACX-MPI row considers two partitions of the machine with PACX-MPI library software. In this last case 4 processes for the communication between partitions have been added to the 8 processes needed. The overhead induced by PACX on the total communication is almost the overhead of the external communication. We can hardly see any difference for the internal communication. The low overhead in the time of resolution between the 2 tests demonstrate the suitability of the use of PAXC-MPI in a metacomputing context. Table 4 compares the communication elapsed time of the classical multigrid solver on one macro domain (without domain decomposition for the coarser grid problem) with the communication elapsed time of the Aitken-Schwarz domain decomposition method to solve the three velocity equations. The first method used MPI Vendor on one partition of the machine, the second used PACX-MPI communication library on two partitions of the machine with one macro domain each. Two global sizes of the computational domain 125 × 32 × 64 and 125 × 63 × 64 have been considered. The time of communication for the velocity is more important for the Aitken-Schwarz code than for multigrid solver on one macro domain (125 × 64 × 64) case. The communications time of one inner solver iterate in the whole domain competes with the communications time of 4 inner solver iterates in reduced domains. When the number of communications in the inner solver increases (two times more for the (125 × 63 × 64) case than for the (125 × 32 × 64) case) Aitken Schwarz seems to be less competitive because of the 4 solves of Aitken-Schwarz. Nevertheless, this implementation with (4) is not the optimal one [8]. 6. Conclusions This paper demonstrates the need for new algorithms associated with efficient communication software to solve numerical PDEs in a metacomputing framework. The experiments shown still use only a few processors, but experiments using more than 1000 processors are under investigation with the Aitken-Schwarz methodology and PACX-MPI. We must report some difficulties that occur in the metacomputing framework, in order to indicate some tracks of development. The main problem encountered in the development of this experiment is the code validation on different platforms due to the difference of the MPI vendor capability depending on the hardware resources. For example, we developed a version of the code using the buffered non-blocking communication code, that runs without problems on a DEC Alpha but is limited to small problem sizes on the SUN Enterprise 10000 of ENS Lyon. At the present time, our metacomputing experiments require quite heavy management of source files, data files and makefiles on several distant computational sites. The software environment of each machine, scientific library, compiler optimization, development tools and batch system lead to a capitalization of the know-how on several architectures. However, we need to develop : • generic procedures of compiling that adapt themselves to the software and hardware environment.

7 • automatic management of source codes, data files, and job submissions as provided in TME [14]. • tools that give an accurate image of the network or computer resources availability. Notably, the time of latency, the average value and real time value of the communication bandwidth between two distant computers, the load of the network. We are currently investigating the development of such tools. REFERENCES 1. T. Eickermann, H. Grund and J. Henrichs: ”Performance Issues of Distributed MPI Applications in a German Gigabit Testbed”, In: Jack Dongarra, Emilio Luque, Tomas Margalef, (Eds.), ’Recent Advances in Parallel Virtual Machine and Message Passing’, Lecture Notes in Computer Science, 180-195, Springer, 1998. 2. Ian Foster, Carl Kesselman: ”GLOBUS: A Metacomputing Infrastructure Toolkit”, International Journal of Supercomputer Applications, 11, 115-128, 1997. 3. Andrew Grimshaw, Adam Ferrari, Fritz Knabe, Marty Humphrey: ”Legion: An Operating System for Wide-Area Computing”, Technical Report, University of Virginia, CS-99-12, 1999. 4. P. Beaugendre, T. Priol, G. Allon, and D. Delavaux. A Client/Serveur Approach for HPC Applications within a Networking Environment. In Proc. of HPCN’98, number 1401 in LNCS, Springer Verlag, pages 518–525, Amsterdam, Pays-Bas, April 1998. 5. J. Douglas Jr: ” Alternating direction methods for three space variables ”, Numerische Mathematik 4, pp. 41-63, 1962. 6. Edjlali G., Garbey M., Tromeur-Dervout D.: Interoperability Parallel Programs Approach To Simulate 3d Frontal Polymerization Processes . Journal of Parallel Computing, Vol 25, pp.1161-1191, 1999 7. Garbey M., Tromeur-Dervout D.: ”Operator splitting and Domain Decomposition for Multicluster”, 11th International Conference Parallel CFD99, D. Keyes, A. Ecer, N. Satofuka, P. Fox, J. Periaux Editors, North-Holland publisher ISBN 0-444-50571-7, Williamsburg, pp. 27-36, 1999. 8. Garbey M., Tromeur-Dervout D.: ”Two level Domain Decomposition for Multiclusters”, 12th International Conference on Domain Decomposition Methods DD12, T. Chan, T. Kako, H. Kawarada and O. Pironneau Editors, DDM org publisher, 2000, http://applmath.tg.chiba-u.ac.jp/dd12/proceedings/Garbey.ps.gz 9. Garbey M., Tromeur-Dervout D.: ”A Parallel Adaptive Coupling Algorithm for Systems of Differential Equations”, J. Comp. Phys., Vol 161, pp.401-427, 2000 10. Edgar Gabriel, Michael Resch, Thomas Beisel, Rainer Keller: ”Distributed Computing in a heterogenous computing environment”, in Vassil Alexandrov, Jack Dongarra (Eds.) ’Recent Advances in Parallel Virtual Machine and Message Passing Interface’, Lecture Notes in Computer Science, Springer, 1998, pp 180-188. 11. Matthias A. Brune, Graham E. Fagg, Michael Resch: ”Message-Passing Environments for Metacomputing”, Future Generation Computer Systems (15)5-6 (1999) pp. 699-712. 12. Michael Resch, Dirk Rantzau and Robert Stoy: ”Metacomputing Experience in a Transatlantic Wide Area Application Testbed”, Future Generation Computer Systems (15)5-6 (1999) pp. 807-816. 13. Thomas Beisel , Edgar Gabriel, Michael Resch: ”An Extension to MPI for Distributed Computing on MPPs”, in Marian Bubak, Jack Dongarra, Jerzy Wasniewski (Eds.) ’Recent Advances in Parallel Virtual Machine and Message Passing Interface’, Lecture Notes in Computer Science, Springer, 1997, 75-83.

8 14. ”TME (Task Mapping Editor), Japan Atomic Energy Research Institute (JAERI), JAERIData/Code 2000-010, Tokyo, 2000. Cluster 1 3 or 4 processors

Cluster 2 2 or 3 or 4 processors

Elapse time in second

Bandwidth of network

4 ev5 CDCSP-MOBY

4 ev5 CDCSP-MOBY

28.4s

100 Mbit/s

4 ev5 CDCSP-MOBY

2 ev6 CDCSP-DS20

29.4s

10 Mbit/s

2 ev5 CDCSP-MOBY 1 ev6 CDCSP-DS20

2 ev5 CDCSP-MOBY 1 ev6 CDCSP-DS20

220.7s

10 Mbit/s

Table 1 Performance of the analytical Aitken Schwarz algorithm on intranet. Cluster 1 4 processors

Cluster 2 4 processors

Cluster 3 4 or 2 processors

Elapse time in second

Bandwidth of network

4 PSMN-SDF1

4 PSMN-SDF1

4 PSMN-SDF1

28.8 s

not available

4 ev5 CDCSP-MOBY

4 ev5 CDCSP-MOBY

4 ev5 CDCSP-MOBY

20.7s

100 Mbit/s

4 PSMN-SDF1

4 ev5 CDCSP-MOBY

2 ev6 CDCSP-DS20

31.2s

2 Mbit/s

Table 2 Performance of the analytical Aitken Schwarz algorithm on City’s Network. Vorticity External com Total com 0.88 1.36 1.00 1.5

Velocity External com Total com 0.29 10.0 0.7 10.4

MPI PACX+MPI Table 3 Overhead induced by PACX on a single MPP

9

Velocity MGM+MPI Aitken-Schwarz + MGM + PACX-MPI

M acrox (px × py ) = 2(2 × 1) (N gx × N gy × N gz ) = (125 × 32 × 64) External com Total com 4.66 0.28

4.3

M acrox (px × py ) = 2(2 × 2) (N gx × N gy × N gz ) = (125 × 63 × 64) External com Total com 7.1 0.7

Table 4 Comparison of communication elapse time for the velocity problem .

10.4