parallelization strategies: one based on domain partitionning with message passing ... time dependent Navier-Stokes equations by using DNS approaches is of.
DNS CODE IMPLEMENTATIONS ON HIGH PERFORMANCE PARALLEL COMPUTERS B. TROFF, G. COUSSEMENT, J. RYAN, P. SAGAUT ONERA, 29 av. de la Division Leclerc, 92322 Ch^atillon Cedex, FRANCE H. YAMAZAKI, M. YOSHIDA, T. IWAMIYA NAL, 7-44-1 Jindaiji-higashimachi, Chofu city, Tokyo 182, JAPAN
Abstract. This paper presents a cooperative work between the French National Establishment for Aerospase Research (ONERA) and the Japanese National Aerospace Laboratory (NAL), to investigate potentiality of distributed environments for high performance computing in CFD. For this purpose the PEGASE NavierStokes solver, developed at ONERA, was implemented on distributed processor environments with two parallelization strategies: one based on domain partitionning with message passing and an other based on loop-parallelization with compiler directives. The results obtained on dierent super-computers such as Cray, Paragon and Numerical Wind Tunnel at NAL are shown.
1 Introduction
Solution of the three-dimensional time dependent Navier-Stokes equations by using DNS approaches is of crucial importance in uid dynamics to represent the complete physical phenomena, but involves huge amount of time dependent data. The most powerful, existing and promised, computers are based on parallel architecture. With the advent of such high performance parallel computers, the range of Reynolds number and con guration complexity that can be studied with this method increases. This progress is not just the topic of one country and, to be accelerated, needs a strengthening of interaction between countries. A cooperative work between two national aerospace research centers, the Japanese National Aerospace Laboratory (NAL) and the French National Establishment for Aerospase Research (ONERA), has been conducted to investigate potentiality of distributed environments for high performance Computational Fluid Dynamics (CFD). For this purpose a ow code was implemented on distributed memory environments using various parallelization strategies. The simulation code used is the PEGASE Navier-Stokes solver developed at ONERA for the direct simulation of turbulent incompressible ows. Initially implemented on a Cray vector computer, this code was parallelized on various supercomputers with distributed memory systems like Paragon and NAL Numerical Wind Tunnel (NWT). Two parallelization strategies are presented and compared for three-dimensional calculations. One is based on a domain partitionning strategy coupled with a message passing protocol to ensure communication between processors. The other is based on the use of compiler directives added into the code to ensure a loop-parallelization. The latter was performed on NAL NWT with Fujitsu Fortran compiler directives.
2 PEGASE Code
2.1 Basic Numerical Method
The PEGASE code, developed at ONERA, is used for direct simulation of turbulent incompressible ows. The basic equations of uid motion are the unsteady incompressible three-dimensional Navier-Stokes equations. An hybrid conservative nite-dierence/ nite-element scheme is implemented to solve these equations. Using the velocity-pressure variables on a non-staggered grid system, the solution is obtained with an approximate projection method based on the resolution of the Pressure Poisson Equation (PPE formulation) with a Bi-CGSTAB solver 10. Consistent boundary conditions 4 derived from the momentum equations are used to solve the latter equation. 1
The spatial scheme is derived from the nite element discretization using the Galerkin method with piecewise trilinear polynomial basis functions de ned on hexaedral elements. It is applied to the pressure gradient term and to the nonlinear convection term as in the so-called Group Finite Element Method 3. It ensures a strong coupling between spatial directions inhibiting the development of spurious modes during long-time integration 8 9. Time cycles are carried out with a rst order forward Euler scheme. To these equations, initial and boundary conditions (Dirichlet, Neumann, or periodic) are associated. The numerical method ensures good numerical properties such as stability, consistency and accuracy and is optimized for non uniform Cartesian meshes. Some properties of the scheme are commented and benchmarks validation tests are detailed in L^e et al. 5. ;
2.2 PEGASE on a MIMD Machine
Two strategies were chosen for parallelization for three-dimensional calculations : One, developed at ONERA 8 on Paragon, based on a domain partitionning strategy with non-overlapping pencils over a two-dimensional array of processors; The other, developed at NAL 2 on NWT, based on the use of Fujitsu compiler directives added into the code to ensure a loop-parallelization.
2.2.1 Introduction to NAL NWT
The Numerical Wind Tunnel (NWT) is a parallel computer consisting of 166 processor elements (PE) connected with a cross-bar network. Each PE is a vector processor with pipelines of multiplicity 8. Add, multiply, load and store pipelines can be operated simultaneously. The capacity of vector registers is 128KB. A scalar instruction is a Long Instruction Word (LIW) which issues 3 scalar instructions or 2 oating point instructions at once. Each PE has a main memory of 256MB except for 4 PEs with 1GB. The cross-bar network exclusively connects any pair of PEs without any in uence by other PEs. Total peak performance is 280 GFLOPS and the total capacity of the main memory is 44.5GB. The Gordon Bell prizes in 1994,1995 and 1996 of IEEE were awarded to the researchers using NWT 6 7 11. ; ;
2.2.2 Domain partitionning version
A message passing version (referred as MP version below) was developed at ONERA to run on MIMD computers. A two-dimensional non-overlapping cube decomposition was implemented. This domain partitionning strategy requires communication between processors to handle the continuity between each subdomains. This is performed with standard message passing libraries such as NX (on Paragon), PVM or MPI. An advantage of this strategy is the portability due to the fact that these communication libraries are available on a wide range of computers. This is the case of PVM that is implemented on Paragon, IBM SP2, cluster of work stations 1, Cray and NAL NWT. Computation of spatial derivatives (a second order mixed nite dierences, nite element 27 point scheme) was written (with a special care in the treatment of block border points) and split into 3 sequential steps, corresponding to the three directions (East-West, North-South, Top-Bottom) of the 3-D block grid: 1. For border points: Contributions (between blocks) to computations of derivatives of points located in East-West neighbor blocks are rst computed and asynchronously sent (by asynchronous, one means that the send and receive calls return immediately, allowing the program to do further work while the message is processed); A third of the derivatives at inner points are then computed; Then border points derivatives are computed, the partial contributions having also been read asynchronously. The same is done in the North-South direction then again in the Top-Bottom direction. 2
Figure 1: Computation of derivatives at center border points (left) and at corner points (right) 2. For center border points, see Figure 1, (at a point represented by a black square), derivatives are computed in the following way (points in the second direction of space which is left uncut are not shown): Contribution c2 is sent to domain A. Computations involving points in domain B are made; Contribution c1 is then received and added. 3. For corner points, see Figure 1, involving four dierent domains, computation is done in several steps: At rst, contribution c2 is sent to domain A while contribution c5 to derivatives of points in domain D is sent. In c5 is included contribution c3; In domain D, contribution c5 is received and added to contribution c4 which is then sent to domain B; In domain B, contributions from domain A are received and added to computations involving points in the domain; Contribution c22 is sent to domain D, and contribution c4 is read and added.
Figure 2: Iso-enstrophy surface of the Kelvin-Helmholtz rolling
2.2.3 Loop-parallelization version
The sequential version of PEGASE was also parallelized on the basis of a one-dimensional loop-parallelization with NWT Fortran compiler directives (LP version). For this, the DO loop enclosed by SPREAD DO and 3
END SPREAD directives operates the partitionned local array assigned for each PE. When dividing the domain and assigning the subdomains to PEs, the domain should be divided by the outermost index in order to get longer vector length and to access contiguous memory addresses. The load and store from/to contiguous addresses are faster than ones from/to addresses with stride. For the ecient reference to array data with such indices as i 1, we employed overlapping partitionning which is supported by NWT FORTRAN. The overlapping part of the index range is called wing. The WINGFIX directive speci es the replacement of the values on the corresponding range of another PE.
3 Results
3.1 Mixing layer test case
For the mixing layer simulation 9 used in the performance measurement, periodic boundary conditions are prescribed in streamwise and spanwise directions and a non-re ecting condition is used on top and bottom boundaries. The initial condition is a tangent hyperbolic velocity eld with a superimposed 3-D perturbation. An illustration of the ow structure is given on gure 2 which shows an instantaneous 3-D enstrophy eld.
3.2 Performances
The aim of the performance measurement is to investigate potentiality of distributed environments for large scale CFD problems in order to reduce the response time for a given problem or to extend the problem size to be solved. For this, a set of two dierent tests have been conducted: One set to assess the speed-up performances (capability, for a xed problem size, to reduce the computing restitution time with respect to the number of processors); An other set to assess the scalability performances (capability to sustain a constant restitution time for a problem size proportionally increased with the number of processors used). For most of the results the rectangular computational domain, used in the mixing layer test case, is a non uniform Cartesian grid with respectively 70 42 85 points in the i,j,k directions ( xed size for speed-up and processor size for scalability). For some results, a 127 63 256 grid is used for speed-up cases in order to reduce impact of either under load or bad vector performances due to loop shortening. Table 1 gives the restitution time ratio obtained with the code on some computers used in single processor mode (reference Cray YMP) for the 250,000 points case. Machine Cray YMP NWT Paragon Time (in s) 97.1 14.7 1153 Time ratio 1 0.15 12 Table 1: Comparison of restitution time obtained on one PE
3.3 Paragon results
Figure 3(a) shows the speed-up obtained on the Paragon up to 64 PEs using the MP version with the NX communication library in the case of a 250 000 points mesh. For more than 32 PEs, the communication overhead time (proportional to the square root of the number of PEs) becomes the main cause of the global loss in eciency. This is due to the too weak load of work on each PE, as can be seen on Figure 3(b), where the speed-up is plotted in the case of a 2 millions points mesh. The de ciency in performance when increasing the number of PE does no longer appear up to 64 PEs. Figure 3(c) gives the results of the scalability test. The in uence of communication on computation is negligible and the discrepancies in time are about 1%. 4
(a)
Speed-up Full problem size = 70 x 42 x 85
(b)
Speed-up Full problem size = 127x63x256
Scalability Subdomain size = 70 x 42 x 85
(c)
1000
1000
1000
800 600 400 200 0
Time [seconds]
1200
Time [seconds]
1200
Time [seconds]
1200
800 600 400 200
0
20
40
60
Number of PE
80
0
800 600 400 200
0
20
40
60
Number of PE
80
0
0
20
40
60
80
Number of PE
Figure 3: MP version performances on Paragon.(a) Speed-up for 250 000 points; (b) Speed-up for 2 millions points; (c) scalability
3.4 NWT results 3.4.1 MP version
The MP version was also implemented on NWT using PVM communication library. The rst results showed a de cient behavior due to global communications, see Figure 4. This behavior was not seen on Paragon because of the poor performance of the PE.
Figure 4: Basic MP version performances on NWT; speed-up and scalability Two improvements have been made to increase these performances. On the one hand global functions have been improved and on the other hand the most time consuming module (named lapuloc) has been tuned. In the global communication part of the results shown before, each PE sends its data to all PEs and receives data from all PEs. With this rustic algorithm, the communication time is proportional to the number of PEs. To obtain global sum and maximum of all PEs data, it is not necessary for each PE to communicate with all the others. The communication time becomes proportional to log2 (number of P E) with the following minimal spanning tree method: rst gather all data to the root through a minimal spanning tree (at each stage compute locally components of the global operation), then broadcast to all nodes the nal result. This algorithm is valid for any number of nodes. But with this method, some PEs, after sending their own data, remain idle a long time before receiving the global data. NWT cross-bar network allows a PE to send and receive to/from another PE directly and simultaneously without disturbance from other PEs communications. Hence it is more suitable to make PE groups and compute partial sums or maxima and swap them between groups as schematically shown on Figure 5. This ensures a faster simultaneous global communication also proportional to log2 (number of P E) with the restriction that the number of PEs must be a power of two. Then we have tuned Lapuloc subroutine in the calculation part of the PEGASE program to make it suitable for the NWT vector performances. The original Lapuloc subroutine spends more than half of calculation time. Increasing PE number in speedup case, Lapuloc time does not decrease much in spite of reduction of calculation amount due to vector 5
Figure 5: Global communication: (left) ring communication; (center) gather and broadcast; (right) simultaneous communication.
Figure 6: Improved MP version performances on NWT; speed-up and scalability performance degradation. The results obtained after global operation and vectorization optimizations are plotted on Figure 6.
3.4.2 LP version
We parallelized the PEGASE program by adding the NWT compiler directives to the sequential program for one PE. The results obtained with the LP version are presented. Figures 7 and 8 shows the speed-up obtained with the 250 000 points case and the 2 000 000 points case respectively and Figures 9 presents the scalability results. For scalability, this loop-parallelization strategy present a similar behavior as the domain partitionning strategy with PVM. However, a better global eciency is reached with loop-parallelization. For speed-up, the loop-parallelization on k direction (85 points) can not ensure good load balancing property with various PE numbers. This leads to a fast degradation of performance when load is not equivalent on each processor. To enable better load balancing property with 2N processors and better vector performances, a 127 63 256 grid was used. The following table shows summarized performances obtained with the LP version. NWT PE number 1 PE 16 PE Grid size 250 000 250 000 Time (in mn) 90 8.3 Time ratio 1 1/10.8 Speedup
NWT PE number 1 PE 64 PE Grid size 250 000 16 000 000 Time (in mn) 15 18 Time ratio 1 1.2 Scalability 6
3.5 Concluding remarks
Results obtained both with loop-parallelization and domain partitionning are presented on Figure 7, 8 and 9. With a similar speed-up behavior, the LP version presents better performances than the MP version. For distributed memory system with fast processors such as NAL NWT, communication de ciency in the parallelization process are clearly brought to the fore in contrast with slow processor machine such as Paragon where good parallel performances may still be obtained with poor parallel communication settings. The use of these MIMD facilities requires to modify the existing codes and to conceive new algorithm for their parallel running. It appears that human and algorithmic eorts for these modi cation are relatively important with a domain partitionning strategy which, in counterpart, oers high portability on dierent computers due to standardization of communication protocol. With loop-parallelization directive, it appears that the human investment is more limited while, due to machine speci c compiler directive, portability is low (a tentative of portability being available with High Performance Fortran or FORGE).
4 Conclusion
A collaborative work between ONERA and NAL has been conducted to investigate the potentiality of distributed environments for high performance computing in CFD. For this purpose the ONERA benchmark PEGASE code was implemented on MIMD facilities such as NAL NWT. The results obtained with dierent parallelization strategies as domain partitionning with message passing protocol and loop-parallelization with compiler directive, showed the potentiality of these two approaches. The present work has shown that high performance parallel computers, with a fast processor element, such as NWT, should contribute in the near future to progress, via DNS or LES, in the understanding and prediction of fundamental and applied aerospace uid dynamics.
Figure 7: Comparison of MP and LP version: speed-up, problem size 70 42 85
Figure 8: Comparison of MP and LP version: speed-up, problem size 127 63 256 7
Figure 9: Comparison of MP and LP version: scalability, subdomain size 70 42 85
References
[1] G. Coussement, Parallelization of a mesh optimization code on a RS/6000 cluster. Proceedings of the Share Europe Spring Meeting'93, Hamburg, Germany, (1993) April 19-22. [2] G. Coussement, B. Tro. J. Ryan, H. Yamazaki, M. Yoshida, T. Iwamiya, High performance computing for direct turbulence simulation in computational uid dynamics High Performance Computing, IEEE Computer Society Press, (1997) 168-177 . [3] C.A.J. Fletcher, Computational technique for uid dynamics. Springer Verlag. (1988) 355-360. [4] P.M. Gresho, R.L. Sani, On pressure boundary conditions for the incompressible Navier-Stokes equations. Int. J. Num. Methods Fluids Vol. 7 (1987), 1111-1145. [5] T.H. L^e, B. Tro, P. Sagaut, K. Dang-Tran, T.P. Loc, PEGASE: a Navier-Stokes solver for direct numerical simulation of incompressible ows. Int. J. Numer. Fluids 24 (1997), 833-861. [6] H. Miyoshi, M. Fukuda, T. Takamura, M. Kishimoto and al., Development and Achievement of NAL Numerical Wind Tunnel(NWT) for CFD Computations. Proceedings of Super Computing 94,(1994) November 13-18, Washington D.C.(US). [7] T. Nakamura, T. Iwamiya, M. Yoshida, Y. Matsuo, M. Fukuda, Simulation of The Three-Dimensional Cascade Flow With Numerical Wind Tunnel (NWT) Proceedings of Super Computing 96, (1996) November 17-22, Pittsburgh (US). [8] J. Ryan, P. Leoncini, U. Berrino, B. Tro, Direct simulation and graphics post-processing of threedimensional turbulent ows. AGARD Conference proceedings 551, (1994), 37/1-37/6. [9] J. Ryan, P. Sagaut, B. Tro, P. Cambon, PEGASE: A parallel Navier-Stokes solver applied to a rotating mixing layer. Parallel CFD95, San-Francisco, CA (USA), (1995) June 26-28. [10] H. Van der Vorst, Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymetric linear systems. SIAM J. Sci. Statist. Comput. Vol 13 (1992), 631-644. [11] M. Yoshida, M. Fukuda, T. Nakamura, A. Nakamura, S. Hioki, Quantum Chromodynamics Simulation on NWT Proceedings of Super Computing'95, (1995) December 4-8, San Diego (US).
8