Towards Petascale Computing with Parallel CFD codes - HPCx

Towards Petascale Computing with Parallel CFD codes A.G. Sunderland, M. Ashworth, N. Li, C. Moulinec, STFC Daresbury Laboratory, Warrington, UK Y. Fournier, J. Uribe University of Manchester, UK Keywords: Petascale, High-End Computing, Parallel Performance, Direct Numerical Simulations, Large Eddy Simulations

1. Introduction Many world leading high-end computing (HEC) facilities are now offering around 50-100 Teraflops/s of performance and several initiatives have begun to look forward to Petascale computing1 (1015 flop/s). For example Oak Ridge National Laboratory (ORNL) has a DoE funded programme to install a Petascale system by the end of 2008. Computing at the Petascale raises a number of significant challenges for parallel computational fluid dynamics codes. Most significantly, further improvements to the performance of individual processors will be limited and therefore Petascale systems are likely to contain 100,000+ processors. Thus a critical aspect for utilising high Terascale and Petascale resources is the scalability of the underlying numerical methods, both with execution time with the number of processors and scaling of time with problem size. In this paper we analyse the performance of a several CFD codes for a range of datasets on some of the latest high performance computing architectures. This includes Direct Numerical Simulations (DNS) via the SBLI [1] and SENGA2 [2] codes, Large Eddy Simulations (LES) using both STREAMS_LES [3] and the general purpose open source CFD code Code_Saturne[4].

2. Parallel CFD codes We analyse the parallel performance of several parallel CFD codes on the target high-end computing systems. The codes have been chosen to reflect a range of applications (e.g. turbulence at the shock/boundary layer interaction, combustion) using both DNS-based and LES-based computational methods. All codes are written in Fortran with MPI [5] for data transfer between processors. The Code_Saturne package also has modules written in the C programming language and the Python scripting language. SBLI Fluid flows encountered in real applications are invariably turbulent. There is, therefore, an ever-increasing need to understand turbulence and, more importantly, to be able to model turbulent flows with improved predictive capabilities. As computing technology continues to improve, it is becoming more feasible to solve the governing equations of motion, the Navier-Stokes equations, from first principles. The direct solution of the equations of motion for a fluid, however, remains a formidable task and simulations are only possible for flows with small to modest Reynolds numbers. Within the UK, the Turbulence Consortium (UKTC) has been at the forefront of simulating turbulent flows by direct numerical simulation (DNS). UKTC has developed the parallel code SBLI to solve problems associated with shock/boundary-layer interaction. SBLI [1] is a compressible DNS code based on finite difference method using highorder central differencing in space and explicit Runge-Kutta for time marching. A grid transformation routine enables this code to simulate relatively complex-geometry flows. The parallel version is under active development and its parallel performance has been fine-tuned. A set of test cases, some with complex geometry involving multiple Cartesian-topology blocks, have been specified for its testing and benchmarking on a range of HPC platforms. SENGA2 The SENGA2 [2] code has been developed at The University of Cambridge and has been designed to facilitate combustion DNS with any desired level of chemistry, from single-step Arrhenius mechanisms through all classes of reduced reaction mechanisms up to fully detailed reaction mechanisms. The Navier-Stokes momentum equations are solved in fully compressible form together with the continuity equation and a conservation equation for the stagnation 1

Petascale assumes 10s of Petaflop Peak Performance and 1 Petaflop Sustained Performance on HEC applications

internal energy, as well as any required number of balance equations for species mass fraction. Each component of the reacting mixture is assumed to obey the equations of state for a semi-perfect gas. Boundary conditions are specified using an extended form of the Navier-Stokes Characteristic Boundary Condition formulation, and available boundary conditions include periodic as well as several types of walls, inflows and outflows. The numerical framework is based on a finite-difference approach for spatial discretisation together with a Runge-Kutta algorithm for time-stepping. Highorder explicit schemes are preferred due to their speed of execution and ease of parallel implementation, and a 10th order explicit scheme is standard for interior points. The code is fully parallel using domain decomposition over a cubic topology. Current HEC architectures permit 3D DNS of the turbulent flow fields but with only limited representation of the combustion chemistry and a highly simplified representation of the geometry. At the Petascale it will be possible to move towards more complex configurations that are much closer to industrial requirements. STREAMS-LES STREAM-LES [3] is a CFD package developed at Imperial College London for Large Eddy Simulations (LES) of incompressible flow. Its numerical framework rests on a general structured, multi-block, collocated-storage finite volume method with non-orthogonal mesh capability. The spatial scheme is second-order central and the time-matching is based on a fractional-step method in which a provisional velocity field is made divergence-free through the solution of the pressure-Possion equation. The code is fully parallelised using MPI through standard domain decomposition and runs on several high-end computing platforms. CODE_SATURNE Code Saturne [4] is an open source general purpose computational fluid dynamics software package developed by EDF [6]. It is based on a co-located Finite Volume approach that accepts meshes with any type of cell, including tetrahedral, hexahedral, prismatic, pyramidal, polyhedral and any type of grid structure, including unstructured, block structured, hybrid, conforming or with hanging nodes. Its basic capabilities enable the handling of either incompressible or expandable flows with or without heat transfer and turbulence (mixing length, 2-equation models, v2f, Reynolds stress models, Large Eddy Simulations etc.). Dedicated modules are available for specific physics such as radiative heat transfer, combustion (e.g. gas, coal), magneto-hydro dynamics, compressible flows, two-phase flows (Euler-Lagrange approach with two-way coupling) with extensions to specific applications (e.g. for atmospheric environment: code Mercure_Saturne).

3. High-End Computing Platforms HPCx HPCx [7] is the UK’s National Capability Computing service, located at the Computational Science and Engineering Department at STFC Daresbury Laboratory [8] and comprising of 160 IBM eServer 575 nodes. Each eServer node contains 16 1.5 GHz POWER5 processors, giving a total of 2536 processors for the system. The total main memory of 32 GBytes per frame is shared between the 16 processors of the frame. The frames in the HPCx system are connected via IBM’s High Performance Switch. The current configuration has a theoretical peak performance of 15.4 Tflops and is positioned at No. 101 in the Top 500 list [9]. HECToR HECToR [10] is the UK’s new high-end computing resource, located at the University of Edinburgh and run by the HPCx consortium. It is a Cray XT4 system comprises 1416 compute blades, each of which has 4 dual-core processor sockets. This amounts to a total of 11,328 cores, each of which acts as a single CPU. The processor is an AMD 2.8 GHz Opteron. Each dual-core socket shares 6 GB of memory, giving a total of 33.2 TB in all. The theoretical peak performance of the system is 59 Tflops/s, positioning the system at No. 17 in the Top 500 list. Where appropriate, comparisons are also made with the Cray XT3 machine at the Swiss National Supercomputing Centre CSCS [11], and an alternative Cray XT4 machine named ‘jaguar’ at ORNL [12] . BlueGene As of November 2007, the two fastest supercomputers in the world are BlueGene/L and BlueGene/P systems with 212992 and 65536 processors respectively [9]. STFC Daresbury Laboratory has recently acquired single rack Blue Gene/L and BlueGene/P systems. Both systems contain 1024 chips, with 2 processor cores per chip in the L system and 4 processor cores per chip in the P system, giving a total of 2048 cores and 4096 cores respectively. Memory is provided at 512 Mbytes per core in both systems. The basic processor in the L system is the Power 440 running at 700

MHz, whilst the P system uses a processor form Power450 family running at 850 MHz. Inter-processor communication takes place via two different networks: a 3-D torus for general communications and a tree network for collective communication patterns. The philosophy behind the Blue Gene design is that the speed of the processor is traded in favour of very dense packaging and low power consumption. As a consequence of these features, the Blue Gene/P at STFC Daresbury Laboratory currently resides at No. 1 in the Green Top 500 supercomputer list [13].

4. Initial Results Initial Results are displayed in this section for the codes and platforms under investigation. It is intended that the performance analysis will be much more comprehensive, both in terms of multiple datasets and complete sets of performance comparisons from the alternative HEC platforms, by the time of presentation to the conference and time of final paper submission. SBLI The benchmark is a simple turbulent channel flow benchmark using a grid size of 360 × 360 × 360 run for 100 iterations. The most important communications structure is a halo-exchange between adjacent computational subdomains. Providing the problem size is large enough to give a small surface area to volume ratio for each sub-domain, the communications costs are small relative to computation and do not constitute a bottleneck and we see almost linear scaling from all systems out to 1024 processors. Performance, shown in Figure 1, shows the IBM p5-575 with an advantage over the BlueGene/L which rises from 2.79 times at 256 processors to 4.13 times at 1024. Cray XT4 performance is very similar to that of the IBM p5-575 upto 1280 processors, though at 1536 processors the IBM performance dips markedly. This behaviour is currently under investigation. Hardware profiling studies of this code have shown that its performance is highly dependent on the cache utilization and bandwidth to main memory [14].

Figure 1. Performance of SBLI on up to 1536 processors of high-end systems

Figure 2. Performance of SBLI on up to 12288 processors

Figure 2 shows the parallel performance on the Cray XT4 on up to 12288 processors. The code scales very well up to 8192 processors but thereafter a 50% increase in processor count yields only a 6.6% increase in performance. The performance of SBLI on larger problem sizes is currently being analyzed on large processor counts for all the high-end systems under investigation.

STREAMS-LES The test case is a turbulent channel flow using 2,097,152 grid points calculated for 75000 time steps. The arbitrary performance shown is relative to the time taken to complete the simulation on 128 processors of BlueGene/P. Parallel efficiency is between around 76% to around 96% for the three platforms up to 256 processors. However from 256 to 512 processors the parallel efficiency falls to approximately 45% (BG/P), 53% (Cray XT4) and 64% (IBM Power5). Currently, further investigation of the parallel performance of STREAMS-LES is being undertaken, including the analysis of larger datasets where possible.

Figure 3. Performance of STREAMS-LES on high-end systems

Code_Saturne

Figure 4. Performance of Code_Saturne on HPCx

The parallel performance of Code_Saturne on HPCx is shown in Figure 4. The test case consists of a LES at Reynolds number based of the friction velocity of 395, in a channel. The dimension of the box is 6.4 × 2 × 3.2 and two meshes are considered, 256 × 256 × 256 (16M cells) and 1024 × 256 × 256 (67M cells). Periodic boundary conditions apply in streamwise and spanwise directions and no-slip conditions in the wall normal ones. Van Driest damping is used at walls and the flow is driven by a constant pressure gradient. The parallel scaling for both datasets is excellent and the code will soon be ported to Blue Gene and HECToR for further performance comparisons on higher processor counts. SENGA2 The two datasets examined here are outlined below. 1.

A 4-step calculation, which is an early example of a reduced methane-air mechanism due to Peters and Williams. It uses 4 forward steps (two of them reversible) and 7 species, and is notoriously stiff. It is there to test the ability of the code to cope with stiff chemistry. The benchmark calculation is undertaken with each processor holding 323 grid points. Therefore the global grid size is expanded linearly with the number of processors used and is used to assess the weak scaling properties of the code.

2.

A 1-step calculation is a simple global 1-step generic non-reversible Arrhenius-type mechanism. It is representative of the previous generation of combustion DNS, and is inexpensive to run. A volume of 1cm3 of air with periodic boundary condition is simulated over 10 time steps of 1 ps. Initial conditions are 300K and atmospheric pressure with an initial turbulent filed. Snapshot data is dumped every 5 steps (twice over the simulation length).

In common with other Direct Numerical Simulation codes, memory bandwidth is expected to be the dominant constraint on performance. Communications within Senga2 are dominated by halo-exchange between adjacent computational sub-domains. Providing the problem size is large enough to give a small surface area to volume ratio for each sub-domain, the communications costs are small relative to computation and do not constitute a bottleneck. This is exemplified in Figures 5, 6, and 7, below where both weak scaling and strong scaling properties are very good on the target platforms. Scaling on the Blue Gene /P is generally good, but overall speed is around 2.5 times slower than the Cray XT4. However it should be noted that this performance ratio betters the relative clock speeds (2800/800 MHz) of the machines’ underlying processors.

Figure 5. Weak scaling performance of Senga2 – 4 Step reduced-air methane mechanism 323 grid points per processor.

Figure 6. Strong scaling performance of Senga2 – 1 Step Arrhenius-type Mechanism with 5003 global grid points.

Figure 7. Strong scaling performance of Senga2 – 1 Step Arrhenius-type Mechanism with 10003 global grid points

Acknowledgements This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-ASC05-00OR22725. The authors would also like to express their gratitude to Dr Stewart Cant at the University of Cambridge for his input.

References 1.

‘Direct Numerical Simulation of Shock/Boundary Layer Interaction’, N.D. Sandham, M. Ashworth and D.R. Emerson, http://www.cse.clrc.ac.uk/ceg/sbli.shtml.

2.

http://www.escience.cam.ac.uk/projects/cfd/senga.xsd.

3.

‘Investigation of wall-function approximations and subgrid-scale models in large eddy simulation of separated flow in a channel with streamwise periodic constrictions’, Temmerman, L., Leschziner, M.A., Mellen, C.P. and Frohlich, J., International Journal of Heat and Fluid Flow, 24(2), 157-180, 2003.

4.

“Code_saturne: a finite volume code for the computation of turbulent incompressible flows – industrial applications”, Archambeau, F. Méchitoua, N. and Sakiz, M., Int. J. on Finite Volumes, February 2004.

5.

MPI: A Message Passing Interface Standard, Message Passing interface Forum 1995, ,http://www.netlib.org/mpi/index.html.

6.

EDF Research and Development, http://rd.edf.com/107008i/EDF.fr/Research-and-Development/softwares/CodeSaturne.html.

7.

HPCx -The UK's World-Class Service for World-Class Research, www.hpcx.ac.uk.

8.

STFC's Computational Science and Engineering Department, http://www.cse.scitech.ac.uk/.

9.

Top 500 Supercomputer sites, http://www.top500.org.

10. HECToR – UK National Supercomputing Service, http://www.hector.ac.uk. 11. Swiss National Supercomputing Centre, CSCS, http://www.cscs.ch. 12. Supercomputing at Oak Ridge National Laboratory, http://computing.ornl.gov/supercomputing.shtml. 13. The Green Top 500 List, http://www.green500.org/lists/2007/11/green500.php. 14. Single Node Performance Analysis of Applications on HPCx, M. Bull, HPCx Technical Report HPCxTR0703 2007, http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0703.pdf

Towards Petascale Computing with Parallel CFD codes - HPCx

Towards Petascale Computing with Parallel CFD codes - HPCx

Suggest Documents

HPCx: TOWARDS CAPABILITY COMPUTING

Towards Distributed Petascale Computing

HPCx: TOWARDS CAPABILITY COMPUTING - Semantic Scholar

Parallel Visualisation on HPCx

Towards petascale ab initio protein folding through parallel scripting

Parallel Computing with X10

Parallel Computing with MATLAB

petascale computing - Bo-Wen Shen

high-performance computing, visualization, computational ... - HPCx

high-performance computing, visualization, computational ... - HPCx

IRJET- Parallel Computing with CUDA

Parallel Computing with MATLAB

Approach to Application Centric Petascale Computing - Fujitsu

petascale computing: algorithms and applications - Google Sites

Roadrunner: Heterogeneous Petascale Computing for Predictive

The quest for petascale computing - Computing in ... - The Netlib

Approach to Application Centric Petascale Computing - Fujitsu

Parallel Computing

Parallel Computing

Scientific Visualization in Astronomy: Towards the Petascale

Optimising the Termofluids CFD code for petascale simulations

Profiling Parallel Performance using Vampir and Paraver - HPCx

Capability Computing: Achieving Scalability on over 1000 ... - HPCx

Parallel Computing with MATLAB workshop presentation - HPCC