16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator Mitsuo Yokokawa1† , Ken’ichi Itakura2 , Atsuya Uno2 , Takashi Ishihara3 and Yukio Kaneda3 1
Earth Simulator Research and Development Center Japan Atomic Energy Research Institute 6-9-3, Higashi-Ueno, Taito-ku, Tokyo 110-0015, Japan
2
Earth Simulator Center, Japan Marine Science and Technology Center 3173-25 Showa-machi, Kanazawa-ku, Yokohama 236-0001, Japan {itakura,uno}@es.jamstec.go.jp
3
Graduate School of Engineering, Nagoya University, Chikusa-ku, Nagoya 464-8603, Japan {ishihara,kaneda}@cse.nagoya-u.ac.jp
Abstract The high-resolution direct numerical simulations (DNSs) of incompressible turbulence with numbers of grid points up to 40963 have been executed on the Earth Simulator (ES). The DNSs are based on the Fourier spectral method, so that the equation for mass conservation is accurately solved. In DNS based on the spectral method, most of the computation time is consumed in calculating the three-dimensional (3D) Fast Fourier Transform (FFT), which requires huge-scale global data transfer and has been the major stumbling block that has prevented truly high-performance computing. By implementing new methods to efficiently perform the 3D-FFT on the ES, we have achieved DNS at 16.4 Tflops on 20483 grid points. The DNS yields an energy spectrum exhibiting a wide inertial subrange, in contrast to previous DNSs with lower resolutions, and therefore provides valuable data for the study of the universal features of turbulence at large Reynolds number. ∗
1
Introduction
Direct numerical simulation (DNS) of turbulence provides us with detailed data on turbulence that is free of experimental uncertainty. DNS is therefore not only a powerful means for finding directly applicable solutions to problems in practical application areas that involve turbulent phenomena, but also for advancing our understanding of turbulence itself – the last outstanding unsolved problem of classical physics, and a phenomenon that is seen in many areas which have societal impacts. Sufficiently high levels of computational performance are essential to the DNS of turbulence. If we don’t have this, we are only able to simulate turbulence with insufficient resolution or for low or moderate values of the Reynolds number Re, which represents the degree of non-linearity of flow in a turbulent system. However, †
Currently Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology e-mail:
[email protected] ∗ 0-7695-1524-X/02 $17.00 (c) 2002 IEEE
1
one will then be missing the essence of the turbulence. For example, our recent experience has shown that DNS with resolution of only around 5123 grid points or so results in a significant overestimate of the Kolmogorov constant, which is one of the most important constants in the theory of turbulence. To obtain asymptotically correct higher-order statistics on small-scale eddies for large Re, which today forms the core of much of the effort in turbulence research, the required degree of resolution for the DNS of incompressible turbulent flow is estimated as at least 20483 or 40963 grid points. The computer that runs a DNS of this type must have (M) enough memory to accommodate the huge number of degrees of freedom, (S) high enough speeds to run the DNS within a tolerable time, and (A) high levels of accuracy such that it is possible to resolve the motion of small eddies that have velocity amplitudes much smaller than those of the energy-containing eddies. The Earth Simulator (ES) provides a unique opportunity in these respects. On the ES, we have recently achieved DNS of incompressible turbulence under periodic boundary conditions (BC) by a spectral method on 20483 grid points with double-precision arithmetic, and DNS on 40963 grid points with the application of time integration with single-precision arithmetic; double-precision arithmetic was used to obtain the convolutional sums for evaluating the nonlinear terms. Being based on a spectral method, our DNS accurately satisfies the law of mass conservation as is explained below (§ 3); this is not achieved by a conventional DNS based on a finite-difference scheme. Such accuracy is of crucial importance in the study of turbulence, and in particular for the resolution of small eddies. On the other hand, the execution of DNS code that implements the Fourier spectral method is definitely suitable as a way of evaluating the performance of such newly developed distributed memory systems as the ES from the viewpoints of computational performance, bandwidth from node to node, and I/O capabilities. DNS code is a very good benchmark program of the ES. It has also been employed in the final adjustment of the hardware system. The maximum number of degrees of freedom used in the evaluation is O(1011). The computational speed for DNSs was measured by using up to 512 processor nodes of the ES to simulate runs with different numbers of grid points. The best sustained performance of 16.4 Tflops was achieved in a DNS on 20483 grid points. An overview of the ES and of the numerical methods applied in the DNS, along with results for systemperformance evaluation and DNS are presented in this paper.
2
Overview of the Earth Simulator
2.1 Structure The ES is a parallel computer system of the distributed-memory type, and consists of 640 processor nodes (PNs) connected by 640 × 640 single-stage crossbar switches. Each PN is a system with a shared memory, consisting of 8 vector-type arithmetic processors (APs), a 16-GB main memory unit (MMU), a remote access control unit (RCU), and an I/O processor. The peak performance of each AP is 8Gflops. The ES as a whole thus consists of 5120 APs with 10 TB of main memory and the peak performance of 40Tflops[1]. Each AP consists of a 4-way super-scalar unit (SU), a vector unit (VU), and main memory access control unit on a single LSI chip. The AP operates at a clock frequency of 500MHz with some circuits operating at 1GHz. Each SU is a super-scalar processor with 64KB instruction caches, 64KB data caches, and 128 generalpurpose scalar registers. Branch prediction, data prefetching and out-of-order instruction execution are all employed. Each VU has 72 vector registers, each of which can has 256 vector elements, along with 8 sets of six different types of vector pipelines: addition/shifting, multiplication, division, logical operations, masking, 2
Interconnection Network (Single-stage full crossbar switch : 12.3GB/s x 2)
Processor Node #0
Processor Node #1
Arithmetic Processor #7
Arithmetic Processor #1
Arithmetic Processor #0
Shared Memory 16GB
Arithmetic Processor #7
Arithmetic Processor #1
Shared Memory 16GB
Arithmetic Processor #0
Arithmetic Processor #7
Arithmetic Processor #1
Arithmetic Processor #0
Shared Memory 16GB
Processor Node #639
Figure 1: System configuration of ES and load/store. The same type of vector pipelines works together by a single vector instruction and pipelines of different types can operate concurrently. The VU and SU support the IEEE 754 floating-point data format. The RCU is directly connected to the crossbar switches and controls inter-node data communications at 12.3GB/s bidirectional transfer rate for both sending and receiving data. Thus the total bandwidth of inter-node network is about 8TB/s. Several data-transfer modes, including access to three-dimensional(3D) sub-arrays and indirect access modes, are realized in hardware. In an operation that involves access to the data of a sub-array, the data is moved from one PN to another in a single hardware operation, and relatively little time is consumed this processing. The overall MMU is divided into 2048 banks and the sequence of bank numbers corresponds to increasing addresses of locations in memory. Therefore, the peak throughput is obtained by accessing contiguous data which are assigned to locations in increasing order of memory address. The fabrication and installation of the ES at the Earth Simulator Center of the Japan Marine Science and Technology Center was completed by the end of February 2002 (Fig. 2)[2].
2.2 Parallel programming on the ES If we consider vector processing as a sort of parallel processing, then we need to consider three-level parallel programming to attain high levels of sustained performance for the ES. The first level of parallel processing is vector processing in an individual AP; this is the most fundamental level of processing by the ES. Automatic vectorization is applied by the compilers to programs written in conventional Fortran 90 and C. The second level is that of shared-memory parallel processing within an individual PN. Shared-memory parallel programming is supported by microtasking and OpenMP. The microtasking capability is similar in style to that one which has been provided for a Cray supercomputer, and the same function is realized for the ES. Microtasking is applied in two ways; one (AMT) is automatic parallelization by the compilers and the other (MMT) is the manual insertion of microtasking directives before target do loops. The third level is distributed-memory parallel processing that is shared among the PNs. A distributedmemory parallel programming model is supported by the Message Passing Interface (MPI). The performance of this system for the MPI put function of the MPI-2 specification was measured[3]. The maximum throughput
3
Cartridge Tape Library System Disks Interconnection Network Cabinets
Processor Node Cabinets
Double Floor for PN-IN Cables
Power Supply System Air Conditioning System
Figure 2: A model of the ES system in the gym-like building. The building is 50m × 65m × 17m and has two stories; it includes a seismic isolation system. and latency for MPI put are 11.63GB/s and 6.63 µsec, respectively. Only 3.3 µsec is required for barrier synchronization; this is because the system includes a dedicated hardware system for global barrier synchronization among the PNs.
3
Numerical Methods
3.1 Basic Equations and Spectral Method The problem of incompressible turbulence under periodic boundary conditions (BC) is one of the most canonical problems in the study of turbulence, and has in fact been extensively studied. It keeps the essence of turbulence–nonlinear convection, pressure and dissipative mechanisms due to the viscosity–while being free of such extra complexities as those due to the fluid’s compressibility, which often make the reliability of DNS less transparent. We here consider the flow of an incompressible fluid as described by the Navier-Stokes (NS) equations, ∂u + (u · ∇)u = −∇p + ν∇2 u + f , ∂t
(1)
under a periodic boundary condition with period 2π, where u = (u1 , u2 , u3 ) is the velocity field, p is the pressure, and f is the external force that satisfies ∇ · f = 0, the fluid density is assumed to be unity. The pressure term p can be eliminated by the incompressibility condition:
Let us rewrite (1) in the form
∇ · u = 0.
(2)
∂u = u × ω − ∇Π + ν∇2 u + f , ∂t
(3)
4
where ω = rotu = (ω1 , ω2 , ω3 ) is the vorticity and Π = p + 12 u2 . Then, taking the divergence of (3) and using (2), we obtain (4) ∇2 Π = ∇ · [u × ω]. In DNS of turbulence, the accurate solution of equations of this type, i.e., Poisson’s equation, is important, because a violation of the equation(s) implies a violation of mass conservation, one of the most fundamental laws of physics. However, it is not, in general easy to accurately solve Poisson’s equation by using a finitedifference (FD) scheme. In fact, most of the cpu time consumed in solving a DNS by FD is known to be consumed in solving Poisson’s equation, and the cpu time increases rapidly with the required accuracy level. This difficulty can be overcome by using the Fourier spectral method, where (3) is written as k · s(k) d 2 + f (k), (5) u(k) = s(k) − k + νk dt k2 s(k) = −(u × ω)(k), where we have used (4). In (5), k is the wave vector, k = |k|, and the hatdenotes the Fourier coefficient. For example, u(k) is defined by u(x) = u(k) exp ik · x, k ω ¯ + 4σ; ω is the vorticity, and ω ¯ and σ are the mean and standard deviation of |ω|, respectively. The size of the display domain is (59842 × 1496)η, periodic in the vertical and horizontal directions. η is the Kolmogorov length scale and Rλ = 732 (see Table 3).
14
Figure 5: A closer view of the inner square region of Fig. 4; the size of the display domain is (29922 × 1496)η.
15
Figure 6: The same isosurfaces as in Fig. 4; a closer view of the inner-square region of Fig. 5. The size of the display domain is (14962 × 1496)η.
16
Figure 7: The same isosurfaces as in Fig. 4; a closer view of the inner-square region of Fig. 6. The size of the display domain is (7482 × 1496)η.
17