Document not found! Please try again

Acceleration of a high order Finite-Difference WENO ...

4 downloads 611 Views 2MB Size Report
[17] http://www.physics.mcmaster.ca/~taskere/codecomparison/codecom parison_intro.html. [18] Compute Command Line Profiler User Guide. DU-05982- ...
Acceleration of a high order Finite-Difference WENO Scheme for Large-Scale Cosmological Simulations on GPU

Chen Meng1, 2, Long Wang1 , Zongyan Cao1, 2, Xianfeng Ye1, 2, Long-Long Feng3, 4 †

1. Supercomputing Center, Computer Network Information Center, CAS, Beijing, China 2. Graduate University of Chinese Academy of Sciences, Beijing, China 3. Purple Mountain Observatory, Chinese Academy of Sciences, Beijing, China 4. Center for Astrophysics, University of Science and Technology of China, Anhui, China {mengchen, wangl, zycao, yexf }@sccas.cn, [email protected]

Abstract—In this work, we present our implementation of a three-dimensional 5th order finite-difference weighted essentially non-oscillatory (WENO) scheme in double precision on CPU/GPU clusters, which targets on large-scale cosmological hydrodynamic flow simulations involving both shocks and complicated smooth solution structures. In the level of MPI parallelization, we subdivided the domain along each of three axial directions. On one node, we ported the WENO computation to GPU. This method is memory-bound derived from the calculations of the weight and it becomes a greater challenge for a 3D high order problem in double precision. To make full use of impressive computation power of GPU and avoid its memory limitations, we performed a series of optimizations that is focused on memory accessing mode at all levels. We subjected this code to a number of typical tests for the evaluation of effectiveness and efficiency. Our tests indicate that, in a mono-thread CPU reference, the GPU version achieve a 12~19 times speed-up and the computation part is about 19~36 times faster than the Fortran code on the CPU. And we analyzed the results on two GPUs with different architectures and discussed future improvements and requirements for GPU cards. We also outlined what is needed to further increase the speed by reducing the time spent on the communications part and other future work. Keywords- WENO, 3D, double precision, GPU, cosmological hydrodynamic.

I.

INTRODUCTION

As a result of the high nonlinearity of gravitational clustering in the universe, the velocity of the cosmological flow in motion exceeds that of sound [1]. The most distinctive feature of the compressible supersonic flow is the fact that there can occur what are called shock waves in it. Shock wave is a kind of propagating disturbance that leads to strong discontinuities of the properties of the fluid (i.e., density, velocity, pressure, temperature) within complex smooth structures [2, 3]. A variety of numerical schemes for solving the supersonic flow systems have developed in the past decades [4]. The modern approaches, based on structured grids, implemented for high-resolution shock capturing, work extremely well in both low- and †

Corresponding author : [email protected]

high-density regions, as well as in shocks. In this paper, we present a accelerated implementation of 3D 5th order WENO scheme, which is a finite-difference grid method in high robustness and accuracy for solving Euler equation. WENO scheme uses the idea of adaptive stencils in the reconstruction procedure based on the local smoothness of the numerical solution to automatically achieve high-order accuracy and non-oscillatory property near discontinuities. It can simultaneously provide a high-order resolution for the smooth part and a sharp shock. Moreover, a significant advantage of WENO is its ability to have high accuracy on coarser meshes and to achieve better resolution on the larger meshes allowed by computer hardware.

FIGURE 1. TRADITIONAL PARALLEL SCHEME ON CPU CLUSTERS

WENO scheme has been widely used in applications for supersonic flow simulations, which call for high-order accuracy and shock capturing. Because of the large amounts of data in 3D large-scale simulations and the high demands of computing power imposed by WENO scheme, the computational workload should be suitably distributed among the processors. However, the main disadvantage of that method is, the higher order of accuracy, the more ghost data. As a result, parallelization scheme with the traditional domain decomposition methods could become poor scalability as the further scattering of the workload (Figure 1). Advent of many-core processors, especially GPUs, opens new horizons to further accelerating large-scale, high-resolution simulations. WENO scheme could significantly benefit from the tremendous advances of GPUs mainly because of the following three reasons:(1) The program on GPU is executed following the programming paradigm of Single Instruction Multiple

Threads (SIMT) on hundreds or even thousands of cores on a GPU chip. SIMT is less restrictive than the Single Instruction Multiple Data (SIMD), which is a traditional programming model of vector processors. And it allows parallelization both in data and instructions. So the overlapping of memory access and computing could make a great performance improvement for WENO computation that is memory-bound. (2) Sing precision solvers for Euler equation on GPU have been developed [5, 6]. But engineers and researchers need a high order, double precision solver, which have been satisfied extremely well by the GPUs with SM_2.x (Computing Capability indicator). Tesla C2075 with Fermi architecture provides a peak performance of 515Gflops for double precision operation and Tesla K20m with the Kepler architecture, the next generation of Fermi, provides 1.17Tflops. (3) Some implementations on GPU for tow-dimensional problems have achieved good speed-up [7]. All these implementations exploited the memory hierarchy of the CUDA programming model. Though WENO is able to have high accuracy on coarser meshes, the amount of data in a three-dimensional mesh is large. 3D solver requires all levels of memory for more space and programmers for using them more efficiently. We have presently implemented our double-precision three-dimensional Euler equation [8, 9], which is based on 5th order WENO scheme for hyperbolic conservation laws by using CUDA and 3th order low-storage Runge-Kutta scheme [10] for the time integration. The GPU code is integrated with the cubic domain-decomposition MPI parallelization framework so that the solver is allowed to run on CPU/GPU clusters. Our implementation have been applied for Wigeon, which is a hybrid cosmological hydrodynamic/N-body simulation software based on WENO scheme [1]. This article is organized as follows. In section 2,we introduce the algorithm of WENO scheme as well as our model for the solution of the Euler equation. In Section 3,outlines the implementation and optimization details of both MPI-parallelization and our GPU code. In Section 4,we measure and analyze the performance of our implementation. In Section 5,summarizes the results and gives a short outlook. II.

NUMERICAL ALGORITHM

A. Governing equation The Euler equation for the fluid without any viscous that is the governing equation for the systems of hyperbolic conservation laws, can be written in the compact form:

∂f (U) ∂g(U) ∂h(U) Ut = + + = F(t,U) , ∂X ∂Y ∂Z

(1)

where U and the fluxes f (U) , g(U) ,and h(U) are five-component column vectors,

! ρ $ ! ρu $ ! ρv $ # & # 2 & # & ρ u # & # ρu + P & # ρuυ & # ρυ & , # ρuυ & , # ρv2 + P & , & # & # & # & # ρ vw & # ρ w & # ρuw #E & # & # v(E + P)& u(E + P) " % " % " %

! ρw $ # & # ρuw & # ρ vw & , (2) # & # ρu 2 + P & # & " w(E + P)%

Here t is the time and (X,Y, Z ) are the coordinates. ρ is the density, V = (u, v, w) is the velocity vector, E is the total energy,

P is the pressure: E=

P 1 + ρ (u 2 + v 2 + w 2 ) , γ −1 2

(3)

The left-hand side in the equation is written in the conservative form for mass, momentum, and energy, and the “force” term on the right-hand side including the contributions from the gravitation. B. WENO scheme algorithm The discretization of the fluxes for solving governing equation is based on the 5th order WENO finite difference [11, 12], for example, ∂f (u) : ∂x

∂f (u) 1 ˆ x = xj ≈ ( f j+1/2 − fˆj−1/2 ) ∂x Δx along the

x line,

with y and z fixed, where fˆj+1/2 is the

numerical flux; If f !(u) ≥ 0 ,the 5th order finite difference WENO scheme has the flux given by:

ˆ (2) ˆ (3) fˆj+1/2 = w1 fˆ (1) j+1/2 + w2 f j+1/2 + w3 f j+1/2 where

fˆ (i) j+1/2 are

fluxes on three different stencils given

by:

1 7 11 fˆ (1) f (u j−2 ) − f (u j−1 ) + f (u j ), j+1/2 = 3 6 6 1 5 1 (2) fˆ j+1/2 = f (u j−1 ) + f (u j ) + f (u j+1 ), 6 6 3 1 5 1 fˆ (3) f (u j ) + f (u j+1 ) − f (u j+2 ), j+1/2 = 3 6 6 Note that the above formulas are as a result of the assumption f !(u) ≥ 0 . The key for the success of WENO scheme relies on the design of the nonlinear weights

wi ,

which are given by:

wi =

w i



3 k=1

w k

, w = k

γk , (ε + βk )2

where the linear weights γ k are chosen to yield 5th order accuracy and are given by:

γ1 =

1 3 3 , γ1 = , γ1 = , 10 5 10

The smoothness indicators β k are given by:

III.

13 1 [ f (u j−2 ) − 2 f (u j−1 ) + f (u j )]2 + [ f (u j−2 ) − 4 f (u j−1 ) + 3 f (u j )]2 , 12 4 13 1 2 β2 = [ f (u j−1 ) − 2 f (u j ) + f (u j+1 )] + [ f (u j−1 ) − f (u j+1 )]2 , 12 4 13 1 β3 = [ f (u j ) − 2 f (u j+1 ) + f (u j+2 )]2 + [3 f (u j ) − 4 f (u j+1 ) + f (u j+2 )]2 , 12 4

β1 =

Finally, ε is a parameter to keep the denominator from becoming 0 and is usually taken as ε = 10 −6 . This finishes the description of the 5th order WENO in the assumption f !(u) ≥ 0 . We then indicate the scheme in our case, which is more complex situation without the property f !(u) ≥ 0 . In this paper, we use the Lax-Friedrichs flux splitting: f (u) = f + (u) + f − (u), 1 df + (u) f (u) = ( f (u) + α u), ≥0 2 du 1 df − (u) f − (u) = ( f (u) − α u), ≤0 2 du α = max u f $(u)

IMPLEMENTATION

The Euler equation solves a flow system with multi-iterations (Figure 2). It begins with a data exchange of ghost cells. Then CPU transfers the values on the cubic mesh to memory on GPU. The 3D WENO procedure can take 98% of the total CPU computation time. Thus we concentrate on this WENO part. It is split into three kernels, which are WENO_fx for calculating the finite difference in X-axis direction, WENO_gy for Y-axis, and WENO_hz for Z-axis. We do this splitting for two reasons: (1) Both the memory access and computation mode in each of these three kernels are more concise than that of one big kernel. It is also easier to make further optimizations. (2) Some of applications have different demand for accuracy in three directions. After getting the right-hand side values from GPU, CPU calculates the new values on the cubic mesh with Runge Kutta method. CPU

GPU

MPI

+

Then, we apply the above procedure to f + (u) and a mirror image procedure to f − (u) . For the 5th WENO we adopt, this simple Lax-Friedrichs flux splitting has very small numerical viscosity and works well. For systems of hyperbolic conservation laws, the nonlinear part of the WENO procedure is carried out in local characteristic fields. Thus, we would first define an average

u j+1/2 = [ ρ, u, v, w, p]T of u j and u j+1 by using

Roe average: ρ = ρ j ρ j+1 ,V =

p=

Vj ρ j +Vj+1 ρ j+1

ρ j + ρ j+1

,

Pj ρ j + Pj+1 ρ j+1

ρ j + ρ j+1

MPI_sendrecv

set ghost data WENO_fx WENO_gy WENO_hz computation Runge Kutta

FIGURE 2. AN ILLUSTRATION OF THE COMPLETE FLOW CHART OF SOLVING EULER EQUATION WITH WENO SCHEME

A. Parallel scheme In addition to the huge amount of data, there are other two significant features emerging in 3D WENO computations, which pose more challenges than those schemes without targeting to high accuracy and shock capturing. One is the high demand of computing power. The computational workload should be suitably distributed among the processors. Another feature is that the high order WENO requires wide stencils. For example the 5th order scheme at least requires a five point wide stencil, which is called as the ghost data (Figure 3).

to compute the left and right eigenvectors of f !(u j+1/2 ) :

R −1 j+1/2 f "(u j+1/2 )R j+1/2 = Λ j+1/2 One then project all the quantities needed for evaluating the numerical flux fˆ (u j+1/2 ) on the local characteristic space by left multiplying them with R −1 and then re-project back j+1/2 to original physical space by right multiplying with R j+1/2 . C. Time Discretization In this paper, we use the finite-difference WENO scheme only as a method of discretizing the spatial variables. To solve the Euler equation, we use the 3th order nonlinearly stable Runge-Kutta time discretization.

FIGURE 3. DOMAIN DECOMPOSITION ON CPU-CLUSTER AND GHOST DATA OF A SUBDOMAIN

In this paper, the numerical solution of the Euler equation has been initially implemented in Fortran-90 for single block structured meshes. Parallelization of code for clusters was achieved through the use of message passing interface (MPI). In the MPI version, the domain

decomposition was performed by subdividing the domain cubically (Figure 3). This kind of decomposition generates the least amount of total ghost data on the same parallel scale and also ensures the best scalability. Theoretically, the cubical decomposition can work on 3D meshes of any size and on any parallel scale. However, as the further scattering of the workload, the effects of the ghost data and the communication overhead become prominent, which make our parallelization scheme poor scalability. So we decided to port our code on a heterogeneous scheme that involves CPU and GPU co-processing. There are a CPU and a GPU involved in each process. The GPU processes the part of the computations in WENO scheme and the CPU processes the left part including the communications between nodes and other light computations.

these sizes [14]. When a warp (32 threads in consecutive index) executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In our code, the size of one element in double precision is 8 types and one memory access of a warp involves 256 types. It will coalesce such a access into 2 memory transactions ideally while 32 memory transactions in the worst case (Figure 5). The fluxes in WENO scheme are five-component column vectors and the memory accessing locality for one CPU thread is achieved by putting the data as a array of five-component vectors, which is the worst address pattern for GPU threads.

The subdomain distributed on each node would undergo a second decomposition on the local GPU. The WENO procedure for

∂f (U) ,for example,is actually a ∂X

serial of X-axis directional computations. Attaining high GPU speeds requires generating as many identical, independent threads as possible. So the selected strategy is that each point on the Y-Z plane of the local cubic mesh is assigned on a different thread in a different block and corresponds to the processing of a group of all points along X-axis (Figure 4). The parallel strategy for the WENO procedure of

∂g(U) and ∂h(U) is in the same but the ∂Y ∂Z

different axis direction.

FIGURE 4. PARALLEL STRATEGY FOR WENO_FX ON GPU

B. Optimization Technologies 1) Memory throughout The kernel for the WENO computations is memory-bound [13], In which the instructions per byte is low. That is to say, the time spent in accessing memory is longer than time spent in executing instructions. This is partly because of the feature of the WENO algorithm and partly of using double precision. So we think the memory access as the biggest performance limiter. Optimizations should focus on global memory first and then making full use of other kinds of hierarchical memory including texture memory, shared memory, local memory and registers. a) Address pattern in global memory Global memory is accessed via 32-, 64-, or 128-byte memory transactions whose first address is a multiple of

FIGURE 5. MEMORY TRANSACTIONS IN ONE WARP

So we change the array of vector-structure (AOS) to a vector-structure of array (SOA), in which all elements of each property in the vectors are in the consecutive addresses. And the organization of the immediate variables in global memory for each thread is in the same way. Global memory resides in the device memory and it is the main memory on GPU, which can be written and read, cached by L1 and L2. The results that are computed on each time step are stored in global memory. The 3D WENO computation benefits from its large capacity. b) Hierarchical memory The access latency of global memory is ten or hundreds of times longer than other memory. In order to improve memory throughout further, we need to utilize appropriately the memory hierarchy of GPU. We use the texture memory for the data that are processed before the each time WENO computations begins because only reading the data is safe after binding them to the texture memory. The texture memory resides in device memory but is designed optimized for 2D and 3D spatial locality, which benefits from its own special caching mechanism. Though the capacity of texture memory is limited, it solved two problems in the global memory in our WENO code. One is that it reduces the cost from uncoalescing address in the reading part of X-axis direction. Global memory can only coalesce 1D memory access and we can see the texture memory as a kind of special global memory with a 2D and 3D “coalesce”. The other one is the separated cache for texture memory does not cause any possible cache-miss in global memory. Texture memory

cannot support double precision. We read data via “int2” and convert them to “double” by using function “__hiloint2double ()” [14]. Shared memory on chip has much higher bandwidth and much lower latency than device memory. The shared memory is a limited resource for stream-multiprocessors(SM). Over-use of shared memory would severely reduce the number of active threads, which directly brings up bad performance. Double precision and the algorithm feather both make it impossible to use shared memory to get performance improvement in the computation part of the main kernel, in which a fast thread-switching across numerous active threads is more important.So we use it as a bridge for data write/read in those light-workload kernels with uncoalescing address problems including the kernel for data-block transpose and the kernel only for writing the results back. As mentioned earlier, the element of the data block in WENO has five components. We choose to process the properties one by one to minimize the utilization of shared memory on one SM. Registers and local memory are both private for one thread. Register access has no latency while local memory residing in device memory access takes as long as global memory with cache L1.The limit of registers for GPUs with SM_2.x is 63 registers per thread. When this limit is exceeded, compiler would “spills” registers to local memory.By default, the declared array in kernel would be stored in local memory. One hand, we change the short array in our code into several individual variables to make full use of registers. On the other hand, we do our best to reduce the use of registers for avoiding “register spilling”. 2) Instruction throughout Instruction throughout can be incremented by removing instruction serialization and overlapping latency. We consolidate “if” statements in the code to removing the instruction serialization caused by branch divergency in one warp, and unroll as much small loop bodies as possible. High occupancy rate ensures high instruction throughout, which means there are sufficient concurrent threads per multiprocessor (SP) to hide latency. Due to the specification of CUDA architecture, we select the optimal blocksize and control the use of resources on a SM to forbid latency between threads in distinct blocks. 3) Other optimization methods Because the main kernel of WENO computations is of weight workload in which there are more accessing transactions and more calculations, the optimization is harder than small kernels. We split some functionally independent part, like the head or the tail of the big main kernel, to separate kernels. For these new kernels, there is more resources and It is more easy to optimize.

The compiler arguments, such as –use_fast_math, -Xptxas–dlcm=cg/ca, bring some performance improvements. In our code, the compiler argument includes “-Xptxas –dlcm=cg”, which means using non-caching loads and stores from L1. Memory accesses that are cached in L1 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 are serviced with 32-byte memory transactions [14]. So non-caching from L1 is more efficient for scattered or partially filled patterns. In our code, non-caching in L1 can therefore reduce over-fetch and potentially fewer contentions with spilled registers in cache L1. After above optimizations, one step (Figure 6) of the iterations of solving Euler equation has achieved high memory and instruction throughout. CPU

GPU

set ghost data Memcpy Memcpy bind Texture

transpose AOS->SOA WENO_fx WENO_gy WENO_hz

unbind Texture

transpose SOA->AOS

WENO_Fx calculations WENO_Fx writing back WENO_hz calculations WENO_hz writing back

Memcpy

Runge kutta

FIGURE 6. AN ILLUSTRATION OF ONE STEP OF SOLVING EULER EQUATION WITH WENO SCHEME AFTER GPU OPTIMIZATIONS

IV.

TESTING AND ANALYSIS

A. Numerical tests The WENO scheme in our code has been subjected to a variety of numerical tests, especially for the situation in which both shocks and complicated smooth flow features coexist. In this section we are going to run the following two tests [15]: (1) the oblique sod shock tube test. (2) the sedov spherical blast wave test. 1) Oblique Shock Tube Test The sod shock tube problem has been widely used to test the ability of hydrodynamic codes for shock capturing. It is set as a straight tube of gas divided by a membrane into two chambers. We set the initial density and pressure jump either side of the membrane: 𝜌! 𝜌! 1.5 1.0 𝑃! = 1.0  , 𝑃! = 0.2 𝑉! 𝑉! 0.0 0.0 The sod shock tube is actually a one-dimensional problem. In three-dimensional case, the shock propagates along the main diagonal of the region, the line (0,0,0) to (1,1,1) in the cube, which is called oblique sod shock tube. We compares the numerical results at t=0.510 (Figure 7) with the

analytical solution, in which 64 cells in each direction have been used. We can see the shock discontinuity can be resolved within two cells without oscillation and over-smoothing.

FIGURE 7. DENSITY ALONG THE MAIN DIAGONAL LINE OF THE 3D REGION FOR THE SOD SHOCK TUBE TEST AT T=0.510.

2) Spherical Sedov-Taylor Blast Wave Test The sedov blast test is a comprehensive test to WENO scheme as a typical high Mach number [2, 3] problem. It is an intense explosion caused by a energy point deposited in the center of the simulation box. The explosion will develop a spherical blast wave that propagates outward along the radial direction. The shock front propagates according to 𝑟! 𝑡 = 𝜉! (

!! ! ! !/! ) , !!

2) The experimental results The total computation time can be divided into three parts: the numerical operation time, the MPI communication time, and the CPU-GPU data copy time. The numerical operation time includes the time integration with RK method on CPU and the finite difference operation base on WENO scheme on GPU. Table 1 contains execution time profile of one step of solving Euler equation on a single core. The execution time of the one step run is fully dominated by WENO computation, which is what we ported to GPU. TABLE 1. EXECUTION TIME OF ONE STEP ON THE SCALE OF 1283 ON CPU

component fx WENO gy hz

(4)

where 𝜉! = 1.15 for an ideal gas with a polytrophic index 𝛾 = 5/3.In our calculations, the initial density is 𝜌! = 1 and the pressure 𝑃! = 10!! .A explosion energy 𝐸! = 10! is injected at the center of the box. The test is performed in a grid. From Figure 8, we can see the numerical solution captures the shock well, though there are some scattering points around because of the Cartesian coordinates.

FIGURE 8. DENSITY ALONG THE DISTANCE FROM THE ENERGY POINT OF 3D REGION FOR THE SEDOV BLAST WAVE TEST AT T=2.5.

V.

memory available, with 13 SMX [16]. The GPU cards are connected with the CPU cores by PCI-express2.0. We used Intel Fortran compiler as the basic compiler in our tests. For MPI communication routines, we used OpenMPI 1.3.3. In muti-CPU/GPU test, each MPI process was bound to one CPU core and attached to a single Tesla C2075 GPU card. So we counted one CPU core plus one GPU card as on CPU/GPU computing unit.

PERFORMANCE EVALUATION

1) GPU testing machine In order to evaluate the performance of our implementation, we have conducted experiments using different experimental platform deployment. The experimental platform for mono-GPU test consists of a host supplied with a quad-core AMD Phenom(tm) 9850 processor at the clock rate of 2.5GHz with 8GB of main memory and one NVIDIA Tesla C2075 with Fermi architecture , one NVIDIA Tesla K20m with Kepler architecture. The Tesla C2075 makes 6GB memory available, with 14 SM. The Tesla K20m makes 5GB

Time(ms) 1703

1869

Runge Kutta

116

Total

5726

2038

According to the optimization technologies that we have discussed in section 3.2, we can draw the GPU-WENO code learning curve [17] on Tesla C2075 (Figure 9). The horizontal axis shows steps of optimizations. They are, in order, initial Kernel without any optimizations, removing ‘if’ and ‘for’, global memory coalescing, using texture memory, selecting optimal blocksize and compiler arguments, controlling number of registers, splitting kernel, using of shared memory. The vertical axis shows the speedup of the three main kernels (WENO_fx, WENO_gy, WENO_hz) when finishing all optimization steps before the horizontal coordinate. We can measure the effect of each optimization step by the slope of the curve. Clearly, global memory coalescing made the most significant contribution. 22" 20" 18" 16" 14" 12"

CPU"

10"

GPU"

8" 6" 4" 2" 0"

1"

2"

3"

4"

5"

6"

7"

8"

9"

FIGURE 9. WENO LEARNING CURVE ON TESLA C2075

In the final version with MPI and CUDA, CPU and GPU computations execute synchronously, without overlapping in time. Table 2 contains execution time profile of one step on the pair of CPU core and GPU. The parts marked with ‘*’ in the top right corner are run on GPU. TABLE 2. EXECUTION TIME(MS) OF ONE STEP ON THE SCALE OF 1283 ON SINGLE CPU/GPU NODE

component MPI communication Data copy CPUGPU Data transposition* fx gy hz

WENO*

C2075

K20m

20

20

160

160

3.65

2.42

2.37

2.61

1.73

1.7

113

85

89

67

56

56

Runge Kutta

116

116

Total

591.44

421.04

Figure 10 shows the speedups by comparing Table 1 with Table 2. We observe that the whole WENO computation on C2075 executes 19.5 times faster than on the CPU while the one on K20m executes 31 times faster than on the CPU. In particular, WENO_gy and WENO_hz achieved higher speedup than WENO_fx because of the partial-uncoalescing reading in WENO_fx. On C2075 and K20m, WENO_gy achieved × 22 and × 33 while WENO_hz achieved × 23 and × 36. When the data-copying time is taken into consideration, the speedup of the whole WENO part dropped to ×12 and ×17. With the increasing of the scale, up to 256×256×224 double precision on single GPU, there would be a rise in the speedup on K20m but it is very slight on C2075. 40" 35" 30" 25" 20" 15" 10" 5" 0"

1

403

2

207.1

97.3%

41.35

4

104.86

96.1%

24.3

8

52.58

95.8%

15.56

3) Result analysis and Future improvements From the experimental results, we now discuss the possibility of future improvements and the architecture requirements for the machine in two aspects including computation and data transfer. (1) The performance of WENO computation on K20m with Kepler architecture is better than on C2075 with Fermi architecture by a factor of about 1.5~2 which caused by higher double precision peak GFLOATS and much higher maximum number of registers one thread can use without “register spilling”, which is 225 on K20m and 63 on C2075. In our code, one of its bottlenecks is the high register requirement, which may lead to lots of local memory accesses. We used CUDA Profiler to analyze running code and find the number of registers one thread need is 200 [18], which have generated thousands of bytes “register spilling” on C2075 and zero on K20m, respectively. WENO computation uses these registers for intermediate variables in the calculations of the weight, which directly determines the accuracy and the effect of non-oscillation of WENO scheme. So the maximum number of registers on GPU is very important for the performance of WENO computation. (2) The data transfer between CPU and GPU takes about 35% of the total time of whole WENO part on C2075 and 46% on K20m. When this time is taken into consideration, the great effects we have made in the WENO computation have been discounted. For a particular application, we can solve this problem without higher speed PCI-express. In fact, only the ghost data cannot be updated without data transfer between CPU and GPU if we port the left part of our code to GPU. VI.

F_SpeedUp" y"

" we no

+d

at

a_ cp o

we no

" _H z

we no

_G y"

we no

we no

_F x

"

K_SpeedUp"

FIGURE 10. SPEEDUPS OF WENO COMPUTATION ON GPU IMPLEMENTATIONS COMPARED TO SEQUENTIAL EXECUTION ON CPU ON THE SCALE OF 1283

Table 3 shows the execution time of ten steps of solving Euler equation on muti-CPU/GPU and the efficiency of MPI parallelization. TABLE 3. EXECUTION TIME(MS) OF TEN STEPS OF SOLVING EULER EQUATION ON MUTI-CPU/GPU ON THE SCALE OF 256*128*256 AND THE EFFICIENCY OF MPI PARALLELIZATION

Number of nodes

CPU

Efficiency _CPU

CPU/GPU

70.56

th

CONCLUSIONS

A 5 order WENO finite difference method was implemented for parallel processing with GPUs. In the level of domain decomposition approach based on MPI, we subdivided the domain along each of three axial directions to achieve the smallest amount of ghost data on the same scale and the best scalability. Then we port the WENO computation to GPU on one CPU/GPU node. The oblique sod shock tube and sedov blast wave were used as computational examples for the evaluation of effectiveness and efficiency. After a series of optimizations including memory accessing mode, instruction arrangement and other tricks according to its features, in a mono-thread CPU reference, the GPU version achieve a 12~19 times speedup and the computation part is about 19~36 times faster than the Fortran code on the CPU. We analyzed the results on two GPUs with different architecture and discussed future improvements and requirements for GPU cards. We also posted the test results on multi-CPU/GPUs. The future

work for us is to try to overlap the time spent on data transfer and MPI communications.

ACKNOWLEDGMENTS We would like to thank Weishan Zhu and Liang Wang for helping us with the professional knowledge about cosmological hydrodynamics. And also we would like to thank Weile Jia for some advices on optimizations and some news on GPU cards. The work of the first 4 authors is supported by National Basic Research Program of China 2010CB832702; NSF of China 61202054, 10972215 and 60873113; Knowledge Innovation Program of CAS CNIC_ZR_201202; 863 Program 2010AA012301 and 2010AA012402.

REFERENCES [1] Long-Long Feng, Chi-wang Shu, Mengping Zhang, 2004. A hybrid cosmological hydrodynamic/N-body code based on a weighted essentially non-oscillatory scheme. The Astrophysical Journal (September. 2004).

[7] Liu Mingqin, W. L. Wei, Lv Bin, X. J. Zhao, Sh. Li, 2011. Simulation for 2D flows in a rectangular meandering channel. International Symposium on Water Resource and Environmental Protection (2011).

[8] Tobias B., Graham P., 2008. Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware. The American Institute of Aeronautics and Astronautics (2008).

[9] Paulius M. 3D finite difference computation on GPUs using CUDA. Architectual Support for Programming Languages and Operating Systems (2009), 79-84.

[10] Chi-Wang S. Total Variation Diminishing Time Discretizations. Siam Journal on Scientific and Statistical Computing (1988).

[11] Jiang, G.S. and Shu, C.W., 1996. Efficient Implementation of Weighted ENO Schemes. J. Computational Physics (1996), 202-208.

[12] Balsara, D.S. and Shu, C.W., 2000. Monotonicity Preserving Weighted Essentially Non-oscillatory Schemes with Increasingly High Order of Accuracy. J. Computational Physics (2000), 405-452.

[13] Jairo P., Thiago T., Paulo R.P. de Souza Filho, et al. Accelerating Kirchhoff Migration by CPU and GPU Cooperation. Symposium on Computer Architecture and High Performance Computing (2009), 26-32.

[2] Anderson, John D. Jr. Fundamentals of Aerodynamics (3rd ed.,

[14] Paulius M, 2010.

[3] Robert W. F., Alan T.M. Introduction To Fluid Mechanics Fourth

[15] CUDA C PROGRAMMING GUIDE. PG-02829-001_v5.0 (October

January 2001). Edition.

[4] Oscar A., Ben M., Joachim S., et al. 2007. Fundamental differences between SPH and grid methods. Monthly Notices of The Royal Astronomical Society (2007), 963-978.

[5] Athanasios S. A., Konstantinos I. K., Eleftherios D. P., 2010. John A. E., Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures. The American Institute of Aeronautics and Astronautics (2010).

[6] Michael G., Peter Z. A muti-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations. Computer Science-Research and Development (2010), 65-73.

Analysis-Driven Optimization. SC10, ACM,

2010.

2012).

[16] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110 (v1.0, 2012).

[17] http://www.physics.mcmaster.ca/~taskere/codecomparison/codecom parison_intro.html.

[18] Compute Command Line Profiler User Guide. DU-05982-001_v03 (November 2011).

Suggest Documents