Simulation of 1D Condensing Flows with CESE Method on GPU ...

Chapter 10

Simulation of 1D Condensing Flows with CESE Method on GPU Cluster Wei Ran, Wan Cheng, Fenghua Qin and Xisheng Luo

Abstract We realized the space-time Conservation Element and Solution Element method (CESE) on GPU and applied it to condensation problem in a 1D infinite length shock tube. In the present work, the CESE Method has been implemented on a graphics card 9800GT successfully with the overlapping scheme. Then the condensation problem in 1D infinite shock tube was investigated using the scheme. The speedup of the condensation problem with the overlapping schemes is 71× (9800GT to E7300). The influence of different meshes on the asymptotic solution in an infinite shock tube with condensation was studied by using the single GPU and GPU cluster. It is found that the asymptotic solution is trustable and is meshinsensitive when the grid size is fine enough to resolve the condensation process. It is worth to mention that the peak value of computing reaches 0.88 TFLOPS when the GPU cluster with 8 GPUs is employed. Keywords GPU cluster · CESE method · Shock tube · Condensation

10.1 Introduction Historically, the CFD has apace developed due to the rapid increase of CPU performance and the swift reduction of the hardware price. However, with the limitation of physics, the CPU’s performance can’t be improved easily by raising clock frequency as ever (Kish 2002). To increase the computational ability of CPU, a large and expensive cache is integrated and many-core system has been employed (Geer 2005). The small scale problems of CFD can be fulfilled by a PC of multi-core CPU with shared memory. For a large scale problem, however, a PC with a few cores can’t offer enough computational capability and the cluster with many CPU W. Ran (B) · W. Cheng · F. Qin · X. Luo Department of Modern Mechanics, University of Science and Technology of China, 230026 Hefei, P.R. China D. A. Yuen et al. (eds.), GPU Solutions to Multi-scale Problems in Science and Engineering, 173 Lecture Notes in Earth System Sciences, DOI: 10.1007/978-3-642-16405-7_10, © Springer-Verlag Berlin Heidelberg 2013

174

W. Ran et al.

cores is needed. Nevertheless, the memory bottleneck, which appears in the form of bandwidth limitation and fetching latency, has restricted the performance of the many-core system. At the meantime, graphics process unit (GPU), having recently turned into general-purpose programmable units, is the first to abandon expensive caches and combat latency by massive parallelism instead (Kirk and Hwu 2010), (NVIDIA 2009). Now, the GPU has widely used in many fields of scientific computing. Preis et al. has implemented GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model (Preis et al. 2009). Klöckner et al. has used CUDA to accelerate discontinuous Galerkin methods (Klöckner et al. 2009). Corrigan et al. realized CFD solvers on GPU using an unstructured grid (Corrigan et al. 2009). We will present the work of implementation of the space-time Conservation Element and Solution Element (CESE) (Chang 1995) method on GPU and GPU cluster with CUDA. The space-time CESE method, originally proposed by Chang (1995), has many unique features comparing with conventional methods such as finite volume, finite difference and finite element. It offers a new approach to solve conservation laws like Navier-Stokes or Euler equations. It can obtain highly accurate numerical solutions for flow problems involving discontinuities (e.g. shocks and contact surfaces), vortices, acoustic waves, boundary layers and chemical reactions (Yu et al. 2009), (Cheng et al. 2010). To accurately calculate a complex flow field, a fine mesh with enormous grid points is employed normally, which demands a large amount of computing resource. The requirement of huge computing resource is not only needed in 2D or 3D problems, but also in some 1D problems for example the simulation of the asymptotic behavior in an infinite shock tube with homogeneous condensation (Cheng et al. 2010). It is expected that the GPU accelerated CESE method developed here could greatly reduce the calculating time and, therefore, meet the requirement of huge computing resource in these problems. The rest of this paper is organized as follows: Sect. 10.2 is the overview of the CESE method. The implementation of the CESE method on GPU and GPU cluster will be dedicated in Sect. 10.3. Section 10.4 will present the simulation results and the computational performance. Finally conclusions are summarized in Sect. 10.5.

10.2 Overview of the Method According to Yu and Chang (1997), the explicitly treatment of stiff source terms is easily achieved. First, the governing equation is a little different from the Euler equation. A source term vector is appended in it, which can be expressed as: ∂U ∂F + = S. ∂t ∂x

10 Simulation of 1D Condensing Flows with CESE Method on GPU Cluster

175

U, F and S are: ⎛ ⎞ ⎛ ⎞ ⎞ 0 ρu ρ ⎜ ⎟ 0 ⎜ ρu 2 + ρ ⎟ ⎜ ρu ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ 0 ⎜ (ρ E + ρ) u ⎟ ⎜ ρE ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ 4πρ (Jr 3 + 3ρ Q dr )/3 ⎟ ⎜ ⎜ ⎟ ⎟ ; S = U = ⎜ ρg ⎟ ; F = ⎜ ρgu l 2 ⎜ ⎟, c dt ⎟ ⎜ ⎟ 2 + 2ρ Q dr ⎜ ρ Q2u ⎟ ⎜ ρ Q2 ⎟ ⎜ ⎟ Jr 1 dt ⎜ ⎜ ⎟ ⎟ c ⎜ ⎟ ⎝ ρ Q1u ⎠ ⎝ ρ Q1 ⎠ dr ⎝ ⎠ Jrc + ρ Q 0 dt ρ Q0 ρ Q0u J ⎛

where Q 0 , Q 1 and Q 2 stand for finite numbers of moments of the size distribution function which describe the conservation of the liquid phase. g is the liquid mass fraction. is the density of liquid water. J is the nucleation rate. rc is the critical radius and dr dt is the averaged droplet growth/shrinkage rate. Here we used the liquid mass fraction g that is directly related to the third-order moment Q 3 , g = 4πρl Q 3 /3 (Luo et al. 2007). With the explicit treatment of source terms, the governing equation is first divided into a homogeneous and an inhomogeneous part (Yu and Chang 1997) as: ∂Uhom ∂t ∂U ∂t =

+

∂F ∂x = hom

S(U

0,

),

where Uhom is the solution of the homogeneous part. The homogeneous part is an Euler equation which can be calculated by the CESE Method. The governing equation solved by the CESE method can be written to the difference form as: u mt + f mx = 0. Define that: A=

∂ fm , ∂u k

the intermediate variable f m and its time derivative f mt at time n and point j are given as: ( f m )nj = Anj (u m )nj , ( f m t)nj = Anj ( f m )nj . The a-α scheme of the CESE method is: (u m )nj =

1 n−1/2 n−1/2 n−1/2 n−1/2 [(u m ) j−1/2 + (u m ) j+1/2 + (cm ) j−1/2 − (cm ) j+1/2 ], 2 (u mx )nj

=

|vxr nj |α vxl nj + |vxl nj |α vxr nj |vxr nj |α + |vxl nj |α

,

176

W. Ran et al.

where (cm )nj =

dx dt (dt)2 (u mx )nj + ( f m )nj + ( f mt )nj , 4 dx 4d x n−1/2

vxl nj = −

n−1/2

(u m ) j−1/2 − (u m )nj + (dt/2)(u mt ) j−1/2 d x/2 n−1/2

vxr nj = +

,

n−1/2

(u m ) j+1/2 − (u m )nj + (dt/2)(u mt ) j+1/2 d x/2

,

with dt and d x being the time and the space step, respectively. (cm )nj is an intermediate variable. vxl nj and vxr nj are the left and right gradients of (u m )nj respectively. The explicit treatment of source terms is simple and its computation is local reliant. The physical model is given in Smolders (1992).

10.3 Implementation As described above, the simple interdependency of the difference scheme makes the CESE method be easily parallel implemented. Consider a problem with a total number of grid points of N G. N G also represents the total threads in a Grid. Let N T denote the number of Threads in a Block with N T = 2T , 0 < T < 9 (usually, T = 7), then the number of Blocks N B can be determined by N B = N G/N T .

10.3.1 The Overlapping Scheme The number of threads in a Block in the overlapping scheme is one more than that the grid points needed. This method is based on a principle similar to cache. Cache reads more data nearby than that the program actual needs in memory. As shown in Fig. 10.1 and Algorithm 1, total number of threads is N T + 1 which means Shared Memory n−1/2 n−1/2 and (u mx ) j from Device Memory. Then will read N T + 1 vectors of (u m ) j n−1/2

N T + 1 of ( f m ) j n−1/2 of (u m ) j

n−1/2

, ( f mx ) j

n−1/2 n and (cm ) j ,2

n−1/2

and (cm ) j

of (u m )nj

are calculated. With these N T + 1

and (u mx )nj can be calculated. This scheme avoids the communication between Blocks apparently. In fact, the communication has finished before calculation, where the overlapping corresponds between two adjacent Blocks. The source code of the kernel function is listed in Appendix A.


177

Fig. 10.1 Main computation procedures of the overlapping scheme for the CESE Method

10.3.2 Implementation on GPU Cluster For a large scale problem, computational ability of single GPU is not enough. Thus GPU cluster is considered. Implementation of CESE method on GPU cluster is also achieved in our study by using the MPI. As depicted in Fig. 10.2, in the application of GPU cluster, MPI threads are used to start and control Devices. The calculation on each Device is the same as the single GPU application. What’s difference is the communication between Devices, which must be processed to ensure the correctness of results. In the application of GPU cluster, bandwidth between Devices is relatively low. Bandwidth in Device is above 100 GB/s (coalescence accessing), and 8 GB/s between Device and Host (PCIE-X16), while between Devices is 10 GB/s (RAM) or 1.25 GB/s (InfiniBand). It is obvious that the bottleneck is in the communication between Devices. To tackle this bottleneck problem, the frequency of communication between Devices must be limited. Two methods are employed to reduce the frequency through low bandwidth channel. First one is packing and unpacking data in Device. With this method, packing and unpacking time of data is less than doing it on Host because

178

W. Ran et al.

the bandwidth on Device is much higher than it on Host. This method also reduces frequency of communication between Device and Host. The second method is setting a large size buffer to reduce the frequency of communication between Devices.

Algorithm. 1. Algorithm of overlapping scheme for the CESE Method As implied above, the overlapping scheme can gain higher computing performance. Here we use the multiple threads’ overlapping scheme to set the buffer. With

Fig. 10.2 MPI based CUDA, one MPI thread controls one CUDA Device


179

Fig. 10.3 Multiple threads’ overlap: effective point decrease as time step accumulating

Fig. 10.4 The shock tube with both ends opened. The tube is infinite long and is divided into the HPS and the LPS by the diaphragm D

this method, if we set 1 times more size of buffer, the frequency of communication between Devices will decrease to a half. However, multiple threads’ overlap will reduce the effective points, which is shown in Fig. 10.3. Thus, the balance between data loss and communication frequency reducing should be well considered. The best situation is that the buffer size is 1/8 of total data size in our test.

10.4 Example and Results Condensation problem in an infinite length shock tube is simulated in this section. Initially a shock tube is equally divided into a high pressure section (HPS) and a low pressure section (LPS) by a diaphragm D, as shown in Fig. 10.4. The initial pressure is 1.0 bar in the HPS and 0.3 bar in the LPS, respectively. The initial temperatures of both HPS and LPS are 295 K. The gas in the tube is humid nitrogen with an initial saturation ratio of 0.8. An initial mesh with d x = 0.05 mm and a time step dt = 0.2 µs are set. The problem of condensation in an infinite length shock tube has an asymptotic solution after an infinitely long time (Cheng et al. 2010). Now, GPU is applied to prove the numerical solution implied in Smolders (1992) is mesh-independence. The overlapping scheme is chosen for the following computation because it is fast. The calculation method is to double the space step and time step when the shock transmits to the end of tube and then the length of shock tube is doubled (Cheng et al. 2010). With the method of doubling d x and dt, 5 different scales of grid ranging from 212 to 216 are employed in calculation. The result in global view is presented in Fig. 10.5a. As depicted in Fig. 10.5b, all the results are nearly the same at 20 years and approach to the theoretical solution. Although the space step is expanded ranging from 108 to 107 meters, the changes of

180

(a)

W. Ran et al.

(b)

Fig. 10.5 Condensation problem in an infinite shock tube at t = 20 years (single card). a Density evolution, Grid(N G) = 214 . b Asymptotic solution with different grids at the time 20 years

(a)

(b)

Fig. 10.6 Condensation problem in an infinite shock tube at t = 20 years, Grid(N G) = 217 (multi-card). a Density evolution. b Asymptotic solution with different grids at the time 20 years

space step’s length have little effect on the final results at the same evolution time. So we put forward that the theoretical solution is mesh-independence. The acceleration ratio of single card computing is 71. The GPU used is the NVIDIA graphics card 9800GT and the CPU is Intel CPU Core Dual E7300. The peak value of GPU is about 75 GFLOPS. Both the CPU and GPU’s key parameter are listed in Appendix B. For a large scale problem, computational ability of single GPU is not enough. Thus GPU cluster is considered. Implementation of CESE method on GPU cluster is also achieved in our study. Here we present the results of condensation problem in an infinite shock tube which is computed on GPU cluster. Figure 10.6a shows the global view of the density evolution at t = 20 years. Figure 10.6b illustrates the platform in the asymptotic solution.


181

For this case, the peak value of GPU cluster is 0.88 TFLOPS (using 8 C1060 GPU cards), and the time consumed on communication between Devices is lower than 0.1 %.

10.5 Conclusion We presented a GPU accelerated version of the CESE method and the explicit treatment of source terms. The new approach of GPU acceleration with CUDA was applied and we have gained good computational performance increase with an old fashion graphics card 9800GT which only supports singe precision float number. To optimize the code performance, Shared Memory is employed—the overlapping scheme computes with Shared Memory. The principle of Overlapping scheme can be used in other similar algorithm. The scheme also achieved good acceleration ratio in the problem of condensation in shock tube which is 71. We have also implemented the CESE method on GPU cluster and gained peak value of 0.88 TFLOPS (using 8 C1060 GPU cards). With the simulation results of single GPU and GPU cluster, we proved that the asymptotic solution of condensation problem in an infinite length shock tube is mesh-independence. In the future work, we will use the GPU and CUDA to solve 2 and 3D problems with the CESE Method. Acknowledgments This research was carried out with the support of the National Natural Science Foundation of China under grant 10972214.

182

Appendix A Here we list out the code of kernel function:

W. Ran et al.


183

Appendix B

Table B.1 Key facts of Intel CPU Core Dual E7300:

Intel CPU Core Dual E7300 Number of cores L2 Cache Clock rate

2 2 MB 2.66 GHz

184

W. Ran et al.

Table B.2 Key facts of NVIDIA graphics card 9800GT:

NVIDIA graphics card 9800GT Number of streaming multiprocessors Number of streaming processors Global memory Shared memory per multiprocessor Register memory per multiprocessor Clock rate Compute capability

Table B.3 Key facts of NVIDIA computing card Tesla C1060:

14 112 1 GB 16 KB 8192*4 B 1.50 GHz 1.1

NVIDIA computing card Tesla C1060 Number of streaming multiprocessors Number of streaming processors Size of global memory Shared memory per multiprocessor Register memory per multiprocessor Clock rate Compute capability

30 240 4 GB 16 KB 16384*4 B 1.296 GHz 1.3

References Chang SC (1995) The method of space-time conservation element and solution element-A new approach for solving the Navier-Stokes and Euler equations. J Comput Phys 119:295–324 Cheng W, Luo X, Yang J, Wang G (2010) Numerical analysis of homogeneous condensation in rarefaction wave in a shock tube by the space-time CESE method. Comput Fluids 39:294–300 Cheng W, Luo X, van Dongen MEH (2010) On condensation-induced waves. J Fluid Mech 651:145– 164 Corrigan A, Camelli F, Löner R, Wallin J (2009) Running unstructured grid based CFD solvers on modern graphics hardware, in: 19th AIAA computational fluid dynamics. American Institute of Aeronautics and Astronautics, Inc., San Antonio, Texas, USA Geer D (2005) Chip makers turn to multicore processors. Computer 38(5):11–13 Kirk D, Hwu W (2010) Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Morgan Kaufmann Kish L (2002) End of Moore’s law: thermal (noise) death of integration in micro and nano electronics. Physics Letters A 305(3–4):144–149 Klöckner A, Warburton T, Paul JB, Hesthaven J (2009) Nodal discontinuous Galerkin methods on graphics processors. J Comput Phys 228:7863–7882 Luo X, Wang M, Yang J, Wang G (2007) The space-time CESE method applied to phase transition of water vapor in compressible flows. Comput and Fluids 36:1247–1258 NVIDIA Corporation, NVIDIA CUDA Programming Guide, Version 2.3.1, NVIDIA Corporation (2009) Preis T, Virnau P, Paul W, Schneider JJ (2009) GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model. J Comput Phys 228:4468–4477 Smolders HJ (1992) Non-linear wave phenomena in a gas-vapour mixture with phase transition. Ph.D. thesis, Eindhoven University of Technology, Eindhoven


185

Yu ST, Chang SC (1997)Treatments of stiff source terms in conservation laws by the method of Space-Time Conservation Element/Solution Element, AIAA paper, 0435 1997 Yu S-TJ, Yang L, Lowe RL, Bechtel SE (2009) Numerical simulation of linear and nonlinear waves in hypoelastic solids by the CESE method. Wave Motion 47:168–182

Simulation of 1D Condensing Flows with CESE Method on GPU ...

Simulation of 1D Condensing Flows with CESE Method on GPU ...

Suggest Documents

numerical simulation of spontaneously condensing flows in a ... - TASK

Research on steam condensing flows in nozzles with ... - CiteSeerX

A GPU-accelerated meshless method for flows two ...

A GPU-based Method for Massive Simulation of ... - Semantic Scholar

Numerical Simulation of Cavitating Flows with

NUMERICAL SIMULATION OF TURBULENT FLOWS WITH SHEET ...

GPU simulation environment for

Numerical Simulation on the GPU - Semantic Scholar

Simulation of Turbulent Flows

Real-Time Fluid Simulation on GPU

Particle-Based Fluid Simulation on the GPU

On Scale-Resolving Simulation of Turbulent Flows

Hydraulic Loss Coefficients in 1D Flows - CiteSeerX

Direct Numerical Simulation of Turbulent Flows Using Spectral Method

Internal Condensing Flows inside a Vertical Pipe - Michigan

GPU parallel simulation algorithm of Brownian particles with excluded ...

GPU accelerated biochemical network simulation

Mathematical simulation of the process of condensing natural gas

Simulation Of Motorway Traffic Flows

Simulation of Surface-Type Condensing Units for Heat Recovery from ...

Simulation of X-ray Attenuation on the GPU - gVirtualXRay

Bufferless NOC Simulation of Large Multicore System on GPU Hardware

CESE CdR EESC CoR ((

Simulation of liquid-gas-solid flows with the Lattice Boltzmann Method