Seismic Wave Propagation Simulation Using Support ...

4 downloads 0 Views 595KB Size Report
1Minnesota Supercomputing Institute, University of Minnesota ... 3Department of Computer Science, University of Minnesota ... is developed by Ely al.(Ely et al.
Seismic Wave Propagation Simulation Using Support Operator Method on multi-GPU system Shenyi Song1,2 , Tingxing Dong1,2 , Yichen Zhou1,3 , David A. Yuen1,4 , and Zhonghua Lu2 1

Minnesota Supercomputing Institute, University of Minnesota 2 Computer Network Information Center, China Academy of Science 3 Department of Computer Science, University of Minnesota 4 Department of Geology & Geophysics, University of Minnesota

Abstract The method of Support Operator(SOM) is a numerical method based on finite difference method. We use SOM to simulate seismic wave propagation by solving the three dimension viscoelastic equations. The Support Operator Rupture Dynamics(SORD) has been proved to be highly scalable in large-scale multi-processors computing. This paper discusses accelerating SORD on multi-GPU system using NVIDIA CUDA C and MPI. Compared to its original version on multi-CPU system, we have achieved a maximum 15.0X speed-up.

1 1.1

Introduction The method of Support Operator

The method of Support Operator is a generalized finite difference method introduced by Samarskiiet al.(Samarskii et al., 1981) and Shashkov(Shashkov, 1996). SOM is a general scheme for discretizing the differential form of partial differential equations. The Support Operator Rupture Dynamics (SORD), an application of this method in the simulation of earthquake rupture dynamics is developed by Ely al.(Ely et al., 2008). This application uses single-precision floating-point Support Operator Method. SORD can be used to investigate idealized wave propagation and rupture dynamics problems and simulate potential future earthquakes with realistic fault and basin models. One example is the simulation of Mw 7.6 earthquake scenarios on the southern San Andreas fault.(Ely et al., 2010)

1

1.2

Solving Partial Differential Equations on GPU

A variety of applications require solving partial differential equations (PDE), such as Laplace equation in image denoising, Poisson equation in image editing and mesh editing and Navier-Stocks equations in fluid simulation, etc. Numerical simulation of the PDEs usually requires high-intensity computation and large consumption of computational resources.(Zhao, 2008) As a multiple SIMD processing unit, GPU has inherent parallelism, which is suited for explicit and lattice-based computations. Solid-earth geophysics remains one of the last bastions to have resisted the use of GPUs, especially in geodynamics. GPU programming for scientific computation has become significantly easier with the introduction of the Compute Unified Device Architecture(CUDA) at the end of 2006 by NVIDIA, which is relatively easy to learn because its syntax is similar to C.(NVIDIA, 2009)

2 2.1

Support Operator Rupture Dynamics Theoretical Formulation

The governing equations of wave propagating in 3D, isotropic viscoelastic medium are: gij = ∂j (ui + γvi ),

(1)

σij = λδij gkk + µ(gij + gji ),

(2)

ai =

1 ∂j σij , ρ

(3)

v˙ i = ai ,

(4)

u˙ i = vi .

(5)

Where σ is the stress tensor, u and v are displacement and velocity vectors, ρ is density, λ and µ are elastic moduli, and γ is viscosity.

2.2

Numerical Method

Finite Difference Method (FDM) is widely used in modeling three dimensional seismic wave propagation and rupture dynamics problems.(?) We apply the Support Operator Method (SOM). Many simple FDMs are special cases of SOM. The approach constructs discrete analogs of continuum derivative operators that satisfy important integral identities, such as the adjoint relation between gradient and divergence. SOM brings to an FDM-type formulation the FEM advantage that energy is conserved in the semi-discrete equations. The scheme is explicit in time, and discretized on a hexahedral, logically rectangular mesh. On the mesh we define the space of nodal function H N consisting of the hexahedra vertices, and the space of cell H C consisting of the 2

hexahedra volumes. If we do a difference to a variable in H N , we can get a variable in H C , and if we differentiate a variable in H C , we can get a variable in H N . So we define two discrete difference operators.(Ely et al., 2009) Di : H N → H C

and

Di : H C → H N .

(6)

On the nodes we have (ρ, γ, β, u, v, a) ∈ H N , and on the cells we have (λ, µ, y, σ, g) ∈ H C . Using the two operators we can obtain a variable in H N through variables in H C and vice versa. In this case Di is called the natural operator and Di is called the support operator. As for time, we adopt a centered difference in second-order accuracy. So the discretized difference equations are: n−1/2

gij = Dj (uni + γvi

),

(7)

σij = Λδij gkk + M (gij + gji ), n−1/2

ai = RDj σij − Qk yQk (uni + βvi n+1/2

vi

n−1/2

= vi

(8) ),

(9)

+ ∆tai ,

(10)

n+1/2

(11)

vin+1 = uni + ∆tvi

.

The material variable incorporate the cell volumes V C and the node volumes V : N

λ VC µ M= C V 1 R= ρV N Λ=

(12) (13) (14)

Viscous as well as stiffness hourglass control may be used, for which we define the viscosity β, and stiffness. y=

µ(λ + µ) 6(λ + 2µ)

(15)

The form we choose for hourglass stiffness y is based on the approximate analysis of Kosloff (Kosloff and Frazier, 1978). Instabilities in the numerical method due to non-uniform stress modes are corrected for by hourglass operators: Qi : H N → H C

and Qi : H C → H N

(16)

We use a modified form of the hourglass control scheme described by Flanagan & Belytschko(Flanagan and Belytschko, 1981) and more recently by Day et al(Day et al., 2005). and Ma & Liu(Ma and Liu, 2006).

3

3 3.1

Implement on GPU using CUDA GPU system

We use the GPU-cluster in SCCAS(Supercomputing Center, China Academy of Science). This cluster has 90 computing nodes, include 18 AMD Phenom 9850 CPUs, 72 Intel Xeon E5410 CPUs and 192 NVIDIA Tesla C1060 GPU, which contains 240 stream processors of 1.296GHz and 4GB memory. The theoretical memory bandwidth is 102GB/s. DDR 4X Infiniband Cards are used on each node, and the P2P bandwidth is 2.0GB/s. The compiler of C/C++/FORTRAN is GNU compiler. The compiler of CUDA is CUDA 2.3. Parallel interface is OpenMPI.

3.2

Algorithm - thread strategy

As a two order finite differential method on 3D structured grid, SORD scales well on GPU. The program performs as Figure 1.: SORD is created using FORTRAN 95 and parallelized by MPI. Compared to C, FORTRAN 95 has better dynamic array performance and much simple array operation. We use FORTRAN 95 to complete the CPU part of program. However, FORTRAN cannot call CUDA kernels unless using PGI compiler. We update SORD and use the FORTRAN code to call C subroutines, then The C subroutines call CUDA kernels. The CUDA kernels are basic operation to data. Such as add, multiply and operators to data. 60 percent of the kernel codes are four operators Di , Di , Qi , Qi . And the four operators take more than 80 percent of runtime in computing. The resulting expressions for Di and Dx are rather complex and are not available elsewhere. If elements are restricted in shape to rectangular parallelepipeds, the operators simplify to 1 (Z1 − Z0 )(Y1 − Y0 ) 4 (F111 + F100 − F010 − F001

(Dx F )000 =

−F000 − F011 + F101 + F110 ),

1 (Z2 − Z1 )[(Y2 − Y1 )(W 111 − W 011 ) 4 +(Y1 − Y0 )(W 101 − W 001 )] 1 + (Z1 − Z0 )[(Y2 − Y1 )(W 110 − W 010 ) 4 +(Y1 − Y0 )(W 100 − W 000 )]

(Dx W )111 =

(17)

We can generate the code of operators from Equation (17).We store Di in a memory array, and Dx can be had directly from Di via the adjoint relation. Storage of Di requires more memory, but has a 65% reduction in runtime for exactly integrated a 50% reduction for one-point quadrature.(Day and Bradley, 2001) The kernel code of compute Di is: 4

Node 0

Node n

Setup MPI communicator and map GPU device

MPI communication

Setup MPI communicator and map GPU device

Setup GPU device and allocate data array

Initialize

Setup GPU device and allocate data array

Read input data and initialize output files

Read input data and initialize output files

MPI communication I/O

Generate compute grid

Generate compute grid

Initialize material, PML and source

Initialize material, PML and source

CUDA memory copy host to device

CUDA memory copy host to device

Compute velocity & Displacement

Compute velocity & Displacement

Compute Stress

GPU

Compute Stress

Compute acceleration

Compute acceleration Loop: timestep++

CUDA memory copy device to host Swap ghost cell

Swap ghost cell

MPI communication

CUDA memory copy host to device TRUE

CUDA memory copy device to host

CUDA memory copy host to device TRUE

timestep < t ? FALSE

timestep < t ? FALSE

Output

I/O

Stop

Figure 1: Program flowchart of SORD on GPU.

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

void k e r n e l d i f f n o d e 2 c e l l ( f l o a t ∗ df , float ∗f , f l o a t ∗b , int i, int a, int j start , int k start , int l start , int j end , int k end , int l end , int dimx , int dimy , int dimz , int off 1d f , int off 2d f , int off 3d f , int off 3d b , int off 4d b ) {//Compute array df from array f , the operator stored in array b . i n t i d = t h r e a d I d x . y+b l o c k I d x . x∗dimx+b l o c k I d x . y∗dimx∗dimy ; i f ( ( t h r e a d I d x . y >= j s t a r t && t h r e a d I d x . y < j e n d ) && ( b l o c k I d x . x >= k s t a r t && b l o c k I d x . x < k end ) && ( b l o c k I d x . y >= l s t a r t && b l o c k I d x . y < l e n d ) ) { d f [ i d ]= b [ i d +0∗ o f f 3 d b ] ∗ f [ i d+1+ o f f 1 d f+o f f 2 d f ] +b [ i d +1∗ o f f 3 d b ] ∗ f [ i d +1 ] +b [ i d +2∗ o f f 3 d b ] ∗ f [ i d+ o f f 1 d f ] +b [ i d +3∗ o f f 3 d b ] ∗ f [ i d +o f f 2 d f ] +b [ i d +4∗ o f f 3 d b ] ∗ f [ i d ] +b [ i d +5∗ o f f 3 d b ] ∗ f [ i d +o f f 1 d f+o f f 2 d f ] +b [ i d +6∗ o f f 3 d b ] ∗ f [ i d+1+ off 2d f ] +b [ i d +7∗ o f f 3 d b ] ∗ f [ i d+1+ o f f 1 d f ]; } e l s e return ; } global

3.3

Data structure - memory strategy

We implement 3 kinds of kernels: 1. Directly access global memory without shared memory. Store all of the data in global memory and registers. GPU manages the threads automatically, but accessing data in global memory will take about 500 cycles. 2. Through shared memory. First copy data from global memory into shared memory and assign a single thread to each unit cell and the shared memory is shared by the whole block. The thread can access its neighbor unit easily. The shared memory has very short access latency. But there is very little shared memory in one streaming multiprocessor, and the copying from global memory to shared memory will also cost about 500 cycles. 6

Data Zone 61*61*61 73*73*73 85*85*85 97*97*97 109*109*109 121*121*121 133*133*133

global memory(ms) 78.3 135 213 315 433 601 800

shared memory(ms) 120 152 205 290 372 733 854

speedup 0.652 0.889 1.04 1.09 1.16 0.820 0.937

Table 1: The performance of the kernel with directly accessing global memory and the one using shared memory. The time is what one time step takes. Data Zone 101*101*161 131*131*161

Accessing global memory without texture memory(ms) 100 181

Accessing global memory with texture memory(ms) 97.8 160

speedup 1.02 1.13

Table 2: Speedup by texture memory. The time is what one time step takes. 3. Directly access global memory and use texture memory. The texture memory space is read-only and resides in device memory. Since it is cached, a texture fetch costs one memory read from global memory only on a cache miss, otherwise it just costs one read from texture cache. We blind some data on texture memory so they can be accessed quickly. We compared the performance of the kernel with directly accessing global memory and the one using shared memory.

In this case, the increase of using shared memory is limited. In the 7 cases of Table 1, the shared memory used by kernel did not have obvious speedup. That is because the shared memory is too small, and the data copied to shared memory also takes too much times. So we don’t use the kernels with shared memory in this version of program. Reading device memory through texture fetching provides benefits so that it has advantages compared to reading device memory from global memory. In the SORD on CUDA, we have a 5 dimension vector to be stored, and after first calculated, it is read-only. So it can be stored in texture memory. Table 2 shows the speedup by texture memory.

3.4

Swap Ghost Cell Data by MPI

Computing by the method of Support Operator needs the data of neighbor unit. So in the end of every time step, each GPU needs to swap the ghost 7

Figure 2: 3D communicator of MPI cells with its neighbors. CUDA C provides a subroutine cudaMemcpy(). It can be used to copy data between host memory and device memory. Then we use MPI to swap the ghost cell data, as Figure 2shows. However, we found it takes a long time to separately copy the ghost cell (6 boundary surfaces of one hexahedron)from device memory to host memory. It costs much more time compared to copying the whole hexahedron array, because the initialization time of calling cudaMemcpy() is remarkable.

4

Layered Model Test

To test SORD on one GPU, we reproduce the double-couple point source test LOH.1 of Day & Bradley(Day and Bradley, 2001). The model diagrammed in Fig. 3, consists of an underlying layer of 7km and a layer of 1km over the underlying layer. In the surface layer Vx = 2, 000m/s, Vp = 4, 000m/s, and density ρ = 2, 600kg/m3 , and in the underlying layer Vx = 3, 464m/s, Vp = 6, 000m/s, and density ρ = 2, 700kg/m3 . A double-coupled point source is located at 2km depth lasts 0.1s to produce seismic waves then propagate across the whole space. We apply the calculations with a rectangular mesh of node spacing ∆x = 50m, and the grid size is from 81*81*161 to 241*241*161. The point source is at (0, 0, 41). This case is run for 1000 steps with a time step of ∆t = 0.004. Table 3 lists the performance of the test. The baseline, single-thread CPU performance is measured on an Intel XEON E5410 system running at 2.33GHz with 8GB main memory. The CPU code is serial and written by FORTRAN. The CUDA C code is based on a Tesla C1060 with 240 stream processors running at 1.296GHz and 4GB global memory. The test runs for 200 time steps, and the performance is the whole running time divided by 200. Figure 4. shows the simulation of seismic wave propagation by GPU. We use CPU and GPU to compute different grid sizes. The CPU time of one step is shown in Table 3. The speedup ranges from 9.36X to 12.8X. The maximum speedup is 12.8X on the data size of 81*81*161. When the data size increases, 8

Figure 3: Perspective view of the layer. The layer is 1 km thick. The source is located on 2 km depth. Table 3: Speedup by texture memory. The time is what one time step takes. Data Zone Data size(*103 ) CPU time(ms) GPU time(ms) Speedup 81*81*161 1056 1530 120 12.8 101*101*161 1642 2141 171 12.5 121*121*161 2357 3141 255 12.3 141*141*161 3201 4209 343 12.3 161*161*161 4173 4482 409 11.0 181*181*161 5275 5681 584 9.73 201*201*161 6505 6514 606 10.7 221*221*161 7863 7333 718 10.2 241*241*161 9351 7906 845 9.36 the execution time varies almost linearly with good scaling.

To test the multi-GPU performance, we use an large data size. Data zone is 401*401*401. Other properties are same to LOH.1. Table 4 shows the speedup of multi-GPU is about 15X, from 4 processing units to 64 processing units.

5

Conclusion and future work

We present a method of Support Operator on multi-GPU system using MPI and CUDA. Optimize the code using many threads to hide latency, and use texture memory to accelerate. Compared to 64 CPU, 64 Tesla C1060 GPU has more than 14X speedups. 9

Figure 4: Seismic wave propagate in the layers. Simulated ground velocity in different time. (a) t=0.4s, (b) t=1.0s, (c) t=2.0s, (d) t=3.0s

Table 4: Performance of multi-CPU and multi-GPU. The data zone is 4013 . The time is what one time step takes. CPU size CPU number CPU time(ms) GPU time(ms) Speedup 1*2*2 4 24915 1660 15.0 2*2*2 8 15410 990 15.6 2*2*4 16 7051 480 14.7 2*4*4 32 3620 250 14.5 4*4*4 64 1892 130 14.5

10

Actually, memory access bandwidth in GPU is limited, so we need a highly efficient data structure to reduce accessing global memory. Data transmission between host and device(GPU and CPU) usually hurt performance, because the bandwidth of PCI-E bus is much smaller than the GPU bandwidth. Reducing data transmission between host and device is necessary.

References Day, S. M. and C. R. Bradley (2001). Memory-efficient simulation of anelastic wave propagation. Bulletin of the Seismological Society of America 91, 520– 531. Day, S. M., L. A. Dalguer, N. Lapusta, and Y. Liu (2005). Comparison of finite difference and boundary integral solutions to three-dimensional spontaneous rupture. Journal of Geophysical Research-Solid Earth 110, 23. Ely, G. P., S. M. Day, and J. B. Minster (2008). A support-operator method for viscoelastic wave modelling in 3-d heterogeneous media. Geophysical Journal International 172, 331–344. Ely, G. P., S. M. Day, and J. B. Minster (2009). A support-operator method for 3-d rupture dynamics. Geophysical Journal International 177, 1140–1150. Ely, G. P., S. M. Day, and J. B. Minster (2010). Dynamic rupture models for the southern san andreas fault. Bulletin of the Seismological Society of America 100, 131–150. Flanagan, D. P. and T. Belytschko (1981). A uniform strain hexahedron and quadrilateral with orthogonal hourglass control. International Journal for Numerical Methods in Engineering 17, 679–706. Kosloff, D. and G. A. Frazier (1978). Treatment of hourglass patterns in low order finite-element codes. International Journal for Numerical and Analytical Methods in Geomechanics 2, 57–72. Ma, S. and P. C. Liu (2006). Modeling of the perfectly matched layer absorbing boundaries and intrinsic attenuation in explicit finite-element methods. Bulletin of the Seismological Society of America 96, 1779–1794. NVIDIA (2009). NVIDIA CUDA Programming Guide Version 2.3. Santa Clara: NVIDIA Corporation. Samarskii, A. A., V. F. Tishkin, A. P. Favorskii, and M. Y. Shashkov (1981). Operational finite-difference schemes. Differential Equations 17, 854–862. Shashkov, M. Y. (1996). Conservative Finite-Difference Methods on General Grids. Boca Raton: CRC Press. Zhao, Y. (2008). Lattice boltzmann based pde solver on the gpu. Visual Computer 24, 323–333. 11