Optimized OpenCL implementation of the ...

Computer Physics Communications (

)

–

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Optimized OpenCL implementation of the Elastodynamic Finite Integration Technique for viscoelastic media✩ M. Molero-Armenta a , Ursula Iturrarán-Viveros b,∗ , S. Aparicio a , M.G. Hernández a a

Instituto de Tecnologías Físicas y de la Información ‘‘Leonardo Torres Quevedo’’ (ITEFI), CSIC, Madrid, Spain

b

Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Escolar S/N, Coyoacán 04510, México D.F., Mexico

article

info

Article history: Received 24 August 2013 Received in revised form 2 April 2014 Accepted 19 May 2014 Available online xxxx Keywords: EFIT Kelvin–Voigt GPUs PyOpenCL OpenCL

abstract Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included. Program summary Program title: EFIT2D-PyOpenCL Catalogue identifier: AETF_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AETF_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 38079 No. of bytes in distributed program, including test data, etc.: 2949059 Distribution format: tar.gz Programming language: Python. Computer: Computers having GPU or Multicore CPU with OpenCL drivers. Operating system: Multi-platform. Has the code been vectorized or parallelized?: Yes. RAM: 2 Gb Classification: 6.5. External routines: Numpy, scipy, matplotlib, glumpy, pyopencl Nature of problem: Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic

✩ This paper and its associated computer program are available via the Computer Physics Communication homepage on ScienceDirect (http://www.sciencedirect.com/ science/journal/00104655). ∗ Corresponding author. Tel.: +52 55 56 22 54 11. E-mail addresses: [email protected] (M. Molero-Armenta), [email protected], [email protected] (U. Iturrarán-Viveros), [email protected] (S. Aparicio), [email protected] (M.G. Hernández).

http://dx.doi.org/10.1016/j.cpc.2014.05.016 0010-4655/© 2014 Elsevier B.V. All rights reserved.

2

M. Molero-Armenta et al. / Computer Physics Communications (

)

–

Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). Solution method: We choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. Restrictions: Wave propagation simulation only in 2D Scenarios, OpenCL drivers needed. Running time: This code can process wave propagation simulations within a few minutes in a typical current computer with GPUs or multicore CPUs. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Wave modeling is a valuable tool for seismic interpretation, non-destructive testing techniques and it is an essential part of inversion algorithms. The unconsolidated nature of the shallow layers of Earth requires an anelastic stress–strain relation to model the dissipation of the wave field. Some of the most common rheologies used in the literature are described in [1]: the Maxwell (or the generalized Maxwell), the Kelvin–Voigt and the Zener (or the generalized Zener). The EFIT is a technique that goes back to the early 1990s and it is widely used in the treatment of numerical time-domain modeling of acoustic (AFIT), electromagnetic (FIT, 1980s) and elastodynamic (EFIT) waves. With the advent of new parallel computing devices (PCDs) this 20-year-old method, among others, has the opportunity to progress without introducing revolutionary changes to its fundamental principles. Originated in the high-end computer gaming market, GPUs offer highest computing capabilities to important large-scale applications of computational science and engineering at a very attractive price-per-flop-ratio. Their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. Parallel Finite Difference (FD) algorithms have been used either with the Message-Passing-Interface (MPI) technique or OpenMP architecture, see [2,3]. Although these implementations can significantly reduce the overall computation time for modeling viscoelastic media, the cost of building these systems is very expensive. It is a real challenge to develop scientific codes that are not only portable between the very specific hardware architectures currently available on the market but will also transparently scale their parallelism in the future. We choose OpenCL since it unifies the process of code development for heterogeneous computing systems by using one programming environment (compiler) to target substantially different processing elements. On the other hand, a dynamic high-level scripting language with the massive performance: PyOpenCL gives easy access to OpenCL parallel computation through Python. The advantage of PyOpenCL is that it is a complete mature application programming interface (API) wrapper, it is easy to program, no need to compile and it has automatic error checking. A CUDA (Compute Unified Device Architectures) code of the 2-D acoustic case by [4] was the inspiration to this implementation. Previous work using PyOpenCL to model wave propagation in anisotropic elastic media and in fluid-filled boreholes consider only the use of global memory, see [5,6]. In addition, comparison between experimental results, analytical models and numerical simulations using our global memory implementation of EFIT can be found in [7,8]. OpenCL performance results are scarcely found in the literature for these applications. Therefore, the main contribution of the present work is the new PyOpenCL implementation using local memory and the performance analysis when testing local and global memory in different PCDs including the latest Kepler architecture technology. 2. Mathematical formulation and discretization The EFIT is based on the linear elastodynamics equations in Cartesian coordinates. The time-domain numerical scheme EFIT is used to numerically simulate wave propagation in heterogeneous materials, see for example: [9–13]. The field component to be calculated, is placed at the center of an appropriate control volume and the integration of the corresponding equation is calculated over this cell. In order to implement the Kelvin–Voigt constitutive equation we follow the approach given in [1,14–17]. An equivalent formulation where an ultrasonic wave simulation of concrete modeled as an inhomogeneous, dissipative (viscoelastic) isotropic medium is due to [18]. The 2-D equations of momentum conservation can be expressed as

ρ u¨ i =

∂τij + fi , ∂ xj

i, j = {x, y},

(1)

where ρ is the density, ui are displacement components, τij express the stress components and fj are the body forces. We denote with a dot a time derivative. The stress–strain relations for a Kelvin–Voigt solid are given by

  τij = λθ + λ′ θ˙ δij + 2µεij + 2µ′ ε˙ij

(2)


)

–

3

where λ and µ are the Lamé constants, λ′ and µ′ are the corresponding anelastic parameters and δij is Kronecker’s delta. The strains are given by

θ=



1

εij =

2

∂ ui ∂ uj + ∂ xj ∂ xi



,

(3)

∂ ui . ∂ xi

(4)

For a given frequency (say the central frequency of the source ω0 ) the anelastic parameters can be obtained as follows:



1

λ′ =

ω0

E QP0

2µ

−

 and

QS0

µ′ =

µ ω0 QS0

(5)

where Qp0 and QS0 are the quality factors at ω = ω0 , E and µ are the moduli at ω = 0. The moduli can be obtained from the P- and S-wave phase velocities at ω = ω0 , vp0 (P) and vp0 (S) respectively and are given by 2 E = vp0 (P)g (Qp0 ) and

2 = vp0 (S)g (Qs0 )

(6)

where

 1

g ( a) =

2

1



1+

1 a2



1 + 

1 1+

1 a2



(7)

see [14] for further details. Introducing the particle-velocity components vi = u˙i , Eq. (1) becomes

v˙i =

1



ρ

∂τij + fi ∂ xj



i, j = {x, y}.

(8)

Using (3) and (4) the time derivative of the stress–strain relations (2) becomes

      ∂vi ∂ v˙i ∂vi ∂vj ∂ v˙ i ∂ v˙ j τ˙ij = λ + λ′ δij + µ + + µ′ + . ∂ xi ∂ xi ∂ xj ∂ xi ∂ xj ∂ xi

(9)

Substituting (8) into (9) yields

τ˙ij = λ

          ∂vi ∂ 1 ∂τij ∂vi ∂vj ∂ 1 ∂τim ∂ 1 ∂τjm δij + λ′ + fi δij + µ + + µ′ + fi + + fj , ∂ xi ∂ xi ρ ∂ xj ∂ xj ∂ xi ∂ xj ρ ∂ xm ∂ xi ρ ∂ xm

(10)

with m {x, y}. Let us define the auxiliary variables

 ∂τxx ∂τxy + + fx , ρ ∂x ∂y   1 ∂τxy ∂τyy Πy = + + fy , ρ ∂x ∂y

Πx =

ψ=

1



∂ Πx ∂ Πy + ∂x ∂y

(11)

(12) (13)

and

ϑ = θ˙ =

∂vx ∂vy + . ∂x ∂y

(14)

Then the linear viscoelastic equations (8) and (9) can be written in terms of their velocity (vx , vy ) and stress (τxx , τyy , τxy ) components as follows

v˙ x = Πx , v˙y = Πy ∂vx ∂vx ∂ Πx ∂ Πx ∂ Πy + 2µ′ = (λ + 2µ) + (λ′ + 2µ′ ) + λ′ ∂x ∂x ∂y ∂x ∂y ∂v ∂ Π ∂v ∂v ∂ Π ∂ Πx y y y x y + 2µ′ = (λ + 2µ) +λ + (λ′ + 2µ′ ) + λ′ τ˙yy = λϑ + λ′ ψ + 2µ ∂y ∂y ∂y ∂x ∂y ∂x     ∂vx ∂vy ∂ Πx ∂ Πy τ˙xy = µ + + µ′ + . ∂y ∂x ∂y ∂x τ˙xx = λϑ + λ′ ψ + 2µ

(15) (16) (17) (18) (19)

Eqs. (11) and (15)–(19) constitute the velocity–stress formulation for the Kelvin–Voigt model in isotropic media. The process of space discretization of the EFIT consists of performing an integration of the differential equation (15)–(19) over a certain control volume or integration cell. For instance, the area integration of (15) according to Green’s theorem gives [19]:

4


)

–

Fig. 1. Velocity and stress components at the staggered grid used in the 2-D EFIT model.



ρ v˙x dxdy =

Axy

 ∂τxy ∂τxx + + fx dxdy ∂x ∂y    τxx dy − τxy dx + fx dxdy.



 Axy

 = ∂ Axy

(20)

(21)

Axy

The above integrals can be calculated, approximately by multiplying the mean value of the integrand with the corresponding area of the integration cell, yielding the following discrete approximation at the central point of the integration cell (i, j), see Fig. 1:

    ρ v˙ x,i,j ∆x∆y = τxx,i+1,j − τxx,i,j ∆y + τxy,i,j − τxy,i,j−1 ∆x + fx,i,j ∆x∆y.

(22)

Therefore, the above finite integration approach leads to a staggered grid as shown in Fig. 1. Assuming a leapfrog scheme in the time stepping, the update of velocities and stresses are staggered in time by ∆t /2. Therefore, the space and time discretization (with ∆x = ∆y) of Eqs. (15)–(19) can be written as:

 ∆t  n−1/2 n−1/2 n−1/2 n−1/2 Bx τxx,i+1,j − τxx,i,j + τxy,i,j − τxy,i,j−1 + ∆tBx fx,i,j ∆x  ∆t  n−1/2 n−1/2 n−1/2 n−1/2 By τxx,i+1,j − τxx,i,j + τxy,i,j − τxy,i,j−1 + ∆tBy fx,i,j vxn,i,j = vxn,−i,j1 + ∆x     ∆t  n+1/2 n−1/2 τxx,i,j = τxx,i,j + (λ + 2µ) vxn,i,j − vxn,i−1,j + λ vyn,i,j − vyn,i,j−1 ∆x      + λ′ + 2µ′ Πx,i,j − Πx,i−1,j + λ′ Πy,i,j − Πy,i,j−1 + ∆tgxx,i,j vxn,i,j = vxn,−i,j1 +

(23) (24)

(25)

   ∆t   n n−1/2 λ vx,i,j − vxn,i−1,j + (λ + 2µ) vyn,i,j − vyn,i,j−1 = τyy,i,j + ∆x      + λ′ + 2µ′ Πy,i,j − Πy,i,j−1 + λ′ Πx,i,j − Πx,i−1,j + ∆tgyy,i,j

n+1/2

τyy,i,j

(26)

  ∆t  ∆t  n n+1/2 n−1/2 vx,i,j+1 − vxn,i,j + vyn,i+1,j − vyn,i,j + µ ¯ ′xy Πx,i,j+1 − Πx,i,j − Πy,i+1,j − Πy,i,j + ∆tgxy,i,j τxy,i,j = τxy,i,j + µ ¯ xy ∆x ∆x where gij (i, j{x, y}) are the stress sources, Bx , By are effective buoyancies, and µ ¯ xy is the effective rigidity defined as: 2

Bx =

ρi+1,j + ρi,j

By =

ρi,j+1 + ρi,j

µ ¯ ij =

(28)

2

(29) 4

1 µi,j

+

1 µi+1,j

(27)

+

1

µi,j+1

+

1

.

(30)

µi+1,j+1

Note that when we need the viscoelastic effective constant µ ¯ ′ij we should interchange the variable µij by µ′ij in (30). To guarantee numerical

√

stability of the EFIT code, the spatial and time steps must be ∆x ≤ Vmin /(10Fmax ) and ∆t ≤ (1/ 2)∆x/Vmax , respectively. The Vmin and Vmax correspond to the lowest and highest velocities in the heterogeneous medium and Fmax is the highest frequency component in the signal respect to the operating frequency Fc (e.g. Fmax = 2Fc ). The source we apply is given by: f (x, y, t ) = P (t ),

(31)

where the source is actually a line of sources and P (t ) is the raised cosine function given as follows: P (t ) =



(1 − cos (π f0 t )) cos (2π f0 t ) , 0,

t 6 2/f0 , t > 2/f0 ,

(32)


)

–

5

Fig. 2. A set of snapshots for a three-layer media. The layers at the top and at the bottom are viscoelastic and the layer in the middle is elastic. The velocities and density for the elastic medium are Vp = 3000 m/s, Vs = 1730 m/s and = 2.6 kg/m3 . The viscoelastic layers have the following parameters: Vp = 1800 m/s, Vs = 1040 m/s, = 2.0 kg/m3 , = 6.8348 (Pa s), = 13.7672 (Pa s).

with 2f0 the cutoff frequency. To implement the stress-free boundary conditions, a vacuum formulation (VCF) is used, where the elasticity moduli and Bx and By tend to zero, see [17]. We might consider vacuum conditions for all boundaries or take into account absorbing boundaries as in [20]. 3. Numerical results In this section it is described how the parallel elastic/viscoelastic EFIT codes have been applied to simulate wave propagation in a layer medium. We consider a three-layer medium with two viscoelastic layers (one on top and the second one at the bottom) and an elastic layer in the middle of the model, see Fig. 2. The velocities and density for the elastic medium are Vp = 3000 m/s, Vs = 1730 m/s and = 2.6 kg/m3 . The viscoelastic layers have the following parameters: Vp = 1800 m/s, Vs = 1040 m/s, = 2.0 kg/m3 , QP0 = 60, QS0 = 50, λ′ = 6.8358 (Pa s), µ′ = 13.7672 (Pa s). The size of the numerical grid is NX = 2307, NY = 2307, the central frequency is f0 = 500 kHz, ∆t = 6.1282 ns, ∆x = ∆y = 52.0 µm. The thickness for the viscoelastic and elastic layers is 25 mm and 50 mm, respectively. A set of snapshots for a three-layer media are shown in Fig. 2. Synthetic seismograms for 100 equally spaced receivers located at the top


0

0

10

10

20

20

30

30 t(µs)

t(µs)

6

40

50

60

60

70

70 20

40

60

80

–

40

50

0

)

100

0

20

40

80

100

60

80

100

(b) Viscoelastic Txx .

0

0

10

10

20

20

30

30 t(µs)

t(µs)

(a) Elastic Txx .

40

40

50

50

60

60

70

70 0

60 y(mm)

y(mm)

20

40

60

80

100

0

20

y(mm) (c) Elastic Tyy .

40 y(mm)

(d) Viscoelastic Tyy .

Fig. 3. Synthetic seismograms for 100 equally spaced receivers located at the top of the free-surface. The elastic (left) and viscoelastic (right) cases are considered.

of the free-surface are depicted in Fig. 3. The elastic (left) and viscoelastic (right) cases are considered. Note that for both components the signal is attenuated in the lossy viscoelastic medium. 4. Open computing language The OpenCL is the first open, royal-free standard for cross-platform, parallel programming across CPUs, GPUs and other processors. OpenCL was initially developed by Apple Inc, the standard is being actively developed and worked upon by the Khronos group, a large multi-vendor consortium. This platform gives software developers portable and efficient access to the computation power of these architectures and it eliminates vendor-specific definitions. Furthermore, OpenCL also allows applications to use a host and one or more computing devices as a single heterogeneous parallel computing system. The OpenCL framework consists of the platform layer, which allows the host program to test the capabilities of OpenCL devices and to create contexts. The runtime allows the host program to manipulate contexts once they have been created. The compiler creates the OpenCL kernels which are executed on computing devices. The context defined by the host includes the following resources: (i) OpenCL devices, (ii) kernels (functions that run on OpenCL devices), (iii) program objects (program source and executable that implement the kernel) and (iv) memory objects. The host creates a data structure called a command queue to coordinate the execution of kernels, memory operations and synchronizations with the context. An index space is created when a kernel is submitted for execution on the OpenCL device by the host. An instance of the kernel (called work item) executes for each point of the index space. A work item can issue one instruction per clock cycle and they are organized into workgroups, providing a more coarse-grained decomposition of the index space. The block of work items that are executed together is called a wavefront. The size of wavefronts can differ on different GPU compute devices. OpenCL has various memory domains (see Table 1): private, local, global, and constant; the AMD Accelerated Parallel Processing system also recognizes host (CPU) and PCI Express (PCIe) memory. The use of local memory typically is an order of magnitude faster than accessing host memory through global memory (VRAM), which is one order of magnitude faster than PCIe. However, stream cores do not directly access memory; instead, they issue memory requests through dedicated hardware units. When a work item tries to access memory, the work item is transferred to the appropriate fetch unit. The work item then is deactivated until the access unit finishes accessing memory. Meanwhile, other work items can be active within the compute unit, contributing to better performance. In the case of OpenCL code execution on a GPU, local memory is usually implemented in hardware as a user can manage cache (fast shared memory), enabling higher bandwidth than using other memories. However, in the case of OpenCL execution on a CPU, the programmer does not have direct control over the cache memory, see [21]. Therefore, the use of local memory in a CPU does not enable higher bandwidth but requires additional copying of data between regions. For a comprehensive


)

–

7

Table 1 Different memory types in OpenCL. Memory

Characteristics

Private memory Local memory Global memory Constant memory Host (CPU) memory PCIe memory

Specific to a work item; it is not visible to other work items Specific to a workgroup; accessible only by work items belonging to that workgroup Accessible to all work items executing in a context, as well as to the host (read, write and map commands) Read only region for host-allocated and initialized objects that are not changed during kernel executions Host accessible region for an application of data structures and program data Part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device. Modifying this memory requires synchronization between the GPU compute device and the CPU

Table 2 Technical specification of experiment platforms. The numbers of the devices correspond to the numbers of their operating systems. GPU

(1) NVIDIA GTX Titan

CPU

(2) AMD 7970 (3) NVIDIA GeForce GTX 660 Ti (4) NVIDIA GeForce GTX 560 Ti (5) NVIDIA GeForce GTX 780 M (6) Tesla T10 (7) AMD 6770M (8) Intel(R) Core(TM) i7-4771 @3.5 GHz 4 cores

Operating system

(1) Linux Ubuntu 12.04 (2) Linux Ubuntu 12.04 (3) Linux Ubuntu 12.04 (4) Linux Ubuntu 11.10 (5) MacOS Apple Mavericks 10.9 (6) Linux Red Hat (7) MacOS Apple Mountain Lion 10.8 (8) MacOS Apple Mavericks 10.9

Table 3 Technical specification of experiment platforms. The number of Cuda cores is given by the number of streaming multiprocessors (SM) times the number of streaming processors (SP).

Cuda/stream cores (SM ∗ SP) Processor clock speed (MHz) Memory clock speed (GHz) Global memory bandwidth (GB/s)

GTX Titan

GTX 660 Ti

GTX 560 Ti

GTX 780 M

Tesla T10

AMD 7970

AMD 6770M

2688 836 6.0 288.4

1344 915 6.0 144.2

352 820 4.0 128

1536 823 5.0 160

240 1300 0.8 102

2048 925 5.5 288

480 750 3.6 57.6

documentation on OpenCL, see [22,23] and for PyOpenCL see [24,25]. A briefly description of the Implementation of EFIT on a single PyOpenCL device is given in Appendix A. 5. Performance analysis The main purpose of this paper is to explore the speedup using local memory instead of global memory to accelerate the EFIT computation. To this end, average computation times are calculated to compare the effectiveness of the parallel code running the PyOpenCL in a GPU with local versus global memory. The performance of the CPU is only included as a reference. It is known that the performance of OpenCL under a CPU is undermined since a user cannot control the cache memory. We have tested the code speed for four different grid sizes and we evaluate various workgroup sizes for each one of these grids when using local memory. The experiment platforms are seven kinds of GPUs and one CPU multi-core listed in Table 2. Some technical specifications of GPUs, such as the number of processors (Cuda cores/stream processors), the speed of processors and memories are given in Table 3. The architectures change from model to model and from different brands so it is difficult to build a unified table with the same characteristics listed for all GPUs. The set of experiments consists of running the code for both elastic and viscoelastic cases when using global and local memory. We recorded the execution time for 4526 time iterations in each one of the PCDs. In Fig. 4 we have the comparison of performance using global memory in terms of the computational time of the PCDs with different operating systems. The computational time depends on whether the experiment is for an elastic Fig. 4(a) or a viscoelastic medium Fig. 4(b) because the number of operations executed are different in these two cases. The computational time for the viscoelastic case is greater than for the elastic case because when the viscoelastic constants λ′ and µ′ are different from zero we introduce more operations slowing down the computation with respect to the elastic case. In Fig. 5 we show the computational time for the case of local memory including a comparison of performance with respect to two different workgroup sizes (8 × 8 and 16 × 16). We observe that there is an improvement when using a workgroup of size 16 × 16 with respect to the case of global memory. Since the trends are very similar for the elastic case we omitted this case. The Gflops are also taken into account and we compute them as follows:

 39 ∗ GS    (Kernel + Kernel ) ∗ 109 , V σ Gflops =  61 ∗ GS   , (KernelV + Kernelσ ) ∗ 109

Elastic; (33) Viscoelastic,

8


)

–

Fig. 4. Comparison of performance using global memory in terms of computational time of eight different devices running the program for various grid sizes. The computational time depends on whether the experiment is for an elastic (a) or viscoelastic (b) medium.

Fig. 5. Comparison of performance in terms of time using local memory for the viscoelastic case. For each PCD we measure the computational time using workgroups of sizes (8 × 8 and 16 × 16) and we compare the performance for the four grid sizes. We observe that there is an improvement when using a workgroup of size 16 × 16 with respect to the case of global memory. We omitted the elastic case because the trends are very similar.

where GS is the grid size of the computational model, Kernelv is the time of the kernel that computes the velocities, Kernelσ is the time of the kernel that computes the stresses and 39 and 61 are the total of sums and multiplications carried out in the kernels that compute velocity and stresses for the elastic and viscoelastic cases, respectively. The main difference between the kernels using global or local memory is due to the fact that work items can access local memory faster than global memory. Therefore, we need to read the data located in global memory and save it into local memory blocks in order to process it. Once the processing of the local data has finished, the resulting data will be transferred back to the global memory (see the implementation given in Appendix A). The speed does not depend on the complexity of the model in terms of different materials but on the grid size and if the materials are viscoelastic or elastic. We plot the performance in terms of Gflops using global memory (we consider four different grid sizes) for the elastic Fig. 6(a) and viscoelastic Fig. 6(b) cases. We have similar trends in both cases and also note that the best performance is the GTX Titan. The performance results in terms of Gflops using global memory is shown in Fig. 6(b) for the viscoelastic case. Four different grid sizes and seven GPUs are compared we include the CPU. Similar results are obtained for the elastic case in Fig. 6(a). In Fig. 7 we have the results in terms of Gflops using local memory (with workgroup sizes of 8 × 8 and 16 × 16) for the seven GPUs. Note that the most powerful GPU is the GTX Titan and this card also supports workgroups up to size (32 × 32) (see Fig. 8). In Fig. 8(a) a comparison of performance using local memory with the three workgroup sizes supported by this card is shown. In order to assess the speedup achieved when using local memory, we define this as the following ratio: Speedup =

Tref TLocal

,

(34)

where TLocal is the computational time using local memory and Tref is the reference computational time which in some cases is the global memory time or the CPU time using OpenCL or a serial C implementation. We expect that Speedup > 1 implying that Tref > TLocal . On the


)

–

9

Fig. 6. Performance in terms of Gflops using global memory (four different grid sizes are considered). Elastic (a) and viscoelastic (b) cases. We have similar trends in both cases. Note that the best performance is the GTX Titan which is the latest and fastest (Kepler) technology.

Fig. 7. The performance results in terms of Gflops using local memory for the viscoelastic case. Four different grid sizes and four GPUs are compared for two different workgroup sizes (8 × 8 and 16 × 16). The most powerful GPU is the GTX Titan followed it by the GTX 660 Ti.

other hand, the speedup achieved using local memory with the GTX Titan is shown in Fig. 8(b). The different bars correspond to the ratio between the computational time of the CPU (executing the code using the OpenCL implementation with global memory or using a serial C implementation). These results show that the speedup factor is around 60X using local memory with a workgroup of size (32 × 32) when compared with the execution of the same program in parallel using a CPU quad-core (OpenCL implementation) or nearly 80X when using the CPU serial C implementation. Although we believe that an MPI implementation could be the best choice to take full advantage of the CPU performance and therefore this might not be a fair comparison, we included it for reference. The most intriguing fact is that the speedup factor is up to 1.37X when comparing the performance between global and local memory (with a workgroup size of 32 × 32 running in the GTX Titan). In Fig. 8(c) we can see the speedups for the different cards when time execution of global and local memory is compared. Note that when the workgroup sizes are (8 × 8) and (16 × 16) there is little advantage between using local memory over global, except when using the Tesla card, where the speed up is up to 4X. However there is a marginal gain when the workgroups are of size (32 × 32) for the GTX Titan, GTX 660 Ti and GTX 780 M. With these few experiments we can observe that there are architectures that are more suited to speedup this algorithm than others and since this is a fairly recent technology we are in the process of exploring the real benefits and drawbacks of modeling wave propagation using PCDs. 6. Conclusions We have developed codes that can be use to simulate wave propagation in elastic/viscoelastic media using OpenCL. The use of graphics cards with local memory enables us to increase the speed in simulation for some cases nearly 4X times (using a large-scale grid) with respect to the global memory. In addition, the speedup factor is around 40X using local memory with a workgroup of size (32 × 32, GTX Titan) when compared with the execution of a CPU quad-core (OpenCL implementation, global memory) or nearly 80X when using the CPU serial C implementation. It is important to mention that the performance strongly depends on the algorithm, so improvements of speed could be achieved with optimally parallelized algorithms and this is a subject of further scrutiny. The use of PCDs and the acceleration in the computation opens the door in the very near future to study not only forward problems but more interesting full waveform inverse

10


)

–

Fig. 8. (a) Comparison of performance in terms of time for the local memory for the three workgroup sizes supported by the GTX Titan. The speedup achieved using local memory with the GTX Titan is shown in (b). The different bars correspond to the ratio between the computational time of the CPU (with OpenCL and with a serial C implementation using global memory). (c) Speedups for the different cards using global and local memory.

Fig. 9. Implementation of the modules that contain the core functions to run the 2-D EFIT. The main class is the EFIT2D which also depends on other classes and depends on the kernels to model elastic or viscoelastic media.

problems using this technology. Further work should be addressed into the scalability of 3-D simulations and the use of multi-GPUs. Since parallel computing with graphic cards is a relatively new subject in the scientific community, the development tools equivalent to CPU tools which shield the programmer from having to deal with full complexity of hardware are unavailable or rudimentary. We aim to contribute exploring its characteristics and potentialities hoping that achieving any step in this long path, will provide some insights and help to mature this new technology. Finally, the code described here and used to obtain the results presented in this work, is available as an open source tool to share with the scientific community. Acknowledgments We would like to thank comments and suggestions from two anonymous reviewers. This work was partially supported by the Spanish Economy and Competitiveness Ministry (TEC2012-38402-C04-03), DGAPA-UNAM under project IN116114-2 and Conacyt México under program PROINNOVA project number 212923. UI-V thanks Pilar Ladrón de Guevara, for her crucial help to locate useful references. Appendix A. Implementation of EFIT on a single PyOpenCL device We describe the general algorithm to run the EFIT. A version of the code can be found in https://github.com/mmolero/efit2d-pyopencl (see Appendix C). The class of EFIT2D is the core of the program because it contains the functions to initialize the OpenCL environment, set the inspection, set the staggered properties, configure the air boundary layer or absorbing layers, initialized all the fields (all the needed variables), set the receivers, run the kernels for elastic or viscoelastic media and choose between global or local memory. The EFIT2D class depends on another module of classes called EFIT2D_Classes and there are two separate modules for kernels to run the either viscoelastic or the elastic cases, see Fig. 9. The main program to run the elastic/viscoelastic, global or local memory is described in the next pseudocode.


)

–

11

Here we show a comparison between the kernel that computes the velocities using global and local memory. The main differences between these kernels are due to the fact that work items can access local memory faster than global memory. Therefore, we need to read the data located in global memory and save it into local memory blocks in order to process it. Once the processing of the local data has finished, the resulting data will be transferred back to the global memory. To ensure the correct processing of data in both global and local memory models, synchronization of work items and work groups is an important priority when the work items need to finish the calculation of an intermediate result that will be used in a future computation. This is done using barrier commands preventing that all the following commands will be executed until every preceding command has completed its execution. In the case of synchronizing work items in a workgroup, the barrier command forces a work item to wait until every other work item in the group reaches the barrier.

Global memory implementation

12

Local memory implementation


)

–


)

–

13

Appendix B. Installation of some Python libraries (including PyOpenCL) in MacOS via Macports In order to run the present code you need the following open-source software: Python (we use version 2.7.3), NumPy, Matplotlib, Scipy, PyOpenCL (OpenCL Library). To run the code you just have to type: python main.py and this will invoke de EFIT2D class and all its necessary constructors to start with the time loop. There is a complementary file called EFIT2D_Classes.py that contains the classes for the source, the inspection. In this appendix we illustrate how to install the components needed to run this code under MacOS 10.9.2 (Mavericks) via Macports. For a MacOS with macports installed, run macports to install the various libraries:

• • • • • •

sudo port install python27 sudo port install py27-numpy sudo port install py27-scipy sudo port install py27-opengl sudo port install py27-matplotlib sudo port install py27-pyopencl.

In addition to this one should download the glumpy untar and uncompress it and copy this folder into the folder where the EFIT2D is located. Appendix C. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.cpc.2014.05.016. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

J. Carcione, Wave Field in Real Media, in: Wave Propagation in Anisotropic, Anelastic, Porous and Electromagnetic Media, Elsevier, Netherlands, 2007. D.-H. Sheena, K. Tuncaya, C.-E. Baagb, P.J. Ortoleva, Comput. Geosci. 32 (2006) 1182–1191. T. Bohlen, Comput. Geosci. 28 (2002) 887–899. O. Roy, I. Jovanovic, A. Hormati, R. Parhizkar, M. Vetterli, in: J. D’Hooge, S.A. McAleavey (Eds.) SPIE Medical Imaging, Vol. 76290, New York, 2010. M. Molero, U. Iturrarán-Viveros, Ultrasonics 53 (2013) 815–822. U. Iturrarán-Viveros, M. Molero, Comput. Geosci. 56 (2013) 161–169. M. Molero, L. Medina, D. Lluveras, M. Izquierdo, J. Anaya, IOP Conferences Series: Materials Science and Engineering, vol. 42, 2012, pp. 1–2. M. Molero, L. Medina, Ultrasonics 52 (7) (2012) 809–814. P. Fellinger, R. Marklein, K. Langenberg, S. Klaholz, Wave Motion 21 (1) (1995) 47–66. F. Schubert, Ultrasonics 42 (2004) 221–229. F. Schubert, A. Peiffer, B. Koehler, T. Sanderson, J. Acoust. Soc. Am. 104 (5) (1998) 2604–2614. F. Schubert, B. Koehler, J. Comput. Acoust. 9 (4) (2001) 1543–1560. D. Calvo, K. Rudd, M. Zampolli, W. Sanders, Wave Motion 47 (8) (2010) 616–634. J.M. Carcione, F. Poletto, D. Gei, J. Comput. Phys. 196 (2004) 282–297. J.M. Carcione, Geophysics 60 (2) (1995) 537–548. J.M. Carcione, Geophysics 58 (1) (1993) 110–120. V.A. Barkhatov, Rus. J. Non-Destr. Test. 45 (6) (2009) 58–75. R. Marklein, R. Bärmann, K. Langenberg, in: D. Thompson, D.E. Chimenti (Eds.) Review of Progress in Quantitative Nondestructive Evaluation, Vol. 14, New York, 1995, pp. 251–255. [19] F. Schubert, Ausbreitungsverhalten von ultraschallimpulsen in beton und schlussfolgerungen fr die zerstrungsfreie prfung, (Ph.D. thesis), Dresden Universitat, 2000, (in German).

14 [20] [21] [22] [23] [24] [25]


)

–

C. Cerjan, D. Kosloff, R. Kosloff, M. Reshef, Geophysics 50 (1985) 705–708. T. Stefanski, S. Benkler, N. Chavannes, N. Kuster, Int. J. Numer. Modelling, Electron. Netw. Devices Fields 26 (2013) 355–365. M. Scarpino, OpenCL in Action, Manning, Shelter Island, 2012. A. Munshi, B.R. Gaster, T.G. Mattson, J. Fung, D. Ginsburg, Open CL Programming Guide, Addison Wesley, New York, 2012. A. Klöckner, 2010, Pyopencl. http://mathema.tician.de/software/pyopencl. A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, A. Fashi, Parallel Comput. 38 (3) (2012) 157–174.

Optimized OpenCL implementation of the ...

Optimized OpenCL implementation of the ...

Suggest Documents

Efficient FPGA implementation of OpenCL High-Performance ...

OpenCL implementation of Particle Swarm Optimization

OpenCL implementation of Particle Swarm Optimization

Optimized Data Transfers Based on the OpenCL Event Management

Research Article Optimized Data Transfers Based on the OpenCL

Automatic Generation of Optimized OpenCL Codes Using OCLoptimizer

IMPLEMENTATION OF OPTIMIZED CACHE REPLENISHMENT ...

Area-Optimized Architectures & Implementation of

TLM OpenCL multi-GPUs implementation - Hal

OpenCL implementation of PSO algorithm for the ...

MPI + OpenCL implementation of a ... - CiteSeerX

OpenCL Implementation of a Parallel Universal ... - Semantic Scholar

Workload Analysis and Efficient OpenCL-based Implementation of ...

GPGPU optimized parallel implementation of AES ...

Implementation of Area & Power Optimized VLSI ...

A Survey on Optimized Implementation of Deep

Area-Optimized Architectures & Implementation of PELICAN MAC ...

FPGA Implementation of Optimized QPSK and

Optimized Hardware Implementation of Enhanced ...

Optimized Implementation of Distributed Real-Time ... - SynDEx

Optimized Hardware Implementation of FFT Processor

Optimized Implementation of Speech Processing ... - DSP-Book

optimized real time implementation of spectral

Optimized implementation of QC-MDPC code-based