GPU-Accelerated Parallel Finite-Difference Time-Domain ... - arXiv

GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media Xi-min Wang, Lang-lang Xiong, Song Liu, Zhi-yun Peng, and Shuang-ying Zhong

Abstract—The finite-difference time-domain (FDTD) method has been commonly utilized in the numerical solution of electromagnetic (EM) waves propagation through the plasma media. However, the FDTD method may bring about a significant increment in additional run-times consuming for computationally large and complicated EM problems. Graphics Processing Unit (GPU) computing based on Compute Unified Device Architecture (CUDA) has grown in response to increased concern for reduction of run-times. We represent the CUDA-based FDTD method with the Runge-Kutta exponential time differencing scheme (RKETD) for the unmagnetized plasma implemented on GPU. In the paper, we derive the RKETD-FDTD formulation for the unmagnetized plasma comprehensively, and describe the detailed flowchart of CUDA-implemented RKETD-FDTD method on GPU. The accuracy and acceleration performance of the posed CUDA-based RKETD-FDTD method implemented on GPU are substantiated by the numerical experiment that simulates the EM waves traveling through the unmagnetized plasma slab, compared with merely CPU-based RKETD-FDTD method. The accuracy is validated by calculating the reflection and transmission coefficients for one-dimensional unmagnetized plasma slab. Comparison between the elapsed times of two methods proves that the GPU-based RKETD-FDTD method can acquire better application acceleration performance with sufficient accuracy. Index Terms—Parallel FDTD, Unmagnetized plasma, Compute unified device architecture (CUDA), Graphic processing unit (GPU)

I. INTRODUCTION Since the finite-difference time-domain (FDTD) method was initially delivered to numerically resolve the Maxwell’s equations by Yee in 1966 [1], it has been widely used in the numerical solution of electromagnetics (EMs) problems. The FDTD method has obvious advantages over many other numerical methods as it updates naturally the field values at every separate cell in discrete time steps by leap-frog integrations without complex computations [2]. Over the past decade, the FDTD numerical modeling approach has led to applications to diverse difficulties, involving modeling of objects ranging from aerospace and biological systems to geometric shapes, analysis and design of complicated microwave circuits and fast time-varying systems as well as various other engineering applications [3]. Plentiful numerical schemes related to FDTD formulations used to simulate the EM waves propagation in the dispersive media are addressed, including the recursive convolution (RC) method [4,5], frequency-dependent Z transform method [6,7], direct integration (DI) method [8, 9], JE convolution (JEC) method [10], the auxiliary differential equation (ADE) method [11],

Manuscript received; This work is supported by National Nature Science foundation of China (No. 61261006 & 11563006 &1116501), the State Key Laboratory of Millimeter Waves Open Research Program (No. K201605), and it was also supported by the Natural Science Foundation of Jiangxi Province (No. 20151BAB202024). Xi-min Wang is with Electrical Engineering, Nanchang University, Nanchang Jiangxi 330031, China. Lang-lang Xiong is with School of Sciences, Nanchang University, Nanchang Jiangxi 330031, China. Song Liu is with School of Sciences, Nanchang University, Nanchang Jiangxi 330031, China; and State Key Laboratory of Millimeter Waves, Nanjing Jiangsu 210016, China. (Email: [email protected]) Zhi-yun Peng is with Electrical Engineering, Nanchang University, Nanchang Jiangxi 330031, China. Shuangying Zhong is with School of Sciences, Nanchang University, Nanchang Jiangxi 330031, China; (Email: [email protected]) (corresponding author Email: [email protected]).

piecewise linear recursive convolution (PLRC) method [12], piecewise linear current density recursive convolution (PLCDRC) method [13], Runge-Kutta exponential time differencing (RKETD) [14]. Simulation of EM waves propagating through plasma media is a unique and fascinating application built on FDTD formulations for dispersive media, where the appearing nonlinear phenomena that are not totally understood can be explicitly refined by numerical simulation.

Furthermore, aforementioned various FDTD scheme for

CUDA. Section Ⅳ designs a numerical simulation that is

dispersive media can be applied to the plasmas.

carried out to prove the accuracy and efficiency of the

Although the FDTD schemes above are well-suited to

GPU-based RKETD-FDTD method for unmagnetized plasma.

numerical simulation, the original FDTD method can bring II. RKETD-FDTD FORMULATION

about a significant increment in additional run-times consuming for computationally large and complicated EM issues. However, FDTD method is naturally a massively

The famous Maxwell’s equations in time domain for the unmagnetized plasma are provided by

parallel algorithm, thus it can benefit a lot from most advances in parallel computing techniques to acquire considerable decrease in time spending. Graphics Processing Units (GPUs) have been industrialized rapidly in recent years, which are considered as the most popular hardware accelerator in parallel

Ñ´ H = e ¶E + J 0 ¶t

(1)

Ñ´ E = -µ ¶H 0 ¶t

(2)

dJ +n J = e w 2 E 0 p dt

(3)

computation to settle massively parallel computations with

where, H is the magnetic intensity, E is the electric field, J

better application acceleration performance obtained.

is the polarization current density, n is the electron collision

Since the Compute Unified Device Architecture (CUDA) was first introduced by NVIDIA as a parallel computing

e 0 and µ0 are the

frequency, w p is plasma frequency,

platform and software programming model, GPUs have turned

permittivity and permeability of free space, respectively.

out to be much more formidable and generalized for developers

Considering one-dimensional equations, the one-dimensional

to be utilized to tackle general-purpose parallel computations

component of the equations can be written as

with high performance. Previous researches about diverse FDTD schemes based on CUDA-enabled GPU have been undertaken over the past decade [15-20]. What’s more, FDTD method carried out on multi-GPU clusters has triggered great interest to further accelerate large-scale computations with improved speedup performance [21-24]. This paper presents a GPU-based RKETD-FDTD method with CUDA for the unmagnetized plasma media to acquire better acceleration performance, compared with merely CPU-based RKETD-FDTD method. Numerical simulation of the GPU-based RKETD-FDTD method for the unmagnetized plasma media is undertaken both on CPU and GPU respectively. The reflection and transmission coefficients through an unmagnetized plasma layer in one dimension are calculated to validate the accuracy of the method. Speedup ratios are computed to confirm the speedup performance of the method. This paper is arranged as follows. In Section Ⅱ, we describe the Maxwell equations for the unmagnetized plasma and derives the FDTD formulation with RKETD numerical scheme. Section Ⅲ illustrates the CUDA programming model and implementation of GPU-based RKETD-FDTD method with

-

¶H y ¶Ex =e + Jx 0 ¶t ¶z

(4)

¶H y ¶Ex = -µ 0 ¶t ¶z

(5)

dJ x +n J x = e w 2p Ex 0 dt

(6)

To derive FDTD formulation with RKETD scheme for the unmagnetized plasma, equation (6) is multiplied by the integrating factor

enDt and then integrated from tn to tn +1 ,

letting Dt = tn+1 - tn and t = tn + t . The result is given by Dt

J xn+1 = e -nDt J xn + e -nDt ò ent F (tn + t )dt 0

(7)

Where, F (tn + t ) = e 0w p2 Ex (tn + t ) . To derive the second-order RKETD method, first define K = e -nDt J xn +

F (tn , J x )(1 - e -nDt ) n

(8)

Next the approximation is taken to give F (tn + t ) = F (tn , J x ) +

t [ F (tn + Dt, K ) - F (tn , J x )] + o ( (Dt )2 ) Dt

Combining equation (7) and (9) yields Dt

J xn+1 = e -nDt J xn + F (tn , J x )e -nDt ò ent dt 0

(9)

+ [ F (tn + Dt , K ) - F (tn , J x ) ]

The x component of J at

e-nDt Dt

ò

Dt

0

(10)

t ent dt

n + 1 time step can be arranged as

J xn+1 = e-nDt J xn + +

1- e

n

-nDt

compute capability of GPU hardware more efficiently. CUDA programs involve codes running on two different platforms concurrently: host codes on CPU and device codes on GPU. There are distinct hardware differences between CPU

e 0w p2 Exn

e -nDt - 1 + nDt e 0w p2 ( Exn+1 - Exn ) n 2 Dt

and GPU, whose separate memory named host and device (11)

memory respectively. Hence, the host is the location where the

So equation (11) can be used to update the x component of the

sequential parts of CUDA programs are executed, and the

polarization current density J by utilizing the x component of

device is where the intensively data-parallel parts are executed

the electric field E at both n and n + 1 time steps.

in the form of kernels. Kernels are critical and unique to CUDA

The discretization of equation (4) follows the Yee grid and

programs. A kernel is always executed on the device after being

leapfrog-style algorithm, which is taken to give

invoked on the host, explicitly specified as the calculation and

Exn+1 (k ) = Exn (k ) -

memory manipulation for solely a single thread.

1 1 n+ Dt é n+ 2 1 1 ù 2 ê H y (k + ) - H y (k - ) ú 2 2 û

e 0 Dz ë

-

Dt é J xn+1 (k ) + J xn (k ) ùû 2e 0 ë

As soon as a kernel is launched on the host, a large number of (12)

1

given by 1

each

performing

structure, that is, CUDA organizes blocks and grids in three (13)

formulation for vacuum, and the update equation of H is

1

generated,

and group blocks into grids. In the light of the hierarchy

The discretization of equation (5) strongly resembles the FDTD

n+ n1 1 Dt H y 2 (k + ) = H y 2 (k + ) [ Exn (k + 1) - Exn (k )] 2 2 µ0 Dx

concurrently

two-level thread hierarchy, which divides threads into blocks

w p2 Dt -nDt (e - 1) Dt n+1 2n Ex (k ) = (1 + ) Exn (k ) (1 + e-nDt ) J xn (k ) w p2 -nDt 2e 0 1 + 2 (e - 1 + nDt ) 2n 1

are

calculations of the same pattern in parallel. CUDA reveals a

Substituting equation (11) into equation (12) gives

n+ n+ Dt 1 1 [ H 2 (k + ) - H y 2 (k - )] e 0 Dz y 2 2

threads

(14)

dimension, programmers can utilize the hardware resources for optimization more straightforwardly. Figure 1 shows an illustrative instance of a thread hierarchy structure with a two-dimensional grid containing two-dimensional blocks. Effective thread configuration becomes essential to CUDA programming for the obtainment of better application acceleration performance. Device(GPU)

Host (CPU)

III. CUDA IMPLEMENTATION Grid 0

A. CUDA Programming Model Parallel programming techniques are compatible with computations that could be operated on concurrence of

Kernel 0

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block (2,1)

Kernel 1

plentiful data elements in parallel, with the intention of acquiring better acceleration performance by mapping data

Block(1,1)

elements to numerous threads concurrently on GPU. During the

Thread (0,0)

Thread (1,0)

Thread (2,0)

Thread (3,0)

Thread (4,0)

process, a bridge is built by CUDA programming model

Thread (0,1)

Thread (1,1)

Thread (2,1)

Thread (3,1)

Thread (4,1)

between applications and their implementations on NVIDIA

Thread (0,2)

Thread (1,2)

Thread (2,2)

Thread (3,2)

Thread (4,2)

GPU. Moreover, CUDA programming model abstracts the computer

architecture

to

offer

programmers

effective

Figure 1 (color online) a thread hierarchy structure with a two-dimensional grid containing two-dimensional blocks

perspectives to manipulate memory and organize threads on

Moreover, CUDA programming model represents the

CUDA-supported GPU, so that they are able to harness the

complete memory hierarchy clearly to access and manage

memory

for

optimal

performance.

Several

types

of

intensity and E-update-kernel to update the electric field. The

programmable device memory are presented, such as local

two kernels with dotted line border are launched sequentially

memory, global memory, shared memory, constant memory

and executed in parallel on GPU. Shared memory is often used

and texture memory. Figure 2 illustrates the GPU memory

to efficiently implement the FDTD update kernels for optimum,

hierarchy exposed in CUDA memory model. These memory

compared to merely global memory utilized. The execution

spaces slightly differ in the lifetimes, scopes and caching

configuration for each kernel is governed by the complete

behaviors. Local memory is allocated for threads in a kernel.

domain and allocated shared memory, for the reason that the

Shared memory is shared by all threads of the same block,

calculating of a Yee’s cell is mapped to a thread on the device.

which can be configured by threads of a block with lifetime as

START

block. Global memory accessible by all threads is expected to be much slower than shared memory. Constant and texture memory accessible by all threads are read-only memory, optimized for diverse memory practices. Global, constant and texture memory are continuing with same lifetime as an application [25].

Organize threads for E-update-kernel

Initialize fields and PML parameters

Allocate shared memory for H-fields

Allocate GPU (global) memory

(Device) Grid

Update E-fields with different update schemes in plasma media and free space respectively

Copy data from CPU memory to GPU memory

Block (1, 1)

Update J-fields in plasma media

Shared Memory

Host

Allocate CPU (host) memory

Invoke kernels

Registers

Registers

Thread (0, 0)

Thread (1, 0)

Local Memory

Local Memory

？

Copy data back from GPU memory to CPU memory

Global Memory

Update H-fields in computational domain

Deallocate GPU memory and CPU memory

Texture Memory

Figure 2 (color online) GPU memory hierarchy in CUDA memory model

END

Figure 3 Flowchart of GPU-based FDTD method with CUDA

Compared with RKETD-FDTD method on CPU, we

IV. NUMERICAL SIMULATION

illustrate the flowchart of GPU-based RKETD-FDTD method displayed in Figure 3. The host code is implemented to

Allocate shared memory for E-fields

Retrieve results

Constant Memory

B. CUDA-implemented RKETD-FDTD method

Organize threads for H-update-kernel

A. Simulation environment

complete fields and parameters initialization, device memory

As briefly stated before, it’s theoretically predicted that the

allocation and release, and data transfer between host and

GPU-based RKETD-FDTD method can acquire better

device. To efficiently implement GPU-based RKETD-FDTD

acceleration performance compared with the corresponding

method with CUDA, device codes are considered to be divided

simply

into two kernels: H-update-kernel to update the magnetic

RKETD-FDTD method is utilized to simulate EM waves

CPU-based

RKETD-FDTD

method,

when

traveling

through

the

unmagnetized

plasma

media.

The source wave exploited in numerical experimentation is the

Nevertheless, the application acceleration performance of the

Gaussian-derivative pulsed plane wave, which is given in time

GPU-based RKETD-FDTD method isn’t always as satisfying

domain by

as estimated in practical applications, which frequently changes

æ (t - 5t ) 2 ö Ei (t ) = (t - 5t ) exp ç ÷, 2t 2 ø è

with problem complexities, GPU hardware properties and

(15)

here, t = 15Dt .

programmers’ skills. To evaluate the acceleration performance of GPU-based

The execution configurations of the two kernels are arranged

RKETD-FDTD method, we prefer CUDA as the parallel

both as 16 blocks in each grid and 256 threads in each block in

programming platform to employ the acceleration of the

case of the computational domain, with CUDA programming

GPU-based RKETD-FDTD method for the unmagnetized

model constructed.

plasma media. The numerical experiment is performed on the

To validate the accuracy of the GPU-based RKETD-FDTD

laptop provided with CUDA-supported NVIDIA GPU. The

method with CUDA compared with the RKETD-FDTD method

CPU of the laptop is Intel Core i5 3230M, and GPU NVIDIA

merely CPU-based, reflection and transmission coefficients are

GeForce GT 650M. GeForce GT 650 is a low-end product

analyzed with the Fast Fourier Transform (FFT). Figure 4 and 5

based on graphics processor of NVIDIA Kepler architecture,

clarify

the

magnitudes

developed for desktop applications. Some key specifications of

transmission

coefficient

GT650 are tabulated in Table 1. The development environment

CUDA-implemented RKETD-FDTD method on GPU and

is Microsoft Visual Studio 2013 (Community Edition) with

RKETD-FDTD method on CPU with those of the analytical

CUDA toolkit 7.5 assembled, Windows 10 as operating system.

solution. From Figure 4 and 5, it’s taken to show that the

Specification

GeForce GT 650M

Chip

GK107

CUDA cores

384

Processor clock

835MHz

Memory clock

900MHz

Memory size

4096MB

B. Numerical Results To exhibit the simulation of abovementioned RKETD-FDTD formulation for the unmagnetized plasma media, we choose appropriate parameters that are listed as follows. The entire computational domain is 4096Dz in z axis, where Dz is selected to be Dz = 75 µ m as size of a cell. A single time step is taken to be Dt = 125ns . The unmagnetized plasma slab with the thickness of 9cm occupies the middle 1200 cells, the PML absorbing boundary conditions [26] are implemented at both ends of 5 cells to avoid undesired reflections, and the rest space is vacuum. The key parameters of unmagnetized plasma are 9 wp = 2p ´ 28.7 ´ 10 rad / s and n = 20GHz .

CUDA-implemented

reflection

computed

RKETD-FDTD

coefficient

and

respectively

by

method

achieves

agreement in reasonable accuracy with the RKETD-FDTD method and the analytical solution at higher frequency. 0

Reflection coefficient magnitude (dB)

Table 1 Specifications of NVIDIA 650M

of

-20

-40

-60

-80

FDTD on CPU FDTD on GPU Analytical solution 0

10

20

30

40

50

60

70

80

90

100

Frequency (GHz)

Figure 4 (color online) Reflection coefficient magnitude versus frequency

60 Free Space

Plasma Slab

Free Space

40 -20

20 FDTD on CPU FDTD on GPU Analytical solution

Time Step = 3000

Ex(V/m)

Transmission coefficient magnitude (dB)

0

-40

0

-20

-40 -60

0

10

20

30

40

50

60

70

80

90

FDTD on CPU FDTD on GPU

100

-60

Frequency (GHz)

0

1000

Figure 5 (color online) Transmission coefficient magnitude versus frequency

Figure 6 through 9 show the electric field magnitude at each FDTD cell after various time steps to demonstrate a plasma

slab

with

n = 50GHz

3000

4000

Figure 7 (color online) Electric field magnitude versus Yee’s cells after 3000 time steps

Gaussian-derivative pulsed plane wave propagating through an unmagnetized

2000

FDTD Cells

60

Free Space

and

Plasma Slab

Free Space

40

9 wp = 2p ´ 50 ´ 10 rad / s . Some sharp corners occur at the

boundary conditions must be satisfied.

Ex(V/m)

border of the plasma slab and vacuum, for the reason that the

20

Time Step = 4000 0

-20

-40

60 Free Space

Plasma Slab

-60

Free Space

FDTD on CPU FDTD on GPU 0

1000

3000

4000

Figure 8 (color online) Electric field magnitude versus Yee’s cells after 4000

20

Time Steps = 2000

Ex(V/m)

2000

FDTD Cells

40

time steps

0 60

-20

Free Space

Plasma Slab

Free Space

40

-40

0

1000

2000

3000

4000

FDTD Cells

20

Ex(V/m)

-60


Time Step = 5500 0

-20

Figure 6 (color online) Electric field magnitude versus Yee’s cells after 2000

-40


time steps -60

0

1000

2000

3000

4000

FDTD Cells

Figure 9 (color online) Electric field magnitude versus Yee’s cells after 5500 time steps

Due to the fact that the application speedup performance may vary with GPU device properties and developers’ programming levels, speedup ratio is calculated to evaluate the practical application speedup performance, which is termed as the ratio

between the elapsed times of FDTD computation part only on

CPU. Numerical experiments of these two methods have been

CPU and on GPU in the simulation. Table 2 presents the

done respectively on NVIDIA GPU and on CPU. The accuracy

elapsed time of FDTD computation part only on CPU and on

and

GPU and corresponding speedup ratio at diverse time steps

GPU-implemented RKETD-FDTD method were assessed by

with Yee’s cells unchanged.

the numerical simulation of the EM waves traveling through the

Table 2 Speedup ratio at diverse time steps in the simulation

unmagnetized plasma slab, compared with simply CPU-based

acceleration

performance

of

the

represented

CPU Times

GPU Times

Speedup

RKETD-FDTD method. The accuracy was verified by

(s)

(s)

Ratios

calculating the reflection and transmission coefficients for

1000

0.258

0.058

4.45

one-dimensional unmagnetized plasma slab. The comparison

3000

0.770

0.263

2.93

between the elapsed times of two methods proved that the

5000

1.292

0.481

2.69

posed GPU-based RKETD-FDTD method implemented with

7000

1.832

0.689

2.66

CUDA can acquire decent acceleration with sufficient accuracy.

9000

2.321

0.887

2.62

The further research will be predicted to obtain more

10000

2.579

1.014

2.54

satisfactory acceleration performance while the numerical

15000

3.906

1.513

2.58

results of GPU-based RKETD-FDTD method are generalized

20000

5.063

2.048

2.47

to two-dimensional and even three-dimensional conditions for

Time Steps

practical

applications.

Besides,

more

sophisticated

The speedup ratio is slowly decreasing with time step

CUDA-supported GPUs specialized in scientific computation

increasing. The speedup ratios calculated in Table 2 validate

such as Tesla can be utilized to acquire better acceleration

the speedup performance of the GPU-based RKETD-FDTD

performance.

method with respect to that only on CPU, by numerically simulating EM waves traveling through the unmagnetized plasma slab in one direction. We can easily make further improvement in speedup performance for practical applications when the numerical

REFERENCES [1] Yee, Kane S. "Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media." IEEE Trans. Antennas Propag., vol. 14, no. 3, pp. 302-307, 1966. [2] Taflove, Allen, and Susan C. Hagness. Computational electrodynamics. Artech house, 2005.

results of GPU-based RKETD-FDTD method are generalized

[3] Taflove, Allen. "Review of the formulation and applications of the

to the two- dimensional and even three-dimensional conditions

finite-difference time-domain method for numerical modeling of

to get full use of the GPU computing resources, compared with

electromagnetic wave interactions with arbitrary structures." Wave Motion,

the one dimensional condition just to confirm the speedup performance. Besides, NVIDIA GeForce GT 650M used in the simulation is one product of the low-end GPUs that are not developed for scientific computation. More sophisticated CUDA-supported GPU such as Tesla can be upgraded to obtain better speedup performance.

vol. 10, no. 6, pp. 547-582, 1988. [4] Luebbers, Raymond J., and Forrest Hunsberger. "FDTD for N th-order dispersive media." IEEE Transactions on Antennas and Propagation, vol.40, no. 11, pp. 1297-1301, 1992. [5] Siushansian, Riaz, and Joe LoVetri. "A comparison of numerical techniques for modeling electromagnetic dispersive media." IEEE Microwave and Guided Wave Letters, vol. 5, no. 12, pp. 426-428, 1995. [6] Sullivan, Dennis M. "Frequency-dependent FDTD methods using Z transforms." IEEE Transactions on Antennas and Propagation, vol. 40, no.

V. CONCLUSION AND PERSPECTIVES In this letter, we presented a CUDA-based RKETD-FDTD method for the unmagnetized plasma implemented on GPU to acquire better acceleration performance, when compared with the previous RKETD-FDTD method simply implemented on

10, pp. 1223-1230, 1992. [7] Pereda, José, Ángel Vegas, and Andrés Prieto. "FDTD modeling of wave propagation in dispersive media by using the Mobius transformation technique." IEEE Transactions on Microwave Theory and Techniques, vol. 50, no. 7, pp. 1689-1695, 2002.

[8] Young, Jeffrey L. "Propagation in linear dispersive media: Finite difference

[22]Shams R, Sadeghi P. “On optimization of finite-difference time-domain

time-domain methodologies." IEEE Transactions on Antennas and

(FDTD) computation on heterogeneous and GPU clusters”. Journal of

Propagation, vol. 43, no. 4, pp. 422-426, 1995.

Parallel and Distributed Computing, vol. 71, no. 4, pp. 584-593, 2011.

[9] Young, Jeffrey L. "A higher order FDTD method for EM propagation in a

[23]Kim K H, Park Q H. “Overlapping computation and communication of

collisionless cold plasma." IEEE Transactions on Antennas and

three-dimensional FDTD on a GPU cluster”. Computer Physics

Propagation, vol. 44, no. 9, pp. 1283-1289, 1996.

Communications, vol. 183, no. 11, pp. 2364-2369, 2012.

[10] Chen, Qing, Makoto Katsurai, and Paul H. Aoyagi. "An FDTD

[24]Nagaoka, T., and Watanabe S. Accelerating “three-dimensional FDTD

formulation for dispersive media using a current density." IEEE

calculations on GPU clusters for electromagnetic field simulation”.

Transactions on Antennas and Propagation, vol. 46, no. 11, pp. 1739-1746,

International Conference of the IEEE Engineering in Medicine and

1998. [11] Takayama, Yoshihisa, and Werner Klaus. "Reinterpretation of the auxiliary differential equation method for FDTD." IEEE Microwave and Wireless Components Letters, vol. 12, no. 3, pp. 102-104, 2002. [12] Kelley, David F., and Raymond J. Luebbers. "Piecewise linear recursive convolution for dispersive media using FDTD." IEEE Transactions on Antennas and Propagation, vol. 44, no. 6, pp. 792-797, 1996. [13] Shaobin Liu, Naichang Yuan, and Jinjun Mo. "A novel FDTD formulation for dispersive media." IEEE Microwave and Wireless Components Letters, vol. 13, no. 5, pp. 187-189, 2003. [14] Song Liu, Shuangying Zhong, and Shaobin Liu. "Finite-difference time-domain algorithm for dispersive media based on Runge-Kutta exponential time differencing method." International Journal of Infrared and Millimeter Waves, vol. 29, no. 3, pp. 323-328, 2008. [15] Wang X, Liu S, Li X, Zhong S.Y. "GPU-accelerated finite-difference time-domain method for dielectric media based on CUDA," International Journal of RF and Microwave Computer-Aided Engineering, to be published. [16] Stefanski, Tomasz P., and Timothy D. Drysdale. "Acceleration of the 3D ADI-FDTD method using Graphics Processor Units." In Microwave Symposium Digest, 2009. MTT'09. IEEE MTT-S International. 2009. [17] De Donno, Danilo, Alessandra Esposito, Luciano Tarricone, and Luca Catarinucci. "Introduction to GPU computing and CUDA programming: A case study on FDTD." IEEE Antennas and Propagation Magazine, vol. 52, no. 3, pp. 116-122, 2010. [18] Zunoubi, Mohammad Reza, Jason Payne, and William P. Roach. "CUDA Implementation of TEz-FDTD Solution of Maxwell's Equations in Dispersive Media." IEEE Antennas and Wireless Propagation Letters, no. 9, pp. 756-759, 2010. [19] Lee, Kim Huat, Iftikhar Ahmed, Rick Siow Mong Goh, Eng Huat Khoo, Er Ping Li, and Terence Gih Guang Hung. "Implementation of the FDTD method based on Lorentz-Drude dispersive model on GPU for plasmonics applications." Progress In Electromagnetics Research vol. 116, pp. 441-456, 2011. [20] Livesey, Matthew, James Francis Stack, Fumie Costen, Takeshi Nanri, Norimasa Nakashima, and Seiji Fujino. "Development of a CUDA Implementation of the 3 D FDTD Method." IEEE Antennas & Propagation Magazine, vol. 54, no. 5, pp. 186-195, 2012. [21]Ong C, Weldon M, Cyca D, Okoniewski M. “Acceleration of large-scale FDTD simulations on high performance GPU clusters”. 2009 Antennas & Propagation Society International Symposium, 2009, 1-4.

Biology Society, 2012: 5691-4. [25]Cheng, John, Max Grossman, and Ty McKercher. “Professional CUDA C Programming”. John Wiley & Sons, 2014. [26] J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,” J. Comput. Phys., vol. 114, no. 1, pp. 185–200, 1994.

GPU-Accelerated Parallel Finite-Difference Time-Domain ... - arXiv

GPU-Accelerated Parallel Finite-Difference Time-Domain ... - arXiv

Suggest Documents

A massively parallel GPUaccelerated model for analysis of fully ...

A massively parallel GPUaccelerated model for analysis of ... - DTU Orbit

Parallel Polarization State Generation - arXiv

Stability of Inviscid Parallel Flows between Two Parallel Walls - arXiv

Optimistic Parallel State-Machine Replication - arXiv

Decision Sort and its Parallel Formulation - arXiv

Robustness maximization of parallel multichannel systems - arXiv

Parallel Monitors for Self-adaptive Sessions - arXiv

Stiffness Analysis of Overconstrained Parallel Manipulators - arXiv

a 4-dof parallel mechanism simulating - arXiv

heterogeneous highly parallel implementation of matrix ... - arXiv

Massively-Parallel Lossless Data Decompression - arXiv

Computer Assisted Parallel Program Generation - arXiv

Threadbare Parallel Port DAQ Card - arXiv

Parallel magnetic field induced strong negative ... - arXiv

Parallel Self-Sorting System for Objects - arXiv

Massively-Parallel Lossless Data Decompression - arXiv

Open architecture for multilingual parallel texts - arXiv

Multiprocessor Scheduling Using Parallel Genetic Algorithm - arXiv

The Glasgow Parallel Reduction Machine: Programming ... - arXiv

Threadbare Parallel Port DAQ Card - arXiv

Forest Packing: Fast, Parallel Decision Forests - arXiv

Parallel distinguishability of quantum operations - arXiv

Numerical Reproducibility and Parallel Computations: Issues ... - arXiv