Particle-in-Cell Plasma Simulation on CPUs, GPUs

Particle-in-Cell Plasma Simulation on CPUs, GPUs and Xeon Phi Coprocessors Sergey Bastrakov1 , Iosif Meyerov1 , Igor Surmin1 , Evgeny Efimenko2 , Arkady Gonoskov1,2 , Alexander Malyshev1 , and Mikhail Shiryaev1 1

2

Lobachevsky State University of Nizhni Novgorod, Russia Institute of Applied Physics, Russian Academy of Sciences, Nizhni Novgorod, Russia

Simulation of plasma dynamics with the Particle-in-Cell method is one of the currently high-demand areas of computational physics. Solving up-to-date physical problems often requires large-scale plasma simulation. Given the growing popularity of GPGPU and the advent of Intel Xeon Phi coprocessors, there is an interest in high-performance implementation of the method for heterogeneous systems. PICADOR [1] is a tool for three-dimensional fully relativistic plasma simulation using the Particle-in-Cell method. Features of PICADOR include FDTD and NDF field solvers, Boris particle pusher, CIC and TSC particle form factors, Esirkepov current deposition, ionization, and moving frame. The code is capable of running on heterogeneous cluster systems with CPUs, GPUs, and Xeon Phi coprocessors and supports dynamic load balancing. Each MPI process handles a part of simulation area (domain) using a multicore CPU via OpenMP, a GPU via CUDA, or a Xeon Phi coprocessor. All MPI exchanges occur only between processes handling neighboring domains. The Particle-in-Cell method operates on two major sets of data: an ensemble of charged particles (electrons and ions of various type) and grid values of electromagnetic field and current density. A key aspect of high-performance implementation of the Particle-in-Cell method is obtaining efficient memory access pattern during the most computationally intensive Particle–Grid operations: field interpolation and current deposition. We store particles of each cell in a separate array and process particles in a cell-by-cell order. This scheme helps to improve memory locality and allow vectorization of particle loops. These optimization techniques yield a combined 4x to 7x performance improvement over the baseline implementation. We employ a widely used performance metric for Particle-in-Cell simulations, that is computational time per particle per time step, namely, nanoseconds per particle update. On a simulation of dense plasma with first-order field interpolation and current deposition in double precision PICADOR achieves 12 nanoseconds per particle update on an 8-core Intel Xeon E5-2690 CPU with 99% strong scaling efficiency on shared memory, which is competitive to the state-of-the-art implementations [2, 3]. The Xeon Phi implementation is essentially the same C++/OpenMP code as for CPUs with a minor difference in compiler directives that control vectorization. A Xeon Phi 7110X coprocessor in native mode scores 8 nanoseconds

per particle update on the same benchmark, outperforming the Xeon E5-2690 CPU by factor of 1.5. A heterogeneous Xeon + Xeon Phi combination, with one process running on the processor and another one on the coprocessor, achieves 6 nanoseconds per particle update. However, other heterogeneous configurations, such as 2x Xeon + Xeon Phi or Xeon + 2x Xeon Phi, do not yield any performance benefit due to high MPI exchanges overhead. A major performance drawback on CPUs and particularly on the Xeon Phi is that field component scatter in Yee grid hinders efficient memory access in the vectorized field interpolation. The GPU implementation employs a variation on the widely used supercell technique [4] with a CUDA block processing particles of a supercell. The main performance challenge in GPU implementation is current deposition, which requires reduction of the results of all threads in a block. We have two implementations of this operation: reduction in shared memory and reduction via atomic operations. The first one appears to be better on Fermi-generation GPUs, while the second is preferable on Kepler-generation GPUs, achieving, respectively, 4x and 10x speedup over 8 CPU cores in single precision. PICADOR is developed and used by the HPC Center of University of Nizhni Novgorod and the Institute of Applied Physics of Russian Academy of Sciences for simulation of laser-matter interaction. The code architecture is extendable in terms of additional stages and devices and is capable of using modern heterogeneous cluster systems with CPUs, GPUs and Intel Xeon Phi coprocessors. The performance and scaling efficiency are competitive with other implementations. The future prospects include better load balancing between CPUs, GPUs and Xeon Phi, further optimization of GPU and Xeon Phi implementations, development and optimization of additional modules to allow solving a larger set of problems. The study was supported by the RFBR, research project No. 14-07-31211.

References 1. Bastrakov, S. Donchenko, R., Gonoskov, A., Efimenko, E., Malyshev, A., Meyerov, I., Surmin, I.: Particle-in-cell plasma simulation on heterogeneous cluster systems. Journal of Computational Science. 3 (2013) 474– 479. 2. Fonseca, R.A., Vieira, J., Fiuza, F., Davidson, A., Tsung, F.S., Mori, W.B., Silva, L.O.: Exploiting multi-scale parallelism for large scale numerical modelling of laser wakefield accelerators. Plasma Physics and Controlled Fusion. 55 (2013). 3. Decyk, V.K., Singh T.V.: Particle-in-Cell algorithms for emerging computer architectures. Computer Physics Communications. 185 (2014) 708– 719. 4. Burau, H., Widera, R., Honig, W., Juckeland, G., Debus, A., Kluge, T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M.: PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster. IEEE Transactions on Plasma Science. 38 (2010) 2831–2839.

Particle-in-Cell Plasma Simulation on CPUs, GPUs

Particle-in-Cell Plasma Simulation on CPUs, GPUs

Suggest Documents

Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software ...

Parallel Programming CPUs & GPUs - Google Sites

IBM Research Report Can CPUs Match GPUs on ... - ePrints@IISc

Investigating SRAM PUFs in large CPUs and GPUs

Interactive Simulation of Deformable Bodies on GPUs

A Comparison of CPUs, GPUs, FPGAs, and Massively ... - CiteSeerX

Fast Sort on CPUs and GPUs: A Case for ... - Semantic Scholar

Accelerating Network Coding on Many-core GPUs and Multi-core CPUs

Accelerating Network Coding on Many-core GPUs and Multi-core CPUs

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar

and Cost-efficient Parallel Discrete Event Simulation on GPUs

Fitting Galaxies on GPUs

OpenCL Performance Evaluation on Modern Multicore CPUs

LNCS 7248 - A Fair Comparison of Modern CPUs and GPUs ... - VUT FIT

Laboratory plasma astrophysics simulation

N-Body Simulations on GPUs

Staggered fermions simulations on GPUs

Duplicate Detection on GPUs - Journals

Scaling database performance on GPUs

(SIRT) on GPUs - Semantic Scholar

Finite Element Integration on GPUs

CUDA programming on NVIDIA GPUs

Simulation of Recognizer P Systems by Using Manycore GPUs