Particle-in-Cell Plasma Simulation on CPUs, GPUs

0 downloads 0 Views 93KB Size Report
Simulation of plasma dynamics with the Particle-in-Cell method is one of the ... onds per particle update on an 8-core Intel Xeon E5-2690 CPU with 99% strong.
Particle-in-Cell Plasma Simulation on CPUs, GPUs and Xeon Phi Coprocessors Sergey Bastrakov1 , Iosif Meyerov1 , Igor Surmin1 , Evgeny Efimenko2 , Arkady Gonoskov1,2 , Alexander Malyshev1 , and Mikhail Shiryaev1 1

2

Lobachevsky State University of Nizhni Novgorod, Russia Institute of Applied Physics, Russian Academy of Sciences, Nizhni Novgorod, Russia

Simulation of plasma dynamics with the Particle-in-Cell method is one of the currently high-demand areas of computational physics. Solving up-to-date physical problems often requires large-scale plasma simulation. Given the growing popularity of GPGPU and the advent of Intel Xeon Phi coprocessors, there is an interest in high-performance implementation of the method for heterogeneous systems. PICADOR [1] is a tool for three-dimensional fully relativistic plasma simulation using the Particle-in-Cell method. Features of PICADOR include FDTD and NDF field solvers, Boris particle pusher, CIC and TSC particle form factors, Esirkepov current deposition, ionization, and moving frame. The code is capable of running on heterogeneous cluster systems with CPUs, GPUs, and Xeon Phi coprocessors and supports dynamic load balancing. Each MPI process handles a part of simulation area (domain) using a multicore CPU via OpenMP, a GPU via CUDA, or a Xeon Phi coprocessor. All MPI exchanges occur only between processes handling neighboring domains. The Particle-in-Cell method operates on two major sets of data: an ensemble of charged particles (electrons and ions of various type) and grid values of electromagnetic field and current density. A key aspect of high-performance implementation of the Particle-in-Cell method is obtaining efficient memory access pattern during the most computationally intensive Particle–Grid operations: field interpolation and current deposition. We store particles of each cell in a separate array and process particles in a cell-by-cell order. This scheme helps to improve memory locality and allow vectorization of particle loops. These optimization techniques yield a combined 4x to 7x performance improvement over the baseline implementation. We employ a widely used performance metric for Particle-in-Cell simulations, that is computational time per particle per time step, namely, nanoseconds per particle update. On a simulation of dense plasma with first-order field interpolation and current deposition in double precision PICADOR achieves 12 nanoseconds per particle update on an 8-core Intel Xeon E5-2690 CPU with 99% strong scaling efficiency on shared memory, which is competitive to the state-of-the-art implementations [2, 3]. The Xeon Phi implementation is essentially the same C++/OpenMP code as for CPUs with a minor difference in compiler directives that control vectorization. A Xeon Phi 7110X coprocessor in native mode scores 8 nanoseconds

per particle update on the same benchmark, outperforming the Xeon E5-2690 CPU by factor of 1.5. A heterogeneous Xeon + Xeon Phi combination, with one process running on the processor and another one on the coprocessor, achieves 6 nanoseconds per particle update. However, other heterogeneous configurations, such as 2x Xeon + Xeon Phi or Xeon + 2x Xeon Phi, do not yield any performance benefit due to high MPI exchanges overhead. A major performance drawback on CPUs and particularly on the Xeon Phi is that field component scatter in Yee grid hinders efficient memory access in the vectorized field interpolation. The GPU implementation employs a variation on the widely used supercell technique [4] with a CUDA block processing particles of a supercell. The main performance challenge in GPU implementation is current deposition, which requires reduction of the results of all threads in a block. We have two implementations of this operation: reduction in shared memory and reduction via atomic operations. The first one appears to be better on Fermi-generation GPUs, while the second is preferable on Kepler-generation GPUs, achieving, respectively, 4x and 10x speedup over 8 CPU cores in single precision. PICADOR is developed and used by the HPC Center of University of Nizhni Novgorod and the Institute of Applied Physics of Russian Academy of Sciences for simulation of laser-matter interaction. The code architecture is extendable in terms of additional stages and devices and is capable of using modern heterogeneous cluster systems with CPUs, GPUs and Intel Xeon Phi coprocessors. The performance and scaling efficiency are competitive with other implementations. The future prospects include better load balancing between CPUs, GPUs and Xeon Phi, further optimization of GPU and Xeon Phi implementations, development and optimization of additional modules to allow solving a larger set of problems. The study was supported by the RFBR, research project No. 14-07-31211.

References 1. Bastrakov, S. Donchenko, R., Gonoskov, A., Efimenko, E., Malyshev, A., Meyerov, I., Surmin, I.: Particle-in-cell plasma simulation on heterogeneous cluster systems. Journal of Computational Science. 3 (2013) 474– 479. 2. Fonseca, R.A., Vieira, J., Fiuza, F., Davidson, A., Tsung, F.S., Mori, W.B., Silva, L.O.: Exploiting multi-scale parallelism for large scale numerical modelling of laser wakefield accelerators. Plasma Physics and Controlled Fusion. 55 (2013). 3. Decyk, V.K., Singh T.V.: Particle-in-Cell algorithms for emerging computer architectures. Computer Physics Communications. 185 (2014) 708– 719. 4. Burau, H., Widera, R., Honig, W., Juckeland, G., Debus, A., Kluge, T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M.: PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster. IEEE Transactions on Plasma Science. 38 (2010) 2831–2839.