An essential component of our approach is a new version of the. James algorithm for infinite-domain boundary conditions for the case of three dimensions. By.
Institute of Physics Publishing doi:10.1088/1742-6596/16/1/066
Journal of Physics: Conference Series 16 (2005) 481–485 SciDAC 2005
Advanced 3D Poisson solvers and particle-in-cell methods for accelerator modeling David B. Serafini1 , Peter McCorquodale1 and Phillip Colella1 1
Lawrence Berkeley National Lab, Applied Numerical Algorithms Group, SciDAC Applied Differential Equations Center E-mail: {dbs,petermc,colella}@hpcrdm.lbl.gov
Abstract. We seek to improve on the conventional FFT-based algorithms for solving the Poisson equation with infinite-domain (open) boundary conditions for large problems in accelerator modeling and related areas. In particular, improvements in both accuracy and performance are possible by combining several technologies: the method of local corrections (MLC); the James algorithm; and adaptive mesh refinement (AMR). The MLC enables the parallelization (by domain decomposition) of problems with large domains and many grid points. This improves on the FFT-based Poisson solvers typically used as it doesn’t require the all-to-all communication pattern that parallel 3d FFT algorithms require, which tends to be a performance bottleneck on current (and foreseeable) parallel computers. In initial tests, good scalability up to 1000 processors has been demonstrated for our new MLC solver. An essential component of our approach is a new version of the James algorithm for infinite-domain boundary conditions for the case of three dimensions. By using a simplified version of the fast multipole method in the boundary-to-boundary potential calculation, we improve on the performance of the Hockney algorithm typically used by reducing the number of grid points by a factor of 8, and the CPU costs by a factor of 3. This is particularly important for large problems where computer memory limits are a consideration. The MLC allows for the use of adaptive mesh refinement, which reduces the number of grid points and increases the accuracy in the Poisson solution. This improves on the uniform grid methods typically used in PIC codes, particularly in beam problems where the halo is large. Also, the number of particles per cell can be controlled more closely with adaptivity than with a uniform grid. To use AMR with particles is more complicated than using uniform grids. It affects depositing particles on the non-uniform grid, reassigning particles when the adaptive grid changes and maintaining the load balance between processors as grids and particles move. New algorithms and software are being developed to solve these problems efficiently. We are using the Chombo AMR software framework as the basis for this work.
1. Introduction The SciDAC APDEC has been working on developing new simulation software for problems involving the interaction of particles and fields. The driving application for this work has been electrostatic particle accelerator simulations exemplified by the MaryLie/Impact (ML/I)[1] code being developed in the Accelerator Science and Technology SciDAC project. There are two parts to the APDEC effort: solver methods for the Poisson equation and particle-in-cell methods with adaptive mesh refinement (PIC/AMR). © 2005 IOP Publishing Ltd
481
482
The APDEC development effort is motivated by several problems facing existing methods for this application space. Most existing PIC methods use uniform grids, which are inherently inefficient when the particle density varies significantly across the spatial domain (e.g. halos, injectors). The infinite-domain boundary condition often used in accelerator simulations is not supported by most advanced Poisson solver methods. High-accuracy particle simulations can require extremely large numbers of particles (billions), which require large amounts of memory and computational capability. These requirements can only be achieved on large parallel computers, but most of the existing software does not scale well enough to run efficiently on such computers. The APDEC approach combines solver methods that can scale to thousands of processors with an AMR approach that can significantly reduce memory and computational resource requirements compared to using a uniform grid. Alas, there is no free lunch: AMR complicates the main problems of PIC methods, namely depositing particle data to the grid, interpolating grid data back to the particles and “pushing” particles between processors. Algorithms and data structures for handling particles efficiently and scalably when used with adaptive meshes are being developed in conjunction with the new solver methods. Another significant issue is the integration of new software into existing applications that have been deployed for years and have established user communities. One of the goals of the APDEC development effort is to design the application interfaces (APIs) for our new software to fit with the methods used in legacy applications. Also, since existing uniform grid PIC methods combined with FFT-based Poisson solvers are essentially optimal for small and medium sized problems, it is desirable that our new methods be able to coexist with the existing methods in an application, so that a user may choose the best approach for each problem. The ML/I and Warp[2] codes have been used as integration targets throughout the development process. In reality, the adoption of new software methods into existing applications usually meets signicant resistance from application developers. To be successful in the field, the APDEC methods and implementation must be more accurate than existing methods, able to solve larger problems and have better performance, and be relatively easy to use. Our ultimate goal is to fulfill all these requirements. 2. Solving Poisson’s equation We’ve developed two new Poisson solvers, one for Dirichlet and Neumann boundary conditions and one for infinite domains. The former has been discussed in previous SciDAC reviews[3]. It uses AMR and multigrid technologies that have been developed at LBNL for many years and are at the core of the Chombo[4] software1 The infinite-domain solver is more recent2 . It uses a new algorithm based on the method of James[5] for solving on infinite domains, combined with a Method of Local Corrections by Balls and Colella[6] for high performance on larger parallel computers. The algorithm and implementation of this solver is described in detail in [7]. A brief description of the algorithm follows. The solution to Poisson’s equation ∆u = f with infinite boundary conditions on a finite domain D containing the support of f can be computed from a pair of solutions with Dirichlet boundary conditions on finite domains D ⊂ D1 ⊂ D2 (Figure 1). First, Poisson’s equation is solved with homogeneous Dirichlet boundary conditions on D1 . Then the normal derivative on the D1 boundary is convolved with a Green’s function on the D2 boundary, providing an inhomogeneous Dirichlet boundary condition for solving Poisson’s equation on D2 . This provides the solution to the infinite-domain problem on the D portion of D2 . The computational cost for this method is dominated by the convolution. In the APDEC 1 2
The Chombo software is available at . The infinite-domain solver is not included in the latest release of Chombo, but will be included in the future.
483
g =G◦q
u2 = g
u1 = 0 u1 ∆u1 = f
Solve for u1 on D1
q=
∂u1 ∂n
u2 ∆u2 = f
Compute q on ∂D1, then g on ∂D2
Solve for u2 on D2
Figure 1. James’s algorithm. method, a multipole algorithm is used to approximate the convolution, reducing its cost from O(n4 ) to O(Kn2 ), where n is the length of a side in number of points, and K is dependent on the domain decomposition used in the multipole algorithm. K does increase with problem size, but it can be chosen so that the convolution no longer dominates the total cost. The Dirichlet BC Poisson solves use a FFT-based algorithm, which doesn’t parallelize scalably because of the large amount of interprocessor communication required. Using the Method of Local Corrections, the whole domain is decomposed into disjoint subdomains, and the infinitedomain solver is applied to each subdomain independently. Thus the parallel computational cost scales with the subdomain size, not the total problem size. These subdomain solutions are coupled by projecting them onto a single coarse global grid with a discrete Laplacian operator and solving there. The global solution is interpolated back to the local grids and another solve is done, this time with Dirichlet boundary conditions that are computed by combining the fine and coarse solutions. This method scales well in parallel because communication is limited to the coarse-fine interactions (there are no fine-fine interactions as the subdomains are independent). Also, the cost of the fine level Poisson solves depends on the size of the fine grid subdomains, which can be held relatively constant as the problem size grows. The main performance bottleneck becomes the coarse level solve, which isn’t large enough to parallelize well. The APDEC implementation makes a reasonable tradeoff and parallelizes the coarse level solve using a small (fixed) number of processors.3 3. Particle handling The issues of algorithm design and implementation of PIC methods for uniform grids are well understood. Adaptive meshes add some issues and complicate others. Achieving scalable performance is a primary issue (scalable with respect to the number of particles, number of grid points, and number of processors). Managing the particle data on multiple grid levels is another. Interfacing the new PIC/AMR implementations with existing single-grid uniform-grid applications is an issue that cuts across all others. The Chombo AMR software framework has been used as the foundation for our PIC/AMR development effort. Chombo provides powerful data structures for multilevel adaptive grids in parallel. The Chombo data structures have been augmented for particles to leverage the similarity between the block-structured meshes used by Chombo and the uniform grids used by most existing methods. This development work is ongoing. 3
on seaborg, 8 procs are used, since although each SMP node has 16 CPUs, it is well-known the hardware has insufficient memory bandwidth to support them all.
484
Table 1. Timings for solving Poisson’s equation with parallel MLC. The global solve is done on eight processors. P is the number of processors; N 3 is the total number of grid points in the original problem; q is number of pieces in which domain is subdivided in each dimension; C is coarsening ratio to the global grid; Work is the number of grid points per processor; WallTime is the elapsed time; time/point is P ∗ WallTime/N 3 ; Comm. Time is the percentage of WallTime spent in communications. Input Parameters Work WallTime time/pt. Comm. Time P N q C N 3 /P (sec) (µsec) % of WTime 128 768 8 6 3.5 × 106 36.7 10.3 5.7 6 256 1024 8 8 4.2 × 10 43.5 10.3 5.0 512 1280 8 10 4.1 × 106 44.5 10.8 5.2 1024 1536 16 12 3.5 × 106 47.1 13.3 8.6 2048 2048 16 16 4.2 × 106 59.6 14.2 6.9 6 4096 2560 16 20 4.1 × 10 59.2 17.9 5.6
Load balancing is a more important, and difficult, aspect of the performance problem with particles and adaptive meshes. The load depends on the number of particles per processor and the number of grid points, as well as the distribution of both to processors. It is impractical to predict the load accurately a priori because the ratio of computational work per particle and per grid point is unknown, and depends on many factors. Unfortunately, as the number of particles and processors becomes very large, it becomes more important to balance the load well. Another performance problem due to adaptive meshes is the cost of moving particle data between grids, and particle and grid data between processors, when the mesh adapts. This is strongly related to the load-balancing problem. There are no silver bullets here. The goal of minimizing data motion can be achieved in a variety of ways, involving trading off accuracy, performance and implementation complexity. There is still much work to be done in this area. This also relates to the issue of interfacing to existing applications, since they may constrain when the mesh may be changed. One numerical algorithm issue of continuing concern is the so-called “self-force” problem. This pertains to interpolation between particles and grid variables at coarse-fine interfaces in adaptive grids. For a particle near such an interface, the interpolation functional on the coarsegrid side of the particle will be slightly different than on the fine-grid side, because of the different spacings of the grid points used to do the interpolation. 4. Status A parallel infinite-domain MLC solver has been implemented and tested. Initial results indicate the expected second-order accuracy. The solver currently supports only one level of refinement; development of multi-level refinement is in progress. Performance results are shown in Table 1. The solver runs on up to 4096 processors on the IBM SP system at NERSC (seaborg) with at most 9% of the time used for communication. More than half of the computation time is spent in fast Fourier transforms. On the examples run on the SP, the processor time per point varies from 10 to 18 microseconds for up to 4096 processors. On the PIC/AMR software front, several prototypes have been developed. An interface between the AMR solver for Dirichlet boundary conditions and the WARP accelerator code was discussed in [8]. This solver was also integrated with the ML/I code with satisfactory results. Another prototype integrating this solver with the QuickPIC code[9] is in development. An early, non-adaptive, version of the infinite-domain James/MLC solver was integrated with the ML/I code as a proof-of-concept. The results of this test were very encouraging. A prototype
485
PIC code using the latest adaptive solver is in the early stage of development. This code will be used to investigate the performance issues discussed above: parallel scalability, especially load balancing, and particle data movement. Lessons learned from this prototype will be used to integrate the AMR solver with the ML/I code. The goal is to be able to perform calculations with billions of particles on grids sufficiently large to achieve engineering accuracy. 5. Future plans Future solver efforts will focus on adding multi-level adaptivity, improving performance for large parallel problems and solving the self-force problem. Future PIC/AMR efforts will focus on finishing the prototype code and tuning its performance behavior for large problems. The resulting code will be integrated with ML/I and tested on large problems of interest to the SciDAC accelerator modeling community. 6. References [1] R. Ryne, J. Qiang, A. Dragt, S. Habib, T. Mottershead, F. Neri, R. Samulyak, and P. Walstrom. MaryLie/IMPACT: a parallel beam dynamics code with space charge. In Proceedings of the Computational Accelerator Physics Conference, October 2002. [2] D. P. Grote, A. Friedman, J.-L. Vay, and I. Haber. The WARP code: Modeling high intensity ion beams. In AIP Conference Proceedings, volume 749, page 55, 2005. [3] J. B. Bell et al. The applied differential equations center (APDEC): Software development. In 2004 SciDAC PI Meeting, March 2004. http://www.csm.ornl.gov/workshops/DOE SciDAC/papers.html. [4] P. Colella, D. T. Graves, T. J. Ligocki, D. F. Martin, D. Modiano, D. B. Serafini, and B. Van Straalen. Chombo Software Package for AMR Applications - Design Document. unpublished, 2000. [5] R.A. James. The solution of Poisson’s equation for isolated source distributions. J. Comput. Phys., 25:71–93, 1977. [6] G.T. Balls and P. Colella. A finite difference domain decomposition method using local corrections for the solution of Poisson’s equation. J. Comput. Phys., 180:25–53, 2002. [7] Peter McCorquodale, Phillip Colella, Gregory T. Balls, and Scott B. Baden. A scalable parallel Poisson solver in three dimensions with infinite-domain boundary conditions. In 7th International Workshop on High Performance Scientific and Engineering Computing (HPSEC-05), Oslo, Norway, pages 814–822, June 2005. [8] Peter McCorquodale, Phillip Colella, David P. Grote, and Jean-Luc Vay. A node-centered local refinement algorithm for Poisson’s equation in complex geometries. J. Comput. Phys., 201(1):34–60, November 2004. [9] C. Huang, V. Decyk, S. Wang, E.S.Dodd, C. Ren, and W. B. Mori. A parallel particle-in-cell code for efficiently modeling plasma wakefield accleration: QuickPIC. In Proceedings of the 18th Annual Review of Progress in Applied Computational Electromagnetics, Monterey, CA, page 557, March 2002.