netic particle-in-cell simulation on the latest parallel computer architectures, ... chosen to use an object-oriented design for our linear accelerator code.
Modeling Particle Accelerators using C++ and the POOMA Framework1 Graham A. Mark, William F. Humphrey, Julian C. Cummings, Timothy J. Cleland, Robert D. Ryne, and Salman Habib Los Alamos National Laboratory, Los Alamos, NM
1 Introduction This paper concerns the use of C++ and the POOMA Framework [1] to model high-intensity particle accelerators. This work is part of the Computational Accelerator Physics Grand Challenge, sponsored by the U.S. Department of Energy. Another paper in this conference, \The DOE Grand Challenge in Computational Accelerator Physics," [2] describes the goals and progress to date of the project. This Grand Challenge project requires implementation of well-known numerical methods in electromagnetic particle-in-cell simulation on the latest parallel computer architectures, development of alternative computational approaches, and smooth interaction of multiple physics packages. For these reasons, we have chosen to use an object-oriented design for our linear accelerator code. By structuring our model in terms of abstractions (\objects" and \classes") relevant to accelerator physics, we can develop code that is relatively easy to understand, maintain and extend.
2 C++ Language Features C++ has many features that make it an attractive language for object-oriented scienti c application codes. C++ classes provide the means to de ne abstractions relevant to a particular problem domain. These classes can contain both data and methods that act on the data. The contents of a class may be either hidden from or visible to other code modules, a device that allows class internals to be encapsulated and its external interface to be xed. In addition, C++ classes can be developed in a hierarchy, so that a child class inherits properties of parent classes. Inheritance can greatly aid in adapting and specializing an existing class for a new purpose. Classes can be created to describe the features and behaviors of physical objects such as charged particles, particle beams, and beamline elements. Similarly, physics entities such as electric and magnetic elds and computational objects such as grids for spatial discretization can be represented by classes. A speci c accelerator model may be constructed by creating the appropriate objects with appropriate internal state. This approach leads to a tremendous amount of exibility and code reuse. In addition to the object model of programming, C++ oers the capabilities of polymorphism and generic programming. Polymorphism allows an object to specify its behavior or characteristics when a particular function acts on it. This permits a very exible style of coding in which methods are invoked on heterogeneous collections of objects; each object determines how the methods are to do their tasks. Polymorphism is achieved in C++ through virtual member functions, overloaded functions and operators, and the de nition of \traits" classes using C++ templates. Templates underlie generic programming. They allow the C++ programmer to parameterize a class or function with an unspeci ed type. Specifying the parametric type creates a particular kind of object (an \instantiation"); dierent instantiations result from dierent parametric types. This technique allows the same piece of code to be reused in many dierent settings. The judicious use of templates and generic programming can produce compact but highly powerful code. 1 Work supported by the U.S. Department of Energy, Division of Mathematical, Information, and Computational Sciences and Division of High Energy Physics.
3 Parallel Programming and POOMA These language features can be used to address a problem known as the \Parallel Platform Paradox": the time it takes to develop a typical physics application code for the latest supercomputer roughly equals the lifetime of that computer. This problem exists because custom code must be used if one is to exploit the novel features of the latest supercomputer. This custom code may include such things as message passing or load balancing algorithms, as well as architecture-speci c data structures or numerical optimizations. Learning about the new system and writing the code takes time, however, and the newest supercomputer rather quickly becomes obsolete. Any given problem domain requires certain commonly used data structures and operations. Representing these structures and operations eciently in a program often requires substantial amounts of optimized architecture-speci c code. It would make sense to collect these data structures and operations in a C++ class library, where the optimization and custom coding for the target machine would be done just once. Classes in the library would provide interfaces for the structures and operations and would simultaneously encapsulate them, keeping custom code out of the application code. Another area that often requires custom code is parallelism. All modern supercomputers rely on parallel processing of some form. Machines dier, however, in exactly how they undertake parallelism and how their parallel architecture is best exploited. C++'s encapsulation can hide an algorithm's implementation, whether parallel or serial. Using a library of suitably encapsulated algorithms, the application developer can construct a physics model without worrying about the exact target architecture. The resulting code should be both portable and ecient. POOMA, an acronym for \Parallel Object-Oriented Methods and Applications," is a C++ class library designed to provide all of these services, and thus to resolve the Parallel Platform Paradox. Application code that relies on POOMA can be compiled and run without any change wherever POOMA itself exists on a parallel supercomputer, on a workstation cluster, or on desktop system. The problems of porting code and of ecient exploitation of each computer system become problems for the maintainers of the POOMA Framework. The person writing application code can concentrate on physics rather than on machinespeci c programming. This division of labor speeds the development of new applications and broadcasts code optimizations across a rather broad class of physics codes.
4 Object-Oriented Accelerator Model We began with a High Performance Fortran program written by R. Ryne and S. Habib. The program models transport of an intense charged particle beam in a magnetic quadrupole channel. The central routines of the code follow a collection of particles moving through successive elements in the channel. The electromagnetic eld of each element, and the beam's self- eld, aect the particles' positions and momenta. The program's major computational job is integrating the particles forward in time through each of the beamline elements using the charged-particle equations of motion in an electromagnetic eld. We de ned classes that correspond to the main entities in the model: a class BeamlineElements, with subclasses Drift and Quadrupole to describe speci c types of elements; a class called Beamline, consisting of a collection of BeamlineElements; and a Beam class that consists of a set of charged particles. To tie it all together, we created an Accelerator class that contains a Beamline and a Beam and describes our complete physical system. Some of these classes { Beamline and BeamlineElement, for example, are useful only in an accelerator code, and we de ned them from scratch for this project. Others, like the Beam class, rely on concepts that are useful in other kinds of physics applications. This is precisely the sort of general physics-based abstraction that POOMA provides. POOMA has a base class, ParticleBase, from which our Beam class was derived. ParticleBase provides a minimal description of a particle collection (a position and ID number for each particle), along with interfaces for a variety of useful operations such as data-parallel computations and interpolation to and from a grid. The Beam class inherits these features and adds data speci c to our
charged-particle representation. Another POOMA class of this sort is the Field class, which represents a multidimensional array. The Field class provides several characteristics usually expected of eld quantities in physics models, such as built-in boundary conditions, existence on a discretized mesh, and the ability to have scalar, vector or tensor elds. Moreover, the POOMA Field supports array syntax, stencil operations, dierential operators, and reductions. In the accelerator code, we use the Field class to represent charge density, electrostatic potential, and the electric eld. POOMA also contains an FFT class that operates on Fields and is used extensively within the eld solver portion of the code. POOMA's ParticleBase and Field classes contain parallel data structures, which are automatically distributed across processors. By using these classes, we avail ourselves of the many data-parallel operations that are built into POOMA. We can compile and run our code without change on any platform to which POOMA has been ported; POOMA will utilize that particular hardware and parallel system as eciently as possible. In addition to this portable parallelism, POOMA applications such as ours can leverage o of the many built-in features of the physics-based abstractions contained in the POOMA Framework.
5 Performance Issues Despite all of these bene ts, the use of C++ in general and of POOMA in particular would make no sense if the performance of the resulting code were substantially worse than the performance of equivalent customcoded Fortran. Until very recently, numerical codes written in C++ did not perform well in comparison to equivalent Fortran, but the situation is rapidly changing [3]. One reason for the poor performance of C++ has been the absence of good optimizing compilers. The KCC compiler from Kuck and Associates, Inc. (KAI) has lled that gap well, and other good optimizing compilers that are fully compliant with the ANSI C++ standard are on the horizon. Another cause of poor performance is inherent in the C++ language. Consider the following code example:
class Matrix = ::: =; Matrix A; B; C; D; = ::: = A = B + C + D;
(1)
Suppose that class Matrix overloads the operators \+" and \=" to perform elementwise addition and assignment. The nal line will be evaluated in a series of binary operations. These will involve temporary Matrix objects that store intermediate results: tmp1 = B + C ; tmp2 = tmp1 + D; A = tmp2. Creation and destruction of temporary objects can severely degrade performance, especially if each object contains a lot of data. This problem has been recognized for some time, and various attempts have been made to solve it. The best solution to date is \expression templates" [4], a exible and general device that avoids the creation of temporaries. POOMA relies heavily on expression templates to optimize data-parallel expressions involving particles and elds. POOMA applications thereby retain the bene ts of overloaded operators with no loss in performance.
6 Project Status Our goal is to produce a \dimension-independent" linear accelerator model capable of simulating beam behavior for a variety of beamline elements. We will use classes that are parameterized by dimension using C++ templates. This means that a single code base will support both 2D and 3D models. (Other dimensionalities are formally possible but have little practical use.) POOMA provides classes templated on dimension, so our accelerator code can use this feature and derive templated classes from POOMA classes as needed.
We have a 2D-prototype code implemented in C++ and POOMA. It supports a K-V or Gaussian initial beam distribution in the x-y plane and integrates the beam particles through a series of drift and quadrupole elements. The integration is performed using a split-operator approach. The beam's self-consistent electrostatic potential is computed by scattering charge density into a Field, performing an FFT, applying a Green's function in Fourier space, and inverting the FFT. POOMA provides simple functions for scattering the particle charge density, computing the gradient of the electrostatic potential, and gathering the resulting electric eld at the particle positions. Our results are in agreement with results of Ryne and Habib's 2D HPF code. The POOMA code is instrumented to send particle and eld data to ACLVIS, a Los Alamos visualization package, during a code run. This provides real-time data visualization capabilities that enable users quickly to spot problems in code behavior and to study the eects of various beamline elements. Furthermore, POOMA provides a simple mechanism for pro ling application codes with the Tau pro ling tools [5]. Simple macros in the accelerator code generate timing data. Tau uses the data to chart the CPU time spent in each instrumented routine by each processor. We are using these pro ling tools to analyze the performance of our code, and to compare it with the performance of the HPF code. Our most recent tests, run on an Origin 2000 symmetric multiprocessor computer, indicate that the POOMA code is comparable to the HPF code. More studies need to be done before speci c performance data can be provided. Our future work includes such studies and recasting the current code into a generic templated form.
References [1] J. V. Reynders, V. W. John, P. J. Hinker, J. C. Cummings, S. R. Atlas, S. Banerjee, W. F. Humphrey, S. R. Karmesin, K. Keahey, M. Srikant, M. Tholburn, in Parallel Programming Using C++, G. V. Wilson and P. Lu, eds. (MIT Press, Cambridge, 1996). [2] R. D. Ryne, S. Habib, K. Ko, Z. Li, W. Mi, C.-K. Ng, J. Qiang, M. Saparov, V. Srinivas, Y. Sun, X. Zhan, Proceedings ICNSP'98. [3] T. Veldhuizen, http://monet.uwaterloo.ca/~tveldhui/DrDobbs2/drdobbs2.html [4] T. Veldhuizen, C++ Report 7:5, 26 (June, 1995). Reprinted in C++ Gems, Stanley B. Lippman, ed. (Sigs Books, NY, 1996). [5] Tau. http://www.acl.lanl.gov/tau/