Rapid Application Development and Enhanced Code ... - CiteSeerX

13 downloads 0 Views 408KB Size Report
tera op computers and petabyte memory systems within the next decade in order to ful ill ... purchased at the three major weapons laboratories, and multi-tera op ...
Rapid Application Development and Enhanced Code Interoperability using the POOMA Framework Julian C. Cummings, James A. Crotinger, Scott W. Haney, William F. Humphrey, Steve R. Karmesin, John V.W. Reynders, Stephen A. Smith and Timothy J. Williamsy

Abstract The Parallel Object-Oriented Methods and Applications (POOMA) Framework, written in ANSI/ISO C++, has demonstrated both great expressiveness and ecient code performance for large-scale scienti c applications on computing platforms ranging from workstations to massively parallel supercomputers. The POOMA Framework provides high-level abstractions for multi-dimensional arrays, computational meshes, physical eld quantities and collections of particles. POOMA also exploits advanced C++ programming techniques, such as expression templates and compile-time polymorphism, to optimize serial performance while encapsulating the details of parallel computation within a simple data-parallel array syntax. Consequently, scientists can quickly assemble parallel simulation codes without being bothered with the technical diculties of parallel programming. The POOMA Framework is currently being used to develop application codes targeting several of the Department of Energy's Grand Challenge research areas. In addition, POOMA is providing infrastructure for one of the main code projects of the Accelerated Strategic Computing Initiative (ASCI) at Los Alamos National Laboratory. Work is now in progress on POOMA II, a complete rewrite of the POOMA Framework intended to increase expressiveness and code performance. In this presentation, we illustrate some of the successes of the POOMA approach in real-world scienti c applications and highlight anticipated improvements in POOMA II.

1 Introduction

The Department of Energy (DoE) has recently launched its Accelerated Strategic Computing Initiative (ASCI) in an e ort to radically advance the state of the art in highperformance computing and numerical simulation. The DoE anticipates the need for 100tera op computers and petabyte memory systems within the next decade in order to ful ill its mission of Science-based Stockpile Stewardship (SBSS). Simulations of nuclear weapons must be fully three-dimensional with much ner resolution and more sophisticated physics models than previously attempted to compensate for the halting of nuclear weapons testing and maintain con dence in the nuclear stockpile. Tera op computer systems have been purchased at the three major weapons laboratories, and multi-tera op systems are coming soon. This work was performed under the auspices of the U.S. Department of Energy by Los Alamos National Laboratory under contract No. W-7405-Eng-36. y Los Alamos National Laboratory, Los Alamos, NM 87545. (fjulianc, jac, swhaney, bfh, karmesin, reynders, sa smith, [email protected]). 

1

Application Development using POOMA Framework

2

Equal in importance to the new computer hardware being acquired, however, is ASCI funding for software development. The ASCI project faces a two-pronged challenge in utilizing the new computer systems being brought online. Much of the knowledge of nuclear weapons physics accrued over years of research and testing is encased in archaic computer codes which were not written to be easily maintained and extended and do not take advantage of parallel processing concepts. Somehow the crucial physics of these codes must be extracted and expressed in a manner that leads to greater code portability. At the same time, the SBSS mission demands complete weapon physics simulation and a much tighter integration of the many existing software packages that model particular subsystems. Unfortunately, most physics codes in the past have not been designed with strong interoperability in mind. Thus, some ASCI code projects are focusing not only on developing the necessary physics models and getting them running on the new ASCI computing platforms, but also on designing frameworks and component architectures that support general concepts of scienti c simulation on parallel computers and facilitate rapid application development. The Advanced Computing Laboratory (ACL) at Los Alamos is providing such support for the ASCI mission through its Parallel Object-Oriented Methods and Applications (POOMA) project[1]. The POOMA team has developed a C++ class library that contains physics-based data abstractions and parallel numerical algorithms designed to accelerate development of scienti c simulations on parallel architectures. The primary goal of the POOMA project is to use modern, object-oriented software development techniques to encapsulate the computer science aspects of parallel processing, expose the physics contained in numerical algorithms, and increase code reuse across application domains. POOMA began as a toolkit focused on particle-in-cell simulation techniques commonly used in plasma simulation, but it has since been extended to support hydrodynamics, Monte Carlo simulations, and molecular dynamics modeling. The POOMA Framework is now providing the basis for one of the main ASCI code projects at Los Alamos National Laboratory, as well as two DoE Grand Challenge simulation projects. Furthermore, work is now underway on the next generation of the framework called POOMA II, which will o er improved data abstractions, greater programming exibility and more sophisticated code optimizations.

2 Short Tour of POOMA

Brie y stated, the POOMA Framework is an integrated set of C++ classes organized into several abstraction layers and designed to facilitate development of scienti c simulations for parallel architectures (see Fig. 1). Each layer of classes builds upon the layers below. The uppermost layer consists of complete user applications composed of POOMA objects and functionalities. POOMA applications are generally built from the POOMA global data objects (meshes, elds and particles) and parallel algorithms (Fourier transforms, interpolators and spatial operators). Management of parallel issues such as domain decomposition and load balancing are handled by classes in the parallel abstraction layer. Finally, the lowermost layer contains classes that take care of nitty-gritty computer science details involved in making the layers above function, such as providing generic (serial) containers and algorithms via the STL[2] and optimizing POOMA data-parallel expressions using expression templates[3]. The POOMA Field class is a distributed data container used to describe a physical eld quantity. With the use of C++ templates, a Field can have any number of dimensions

Application Development using POOMA Framework

Fig. 1.

3

Design of the POOMA Framework.

and a variety of element types including vectors and tensors. One of several POOMA Mesh types may be used to discretize the simulation domain, and Field elements can reside on the Mesh with a variety of centerings (vertex, cell, face, etc.). Fields can participate in data-parallel expressions using the familiar Fortran 90 array syntax. In addition, POOMA provides an Index class which allows the user to write stencil expressions such as B[I] = 0.5 * (A[I+1] + A[I-1]);

where A and B are conforming Fields and I is an Index spanning some portion of their domain. If I spans the entire domain of A and B, then this stencil expression would overrun the boundaries of A. In this case, we can attach a guard cell layer with a width of one element to A in order to handle this overrun, and we can apply various boundary conditions at the global boundaries of the simulation domain. Besides using the data-parallel interface of Fields, the user can also manipulate individual elements, apply POOMA spatial operators such as Div and Grad or reduction functions such as sum, or perform a parallel fast Fourier transform (FFT) on a Field. To support particle simulations, POOMA provides a ParticleBase class that has a minimal description for a set of particles. ParticleBase contains two ParticleAttribs, one for each particle's global position in the simulation domain and one for a global ID number. Each ParticleAttrib is a distributed list of elements of arbitrary type. The user derives his or her own class from the ParticleBase class and includes additional ParticleAttribs sucient to describe the desired simulation particles. Besides a minimal data description, ParticleBase provides many important functionalities, such as the ability to create or destroy particles, to swap particles between processors in order to maintain locality with Field data, and to generate pair lists needed for ecient particle-particle interactions. In addition, POOMA has a selection of Interpolator classes that implement common schemes for interpolating data between particle positions and Field element

Application Development using POOMA Framework

4

locations. As with POOMA Fields, ParticleAttribs can participate in data-parallel expressions and perform reduction functions such as summation. POOMA automatically handles all of the interprocessor communication needed to manage the global data objects that it provides. However, the user is a orded some level of control over the data distribution. Field data is broken into chunks that are distributed amongst virtual nodes, or vnodes, where each processor may own one or more vnodes. This scheme insulates the data distribution from the physical processor layout. The user may decide which axes of a Field are to be decomposed and how many vnodes should be generated. ParticleAttrib data is not decomposed into vnodes. Instead, it is typically distributed across processors such that each particle's data is owned by the same processor that owns Field elements nearest the particle's position, although other particle data layout strategies are available. The data distribution may also be a ected by invocation of a POOMA load balancing strategy. Each strategy uses as input a Field of values representing the amount of computational work to be done for each element of the domain and then subdivides the domain in an e ort to balance the sum of the elements within each subdomain. POOMA also provides several other tools which can be extremely useful for application development, including code pro ling and parallel data I/O and visualization. POOMA is instrumented with Tau[4] pro ling macros that collect timing information and other statistics during code execution. This data can be processed using either a prof-like utility for ASCII output or a GUI tool for graphical results. Tau pro ling indicates the number of calls to a function and the amount of CPU time spent in a function or section of code on each processor, which simpli es assessment of load balance and identi cation of code bottlenecks. Command-line ags allow conditional pro ling of certain sections of code as desired, and users can include Tau pro ling macros in application code. POOMA handles transfer of its distributed data objects via the DataConnect class. DataConnect sets up a connection between a Field or ParticleAttrib and some outside entity such as a le or a visualization routine. Whenever the connection is updated by the user code, data is transferred from the global data object to the client. A le connection will write the data to output les in parallel using a simple metadata format. A visualization connection will send fresh data to the visualization routine and can pass control to this routine, allowing the user to modify the data visualization parameters in real-time before continuing the simulation. DataConnect is a generic facility allowing POOMA to share data with a wide variety of outside entities.

3 Sample Applications

Several complete physics application codes have now been developed using the tools and framework provided by the POOMA library. Most of these applications were developed jointly by POOMA team members and other researchers as the POOMA Framework has continued to grow and mature. In general, our experiences have been very positive. We have found that the use of POOMA accelerates the development of new applications by providing data abstractions and numerical algorithms for scienti c simulations that have the desired functionalities and hide the messy implementation details. User code works almost entirely with high-level structures, so it becomes much simpler to write, understand and maintain. At the same time, POOMA classes are structured in a way that can be easily extended to produce new capabilities, unlike most standard numerical methods libraries. Porting of existing codes to use the POOMA Framework is relatively straightforward

Application Development using POOMA Framework

5

for codes written in a data-parallel style (e.g., using High-Performance Fortran). Messagepassing codes can also be converted to use POOMA, but this transition is usually more dicult. Although POOMA allows user-level message passing through the use of a supporting communications library such as MPI[5], this is typically not the most ecient way of doing things in POOMA and is generally discouraged. In any e ort to port code to POOMA, it is always best to rst re-examine the problem domain from an object-oriented viewpoint and decide how best to represent your problem with POOMA tools. In order to elucidate some of the bene ts and drawbacks of using POOMA, we will discuss a couple of sample applications. The rst example is part of a DoE Grand Challenge project in the area of particle accelerator physics. A research group at LANL had developed some simple two- and three-dimensional models of charged particles progressing through a series of beamline accelerating and focusing elements[6]. These codes were written in High-Performance Fortran (HPF) because of its simple data-parallel syntax, and they implemented the standard particle-in-cell (PIC) model for the charged particles to account for space-charge e ects on the particle beam. In PIC simulations, the electronic charge of the particles is scattered onto a simulation grid using an interpolation method to form a charge density. One then solves the Poisson equation (e.g., using Fourier transforms) to compute the electrostatic potential and the electric eld, and gathers the electric eld values back to the particles in order to calculate their acceleration. Hence, these accelerator models perform lots of di erent types of computations: particle data-parallel operations, particlemesh interpolations, parallel FFT's, and spatial gradients. Researchers were interested in converting these codes to use of POOMA because of concerns about code performance, extension to more complex beamline element models and interacting with other physics codes. Over a period of about three months, a POOMA version of the particle accelerator model called Linac was developed[7]. We redesigned the code by decomposing the problem into a Beam object containing the charged particles and a Beamline object containing a list of BeamlineElement objects (see Fig. 2). The accelerator simulation can be entirely described in terms of e ects on the Beam by each BeamlineElement as the Beam passes through. Each BeamlineElement applies some combination of electric and magnetic elds and then integrates the Beam particles forward in time through the BeamlineElement using the PIC algorithm. The BeamlineElement can be viewed as a map which converts an initial set of particle positions and momenta to a nal set. By constructing the code in this fashion, it was very easy to test di erent algorithms for time integration or try out new beamline element models. Unlike the HPF codes, Linac makes the interactions between the various data structures in the code very explicit, so that the side e ects or surprises of adding a new model are minimized. Linac exercises a tremendous fraction of the POOMA Framework. The Beam class derives from ParticleBase and adds the ParticleAttribs necessary to model charged beam particles, such as particle momentum and the local value of the electric eld. The particles are initialized in the desired distribution in phase space using random numbers produced by POOMA SequenceGen classes. Fields are used to represent the charge density and the electric eld on the simulation mesh. The particles can scatter their charge or gather the electric eld values using one of several available Interpolator classes. The POOMA FFT class is used to perform Fourier transforms and solve Poisson's equation. Particle positions and momenta are advanced in time using simple data-parallel expressions. Finally, because POOMA classes are templated on dimensionality, the same Linac code base can support both 2D and 3D simulations.

Application Development using POOMA Framework

6

Beamline Components BeamlineElem

N

Beamline

integrate

integrate

BeamlineQuad

BeamlineDrift

BeamlineRFGap

integrate

integrate

integrate

Beam Components Beam spaceCharge Accelerator run ParticleAttrib

Field

gather, scatter

ρ, φ

Fig. 2.

Accelerator Components

Object design for the Linac simulation code.

Once the code was complete and we veri ed that the physics results were identical to those of the HPF codes for a variety of test problems, we began to examine code performance. Using the Tau pro ling tools, we discovered that the parallel FFT's performed in Linac were relatively slow due to the costliness of shuing the Field data between di erent layouts. So we made improvements to the algorithm for moving Field data from one layout to another and we made adjustments to Linac that reduced the number of data transpositions required per parallel FFT. In addition, we found that the performance of the particle-mesh interpolations in Linac was not as impressive as we had hoped. One advantage that the HPF code had in this area was that it reused interpolation information acquired during the charge density scatter later on in the electric eld gather, since the particles did not change positions in between. So we extended the POOMA gather and scatter methods to allow the user to cache and reuse interpolation information. With these improvements, Linac now runs about four times faster than the equivalent HPF code on the SGI Origin 2000. Most of this performance advantage is attributable to POOMA's ability to maintain locality between particle and eld data, which makes particle-mesh interpolation a local operation. Even more importantly, the changes and extensions made to POOMA in the process of optimizing Linac brought bene ts to all of our POOMA-based applications simultaneously. The Linac code not only runs faster than the original HPF version, it is also more robust, extensible and interactive. Compile-time and run-time error checking, plus the added safety of C++ type checking, help to catch coding mistakes quickly. The Beam class can be modi ed to represent di erent particle characteristics without breaking existing code. Similarly, a new class representing a new model for a beamline element can be derived from the BeamlineElement base class and inserted into the Beamline without making any code changes elsewhere. Furthermore, Linac uses the DataConnect class and the ACLVIS package[8] to provide real-time visualization of particle motions and the evolution of eld

Application Development using POOMA Framework

(a)

7

(b)

Fig. 3. Visualizations from a Linac 2D simulation. (a) Particle positions colored by kinetic energy. (b) Charge density eld.

quantities (see Fig. 3). This allows for simple visual veri cation of correct code behavior and greater physics understanding. The same DataConnect abstraction can also be used to export data from Linac particles and elds to other codes for additional analysis. Our second application code example is MC++, a Monte Carlo neutron transport code[9]. MC++ was designed to ful ll the needs of ASCI for estimates of criticality for a given system of ssile and non- ssile materials. Such a capability was needed quickly, and the code had to be portable to the new ASCI computing platforms. MC++ was loosely based on a previous code written in CM Fortran for the old Connection Machines, but it was essentially written from scratch in about ve months. It estimates the criticality of a system by determining the e ective neutron multiplier required to keep the neutron population in balance. MC++ tracks a sample of neutrons through the system, using probabilistic Monte Carlo techniques to determine what scattering, ssioning and absorption events will occur to each of them. Interestingly, this is largely not a data-parallel algorithm because each particle will undergo a di erent series of events during its lifetime. Thus, it provides an excellent test of the versatility of the POOMA Framework. MC++ uses many, but not all, of the same POOMA features as Linac. MC++ has a NeutronParticles class to represent active neutrons in the calculation, a BankParticles class to store newly produced neutrons for the next cycle, and an InterfaceParticles class to represent material interfaces. It uses data-parallel expressions in a few instances, such as updating the neutron positions and track lengths. Most sections of the code, however, contains loops over particles in which each particle's data is accessed by indexing of ParticleAttribs. POOMA does not lock the user into a data-parallel paradigm, but instead allows the exibility of individual element access when it is necessary. Being a Monte Carlo code, MC++ makes very extensive use of POOMA RandomNumberGen classes. These classes are simple wrappers around some well-known algorithms for producing sequences of pseudo-random numbers. Nevertheless, they can be plugged into the SequenceGen class and then participate along with Fields or ParticleAttribs in data-parallel expressions by generating a new random number each time they are evaluated. MC++ does use a couple of Fields to store the ID number and density of the material in each cell. Cells containing more than one material refer to the InterfaceParticles object that stores the data for

Application Development using POOMA Framework

Fig. 4.

8

Neutron tracks through a uranium sphere in MC++.

the material on both sides of every interface. MC++ was benchmarked against the MCNP code, a comprehensive neutronics package written in Fortran 77 and parallelized using MPI. We ran the double-density Godiva test problem, which models a bare uranium sphere (see Fig. 4), using each code on a variety of workstations and the Cray T3D. We found that serial code performance was comparable and MC++ had superior parallel scaling properties on the T3D[9]. Because of its relative simplicity, MC++ was easily ported to new parallel architectures and quickly became the rst ASCI-relevant physics code to run successfully on all three ASCI computing platforms. This rapid success encouraged several groups of ASCI researchers working on hydrodynamics applications to begin utilizing the POOMA Framework for their code development as well. These researchers now form the Blanca code team, and at present, their POOMA-based hydrodynamics models have utilized over 1000 SGI R10000 processors in parallel and have run calculations on meshes containing as many as 60 million cells.

4 Future Improvements in POOMA II

Despite the many encouraging successes outlined in the above discussion of sample application codes, we also discovered some things about using POOMA in real-world applications that we did not like. The performance of POOMA codes on cache-based architectures has not been as good as we had hoped. It turns out that writing codes in terms of a series of data-parallel statements is not ideal when working with large distributed arrays that do not t entirely into the data cache. Instead, one would like to load a subset of each array into the cache, perform several calculations with that data, and then load the next subset of each array. Furthermore, many operations simply are not well expressed as data-parallel statements. If only a subset of a given array needs to be manipulated, the user needs a means of eciently addressing just that subset of array elements. If circumstances

Application Development using POOMA Framework

9

dictate that di erent operations must be performed on di erent subsets of an array, a taskparallel paradigm might make the most sense. Finally, the porting of existing codes into the POOMA Framework could be greatly simpli ed if there were a way for POOMA to manipulate data structures created by those codes. For these reasons, we were motivated to redesign certain aspects of POOMA and provide the user with additional exibility. The core of the POOMA II redesign consists of the Domain and Array abstractions. A Domain is a set of discrete points in some space, and an Array can be thought of as a map from one Domain to another. Domains provide all of the expected domain calculus capabilities such as subsetting and intersection. Arrays depend only on the interface of Domains. Thus, a subset or view of an Array can be manipulated in all the same ways as the original Array. Arrays can perform indirect addressing because the output Domain of one Array can be used as the input Domain of another Array. Arrays also provide individual element access, as well as the same sort of array syntax for expressions as the original POOMA Framework. The Array class will provide the basis for Fields in POOMA II. The Array class is templated on an Engine type that handles the actual implementation of the mapping from input to output. Thus, the Array interface features are completely separate from the implementation, which could be a simple C array, a function of some kind or some other mechanism. This exibility allows an expression itself to be viewed through the Array interface. Thus, one can write something like foo(A*B+C);

where A, B and C are Arrays and foo is a function taking an Array as an argument. The expression A*B+C will only be evaluated by the expression engine as needed by foo. In fact, one can even write Engines which are wrappers around external data structures created in non-POOMA codes and know how to manipulate these structures. Once this is done, the external entities have access to the entire Array interface and can utilize all of the powerful features of POOMA II. This should make POOMA much more attractive to users with large and complex existing codes in search of a nicer parallel interface. POOMA II also contains several improvements in the area of expression evaluation. It uses an enhanced version of our Portable Expression Template Engine (PETE) for parsing expressions involving Arrays. This version of PETE reduces compile time of user codes and utilizes compile-time knowledge of expression Domains for better optimization. For example, more ecient loops for evaluating an expression can be generated if PETE knows that the Domain has unit stride in memory. In addition, Domains can provide information about the best way to subset an expression involving distributed Arrays for fast evaluation and minimal communication. Through the use of a new fuse function, users will be able to tell the evaluator that a series of data-parallel expressions may be evaluated together in one loop, rather than one at a time. This capability should greatly enhance our cache utilization. Finally, POOMA II will make use of a new parallel run-time system called SMARTS[10] that is under development at the ACL. SMARTS supports lightweight threads, so the evaluator will be able to farm out data communication tasks and the evaluation of subsets of an expression to multiple threads, thus increasing the overlap of communication and computation. Threads will also be available at the user level for situations in which a task-parallel approach is deemed appropriate. All of these new capabilities should greatly enhance the performance of POOMA codes by allowing them to use the right approach for each computational task.

Application Development using POOMA Framework

10

5 Conclusions

Our experiences developing simulation codes with the POOMA Framework indicate that we are moving in the right direction. Providing parallel data containers and algorithms suited for scienti c simulations allows users to rapidly develop new application codes without having to worry about the computer science aspects of parallel programming. The resultant applications tend to more clearly express the key physics being modeled. Because of their object-oriented design and explicit interface mechanisms, these applications reuse large amounts of code and more readily interoperate with other packages. Performance testing of these application codes has generally yielded good results, but has also indicated areas in which improvement is needed. By retooling our abstractions for data representation and expression evaluation, we aim to provide our users with even greater code eciency and

exibility in the future.

References [1] J. V. W. Reynders et al., POOMA: A Framework for Scienti c Simulations on Parallel Architectures, in Parallel Programming using C++, MIT Press, Cambridge, MA, 1996. [2] D. R. Musser and A. Saini, STL Tutorial and Reference Guide, Addison-Wesley, Reading, MA, 1996. [3] T. Veldhuizen, Expression Templates, C++ Report, June 1995. [4] S. Shende et al., Portable Pro ling and Tracing for Parallel Scienti c Applications using C++, in Proceedings of SPDT'98: ACM SIGMETRICS Symposium on Parallel and Distributed Tools, (1998), pp. 134{145. [5] W. Gropp, E. Lusk and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, Cambridge, MA, 1994. [6] R. D. Ryne and S. Habib, Beam Dynamics Calculations and Particle Tracking using Massively Parallel Processors, Part. Accl., 55 (1996), p. 365. [7] W. F. Humphrey et al., Particle Beam Dynamics Simulations using the POOMA Framework, in 1998 International Scienti c Computing in Object-Oriented Parallel Environments Conference: ISCOPE 98, Springer-Verlag, Berlin, Germany, 1998. [8] J. Ahrens et al., See http://www.acl.lanl.gov/Viz/aclvis.html. [9] S. R. Lee, J. C. Cummings and S. D. Nolen, MC++: Parallel Portable Monte Carlo Neutron Transport in C++, Report No. LA-UR-96-4808, 1996. [10] S. Vajracharya et al., See http://www.acl.lanl.gov/smarts.