on the CM-5. Roscoe Giles. Pablo Tamayo. Department of Electrical, Computer,. Thinking Machines Corp. and Systems Engineering. Cambridge, MA 02142.
A Parallel Scalable Approach to Short-Range Molecular Dynamics on the CM-5 Roscoe Giles
Pablo Tamayo
Department of Electrical, Computer, and Systems Engineering and Center for Computational Science Boston University, Boston, MA 02215.
Thinking Machines Corp. Cambridge, MA 02142.
Abstract We present an scalable algorithm for short-range Molecular Dynamics which minimizes interprocessor communications at the expense of a modest computational redundancy. The method combines Verlet neighbor lists with coarse-grained cells. Each processing node is associated with a cubic volume of space and the particles it owns are those initially contained in the volume. Data structures for \own" and \visitor" particle coordinates are maintained in each node. Visitors are particles owned by one of the 26 neighboring cells but lying within an interaction range of a face. The Verlet neighbor list includes pointers to own{own and own{visitor interactions. To communicate, each of the 26 neighbor cells sends a corresponding block of particle coordinates using message-passing calls. The algorithm has the numerical properties of the standard serial Verlet method and is ecient for hundreds to thousands of particles per node allowing the simulation of large systems with millions of particles. Preliminary results on the new CM-5 supercomputer are described.
1 Introduction Over the last 30 years Molecular Dynamics (MD) methods have been used to study properties of liquids, solids, polymers and phase transitions in general[1-4]. In the last decade, the availability of fast vector and parallel supercomputers have made it possible to apply Molecular Dynamics to more realistic and challenging problems in Chemistry, Biology, Physics and Materials
Science[5-8]. Today, with the advent of high performance parallel computers[9], we face the challenge of developing new methods to make optimal use of these unprecedented computational resources. It is important to develop algorithms with enough degree of uniformity and appropriate scaling properties to exploit the capabilities of scalable SIMD and MIMD parallel architectures.
2 Computational Problems of MD Simulations A molecular dynamics simulation requires the integration of the classical equations of motion of a system of N particles. The main computational problem in MD simulations is the calculation of inter-particle forces. The force between two particles is usually given by the gradient of a pair-potential. In the general case, each particle can interacts with all the other particles in the system and one is then solving an N ?body problem with O(N 2 ) non-zero pair interactions. For long range potentials such as the Coulomb electrostatic interaction, this is inevitable. Methods for making long range potentials tractable for large N typically involve combining and regrouping terms in the force expression so as to avoid direct summation over the N (N ? 1)=2 pairs. However, for short-range, rapidly decaying potentials, such as Lennard-Jones, the interaction can be treated as zero for particles whose separation is larger than some cuto radius. This means that each particle interacts only with neighbors inside a three dimensional spherical shell. This reduces the computa-
tional complexity to O(N ) at the expense of identifying and tracking the neighbors of each particle. The actual number of interacting neighbors is a function of the molecular density. For our typical simulations the number of interacting neighbors is of order 50 to 70. For the Lennard-Jones potentials we consider, the computation a single pair-interaction requires about 30{40 oating-point operations. Therefore, a complete force calculation will require of the order of 2000
oating-point operations per particle. The key issue for short range MD on serial computers is the identi cation of interacting pairs. The basic tradeo is between time invested in identifying the non-zero interactions (so as to reduce time spent computing zero forces) and the actual force calculation itself. Traditional methods are based on the use of Verlet neighbor-tables[10] and/or spatial cells. Spatial cell methods divide physical space into a regular grid of cells. The search for interactions of particles in a given cell is restricted to a local neighborhood. In a \pure" version, no additional search of neighboring cells is used and all particles in neighboring cells are considered.
3 Verlet Neighbor-Table Method Consider a system of N particles interacting with a short-range potential of range R. If the average particle density is , each particle on average interacts with N 34 R3 neighbors. This gives an average number of non-zero pair forces of 12 N N . At xed density, this is O(N ) rather than O(N 2). At each stage of the computation the algorithm maintains a table of \bonds" { pairs of particles { for which the interaction forces are actually computed. Though some of the bonds may result in zero forces, it is an error if any pair which has non-zero force is absent from the table. The overhead for constructing the table is usually signi cant compared to the force computation, so the tables are reused for several time-steps. At a given time step, the tables are constructed by nding all pairs of particles whose separation is less than R = R + . Here, is a \safety" margin of distance. The tables can be reused for as long as no pair of particles originally further apart than R get closer together than R. This is a time of order =v, where v is a typical velocity. The parallel scalable algorithm we consider is numerically equivalent to the Verlet neighbor table method. Thus, the relationship between , table rebuilding interval, and time step are exactly as in the int
int
s
s
standard version of the algorithm. The dierence is in its parallelism.
4 Parallel MD Algorithms Parallel MD methods are typically based on extensions of traditional methods. However, the need to communicate information between processors adds a new dimension to the parallel MD problem. Usually interprocessor communications costs { latency and bandwidth { are signi cantly larger that processor { memory data transfer costs. This means for MD that not only do we want to minimize the number of oating point operations (which re ect local processor{memory interactions) as in serial algorithms, but we also want to organize data so as to minimize communication requirements and to get the most reuse out of any data that must be sent from processor to processor. The most natural way to achieve this is to capture the spatial locality inherent in the problem in the computational locality of a scalable parallel processing system. A direct way of accomplish this is to map node-processors to cells, where the granularity of the cells is related to the number, complexity and connectivity of the node-processors in the machine. Examples of this for the ne grained massively parallel CM-2 have been previously described[12, 13]. In this paper we will address some of these issues and present an MIMD (more properly a SPMD1 ) method which combines Verlet neighbor lists with coarse-grained cells. In this situation, many particles will be mapped to each cell. Particles in a given cell will necessarily interact with particles in a neighboring cell. Computation of such interactions requires that the coordinates of one or the other particles be communicated across cell boundaries. The method we have implemented minimizes the net amount of data crossing cell boundaries by sending such particle coordinates across the cell boundaries exactly once. The method proposed reduces interprocessor communication, is fully scalable and has been tested on the new CM-5 supercomputer.
5 CM-5 Characteristics Before we describe our method we will brie y review some general characteristics of the CM-5. 1
Single Program Multiple Data
The CM5 is a scalable partitionable network of MIMD processor nodes optimized for data parallel and SPMD computation. It has independent data and control high speed networks. The processing nodes can fetch and execute their local instructions independently but the control network can keep them synchronized for data parallel mode. The processing nodes are o-the-shelf SPARC processors and each has access to four 64-bit vector units and 32 Mbytes of memory. The networks have cross section bandwidths scalable up to 16k processors. The operating system (UNIX) supports both timesharing and partitioning. Operating in MIMD mode the machine supports C and Fortran plus calls to the CMMD message-passing library. Important features for this application are: 1. The nodes are relatively powerful independent processors. At the hardware level we therefore have coarse{grained parallelism (hundreds or thousands of particles per node). Since our goal is to minimize communications costs, we will explicitly represent this processor granularity rather than subsume it in some higher level ne grain data parallel model. 2. There are fast synchronization and global reduction operations through the control network so that we need not worry much about communications overhead associated with loosely synchronizing activities of the processors. 3. Long data blocks are transmitted through the network more eciently than short blocks. Our algorithm gathers data into long blocks before transmittal. 4. The local UNIX kernel on the nodes supports dynamic memory allocation. We use the rich set of memory allocation options and pointer manipulations available conventionally in C. 5. The existence of vector units in each processor allows for the possibility of employing techniques already developed for vector supercomputers[11]. We have not yet exploited the vector units.
{Figure1{
Figure 1: This gure shows the \proper" and \extended" volumes corresponding to one cell (processing node) in our method. (modest) redundancy of computations and at the expense of increased memory use. Our algorithm is ecient for hundreds to thousands of particles per node. This is appropriate for simulations of millions of particles on hundreds or thousands of nodes. At any stage of the algorithm, each particle is \owned" by a processing node of the CM-5. If particles were distributed randomly among P processors and N P , then we would expect that most bonds would require a communication. To avoid this, we distribute particles among processor nodes geometrically: each processor is associated with a rectangular volume of space and the particles it owns are those initially contained in this \proper" volume. For convenience of exposition, we assume that the volume of each cell is cubic (L L L). Also, we assume that the cells are large enough that R < L. In this case, every particle inside the interaction range of a given particle lies in that particle's own cell or one of its 26 neighbors. An \extended" space is de ned to include the particles in neighboring cells lying within an interaction range from the boundary (see g 1). If a particle's coordinates do have to be communicated from one cell to a neighboring one, it is likely that they will be needed for more than one bond. This int
s
6 A Parallel Scalable Coarse-Grained Method Our algorithm is designed to minimize interprocessor communications and associated overhead for short range MD on the CM-5. We do so at the expense of
gives means to minimize the data ow required to go from processor to processor by using a block mode communication. To properly perform the force computations for all the \own" particles a number of data structures are maintained in each processing node:
An array of \own" particle coordinates and velocities. These are the (on average) L3 particles
initially in the cell. An array of \visiting" particle coordinates. These are particles owned by one of the 26 neighboring cells but lying within an interaction range of a face. There are on average ((L + 2R )3 ? L3 ) such particles. A neighbor list of pairs of pointers to particles that interact. This Verlet list only includes own{ own and own{visitor interactions, not visitor{ visitor interactions. 26 lists of pointers to \own" particle coordinates to be sent to each of the neighboring cells. s
{Figure2{
This data structures are schematically represented by boxes in gure 2. A regular time step consists of a communication step followed by in-node force calculations and integration of the equations of motion: 1. For each of the 26 neighbor directions, all cells send a corresponding block of particle coordinates, as determined by their pointer lists. Note that each cell independently must do a gather and a \block mode" communication to a neighboring cell (this is done by calling the CM-5 CMMD message-passing functions). 2. Each cell computes forces corresponding to the local neighbor pair lists. The result is the total force on all \own" particles. 3. Each cell integrates the equations of motion to get new velocities and positions for the \own" particles. Regular steps can be executed only after the data structures and tables have been constructed. this is done in a regular local way: 1. Perform a communication step as in the force calculation to update the positions of the \visitors" in each cell.
Figure 2: Data structures maintained in each node to compute the forces between particles including the proper and extended volumes. The arrows show data transfer inside node memory and block mode communications (CMMD). a) shows the state of particles right after re-building the data structures. b) shows the particles after a few regular time steps. Notice that some particles have left the proper volume of the cell but are still consider \own" particles as their pointers will not be changed until the next re-building step. The extended region takes into account the same security distance needed for the Verlet list and consequently no interactions are missed.
2. Each cell scans through all particles (own and visitor) and keep only those particles that actually still reside in the proper volume of the cell. If there has been no error (ie, no particle has moved more that ), this is guaranteed to place all the particles in the correct cell without a global resorting! An error can be detected by simply counting particles at this stage. 3. Communicate velocities to proper cells. 4. Each cell determines the 26 lists of its own particles that must be sent to neighbors based on whether they are close enough to the faces of the cell. 5. Send the count of such particles to the corresponding neighbor. This allows each cell to allocate memory for the \visitors" it will receive in 26 blocks. 6. Do a coordinate communication step with the new visitor lists. 7. Each cell scans coordinates of its \own + visitors" to produce new Verlet neighbor tables.
7 Results and Discussion The algorithm above has been implemented on CM5 machines ranging in size from 32 to 512 nodes without vector units installed. A signi cant improvement in performance is expected when the vector units become available. In our runs, overall average communications rates of 6 Mbytes/sec/node are typical. Communications only take about 5-10 % of the total execution time. Most of the time is spent computing forces and in the Verlet list construction. The updating times for a 256 node CM-5 are less than 8 secs per particle-update (for N = 51; 200 particles). We are currently investigating ecient ways of building the neighbor tables and dynamically allocating memory for the associated data structures. We are also studying vectorization schemes for the in-node force calculations. Though the precise scaling of this algorithm is still the subject of empirical study it is straightforward to characterize in a general way. The critical parameter for the scaling of this algorithm is the average number of particles per processor, n. In our runs so far n has ranged from about 200 to 2000. In addition to the obvious linear impact on local memory and ALU
operation time, the value of n has the following effects: (1) For small n, communications become less ecient due to xed overheads and decreasing block lengths. This is especially pronounced at the 8 corners and 12 edges of the cells. (2) For suciently large n the communication times scale linearly. (3) The time required to build neighbor tables depends on the local algorithm for doing so, which we have not focussed on here. As for purely serial implementations of the neighbor table method, we expect that the strategy of using an O(n2 ) method for small n and an O(n) (local) cell method for larger n will be best. This gives an overall asymptotic linear scaling with n (though it is unlikely that one will try to reach asymptopia in n as opposed to N ). Along a \data parallel" scaling line, with N =) 1 at xed n, we would expect times which grow polylogarithmically due to the performance of the network.
Acknowledgments We want to thank W. Klein, A. Mel'cuk, H. Gould, B. Boghosian, J. Mesirov, G. Grest, S. Plimpton and R. Brower for interesting discussions about MD methods. We also want to express our appreciation to the CM-5 CMMD team: A. Greenberg, L. Tucker, M. Drumheller, G. Drescher, A. Landy, R. Fry and C. Feynmann for their valuable assistance and many useful suggestions.
References [1] G. Ciccoti, D. Frenkel and I. R. McDonald, eds., \Simulations of Liquids and Solids", North Holland, Elsevier Science pub. 1987. [2] R. W. Hockney and J. W. Eastwood, \Computer Simulations Using Particles", Mac Graw-Hill, New York 1981. [3] M. P. Allen and D. J. Tildesley, \Computer Simulations of Liquids", Clarendon Press, Oxford 1987. [4] H. Gould and J. Tobochnik, \Computer Simulation Methods", Addison-Wesley 1988. [5] D.C. Rapaport, Comp. Phys. Comm., 62, 198 (1991) [6] D.C. Rapaport, Comp. Phys. Comm., 62, 217 (1991) [7] W. Smith, Comp. Phys. Comm., 62, 229 (1991)
[8] F. F. Abraham, Advances in Physics, 35, 1 (1986). [9] B. M. Boghosian, Computational Physics on the Connection Machine, in Computers in Physics, Jan/Feb 1990. [10] L. Verlet, Phys. Rev. 165, 201 (1968) [11] G. S. Grest, B. Dunweg and K. Kremer, Comp. Phys. Comm. 55, 269 (1989). [12] A. I. Mel'cuk, R. C. Giles and H. Gould, Molecular Dynamics simulation of liquids on the Connection Machine, Computers in Physics, May/June 1991. [13] Pablo Tamayo, Jill P. Mesirov and Bruce M. Boghosian, Parallel Approaches to Short Range Molecular Dynamics Simulations, Proceedings of Supercomputing 91 (1991).