A domain decomposition parallel processing ... - Semantic Scholar

12 downloads 32036 Views 69KB Size Report
Jul 4, 1997 - power which, if it can be efficiently harnessed, offers the prospect of simulating .... available for that purpose [1-3], but will serve as an introduction to some of the ...... To verify and assess the performance of the parallel domain ...
Submitted to Comp. Phys. Comm.

Revised Version 7/4/1997

A domain decomposition parallel processing algorithm for molecular dynamics simulations of systems of arbitrary connectivity. David Brown, Hervé Minoux and Bernard Maigret Laboratoire de Chimie Théorique - URA CNRS 510, Université Henri Poincaré - Nancy I, BP 239, 54506 Vandoeuvre-lès-Nancy Cedex, France.

Abstract We describe in this paper methods for applying domain decomposition to a general purpose molecular dynamics program. The algorithm is suitable for either distributed memory parallel computers or shared memory machines with message passing libraries. A method is discussed in detail which allows molecules of arbitrary connectivity to be simulated within the domain decomposition approach. The algorithm also contains techniques to handle both rigid bond constraints and special CH2 constraints which involve five atoms. Examples are given for molecular systems containing both types of constraints plus two, three- and four-body intramolecular potentials as well as short-range and long-range non-bonded potentials. The algorithm has been implemented on a Silicon Graphics Power Challenge machine using the MPI message passing library.

1 . Introduction The basic techniques of the molecular dynamics (MD) computer simulation method are now relatively well established, as are its uses and limitations [1-3]. Prominent amongst these limitations are the time and length scales accessible which are restricted by the computing power available. For atomistic simulations of actual physical systems this is, in all practical terms, currently in the nanoscale regime. Thus many technologically and biochemically interesting systems which extend over distances greater than, say, 10 nm and/or relax on time scales longer than a nanosecond are time consuming to simulate. The advent of parallel computers, on the other hand, has resulted in an unprecedented increase in the available processing power which, if it can be efficiently harnessed, offers the prospect of simulating systems of a much larger scale than previously possible and thus of advancing our understanding of many processes at a fundamental level. The prospects offered by parallel computers have inevitably led to much interest in applying them to the area of molecular dynamics simulations and there is now quite a history in the subject. Initial attempts were largely limited to atomic systems and to highly specific problems, e.g. [4-8]. During this time, a number of strategies were proposed for mapping MD onto multiple-instruction multiple-data (MIMD) machines based on either particle decomposition, assigning particles to home processors using their indices, or domain decomposition, assigning particles to processors according to their position in space [4, 5, 9-

-1-

13]. Although there have been more recent developments involving force decomposition [14, 15] and attempts to combine strategies [16], the problem has remained that the best parallel solution for a particular simulation depends on factors like the average number of particles per processor and the speed of the individual processors. However, for large scale simulations, where there are many more particles than processors, there is little doubt that domain decomposition (DD) is the optimum approach. In the DD strategy, each processor element (PE) is responsible for particles residing in a particular sub-volume of the MD box and each processor accumulates the total forces on the particles in its domain and integrates their equations of motion. Although DD is relatively straightforward to implement for atomic systems in which only pair interactions exist (and there are now numerous examples [7, 11, 17-23]) the connectivity of molecular systems introduces a number of new difficulties [24, 25]. To appreciate these difficulties, a good working knowledge of the common practices in the molecular dynamics of molecular systems is required, and these will be reviewed briefly later, but for the mean time the added complexity molecular systems bring to DD can be summarized as follows:(1)

Multiparticle potentials and specialized constraints require that the group of n atoms involved in the calculation be simultaneously known by one PE which is assigned to carry out the necessary computations.

(2)

Particles subject to rigid constraints may lie in different domains. The procedure S HAKE, generally used to impose rigid constraints, implies a communication between processors at each of several iterations.

(3)

Not all pairs of atoms interact via non-bonded pair potentials.

As a result of the above, most attempts to parallelize general purpose MD codes have relied on simpler to implement particle decomposition techniques [26-32] which are unsuitable for large-scale simulations as each processor requires a knowledge of all atomic positions. Some improvement is possible using force decomposition techniques [14] but, at this point in time, there exist only a few examples in the literature of DD MD programs being used for molecular systems [24, 25, 33-35]. Chynoweth et al [33] have described a DD method in which space is divided in just two dimensions (2-D), and so each PE is responsible for a column of the system. They then used what they called a local multiple systolic loop algorithm to calculate the forces and energy but gave few details of how they handled the connectivity problem. Esselink and Hilbers [24] have also used a two dimensional partitioning of space and outline and justify a method which guarantees that all n atoms of a multi-particle will be found by just one PE. In partitioning space in only two dimensions, both these latter methods incur some communication penalties for not using the optimum, 3-D, decomposition [19]. Furthermore they also perform more communications than is required for a search of all interacting pairs. Clark et al [34] have recently presented a program called E ULERGROMOS, which is a parallel version of the general purpose MD program G ROMOS. They use domain decomposition to calculate the non-bonded forces but incur some redundancy by evaluating the same pair forces in different domains. -2-

Their method has the advantage though that it is not limited to domains of size greater than the potential truncation distance and can also be operated in a self load-balancing mode. However, to evaluate the bonded forces, each processor has to scan the global connectivity table to extract those terms involving atoms in its own domain, a process which probably also involves some redundant calculations and which they admit could limit the scalability of the algorithm. This may be a marginal issue in the case where the CPU costs are dominated by the calculation of the non-bonded forces but is likely to be fairly crucial in the cases where they do not (see section 5.1 later). Plimpton et al [35] have also recently described a fairly comprehensive general purpose parallel MD program, called LAMMPS, which incorporates multiple time step methods and efficient Coulombic sums. For the bonded forces, however, they either scan a global connectivity table or use a 'hash table', but few details are given of the latter approach. In the past, we have described a method which allows the optimum 3-D decomposition of space to be combined with an efficient minimum communication scheme and non-redundant search for all interacting pairs [19]. Furthermore we have shown how this can be retained in a DD code capable of handling model linear homo-polymers [25]. This algorithm for molecular DD incorporated a parallel S HAKE algorithm, to constrain bond lengths, and a crude, but effective, method to handle the three- and four-body forces which were used to restrict the flexibility of the linear molecules. The feasibility of being able to use a parallel SHAKE algorithm was seen as an important part of this work. The generally quoted reason for choosing rigidly constrained bonds as opposed to flexible ones in molecular dynamics simulations is that of a means of saving CPU time; the relatively high frequency of most realistic chemical bond vibrations implies that short time steps have to be used for stable integration of the equations of motion. However, a second and much more fundamental, although less well appreciated, problem with the dynamics of classical systems, where there are modes of motion with widely separated frequencies present, is that of equipartition of kinetic energy. It is possible for the coupling between modes to be so poor as to make equilibration times extremely long [36]. For this latter reason alone it is important to retain the constraints routine in the DD framework. It is worth pointing out that although multiple time step methods, e.g. [37], can overcome the problem of short time steps they cannot address the problem of poor coupling. In this article, we describe in detail a new method which allows domain decomposition to be applied to molecular systems of arbitrary connectivity. We retain many of the features used previously [25] but implement a new method for searching for all n atoms which constitute a group (henceforward referred to as an ‘n-plet’) required for a multi-particle potential or a rigid constraint. The implementation issues within the framework of a generic general purpose MD program are addressed. Finally, we illustrate the new method in simulations of systems containing not only pair potentials, three and four atom multi-particle potentials and rigid bonds, but also special CH2 constraints [38] which involve n-plets of five atoms.

2 . Molecular dynamics It is perhaps useful to briefly review some of the common practices carried out in molecular dynamics computer simulation. This will be by no means a comprehensive treatment, standard texts being

-3-

available for that purpose [1-3], but will serve as an introduction to some of the difficulties involved in applying domain decomposition to a molecular system of arbitrary connectivity. First of all, in an MD simulation using classical mechanics, a molecule is generally modelled as a number of atoms connected by a network of bonds. Normally, all atoms in one molecule interact with all atoms in another through some form of, what is often referred to as a, non-bonded pair potential. That is to say any two atoms on different molecules will experience a force between them independent of all the other atoms present. In reality, this is not truly the case but expedience dictates that this is an approximation which is almost universally made in classical MD. The main purpose of the non-bonded potential is usually to exclude two atoms from the same region of space (repulsive interaction). Attractive terms in the potential add the ‘glue’ which, in the absence of containing walls, hold liquid/solid systems together and influence the preferred packing structure of the molecules to a greater or lesser extent depending on their exact nature. Two examples of non-bonded pair potentials used in simulations are the Lennard-Jones 12-6 potential ΦLJ(|rij|) = 4ε((σ/|rij|)12-(σ/|rij|)6)

(1)

Φc(|rij|) = qiqj / (4πεο|rij|)

(2)

and the Coulombic potential

In Eqs. 1 and 2 the pair of interacting atoms, {i,j}, have a separation vector denoted by rij (=ri-rj). In Eq. 1, ε is the well-depth of the potential and σ is the distance at which the potential is zero. In Eq. 2, qi and qj are the partial charges on the two atoms and εο is the permittivity of the vacuum. In principal, both these potentials imply that particles interact whatever their distance apart. In practice, the high powers used in Eq. 1 mean that the interaction energy reaches relatively small values, compared to ε, within a few σ and it is customary to ignore the forces between atoms further apart than some cut-off distance, r c. For this reason the Lennard-Jones potential is often referred to as being ‘short-range’. On the other hand the Coulombic potential diminishes slowly with range and is often referred to as a ‘long-range’ potential; despite the fact that it has a very significant influence at short range too. From several studies already conducted [39-42], it is very clear that approximate techniques which either simply truncate the 1/r potential or force it to zero using a switching function lead to drastic unphysical effects which only disappear when the non-negligible long range part of the Coulomb potential is handled in a proper manner. In the program to be described here, we have used the Ewald summation method [43, 44]. In 3D periodic boundary conditions, the Ewald summation nominally fits well into a DD strategy as, firstly, the reciprocal space summation, formally a sum over pairs of atoms, can be factorized into a single sum involving atomic positions only, within a sum over reciprocal lattice vectors, k (defined with respect to the matrix of basis vectors, h, specifying the shape and size of the MD box, as k = 2π ( h-1 )t (L,M,N)t

(3)

with L , M and N integers). The parallelization of this part is then straightforward with each PE performing the sum over k-vectors for each atom in its domain (see Section 4.2.1). Secondly, the real space part of the

-4-

Ewald summation is short-ranged and can be simply incorporated within the non-bonded pair force calculation. Problems do exist with the Ewald summation though when systems become large in extent, in that, for a fixed cut-off in real space, the number of k with a modulus less than a certain value (the determining factor in the convergence of the reciprocal space sum) increases. The scaling of the method is thus non-linear and ultimately the cost becomes prohibitive. Alternative methods, such as fast multipoles (FMM) [45] and particle mesh Ewald (PME), have already been used in large scale simulations [35, 46-49]. Although the fast multipole method has linear scaling with n, the number of particles, and particle mesh Ewald has order n*log2(n) scaling, at least one recent comparison of the two methods comes out substantially in favour of PME [50] as the preferred method for both scalar and parallel MD based on the timings obtained, ease of implementation, treatment of non-orthorhombic cells and suitability for parallelization. Given the rather chequered history of comparisons between FMM and Ewald (see for example refns. [51, 52] for a review of this) with crossover numbers varying by several orders of magnitude it is probably premature to say that the definitive statement has been made on the issue of which is the best method. There is currently much activity in the area of efficient Coulombic sums which warrants closer inspection [45, 51-64]. However, this is not the main issue of this paper. For the time being we continue to use the Ewald sum although we are currently investigating these other possibilities. Now, atoms within the same molecule may also be interacted in the same way as described above but this depends on if they are deemed to be ‘non-bonded’ or ‘bonded’ interactions; it is at this point the common terminology of MD starts to become slightly convoluted. In principal, it would be simpler if atoms interacted with each other in a completely consistent pair additive way, i.e. an atom of type A always interacts with an atom of type B by the same pair potential irrespective of where they are in space. Such ‘central force’ models do exist, e.g. for water [65], but in general it has proved simpler to employ specific intramolecular potentials to maintain the known geometry of molecules. So, typically, atom pairs forming a chemical bond will be kept at a fixed distance apart by a rigid constraint or alternatively quite close to it by a continuous potential. The harmonic spring is a popular form 1 Φb(|rij|) = 2 kb (|rij| - bo)2 (4) where for a bonded pair of atoms, i and j, kb is the force constant and bo the equilibrium bond length. To maintain a low amplitude motion though, the force constant has to be quite high and this in turn implies high frequency motions. High frequency vibrations lead to a short time step in the integration of the equations of motion of the atoms and, as mentioned in the introduction, can lead to difficulties with equipartition of energy if the frequency is so high as to be largely uncoupled to other motions in the system. For these reasons, it is often preferable to constrain bonds rigidly. To maintain bond angles close to their equilibrium values, it is customary to apply some form of bending potential. Typically this will be parametrized in terms of a function of the bond angle, Φ(θ), and several forms are in common use, including the following 2 1 Φ(θ) = 2 kθ (cos θ - cos θ0)

-5-

(5)

where kθ is a constant with units of energy determining the flexibility of the angle and θ0 is the equilibrium bond angle. The angle θ is a function of the positions of the three contiguous atoms, i.e. {i, j, k} with i bonded to j and j bonded to k and i≠k, in the molecule cos θ = (rij •rkj)/|rij||rkj| Equation 5 has computational advantages over the harmonic form, 2 1 ´ Φ(θ) = 2 kθ (θ - θ0) ,

(6)

(7)

in that no expensive inverse cosine function calls have to be made and the potential is well behaved in the region of θ=π, where the form of Eq. 7 introduces a singularity in the force due to a (sin θ)-1 term appearing. In general, the bonding and bending terms are deemed ‘bonded’ interactions and therefore the atoms involved don’t interact through the non-bonded potential. However, this simple distinction does not stretch to the one other intramolecular potential in common use, the torsional potential restricting rotations around bonds. In this case, there are at least two possible approaches. Both of these use, either in part or wholly, a parametrization in terms of the torsion angle, τ. Given four contiguous atoms {i, j, k, l} in a molecule, the torsion (or dihedral) angle between the planes of the two groups {i, j, k} and {j, k, l} can be defined as cos τ = –

(rij ×rjk)•(rjk ×rkl) |rij ×rjk| |rjk ×rkl|

(8)

In this convention the dihedral angle varies from -180° to +180° with τ=0° corresponding to the trans conformation, i.e. where all four atoms lie in the same plane and atoms i and l are at their furthest distance apart. A typical parametrization takes the form of a polynomial in cos τ ν

Φ(τ) =

C m cosm τ ∑ m=0

(9)

where the Cm are the coefficients of a polynomial of order ν. One possibility is then to have part of the torsional energy described by Eq. 9 whilst non-bonded interactions between the atoms at the ends of the torsion, i and l, makes up the remainder. The alternative is to choose the coefficients in Eq. 9 to represent the entire torsional energy, in which case i and l are deemed not to interact through the non-bonded potential. The purpose here isn’t to say which, if either, is a better method but to simply point out some of the circumlocution in MD of molecular systems; in this latter case, we have the situation where a nominally ‘bonded’ intramolecular potential can involve non-bonded potentials. The above set of potentials forms a generic form of force field typical of that used in classical MD simulations. Given a set of initial coordinates and velocities for the atoms, the boundary conditions and a force field like the above MD progresses by calculating the total force on each atom and then integrating the equations of motion forward by a short time interval, ∆t, using finite difference methods. Done repeatedly, this process builds up a trajectory for each atom in the system from which various thermodynamic and dynamic properties can be calculated. The equations of motion can be of the Newtonian form or be more

-6-

complicated in a number of ways, e.g. in order to force the system to have a certain average temperature and pressure [1]. In addition to the above, a molecular system will often contain a number of rigid constraints which have to be taken into account when solving the equations of motion. The simplest form of constraint is that of a rigid bond between neighbouring atoms which, as mentioned already, is an alternative to the flexible bond model. However, other more complicated forms [38, 66] can involve more than two atoms. A particular example is the special constraints required to freeze all the degrees of freedom of hydrogen molecules in a CH2 group [38]. For a X-CH2-Y group, holding the mid-point of the H-H vector on the bisector of the X-C-Y bending angle and maintaining the H-H vector perpendicular to the vector joining the X and Y atoms removes all the degrees of freedom of the hydrogens. Such a constraint involves the positions of five atoms. Although there are different methods available to incorporate a solution of the constraints problem into the equations of motion, we will restrict the discussion in this paper to probably the most common one based on the iterative procedure ‘SHAKE’ [67].

3 . Domain decomposition In this section, we introduce the domain decomposition method by reviewing some of the techniques that we have developed in the past for atomic systems [19] and which were later used for linear chain systems [25]. These will form the basis of the new method. The decomposition of space used here is the same as described previously [25] in that initially atoms are assigned to Npx×Npy×Npz=Np processors, i.e. 3-D decomposition of space. The exact shape of the MD box does not have to be orthorhombic and assignments are actually carried out using the set of scaled coordinates si = h-1ri where the matrix h={a, b, c} is defined from the basis vectors of the MD box. The scaled space is cubic and facilitates easy assignment of particles to processors. The scaled space also facilitates the assignment of atoms to the Ncx×Ncy×Ncz=Nc link-cells within each domain. The link-cell strategy [1] is useful both as a way to speed up the creation of a neighbour table [1] and as a means of determining which atoms lie close to domain boundaries and thus have to be communicated to neighbouring domains. The above assignment of particles to PEs means that the processors are responsible for domains in real space which are not necessarily orthorhombic. This causes no problems and is completely consistent with the link-cell method applied to non-orthorhombic MD boxes. Care, of course, has to be taken with respect to the perpendicular distance between faces of the MD box and the potential truncation distances, particularly in simulations where h is a dynamical variable, but these are no different than for a scalar program [68]. At this point each PE is only aware of particles in its own domain. As atoms are assigned to processors it is inevitable that some molecules will be distributed among two or more of them. Alternative decompositions based on molecular centres-of-mass [69] are not considered here as we require a method which is general enough to cope with the worst possible case where the MD box contains just one large molecule.

-7-

Once the assignments to domains and link-cells have been made, the first step is performance of a search for all interacting pairs of atoms. This requires the communication of those coordinates required by other domains to complete their "working" environment. Details of the modification of the conventional link-cell search pattern which minimise the amount of communication at this stage have been given previously, see Fig. 4 of [19] and related discussion. Briefly, a conventional link-cell search for interacting pairs treats each each link-cell in the same way. Particles within a link-cell are interacted amongst themselves and then with the contents of half the 26 nearest neighbour link-cells [1]. If this search method was to be applied in a domain decomposition code then it would require communications, direct or indirect, with 17 of the nearest neighbour processors. By treating link-cells differently and modifying their search pattern according to their position within a domain, communications (direct and indirect) are required with only 7 of the nearest neighbour domains and can be accomplished efficiently with just three communication steps. The flow of information during these three communication steps has also been explained in detail previously, see Fig. 1 of [25] and related discussion. In brief, a subject domain's working environment of link-cells is built up by communications in each of the three perpendicular directions, i.e. with respect to the aforementioned scaled space used to assign particles initially to PEs. The nomenclature we have used previously is that the neighbouring PE in the -y (“southward”) direction is called South, that in the +x (“eastward”) direction is called East and that in the +z “upward” direction is called Above. In the first stage of communication each PE simultaneously sends information for all particles found on the southern layer of link-cells to its southern neighbour. At the same time, each PE receives some information from its northern neighbour. This increases the working environment of a PE by one layer of link-cells. In the second stage each PE now simultaneously communicates information concerning particles on its eastern layer of linkcells, including that received at the previous step, to its eastern neighbour. Again each PE receives a new layer of link-cells on its western face. There then follows an analogous third stage in the above direction. Following this initial communication stage, each processor has access to enough of the coordinate information to perform its share of the all interacting pairs search. This initial pair search not only allows the interaction of all non-bonded pairs but also the formation of neighbour tables of non-bonded pairs. These neighbour tables are used in subsequent steps until deemed redundant. Redundancy occurs automatically when any site has diffused a distance greater than or equal to half the shell width (a parameter used in the construction of the neighbour table [25]) since the last rebuild. Note also that only at this stage is it necessary for sites to get reassigned to different PEs should they have diffused out of their previous home PEs, so as to ensure sites appear in their correct link-cells for the next rebuild. Subsequent steps are handled by the neighbour tables and other lists formed so it is irrelevant if a site wanders out of the region that its home PE is nominally responsible for.

4 . Domain decomposition for molecular systems Given the overviews of MD applied to molecular systems and domain decomposition, it is possible to expand upon some of the difficulties mentioned in the introduction regarding the application of the domain decomposition technique to molecular systems of arbitrary connectivity. In this section, we concentrate on one solution to these difficulties.

-8-

The main problem encountered in applying domain decomposition to molecular systems is the unique identification of the n-plet groups required for multi-particle potentials and constraints. In their work on a domain decomposition algorithm for a molecular system, Esselink and Hilbers [24] show that, as long as the distance between any two atoms in the n-plet multi-particle potential is no greater than that of the nonbonded pair potential, it is certain that at least one processor will have all the n atoms in its working environment. As a result, no extra communication costs are incurred than for the search for all interacting pairs. In the past, we too found that for a system of linear chains [25] it was possible to identify uniquely all the groups of three sites (triplets) involved in the angle bending and groups of four (quadruplets) sites involved in torsions. This was despite the fact that we used a different communication environment to that of Esselink and Hilbers [24]. Given the above guiding principle, the problem thus reduces to one of identifying that processor which will be temporarily responsible for a particular n-plet and signalling to it that this is the case. In the work on linear homo-polymers, advantage was taken of the fact that all atoms were of the same type and that atoms bonded together were identifiable by their global indices [25]. Despite this relatively simplified connectivity the unique identification of the triplets and quadruplets proved quite a difficult problem. In the most general case atom type and connectivity, although both fully specified, are completely arbitrary. For this reason we have abandoned our previous approach in favour of a more general one applicable to n-plets of any size. The details of the solution will be set out later but in summary it involves the following new features:(i) In addition to the coordinates and velocities, each atom carries around with it details of its atom type, global atomic index, global molecular index and information concerning the neighbours with which it is has ‘bonded’ interactions. (ii) All bends, torsions and groups requiring special constraints, e.g. those for CH 2 groups as in [38], are assigned to a particular atom and are also carried by this atom whichever processor it moves to. (iii) At the first step, and other steps where the neighbour lists etc. have to be rebuilt, a table is required indicating the local index and processor identity of all atoms known to each PE, i.e. both those a processor is directly responsible for and all those it receives during the communication stages prior to the search for all interacting pairs. (iv) Using the tables formed in (iii), any two atoms found during the pair search to be involved in a ‘bonded’ interaction make further lists which contain each others indices and processor identities in all the PEs they are known in. This information is communicated back to each atom’s home PE at the end of the pair search. By doing this each atom knows for each of its ‘bonded’ neighbours all PEs in which they both appear in and their local indices therein. It is then a simple matter of finding for each n-plet the processor which is aware of all n atoms and signalling to it that it is responsible for that n-plet. This procedure allows

-9-

each PE to build lists of all the n-plets it will handle for this and following steps until the lists are deemed redundant. Although this increases some of the communication costs, much of the information is invariant for the period that each processor works with the same set of atoms, i.e. between neighbour table rebuilds, and so the effect of the extra overhead is ameliorated. In the following sections, we describe in detail the necessary changes required to the basic DD scheme for applying it to systems of arbitrary connectivity. In what follows we assume that the system contains all the features outlined in section 2. Extension to other more complex multi-particle potentials or constraints should follow from the details we give. 4.1 Preliminary operations As stated already, to generalize the DD method for systems of arbitrary molecular complexity, a number of changes are required compared to the method published previously [25]. Firstly, we make no assumptions regarding an atom’s global index, ig, and its connectivity. The connectivity is initially read from the configuration file and simply takes the standard form of a list of the N bonds(ig) global indices with which ig forms bonded pairs, i.e. ig is bonded to atoms having the global indices Ibond(ig,1), Ibond(ig,2).....Ibond(ig,Nbonds(ig)) . In addition, each atom is assigned a certain type, Itype(ig), which defines its mass and, to a large extent, the way it interacts with other atoms in the system; apart from the Coulombic energy, determined by each atom’s individually assignable partial charge, the potential energy and the rigid constraints are defined in terms of atom types. We have implemented both the normal rigid bond constraints and the special CH2 and CH3 group constraints of [38] which remove all the degrees of freedom associated with hydrogens except for the end rotation of terminal methyl groups. In a first global operation, all possible CH3 groups are identified and the extra rigid bonds identified by adding extra entries, Ibond(ig,Nbonds(ig)+1.....Nbondx(ig)), to the respective connectivity tables. In a second global, operation the input connectivity table, Ibond(ig,1.....Nbonds(ig)), is scanned and all valid triplets, i.e. ig bonded to jg bonded to kg with a valid entry in the table of angle-bending potentials Iangtype(Itype(ig),Itype(jg),Itype(kg)), identified and associated with ig. This is done by building a list of the pointers to the global indices in the connectivity table of ig, e.g. if Ibond(ig,jp) = jg and Ibond(ig,kp) = kg then jp and kp are stored in ig’s list of triplets. At the same time, the list of ‘bonded’ atoms of atoms ig and kg are extended by adding kg and ig, respectively, to each others list, provided they are not there already. These extra global indices are stored in the entries I bond(ig,Nbondx(ig)+1.....Nbends(ig)) and become important later for defining those pair interactions which are non-bonded. In a completely analagous fashion, a scan is then made to identify all quadruplets {ig,jg,kg,lg} for the torsional potentials and a list is made for each atom. The ‘bonded’ lists of end atoms in a torsion are also extended as for the bends, and with the same proviso, such that at the end of the search each atom has a list of all the other Nnbneig(ig) it is involved with in either bonds, angle bends or torsions. By ordering the list so that the entries Ibond(ig,Nbends(ig)+1.....Nnbneig(ig)) are due to ‘1...4’ interactions in torsions means that

-10-

they can be distinguished easily in case they are deemed to interact via the non-bonded potential; the program has the option of switching on and off 1...4 non-bonded interactions. In a final search, all CH2 groups suitable for the special perpendicular and bisector constraints of Hammonds and Ryckaert [38] are identified. This involves forming a list for each suitable carbon atom of the pointers to global indices in their connectivity tables of the four other atoms involved, i.e. the two hydrogens and the two other atoms bonded to the central carbon. At this point, each atom now has associated with it a set of information which defines all its ‘bonded’ neighbours, all the triplets it is responsible for, all the quadruplets it is responsible for and the atoms with which it forms a CH2 group subject to the special constraints. Along with its atom type, global index, global molecular index Im(ig), coordinates, velocities and partial charge, this data is communicated from the host processor to all the processor elements (PEs) in the parallel array; details of how space is decomposed in all three dimensions and how particles are allocated to PEs have been given previously [25]. Each PE only receives data for particles initially assigned to its domain and stores this in local arrays indexed from 1 to NPE, the number of atoms it is currently responsible for. Thus in the PEs I g(i) is a mapping from the local index to the global index of a particle and Ibond(i,1....Nnbneig(i)) contains the global indices of the 'bonded' neighbours of i. 4.2 Procedure at a list building step prior to pair search After these preliminary operations have been completed, the molecular dynamics proceeds by a calculation of the forces acting at the current time step. As before [25], the first time step is one at which the neighbour lists of non-bonded pair interactions are constructed. At the same time, we need to create lists of bonds, bends, torsions and CH2 groups. All these lists are then utilized in subsequent steps until deemed redundant. 4.2.1 Reciprocal space part of Ewald summation As we use the Ewald method [43] to evaluate the Coulombic forces, there are both summations in reciprocal space and in real space; the exact form, as applicable to molecular systems, has been given before [70] and we do not reproduce it here. The summation in reciprocal space can be factorized into sums only dependent on particle positions. Thus its calculation within the domain decomposition framework follows very straightforwardly [71]; for each reciprocal space vector included in the summation, each processor first performs the necessary sum on the particles it is responsible for and then standard calls to global summation routines calculate the overall total and return the result to each PE so as to allow the calculation of the forces and the contributions to the pressure tensor [70]. In real space, there remains a pair summation over nonbonded atoms and a pair summation over ‘bonded’ atoms; the latter correcting for the fact that the reciprocal space sum is indiscriminate and is tantamount to interacting all possible pairs, whereas in many cases explicit Coulomb interactions are defined in the potential model to be restricted to non-bonded pairs and to be implicitly contained within the ‘bonded’ terms [44, 70]. These real space sums are carried out at the same time as the non-bonded van der Waals interactions (see later).

-11-

4.2.2 Communications required for pair search The sequence of three communication steps required before the pair search can begin has been described in detail before [19, 25]. This remains the same here except that extra information must accompany the coordinates of an atom in order to perform the non-bonded interactions and to facilitate the formation of the various lists. So, along with the coordinates and partial charges, the variables I g(i), Im(i), Itype(i), Nbonds(i), Nbondx(i), Nbends(i), Nnbneig(i) and Ibond(i,1.....Nnbneig(i)) must be communicated. As all but the coordinates are invariant during the life span of the lists, everything else needs only to be communicated at the step when the lists are built. 4.2.3 Operations required by general n-plet search prior to pair search With the above information the pair search could be accomplished as shown previously [19, 25]. However, in order to form the lists of triplets (bending potentials), quadruplets (torsional potentials) and quintuplets (CH2 special constraints) in a completely general way we have found it necessary to generate tables for each atom of the identity of each PE in which it exists (maximum of 8 currently) and its local index in these PEs. These tables are accumulated during the first phase of three forward communication steps and then communicated back to the home PE via three reverse steps. At the end of this procedure, only the home PE of each atom can guarantee that it knows its identity in all other PEs. This information is, however, required by all the PEs having a copy of the atoms so a further phase of forward communications is required in order to ensure that all PEs know the identities of all atoms they are aware of. 4.3 Pair search at a list building step After this last stage of communications, the pair search can be carried out. The pair search is performed using the modified link-cell search pattern adopted previously [19, 25] which ensures that pairs within the given truncation distance are found uniquely so removing any redundant calculations. For each pair of atoms {i,j} found, it has to be decided whether they form a ‘bonded’ or non-bonded pair. All atoms with differing global molecular indices, Im(i) ≠ Im(j), are deemed to be non-bonded, the calculation of their minimum image separation continues and hence the non-bonded interaction energy and forces; non-bonded interactions implying here the van der Waals term and also the real space part of the Ewald summation if both atoms carry a non-zero charge. If i and j belong to the same molecule, Im(i) = Im(j), then a check is made to see if i is a bonded neighbour of j, i.e. if there exists an Ibond(j,kj) = Ig(i) for 1≤ kj ≤ Nnbneig(j). If i and j are non-bonded then the calculation of separations, energies and forces proceeds as in the case of atoms belonging to different molecules. If i and j are in each others ‘bonded’ lists then a number of further operations are undertaken so as to facilitate the formation of lists for the intramolecular potentials and the special CH2 constraints. 4.3.1 Operations required by general n-plet search during pair search The first of these is to find the corresponding position of j in i’s list of ‘bonded’ neighbours, i.e. Ibond(i,ki) = Ig(j). Using the tables already created of atom-indices/PE-identity, further lists are then produced for the kj-th ‘bonded’ neighbour of j of all the locations of i in all PEs. Similarly all the atomindices/PE-identity of j are copied into a table indexed by i and ki. The reason for doing this will become apparent when we come to building lists of triplets etc.

-12-

4.3.2 Formation of bonded pair lists At this stage, we can determine immediately whether i and j are truly bonded together as in that case ki ≤ Nbondx(i) (or equivalently kj ≤ Nbondx(j)). Atoms i and j are then added to the appropriate bonded lists; we allow for both flexible bonds and rigidly constrained bonds and so maintain two lists. As the pair search finds pairs uniquely it is guaranteed that all bonded pairs will appear only once on one of the lists. 4.3.3 Screening for non-bonded 1 ... 4 interactions The final check to make is to determine whether i and j will interact via the non-bonded potential; this is only possible if they are the end atoms in a torsion and 1...4 non-bonded interactions are switched on. If the non-bonded 1...4 interactions are switched on and Nbends(i) < ki ≤ Nnbneig(i) (or equivalently Nbends(j) < kj ≤ Nnbneig(j)), then the non-bonded interaction is carried out for i and j. 4.3.4 Correcting for the reciprocal space part of the Ewald summation Finally, it should be noted that, as stated above, in all cases where i and j are deemed to be ‘bonded’ and both carry a non-zero charge, then the calculation is made to correct the energy, forces and pressure for the reciprocal space part of the Ewald summation. 4.3.5 Formation of the neighbour list In the process of carrying out the above pair search, a neighbour list is generated in each PE so as to reduce the amount of computation required at subsequent steps. This follows the same pattern as described previously [19, 25] in that, in the case where i and j interact through the non-bonded potential, atom j is added to atom i’s neighbour list, should their pair separation be less than the potential truncation plus shell width distance. In addition to this, if i and j are ‘bonded’ and it is necessary to calculate at each step the Ewald correction term, i.e. charges on i and j both non-zero, then -j is added to i’s list with the minus sign indicating the fact that it is a ‘bonded’ interaction. It is important to note that no distance criterion is placed on this latter type of interaction as the reciprocal space part of the Ewald sum always interacts all i and j irrespective of distance apart. 4.4 Procedures peculiar to a list building step after the pair search Following the pair search, the lists required for the flexible bond pair interactions are formed as well as the lists of rigid bonds which will be required for the parallel constraints routine. Also formed are the neighbour lists containing both the non-bonded pairs and the ‘bonded’ pair interactions requiring the real space correction of the Ewald sum. At this stage too, the non-bonded energy and forces have been calculated. The next stage is therefore to form the lists of n-plets required. 4.4.1 Formation of n-plet lists The first stage in the general n-plet search scheme is to perform the necessary communications (three step reverse process) that ensure that the home PE of each atom has access to the full tables discussed in section 4.3.1. With this accomplished, each atom will have a copy of the local-index/PE-identity entry for each atom in its ‘bonded’ list to go along with its own local-index/PE-identity list formed before the pair search. As all atoms involved in the n-plets for which an atom is responsible for must be in its list of ‘bonded’ neighbours, we thus have all the information required to form the n-plet lists. The procedure is simply one of finding for each n-plet that PE which has all atoms involved in the n-plet in its working

-13-

environment and to add the known local indices of these particles to the correct list. Should the PE responsible for the n-plet be the home PE of the atom who ‘owns’ that n-plet then the list in that PE is updated. If not, then it is added to one of three other lists depending on whether it has to be sent to the South, East or Above neighbouring PE at the first step of the usual three step communication process. Thus following the processing of all the home atoms n-plet lists, the first communication to the South is performed, with corresponding receptions from North, and n-plets received which are labelled as belonging to this PE are stored at the end of the list of those already accumulated whilst others labelled for further communication are added to the lists for communicating either to East or Above. After similar communication and list appending operations in the east and upward directions, the n-plet lists in each PE are complete. 4.4.2 Formation of lists required for parallel SH A K E With the lists of all n-plets now complete, there remains one final operation peculiar to the list building step. From the list of rigidly bonded pairs and quintuplets forming a CH 2 group on a particular PE, it is possible to specify which entries involve atoms not resident in this domain. These atoms will form a subset of the total number received during the communication stage and are obviously related to the number of rigid bonds and CH2 groups which span domain boundaries. To reduce the amount of data which has to be communicated at each SHAKE iteration, lists are formed of the sites involved in either of these entities which have to be sent and those which will be received at each of the three stages of communication. 4.5 Procedures common to list building or list using steps 4.5.1 Calculation of intramolecular forces Following the building of the various n-plet lists, the procedure for the calculation of the intramolecular terms in the potential (flexible bonds, bends and torsions) is quite straightforward and is common to all steps whether list building or list using. All that remains afterwards is to communicate back that part of the force on atoms not calculated in the home processor. As there is no redundancy in any of the force calculations, this is a simple three stage pass back and sum process as described before [25]. Each atom then knows the total force on it due to the potentials in operation and the algorithm can then continue to the integration phase. 4.5.2 Integration of the equations of motion incorporating parallel SH A K E To integrate the equations of motion, we use the leapfrog form of the Verlet [1] algorithm in conjunction with the iterative procedure, SHAKE, for solving the rigid constraints problem [67]. The algorithm also incorporates a loose-coupling method [68] for performing simulations at constant temperature and pressure with the pressure tensor being calculated in the manner described in [70]. Previously, we discussed a parallel implementation of SHAKE appropriate for rigid bond constraints [25] and this approach has only to be slightly adapted to accomodate the special CH2 group constraints. At each SHAKE iteration, the rigid bond constraints are imposed first, in the same way as described before [25], and all displacements to atoms occurring in other PEs are sent back to the home PEs. The resulting positions, updated for the imposition of the rigid bond constraints, are then used as input to an iteration of

-14-

the CH2 special constraints. As for the rigid bonds, this involves an outward communication of current positions followed by a return of any displacements to atoms which occur in other PEs; with the tables built at 4.4.2 helping to reduce communication overheads. Using this approach, the algorithm converges within a reasonable number of iterations. An alternative approach is to dispense with the communication steps after the imposition of the bond constraints and before applying the special CH2 constraints, but this leads to a very slow convergence and is not a worthwhile trade-off. With the constraint forces calculated, the full pressure tensor is now known and, if required, any integration of the MD basis vector equations of motion can take place [68]. 4.5.3 Particle re-allocation Following the integration of the equations of motion, it can be determined from the particle displacements since the last re-build of the lists whether the lists are redundant [25]. If this is the case then particles are re-allocated to PEs in exactly the same manner as described before [25], except that the particles carry with them the extra information regarding their partial charge, atom type, connectivity, bonded neighbours etc. (see section 4.1). A list-building step will then follow. 4.6 Procedure at a list-using step At a list-using step, the procedure is very much simpler. The reciprocal space part of the Ewald summation is first conducted as before. Then a communication is made of just the coordinates required in other PEs; all other information concerning atom types, connectivity etc. being the same as at the previous list-building step. The neighbour list is then used to calculate the non-bonded terms and the ‘bonded’ corrections to the Coulombic energy. The rest of the step then follows as described in section 4.5. 4.7 Remark The parallel code as described is limited to a minimum configuration of 2×2×2=8 processors which ensures that a domain never has an immediate neighbour which in fact is itself; and thus a processor never communicates with itself. In the case where a processor does communicate with itself a problem is encountered in the search for the n>2 n-plets in that a PE may have two or more copies of images of the same atom. The searches described in Section 4.4.1 would then not guarantee that the atoms in an n-plet were in fact the nearest images. To overcome this problem a flag is set if there are less than 2 PEs allocated to any one of the 3 dimensions. When this flag is set nearest image transformations are used in the calculation of separation vectors in the bending, torsion and CH2 constraint routines. For the pair separations in the non-bonded interactions and bonds, this is not necessary as the method of searching guarantees they are always nearest images.

5 . Benchmarks To verify and assess the performance of the parallel domain decomposition algorithm described here, an existing general purpose scalar MD program was converted to a parallel form. The development work has been made on the Charles Hermite Centre’s Silicon Graphics Power Challenge machines in Nancy. The SG supplied MPI library of communication routines were used for all message passing. Once implemented and debugged, the parallel code was verified against results from the scalar program for a -15-

series of short benchmark simulations carried out on systems of differing complexity. All benchmarks were conducted under standard constant energy conditions and used periodic boundaries in all three dimensions. Details of the individual test simulations follow but some general comments apply to all of them. First, results obtained from the parallel code in all cases corresponded as well as could be expected with results from the scalar code. MD is an intrinsically “divergent” method in that any small changes in the order of calculation of, say, the forces, e.g. as caused by compiling using different optimization levels, leads to different round-off errors and ultimately to trajectories which diverge, despite the same initial and simulation conditions. (Indeed MD is a very powerful probe for testing the hardware consistency of nominally identical CPUs in multi-processor machines; the same code forced to run on a specific CPU should produce exactly the same output as on another.) Depending on the number of significant digits compared, this divergence may or may not be noticeable immediately. Clearly the parallel code described here will naturally change the order of the force calculation and the order of the S HAKE iterations and thus lead to similar divergence hence great care has to be taken in deciding whether this is at an acceptable level or the result of some subtle programming error. Of course, all obvious errors, such as a change in the global number of bonded interactions etc., are continually screened for and, after debugging, have not occurred so far. At the first step, the number of non-bonded interactions has to be the same, since the coordinates used are those read-in, but thereafter a change in the least significant bit of a coordinate can be enough to alter the distance between two particles such that, for example, they no longer fall within one anothers’ cut off sphere and hence no longer interact. In general then, we expect a level of divergence between the parallel and scalar codes of the same order as that found for the same program compiled at different levels of optimization. Within this criteria, we are satisfied that the results obtained with the parallel code for all the benchmarks presented are consistent with the equivalent scalar ones. Second, performance timings for the parallel code are difficult to measure reliably on multi-user machines which are continuously under varying, often heavy, load. As we are primarily concerned here with presenting a method to account for arbitrary connectivity an exhaustive effort has not been made to obtain precise timings. The benchmarks serve rather as a rough guide to any obvious impact the algorithm might have on performance. Third, the parallel code is currently limited in that the domains cannot be reduced in size below that specified by the truncation radius of the non-bonded potentials. To do this would imply communications, either direct or indirect, with next-nearest neighbours. Although this is feasible [23, 24], it introduces extra complications and we have not implemented such a scheme as yet. This is an important consideration as with this limitation we are restricted to a maximum number of processors that can be used for a given system. With the above points in mind, we now give details of the various benchmark systems used to verify the new parallel algorithm. 5 . 1 Linear chains The DD program was first tested on a relaxed polymer melt of 640 linear chains with each chain containing 100 sites, i.e. 64000 sites in total. The density, ρ, used was 700 kg m-3 and the temperature, T,

-16-

was close to 500 K. This is exactly the same benchmark system used previously [25] and thus for which there already exists timing results with which comparisons can be made. In this model, the ‘united-atom’ CH2 sites in the chains are linked by rigidly constrained bonds and intramolecular flexibility is restricted by bond angle bending and torsion potentials. A purely repulsive short ranged non-bonded potential operates between sites separated by at least three others along the chains and between all atoms on different chains; for further details see original paper [25]. As a benchmark test, the equilibrated system was simulated for 1000 time steps, each of length 2.5 fs, using a relative tolerance in the constraints of 10-4. The results of both the original tests and the new ones are shown in Table 1. In Table 1, the program multiq is a program for simulating linear chains optimized for the Fujitsu VPX 240 vector supercomputer, ddmultiq is the parallel version of multiq which uses the parallel approach described in [25] and ddgmq is the parallel MD program developed using the techniques described here. The hardware specification of the Fujitsu AP1000 used in the result shown in Table 1 has been given in detail before [19]. Essentially, it is a distributed memory single-user machine consisting of an array of upto 1024 SPARC (25 MHz) processors. In contrast the SG Power Challenge is a multi-user shared memory machine containing, in the configuration used here, MIPS R8000 (90 MHz) processors. Comparisons of the timings obtained on the two parallel platforms largely reflect the difference between the raw scalar speed of the MIPS and older SPARC chips. Both can be seen to deliver better than vector supercomputer performance. The comparison of the results for different numbers of processors for the parallel program ddgmq seems to indicate that the new algorithm performs at better than 100% efficiency; an apparent factor of 10 speed-up for 8 processors over the 1 processor timing. We attribute this to the aforementioned uncertainty in the timings. Also shown in Table 1 is the timing for a simulation in which the rigid bonds are replaced by flexible ones, kb = 200 kg s-2 in Eq. 4. This gives an idea of the time spent in the SHAKE routine. Of course, this test does not take into account the loss of stability introduced by flexible bonds. Our tests suggest the time step would have to be reduced by a factor of at least two to return a similar level of energy conservation, so flexible bonds do not introduce any real savings even in the parallel case. 5 . 2 PVC The second benchmark was carried out on a system of poly-vinyl chloride (PVC) chains, CH3(CHCl-CH2)m-CHCl-CH3, containing 297 molecules with m = 14 (26433 atoms in all) in a cubic box. Full details of the simulations and the model of PVC used have appeared before [72]. The density was fixed at 962.5 kg m-3 and the temperature was ~450 K. In the model of the chains the terminal CH3 groups were represented by a single united-atom interaction site but all other atoms were modelled explicitly. All bonds were maintained rigid and all degrees of freedom associated with hydrogens were removed using the special CH2 constraints. Bond angle and torsion potentials were used to restrain intramolecular flexibility. The Lennard-Jones 12-6 potential was used for all ‘non-bonded’ interactions which were applied to all pairs of atoms separated by two others along a chain and all intermolecular interactions. In addition, all atoms carried a partial charge and Coulombic interactions were handled using the Ewald sum.

-17-

For the benchmark tests reported here, the LJ potentials were truncated at 13 Å. For the Ewald sum, -1

the separation parameter,α, was set to 0.15 Å , the real space part of the Ewald sum was truncated at rc=16.5 Å and in reciprocal space the cutoff was set to Kmax = 6, i.e. |k|V1/3/2π ≤ 6. The time step was set to 1 fs and the relative tolerance for the constraints was 10 -6. Results for short simulations of 100 time steps are given in Table 2. In this benchmark a number of extra complications are introduced over the relatively straightforward example given in section 5.1. We have more than one atom type, partial charges exist and so the Coulomb potential is in operation, ‘non-bonded’ 1...4 interactions are included and in addition we employ special CH2 constraints. Despite this, the parallel program performs well with near linear speed-up. 5 . 3 Water In the third benchmark an equilibrated system of 7101 H2O molecules in a volume of 603 Å3 (ρ = 982.68 kg m-3) was used at T~298 K. In the three site model of water used the O-H bond lengths were fixed at a length of 0.96 Å and the H-O-H bond angles allowed to change subject to the form of potential given in Eq. 5 with a θo=104.5° and kθ= 418.4 kJ/mole. Only the oxygens interact via the LJ 12-6 potential (σ= 3.16552 Å and ε/kB = 78.20813 K) truncated at 10 Å. All atoms carry partial charges (q H=0.41e, -1

qO=-0.82e) which interact through Coulomb potential for which the Ewald sum is used with α=0.24 Å , rc=13.5 Å and Kmax = 10. Timings for simulations of 100 steps, each of 1 fs duration, are given in Table 3. Although this benchmark introduces no extra complications, and the timings show again good speed-up, this kind of simple three site model of water is typical of that used in simulations of biological systems. In the next benchmark we introduce a small peptide into the middle of this system. 5 . 4 Small peptide in water Echistatin is a small peptide containing 49 amino acids (713 atoms) which is found in the venom of the saw-scaled viper, Echis carinatus. It is a member of the disintegrin family of natural peptides which inhibit blood coagulation by blocking the platelet-fibrinogen binding. Many members of this family have been studied since antiplatelet therapy became a useful means of preventing acute thrombo-embolic artery occlusions in cardiovascular disease. Echistatin is known to be a particularly potent inhibitor of platelet aggregation, interacting with platelets through its binding sequence Arg 24-Gly25-Asp26. In addition, the high affinity interaction of echistatin with platelets may depend upon its secondary structure too [73-75]. It is, therefore, interesting to study the structure of echistatin in solution using molecular dynamics. Eight 1H – NMR refined structures of echistatin in solution were proposed by Atkinson et al. [7678] and are available on the Brookhaven National Laboratory Protein Data Bank WWW server with the reference 2ech.. Each of the eight proposed structures was first energy minimized in vacuum using the CVFF force-field to describe the interactions; these computations being carried out using the Discover 93.0/3.0.0 ® program from Biosym/MSI of San Diego. The lowest energy conformation (model 1 in the PDB file) was put at the centre of a 603 Å3 cubic (periodic) cell filled with water molecules. Water molecules overlapping with the echistatin molecule were first removed before the entire system was energy

-18-

minimized. A water molecule far from the peptide was then replaced by a chloride ion to neutralize the charge of the cell. The resulting system, containing 6903 water molecules, an echistatin molecule and one chloride ion, served as a starting point for molecular dynamics simulations. MD simulations were carried both using our own code and also, for comparison, the Discover program. Discover exists in a parallelized version in which a simple loop-splitting (shared-memory) approach has been used. Due to several factors this comparison was not exact. First, although bond lengths were maintained fixed in both, in our case we constrain them to the equilibrium values given by the harmonic bond stretching potential in CVFF whereas Discover fixes them at their input value. Furthermore, constraints are satisfied to a relative tolerance, 10-5 in this case, in our program whereas in Discover they are satisfied to a tolerance of 10-5 Å. As many of the bond lengths in this simulation are close to 1 Å these should be reasonably equivalent. Second, valence angle bending was described by the potential given in Eq. 5 in our program whereas in CVFF the angle bending is harmonic in the angle itself. We obtain the corresponding force constant by equating the curvatures at the minimum of the potential. Third, there is only one user-adjustable parameter in Discover controlling the accuracy of the Ewald summation compared to the normal three (rc, Kmax and α). We first adjusted these latter three parameters so as to give agreement better to than less than 10 bars between the indirect and direct calculations of the reciprocal space contribution to the pressure [70]; the values used being as for the previous pure water benchmark. We then adjusted the single parameter in Discover to give roughly the same level of convergence in the Coulombic energy. As simulation times depend critically on this parameter, we also carried out other benchmark tests in which there was no reciprocal space calculation. For Discover, this involved using a direct summation of the 1/r term in real space with the employment of a switching function to force it to zero at the 13.5 Å cutoff. As described in section 2, such a procedure is highly suspect in general. We use it here simply to furnish us with a meaningful comparison. In our programs we simply set Kmax=0, which leaves us with the smoothly decaying real space part of the Ewald summation potential. The results of these benchmarks are given in Table 4. First, the comparison with the results for the previous pure water system is reasonable showing a slight increase in the time per step, consistent with the increased complexity and the increase in the total number of atoms. In addition, the near linear speed-up again obtained suggests that the addition of the protein, with its intricate connectivity, does not unduly affect the parallel program. The comparison between ddgmq and the Discover program for the calculation using the full Ewald sum is clearly not good in absolute terms; almost certainly due to the lack of control over the Ewald sum parameters in Discover. The second set of results, in which only a real space calculation is made, gives a more reasonable comparison of performance. The interesting point in both the with- and without-Ewald-sum calculations is that, whereas the parallel version of Discover is already showing significantly less than 100% efficiency for 8 processors, the performance of our domain decomposition code maintains a good relative speed-up over the one processor result.

Conclusion A domain decomposition program capable of handling systems of arbitrary molecular connectivity has been successfully implemented on a Silicon Graphics Power Challenge machine. This program has

-19-

been thoroughly tested against results obtained using the corresponding scalar code. The timing comparisons show the effectiveness of the strategy employed. When it is necessary to use a large system size and/or the system relaxation time is longer than can be realistically achieved using a single processor, the domain decomposition approach introduced here should prove useful to the molecular modelling community.

Acknowledgments DB would like to thank the CNRS for the award of a “Position Rouge”. We acknowledge the support of the Charles Hermite Centre in Nancy for the provision of computer time on the SG Power Challenge machines and for the kind help of their technical staff. Sylvie Neyertz is thanked for a critical reading of the original manuscript.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids, (Clarendon Press, Oxford, 1987). J. M. Haile, Molecular Dynamics Simulation: Elementary Methods, (John Wiley & Sons Ltd., New York, 1992). D. Frenkel and B. Smit, Understanding molecular simulation: from algorithms to applications, (Academic Press Inc., San Diego, 1996). D. Fincham, Molecular Simulation 1 (1987) 1-45. D. C. Rapaport, Phys. Rev. A. 36 (1987) 3288-3299. F. Bruge and S. L. Fornili, Comput. Phys. Commun. 60 (1990) 31-38. S. Y. Liem, D. Brown and J. H. R. Clarke, Comput. Phys. Commun. 67 (1991) 261-267. S. Y. Liem, D. Brown and J. H. R. Clarke, Phys. Rev. A. 45 (1992) 3706-3713. H. G. Petersen and J. W. Perram, Molecular Physics 67 (1989) 849-860. A. R. C. Raine, D. Fincham and W. Smith, Comput. Phys. Commun. 55 (1989) 13. M. R. S. Pinches, D. J. Tildesley and W. Smith, Molecular Simulation 6 (1991) 51-87. W. Smith, Comput. Phys. Commun. 62 (1991) 229-248. W. Smith, Theoretica Chimica Acta 84 (1993) 385-398. S. Plimpton and B. Hendrickson, International Journal Of Modern Physics C Physics and Computers 5 (1994) 295-298. S. Plimpton and B. Hendrickson, J. Comp. Chem. 17 (1996) 326-337. F. Bruge, Computer Physics Communications 90 (1995) 59-65. P. Tamayo, J. P. Mesirov and B. M. Boghosian, in: Proceedings of Supercomputing '91, Albuquerque, New Mexico, (1991), pp. 462-470. D. C. Rapaport, Computer Physics Communications 76 (1993) 301-317. D. Brown, J. H. R. Clarke, M. Okuda and T. Yamazaki, Computer Physics Communications 74 (1993) 67-80. R. K. Kalia, A. Nakano, D. L. Greenwell, P. Vashishta and S. W. Deleeuw, Supercomputer 10 (1993) 11-25.

-20-

[21] F. Hedman and A. Laaksonen, International Journal Of Modern Physics C Physics and Computers 4 (1993) 41-48. [22] D. M. Beazley and P. S. Lomdahl, Parallel Computing 20 (1994) 173-195. [23] S. Plimpton, Journal Of Computational Physics 117 (1995) 1-19. [24] K. Esselink and P. A. J. Hilbers, Journal Of Computational Physics 106 (1993) 108-114. [25] D. Brown, J. H. R. Clarke, M. Okuda and T. Yamazaki, Computer Physics Communications 83 (1994) 1-13; ibid 86 (1995) 312 . [26] S. L. Lin, B. M. Pettitt and G. N. Phillips, Faseb Journal 6 (1992) A 464. [27] H. Sato, Y. Tanaka and T. Yao, Fujitsu Scientific & Technical Journal 28 (1992) 98-106. [28] J. F. Janak and P. C. Pattnaik, Journal Of Computational Chemistry 13 (1992) 1098-1102. [29] S. E. Debolt and P. A. Kollman, Journal Of Computational Chemistry 14 (1993) 312-329. [30] T. G. Mattson and G. Ravishanker, Acs Symposium Series 592 (1995) 133-150. [31] Y. S. Hwang, R. Das and J. H. Saltz, Ieee Computational Science & Engineering 2 (1995) 18-29. [32] H. J. C. Berendsen, D. Vanderspoel and R. Vandrunen, Computer Physics Communications 91 (1995) 43-56. [33] S. Chynoweth, U. Klomp and L. Scales, Comput. Phys. Commun. 62 (1991) 297-306. [34] T. W. Clark, R. van Hanxleden, J. A. McCammon and L. R. Scott, in: Proc. Scalable High Performance Computing Conference-94, Knoxville, TN, (1994), pp. 95. [35] S. Plimpton, R. Pollock and M. Stevens, for publication in: Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing, (1997), preprint. [36] D. Brown, in: Proceedings of Faraday Symposium 27, J. Chem. Soc. Farad. Trans., (1992), pp. 1783. [37] M. E. Tuckerman, G. J. Martyna and B. J. Berne, J. Chem. Phys. 93 (1990) 1287-1291. [38] K. D. Hammonds and J.-P. Ryckaert, Comput. Phys. Commun. 62 (1991) 336-351. [39] H. Schreiber and O. Steinhauser, Chemical Physics 168 (1992) 75-89. [40] D. M. York, T. A. Darden and L. G. Pedersen, Journal Of Chemical Physics 99 (1993) 8345-8348. [41] K. F. Lau, H. E. Alper, T. S. Thacher and T. R. Stouch, Journal Of Physical Chemistry 98 (1994) 8785-8792. [42] G. S. Delbuono, T. S. Cohen and P. J. Rossky, Journal Of Molecular Liquids 60 (1994) 221-236. [43] P. P. Ewald, Ann. Phys. 64 (1921) 253. [44] W. Smith, Computer Physics Communications 67 (1992) 392-406. [45] J. Shimada, H. Kaneko and T. Takada, Journal Of Computational Chemistry 15 (1994) 28-43. [46] F. Zhou and K. Schulten, Journal Of Physical Chemistry 99 (1995) 2194-2207. [47] A. Nakano, R. K. Kalia and P. Vashishta, Computer Physics Communications 83 (1994) 197-214. [48] J. A. Board, J. W. Causey, J. F. Leathrum, A. Windemuth and K. Schulten, Chemical Physics Letters 198 (1992) 89-94. [49] P. Procacci, T. Darden and M. Marchi, Journal Of Physical Chemistry 100 (1996) 10464-10468. [50] E. L. Pollock and J. Glosli, Comput. Phys. Commun. 95 (1996) 93-110. [51] H. G. Petersen, Journal Of Chemical Physics 103 (1995) 3668-3679. -21-

[52] B. A. Luty, I. G. Tironi and W. F. van Gunsteren, Journal Of Chemical Physics 103 (1995) 30143021. [53] T. Darden, D. York and L. Pedersen, J. Chem. Phys. 98 (1993) 10089-10092. [54] A. Windemuth, Acs Symposium Series 592 (1995) 151-169. [55] R. H. Zhou and B. J. Berne, Journal Of Chemical Physics 103 (1995) 9444-9459. [56] K. Esselink, Computer Physics Communications 87 (1995) 375-395. [57] R. Kutteh and J. B. Nicholas, Computer Physics Communications 86 (1995) 236-254. [58] R. Kutteh and J. B. Nicholas, Computer Physics Communications 86 (1995) 227-235. [59] C. A. White and M. Headgordon, Journal Of Chemical Physics 101 (1994) 6593-6605. [60] P. E. Smith and B. M. Pettitt, Computer Physics Communications 91 (1995) 339-344. [61] D. York and W. T. Yang, Journal Of Chemical Physics 101 (1994) 3298-3300. [62] B. A. Luty, M. E. Davis, I. G. Tironi and W. F. van Gunsteren, Molecular Simulation 14 (1994) 1120. [63] U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen, Journal Of Chemical Physics 103 (1995) 8577-8593. [64] B. A. Luty and W. F. van Gunsteren, Journal Of Physical Chemistry 100 (1996) 2581-2587. [65] F. H. Stillinger and A. Rahman, J. Chem. Phys. 68 (1978) 666-670. [66] J.-P. Ryckaert, Molecular Physics 55 (1985) 549-556. [67] J.-P. Ryckaert, G. Ciccotti and H. J. C. Berendsen, J. comput. Phys. 23 (1977) 327-41. [68] D. Brown and J. H. R. Clarke, Comput. Phys. Commun. 62 (1991) 360-369. [69] G. Eisenhauer and K. Schwan, Journal Of Parallel and Distributed Computing 35 (1996) 76-90. [70] D. Brown and S. Neyertz, Mol. Phys. 84 (1995) 577-595. [71] F. Hedman and A. Laaksonen, Molecular Simulation 14 (1995) 235-244. [72] S. Neyertz, D. Brown and J. H. R. Clarke, J. Chem. Phys. 105 (1996) 2076-2088. [73] B. Savage, U. M. Marzec, B. H. Chao, L. A. Harker, J. M. Maraganore and Z. M. Ruggeri, J. Biol. Chem. 265 (1990) 11766-11772. [74] V. M. Garsky, P. K. Lumma, R. M. Freindinger, S. M. Pitzenberger, W. C. Randall, D. V. Veber, R. J. Gould and P. A. Friedman, Proc. Natl. Acad. Sci. USA 86 (1989) 4022-4026. [75] Z. R. Gan, R. J. Gould, J. W. Jacobs, P. A. Friedman and M. A. Polokoff, J. Biol. Chem. 263 (1988) 19827-19832. [76] R. A. Atkinson, V. Saudek and J. T. Pelton, Int. J. Pept. Protein res. 43 (1994) 563-572. [77] V. Saudek, R. A. Atkinson, P. Lepage and J. T. Pelton, Eur. J. Biochem. 202 (1991) 329-338. [78] V. Saudek, R. A. Atkinson and J. T. Pelton, Biochemistry 30 (1991) 7369-7372.

-22-

Table 1. The results obtained for benchmark tests on a system of 640 linear molecules, each consisting of 100 united atom CH2 sites per chain. The state point was T~500 K and ρ=700 kg m-3. NP is the number of processors used and ts is the mean CPU time per time step for the 1000 step run. See the main text for the significance of the program names.

Program

Machine

NP

ts / s

multiq

Fujitsu VPX 240

1

1.58

ddmultiq

Fujitsu AP1000

1000

0.187

ddgmq ddgmq ddgmq ddgmq ddgmq ddgmq

SG SG SG SG SG SG

PC PC PC PC PC PC

1 2 4 8 12 16

2.47 1.14 0.55 0.25 0.18 0.13

ddgmq

SG PC

8

0.22*

* rigid bonds replaced by flexible ones

Table 2. The results obtained for benchmark tests carried out on the SG power Challenge machine for a system of 297 PVC molecules with degree of polymerization 14, i.e. 26433 atoms in total. The state point was T~450 K and ρ=962.5 kg m-3. NP is the number of processors used and t s is the mean CPU time per time step for the 100 step run.

Program

NP

ts / s

ddgmq ddgmq ddgmq ddgmq

1 2 4 8

56.7 28.1 14.3 7.3

-23-

Table 3. The results obtained for benchmark tests carried out on the SG power Challenge machine for a system of 7101 water molecules in a 603 Å3 box. The state point was T~298 K and ρ=968.68 kg m-3. NP is the number of processors used and ts is the mean CPU time per time step for the 100 step run.

Program

Np

ts / s

ddgmq ddgmq ddgmq ddgmq

1 2 4 8

52.6 25.2 12.7 6.8

Table 4. The results obtained for benchmark tests carried out on the SG Power Challenge machine for a system of one echistatin molecule surrounded by 6903 water molecules in a 603 Å3 box. The state point was T~298 K and ρ=997.965 kg m-3. NP is the number of processors used and t s is the mean CPU time per time step for the 100 step run.

*K

max

Program

Np

ts / s

ts / s

ddgmq ddgmq ddgmq ddgmq

1 2 4 8

54.8 26.7 13.3 7.1

Discover Discover

1 8

537 71.3

44.0* 22.1* 11.2* 5.6* 104† 16.8†

=0

† Direct 1/r summation with switching function.

-24-