PARALLEL COMPUTER SIMULATION TECHNIQUES FOR THE STUDY OF MACROMOLECULES Mark R. Wilson and Jaroslav M. Ilnytskyi Department of Chemistry, University of Durham, South Road, Durham DH1 3LE United Kingdom
[email protected]
Abstract
This article will review some of the progress made recently in developing parallel simulation techniques for macromolecules. It will start with simple methods for molecular dynamics, involving replicated data techniques; and go on to show how parallel performance can be improved by careful load-balancing and reduction of message passing. Domain decomposition MD methods are then presented as a way of reducing message passing further, so that effective parallelisation can occur with even the slowest of communication links (ethernet). Finally, parallel techniques for conducting Monte Carlo are reviewed, and ways of combining parallel methods are presented. The latter looks like becoming an effective way of using massively parallel architectures for macromolecules, without the need to simulate huge systems sizes.
Keywords:
polymer, parallel, molecular dynamics, Monte Carlo
Introduction In recent years two important developments in computing have occurred. At the high-cost end of the scale, supercomputers have become parallel computers. The ultra-fast (specialist) processors and the expensive vector-computers of a few years ago, have largely given way to systems which combine extremely large numbers of processors with fast inter-processor communications. At the low-cost end of the scale, cheap PC processors have started to dominate the market. This has led to the growth of distributed computing, with clusters of individual PCs linked with slow (but very cheap) communications such as simple ethernet. For both types of computer system, effective parallel simulation techniques are essential if simulators are to utilise these machines for macromolecular simulations.
2 This article reviews some of the progress made in using parallel processor systems to study macromolecules. After an initial introduction to the key concepts required to understand parallelisation, the main part of the article focuses on molecular dynamics. It is shown that simple replicated data methods can be used to carry out molecular dynamics effectively, without the need for major changes from the approach used in scalar codes. Domain decomposition methods are then introduced as a path toward reducing inter-processor communication costs further to produce truly scalable simulation algorithms. Finally, some of the methods available for carrying out parallel Monte Carlo simulations are discussed.
1.
Parallelisation: the basic concepts
Types of parallel machine There are many types of possible computer architecture, including parallel architectures [Fountain, 1994, Hwang and Briggs, 1985]. However, by far the most common one for simulators is the so-called MIMD architecture (multiple instruction stream - multiple data stream). On MIMD machines, each processing element is able to act independently (unlike some other specialist parallel machines). This provides maximum flexibility to the programmer, and unsurprisingly MIMD machines make up most of todays parallel systems. Within the MIMD class of machines, programmers have to deal with shared memory and distributed memory machines. As the name implies, on shared memory machines, each processor is able (in principle) to see the whole of the physical memory. This is tremendously useful, and makes for efficient parallelisation (as will be seen below). However, the technology required to achieve this is complicated, and as a consequence shared memory over many processors has significant financial costs and therefore belongs to the regime of todays supercomputers. In contrast on a distributed memory machine, each processor will have access to its own memory. This is easy to implement, in the sense that a parallel system can be built from an independent array of computers of similar specification, such as a PC cluster. (Actually even this is not a requirement for a distributed machine today, as it is perfectly reasonable to built a distributed memory parallel machine from a set of workstations with different specifications and different operating systems.) The drawback of a distributed memory machine, is that, unless the parallel application is embarrassingly parallel (such as the same application running different starting conditions on different processors) message passing will be required between processors at points in code execution where a chunk of data is required that is not stored in local memory. This dramatically slows down a distributed memory system, and for this reason, applications on distributed systems often scale poorly when large numbers of processors are used.
Parallel computer simulation techniques for the study of macromolecules
3
Message Passing For distributed memory machines a mechanism is required to pass data from one processor to another. In term of hardware this can be done by communication channels between separate processors. In the crudest of parallel machines, such as a PC cluster, the message passing hardware would simply consist of an ethernet connection between the separate PCs. Also, in terms of software, a set of communication routines are required to carry out the communication. Fortunately for the programmer, a number of public domain software libraries now exist, which have routines that can be linked with a simulation program to handle the communications. The most commonly used are the interfaces PVM [PVM, 2002] (parallel virtual machine) and MPI [MPI, 2003] (message passing interface) both of which work with Fortran or C code. Most MIMD systems will have an implementation of one or both of these libraries. This includes PCs, workstations, specialist distributed memory computers with highspeed communication links and even shared memory machines, where transfer of data between separate distributed memory segments is not required. Consequently, an investment in programming using (say) MPI can be rewarded with a portable code that runs over anything from a PC cluster to a supercomputer.
Typical parallel programs for distributed memory machine The two most common types of parallel programs are the master-slave program, and the -identical workers approach. In the master-slave approach, one master program calls the shots, and will spawn a set of slave processes (on separate processors), which will be responsible for doing the work.The master will farm out work to each of the processors and gather in the results at the end. This algorithm is easy to implement and has a range of applications. However, the second approach of -identical workers (fig. 1), is better suited to molecular simulation. In this approach -identical programs are started, one on each processor. Each process is given a separate task ID (TID), and a list of TIDs for the other processes that are running. This process is known as enrolment. Each processor then executes the same code, up to the point where a parallel operation is possible. At that point each processor does separate work, and the results are combined at the end. In this stage of program execution, message passing may be needed to share data between processes, or indeed to combine the results at the end of a parallel section. Processors make use of their own TIDs to know which part of the code they are responsible for executing, and to know which process they may need to pass or receive data from. There may be just one, or many sections of the code where parallelisation is possible in this execution stage. Finally, each process should terminate and free the processor that it is running on. Many specialist parallel machines, start from the assumption that everything can be programmed within the -identical work-
4 ers framework; so they work simply by starting copies of the same program, on on each processor. Hence the alternate name Single Program Multiple Data (SPMD) is often used for this form of parallel program execution.
Figure 1. Schematic diagram illustrating the four stages of a parallel program using the identical workers approach.
-
The global sum operation One of the most important type of message passing operation is the global sum operation. This often occurs in molecular simulation where a quantity is required, which represents the sum of a set of independently calculated quantities. For example, (1)
where the quantities
may be (say) the energy of particle , and where different values of have been calculated on different processors. At the end of the calculation each processor will carry out its own sum of values, but to complete the answer each processor will need to know the complete sum of values from every other processor. The can be achieved in several ways [Wilson, 2000]. The simplest way (but not necessary best way) involves each processor sending their partial sum to one single nominated processor. That processor adds up all the answers and then broadcasts the final result back
Parallel computer simulation techniques for the study of macromolecules
5
to each processor. At the end of this “global sum” operation, each processor knows the correct sum, , over all values of . In message passing interfaces such as MPI, a global sum operation (called a “reduce” operation in MPI) can be carried out by calling a single routine that will carry out the global sum for the programmer. In specialist parallel machines, fast routines are available to do this operation using the quickest possible means, often taking advantage of the architecture and connectivity of the machine itself.
Pointers to successful parallelisation There are two key factors leading to successful parallelisation. These apply to any parallel application not just to molecular simulation. The first involves successful load balancing. This simply means that each processor must have a roughly equal share of the work. It stands to reason that if some processors do more work than others, some processors will be left idle for some of the time and parallel efficiency will be reduced. The second important consideration is minimisation of communication costs. If processors have to wait for an inter-processor communication to receive data, then they remain idle during that period and parallel efficiency again suffers. In many parallel algorithms the main technical problem to be faced simply involves finding ways of minimising the ratio of communication cost to computational cost. In molecular simulation, both rise with the size of the system. However, computational costs (which usually depend on the number of pair interactions in the system) tend to rise quicker than communication costs as the size of the system increases. Consequently, most parallel algorithms are reasonably efficient for huge system sizes. The challenge however is to make algorithms work well in parallel for the typical systems that are tackled usually on scalar machines. Only then, will parallelisation be able to tackle important problems such as speeding up equilibration in simulations of macromolecules.
2.
Parallel molecular dynamics: the replicated data approach
The replicated data concept The replicated data method [Smith, 1991, Smith, 1992, Wilson et al., 1997], uses the -identical workers strategy described above. The same molecular dynamics program is run on each processor, and each processor has a copy of the coordinates, velocities and forces. At a point where parallelisation is possible, work is split between the different processors. Afterwards a global sum operation is usually required to restore a copy of the correct data summed over each process. The name replicated data comes from the fact that the main simulation data needs to be duplicated on each processor. As will be shown
6 below, this is a relatively easy parallel strategy to implement, requiring only minor changes to scalar code. The drawbacks of the approach are that memory usage is high (due to duplication of data), and communication costs are quite high. The latter limits scalability to large numbers of processors unless the system itself is very large, while the former can prevent really large systems being studied if the memory limit available to individual processors is reached.
Application to atomic simulation
In a typical molecular dynamics program, the forces, , on particles are , computed from the potential
(2)
and the equations of motion are integrated using a finite difference algorithm, such as the leap frog algorithm [Allen and Tildesley, 1987] (equations 3 and 4), to update the velocities, , and the positions, , at successive time-steps, . (3)
(4)
Analysis of the time taken for this algorithm is quite informative. For small systems ( 256 particles) at least 80% of CPU time will be spent in the force loop (equation 2). The next largest use of time is the integration stage (equations 3-4), which will take up around 10% of the time. However, when the system gets larger the time for force evaluation grows as and the time for integration grows as . This means that for relatively large systems, or complicated potentials, the force calculation totally dominates in terms of CPU time. Consequently, if the force loop can be parallelised, then the algorithm should work well in parallel regardless of how the rest of the program parallelises. Figure 2 shows how a simple force loop can be parallelised using the replicated data approach. Instead of looping from I = 1, N-1, as in a normal force loop, the loop jumps forward in steps of size NODES, representing the number of processors. Consequently, each processor in turn takes successive values of I as shown on the right hand side of the figure for an eight processor example. At the end of the loop each processor will have summed up the forces for each atom , but each sum will be incomplete. So a final global sum operation is required (GDSUM), which is equivalent to the sum
#! "%$'&% (')+*,- 0/ !#%" .
(5)
Parallel computer simulation techniques for the study of macromolecules
7
Figure 2. Example showing the parallelisation of a simple force loop from a molecular dynamics program. IDNODE ranges from 0 to NODES-1 and represents the processor number. Each step in the loop is taken by successive processors, as shown for an eight processor system on the right of the figure. The call to the routine GDSUM at the end of the loop, will ensure a global sum for each one of the force vectors .
for each atomic force vector . The simplicity of the parallel method in figure 2 explains why it is now possible to parallelise loops such as this automatically on a shared memory computer by means of a parallelising compiler. The compiler is able to identify where execution of a loop involves independent operations for each trip round the loop, and then parallelise accordingly. A de-facto standard for shared memory computing OpenMP [OpenMP, 2003], is now available, which allows the user to add compiler directives to their program to help the compiler parallelise loops efficiently. OpenMP is a simpler approach to parallelisation than PVM or MPI message passing, where the user must organise data transfer and processor synchronisation themselves. However, only the latter two allow portability of code between shared memory and distributed memory machines.
Improved load balancing A close look at the example of figure 2 will show that as the values of I increase then the number of pair forces required to be generated drops for that I. This means that processor 0 will always end up doing more work than processor 1, which in turn does more work than processor 2 etc. Better load balancing can therefore be achieved if this workload is spread more evenly. This can be done by using Brode-Ahlrichs decomposition [Brode and Ahlrichs,
8 1986] and is shown schematically in figure 3. Code to implement this scheme is given in reference [Wilson, 2000].
Figure 3. Brode-Ahlrichs decomposition for a 9 atom and 10 atom system. The rows represent atom and the columns represent atom . Each circle represents an interaction to be calculated. Reading across a row gives the interactions to be computed for atom . Note that Brode-Ahlrichs decomposition is slightly more efficient for systems containing odd numbers of atoms, as it gives prefect load-balancing in this case.
In practise, Brode-Ahlrichs decomposition is usually combined with a Verlet neighbour list for each atom. One nice feature of this is that it is only necessary for each node to hold the neighbourlist for the atoms it is responsible for. This means that the full neighbourlist, which can take up large amounts of memory, can be distributed over each processor in the system, with each one storing only a partial neighbourlist.
A practical example for a Gay-Berne liquid crystal The leap-frog algorithm of equations 3 and 4 can easily be extended to an anisotropic system, such as the Gay-Berne mesogen, by introducing equations for rotation about the centres of mass of a linear molecule with long axis vector, , and with (subject to the constraints , and ) [Fincham, 1984, Allen and Tildesley, 1987]
(6)
Parallel computer simulation techniques for the study of macromolecules
where the quantities are the so-called gorques, calculated from
9 (7) (8)
(9) (10)
(11)
Figure 4 plots results for the force evaluation (including anisotropic parts) of a system of Gay-Berne particles on a Cray T3D. Increasing the number of particles leads to a reduction in the ratio of communication costs/computational costs, leading to an improvement in parallel performance with system size, as shown in the figure. It is clear for the smallest system studied, that beyond 128 processors, increasing the number of processors actually leads to the program running slower. This is termed parallel slowdown, and arises when an algorithm becomes communications bound. This will always occur at some point as the number of processors increases, and represents the limits of parallelisation. It should be noted from figure 4 that the replicated data algorithm scales pretty well. This is because fast communication links were available for the Cray system used. However, for a workstation cluster with slow communications over ethernet, parallel slowdown can occur using only a few processors (often ). The integration part of the algorithm can also be parallelised. However, here there is only a single loop over the number of particles using equations 3, 4, 6-8, and this must be followed by a global sum for coordinates, velocities, orientations, , and orientational derivatives, . Not surprisingly the ratio of communication costs/computational costs tends to be poor, and so for many systems parallel integration is slower than scalar integration [Wilson et al., 1997].
Extension to macromolecular systems The replicated data approach is readily extendable to macromolecules. Many MD studies of polymers have used the simple bead-spring model [Binder, 1995], where Lennard-Jones sites are linked together by simple springs. Alternatively, combinations of isotropic Lennard-Jones and anisotropic Gay-Berne
10
speed up
1000
65536 16384 2048 256
100
10
1 1
3
6
10
30
60 100
number of processors Figure 4. Results for parallel force evaluation (including anisotropic terms) for systems of Gay-Berne particles as described in reference [Wilson et al., 1997]. The results use standard PVM calls on a Cray T3D. Improved performance over these results is possible by using cachecache data transfers for the global sums at the end of the force evaluation.
particles can easily be linked together to form an oligomer [Wilson, 1997], polymer [Lyulin et al., 1998] or dendrimer [Wilson et al., 2003], or a full molecular mechanics force field can be used to represent the polymer [Krushev et al., 2002, Smith et al., 2002]. Evaluation of simple intramolecular bond stretching, bond bending and torsional terms can be parallelised easily by analogy to the force loop shown above. Because data is replicated, each processor has a copy of the connectivity list for each type of molecules, so parallelisation is straight-forward. As an example, parallelisation of a loop over bond angles would require a single loop of the form: DO NUM = IDNODE, NANGLES, NODES !find atom numbers defining angle i = 1, ang1(num) j = 1, ang2(num) k = 1, ang3(num) !find energy and forces for angle i-j-k .............. .............. .............. ENDDO
Parallel computer simulation techniques for the study of macromolecules
11
This would replace a scalar do-loop starting: DO NUM = 1, NANGLES. It should be noted that no additional global sum is required for the forces at the end of this angle loop, because a global sum of the forces will be occurring anyway for nonbonded terms. The only additional global sum required is for the total angle interaction energy, and this can be done at the same time as the other intra- and intermolecular energies. In this sense parallelisation of all intramolecular forces comes virtually free of charge in terms of additional communication costs. The simplicity of implementing this parallelisation scheme means that a large number of replicated data MD programs are available for macromolecules, including DLPOLY (a general purpose parallel MD program [Forester and Smith, 1995] that is well-suited to large molecules), and the GBMOL program from our own laboratory [Wilson, 1996, Wilson, 2000].
3.
Parallel molecular dynamics: the domain decomposition approach
The domain decomposition concept Figure 5 illustrates the domain decomposition concept. The simulation box (domain) is divided into a series of regions, each of which is assigned to a separate processor. Within the regions the box is further subdivided into cells, which have dimensions slightly greater than the cut-off for the nonbonded potential. The idea here is that great savings can be made, in terms of both memory usage and communication costs, by avoiding the need to replicate all data across all the system. Each processor can be responsible for calculating the interactions of its own particles, and integrating their equations of motion. Communication between processors should only be required to calculate the interactions for particles near the boundary between different regions. Such communications are obviously minimised when the boundary regions are small in relation to the size of the simulation box. For short range potentials, domain decomposition (DD) has already shown itself to be effective in a large number of studies [Smith, 1991, Wilson et al., 1997, Esselink et al., 1993, Esselink and Hilbers, 1993, Rapaport, 1988, Brown et al., 1993, Brown et al., 1994, Jabbarzadeth et al., 1997, Hilbers and Esselink, 1992]. Load balancing is generally good for this algorithm provided that, for equal sized regions, the density remains homogeneous throughout the simulation box.
The force evaluation strategy There are two possible force evaluation strategies for domain decomposition [Smith, 1991]. These are illustrated in figure 6. Prior to the force calculation
12
Figure 5. Diagram illustrating the decomposition of a 3d simulation box into separate regions, which are to be controlled by separate processors. The 2d example shows the situation for a 4 processor system where each region is subdivided into cells which are slightly larger than the cut-off for the nonbonded potential. (Note that for the small number of processors shown here, the linear dimensions of the whole simulation box are spanned by just two processors, so one processor will have the same neighbour to its right and left).
data about the coordinates and (in the case of anisotropic sites) orientations, is exchanged between processors. In the first strategy all data in the boundary cells of each region is exchanged between neighbouring nodes. This must be done with three separate message passing events. First data must be passed in
Parallel computer simulation techniques for the study of macromolecules
13
the -direction, then the -direction and then the -direction. The passing of data in the and directions can be done simultaneously. In between each data passing event, the arriving coordinates must be sorted in case any of the arriving coordinates need to be passed on to another node. This is essential to ensure that a region receives coordinates from neighbouring regions that connect only via the edge of a cell, or only via a cell vertex. Figure 7 illustrates how, for a 2d example, a single region receives data from 8 surrounding cells, and now contains a set of particle data from other cells as well as from resident atoms. The forces can then be computed. It should be noted that the force evaluation involves some duplication for the boundary atoms. Obviously, duplication can be reduced by using large regions. This also means being careful not to count contributions to the total energy twice. An alternative force evaluation strategy involves only sending coordinate direction for , and ), prior to the information in one direction (i.e. force calculation, as shown in the second force evaluation strategy diagram in figure 6. This removes duplication in the force calculation, but necessitates a second transfer of forces (in the opposite direction to the transfer of coordinates shown in figure 6); so that each processor can have a copy of the true forces for the particles in its boundary cells. The best strategy for any particular simulation will depend on system size, speed of communications and cost of force evaluation. Small system sizes, fast communications, and costly force calculations favour the second strategy.
Integration and reallocation The beauty of the DD method is that integration can occur perfectly in parallel. Each node integrates the equations of motion for its resident particles. However, at the end of the simulation time-step, it may be the case that integration has led to particles moving outside the region considered by their host node. In this situation, reallocation of particles must occur, and the data stored by the host node on behalf of that particle, (absolute atom number, coordinates and orientations) must be transferred to another node. Again, this must occur in the same 3 stage process, as needed prior to the force calculation.
A practical example for the Gay-Berne liquid crystal Trials for a Gay-Berne system using the domain decomposition strategy on a Cray T3D have yielded successful linear speeds-ups for up to 256 processors [Wilson et al., 1997] as shown in figure 8. Interestingly, for the fast inter-processor communications available on the Cray, small systems (256 and 2048), are slower with domain decomposition than replicated data on 8 nodes. This reflects the duplications in the force calculation for the DD case. This finding is totally reversed if the fast Cray communications are replaced by
14
Figure 6.
The two alternative force evaluation strategies for domain decomposition.
standard ethernet, or if the algorithm is scaled up to more than 8 processors. The latter is shown in figure 9, where performance improvements of 13+ are achieved for DD over RD for a system of 65536 particles running on 256 processors of a Cray T3D.
Parallel computer simulation techniques for the study of macromolecules
15
Figure 7. Diagram showing how a central region receives data from 8 surrounding cells prior to the force calculation, when force evaluation strategy 1 is used for a 2d example. On the right of the diagram the boxes correspond to nodes, the dark circles are resident particles and the lighter circles are particles transferred from another node. In 3 dimensions the central box must receive data from 26 other boxes. This is handled in three sets of overlapped communications with neighbouring boxes in the , and directions.
Extension to macromolecular systems The basic problem to be faced when DD is applied to molecular systems is illustrated in figure 10. When a molecule is broken over several separate regions, a major problem occurs when it comes to computing intramolecular interactions such as bond stretching, bond angle bending and torsional potentials. This is because, in general, each processor only knows about the atoms in its own region. There have been many elegant attempts to solve this problem for linear chains [Esselink and Hilbers, 1993, Brown et al., 1994, Surridge et al., 1996, Jabbarzadeth et al., 1997] and for more general molecules [Nelson et al., 1996, Lim et al., 1997, Brown et al., 1997, Srinivasan et al., 1997a, Srinivasan et al., 1997b]. It should be stressed that some of these methods are quite complex, and consequently DD MD is much harder to implement than RD MD. The authors of this article have implemented a very simple scheme, which can be applied to molecules of arbitrary topology composed of spherical or nonspherical sites. Each node keeps a copy of data which does not change during the simulation, such as the topology of each different type of molecule, and each atom is given a unique atom number, which is stored by its resident processor. The unique atom numbers of all neighbouring atoms involved in
16
speed up
256 192 128 64 0 0
32
64
96
128
160
192
224
256
number of processors Figure 8. Results showing the parallel performance of a domain decomposition molecular dynamics program using the first force evaluation strategy on a Cray T3D. The results are for a system of 16384 Gay-Berne particles using standard PVM calls on a Cray T3D. Improved performance over these results is possible by using cache-cache data transfers for the global sums at the end of the force evaluation.
intramolecular interactions (bonds, angles, dihedrals) with a resident atom can be generated from this stored information. To be successful this scheme relies upon the fact that the distance between four atoms connecting any dihedral angle (and by implication the distance between atoms for each angle and each bond also) is always less than the cut-off distance on the potential. This means that the usual message passing stage prior to the force calculation guarantees that each node has a copy of all the atomic coordinates it needs to carry out the force evaluation for any intramolecular force field term. This scheme has being quite successful and is described in detail in references [Ilnytskyi and Wilson, 2001] and [Ilnytskyi and Wilson, 2002].
4.
Parallel Monte Carlo
Why does standard Monte Carlo perform so badly? The standard metropolis Monte Carlo approach works very badly in parallel. Such an approach involves: calculating the energy of a particle, moving the particle,
ratio of CPU time
Parallel computer simulation techniques for the study of macromolecules
12
17
65536
8 16384 4 2048
256 0 1
3
6
10
30
60 100
number of processors Figure 9. Results showing the ratio of parallel performance of a domain decomposition molecular dynamics program compared to a replicated data molecular dynamics program. The results are for a system of Gay-Berne particles using standard PVM calls on a Cray T3D. (Data are taken from reference [Wilson et al., 1997]).
recalculating the energy
deciding whether to accept or reject the move. A standard replicated data approach to parallelising this algorithm would involve each processor taking part in the energy evaluation, followed by a global sum operation (so that each processor could have the total energy summed over each processor). The problem with this idea is immediately apparent. Unlike molecular dynamics, where all pair interactions in the system must be computed each step, the energy evaluation is usually comparatively cheap, involving only a few interactions. So the ratio of communication time to calculation time is high. (Indeed, in the worse case scenario, for a hard particle system, finding one overlap would be sufficient for the move to need rejecting without further checking of other particles.) In some cases, the potential may be sufficiently expensive that the communication/calculation ratio improves, but the author does not know of any systems where parallelisation in this way has usefully been extended to more than a few processors.
18
Figure 10. Sketch of a molecule overlapping several regions, illustrating the basic problem when domain decomposition is applied to molecular systems.
Embarrassingly parallel Monte Carlo A simple parallel method that can often be applied (for MD as well as MC) is to setup independent simulations and then combine the statistics. Of course this is often easier said that done! Statistically independent starting configurations are required, and for macromolecular simulations there is often a major problem in simulating for long enough times to reach equilibration, and such an approach cannot be applied to speed up the equilibration process. However, within its area of applicability it represents a popular use of parallelisation.
Parallel configurational-bias Monte Carlo In configurational-bias Monte Carlo, the whole, or part of, a polymer chain is deleted and regrown [D. Frenkel, 2001]. For a linear chain trial new replicas of an atom are generated according to the Boltzmann weight associated with its bonded interactions (bond length, bond angle and dihedral angle), and one of these is chosen according to the Boltzmann weight associated with its nonbonded interactions. This ensures that the total probability of accepting the new position of an atom is given by its Boltzmann factor (that includes all interactions) and that the regrown chain follows a “low energy route”. After the new trial configuration has been generated it is accepted or rejected based
Parallel computer simulation techniques for the study of macromolecules
19
on the ratio of Rosenbluth factors for the new and old configurations, which corrects for the bias introduced in the regrowing process. Two forms of parallelism are possible. Each processors can generate the trial replicas of a new atom in parallel. However, a communication step is then required to choose one trial position and update the coordinates on each processor. This synchronisation must occur before a processor can proceed to the next atom. A second parallel strategy involves allowing each processor to grow one or several copies of a new chain. One chain can then be accepted based on its Rosenbluth weight. This strategy of multiple chain growth has been tried on a scalar machine, and has been shown to improve sampling efficiency within configuration bias MC [Esselink et al., 1995]. It should also translate to parallel machines because a reasonable amount of computational time is required to generate a chain (or multiple chains) on each processor, relative to the inter-processor communication required at the end of chain regrowth.
Multi-move Monte Carlo Multi-move Monte Carlo relies on dividing the simulation box up into domains slightly larger in dimension that the cut-off (figure 11). Small molecules separated by more than the cut-off on the potential can be moved independently in parallel without affecting each other. Or two segments of a polymer chain that fulfil the same requirement (as shown in figure 11) can also be moved independently. Parallelisation enters because the energy changes for these moves can be calculated on different processors. Obviously, this approach relies on the site-site potentials being short-range potentials, and is quite difficult to implement for long chain molecules. It has been however been successfully applied by the Mainz group in simulations of the bond fluctuation model [Wittmann and Kremer, 1990], and a simpler variant has been used for off-lattice simulations of a linear polyethylene chains [Uhlherr et al., 2002].
Hybrid Monte Carlo Hybrid Monte Carlo is generally taken to mean the combination of Monte Carlo and molecular dynamics methods. This can be done in a number of different ways. For example, long molecular dynamics steps can be taken (much longer than ones which would conserve energy in normal molecular dynamics), and these steps can then be accepted or rejected with the usual acceptance criteria. Or a series of smaller MD time-steps can be taken, and this sequence of steps can be treated as a Monte Carlo trial move. Doing Monte Carlo in this way allows for all the parallel methods that are applied to molecular dynamics simulations to be applied to the hybrid MC scheme. There are however drawbacks, the most obvious one being that in several studies with scalar algorithms, no big advantages have been seen over standard MD. Of course it is
20
Figure 11.
A domain decomposition approach for use with multi-move Monte Carlo.
possible that there may be some advantages available, if this approach can correctly be combined with other types of Monte Carlo move. There have been no systematic studies of hybrid MD/MC methods with different mixes of moves, so the true effectiveness of this method is still an open question.
Parallel tempering In parallel tempering [Geyer and Thompson, 1995, Vlugt and Dunweg, 2001, Bunker and Dunweg, 2001] several simulations are carried out at the same time, and Monte Carlo moves are made to swap systems. This can be done by running several different temperatures at the same time, or by softening the potential and running different potentials at the same time [Bunker and Dunweg, 2001]. The idea is illustrated in figure 12, where four temperatures are simulated at the same time, and MC moves are carried out to swap systems. Over a period of time, the simulation at the lowest temperature is able to sample states from higher temperatures. Parallel tempering is geared towards improving phase space sampling. So sampling with the correct Boltzmann statistics from higher temperature simulations can speed up the path through phase space and overcome problems caused by bottle-necks. In a similar way, softening the potential can remove barriers to equilibration. In the most extreme case, the potential can be softened to such an extent that the particles are able to pass through each other altogether removing all barriers. However, it is sometimes difficult to move
Parallel computer simulation techniques for the study of macromolecules
21
Figure 12. Idealised diagram showing parallel tempering over four temperatures. The dashed lines between temperatures represent Monte Carlo moves swapping systems.
,-
between this potential and a repulsive potential of the form, even with many steps in between. Parallel tempering can, of course, be done in parallel with almost 100% efficiency. It can also be used with separate molecular dynamics simulations as well as with Monte Carlo. However, here it is often difficult to move between systems, without sometimes causing problems in the solution of the equations of motion that can only be solved by using a small time-step.
5.
Summary
This article summarises the current state-of-play in parallel simulations of macromolecules. Molecular dynamics works well, both with replicated data and domain decomposition methods. The former is easiest to implement, requiring only minor changes from a scalar code. However, the latter uses less memory, and is most efficient for large numbers of processors and large system sizes. Parallel Monte Carlo is still in its infancy, and is not yet proven for large numbers of processors. However, there are several interesting methods that have been proposed recently, and the possibility of combining parallel methods, (such as parallel tempering, parallel configuration bias, hybrid MC/MD and multi-move MC) looks attractive. Finally, a note of caution should be sounded. It is easy to quote spectacular performance figures for parallel computing, by showing the results from MD simulations of large systems. Here the computational costs far outweigh any communication costs and parallelisation works extremely well. However, the practising polymer simulator will know that the most common practical problem faced is not going to larger system sizes, but rather going to longer simulation times (or in MC terms, sampling more of phase space). For many problems, system sizes of around 10000 sites are all that is required; and the practical problem here involves using parallel methods to speed up phase space sampling. It is here that newer methods, such as parallel tempering, parallel
22 configurational bias MC and parallel hybrid MC/MD methods are likely to make the biggest impact over the next few years.
Acknowledgments The authors wish to thank the UK EPSRC for funding High Performance Computers at the University of Durham, proving computer time on a Cray T3D, and for providing funding for JMI. MRW and JMI thank NATO for providing funds towards attending this Erice meeting on polymers and liquid crystals. They also wish to thank Profs Paolo Pasini, Claudio Zannoni and Slobodan Zumer for invitations to attend the workshop and for splendid organisation that made this an excellent meeting.
References [Allen and Tildesley, 1987] Allen, M. P. and Tildesley, D. J. (1987). Computer Simulation of Liquids. Oxford University Press, Oxford. [Binder, 1995] Binder, K. (1995). Monte Carlo and Molecular Dynamics Simulations in Polymer Science. Oxford University Press. [Brode and Ahlrichs, 1986] Brode, S. and Ahlrichs, R. (1986). Comp. Phys. Comm., 42:51. [Brown et al., 1993] Brown, D., Clarke, J. H. R., Okuda, M., and Yamazaki, T. (1993). Comp. Phys. Comm., 74:67. [Brown et al., 1994] Brown, D., Clarke, J. H. R., Okuda, M., and Yamazaki, T. (1994). Comp. Phys. Comm., 83:1. [Brown et al., 1997] Brown, D., Minoux, H., and Maigret, B. (1997). Comp. Phys. Comm., 103:170–186. [Bunker and Dunweg, 2001] Bunker, A. and Dunweg, B. (2001). Phys. Rev. E, 63:016701. [D. Frenkel, 2001] D. Frenkel, B. S. (2001). Understanding molecular simulation : from algorithms to applications. Academic Press. [Esselink and Hilbers, 1993] Esselink, K. and Hilbers, P. A. J. (1993). J. Comput. Phys., 106:108. [Esselink et al., 1995] Esselink, K., Loyens, L. D. J. C., and Smit, B. (1995). Phys. Rev. E, 51:1560–1568. [Esselink et al., 1993] Esselink, K., Smit, B., and Hilbers, P. A. J. (1993). J. Comput. Phys., 106:101. [Fincham, 1984] Fincham, D. (1984). CCP5 Quarterly, 12:47. [Forester and Smith, 1995] Forester, T. R. and Smith, W. (1995). DL POLY. DL POLY is a package of molecular simulation routines written by W. Smith and T. R. Forester, copyright The Council for the Central Laboratory of the Research Councils, Daresbury Laboratory at Daresbury, Nr. Warrington (1996). [Fountain, 1994] Fountain, T. J. (1994). Parallel Computing principles and practice, chapter 2. Cambridge University Press. [Geyer and Thompson, 1995] Geyer, C. J. and Thompson, E. A. (1995). J. Am. Stat. Assoc., 90:909–920.
Parallel computer simulation techniques for the study of macromolecules
23
[Hilbers and Esselink, 1992] Hilbers, P. and Esselink, K. (1992). Parallel computing and molecular dynamics simulations. In Allen, M. P. and Tildesley, D. J., editors, Computer Simulations in Chemical Physics, pages 473–493. Kluwer, The Netherlands. [Hwang and Briggs, 1985] Hwang, K. and Briggs, F. A. (1985). Computer Architecture and Parallel Processing. McGraw-Hill Book Company. [Ilnytskyi and Wilson, 2001] Ilnytskyi, J. M. and Wilson, M. R. (2001). Comput. Phys. Comm., 134:23. [Ilnytskyi and Wilson, 2002] Ilnytskyi, J. M. and Wilson, M. R. (2002). Comput. Phys. Comm., 148:43. [Jabbarzadeth et al., 1997] Jabbarzadeth, A., Atkinson, J. D., and Tanner, R. I. (1997). Comp. Phys. Comm., 107:123. [Krushev et al., 2002] Krushev, S., W, W. P., and Smith, G. D. (2002). 35:4198–4203.
Macromolecules,
[Lim et al., 1997] Lim, K.-T., Brunett, S., Iotov, M., McClurg, R. B., Vaidehi, N., Dasgupta, S., Taylor, S., and III, W. A. G. (1997). J. Comput. Chem., 18:501. [Lyulin et al., 1998] Lyulin, A. V., Barwani, M. S. A., Allen, M. P., Wilson, M. R., Neelov, I., and Allsopp, N. K. (1998). Macromolecules, 31:4626. [MPI, 2003] MPI (2003). Free implementations of MPI and excellent on-line MPI references can be found at http://www-unix.mcs.anl.gov/mpi/ or by following links from this page. Two freely available portable implementations of MPI are MPICH (http://wwwunix.mcs.anl.gov/mpi/mpich/) and LAM-MPI (http://www.lam-mpi.org). [Nelson et al., 1996] Nelson, M. T., Humphrey, W., Gursoy, A., Dalke, A., Kale, L. V., Skeel, R. D., and Schulten, K. (1996). The International Journal of Supercomputing Applications and High Performance Computing, 10:251. [OpenMP, 2003] OpenMP (2003). The home page for OpenMP, is http://www.openmp.org, most manufacturers of shared memory machines will produce their own implementations of the OpenMP standard. [PVM, 2002] PVM (2002). Excellent on-line references for PVM can be found at http://www.csm.ornl.gov/pvm/, (PVM source code is also available free from here). [Rapaport, 1988] Rapaport, D. C. (1988). Comp. Phys. Rep., 9:1. [Smith et al., 2002] Smith, G. D., Borodin, O., and Paul, W. (2002). 117:10350–10359.
J. Chem. Phys.,
[Smith, 1991] Smith, W. (1991). Comp. Phys. Comm., 62:229. [Smith, 1992] Smith, W. (1992). Comp. Phys. Comm., 67:392. [Srinivasan et al., 1997a] Srinivasan, S. G., Ashok, I., Jonsson, H., Kalonji, G., and Zahorjan, J. (1997a). Comp. Phys. Comm., 102:28–43. [Srinivasan et al., 1997b] Srinivasan, S. G., Ashok, I., Jonsson, H., Kalonji, G., and Zahorjan, J. (1997b). Comp. Phys. Comm., 102:44–58. [Surridge et al., 1996] Surridge, M., Tildesley, D. J., Kong, Y. C., and Adolf, D. B. (1996). Parallel Computing, 22:1053. [Uhlherr et al., 2002] Uhlherr, A., Leak, S. J., Adam, N. E., Nyberg, P. E., Doxastakis, M., Mavrantzas, V. G., and Theodorou, D. N. (2002). Comp. Phys. Comm., 144:1–22. [Vlugt and Dunweg, 2001] Vlugt, T. J. H. and Dunweg, B. (2001). J. Chem. Phys., 115:8731– 8741.
24 [Wilson, 1996] Wilson, M. R. (1996). GBMOL: A replicated data molecular dynamics program to simulate combinations of Gay-Berne and Lennard-Jones sites. Author: Mark R. Wilson, University of Durham, (1996). [Wilson, 1997] Wilson, M. R. (1997). J. Chem. Phys., 107:8654. [Wilson, 2000] Wilson, M. R. (2000). Parallel molecular dynamics techniques for the simulation of anisotropic systems. In Pasini, P. and Zannoni, C., editors, Advances in computer simulation of liquid crystals, volume 545 of Series C: Mathematical and Physical Sciences, chapter 13. Kluwer Academic Publishers. [Wilson et al., 1997] Wilson, M. R., Allen, M. P., Warren, M. A., Sauron, A., and Smith, W. (1997). J. Comput. Chem., 18:478. [Wilson et al., 2003] Wilson, M. R., Ilnytskyi, J. M., and Stimson, L. M. (2003). J. Chem. Phys., 119:3509–3515. [Wittmann and Kremer, 1990] Wittmann, H. P. and Kremer, K. (1990). Comp. Phys. Comm., 61:309–330.