Parallel Object Oriented Monte Carlo Simulations

0 downloads 0 Views 110KB Size Report
We present a. C++ Monte Carlo class library for the automatic parallelization of Monte. Carlo simulations. Besides discussing the advantages of object-oriented.
Parallel Object Oriented Monte Carlo Simulations Matthias Troyer1,2 , Beat Ammon2 , and Elmar Heeb2 1 2

University of Tokyo, Tokyo, Japan ETH Z¨ urich, Z¨ urich, Switzerland

Abstract. We discuss the parallelization and object-oriented implementation of Monte Carlo simulations for physical problems. We present a C++ Monte Carlo class library for the automatic parallelization of Monte Carlo simulations. Besides discussing the advantages of object-oriented design in the development of this library, we show examples how C++ template techniques have allowed very generic but still optimal algorithms to be implemented for wide classes of problems. These parallel and object-oriented codes have allowed us to perform the largest quantum Monte Carlo simulations ever done in condensed matter physics.

1

Introduction

The Monte Carlo method[1] has been one of the most successful, if not the most successful numerical method in simulation of physical systems. Its applications span all length scales, ranging from large astrophysics simulations of galaxy clusters, to simulation of properties of solids and liquids, down to simulations of quarks and gluons, the constituents of protons and neutrons. In solid state physics usual Monte Carlo algorithms were easy to vectorize and ideally suited for vector supercomputers. However, in the most interesting cases, close to phase transitions, these “local” Monte Carlo algorithms suffer from so-called “critical slowing down,” which leads to an extra factor of L2 in the CPU-time (L is the system size). Modern “cluster” algorithms [2, 3] beat this slowing down, but one has to deal with much more complex data structures and with algorithms that do not vectorize well. In this paper we present how almost all kinds of Monte Carlo simulations, including the cluster algorithms, can be parallelized very efficiently and introduce a Monte Carlo class library and application framework that automatically performs this parallelization. Additionally, we present our experiences in using C++ template techniques to write generic Monte Carlo programs for a wide class of model systems, and in using them for more than 600 years of CPU time on a wide variety of workstation clusters and massively parallel machines.

2

Monte Carlo Simulations

Monte Carlo simulations are the only useful way to evaluate high-dimensional integrals. Such integrals are very common in the simulation of many-body systems. D. Caromel, R.R. Oldehoeft, and M. Tholburn (Eds.): ISCOPE’98, LNCS 1505, pp. 191–198, 1998. c Springer-Verlag Berlin Heidelberg 1998

192

M. Troyer, B. Ammon, and E. Heeb

For example, in a classical molecular dynamics simulation of M particles, the phase space has dimension 6M (3 coordinates each for positions and velocity). Usual numerical integration techniques are very slow for high-dimensional integrals. For example, with the Simpson rule in d dimensions with N equidistant points, the error decreases as N −4/d . For the corresponding Monte Carlo summation with N random points xi sampled with some distribution p(xi the R PN integral is estimated by f (x)dx = N1 i=1 f (xi )/p(xi ) and the statistical error decreases as N −1/2 . For d > O(10) the Monte Carlo integration method is thus faster. Usually the points xi are a Markov process x1 → x2 → . . . → xi → . . .. Starting from a random configuration x1 the Markov process must be iterated for a certain number Neq of equilibration steps before it produces N random samples having the correct probabilities. This will be important for the performance of the parallel implementation. For more details about Monte Carlo methods, especially “importance sampling” and other techniques for reducing the statistical error, we refer to standard textbooks[1].

3

Parallelization and Performance of Monte Carlo Simulations

Our typical Monte Carlo simulations are easy to parallelize at several levels of granularity: – Often we need many simulations for hundreds of different parameter sets (system sizes, temperatures, and so forth). Being independent, they can be parallelized trivially, with negligible overhead and almost perfect scaling, as little inter-processor communication is needed. For example, we found a speedup of 95.5 on 96 nodes of an Intel Paragon. For numbers of simulations larger than the number of available nodes, this level of parallelization is efficient. – For one Monte Carlo simulation, uncorrelated Markov chains {vecxi } of statistical samples can be generated on different nodes by starting independent Monte Carlo runs with different random seeds. This level of parallelization however incurs a slight overhead, since each run needs to be equilibrated individually. On P nodes this leads to a theoretical maximal speedup of P (1 + Neq /N )/(1 + P Neq /N ). Since typically N ≈ 100Neq this level of parallelization scales well to 20 times more nodes than simulations. – Only if the equilibration time Neq is very long or if memory needs require it, is it worth parallelizing a single Monte Carlo run. For example, this is possible by distributing the particles in the simulation over different nodes. This is, however, rarely done because of communication overhead. We found, however, that the main bottleneck in scaling to a large number of nodes is caused by disk I/O needed at the beginning and end of each job (a simulation typically takes several weeks and thus has to be split into many separate jobs, requiring us to temporarily store the configurations on disk). Due

Parallel Object Oriented Monte Carlo Simulations

193

to limitations of the parallel file system of the Hitachi SR2201 we used, this time grows faster than the data size, as we increase the number of nodes (250 sec for 256 nodes, 780 sec for 512 nodes, and 2970 sec for 1024 nodes for a typical quantum Monte Carlo simulation). On the machine we used the CPU time for a large job is unfortunately limited to one hour per node, so a typical program scales well only up to 256 nodes, where the disk I/O overhead is of the order of 15%.

4

The Alea Application Framework and Class Library

Monte Carlo simulations typically need a large amount of CPU time but fortunately, parallelize well. With the application framework and class library we have developed, many scientists with no experience in parallel computing can make use of the power of massively parallel computers for Monte Carlo simulations. The Alea library (Latin for “dice”), written in C++, automatically parallelizes many types of Monte Carlo simulation at the two generic levels mentioned above. 4.1

Classes for Monte Carlo Simulations

From a user’s point of view, the library consists of three main classes, from which those specific to the Monte Carlo simulation are derived. – A simulation class handles the parallelization of the different runs and the merging of the results of these runs. The user only has to override a work() member function, specifying the amount of work which needs to be done on this simulation. This value is then used for load balancing and serves as a termination criterion once it is zero. – A run class implements the actual Monte Carlo simulation. The following functions have to be implemented for this class: – a constructor to start a new run – functions to access data in a dump, as discussed in Sec. 4.2 below. – a criterion is thermalized() tells if the run is in equilibrium. – a function do step() performs one Monte Carlo step and measurement. – A container class, measurements, collects all the Monte Carlo statistics. This is all the information the library needs to know about a specific Monte Carlo simulation. The library takes care of parameter input and startup, hardware independent checkpointing (see Sec. 4.2), parallelization (see Sec. 4.3) and dynamic load balancing, and evaluation and output of results 4.2

Object Serialization

An object serialization scheme was introduced to enable reading/writing of objects from/to data files, and transmission of objects to remote nodes. C++ does not contain built-in object serialization, unlike Java. The C++ “iostream” library is also not suitable, since it is designed for text output and does not ensure that an object can be recreated from the textual output.

194

M. Troyer, B. Ammon, and E. Heeb

master node

slave node #1

slave node #2

master_scheduler

slave_scheduler

slave_scheduler

abstract_simulation remote_simulation

abstract_simulation simulation user_simulation

abstract_simulation simulation

abstract_run remote_run

user_simulation

abstract_run run

abstract_run run user_run

abstract_run run user_run

user_run

Fig. 1. Illustration of the parallelization and remote object creation. Each box represents an object. The labels of each object, from top to bottom, represent the class hierarchy from base class to most derived class. Solid lines are the creation of local objects, which in case of proxies send a message to a slave node to request the creation of the actual object. Illustrated is the creation of two simulations, which subsequently create three runs.

However, our implementation of object serialization is modelled after the “iostream” library. Objects can be written to odump streams using operator >. Extensions to new classes are done just as in the iostream library, by overloading these operators. In particular, we have implemented two important types of such “dumps”: – xdr odump and xdr idump use the XDR format to write the data in a hardware independent binary format. These are used for hardware independent checkpoint files and for storing results of simulations. – mp odump and mp idump which use an underlying message passing library to send objects from one node to another. These latter classes allow easy parallelization using distributed objects. 4.3

Parallelization Using Distributed Objects

Simulations are parallelized as discussed in Sec. 3. The master node determines how much work needs to be done by each simulation and distributes the simulations across the available nodes accordingly. It then creates the simulation objects remotely, which in turn create one or several run objects. Remote object creation is done by creating a proxy object (called either remote simulation or remote run), which in turn sends a message to the remote node requesting the creation of the object. Remote method invocation similarly invokes the method of the proxy object, which then sends a message

Parallel Object Oriented Monte Carlo Simulations

195

to the remote node requesting the invocation of the method, and perhaps waits for a return value. Figure 1 shows this class hierarchy and method invocation. The scheme is simplified greatly compared to general distributed object systems since on each node, there exists at most one simulation and one run object. Slave nodes check for messages and perform the requested method invocations. If no message needs to be processed, they call the do step method of the local Monte Carlo run object to perform Monte Carlo steps. 4.4

Support Classes

In addition to the Monte Carlo simulation classes, the library also provides a variety of useful classes for parameter input, Monte Carlo measurements and their error analysis, the analysis of time series and the generation of plots. 4.5

Failure Tolerance

The library was designed to be tolerant to failure of single workstations on a workstation cluster when using PVM as the underlying message passing library. This is important as it allows us to perform the calculations on workstations in environments where other users reboot machines. Failure recovery is implemented by period checkpointing and by automatically restarting failed simulations from the latest checkpoint. Since the C++ exception mechanism is not yet fully supported by compilers, the implementation of this feature of the library has been delayed until a future release.

5

Object-Oriented Techniques for Monte Carlo Simulations

The first version of the above Monte Carlo library was developed in 1994 and 1995. At that time, the performance of C++ for scientific simulations was not good enough to allow the use of object-oriented techniques for the CPU intensive parts of the actual Monte Carlo simulations. They were coded in C-style C++ or even in FORTRAN. Meanwhile, the template mechanism has been extended and is supported by more compilers. The use of “light objects” [4] and expression templates [4, 5] allows a higher level of abstraction and the use of object-oriented design without any abstraction penalty in the performance. In the past year we have, with good success, made extensive use of such template techniques to develop generic, but still optimal, algorithms for a variety of condensed matter problems, and have used these programs successfully for a large number of simulations. This is, to our knowledge, one of the first applications of such techniques to large high-performance numerical calculations.

196

5.1

M. Troyer, B. Ammon, and E. Heeb

Generic Simulations Using Templates

Simulations of condensed matter problems have to be done for a variety of crystal lattice structures. Thus, it is advantageous to write a simulation program for general lattice structures. Usually, this is done by storing the lattice as a twodimensional array of all neighbors of all sites. Using modern C++ compilers, we can describe the lattice by a class, shown here for a chain of sites: class chain_lattice { public: typedef unsigned int site_number; chain_lattice(site_number l) : length(l) {} site_number volume() {return length;} site_number neighbors(site_number site) {return 2;} inline site_number neighbor(site_number site,site_number nb) {return (nb ? (site==0? length-1 : site -1) : (site==length-1 ? 0 : site+1));} private: site_number length; }; This class can then be used as a template parameter of the run class: templateclass user_run: public run { private: LATTICE lattice; MODEL model; public: virtual void do_step(); ... }; In the CPU-intensive part (the function do step()), most of the time is spent evaluating an interaction energy or cost function like: for (typename LATTICE::site_number i=0;i