Kinetic Monte Carlo on Parallel Computers

Kinetic Monte Carlo on Parallel Computers Ankita Biswas Ruhr-Universit¨ at Bochum Universit¨ atsstraβe 150, 44801 Bochum (Dated: February 7, 2018) Kinetic Monte Carlo (kMC) have been used considerably, as a powerful tool to study non-equilibrium processes. They have also been used for simulating the time evolution and equilibrium behaviour of large systems of particles. The significance of kMC lies in the fact that, it can be used as a scale-bridging tool between the atomic and continuum scales. Despite being inherently serial with respect to time evolution, kMC simulations of large systems, have the capability to be performed on parallel computers. Here we will present a study in general, of the applications and implementations of the Kinetic Monte Carlo simulations using parallel computing techniques.

I.

INTRODUCTION

The results gained by interpreting the outcome of any simulation can be only as reliable as the theory at its basis that solves the quantum-mechanical problem of the system of electrons and nuclei. Hence any simulation that aims at predictive power must start from the subatomic scale of the electronic many-particle problem. One of the fore-most methods used for dynamical evolution of systems in materials science is the Molecular Dynamics (MD). If the inter-atomic potential defining the dynamics of relation between the kind of particles that constitute a system, is accurate then the MD results will quite accurately predict the dynamical evolution of the system by Integrating the classical equations of motion forward in time. But the time required (10−15 ) to resolve atomic vibrations is too small. So the overall process gets over within approximately less than a few microseconds. In reality though, the phenomena or properties of any material concerned are manifested at much larger time scales. Here is where kMC comes into play. It treats the state to state transitions directly and reaches much longer time scales like a second or well beyond. The Figure 1, below shows, how kMC acts as a bridge between the spatiotemporal gaps posed by atomistic methods and meso or macro scale methods.

One of the ways to accelerate large-scale kMC simulations is to perform the calculations on parallel computers. Especially, in cases of simulations that require higher computational efficiency due to more complicated simulation growth models, parallel computing brings in more ease in the overall process. In kMC simulations the time for an event depends on the system configuration. In particular, the time for an event executed by a processor may be affected by events in neighbouring processors. As a result, the local time condition is not sufficient to guarantee an accurate parallel evolution. Before we start with the different methods to implement kMC simulations let us look at the range of applications of kMC to understand its importance in modeling real physical systems and phenomena. Kinetic Monte Carlo have been used to study surface diffusion, dislocation mobility [13], surface growth [14], vacancy diffusion in alloys [15], coarsening of domain evolution, defect mobility and clustering in ion or neutron irradiated solids including, damage accumulation and amorphization/recrystallization models and viscoelasticity of physically crosslinked networks [16].

II.

METHODS

A.

The Basics

Suppose a system of N species can undergo a set of M possible transitions, with each transition associated with a rate p. The probability of species l undergoing transition k is then proportional to pk . The total activity of the system is defined as the sum of the rates of all possible events that the system could undergo at any one time, i.e.,

W (t) =

M X

pk (t)

(1)

k=1

FIG. 1. The different methods of materials simulations at different length and time scales [1].

where M is the total number of events that could take place in the system. The probability of a specific event taking place at time t is its rate divided by the sum of

2 rates, ρ(l) =

pl W (t)

(2)

It is a normalized probability, since summing pl over all possible events is just the activity W(t) from Eq. 1. Let us suppose, events are ordered in some way 1, 2, . . . , M. To choose an event, a random number ξ is chosen on (0,1) and the event m such that, Pm−1 Pm k=1 < R < k=1 W W is derived. A kinetic Monte Carlo calculation can then be carried out as following: • At a time t, determine which events could happen in a system and then add up the rates of each event, determining W. • Calculate the probability of each of the possible events by dividing the rate of that event by W. • Create a list of the events, weighted by their probability. • Pick a random number ξ on (0,1) and choose which event will occur. • Enact the chosen event, changing the configuration of the system as required. • Advance time by δt : t → t + δt.

Pi 4. The cumulative function Pki = j=1 pkj is calculated for i = 1,2,...., Nk . The total rate is W = Pk,Nk . 5. A uniform random number u ∈ (0, 1] is chosen. 6. The event to carry out i is extracted by finding i for which, Pk,i−1 < uW ≤ Pk,i . 7. Event i is carried out and the current state updated. (k → i) 8. A new uniform random number is selected u0 ∈ (0, 1]. 9. Time t is updated as t = t + δt with δt = W −1 ln( u10 ). 10. The procedure is then reiterated from step 3. This algorithm is known in variously as the residencetime algorithm or the n-fold way or the Bortz-KalosLebowitz (BKL) algorithm. An important thing to note is that, the timestep involved is a function of the probability that all events i, did not occur. Here, an infrequent-event system [8] is considered, caught in a particular energy basin such that if it stays there for a long time (relative to the time of one vibrational period), it forgets how it got there. Then, for each possible escape pathway to an adjacent basin, there is a rate constant that characterizes the probability, per unit time, that it escapes to another state, and these rate constants are independent of what state preceded the present one.

• Repeat steps. Thus kMC is highly dependent on having a good list of the possible events and their rates of the system under study. If a critical event is left out, or if a supplied rate is incorrect, then the kMC results are not likely to be a good model for dynamic evolution of the system. However, for systems whose dynamics are determined by thermally activated events, kMC is often the only way to proceed. There are many variants of kMC algorithms. We can start exploring with the basic rejection-kMC (rkMC) and rejection-free-kMC (rfkMC) [2].

2.

Rejection kMC

This variant of kMC facilitates easier handling of data, and faster computations for each step as the time consuming action of getting all rki s is not needed. The time evolved at each step however, is smaller than for rfkMC. The advantages or disadvantages of this method typically depends on the system concerned and its properties. The steps for performing this method can be written as follows: 1. The time t is set = 0.

1.

Rejection-free kMC

2. An initial state k is chosen. This variant of kMC usually is also called the rfkMC algorithm. It is used for simulating the time evolution of a system, where the rates of the occuring are known. It can be written for instance as follows: 1. The time t is set = 0. 2. An initial state k is chosen. 3. A list of all Nk possible transition rates in the system pki is formed, from state k into a generic state i. States not communicating with k will have pki = 0.

3. The number Nk of all possible transition states from k to i is extracted. 4. The candidate event to carry out i by uniformly sampling from the Nk transitions above, is found. 5. The event with probability fki = ppki0 is accepted, where p0 is a suitable upper bound for pki . 6. Event i is carried out if accepted and current state is then updated (k → i).

3 7. A new uniform random number is selected u0 ∈ (0, 1].

the sum of propensities for all N reactions are to be pre-computed.

8. Time t is updated as t = t + δt with δt = (Nk p0 )−1 ln( u10 ). 9. The procedure is reiterated from step 3. This algorithm is usually called a standard algorithm.

B.

Parallelization

The Logarithmic Time Algorithm A parallel kMC algorithm allows for events to be performed simultaneously on multiple processors, thus saving in time costs. kMC is inherently a serial algorithm, in that selection of a single event depends, in principle, on the current set of probabilities for all events. Thus to enable parallelism, some approximation must be made.

1.

The following is an SSA-GB step, the Gibson/Bruck stochastic simulation algorithm, with O(log2 N ) scaling [5]. This is different from the SSA in the fact, it implements binary tree search to pick a reaction.

The Stochastic Simulation Algorithm (SSA)

Gillespie developed the stochastic simulation algorithm (SSA) via Monte Carlo methods in 1976 [6, 7]. The original SSA was implemented at a computational cost scaled as O(N), to perform a single reaction, i.e. linearly in N , the number of reactions in the domain concerned. Since the time increment per reaction also tends to decrease with increasing N, such scaling limited the size of domains that can be efficiently simulated. An alternate implementation of the SSA was proposed which scales as O(log2 N), enabling much larger domains to be modeled [5].

The Constant Time Algorithm The following gives the representation of an SSACR step, the Composition and Rejection stochastic simulation algorithm, with O(1) scaling, independent of the number of reactions N [5].

Usually the algorithms discussed in this context solve the generalized problem of random variate generation (RVG) from a dynamic discrete probability distribution. The generated variate determines the kind of event taking place in the next time increment. Efficient RVG is capable of modelling quite a diverse range of phenomena. There is a a particular RVG algorithm known as Composition and Rejection, which when applied to the SSA, is well-suited to the simulation of large domains, because its scaling is O(1), i.e. the computational cost to perform a reaction is constant, independent of N. The following are some primary kMC methods that can be parallelized.

The Linear Time Algorithm

The algorithm-complexity correspondence can be clearly defined by Table 1. Table 1 Algorithm

Complexity

L(∇A) log(∇A) C(∇A)

O(N) O(log2 N ) O(1)

The following is a step of the original SSA, the Gillespie stochastic simulation algorithm, with O(N) scaling in the number of reactions N [5].

The algorithms vs their respective complexities scaled with respect to the number of reactions N involved.

Here a ”propensity” is computed for each reaction which is proportional to its probability of occurrence relative to other reactions. The propensity for each reaction and

• log(∇A) → the Logarithmic Time Algorithm and

• L(∇A) → the Linear Time Algorithm

• C(∇A) → the Constant Time Algorithm.

4 2.

The Synchronous Parallel kMC (SPKMC)

In kMC, the cumulative event rate W grows as O(N), and since the time scale of the simulation is determined by W−1 , it becomes progressively smaller in larger systems, as is seen above. To overcome this problem kMC simulations might be parallelized in a divide-and-conquer (DC) fashion, using asynchronous formulation and graph colouring to avoid conflicting events [3]. This can be achieved by introducing a local transition-state (LTS) approximation, in which events outside a cut-off distance are assumed to be statistically independent. Then the simulated system can be subdivided into spatially localized domains and local events in particular domains can be sampled independently of those in the other domains. The SPKMC algorithm partitions the 3-dimensional space into spatially localized mutually exclusive domains.

FIG. 2. An example of two-dimensional domain decomposition into an array of 3 × 2 domains in the x and y directions [3].

In SPKMC, the cumulative rate W is computed for each non-overlapping domain, but can be summed only over the local events within a particular dth domain. If, we can denote the rate of the eth event in domain d as We (d) , then the global maximum rate can be computed as Wmax = maxd W(d) . SPKMC simulation, then can be performed concurrently among all the domains similarly to the sequential kMC algorithm, except with the generation of a null event with the rate of W0 (d) = Wmax − Wd . A globally synchronous time evolution is thus achievable, where the time is incremented by t = - ln (ξ1 )/Wmax at each kMC step, where ξ1 is a uniform random number in [0,1], common to all the domains. By keeping the size of domains constant and increasing the number of domains, Wmax should remain nearly constant and thus t = O(1) instead of O( N1 ) in the conventional kMC method [3].

3.

The Synchronous Sublattice (SL) algorithm

In this algorithm, different parts of the system are assigned to different processors, via spatial decomposition. We can see from Fig. 3, each processors domain is further divided into different sublattices, to avoid conflicts between processors due to the synchronous nature of the algorithm. At the beginning of a cycle each processors

FIG. 3. Two possible methods of spatial and sublattice decomposition. (a) square sublattice decomposition (b) strip sublattice decomposition. Solid lines correspond to processor domains. Dashed lines indicate sublattice decomposition. Dotted lines in (a) and (b) indicate ghost-region surrounding central processor. [4].

local time is initialized to zero. One of the sublattices is then randomly selected so that all processors operate on the same sublattice during a particular cycle. Each processor then simultaneously and independently carries out kMC simulations in the selected sublattice until the time of the next event exceeds the time interval T . As in the usual serial KMC, each event is carried out with time increment t = - ln (ξ1 )/W where ξ1 is a uniform random number between 0 and 1 and W is the total KMC event rate. Each processor then communicates any necessary changes (boundary events) with its neighboring processors, updates its event rates and moves on to the next cycle using a new randomly chosen sublattice. By picking the cycle length T less than or equal to the average time for the fastest possible activated event one can obtain results which are identical to those obtained in serial kMC except for very small sublattice sizes [4].

4.

Other Methods

The approximate algorithm as described above, in order to create a parallelization scheme for kMC had demonstrated itself to be very promising in the fact that, it resolves boundary inconsistencies in a straightforward fashion, while there is an absence of global communications in contrast to synchronous relaxation schemes [10]. In another approach, spatial decomposition of the Markov operator had been developed, that had defined the Kinetic Monte Carlo algorithm, into a hierarchy of operators [10]. The decomposition had been tailored to the processor architecture. Based on this operator decomposition, Fractional Step Approximation schemes

5 were formulated by employing the Trotter product formula. In turn these approximating schemes determine Communication Schedule between processors through the sequential application of the operators in the decomposition, and the time step employed in the particular fractional step scheme. It had been shown, that the Fractional Step kMC schemes would allow a serial kMC simulation (called a kernel) to be run independently on each processor on each fractional time-step window. The processor communication had also been seen to be straightforward at the end of each fractional time-step with no global communications or rollbacks being involved.

III.

CONCLUSION

Any parallel kMC (pkMC) algorithm should be able to solve the same master equation as the sequential (serial) methods do, rigorously. This does not necessarily imply that both approaches give the same sequence of events, but that, on average, both give the same kinetic evolution resulting in the same statistical distributions as a function of time [9]. Amongst the parallel algorithms tested, the approximate algorithm had the highest parallel efficiency [10].

1.

Parallelization Issues

There have been a number of attempts to design parallel kMC algorithms, with various levels of success. All these parallel versions are based on a domain decomposition, in one way or the other, of the total simulation area among parallel processors. However, the existence of a global time in the kMC algorithm prevents the parallel processors from working independently. In practice, most parallel kMC algorithms let each processor run independently for some time interval small with respect to the time scale of the whole simulation, but still long enough to comprise of a large number of events. After each time interval the processors are synchronised and exchange data about the actual configurations of their neighbours. This kind of communication among processors must be done very frequently during program execution. Hence the parallel efficiency strongly depends on latency and bandwidth of the communication network [11].

[1] Dierke Raabe Scales in plasticity modeling. http://www.dierk-raabe.com/multiscale-modeling/ [2] A B Bortz, M H Kalos, J L Lebowitz A new algorithm for Monte Carlo simulation of Ising

The time evolution of the processors may proceed quite inhomogeneously, and latent processors may have to wait a long time until other, active processors have caught up. Events near the boundary of processors may give rise to bigger problems: Such events may turn out to be impossible after the synchronisation has been done, because the neighbour processor may have modified the boundary region prior to the execution of the event in question. Knowing the actual state of the neighbouring processor, the event should have occurred with a different rate, or maybe not at all. In this case, a 0 roll − back 0 is required, i.e., the simulation must be set back to the event prior to the occurance of the conflicting boundary event. The later simulation steps must be discarded. While such roll-backs are manageable in principle, they might lead to a significant decrease in the efficiency of a parallel kMC algorithm. Yet, it is assumed that the problems can be controlled by choosing a suitable synchronisation interval. This is the idea behind the synchronous relaxation algorithm [11]. Several variants of parallel algorithms exist that trades less efficiency, with a non-rigorous treatment of the basic simulation rules .

In the semi-rigorous synchronous sublattice algorithm [4], the first, coarse domain decomposition of the simulation area is further divided into sublattices. Then, in each time interval between synchronisations, events are alternately simulated only within one of the sublattices. This introduces an arbitrary rule additionally restricting the possible processes, and thus may compromise the validity of the results obtained. It however allows one to minimise or even completely eliminate conflicting boundary events. Consequently, 0 roll − backs0 that are bad for parallel efficiency can be reduced or avoided. However, even then, the scalability of parallel kMC simulations for typical tasks is practically limited to few processors with the currently available parallel algorithms [11]. In this context, a parallel lattice kMC simulation including long range particle-particle interactions, based on the BKL method that runs on a GPGPU had been proposed to solve boundary conflicts non-rigorously by using the sublattice approach. The inclusion of p-p interactions is computationally expensive. In GPGPU, the bottleneck is particularly caused by copying CPU data back and forth to the GPU interface. This can be prevented by performing the entire procedure on a GPGPU interface [12].

spin systems. Journal of Computational Physics 17-1 (1975) 10-18 [3] Hye Suk Byun, Mohamed Y. El-Naggar, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta A derivation and scalable implementation of the syn-

6 chronous parallel kinetic Monte Carlo method for simulating long-time dynamics. Computer Physics Communications 219 (2017) 246-254 [4] Y. Shim, Jacques G. Amar Recent advances in parallel kinetic Monte Carlo: synchronous sublattice algorithm (2005). [5] Steve Plimpton, Corbett Battaile, Mike Chandross, Liz Holm, Aidan Thompson, Veena Tikare, Greg Wagner, Ed Webb, Xiaowang Zhou Crossing the Mesoscale No-Man’s Land via Parallel Kinetic Monte Carlo. SAND2009-6226, Unlimited Release, Printed October 2009 [6] T. Gillespie D. Exact stochastic simulation of coupled chemical reactions.. J. Phys. Chem.,81(25):23402361, 1977 [7] T. Gillespie D. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions.. J. Chem. Physics, 22:403434, 1976 [8] Arthur F. Voter Introduction to the Kinetic Monte Carlo Method. Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA [email protected] [9] E. Martinez, J. Marian , M.H. Kalos , J.M. Perlado Synchronous parallel kinetic Monte Carlo for continuum diffusion-reaction systems. Journal of Computational Physics, Volume 227, Issue 8, 1 April 2008, Pages 3804-3823 [10] Giorgos Arampatzis, Markos A. Katsoulakis, Petr Plech, Michela Taufer, Lifan Xu

Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms. Journal of Computational Physics, Volume 231, Issue 23, 1 October 2012, 7795-7814 [11] Peter Kratzer Monte Carlo and Kinetic Monte Carlo Methods - A Tutorial. Multiscale Simulation Methods in Molecular Sciences, J. Grotendorst, N. Attig, S. Blgel, D. Marx (Eds.), Institute for Advanced Simulation, Forschungszentrum Jlich, NIC Series, Vol. 42, ISBN 978-3-9810843-8-2, pp. 51-76, 2009. [12] N.J. van der Kaap, L.J.A. Koster Massively parallel kinetic Monte Carlo simulations of charge carrier transport in organic semiconductors. Journal of Computational Physics 307 (2016) 321-332 [13] Wei Cai, Vasily V. Bulatov, Joo F. Justo, Ali S. Argon, Sidney Yip 1 Intrinsic Mobility of a Dissociated Dislocation in Silicon. PHYSICAL REVIEW LETTERS Vol 84-15 (2000) [14] B. Meng, W.H. Weinberg Dynamical Monte Carlo studies of molecular beam epitaxial growth models: interfacial scaling and morphology. Surface Science 364 (1996) 151-163 [15] W. M. Young, E. W. Elcock Monte Carlo studies of vacancy migration in binary ordered alloys: I. Proceedings of the Physical Society, (1966) VOL. 89 [16] Stephan A. Baeurle, Takao Usami, Andrei A. Gusev A new multiscale modeling approach for the prediction of mechanical properties of polymer-based nanomaterials. Polymer 47 (2006) 8604e8617