From Massively Parallel Algorithms and Fluctuating Time Horizons to ...

From Massively Parallel Algorithms and Fluctuating Time Horizons to Non-equilibrium Surface Growth G. Korniss,1 Z. Toroczkai,2,3 M. A. Novotny,1 and P. A. Rikvold1,4

arXiv:cond-mat/9909114v2 [cond-mat.stat-mech] 1 Feb 2000

1

Supercomputer Computations Research Institute, Florida State University, Tallahassee, Florida 32306-4130 2 Department of Physics, University of Maryland, College Park, MD 20742-4111 3 Department of Physics, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061-0435 4 Center for Materials Research and Technology and Department of Physics, Florida State University, Tallahassee, Florida 32306-4350 (January 14, 2000) We study the asymptotic scaling properties of a massively parallel algorithm for discrete-event simulations where the discrete events are Poisson arrivals. The evolution of the simulated time horizon is analogous to a non-equilibrium surface. Monte Carlo simulations and a coarse-grained approximation indicate that the macroscopic landscape in the steady state is governed by the EdwardsWilkinson Hamiltonian. Since the efficiency of the algorithm corresponds to the density of local minima in the associated surface, our results imply that the algorithm is asymptotically scalable. PACS numbers: 89.80.+h, 02.70.Lq, 05.40.-a, 68.35.Ct

allel computational schemes. To estimate the efficiency of this algorithm one must understand the morphology of the surface associated with the simulated time horizon. In particular, the efficiency of this parallel implementation (the fraction of the non-idling processing elements) exactly corresponds to the density of local minima in the surface model. We show that the steady-state behavior of the macroscopic landscape is governed by the EdwardsWilkinson (EW) Hamiltonian [7], implying that the density of the local minima does not vanish when the number of PEs goes to infinity. This ensures that the simulated time horizon propagates with a non-zero average velocity in the steady state. Thus the algorithm is asymptotically scalable! Further, based on the strong analogy between the evolution of the simulated time horizon and the single-step surface growth model [8], we describe the asymptotic scaling properties of the parallel scheme. The difficulty of parallel discrete-event simulations is that the discrete events (update attempts) are not synchronized by a global clock. The traditional dynamic MC algorithms were long believed to be inherently serial, i.e., in spin language, the corresponding algorithm could attempt to update only one spin at a time. Lubachevsky nevertheless presented an approach for parallel simulation of these systems [3] without changing the dynamics of the underlying model [9]. Applications of his scheme include modeling of cellular communication networks [12], ballistic particle deposition [13], and metastability and hysteresis in kinetic Ising models [14]. Here we consider the case of one-dimensional systems with only nearest-neighbor interactions (e.g., Glauber spin-flip dynamics) and periodic boundary conditions. We restrict ourselves to the case where each PE carries one site (e.g., one spin) of the underlying system. Non-zero efficiency for this algorithm implies non-zero efficiency for the case where a PE carries a block of sites

To efficiently utilize modern supercomputers requires massively parallel implementations of dynamic algorithms for various physical, chemical, and biological processes. For many of these there are well-known and routinely used serial Monte Carlo (MC) schemes which are based on the realistic assumption that attempts to update the state of the system form a Poisson process. The parallel implementation of these dynamic MC algorithms belongs to the class of parallel discrete-event simulations, which is one of the most challenging areas in parallel computing [1] and has numerous applications not only in the physical sciences, but also in computer science, queueing theory, and economics. For example, in lattice Ising models the discrete events are spin-flip attempts, while in queueing systems they are job arrivals. Since current special- or multi-purpose parallel computers can have 104 − 105 processing elements (PE) [2], it is essential to understand and estimate the scaling properties of these algorithms. In this Letter we introduce an approach to investigate the asymptotic scaling properties of an extremely robust parallel scheme [3]. This parallel algorithm is applicable to a wide range of stochastic cellular automata with local dynamics, where the discrete events are Poisson arrivals. Although attempts have been made to estimate its efficiency under some restrictive assumptions [4], the mechanism which ensures the scalability of the algorithm in the “steady state” was never identified. Here we accomplish this by noting that the simulated time horizon is analogous to a growing and fluctuating surface. The local random time increments correspond to the deposition of random amounts of “material” at the local minima of the surface. This correspondence provides a natural ground for cross-disciplinary application of wellknown concepts from non-equilibrium surface growth [5] and driven systems [6] to a certain class of massively par-

1

[3,14]. Also, the stochastic model for the simulated time horizon is an exact mapping only for the synchronous algorithm, where the main simulation cycles are executed in lock-step on each PE. Our goal is to show for this one-site-per-PE synchronous algorithm (which can be regarded as the worst-case scenario) that the efficiency does not go to zero as the number of PEs goes to infinity. The basic synchronous parallel scheme [3] is as follows. The size of the underlying system, and thus the number of PEs, is L. Update attempts at each site are independent Poisson processes with the same rate, independent of the state of the underlying system. Hence, the random time interval between two successive attempts is exponentially distributed. Without loss of generality we use time increments of mean one (in arbitrary units). The Poisson arrivals correspond to attempted instantaneous changes in the state of the site. In the parallel algorithm each PE generates its own local simulated time for the next update attempt, denoted by τi (t), i = 1, 2, . . . , L. Here, t is the discrete index of the parallel steps simultaneously performed by each PE. Initially τi (0) = 0 for each site, and an initial configuration for the underlying system is specified. The simulated time of the first update attempt is determined by τi (1) = τi (0) + ηi (0), where {ηi } are independent exponential variables. For parallel steps t ≥ 1, each PE must compare its local simulated time to the local simulated times of its neighbors. If τi (t) ≤ min{τi−1 (t), τi+1 (t)}, the change of state of the site is attempted (and decided by the rules of the underlying system), and its local simulated time is incremented by an exponentially distributed random amount, τi (t + 1) = τi (t) + ηi (t). Otherwise, the change of state is not attempted and the local simulated time remains the same, τi (t + 1) = τi (t), i.e., the PE waits (“idles”). The comparison of the nearest-neighbor simulated times, and waiting if necessary, ensures that information passed between PEs does not violate causality. The algorithm is obviously free from deadlock, since at worst the PE with the absolute minimum local time can make progress. After the initial step, the probability density of the simulated time horizon {τi (t)} is a continuous measure, so the probability that updates for two nearest-neighbor sites are attempted at the same simulated time, is of measure zero. When modeling the efficiency, we ignore communication times between PEs, since they typically contribute to a scalable overhead. Thus, the efficiency is simply the fraction of non-idling PEs (inherent utilization). This exactly corresponds to the density of local minima of the simulated time horizon. The question naturally arises: Is it possible that the fraction of non-idling PEs goes to zero in the L→∞ limit? This would obviously make the algorithm unscalable and the performance of the actual implementation poor if not disastrous. To study this problem, we focus on the evolution of the simulated time horizon {τi (t)}, which is completely independent of the underlying model. The above

algorithmic steps can be compactly summarized as: τi (t + 1) = τi (t)

(1)

+ Θ (τi−1 (t) − τi (t)) Θ (τi+1 (t) − τi (t)) ηi (t) . The ηi (t) are drawn from an exponential distribution independently at every time t and site i, and independent of {τi (t)}. Here Θ(·) is the Heaviside step-function. This stochastic evolution model is very simple and easily simulated on a serial computer. Alternatively, one can consider the evolution of the local slopes, φi = τi − τi−1 : φi (t + 1) − φi (t) = Θ (−φi (t)) Θ (φi+1 (t)) ηi (t) (2) − Θ (−φi−1 (t)) Θ (φi (t)) ηi−1 (t) . The periodic boundary conditions for {τi } impose the PL constraint i=1 φi = 0. In this representation the operator for the density of local minima is L

1X Θ(−φi (t))Θ(φi+1 (t)) . u(t) = L i=1

(3)

The process described by (2) is a microscopic realization of biased diffusion [6]. The random amount of material “deposited” at a local minimum τi corresponds to the transfer of this amount from φi+1 to φi . Since the noise in (2) is independent of {φi (t)}, the average can be simply taken. This yields a transparent continuity equation, hφi (t + 1)i − hφi (t)i = −(hji i − hji−1 i), where the average current is hji i = −hΘ(−φi )Θ(φi+1 )i. Translational invariance implies that hui = hΘ(−φi )Θ(φi+1 )i, which is the same as the magnitude of the average current or the mean velocity of the surface. To gain some insight into the evolution of the surface, we perform a naive coarse-graining by taking an ensemble average on (2) and replacing Θ(φ) with a smooth representation. The procedure is independent of the actual form of the representation. We use Θκ (φ) = (1/2)[tanh(φ/κ)+1] so limκ→0 Θκ (φ) = Θ(φ). To leading order in φ/κ, 1 hφi+1 − 2φi + φi−1 i 4κ 1 − 2 hφi (φi+1 − φi−1 )i . 4κ

hφi (t + 1)i − hφi (t)i =

(4)

Taking the naive continuum limit one obtains ∂ ∂ 2 φˆ − λ φˆ2 ∂t φˆ = ∂x2 ∂x

(5)

ˆ where roughly speaking for the coarse-grained field, φ, λ, the coefficient of the nonlinear term, carries the details of the coarse-graining procedure. Equation (5) is the nonlinear biased diffusion or Burgers’ equation [15]. Through φˆ = ∂ τˆ/∂x it is simply related to the deterministic part of the KPZ equation for the coarse-grained surface height fluctuations [16], 2

2

.

(6)

Φ(x) =

To capture the fluctuations one typically extends the above equations with appropriate noise, i.e., conserved for (5), and non-conserved for (6). This implies that the evolution of the simulated time horizon is KPZlike. In one dimension the steady state of such systems (on coarse-grained length Rscales) is governed by the EW Hamiltonian [7], HEW ∝ dx(∂ τˆ/∂x)2 . This corresponds to a simple random-walk surface, where the coarse-grained slopes are independent in the thermodynamic limit, yielding 1/4 for the density of local minima. Obviously this value will be different for our specific microscopic model. However it cannot vanish: a zero density of local minima in the L→∞ limit would imply that it is zero at all length scales. This would contradict our finding that the steady state at the coarse-grained level is governed by the EW Hamiltonian. The non-zero density of local minima is an important characteristic of this (steady-state) universality class. It ensures that our specific microscopic surface propagates with a non-zero average velocity in the steady state. Models belonging to other universality classes do not necessarily have nonzero extremal-point densities. For example, we can show [17] that the density of local minima vanishes for a onedimensional curvature-driven random Gaussian surface. We now present our MC results to test the coarsegraining approach. First wePfollow the time evolution L 2 of the width hw2 (t)i=(1/L)h i=1 [τi (t) − τ (t)] i, where PL τ (t)=(1/L) i=1 τi (t) and the average h·i is taken over many independent runs. After the early-time regime, which is strongly affected by the intrinsic width, and before saturation, we find hw2 (t)i∼t2β [Fig. 1(a)]. Although the system exhibits very strong corrections to scaling, for our largest system, L=105 , we find β=0.326 ± 0.005, which includes within two standard errors the exact KPZ exponent, 1/3. In the steady state the width is stationary and hw2 i∼L2α for large L. Here the corrections to scaling are somewhat smaller than in the earlier regime, and the above scaling is obeyed for L≥103 with α=0.49 ± 0.01. This agrees with the prediction that the long-distance behavior is governed by the EW Hamiltonian, for which α=1/2. Further, plotting rescaled width hw2 i/L2α vs rescaled time t/Lz , with z=α/β, confirms dynamic scaling for the intermediate-to-late time crossover [18] [Fig. 1(a) inset]. We also measured the average steady-state structure factor of {τi }, finding the expected ∼ 1/k 2 behavior for small wave vectors k. The spatial two-point correlations decay linearly for {τi } and are short ranged for the slopes, {φi }. To further probe the universal properties of the surface in the steady state we construct the full width distribution, P (w2 ). The EW class is characterized by a universal scaling function, Φ(x), such that P (w2 ) = hw2 i−1 Φ(w2 /hw2 i) [19], where

∞ π2 2 π2 X (−1)n−1 n2 e− 6 n x . 3 n=1

(7)

Systems with L≥103 show convincing data collapse [Fig. 1(b)] onto this exact scaling function. 10

1.25

3 −1

10 2

/L

(b)

(a)

2α

0

10

−2

10

1.00

−2

10

−3

10

10

10

−4

10

−2

10

10

t/L

2

−6

10

−4

−4

10

2

z

10

2

∂ τˆ ∂x

0

10

L=1000 L=2000 L=4000 L=6000 L=8000 4 L=10 5 L=10 2β=2/3

1

0.75

−6

10

2

Φ(x)=P(w )

∂ 2 τˆ −λ ∂t τˆ = ∂x2

0

2

4

6

L=100 L=400 L=1000 L=4000 4 L=10 EW, Eq. (7)

0.50

0.25

0

10

0

10

1

10

2

10

t

3

10

4

10

5

10

6

0

2

2

2

4

6

0.00

x=w /

FIG. 1. (a) Time evolution and dynamic scaling (inset) for the surface width. (b) Steady-state width distribution (inset: on linear-log scales).

Next we estimate the efficiency of the algorithm. When simulating the system described by (1) and measuring the average local minimum density hu(t)i, we observe that for every system size it monotonically decreases as a function of time and approaches a constant slightly smaller than 1/4 for large systems. Using the close similarity between our model and the single-step solid-onsolid surface-growth model [8], we can understand the scaling behavior for the steady-state average hui and the fluctuations σ 2 = hu2 i − hui2 . In the single-step model the height differences (i.e., the local slopes) are restricted to ±1, and the evolution consists of particles of height 2 being deposited at the local minima. The advantage of this model is that it can be mapped onto a hard-core lattice gas for which the steady-state probability distribution of the configurations is known exactly [8,20]. This enables one to find arbitrary moments of the local minimum density operator, analogous to (3). For the single-step model we find that huiL = (1/4)(1−1/L)−1 = 2 1/4 + 1/(4L) + O(1/L2 ), and σL = 1/(16L) + O(1/L2 ). We propose that the scaling of the local minimum density for large L in our model follows the same form, i.e., huiL − hui∞ ∝ L−1 , σL ∝ L−1/2 .

(8)

Our reasons are: (i) both models in their steady states belong to the EW universality class (short-range correlated local slopes), which guarantees that hui∞ is nonPL zero; (ii) the constraint i=1 φi = 0 in our model and the conservation of the number of particles in the lattice gas will produce similar finite-size effects. Our simulation results show very good agreement with (8) 3

gies. While this will involve the typical difficulties of surface-growth modeling, such as an absence of exact results and very long simulation times, it establishes potentially fruitful connections between two traditionally separate research areas. We thank S. Das Sarma, S. J. Mitchell, and G. Brown for stimulating discussion. We acknowledge the support of DOE through SCRI-FSU, NSF-MRSEC at UMD, and NSF through Grant No. DMR-9871455.

[Fig. 2(a)]. Further, extrapolating to L→∞, yields hui∞ =0.24641 ± (7 × 10−6). The scaling relation (8) also implies that u is a self-averaging macroscopic quantity: its full distribution PL (u), for large L, is Gausian with the parameters in (8) [Fig. 2(b)]. Thus, in the L→∞ limit it approaches a delta-function centered about hui∞ . 0.251

0.03

σL

(a)

150

0.5

~ ~

PL (u )

0.4

0.02

0.25

(b)

0.3 0.2

0.01

100 0

0

1/L

1/2 0.05

0.1 0

PL(u)

L

0.249 0.1

−4

−2

0

~

2

[1] R. M. Fujimoto, Commun. ACM 33, 30 (1990). [2] For example the 9472-node ASCI Red at Sandia, the 12, 288-node QCDSP Teraflop Machine at Brookhaven, and the Connection Machine CM-2 with 65,536 PEs. [3] B. D. Lubachevsky, Complex Systems 1, 1099 (1987); J. Comput. Phys. 75, 103 (1988). [4] B. D. Lubachevsky, in Distributed Simulation 1989, SCS simulation series, 1989 vol. 21, p. 100; D. M. Nicol, J. ACM 40, 304 (1993). [5] A.-L. Barab´ asi and H. E. Stanley, Fractal Concepts in Surface Growth (Cambridge University Press, Cambridge, 1995). [6] B. Schmittmann and R. K. P. Zia, in Phase Transitions and Critical Phenomena Vol. 17., edited by C. Domb and J. L. Lebowitz (Academic Press, New York, 1995). [7] S. F. Edwards and D. R. Wilkinson, Proc. R. Soc. London, Ser A 381, 17 (1982). [8] P. Meakin, P. Ramanlal, L. M. Sander, and R. C. Ball, Phys. Rev. A 34, 5091 (1986); M. Plischke, Z. R´ acz, and D. Liu, Phys. Rev. B 35, 3485 (1987). [9] This contrasts with other algorithms [10,11] that obtain the correct equilibrium properties but change the dynamics. [10] R. Friedberg and J. E. Cameron, J. Chem. Phys. 52 6049 (1970). [11] R. H. Swendsen and J.-S. Wang, Phys. Rev. Lett. 58, 86 (1987); Y. S. Choi, J. Machta, P. Tamayo, L. X. Chayes, Int. J. Mod. Phys. C 10, 1 (1999) and references therein. [12] A. G. Greenberg, B. D. Lubachevsky, D. M. Nicol, and P. E. Wright, Proceedings, 8th Workshop on Parallel and Distributed Simulation (PADS ’94), Edinburgh, UK, (1994) p. 187. [13] B. D. Lubachevsky, V. Privman, and S. C. Roy, J. Comput. Phys. 126, 152 (1996). [14] G. Korniss, M. A. Novotny, and P. A. Rikvold, J. Comput. Phys., 153, 488 (1999). [15] M. Burgers, The Nonlinear Diffusion Equation, (Riedel, Boston, 1974). [16] M. Kardar, G. Parisi, and Y.-C. Zhang, Phys. Rev. Lett. 56, 889 (1986). [17] Z. Toroczkai, G. Korniss, S. Das Sarma, and R. K. P. Zia, to be submitted. [18] F. Family and T. Vicsek, J. Phys. A 18 L75 (1985). [19] G. Foltin, K. Oerding, Z. R´ acz, R. L. Workman, and R. K. P. Zia, Phys. Rev. E 50, R639 (1994). [20] F. Spitzer, Adv. Math. 5, 246 (1970). [21] J. Neergaard and M. den Nijs, J. Phys. A 30, 1935 (1997). [22] P. A. Rikvold and M. Kolesik, e-print cond-mat/9909188 (1999).

4

u

0.248

L=100 L=400 L=1000 L=4000 4 L=10

50 0.247

0.246

0

0.002

0.004

0.006

0.008

1/L

0.01

0 0.2

0.25

u

0.3

0.35

FIG. 2. Steady-state scaling for the parallel efficiency (density of local minima). (a) The average and the variance (inset). (b) The full probability density. Inset: scaled probability densities, P˜L (˜ u) with u ˜ = (u − huiL )/σL , collapse onto a Gaussian with zero mean and unit variance.

Finally, we point out the lack of up-down symmetry in our model in the steady-state. This is most easily noticeable at short distances, either by looking at a snapshot [Fig. 3(a)], or through the high degree of asymmetry in the nearest-neighbor two-slope distribution [Fig. 3(b)]: the hilltops are sharp and the valley-bottoms are flattened. Such stationary-state skewness is generally observable in one-dimensional KPZ growth, but has only recently received serious attention [21,22]. 0

6

τi

(a)

φi+1

(b)

-

φi

−10 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480

FIG. 3. (a) A short segment (50 sites) of a typical steady-state surface configuration. (b) Density plots for the nearest-neighbor two-slope distribution, L=104 .

In summary, we studied the asymptotic scaling properties of a general parallel algorithm by regarding the simulated time horizon as a non-equilibrium surface. We conclude that the basic algorithm (one site per PE) is scalable for one-dimensional arrays. The same correspondence can be applied to model the performance of the algorithm for higher-dimensional logical PE topolo4