Parallel Implementation of Triangular Cellular Automata for Computing ...

2 downloads 585 Views 314KB Size Report
Nov 9, 2011 - Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains.
Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains Michael J. Leamy and Adam C. Springer

Abstract In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain’s automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware. Keywords Wave propagation • Cellular automata • Computational mechanics • Distributed computing • MPI

M.J. Leamy () • A.C. Springer Georgia Institute of Technology, Atlanta, GA, USA e-mail: [email protected]; [email protected] J. N´aprstek et al. (eds.), Vibration Problems ICOVP 2011: The 10th International Conference on Vibration Problems, Springer Proceedings in Physics 139, DOI 10.1007/978-94-007-2069-5 98, © Springer ScienceCBusiness Media B.V. 2011

731

732

M.J. Leamy and A.C. Springer

1 Introduction The cellular automata paradigm [1, 2] has recently been adopted to simulate wave propagation in two-dimensional linear and nonlinear elastic domains of arbitrary shape [3, 4]. The approach shares an idea central to all cellular automata modeling, which is domain discretization using autonomous cells (usually rectangular or hexagonal) whose state is updated via simple rules. In the cited elastodynamic work, a rule set for non-uniform triangular cells has been developed which allows domains of multiply-connected, arbitrary shape to be simulated efficiently and accurately. In fact, the method has been shown to effectively avoid spurious oscillations at the front of sharp wave fronts without the need for specialized treatment (unlike other methods – e.g., the finite element method). The method is briefly reviewed next – full details can be found in the cited work, and Java source code can be found on the first author’s research web page. Figure 1 provides a graphical overview of the method. Non-uniform triangles are employed to discretize a domain into multiple state machines termed automata. The rule set governing the temporal update of each cell is first arrived at using a balance of momentum applied to a target cell which sums forces present on each face. In doing so, the computation of strains is necessary, which is done by categorizing the strains as either Type I (derivatives in the normal direction) or Type II (tangential derivatives). Numerical evaluation of the spatial derivatives then follows from simple finite difference expressions using the appropriate states of neighboring automata. By choice, we use only von Neumann and what we term secondary von Neumann neighbors. Finally, a forward-Euler time integration of the momentum

Fig. 1 Left: target cell (or target automaton) geometry with neighbors identified. Top Right: multiply-connected domain with automata mesh. Bottom Right: rectangular automata illustrating strains needed for rule set development

Parallel Implementation of Triangular Cellular Automata for Computing . . .

733

Fig. 2 Comparison of x-displacement results at a snapshot in time (finite element [bottom] vs. cellular automata [top]) for a sharply loaded domain with an interior hole – see Fig. 1 for the meshed domain. The left subfigures employ an isoperimetric perspective, while the right subfigures employ a perspective from the y-axis. The material simulated is aluminum, and the loading occurs on the left boundary in the form of an imposed Dirichlet boundary condition at the start of the simulation (Figure reproduced from [3])

equation yields the target cell’s explicit rule set. Simulation proceeds by requesting that each automata update their state (displacement and velocity components) at each time step. Note that the method relies solely on local interactions, avoids partial differential equations and their complexity, and is fully object-oriented. In fact, the traditional process of assembling and solving a matrix set of equations is traded for requests to the automata objects to update their state. Simulation results from the elastodynamic CA approach have been compared to those from other methods, including commercially-available finite element simulators, and excellent agreement has been documented [3, 4]. For smooth loading histories, the two methods yield nearly identical results. For sharply discontinuous loading, such as experienced during impact, the CA approach avoids spurious oscillations (see Fig. 2), which are a well-known artifact in interpolation-based approaches such as the finite element method. For this reason, it appears the CA method is particularly attractive for studying elastodynamic problems where loading results in sharp wave fronts.

734

M.J. Leamy and A.C. Springer

2 Parallelization Approach and Results

8 7 6 5 4 3 2 1 0

MPI MPj

0

2

4 6 # of Cores

8

10

Time (seconds)

Time (seconds)

The Message Passing Interface (MPI) was chosen for parallelizing the CA-based elastodynamic simulator on both symmetric multiprocessor (SMP) shared-memory and distributed-memory clusters. MPI is the de facto standard for passing messages (i.e., data), which is the central enabler of distributed computing. While C implementations of MPI are common, Java implementations are not; furthermore, the Java language has received little notice as a serious High Performance Computing (HPC) language. To justify using Java, we first performed a preliminary assessment of its HPC potential with the example of computing Pi using a Monte Carlo technique. The Monte Carlo simulation of Pi was written in both Java and C so that we could compare the runtime performance of both C-based MPI and MPJ Express. The approach employs randomly-generated numbers on the interval [0,1] for both x and y coordinates, and then tallies a one if the pair lies inside of the top right quadrant of the unit circle, and a zero otherwise. Note that the probability of being inside the circle is ¼ of Pi. After n iterations, the estimate for Pi is therefore four times the total tally divided by n. By performing ‘inner’ iterations in which each processor generates n random (x,y) pairs, and m ‘outer’ iterations in which each processor reports back to a master processor their Pi estimate after performing an inner iteration, we can control the ratio of time computing Pi to the time passing messages. Thus the comparison captures both the inherit calculation speeds of C and Java, and the message passing overheads. The Monte Carlo estimates of Pi were run on a 64-bit Windows 7 machine with two Intel Xeon E5506 quad-core processors and 24 GB of available RAM. The C simulation employed the MPICH2 implementation of MPI as well as Cygwin, a Linux emulator, to facilitate message passing [5]. The Java version utilizes MPJ Express [6] and runs natively through the NetBeans IDE. The coded algorithms are nearly identical in terms of the number of lines of code needed, variable types employed, and overall program flow. Figure 3 illustrates scaling results from the simulation. The graph on left shows the weak scaling comparison (fixed problem

4 3.5 3 2.5 2 1.5 1 0.5 0

MPI MPj

0

2

4 6 # of Cores

8

10

Fig. 3 Weak and strong scaling, C vs. Java, using a parallel Monte Carlo algorithm for computing Pi

Parallel Implementation of Triangular Cellular Automata for Computing . . .

735

9

Master Tasks

Recv Updated cell state

Achieved Speedup Ideal Speedup

8

Cell objects Broadcast ()

7

Update State

Chunk[2]

Chunk[...]

Chunk[n]

Update State

Update State

Update State

Speedup

6 Chunk[1]

5 4 3 2

Send Updated cell state

1 Slave Tasks Output Cell states

0

1

2

3

4 5 6 Number of Processors

7

8

Fig. 4 Left: Graphical depiction of the parallel calculation flow. Right: Speedup as a function of cores employed on a multi-core shared memory system

size per core) using ten million iterations per core. Strong scaling is also provided and is used to illustrate the manner in which the solution time varies with the number of cores employed for a fixed total problem size. We again used ten million iterations as our problem size. The results indicate that, as implemented, Java performs on the same order, or better, than C. This conclusion has also been reached in studies conducted by the MPJ Express developers [6]. Development of a distributed CA algorithm amounts to splitting an array of automata objects into ‘chunks’ which can then be computed on each processor. Since each cell (or automaton) is a Java object encapsulating data and methods (e.g., pointers to neighbors and a step method), the entire object can be passed to a processor as a single entity. After completing a state update (i.e., a step), the object requires only that neighbor states residing on other processors be passed to it before completing another state update. We accomplish neighbor state sharing using blocking send/receive MPJ Express communications. Communications are often the bottleneck in a parallelization strategy due to their non-negligible latency. In order to decrease the number of communications per processor, we populate cells on a processor by exploiting the neighbor information already stored by the CA cells in such a way that when a cell is added to a processor, its neighbor cells are next added to the processor. These neighbor cells are also placed in a first-in, first-out queue such that their neighbors can be added to the same processor. This process repeats until the processor has a predetermined number of cells. The straight-forward CA parallelization approach described can be contrasted with traditional elastodynamic simulation methods, such as most finite element approaches, which require a sparse linear system solver and complex decomposition schemes prior to HPC deployment [7]. Figure 4 provides a graphical depiction of one version of this parallel algorithm where the Master processor coordinates all sends and receives. On a symmetric multi-core, shared memory system, we analyzed the performance of our algorithm using a ratio of the best sequential running time to the parallel run time. Figure 4 illustrates the achieved speedup as a function of the number of employed cores while simulating just over 20,000 cells. For comparison

736

M.J. Leamy and A.C. Springer

purposes, the ideal speedup curve is also provided. For low numbers of processors, the CA speedup is nearly ideal. With increasing number of processors, the ratio of MPI communications to cells processed increases, and the achieved speedup deviates from ideal. However, even when employing the full eight processors, and thus competing for computing resources with the operating system, the speedup is a very respectable 6.7 (ideal being 8.0). It is expected that inexpensive machines with processors well in excess of eight will be on the market in the near future, and based on the presented results, near-ideal speedup of the CA elastodynamic simulator can be expected when employing a large proportion of those processors.

3 Concluding Remarks This paper describes a new simulation technique in solid mechanics for computing elastodynamic response in arbitrary two-dimensional domains using multiple processors. The paper documents near-linear speedup with respect to the number of available processors on shared memory systems. The developed method is notable for its straight-forward formulation based on local interactions, its ability to accurately simulate sharp wave fronts, and its compatibility with both modern object-oriented software paradigms and parallel processing techniques. Follow-on work will address performance on distributed-memory computer clusters, and may also consider distributed computing using GPU-based systems.

References 1. Chopard, B., Droz, M.: Cellular Automata Modeling of Physical Systems. Collection Al´eaSaclay 6, xii, 341 p. Cambridge University Press, Cambridge, England/New York (1998) 2. von Neumann, J.: Theory of Self-Reproducing Automata. University of Illinois Press, Urbana (1966) 3. Hopman, R.K., Leamy, M.J.: Triangular cellular automata for computing two-dimensional elastodynamic response on arbitrary domains. J. Appl. Mech. Trans. Asme. 78(2), 021020 (2011) 4. Leamy, M.J.: Application of cellular automata modeling to seismic elastodynamics. Int. J. Solids Struct. 45(17), 4835–4849 (2008) 5. Noer, G.: Cygwin: a free win32 porting layer for UNIX applications. In: Proceedings of the 2nd USENIX Windows NT Symposium, Seattle, WA, USA, Usenix Association (1998) 6. Baker, M., Carpenter, B., Shafi, A.: MPJ express: towards thread safe Java HPC. In: IEEE International Conference on Cluster Computing, Barcelona, Spain (2007) 7. Paszynski, M., Kurtz, J., Demkowicz, L.: Parallel, fully automatic hp-adaptive 2d finite element package. Comput. Methods Appl. Mech. Eng. 195(7–8), 711–741 (2006)