Performance of Synchronous Parallel Discrete-Event Simulation

23 downloads 0 Views 258KB Size Report
Bradley L. Noble Gregory D. Peterson Roger D. Chamberlain. Computer and Communications Research Center. Washington University, St. Louis, Missouri.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

Performance of Synchronous Parallel Discrete-Event Simulation Bradley L. Noble Gregory D. Peterson Roger D. Chamberlain Computer and Communications ResearchCenter Washington University, St. Louis, Missouri 1 Introduction

Speculativecomputation (SC) utilizes the idle time in the barrier by allowing event processing to proceed beyond the safety limits imposed by CL. While waiting for the barrier to complete, events are processed speculatively, with the hope that an event message arrival from a remote processor does not subsequently invalidate theevent. Once the barrier is complete, the speculated events are tested for correctness and either committed or discarded. Additional details on the three algorithms investigated are presented in [I ~ 21.

This paper explores the performance of three synchronous discrete-event simulation algorithms. We examine the effects of granularity and present empirical data to illustrate at what granularity the algorithm has reasonable performance. We also investigate two techniques for decreasing both synchronization and load imbalance. In addition, we examine how various execution platforms impact the performance of the simulation, providing empirical data from a network of workstations and a shared-memory multiprocessor. The impact of shared computational resources on simulation performance is also explored. The simulated system is a network of queues connected in a torus topology. The queues are FCFS, the service requirements are exponentially distributed with a minimum service time, and the routing probabilities are uniformly distributed to each of the neighboring queueing stations. In the global clock (GC) algorithm, during each iteration, processors are constrained to process events at a single point in simulated time. As part of the barrier synchronization at the end of each iteration, the processors cooperate to determine the next simulated time with events to process. Although the GC algorithm is straightforward to understand, implement, and debug, it can significantly reduce the amount of parallelism that is exploited by the simulator, as well as the granularity of the simulation, since the critical path of the simulation is determined by the slowest processor at each simulation time step. The conservative lookahead (CL) algorithm exploits knowledge of the simulated system to allow the execution of events timestamped later than the current simulated time. Since our queueing network uses a FCFS discipline with indistinguishable jobs and a single job class, job order at each simulation time step is not important. This allows future events with timestamps up to the minimum service time to be safely processed. If all events for a subsequent point in simulated time are processed during the current iteration, the subsequent iteration can be skipped, thereby reducing the number of barriers. The increase in workload at an iteration increases the granularity of the algorithm.

2

2.1

$4.00 0 1995 IEEE

Problem Size and Granularity

We investigate the impact of granularity on the pert’ormance of the simulation by varying the problem size over a wide range on a dedicated fixed processor set. Figure I illustrates the performance of the GC, CL, and SC algorithms on a network of 5 NeXT machines and 5 processors of an SC2000. For the GC algorithm, the problem size needs to be quite large before any reasonable speedup is attained. This is largely due to the effects of frequent barrier synchronizations. Both the CL and SC algorithms faired much better, requiring a problem size roughly half as large to achieve the same performance. This is a direct result of the reduction in the number of barrier synchronizations. The speedup achieved on the SC2000 is lower than that ot the NeXTs, primarily due to the communication overheads associated with the PVM environment.

2.2

Synchronous Algorithm Performance

Figure 2 illustrates the performance of the simulator on both the NeXTs and SC2000 for a scaled problem size of 16,000 queueing stations per processor. Again, the performance of the GC algorithm is poor compared to the CL and SC algorithms. As the number of processors is scaled up,

‘This material is based upon work supported by the NSF under grant MIP-9309658 and the NIH under grant GM28719.

1060-3425/95

Performance Results

The PVM system is used as a message passing environment on two execution platforms. One platform consists of 10 identical Ethernet-connected (lOMb/s) NeXT workstations equipped with a Motorola 68040 CPU, 8MB RAM, and a local swapdisk. The second platform, a Sun SPARCcenter 2000 (SC2000), is equipped with 640MB global RAM, 20 SuperSPARC processors (each with 1MB secondary cache), and an XD-Bus interconnection (SOOMB/s).

185

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

1995

4

3.5 -

e 8

NeXT sco---a

Table 1: Scaled Speedup with Background Load

SC2000 (f--o

CL -

w--L1

GC-

+--+

Processors Conservative Lookahead Speculative Computation

3-

h! 2.5 ii 3 24 ‘$ v)8 1.5-

i

20 NumberoiCWeueing

30 40 Stations(thcusands)

60

50

Figure 1: Performance impact of grain size 6 NeXT 5.5 5-

sc2oca

SC

-

o--e

CL

al----w

x7--*

GCt---+

+--+

4.5 -

i 4e m 3.5 I

32.5 2-

3

1 2

3

4

5 6 PVXIXSWS

7

6

9

7 3.135 3.442

Conclusions

We have presented empirical results that quantify the impact of problem granularity on simulator performance. relate the performance of three synchronous algorithms on two different execution platforms, and investigate the impact of background load on performance. The granularity results show that large problem sizes are necessary for the performance to be acceptable. Across the board, the CL algorithm has significantly improved performance over the GC algorithm by reducing the number of barriers. The SC algorithm exploits the idle time during the barrier synchronization by conditionally processing events that must then be checked for correctness once the barrier synchronization is complete. Although the performance gains are limited on dedicated systems, the performance degrades slower than CL on systems experiencing background loads.

1.5-

1

5 2.615 2.783

ground load in order to support controlled experimentation. We use a repetitive FFT calculation to serve as a background load. A host program loops forever. Each loop, it chooses a random processor (from the entire set) and instructs that processor to compute the FIT of an image. As the number of processors executing the simulation increases the total background load perceived by the simulation grows (i.e., the effective background load on 1 processor is l/7 of the effective background load on 7 processors) 121. The performance effect this synthetic background load has on the NeXTs is listed in Table 1. Here, the speedup is relative to a single processor execution contending with the background load. Notice that although overall performance is down due to the increased load, there is a greater separation between CL and SC as the number of processors used in the simulation increases. Using 7 processors, the improvement from CL to SC increases from 4% to 10% in the presence of background load. With a higher variance in processor completion times each iteration (due to the background load), the speculative computation algorithm does a better job balancing the workload across iterations.

l-

10

2 1.540 1.540

10

Figure 2: Synchronous algorithm performance the performance gains (especially for the GC algorithm) diminish for the NeXTs while performance continues to improve on the SC2000. This distinction is attributable to the different mechanisms used for interprocessor communication on the two platforms. For the NeXTs, the Ethernet will eventually saturate and become a performance bottleneck. This effect is much less pronounced on the SC2000 since communication occurs across a high-speed bus. The SC algorithm is marginally better than CL on the NeXTs and is negligibly better on the SC2000. Measurements indicate that there was only a small variance in processor load each iteration, and consequently little idle time during the simulation for speculative computation to occur.

References [l] R. M. Fujimoto. Parallel Discrete-Event Simulation. Comm. ofthe ACM, 33( 10):30-53, Oct. 1990. [2] B. Noble, G. Peterson, and R. Chamberlain. Performance of Synchronous Parallel Discrete-Event Simulation. WUCCRC-94-13, Comp. and Comm. Research Center, Washington Univ., St. Louis, MO, 1994.

2.3 Background Load To investigate the performance of the algorithms in the presense of background load, we generate a synthetic back186

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Suggest Documents