A Traced-based Performance Prediction Tool for Parallel Discrete Event Simulation Zoltan Juhasz1,3, Stephen Turner2,3, Miklos Gerzson4 and Krisztian Kuntner1 2
1 Dept of Information Systems, University of Veszprem, Hungary School of Computer Engineering, Nanyang Technological University, Singapore 3 Dept of Computer Science, University of Exeter, UK 4 Dept of Automation, University of Veszprem, Hungary
[email protected],
[email protected],
[email protected]
ABSTRACT Developing parallel discrete event simulation programs is a complex task too often resulting in a disappointingly low runtime performance. Developers need performance prediction tools that are capable of providing information on the future performance of the parallel program during the design phase, prior to implementation. This paper describes such a prediction tool being developed for parallel discrete event simulation programs. Its unique feature is that it does not require the existence of the parallel program; it “simulates” the parallel execution from runtime trace information generated by a sequential simulation run. This enables one to determine, using a sequential simulation program, whether or not a parallel implementation is feasible, and if so, which approach is the best for the given simulation problem. The parallel version then only needs to be developed if proved to be successful. The paper describes the main features and structure of the analyser program.
KEY WORDS: parallel simulation, performance prediction, critical path. 1. INTRODUCTION Parallel computing technology has the potential to speed up the execution of large and complex sequential discrete event simulations. However, despite more than twenty years of research, parallel discrete event simulation techniques have not yet gained widespread acceptance within the discrete event simulation community. This is due to the facts that developing parallel simulation programs is a difficult and time-consuming process requiring special expertise, and it is not guaranteed that the end product will deliver the required, improved, performance. The achievable performance gain for any parallel simulation protocol depends on the particular simulation
problem (application) as well as on many software and hardware related parameters. There is no universal best parallel simulation protocol. Selecting the best parallel implementation approach is thus frequently done by trial and error. Users running sequential simulation programs will only use parallel simulation techniques if there is a justification (i.e. benefit) of it. These can only be determined by using suitable performance prediction tools, that help in exploring the future behaviour of the parallel simulation program under development [1][2]. This paper describes the first version of a general-purpose performance prediction tool for parallel discrete event simulation that we are developing in our current project. The tool is a post-mortem parallelism analyser that predicts parallel execution time for different simulation protocols, process-to-processor mappings and hardware parameters based on run-time information gathered during the execution of a sequential simulation program. Using the tool, the developer can obtain important information on the behaviour of the simulation under various parallel implementation settings and based on this select the best approach for the implementation.
2. BACKGROUND AND RELATED WORK The tool’s prediction method falls within the critical path analysis [3] approaches. The key concept of the method is that once the parallel execution graph of a program is known, the critical path gives the shortest possible execution time for a conservative simulation (it is possible to execute a simulation in less than critical time using optimistic techniques). The program is designed to examine parallel simulation using the conservative or optimistic simulation protocols. The terms conservative and optimistic simulation refer to the way a parallel simulation program ensures that events are executed in correct order, not violating causality. The conservative method ensures that an event is executed if
and only if it is guaranteed that it will not receive an event with smaller time stamp than its current simulation time. Until this condition becomes true, the process is blocked. The optimistic method, on the other hand, will let events be executed in any order, but when a causality error occurs, it is able to recover from it. The simulation will roll back to the last correct simulation state and re-start executing in correct event order. One way to obtain the information necessary for building the execution graph is by creating an execution trace of the parallel simulation. This, unfortunately, needs the implemented parallel program, thus cannot be used for prediction. To predict, one would like to work from a sequential simulation program and implement a parallel one only if that is proved to be worthwhile. Using a straightforward sequential trace for prediction is not adequate, as it does not include overheads only present in the parallel algorithm, e.g. event synchronisation that depends on the simulation protocol used.
The sequential execution of this simulation ensures the correct execution order of events, resulting in the sequence of e1,e2,…e12 with an overall execution time Tseq of 36 time units as shown in Figure 3. Since the event precedence graph is acyclic, there is a maximum weighted path (a critical path) that represents the minimum parallel execution time. This representation ignores the effects of parallel simulation protocols; therefore the critical path execution time (when assuming zero communication delays) gives an absolute lower bound on the parallel execution time. In our example, the critical path is e1, e2, e3, e4, e11, e12, e13 with an execution time of 21 time units. This results in a maximum achievable speedup Sp of 36/21 = 1.71, regardless of the number of processors. P1 e1
One approach to create a parallelism analyser based on sequential execution is to build an analyser into a sequential simulator, which then performs analysis as the sequential simulation is run. Two such systems are reported in [4] and [5]. An alternative to this method is the post-mortem analysis that uses a sequential trace for the same type of analysis. This is used in our analyser and prediction tool.
e2
P4
e3
P5
e4
e6
e7
e8
e10 e11 e12 e13
Figure 2. The event precedence graph of the system. Black arrows represent event scheduling dependence, whereas grey ones mark event causality constraints.
If this simulation is executed correctly under a conservative simulation protocol, there will be inevitable synchronization delays as processes can only process safe events.
P2 P4
P3
P3
e5
The problems of predicting the execution time of a parallel simulation program are best illustrated in the following small example. Assume a simulation system consisting of five processes P1, P2, … P5 as shown in Figure 1. The arrows show communication paths, or links, on which events generated for other processes can be transmitted. Also assume that each process executes and schedules events.
PP11
P2
P5
Figure 1. A simulation system of five processes.
An example event precedence graph, describing the dependencies of events is shown in Figure 2. Here process P1 generates and schedules new events for itself (e5, e7) and for process P2 (e2) and P3 (e6, e8). P2 and P3 in turn generate and schedule events for processes P4 (e3,e10) and P5 (e11,e12). Finally, P4 will generate events (e4, e13) for P5. We further assume that each event has the same timestamp as its index (for instance, e2 occurs at simulation time 2) and that the execution of each event takes 3 time units.
0
e1 e2 e3 e4 e5 e6 e7 e8 e10 e11 e12 e13 3
6
9
12 15 18 21 24 27 30 33 36
Figure 3. Sequential execution of the sample simulation
There are two rules that ensure that events are executed in correct time stamp order. The input waiting rule states that a process must receive a message from all of its input links and must select the one with the smallest time stamp for execution. The output waiting rule states that a newly generated event can only be sent to another process if its time stamp is smaller than the time stamp of the next event plus the lookahead value, i.e. the next event cannot generate an event with a smaller time stamp than that of the just generated new event. In our example, a lookahead
value of 1 is assumed for all processes. It should be noted that these waiting rules apply only to our selected conservative protocol. There are conservative protocols without output waiting rule. Figure 4 illustrates the timing of parallel execution using a conservative protocol with null-messages. Based on the rules just described, the conservative parallel execution introduces the following delays. Process P4 needs to wait for a message on each of its input links to be able to proceed (input waiting rule), whereas event message e3 cannot be sent out at the end of e2 because P2 does not yet know its next event. Once P1 finished its simulation and send an end-of-simulation message to P2 at the end of e7, P2 can decide that e3 can be sent to P4. Now P4 has one event from each of its input links (e3 and e10), so it can choose e3 as the safe event. In Figure 4, the small boxes represent buffers where messages wait until they are considered to be safe for sending and processing. P1
P2
P3
P4
e7
− −
−
− −
e2 e6 e8
3 10
11
e3
12,11
e10
13,12
e4 e11 e12 e13
Figure 4. Execution of the sample simulation under the conservative protocol with null-messages.
As a result, the execution time has increased to 24 units (Sp = 1.5). In realistic systems, there are other delays as well. For example, communication cost normally cannot be neglected. Consequently, execution times will be longer, and speedups smaller then the ones indicated by the critical path analysis. The effects of these parameters can be crucial, and as they are difficult to analyse, the need for performance prediction tools is inevitable.
3. THE PERFORMANCE ANALYSER The analyser tool we have developed helps developers answer the following types of questions. −
What is the shortest possible execution time (or largest speedup) that a parallel implementation could provide?
−
Which simulation protocol will give the best results for my particular application?
−
How will the speedup scale if we increase the number of processors or the size of the problem?
Which parallel architecture is the most suitable for our implementation?
We have decided to use a post-mortem analysis approach as this has the following advantages over the integrated analyser-in-simulator approach:
P5
e1 e1
−
The real simulation needs to be executed only once. Subsequent analyses can be executed much faster and as many times as needed. This helps users to study the effects of different hardware, architecture, mapping and algorithmic (simulation protocol) settings. This approach does not require the use of any particular simulation or programming language. Existing sequential simulation programs can be used as long as they generate the required trace information. This normally requires only minimal program modification. Since simulation and analysis are de-coupled, analysis can be performed on any platform and at any physical location. The separation makes the future extension and modification of the analyser very simple.
The only disadvantage of our approach is that trace information is needed. In fact, the trace can be very large for complex and long-running simulations. We believe that the ability to explore the effect of different hardware, software and architectural parameters quickly will outweigh this and help developers in gaining insight into the behaviour and sensitivity of the designed parallel simulation program. Using the analyser The user can select the required sequential simulation trace file in the File-Open menu. As the trace file is processed, events are stored in memory for faster subsequent analyses. During the analysis a conservative Chandry-Misra simulation protocol analysis algorithm is used to “simulate” the parallel execution of the sequential simulation and ensure the correct timing of events. This algorithm is based on [5] but has been modified to become suitable for post-mortem analysis. Currently this is the only simulation protocol the analyser provides. We are currently working on conservative protocol variations and on the optimistic protocol. The analysis result is assuming one process per processor and zero communication cost is presented in a space-time diagram as shown in Figure 5. The horizontal axis represents time whereas the vertical one shows the processors with increasing index. Events are displayed at times when they are processed along with messages that send them from one process to another. The sequential execution time is displayed at the right as a marker. The parallel execution time (finishing time of the last event
execution) represents the best available parallel execution time due to the neglected costs and maximum number of processors used. Effect of Communication and Mapping If this execution time and speedup results indicate that a parallel implementation is worthwhile, the user can investigate the effects of changes in hardware, architectural parameters, as well as the consequences of using different mapping approaches and simulation protocols.
Figure 5. The result of the trace analysis under the conservative protocol and default architectural parameters.
This can be done in a hardware parameter dialog window, shown in Figure 6. By default, this window opens up with the number of processors set to the number of processes and with a fully-connected interconnection topology having zero communication cost. Thus, the best achievable performance is the starting point. The user has the option to change the value of several parameters, such as the relative speed of the processors, the interconnection system and its topology, the message startup time, per-word transfer time, and per-hop time. Communication Cost The analyser uses the Tcomm = Ts + kTw + rTh cost model for message communication, where Ts is the message startup time, Tw is the per-word transfer time, Th is the perhop time, k is the number of words in the message and r is the number of routers the message has to pass through (if the interconnection is a switch-based one). Message startup time is often the most important parameter in a parallel system. Its value can range approximately from few µsec (fastest dedicated parallel computers) to
approx. 600 µsec (standalone workstations connected by TCP/IP network) [6]. Since the per-word transmission and per-hop times are normally sub-microsecond values, it is easy to see how startup time can easily dominate message transfer time. By changing the value of the startup parameter, the user can quickly identify how much additional delay will be introduced by the target communication hardware, and decide on the range of acceptable values for Ts. Should the user think that the values of Tw and/or Th. are important, they can be adjusted similarly. The effect of the changes are computed and displayed immediately.
Figure 6. Dialog window for changing architectural and hardware parameters.
It is also possible to change the type of interconnect. Choices include bus, switch based interconnects and various point-to-point topologies. Consequently, the additional delays generated by contention or increased message distance can be studied. Figure 7 shows our example system on a line topology with large startup time values. By comparing the results with Figure 5, the increase in communication time can be seen. Mapping In many cases, we may not have as many processors as processes available (very large systems) or perhaps could achieve the same performance using fewer of them. The change in the number of processors is closely related to the problem of mapping processes to processors. It can be seen in Figure 4, for example, that there is no parallelism between processes P1 and P5, and between P3 and P4. In relation to one another, they execute their events in a sequential manner. Thus, without degrading
performance, we could map P1 and P5 onto the same processor. The same is true for P3 and P4. This way, we can solve the same problem in the same time with only three processors, as shown in Figure 8. We could also try to map P1 and P2 onto the same processor. This would only marginally reduce performance but leave us with a system of 2 processors.
also temporary event buffers for storing pending messages (messages waiting to be delivered or processed).
Figure 8. Parallel execution with process P5 mapped onto processor 1 and process P4 mapped onto processor 2. The execution time is the same as in Figure 5. Figure 7. Parallel execution on a line topology with very large message startup time. For large systems, unfortunately, finding a suitable mapping is far more difficult. It has been shown that the general mapping problem is NP-hard [7]. There have been many polynomial-time heuristic methods proposed that are based on either a bin-packing or some node merging algorithm. One of the best-known mapping algorithms is MULTIFIT [8], which has been extended to take inter-task communication costs into account for bus-based parallel systems [9]. The analyser includes the options of performing manual mapping, or using MULTIFIT, MULTIFI-COM and some other heuristics. The Current Implementation The analyser is an object-oriented program written in C++ with a visual front end implemented in Visual C++. An analyser class hierarchy has been developed that enables us to derive specific parallel simulation protocols as well as interconnects very effectively. The central data structures of the analyser are an event set (hash table) that stores all events generated during the sequential simulation and an event queue that is used to store events in timestamp order for later analysis. There are
The analyser works from an ASCII format trace file with the following line syntax: current process ID, current event ID, source process ID, timestamp, sim event exec time, real event exec time, []
The last, optional field is for creating a list of future events generated by the current event. When these lines of the trace file are parsed and processed, incomplete future events will be stored in the event set. If a processed current event is already in the set as a future event, the missing pieces of information are added to the event. By the end of the file processing all events have the required data, so subsequent analysis can be run from memory, which makes the execution of subsequent analyses more efficient.
4. CONCLUSION We have described a post-mortem analyser developed to provide an easy and quick way to analyse and predict the behaviour of parallel simulation programs based upon sequential run-time trace information. Our analyser algorithm currently uses the Chandry-Misra conservative simulation protocol with null messages. This algorithm can be altered to deal with different null-
message generation policies and deadlock avoidance algorithms. We are developing on the addition of various null-message policy variations and simulation protocol analyser algorithms to the program. A final and very important issue is the accuracy of our analyser/predictor program. The current version ignores certain costs normally present in a parallel implementation – time taken up by housekeeping, local queue management, and task switching. We will be testing the analyser with real simulation programs and compare our results with actual runtime figures obtained on parallel computers in order to validate our tool.
5. ACKNOWLEDGEMENTS This work has been supported by the Hungarian Ministry of Education under Grant No. FKFP-0035/2000 and by the Hungarian Scientific Research Fund (OTKA) under Grant No. F 032155.
REFERENCES [1] E.H. Page and R.E. Nance, Parallel discrete Event Simulation: A Modelling Methodological Perspective, Virginia Polytechnic Inst. and State University Technical Report, TR-94-05, 1994. [2] E.H. Page, Beyond Speedup: PADS, the HLA and Web-Based Simulation, in Proc. PADS ’99, 1999, 2–11. [3] S. Srinivasan and P.F. Reynolds, On Critical Path Analysis of Parallel Discrete Event Simulations, Computer Science Report No. TR-93-29, 1993. [4] C.-C. Lim, Y.-H. Low, B.-P. Gan, S. Jain, W. Cai, W.J. Hsu and S.Y. Huang, Performance Prediction Tools for Parallel Discrete-Event Simulation. in Proc. PADS ’99, 1999, 148–155. [5] Y.-C. Wong, S.-Y. Hwang, and J. Y.-B. Lin, “A parallelism analyzer for conservative parallel simulation”. IEEE Transactions on Parallel and Distributed Systems, 6(6):628--638, June 1995. [6] J. Dongarra and T. Dunigan, Message-Passing Performance of Various Computers, University of Tennessee Technical Report, CS-95-299, May 1996. [7] S. H. Bokhari, On the Mapping Problem, IEEE Trans. Computers, Vol. c-30, No. 3, March 1981. [8] E. G. Coffman Jr., M. R. Garey and D. S. Johnson, An application of bin-packing to multiprocessor scheduling, SIAM J. Comput. 7 (1978), 1-17. [9] C. M. Woodside and G. G. Monforton, Fast Allocation of Processes in Distributed and Parallel Systems, IEEE Trans. Parallel and Distributed Systems, Vol. 4, No. 2, February 1993.