Experiences in Verifying Parallel Simulation Algorithms John Penix, Dale Martin, Peter Frey, Ramanan Radharkrishna, Perry Alexander and Philip A. Wilsey Department of Electrical & Computer Engineering and Computer Science The University of Cincinnati Cincinnati, OH fjpenix,dmartin,alex,
[email protected]
August 29, 1997
Abstract
Parallelization is a popular technique for improving the performance of discrete event simulation. Due to the complex, distributed nature of parallel simulation algorithms, debugging implemented systems is a daunting, if not impossible task. Developers are plagued with transient errors that prove dicult to replicate and eliminate. Recently, researchers at The University of Cincinnati developed a parallel simulation kernel, warped, implementing a generic parallel discrete event simulator based on the Time Warp optimistic synchronization algorithm. The intent was to provide a common base from which domain speci c simulators can be developed. Due to the complexity of the Time Warp algorithm and the dependence of many simulators on the simulation kernel's correctness, a formal specici cation was developed and veri ed for critical aspects of the Time Warp system. This paper describes these speci cations, their veri cation and their interaction with the development process.
1 Introduction The Time Warp mechanism is an emerging technique for synchronizing parallel discrete event simulators [1, 2]. In a Time Warp synchronized simulator, the simulation objects (executing in parallel) exchange time-stamped event messages and execute optimistically | without strict enforcement of the causal order between simulation events. Thus, the protocol permits out-of-order event processing to occur and whenever such processing does occur, the simulator is forced to rollback and reprocess the events in their correct causal order. Due to their parallel nature and weak synchronization semantics, Time Warp simulators are dicult to design and implement. Transient errors are dicult to replicate. Frequently, monitoring code inserted to pinpoint errors causes those errors to disappear. The shear number of simulation events processed by a parallel simulator cause even low probability errors to present themselves. The authors have been working with Time Warp simulation for over ve years and this experience has repeatedly demonstrated these problems and motivated a desire to pursue alternate design approaches based on formal methods. Support for this work was provided in part by the Advanced Research Projects Agency, contracts F33615{ 93{C{1315 and F33615{93{C{1316 monitored by Wright Laboratory and contract J{FBI{93{116 monitored by the Department of Justice. The authors also wish to thank Wright Labs and ARPA for their continuing support.
1
LP LP
LP
GVT
LP
LP
(mgr)
LP
LP LP
Figure 1: Time Warp Logical Process interactions. This paper presents an application of formal methods to the speci cation and veri cation of a Time Warp parallel simulation system. Two models of the Time Warp system are constructed to verify: (i) causal ordering of event processing in the simulation system; and (ii) the correctness and monotonic nature of global virtual time. The rst models event processing in a logical process. It is used to prove that at any given time, events have been processed in time order by the logical process. The second models the interaction of logical processes with the global virtual time manager. It is used to prove that global virtual time represents the minimum active timestamp in the simulation system. This result is used to show that fossil collection does not discard potentially relevant information and that the overall simulation activity makes temporal progress. Using Larch/C++ , the Larch Shared Language and the Larch Prover, speci cations are developed and analyzed with respect to desired characteristics.
1.1 Time Warp-Based Parallel Simulation
A parallel simulation with distributed synchronization is generally organized as a set of simulation objects interacting with each other by exchanging time-stamped event messages [1]. These communicating objects are referred to as logical processes (Figure 1). The Time Warp mechanism is an optimistic synchronization protocol based on the virtual time paradigm [1, 2]. In a Time Warp simulation, no explicit synchronization occurs between the logical processes. The lack of explicit synchronization permits the parallel simulators to advance their local simulation times at dierent rates. Consequently, the possibility exists for incoming event messages to arrive with time-stamps in the simulated past of the receiving simulator. Such messages are called stragglers (or straggler messages) and receipt of a straggler message forces the simulator to rollback to an earlier time to process the straggler in its proper order. In a Time Warp simulator, each logical process (Figure 2) operates as a distinct discrete event simulator, maintaining input and output event lists, a state queue, and a local simulation time (called Local Virtual Time or LVT). The state and output queue are required to support rollback processing. That is, upon receipt of a straggler message, the logical process must rollback to undo some work that has been done (the dotted lines in Figure 2 show premature processing being rolled back to the solid lines). Rollback involves two steps: (i) restoring the state to a time preceding the 2
Local Virtual Time (LVT) Antimessages Input Queue
Simulation Object
Output Queue
State Queue
Figure 2: Graphical description of a logical process. time-stamp of the straggler and (ii) killing any output event messages that were erroneous sent (by sending anti-messages to nullify the prematurely sent output message). After rollback, the events are then re-executed in their proper order. One important overhead associated with checkpointing state and event information for rollback is the memory space required for the saved data. This space can be freed only when global progress of the simulation advances beyond the (simulation) time at which the saved information is needed. The process of identifying and reclaiming this space is called fossil collection. The global time against which fossil collection algorithms operate is called the global virtual time (or GVT) and several algorithms for GVT estimation have been proposed [3, 4, 5, 6, 7, 8]. In addition to its use for fossil collection, GVT is also useful for deciding when irrevocable operations (such as I/O) can be performed and, in some instances, when the simulation has completed.
1.2 General Approach
A simulation system is operating correctly when: (i) it processes simulation events in causal order; and (ii) it makes temporal progress. To verify these conditions hold for a Time Warp based simulation, two models were constructed and veri ed. The rst models the arrival and processing of events in a logical process. Using this model it is shown that at any moment in time, a logical process has processed each event scheduled so far in time order. The second models calculation of global virtual time (GVT). Using this model it is shown that GVT is the minimum time stamp in the system and that this value monotonically increases. From this result it follows that fossil collection based on GVT does not discard potentially useful information and that system simulation makes temporal progress.
2 Event Processing Algorithm Speci cation Two behaviors associated with the logical process were deemed important for veri cation: (i) messages and antimessages cancel in the input queue; and (ii) at any time, the logical process has processed events scheduled to that point in time order. To verify these characteristics, a simple model of the logical process was constructed. This model consisted of: (i) a model for 3
BasicEventTrait(Event,Time): trait assumes TotalOrder(Time) includes SignTrait(PosNeg) Event tuple of sender : Int, id : Int introduces __.sign : Event PosNeg __.receiveTime : Event Time __.sendTime : Event Time __.dest : Event Int opposite : Event, Event Bool anti : Event Event inbefore,inafter,outafter : Event, Event
!
!
!
!
!
!
! Bool
asserts 8 e1,e2:Event opposite(e1,e2) == e1 = e2 e1.sign != e2.sign; opposite(e1,anti(e1)); inbefore(e1,e2) == e1.receiveTime e2.receiveTime; outafter(e1,e2) == e1.sendTime e2.sendTime
^
implies 8 e:Event e.sendTime = anti(e).sendTime;
>
_
>
_
^
implies 8 r:RQueue, e1,e2:Event % implication 1 rollback(go(r)) = r; % implication 2 ordered(r.stack);
8 rollback mechanism Figure 6: Input queue with
2.4 Veri cation of Time Ordering
The logical process is modeled as a rollback queue consisting of a signed priority queue and a stack. As events are processed, they are taken from the signed priority queue and placed on the history stack. Thus, the order of the events on the stack will be the order in which they were processed. Showing that events are processed in the correct order, therefore requires proving that the elements of the stack are in the correct order. Showing message cancelation requires proving that an event message and its associated antimessage cannot be found together in any input queue. For brevity, only the ordered history stack proof is shown here as an example veri cation. Complete speci cations and proof codes for all veri cations are available from the authors. The time ordering proof is done at the level of the rollback queue de ning a logical process. It is assumed that the priority queue section of the rollback queue is properly ordered based on the implication de ned and veri ed in the SignedPriorityQueue trait de nition. Based on this assumption, the proof can be divided into two sequential goals. The rst is to show that the head of the priority queue has a timestamp value greater than or equal to the timestamp of the event message on the top of the history stack. Based on this fact, the second is to show that the top of the history stack has a time-stamp not less than the time-stamp of the messages on the remainder of the stack.
3 Global Virtual Time Algorithm Speci cation Two behaviors associated with GVT calculation are deemed important for veri cation: (i) fossil collection does not discard potentially useful state information; and (ii) the simulation makes progress. To verify these characteristics, a simple model of the GVT Manager and logical processes was constructed. This model consists of: (i) a model for maintaining LVT1 information and calculating GVT; (ii) a model for communicating between logical processes and the GVT Manager; and (iii) a model of the simulation system state. Both veri cation obligations can be discharged by observing the state of the simulation system. In each system state, GVT must be the minimum of all LVT values to assure that discarding state information earlier than GVT is safe. This is a matter of showing that the GVT calculation algorithm produces a minimum value and that value never decreases. This result additionally guarantees that the system simulation makes progress. If GVT increases monotonically, then all LVT values are also increasing. Thus, all logical processes are making progress. The GVT calculation algorithm is operating correctly when GVT monotonically increases. Each increase in GVT value represents movement forward in the slowest process' simulation time and thus progress towards a completed simulation. Although individual logical processes may rollback, GVT should never decrease as it represents a lower bound for LVT values. Intuitively, the smallest LVT value represents the earliest event active in the simulation. Allowing GVT to decrease causes the entire system to rollback eliminating the guarantee that \progress" is being made. Furthermore, since GVT is used for fossil collection, allowing GVT to decrease violates the assumption that information from states earlier than GVT will not be required later. As in the previous veri cation activity, all obligations were veri ed using the Larch Prover. Details of this veri cation activity have been reported previously (see [9]) and have been ommitted for brevity. 1 Please note that the term LGVT used in the speci cation and the term LVT used in the text are synonymous.
The dierent terms result from terminology used in the description of the speci c GVT calculation algorithm explored in this veri cation.
9
4 Discussion
4.1 Veri cation Results
The veri cation results show that logical processes and global virtual time calculation and management are speci ed correctly. Logical process input queues were shown to: (i) schedule and process events in time order; and (ii) cancel message/antimessage pairs correctly. Thus, the logical process preserves causal ordering even when out of order events must be scheduled and processes. The GVT manager and calculation algorithm were shown to: (i) generate a GVT value that represents a minimum timestamp value given that LVT is calculated correctly in each logical process; and (ii) generate monotonically increasing GVT values. From the latter result, it can be concluded that if all logical processes make temporal progress, then the overall simulation makes progress. If logical processes make progress, their LVT values increase and thus GVT must also increase. It cannot be concluded that all Time Warp simulations make progress because even a single logical process \stuck" in time will cause GVT to remain constant. However, fossil collection will work correctly in such situations.
4.2 Speci cation and Implementation
Due to time constraints placed on the warped implementation, it was necessary for system implementation to begin before speci cation and veri cation were completed. Although initial lsl and Larch/C++ speci cations were completed before implementation began, signi cant veri cation activities had to be completed while implementation progressed. Initially this was believed to be problematic, however working veri cation and implementation activities in parallel had surprising, positive results. The rst and most important result was the interaction between speci ers and developers. Proximity of people performing both speci cation and implementation activities facilitated frequent interaction. Additionally, the speci ers were not parallel simulation experts and required frequent input from the developers. The precision required to write the formal speci cations caused the speci ers to ask detailed questions of the developers. In answering these questions, the developers considered and resolved details that previously went unnoticed. An example of a detail overlooked by developers is in the speci cs of processing antimessages. The only function of an antimessage is to cancel an existing message. The following question was posed by the speci ers: \What should occur if an antimessage arrives at a logical process before its associated message?" The response from developers was that this situation did not occur. To which the speci ers asked what system properties ensured that this was in fact the case. After a brief, but heated discussion, no such properties could be identi ed. In ideal operation, it is in fact the case that antimessages will not arrive after messages, but transport delays and errors could cause such a situation to occur. Although such events have extremely low probability, the logical process implementations operate on the order of 50,000 events/second. In long simulations involving many logical processes communicating over local area networks, the likelihood of such an error occuring is not small. Additionally, if such a situation did occur, it would represent a transient error that is extremely dicult to replicate and debug. Such errors are of exactly the type that formal analysis and veri cation should catch. Because of the eciency required in logical process code, assumptions such as proper message, antimessage ordering were not uncommon. Several additional assumptions/errors were discovered and discussed during the development activity. Mitigating these problems contributed to elimination of numerous transient bugs in the implemented systems. 10
4.3 Resource Savings
It is dicult to assess the savings gained by undertaking the speci cation activity. Naively, it can be claimed that performing speci cation and veri cation signi cantly reduced implementation time for the system. However, supporting such a statement with quantitative data is dicult. The warped kernel is not the rst parallel simulator developed by this group, thus previous experience contributed substantially to improved development time. However, warped is the rst attempt at building a general purpose simulation kernel, so dierences in the development goals did exist. Furthermore, time to working prototype is not necessarily an accurate measure of productivity as prototypes exist at widely dierent maturity levels. What can be said de nitively is that components of the warped kernel were running after three months of design and implementation. A completed system was available (and publicly released [10]) after eight months. Furthermore, both the early components and implemented system were extremely stable for their relatively early development stage. Experience implementing parallel simulators suggests that this represents a signi cant time savings. The formal speci cations continue to be used after the initial veri cation. Speci cally, the speci cations have been used to verify optimization techniques prior to their introduction into the system [11]. The formal speci cation activity also caused the developers to slow down and think about abstract decisions. Because other simulators had been developed, the developers tended to concentrate on low level \ xes" rather than high level design decisions. The speci cations forced the developers to think at a more abstract level.
4.4 Formal Methods and Engineering
In this exercise, the formalisms were used like traditional engineering mathematics. The Larch models were used to predict the behaviors of critical algorithms and components in a manner similar to traditional engineering domains. A complete mathematical model of Time Warp was not developed. Instead, models for non-routine system elements were developed and veri ed. This is congruent with the use of continuous mathematics in traditional engineering domains to approximate the behavior of physical systems under design or construction. Although Larch/C++ speci cations were written for all signi cant system modules, their role in the activity was limited to speci cation and analysis using the Larch/C++ type checker. No veri cation beyond syntax and type checking was performed. What the interface speci cations did contribute was a common communication mechanism for speci ers and developers. The speci ers found it much simpler to learn domain speci c information by centering their questions on the interface speci cations. The developers found the interface speci cations very useful in organizing and nding pertinent shared language speci cations. Additionally, it provided intuition for understanding the formal speci cations for the developers.
4.5 Tools and Practitioners
All speci cations were written and veri ed using freely available tools without modi cation or enhancement. Larch Shared Language speci cations were sort checked and Larch Prover code generated using lsl version 3.1beta3. Veri cation was performed using lp version 3.1. Larch/C++ speci cations were sort checked using version 4.1 of the Larch/C++ tool set. Some speci cations in this paper have been modi ed from their original form for purely aesthetic reasons. All original speci cation and proof sources are available from the authors. 11
The speci cations were written and veri ed by three rst year graduate students with some background in formal methods. The students had taken two courses in formal methods using Z [12] and Larch. The Larch course provided an introduction to the Larch speci cation style and the use of the Larch Prover. Some outside study of algebraic speci cation, particularly constructive speci cation, was undertaken by one student during the speci cation activity. All of the students had traditional electrical or computer engineering backgrounds. These results provide further evidence that formal speci cation is approachable by traditionally trained engineers given some additional study and preparation. Furthermore, these results suggest that eective application of formal methods does not require complete retraining in discrete mathematics.
5 Conclusions This paper described the use of the Larch language family to aid the design of a parallel simulation system. First, the speci cation goals were presented and justi ed. Two system components were selected as critical for correct operation: (i) the logical process; and (ii) the GVT manager. Important correctness characteristics for each model were identi ed. Namely, the correct time ordering of event processing and antimessage cancelation in the logical process, and correct GVT calculation and monotonic nature of GVT in the GVT manager. Models were developed for each using the Larch Shared Language and Larch/C++ and characteristics veri ed using the lsl checker, the Larch Prover, and to a lesser extent the lcpp parser. If Yuri Manin's statement \a good proof is one that makes us wiser" [13] is to be believed, then this speci cation and veri cation activity was highly successful. Formal speci cation and veri cation were used in an engineering activity in a manner congruent with other engineering disciplines. A pure, complete mathematical model of the Time Warp system was not developed and was not an intended outcome. Instead, critical components were selected, mathematically modeled, and veri ed to assist with engineering design decision making. The resulting success provides anecdotal evidence that formal methods can be pragmatically applied in a manner similar to other, more traditional engineering disciplines.
References [1] R. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30{53, October 1990. [2] D. Jeerson. Virtual time. ACM Transactions on Programming Languages and Systems, 7(3):405{425, July 1985. [3] H. Bauer and C. Sporrer. Distributed logic simulation and an approach to asynchronous GVTcalculation. In 6th Workshop on Parallel and Distributed Simulation, pages 205{208. Society for Computer Simulation, January 1992. [4] S. Bellenot. Global virtual time algorithms. In Distributed Simulation, pages 122{127. Society for Computer Simulation, January 1990. [5] L. M. D'Souza, X. Fan, and P. A. Wilsey. pGVT: An algorithm for accurate GVT estimation. In Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94), pages 102{ 109. Society for Computer Simulation, July 1994. 12
[6] Yi-Bing Lin and E. Lazowska. Determining the global virtual time in a distributed simulation. In 1990 International Conference on Parallel Processing, pages III{201{III{209, 1990. [7] F. Mattern. Eecient algorithms for distributed snapshots and global virtual time approximation. Journal of Parallel and Distributed Computing, 18(4):423{434, August 1993. [8] A. I. Tomlinson and V. K. Garg. An algorithm for minimally latent global virtual time. In Proc of the 7th Workshop on Parallel and Distributed Simulation (PADS), pages 35{42. Society for Computer Simulation, July 1993. [9] B. Kannikeswaran, R. Radharkrishna, Frey P., P. Alexander, and P. A. Wilsey. Formal Speci cation and Veri cation of the pGVT Algorithm. In Proceedings of Formal Methods Europe '96, volume 1051 of Lecture Notes in Computer Science, Oxford, UK, March 1996. [10] D. E. Martin, T. McBrayer, and P. A. Wilsey. warped: A time warp simulation kernel for analysis and application development, 1995. (available on the www at http://www.ece.uc.edu/~paw/warped/). [11] Umamageswaran, K. and Subramani, K. and Wilsey, P. A. and Alexander, P. Formal Speci cation and Veri cation of the Rollback Relaxation Algorithm. Submitted to Journal of Systems Architecture, Special Issue on Distributed Parallel and Distributed Simulation, August 1996. [12] J. M. Spivey. The Z Notation: A reference manual. International Series in Computer Science. Prentice Hall, New York, NY, 2nd edition, 1992. [13] Y. Manin. A Course in Mathematical Logic. Springer-Verlag, 1977.
13