network of workstations, and an SMP, using a number of applications that ... and a network of workstations(NOW)). .... tion primitives â a problem he calls the âcommunication ..... 4, Ohio Supercomputer Center Technical Report Columbus,.
Published in the Proceedings of the 12th Workshop on Parallel and Distributed Simulation, PADS-1998. c 1998, IEEE. Personal use of this material is permitted. However permission to reprint or republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Optimizing Communication in Time-Warp Simulators
Malolan Chetlur, Nael Abu-Ghazaleh, Radharamanan Radhakrishnan and Philip A. Wilsey Dept. of ECECS, PO Box 210030 University of Cincinnati, Cincinnati, OH 45221–0030
Abstract
synchronization (anti-messages in the case of roll-backs), and for Global Virtual Time (GVT) estimation. Hence the communication overhead suffered during simulation is much larger than the time spent in computation. Any improvement the communication cost will be reflected in the performance of the simulation. Considerable efforts into optimizing Time-Warp simulators have been extended. Optimizations that address the communication overhead have focused almost exclusively on minimizing the number of messages generated by the simulation. For example, improved partitioning strategies have been developed to minimize the number of messages generated by the application [1, 2, 10, 18]. Similarly, several optimizations that minimize messages generated by the kernel have been studied (lazy cancelation [16], efficient GVT calculations [12], bounding optimism [5, 15]). In contrast, there have been no efforts to optimize the operation of the communication subsystem of the simulator kernel.
In message passing environments, the message send time is dominated by overheads that are relatively independent of the message size. Therefore, fine-grained applications (such as Time-Warp simulators) suffer high overheads because of frequent communication. In this paper, we investigate the optimization of the communication subsystem of Time-Warp simulators using dynamic message aggregation. Under this scheme, Time-Warp messages with the same destination LP, occuring in close temporal proximity are dynamically aggregated and sent as a single physical message. Several aggregation strategies that attempt to minimize the communication overhead without harming the progress of the simulation (because of messages being delayed) are developed. The performance of the strategies is evaluated for a network of workstations, and an SMP, using a number of applications that have different communication behavior.
The work described in this paper optimizes the operation of the communication subsystem by matching the message behavior to the characteristics of the communication fabric. More precisely, we use dynamic message aggregation to collect consecutive messages with the same destination process and deliver them using a single physical message. Thus, the number of communication messages is reduced and the effective granularity of the simulation is increased. Arbitrarily delaying messages may harm the performance of the receiving process. For example, if an event message has a time stamp lower than the local time of the receiving process, the ongoing computation at the receiving process is erroneous (until the message is received and a roll-back is triggered); delaying such a message is likely to be counterproductive. Dynamic aggregation must balance the gain from aggregation against the potential harm in delaying the messages. Moreover the decisions made during aggregation have to be lightweight and, therefore, be based on information available locally. In this paper, several policies for dynamic message aggregation are suggested. The policies are incorporated into the communication manager module for WARPED [11] — a parallel discrete event simulation kernel
1 Introduction In distributed environments the performance of the communication subsystem has significant impact on the overall performance of the application. The cost of communication operations is significantly higher than computation operations because of the overhead involved in preparing a message and the electrical delay necessary for signal propagation across the bandwidth-limited physical network links. Under such conditions, it is important to minimize the frequency of communication, but not necessarily the size of the messages, in order to arrive at an efficient implementation. Distributed simulators synchronized using the TimeWarp model [9] require frequent communication. That is, in addition to messages required by the model to communicate events, messages are generated by the kernel to implement Support for this work was provided in part by the Advanced Research Projects Agency under contract numbers DABT63–96–C–0055 and J–FBI–93–116.
1
using MPI [8]. The performance of the simulation using message aggregation is studied using representative simulation models. We investigate the effect of the communication fabric on the success of message aggregation. More precisely, message aggregation performs runtime matching of the communication behavior of the application with the abilities of the communication system. Accordingly, we study the performance of the policies for two different communication networks (a symmetric multiprocessor (SMP), and a network of workstations(NOW)). The remainder of this paper is organized as follows. Section 2 overviews optimistic parallel discrete event simulation (Time-Warp). Section 3 explains the different cost components of a message send in a message passing environment. Section 4 overviews some related approaches to reduce the communication cost in a Time Warp parallel simulation. In Section 5, dynamic message aggregation in Time Warp is discussed in more detail. Section 6 presents some aggregation policies and outlines their implementation. In Section 7, the performance of the aggregation policies is studied for two simulation models. Finally, Section 8 presents some concluding remarks.
cally synchronized simulations is highly dynamic and unpredictable. For example, when a rollback occurs, a burst of messages canceling out previous outputs is generated. Depending on the relative LVT values of the LPs, these messages may result in rolling back the target LPs. It is nearly impossible for the model builder to forecast, let alone control, the communication behavior. In general, the communication messages exchanged among the LPs are small in size; event information, as well as control messages, contain little data per message. In addition, the simulated objects tend to be of low granularity. More specifically, little computation is required before outputs change and events are posted to other LPs.
3 MPI and Communication Cost The Message Passing Interface (MPI) [8], is the de facto standard for message passing primitives on distributed systems. MPI provides a uniform high level interface to the underlying hardware, allowing programmers to write portable programs without compromising efficiency and functionality. The performance of MPI has been studied for different MPI implementations, and on different platforms [19]. These studies concluded that the cost of communication in a message passing environment can be divided into two components: (i) an overhead that is independent of the message size (s), and (ii) a cost that varies with the size of the message (n r, where n is the size of the message, and r is the variable cost for a unit size message) [14, 13, 7]. The overhead time includes the time to: (i) context switch to the kernel, (ii) reserve buffers and package the message, and (iii) setup the physical network path. Conversely, the variable cost is the time required to send the message through the network, taking into account the channel bandwidth, as well as any software overheads that scale with the size of the message (e.g., splitting the message into packets). Typically, the static overhead cost (s) is large (up to two orders of magnitude higher than r). Therefore, it is more efficient to communicate two data items using a single physical message, than to use two messages. Messageaggregation is based on this observation. More precisely, an application should identify messages that occur in close proximity and are targeted towards the same destination, and group them as a single message. By aggregating two messages of size n1 and n2 , we reduce the communication cost from 2s + r(n1 + n2 ) to s + r(n1 + n2 ) for a gain of s. The more messages aggregated, the higher the efficiency of the message passing environment.
2 Time Warp In a Time Warp synchronized discrete event simulation, Virtual Time is used to model the passage of the time in the simulation [6]. Changes in the state of the simulation occur as events are processed at specific virtual times. In turn, events may schedule other events at future virtual times. The virtual time defines a total order on the events of the system. The simulation state (and time) advances in discrete steps as each event is processed. The simulation is executed via several simulator processes, called Logical Processes (LPs). Each LP has an associated event queue and maintains a clock keeping its Local Virtual Time (LVT). LPs interact with each other by exchanging time-stamped event messages. The LPs must be synchronized in order to maintain the causality of the simulation; although each LP processes local events in their correct time-order, events are not globally ordered. A causality error arises if an LP receives a message with a time-stamp earlier than its LVT value (a straggler message). In order to allow recovery, the state of the LP and the output events generated are saved in history queues as each event is processed. When a straggler message is detected, the erroneous computation must be undone — a rollback occurs. The rollback process consists of: (i) the state of the LP is restored to a state prior to the straggler message time-stamp; and (ii) erroneously sent output messages are canceled (by sending an anti-message with the earliest erroneous message time-stamp and nullifying subsequent messages from the same source). The communication behavior of distributed optimisti-
4 Related Work Message aggregation at the application level is a well known technique for optimizing performance. If two data 2
there is an aggregation buffer associated with each receiving LP; the number of buffers is equivalent to the number of LPs with which the sending LP communicates. Application messages are aggregated, and are infrequently sent as a single physical message according to some aggregation policy. At the receiving LP, the application messages are extracted from the received aggregate, in their send order. The operation of the communication manager send module under DyMA is shown in Figure 1. Aggregating messages dynamically is a more difficult problem than application-level aggregation because it operates without the benefit of domainspecific application information (including the communication pattern). Thus, in many instances, heuristic decisions about when to delay messages in the hope of receiving more messages, and when to send the aggregate such that the delay is not harmful to the critical path of the application must be made. Furthermore, these decisions must be done at little computational cost. In Section 3, a simplified model of the message send costs was presented. The model forecast that the higher the number the messages aggregated, the greater the reduction in the communication overhead per message. Ignoring the effect of delaying messages on the receiving LP, aggregating messages results in an effectively faster delivery of messages. Thus, the more the messages are delayed, the greater the number of messages aggregated, and the greater the saving in communication overhead. We call this effect the Aggregation Optimistic Factor (AOF). AOF is directly proportional to the rate of reception of messages; if this rate is high, a large number of messages can be aggregated without an excessive delay time. While AOF accounts for the potential benefit of aggregation, it does not take into account the harm caused by delaying the messages; delaying a messages on the critical path of the simulation will likely harm performance 1 . This effect, called the Aggregation Pessimistic Factor (APF), is proportional to the time that the messages are delayed.2 Successful aggregation policies must balance the two factors while maintaining a low overhead for implementing the policy such that the benefit of aggregation is not offset by decision overheads. The aggregation overhead is a fixed cost that depends on the aggregation policy, as well as the costs necessary to aggregate and de-aggregate messages. Note that both of these factors vary with the nature of the application, and may change dynamically within the lifetime of a simulation. In addition to using estimates of AOF and APF to con-
Process 0
Application Messages
Message Aggregation Layer
MPI Message
MPI Message
Process 1
Process 2
Figure 1. Message Aggregation at Run-time items need to be sent along a communication channel, it is more efficient to send them using a single physical message. Gropp, in a lecture on optimizing application performance using MPI, listed message aggregation as an important optimization for minimizing the effect of the communication overhead [7]. When message aggregation is conducted at the application level, the implementor examines the program and aggregates messages that occur in close proximity in the source. For example, in the simulation kernel implementation for the WARPED simulator, some low priority control information is often delayed and piggy-backed on the next event message sent on the same outgoing channel [11]. Application-level message aggregation is only possible if the application has direct control of the communication, and if the communication behavior is statically known. In addition, it requires considerable effort on the part of the application writer. Felten identified the problem of the high overhead associated with communication primitives — a problem he calls the “communication gap” [4]. Felten investigated protocol-compilation, where a communication protocol specific to the application is compiled with it for improved performance. For several applications, an average speedup of 7% was reported. Carothers et al studied the effect of communication on the efficiency of Time Warp simulation [3]. They discovered that increasing communication delays significantly reduces performance. The degradation in performance was found to be greatest for applications where simulation objects have small computation granularity [3].
5 Message Aggregation
1 Note that delaying the messages may improve the performance as a form of limiting optimism [17]. 2 While each application message arrives at the destination LP in a shorter time on average, the simulation progress rate is also faster at the receiving LP because it spends less time communicating. Thus, the proportionality between the time that a message is delayed, and the potential harm to the receiving LP persists.
This paper explores the optimization of the communication module of Time-Warp simulators via dynamic message aggregation (DyMA). Using DyMA, the communication module for each LP collects application messages destined to the same LP in an aggregation buffer. Note that 3
trol aggregation decisions, the communication module can use local LP information to decide whether to send the aggregate or not. For example, delaying the send of antimessages results in an increase in the erroneous work done by recipient LPs. Hence, the detection of an anti-message during aggregation increases the APF (in our implementations, we send the aggregate when an anti-message is detected). Similarly, detecting an idle LP (no events in the input queue) forecasts that the message generation rate is going to drop. Usually an LP is idle because it has not received any messages from other LPs. This situation could have been caused because the LP is aggregating too aggressively. Thus, if the input queue is empty, the aggregate is sent regardless of the AOF and APF estimates.
delay in the transmission of these messages can result in a slowing down of the simulation. For example, the initialization messages are considered high priority messages. Given the above conditions, two strategies have been implemented and studied. The remainder of this section discusses these policies.
6.1 Fixed Aggregation Window (FAW) In this policy, decisions are made based only on the APF (the age of the aggregate). In this basic policy, messages are aggregated for a constant age-window. More precisely, the age of the first message received by the aggregation layer is tracked. Once this age reaches a constant value (the size of the window), the aggregate message is sent along the associated channel. The advantage of this policy is its low overhead; only a single check of the current aggregate age (time that the aggregate has been alive) against the constant window size is required. This policy provides a static balance between the pessimistic and optimistic factors, making it insensitive to transient trends in the communication behavior of the application. No matter how high (or low) the message arrival rate is, the fixed window size is used. The chosen window size significantly affects the performance of this policy. The optimal threshold age varies with the application and the execution environment. Profiling can be used to aid in the selection of the window size. This strategy has been tested for different values of age. The effect of the window size on the performance is studied.
6 Dynamic Aggregation Strategies In this section, several DyMA strategies are suggested. The aggregation strategies should be lightweight, and should be able to balance the AOF and APF in order to achieve an optimal execution time. The strategies differ in their estimates of the two factors, and in the decision policy used to balance them. The pessimistic factor is proportional to the delay time of the messages. Using a precise estimate of time to model the age of the aggregate will prohibitively increase the overhead. In our implementation, the aggregation layer maintains an age estimate by periodically, and non-intrusively, incrementing a local counter. One tick of the age-counter is equivalent to the average execution time of a single event. Thus, other methods for maintaining the aggregate age are possible, including deferring the responsibility to the application. The optimistic factor is proportional to the message arrival rate for the lifetime of the aggregate; applications that generate more than one message per event executed increase the message arrival rate to the aggregation layer, thereby enabling higher aggregation. In addition, all the strategies will immediately send the aggregate whenever:
6.2 Simple Adaptive (SAAW)
Aggregation
Window
This policy is an extension of FAW that adapts the window size as a function of the message arrival rate. The initial aggregation window is specified statically as in the case of FAW. During simulation, the message rate achieved by an aggregate is calculated when the aggregate is sent and used to decide what the aggregation window for the next aggregate should be. Changing the aggregation window size allows policy to adapt its behavior to vary with the behavior of the application. For example, if the application is exhibiting bursty communication behavior, the aggregation window size is increased to take advantage of the higher optimistic factor (message arrival rate). An estimate of the expected message arrival rate is maintained. If the message arrival rate achieved by the current aggregate exceeds (falls below) this estimate, the window size is increased (decreased) proportionately to the difference in the rate and the estimate. Upper and lower limits on the window size are enforced to ensure that the system does not diverge. The expected rate estimate is updated using the previous estimate
1. An anti-message is received: An anti-message is an indication that the receiving LP is progressing on a wrong path (assuming aggressive cancelation) and any delay in the transmission of the anti-message to the receiving LP will result in the increase of erroneous computations the receiving LP will perform. 2. The input queue is empty: The number of events to be processed is an indication of the activity in a process and higher the number of events, greater the probability of generation of messages. A process with no events to process implies that it is idle and its AOF can be reduced. 3. A high priority message is received: Certain kernel messages are tagged as high priority messages as any 4
Aggregate Age vs Execution Time for SMMP (NOW)
Aggregate Age vs Execution Time for RAID (NOW)
450
500 with FAW with SAAW Unaggregated Version
400
400
Execution Time (seconds)
350 Execution Time (seconds)
with FAW with SAAW Unaggregated Version
450
300 250 200 150 100
350 300 250 200 150 100
50
50
0
0 1
10
100
1000
1
10
Aggregate Age
100
1000
Aggregate Age
Figure 2. Performance of the strategies for SMMP on a network of workstations
Figure 3. Performance of the strategies for RAID on a network of workstations Aggregate Age vs Execution Time for SMMP (SMP) 450
and the rate achieved by the current aggregate. The weight placed on the two factors determines the sensitivity of the policy to changes in the message behavior. For example, a high weight given to the previous estimate makes the policy insensitive to transient changes in the message arrival rate. The overhead for implementing SAAW is slightly higher than FAW; there is an additional computation to determine the window size when the aggregate is sent. In addition to accounting for APF, this strategy also considers AOF in the form of the rate of message arrivals. Thus, it attempts to strike a balance between the two factors. SAAW requires an initial window estimate; however, it is less sensitive to this estimate than FAW because of its ability to change the window size.
with FAW with SAAW Unaggregated Version
400
Execution Time (seconds)
350 300 250 200 150 100 50 0 1
10
100
1000
Aggregate Age
Figure 4. Performance of the strategies for SMMP on a 4-processor shared memory machine
7 Analysis Dynamic message aggregation was used to optimize the performance of the WARPED simulation kernel. The policies discussed in Section 6 were implemented and their relative performance compared. Performance studies were conducted using a 4 LP configuration of the simulation. The processes were partitioned across two 4-processor SMP workstations connected using an Ethernet connection (on a NOW). In addition, the 4 LP configuration was also tested on a single 4-processor SMP workstation. The performance was studied using the following two models:
in terms of queuing model objects, there was 100 queuing objects in this simulation. Each event in this simulation represents a request for memory access from a processor to a common global memory. While this model occasionally rollbacks, it does so in a periodic fashion due to its symmetric nature. RAID: model of a nine disk RAID level 5 disk array of IBM 0661 3.5” 320Mb SCSI disk drives with flat-Left symmetric parity placement policy. Sixty processes send requests for stripes of random lengths and location to forks which split each the requests into individual requests for each disk according to the placement policy. Thirty servers process the requests in a FCFS fashion and route the appropriate requests back to the source process. This simulation had a total of 94 queuing objects to model the RAID disk ar-
SMMP: model of a shared memory multiprocessor. Each processor is assumed to have a local cache with access to a common global memory. For the experiments in this paper, a 16 processor machine model with settings as follows: a 10ns cache, a 100ns main memory, and a cache hit ratio of 90% was chosen for simulation. As the modeling was done 5
Aggregate Age vs Execution Time for RAID (SMP) 500 with FAW with SAAW Unaggregated Version
450
Execution Time (seconds)
400 350 300
Aggregate Frequency Profile SMMP with FAW(Age 5) SMMP with SAAW(Age 5) RAID with FAW(Age 10) RAID with SAAW(Age 10)
250 100000
200 150
10000
Frequency
100 50 0 1
10
100
1000
Aggregate Age
1000
100
10
Figure 5. Performance of the strategies for RAID on a 4-processor shared memory machine
1 1
2
3
4
5 6 Size of Aggregate
7
8
9
10
Figure 6. Frequency Profile of the Aggregates
ray. Every event in this simulation carries information about stripe lengths and location so that the servers may carry out the appropriate stripe retrieval. The nature of the model is such that it is highly dynamic with a large number of rollbacks. The performance of the FAW and SAAW strategies versus the un-aggregated implementation is illustrated for SMMP and RAID models. Figures 2 and 3 show the performance of the policies for different aggregate ages on a network of workstations. Figures 4 and 5 show the performance of the policies for different aggregate ages on a 4-processor shared memory system. Clearly, aggregation yields considerable speedup (30% in the best case on a NOW) on both NOWs and SMPs. While the speedup is more on a network of workstations, there is considerable speedup even on a shared memory system (15-20% in the best case on a SMP). There appears to be an “optimal” window size for which the aggregation performance is best for each application. Window sizes less than that are too conservative; additional aggregation is possible without hurting performance. Conversely, window sizes greater than the optimal value delay the messages excessively, nullifying the benefit obtained from the additional aggregation. For some applications it is possible for the optimal window size to change in the lifetime of the application. For such applications, the SAAW strategy is superior to FAW because it is able to converge on the optimal window size dynamically. This explains the (slightly better) performance obtained using the SAAW strategy. We expect that with more sophisticated adaption of the window size, additional performance improvement can be obtained. In order to study the success of the policies in aggregating messages, the frequency profile of the aggregate sizes for the window sizes that yielded the best performance is
Aggregate Age vs Number of Rollbacks for SMMP(NOW) with FAW with SAAW Unaggregated Version
240000 225000 210000 195000
Number of Rollbacks
180000 165000 150000 135000 120000 105000 90000 75000 60000 45000 30000 15000 0 1
10
100
1000
Aggregate Age
Figure 7. Effect of Aggregate Age on Number of Rollbacks for SMMP on a network of workstations
6
Aggregate Age vs Number of Rollbacks for RAID(NOW)
Aggregate Age vs Average Rollback Distance for RAID(NOW)
75000
100 with FAW with SAAW Unaggregated Version
with FAW with SAAW Unaggregated Version
90
60000
80
52500
70
Average Rollbacks Distance
Number of Rollbacks
67500
45000 37500 30000 22500
60 50 40 30
15000
20
7500
10
0
0 1
10
100
1000
1
Aggregate Age
10
100
1000
Aggregate Age
Figure 8. Effect of Aggregate Age on Number of Rollbacks for RAID on a network of workstations
Figure 10. Effect of Aggregate Age on Average Rollback Distance for RAID on a network of workstations
Aggregate Age vs Average Rollback Distance for SMMP(NOW)
Figures 9 and 10 show the effect of aggregation on the average rollback distance. From the graphs, it is clear that for RAID, the optimal aggregate age (the age that gave the best execution time) produced a fewer number of rollbacks when compared to the unaggregated version. As a result, the effective work done during simulation is improved. For larger aggregate ages, the number of rollbacks increase almost linearly with age resulting in poor performance. In comparison, for SMMP, the policies were not able to improve the effective work done but were still able to produce speedup over the unaggregated version because of the reduction in communication costs. From Figures 9 and 10, it is clear that the optimal aggregate age actually reduces the average rollback distance for both RAID and SMMP. This implies that the aggregation policies with the optimal aggregate age do not influence the critical path of the simulation as much as the larger aggregate ages.
400 with FAW with SAAW Unaggregated Version
350
Average Rollbacks Distance
300
250
200
150
100
50
0 1
10
100
1000
Aggregate Age
Figure 9. Effect of Aggregate Age on Average Rollback Distance for SMMP on a network of workstations
8 Conclusion
shown in Figure 6. This figure demonstrates the success of the policies in aggregating messages (the optimistic factor). The picture is not complete until the effect of the aggregation on the behavior of the simulation is studied; does aggregation increase the rollback behavior (pessimistic factor)? This question is addressed in the remainder of this section. Ideally we would like message aggregation to improve the effective work done by the LP in addition to the reduction in the cost of communication. However, since message delivery is delayed, message aggregation may increase the rollback behavior. Experiments were conducted to study the rollback behavior for different aggregation policies. Figures 7 and 8 illustrate the effect of different aggregate ages on the number of rollbacks in the simulation. In addition,
Minimizing the communication overhead is a primary concern of distributed applications. The nature of message passing environments is such that the number of messages affects the overhead of communication more than the size of the messages. Time-Warp simulators are an example of a low granularity distributed application — an application with a high number of relatively small messages. Consequently, the communication overhead represents a large portion of the execution time. This paper investigates the use of Dynamic Message Aggregation (DyMA) for run-time matching of the application communication behavior with the underlying communication fabric. Application level message aggregation is a use7
ful and established optimization for fine-grained distributed applications; because of the significant constant overhead associated with each message, it is efficient to group data items together in a single physical message. However, the communication behavior is often abstracted away from the application developer, and application-specific optimizations to the communication is not possible. Moreover, the communication pattern of the application can be dynamic and unpredictable — complicating static optimization of the communication. Dynamic message aggregation is implemented in the communication module of the Time-Warp LPs in the following way. LPs request message sends from their communication manager module. Instead of sending the messages directly, as per regular Time-Warp, the communication manager aggregates messages destined to the same process. Periodically, the aggregated items are sent to their destination using a single physical message, thereby reducing the communication overhead. Two aggregation policies that balance the benefit of aggregating more messages against the loss due to delaying messages were investigated. The performance of the policies was studied and DyMA was shown to produce a significant improvement in performance (around 30% in the best case for the models that were simulated). The simulations were analyzed to isolate the optimistic and pessimistic effects that the policies attempt to balance. It was shown that the policies were successful in aggregating a large number of messages, without significantly increasing the rollback behavior of the simulations. Other optimizations to the communication behavior of Time-Warp simulators have focused almost exclusively on minimizing the number of messages produced. These optimizations have been carried out at both the application level (by enhancing the partitioning of the model across the processes), and the kernel level (by implementing optimizations that require fewer number of kernel messages to be exchanged). The work presented in this paper is different from these previous efforts in that it is addresses the overhead at a different level — inside the communication manager. No matter what the message generation behavior is, dynamic message aggregation attempts to match the communication behavior to the underlying communication fabric thereby reducing overhead during communication. Matching the application communication pattern dynamically with the underlying hardware at run-time is an intriguing idea. We are currently looking for other optimizations that can be implemented within this framework.
[2] J. V. Briner, Jr. Parallel Mixed-Level Simulation of Digital Circuits using Virtual Time. PhD thesis, Duke University, Durham, North Carolina, 1990. [3] C. D. Carothers, R. M. Fujimoto, and P. England. Effect of communication overheads on time warp performance: An experimental study. In Proc. of 8th Workshop on Parallel and Distributed Simulation, pages 118–125, 1994. [4] E. W. Felten. Protocol compilation: High-performance communication for parallel programs. Technical report, University of Washington — Dept. of Computer Science, 1993. [5] A. Ferscha. Probabilistic adaptive direct optimism control in time warp. In Proc. of the 9th Workshop on Parallel and Distributed Simulation (PADS 95), pages 120–129, June 1995. [6] R. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30–53, October 1990. [7] W. Gropp and E. Lusk. Tuning MPI programs for peak performance. http://www.mcs.anl.gov/mpi/. [8] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1994. [9] D. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7(3):405–425, July 1985. [10] Y. H. Levendel, P. R. Menon, and S. H. Patel. Special purpose computer for logic simulation using distributed processing. Bell Syst. Tech. J. 61, 10,2873 2909, 1982. [11] D. E. Martin, T. J. McBrayer, and P. A. Wilsey. WARPED: A time warp simulation kernel for analysis and application development. In 29th Hawaii International Conference on System Sciences (HICSS-29), January 1996. [12] F. Mattern. Effecient algorithms for distributed snapshots and global virtual time approximation. Journal of Parallel and Distributed Computing, 18(4):423–434, August 1993. [13] N. Nevin. The performance of LAM 6.0 and MPICH 1.0.12 on a workstation cluster. Technical Report OSC-TR-19964, Ohio Supercomputer Center Technical Report Columbus, OhioTech., 1996. [14] N. Nupairoj and L. Ni. Performance evaluation of some MPI implementations. Technical Report Tech. Rept. MSU-CPSACS-94, Dept. of Computer Science, Michigan State University, Sept. 1994. [15] A. Palaniswamy and P. A. Wilsey. Parameterized time warp: An integrated adaptive solution to optimistic PDES. Journal of Parallel and Distributed Computing, 37(2):134–145, Sept. 1996. [16] R. Rajan, R. Radhakrishnan, and P. A. Wilsey. Dynamic cancellation: Selecting time warp cancellation strategies at runtime. International Journal in Computer Simulation, 1997. (forthcoming). [17] P. L. Reiher, F. Wieland, and D. R. Jefferson. Limitation of optimism in the time warp operating system. In Winter Simulation Conference, pages 765–770. Society for Computer Simulation, December 1989. [18] S. P. Smith, B. Underwood, and M. R. Mercer. An analysis of several approaches to circuit partitioning for parallel logic simulation. In In Proceedings of the 1987 International Conference on Computer Design., pages 664–667. IEEE, NewYork, 1987. [19] K. Xu, Zhiwei. Hwang. Modeling communication overhead: MPI and MPL performance on the IBM SP. IEEE Parallel & Distributed Technology., 4(1):9–23, Spring 1996.
References [1] M. L. Bailey, J. V. B. Jr., and R. D. Chamberlain. Parallel logic simulation of VLSI systems. ACM Computing Surveys, 26(3):255–294, September 1994.
8