Scheduling Critical Channels in Conservative Parallel Discrete Event

0 downloads 0 Views 130KB Size Report
million events per second and a speedup of 26 relative to the sequential kernel for a ... at the University of Calgary which was originally derived from GTW [6].
Scheduling Critical Channels in Conservative Parallel Discrete Event Simulation Z. Xiao B. Unger R. Simmonds Computer Science Dept. University of Calgary, Canada xiao,unger,[email protected]

Abstract This paper introduces the Critical Channel Traversing (CCT) algorithm, a new scheduling algorithm for both sequential and parallel discrete event simulation. CCT is a general conservative algorithm that is aimed at the simulation of low-granularity network models on shared-memory multi-processor computers. An implementation of the CCT algorithm within a kernel called TasKit has demonstrated excellent performance for large ATM network simulations when compared to previous sequential, optimistic and conservative kernels. TasKit has achieved two to three times speedup on a single processor with respect to a splay tree central-event-list based sequential kernel. On a 16 processor (R8000) Silicon Graphics PowerChallenge, TasKit has achieved an event-rate of 1.2 million events per second and a speedup of 26 relative to the sequential kernel for a large ATM network model. Performance is achieved through a multi-level scheduling scheme that supports the scheduling of large grains of computation even with low-granularity events. Performance is also enhanced by supporting good cache behavior and automatic load balancing. The paper describes the algorithm and its motivation, proves its correctness and briefly presents performance results for TasKit.

1. Introduction The Critical Channel Traversing (CCT) algorithm extends the Chandy-Misra-Bryant (CMB) [3][1] algorithm with the addition of rules that determine when an LP should be scheduled to execute events. CCT attempts to schedule LPs with the largest number of events that are ready to execute. This is accomplished through identifying critical channels. TasKit is a simulation kernel based on the CCT algorithm that is designed for high performance simulation on small to

J. Cleary Computer Science Dept. University of Waikato, New Zealand [email protected]

medium sized shared memory multi-processors. CCT enables the scheduling of large grains of computation even in very low granularity models. This is achieved through a multi-level scheduling algorithm. The performance of CCT is also enhanced by ensuring good program cache behavior and automatic load balancing, and through a limited form of time parallelism [4]. In this section we first outline a network modeling paradigm along with a cell level asynchronous transfer mode (ATM) modeling environment. Next we present motivation for the CCT algorithm that is largely deduced from our experience with developing and testing network models. In section 2 we define a number of terms and outline multilevel scheduling. Section 3 considers how simulations are partitioned. The CCT scheduling algorithm is then formally defined in section 4. Proofs of correctness and completion are presented in section 5. Finally, we present selected performance results and conclusions in section 6.

1.1. Network Modeling and Simulation We define a network model as a model that extends the general logical process model through the addition of unidirectional channels. A channel connects a sender LP to a receiver LP where events can be scheduled by the sender for the receiver but not the reverse. We refer to events scheduled for an LP by another LP as external events. An LP can also schedule events for itself, we refer to such events as self events. Further, events must be scheduled by a sender LP for a given receiver LP (through a channel) in non-decreasing timestamp order. TasKit has been implemented using the CCT algorithm. It uses SimKit [8], an existing discrete event simulation API that has been modified with the addition of channels. Using this API has allowed the new kernel to be compared with existing parallel discrete event simulation (PDES) kernels using the SimKit API. The performance of TasKit has been compared with an optimistic kernel, WarpKit, and another conservative kernel, WaiKit, for 6 large ATM Traffic and

Network (ATM-TN) benchmark scenarios in [15]. WarpKit [16] is a Time Warp kernel [9] that has been developed at the University of Calgary which was originally derived from GTW [6]. WaiKit is an optimized CMB kernel developed at the University of Waikato [2]. Using SimKit as also allowed us to compare the performance of TasKit to that of CelKit, a splay tree central-event-list based sequential kernel developed at the University of Calgary. The ATM-TN is a cell level network modeling environment [13]. The largest benchmark characterizes the Canadian National Test Network (NTN) with 93 switches and routers that is operated by CANARIE1 which generates over 40 million events per simulated second [14]. The average granularity of events in this benchmark scenario is on the order of 10 microseconds (i.e., the average application level computation per event on a Silicon Graphics R8000 processor). ATM-TN simulations typically have a wide range of event densities, and may involve the execution of trillions of events.

1.2. Motivation Work with the ATM-TN simulator led to the realisation that the optimistic WarpKit kernel and the WaiKit conservative kernel (which was already optimized for low granularity network models) both had a number of performance problems for network applications. WarpKit provides robust relative speedup for a wide range of network topologies, message densities and messaging patterns including, for example, TCP loops. However, we continually were faced with limited absolute speedup caused by the high system overhead associated with executing each event. WaiKit was designed to have very low per-event system overhead and was demonstrated to be able to perform well, providing good absolute speedup, for simulations with heavy message densities. However, we found WaiKit performance susceptible to cycles with small lookahead whether resulting from “errors” in coding models or from different network topologies. Also, achieving good performance required careful partitioning of LPs to processors, something that is difficult to do with large network models due to the dynamic nature of the traffic being modeled and the difficulty of accurately measuring the execution load of LPs when the time to read a clock can be longer than the time to execute an event. Because of this, WaiKit often failed to provide good performance on large network models. Consideration of where these systems do well and where they do badly led us to some conclusions about what would be required from a system to get consistently good performance with network models, and in particular, with the ATM-TN simulator. These include: 1 Canadian Network for the Advancement of Research, Industry and Education

  

low per-event system overhead. dynamic load balancing. good program cache locality.

The requirement for low event overhead, and the occasional positive results with WaiKit’s approach to achieving this [2], has led us to adopting a conservative paradigm with two levels of scheduling above the level of event scheduling. Scalable speedup for large models requires solving the load balancing problem [12]. We were forced to accept the need for dynamic load balancing and thus had to reexamine the way that scheduling was being performed. We realized that processors can acquire their next units of work from a central-list if the granularity of that work is large enough. The requirement for good cache locality forced us to think about how event messages are passed around the system and therefore what sequences of operations should be performed by an individual processor. As shown in [7], optimising cache locality is vital to achieving good performance on modern shared memory multi-processors. The major approach used here, and inherited from WaiKit, is to group as many events as possible together for execution by one LP on the same processor.

2. Multi-Level Scheduling On a shared memory multi-processor, a naive approach to scheduling and load balancing would be to have a single centralized event queue that each processor takes events from. With this scheme the work load is almost perfectly balanced since any available processor takes any available work. It is well known that such an approach will yield very poor performance because of contention accessing the shared queue and poor cache locality for both LP state and event buffers. The poor cache locality is a product of LPs and event buffers moving from processor to processor at random. This results in a high probability that both the destination LP state and the event buffer are not in the cache of whichever processor gets to execute a particular event. Because of these problems, the usual strategy is to preallocate LPs to processors. This removes contention for the shared resource and improves the cache locality. Unfortunately this static allocation process is not easy to do well. If some processors end up doing far more work than some other processors, performance will be poor. Since the work load in the ATM-TN simulator is dynamic and difficult to predict, a good static partitioning of the simulation model is difficult, if not impossible, to achieve. These problems led us to look again at the use of a centralized queue. The problems caused by the central queue in the naive scheduling scheme stem from the granularity of the work obtained from the queue. If the granularity were increased, each processor would access the queue less often and the contention problem would be eliminated. Also,

if each piece of work obtained required accessing the same memory locations many times the cache locality would be improved. This led to the concept of grouping LPs into what we refer to as a task. A task is a group of LPs that are scheduled as a single entity. A task is constructed from a group of LPs that have a high dependence on one another. By this we mean that if an event is executed by one LP in a task, it is likely that the execution of this event will lead to the generation of an event for another LP in the same task. The construction of tasks is discussed in section 3. In a system using the CCT algorithm, scheduling is performed on three levels:

  

Task scheduling : the top level using a single shared scheduling queue. LP scheduling : the scheduling of LPs within a task. Event scheduling : the scheduling of events by the destination LP.

A system without tasks could use CCT by allowing the kernel to execute LPs directly rather than having tasks control this job, but the advantages of using tasks are clear. Before describing the three scheduling levels we define a channel more formally. Then we describe event, LP and task scheduling. A channel is defined as a uni-directional link between two LPs. The channel (i,j) is used for sending messages from sender LPi to receiver LPj . Note that to send messages from LPj to LPi another channel, (j,i) is required. Each channel has a clock and a delay value. Ti;j , the clock value for channel (i,j), is a lower bound on the timestamp of any event that will be placed into the channel in the future. The delay, i;j represents the minimum lookahead on the channel.

2.1. Event Scheduling Events are scheduled for execution by their destination LP. Once an LP is scheduled for execution by its parent task, i.e., the task to which the LP is allocated, an attempt is made to execute a large number of events at this LP. The more events executed in each execution session, the lower the per-event system overhead and the greater the overall program cache locality. Like other channel based conservative algorithms, it is possible to calculate a safe time for the execution session. The safe time is the latest simulation time up to which events can be executed without introducing the risk of a causality error [5]. The event scheduling part of the CCT algorithm calculates the safe time dynamically while events are being executed. The method used includes new information in a lazy fashion and as a result is cheap to calculate. It also allows

for increases in the safe time that occur during an LP execution session to be detected, thus allowing more events to be executed. The event scheduling algorithm is described in detail in section 4.

2.2. LP Scheduling The way that an LP is scheduled depends on whether it is a source LP, that is a logical process with no input channels, or non-source LP, a logical process with at least one input channel. Source LPs re-schedule themselves at the end of each execution session. It would be possible for a source LP to run to completion in a single execution session, but this is avoided (see section 4.3). A non-source LP, LPi , is scheduled for execution by the LP connected to LPi ’s critical channel. A critical channel is an input channel that has to have its clock time increased in order for the LP’s clock to advance. A non-source LP determines which of its input channels should be set critical at the end of each of its execution sessions. The sender LP to the critical channel then schedules the receiver LP if the channel time increases during the sender’s execution session. Note that LPi may have other channels with the same clock values as its critical channel. In this case it is possible that LPi ’s clock will not advance in the ensuing execution session and that another channel at the original critical time will be marked as critical. A logical process can only execute when the task in which it resides executes. So for this to happen either the LP’s parent task has to be scheduled or it has to be scheduled by another LP in the same task.

2.3. Task Scheduling The highest level scheduling is performed on tasks. All processors take tasks from a single centralized task queue. The design of TasKit means that only a single operating system lock is required in the simulation kernel. This lock is on the access point of the task queue. Tasks are placed in the task queue as part of the LP scheduling mechanism explained in section 2.2. A task is placed in the task queue in increasing time order when either a source LP within the task schedules itself, or when an LP outside the task schedules an LP within the task. Channels leading to LPs within the task are called external input channels and channels from LPs in a task that lead to LPs in other tasks are called external output channels. A task is scheduled via one of its external input channels and schedules other tasks via one of its external output channels.

3. Task Types and Partitioning Different types of task can be defined, with each type being optimized to handle groups of LPs with particular be-

haviors or that are connected in particular topologies. Currently TasKit has two types of task; the pipe-task type, used for groups of LPs that are connected in a pipeline, and the cluster-task type that is used for groups of LPs that appear in low lookahead cycles. The LPs in a cluster-task act locally like a simple event list based simulator and share a single event queue. LPs have to be allocated to specific tasks. This task partitioning process is simpler than the static partitioning of LPs to processors required in systems such as WaiKit because it is based only on simple topological and task type information. It is not for example necessary to estimate or measure total execution loads for individual LPs as in WaiKit. Pipe-tasks can be executed by two or more processors concurrently. Simple mutual exclusion rules within the task prevent more than one processor executing on the same LP at any time. The capability for more than one processor to execute in a task concurrently makes the system less sensitive to poor partitioning decisions than static LP to processor partitioning schemes. This in turn makes task partitioning easier to automate. Non-Source LP Task

will remove the task from the task queue and begin executing the first LP, LP1 . The execution of LP1 will consume events on its input channel, typically producing an equal number of events on its output channel. When P1 finishes with LP1 the external input channel will again be critical. P1 will move to execute LP2 , and similarly to LP3 , LP4 and so on. Another processor P2 could subsequently pick up the same task and again begin executing its first LP, i.e. LP1 before P1 has completed executing in the task. P2 would then also execute LP2 , LP3 , etc. following P1 down the pipe. A key effect of this is extremely good program cache behavior since each processor moves down the pipe with the same set of event buffers. Notice that viewed from the level of a task, we have multiple processors executing the task, each within non-overlapping time windows. Such time parallelism was first described in [4].

4. The CCT Algorithm This section explains the CCT algorithm in detail. As with all discrete event simulation algorithms, if event mk with timestamp tk , is generated as a result of executing event mi with timestamp ti , then it is a requirement that

t t k

Source LP

Internal Channel External Channel

(1)

i

The algorithm places two additional conditions on the messages passed between pairs of LPs. These conditions are shared by most conservative PDES simulation algorithms:



Direction of message flow

If there is a cycle in the directed graph of LPs and channels, the sum of the channel delays in the cycle must be greater than zero. That is, if LP0 ; :::; LPn form a cycle made up of channels (0,1),...,(n-1,n), (n,0) then

X?1  +1 +  0 > 0

Figure 1. A task partitioning.

n

Figure 1 shows the task partitioning of a simple network model. In this case the LPs representing hosts and switches are partitioned into tasks in such a way that event streams form pipelines where most events start at the beginning of the pipe and pass through each LP before exiting the task. This allows the execution of events generated along each pipeline to occur in a single execution session of the task. The heuristics used in constructing Figure 1 are particular to ATM-TN. However, the basic rules used should extend easily to other network models. Consider how LPs are scheduled in a simple pipe-task that consists of a linear sequence of LPs, LP1 , LP2 , ..., LPn , each with one input and one output channel. The external input channel to LP1 will be marked as critical. When an event is sent down this channel, LP1 will be scheduled which will in turn cause its parent task to be placed in the task queue. Eventually a processor, say P1 ,

=0

i;i

n;

(2)

i



The timestamps of messages passed along a channel must be monotonically non-decreasing. That is, given m0 ; :::; mn a sequence of messages on channel (i,j), then

t  t +1 8k 2 f0; :::; n ? 1g k

k

(3)

In a CCT system, LPs can be in one of three logical states, ready, executing and waiting. If a logical process is ready it means that it has been scheduled and will be worked on when a processor becomes available. If it is executing it is currently being worked on by a processor. If it is waiting, it is not currently scheduled and will have to be scheduled (i.e. be made ready) before it can execute.

A task assures that only one processor is executing an LP at any time. This is not explained as part of the LP scheduling algorithm, since the best way of assuring this will be specific to a task type. For example, only one processor is allowed to execute in a cluster-task at any time which assures the required property without any additional constraints. In the pipe-task, the linear nature of the pipeline allows this to be achieved using a single state bit per LP; the task queue lock (see section 2.3) prevents two processors from entering the pipeline simultaneously. The mutual exclusion for LP execution is the only contribution a task makes to the correctness of the algorithm and therefore there is no mention of tasks in the following explanation of the algorithm.

encountered. Suppose that (j,i) is an empty channel, then the channel time, CTj;i is calculated from

4.1. System State

Note that another channel could be encountered with an equal channel time, but we only need to mark one channel as having the lowest channel time. Now the new value of WTi is calculated from

Before we describe the algorithm we describe the state used by the algorithm. Each LP has the following state. For LPi : Ti local clock. queuei event queue (a priority queue) local to LPi .

Initially Ti is set to zero and queuei is empty. Each channel has the following state. For channel (i,j): Ti;j the channel clock. i;j the channel delay. criticali;j bit to indicate if (i,j) is critical. sampledi;j bit to indicate if there is a message from (i,j) in queuej . busyi;j bit to indicate if LPj is working on (i,j).

Initially Ti;j is set to zero, i;j is set to the delay value specific to the channel and the three state bits are all unset. The channel clock is only modified by the sender LP, while the three channel state bits are only modified by the receiver LP. The channel delay value does not change during the simulation. During an execution session LPi maintains a local window time WTi , an upper bound on the safe time that will be found in this execution session, and keep c idi , the identity of the channel that currently defines the value of WTi . It should be noted that the window time is refined downwards as the execution session continues so the value of WTi calculated at the start of an execution session may be higher than the true safe time for this LP. At the start of a simulation all LPs are in the ready state. The simulation terminates when all LPs have advanced their clock’s beyond the end time and are in the waiting state.

4.2. Estimating the Local Window Time At several points during the execution of events by LPi , it is necessary to calculate a new value of the local window time, WTi . This is done when an empty input channel is

CT = max(T ; t ) j;i

j;i

(4)

k

where tk is the timestamp of the last message to pass along (j,i). The value of tk is considered since another processor might be working on LPj and may not have updated Tj;i since sending a message. Failing to see any message sent at this time will not cause CCT to fail, at worst it could cause a lower value of WTi to be calculated. If CTj;i < WTi , then keep c idi is set to indicate that (j,i) currently has the smallest channel time of any empty channels encountered.

keep c id = j

(5)

i

WT new = min(WT old ; CT ) i

i

(6)

j;i

4.3. Description of the Algorithm Stage 1 : Initial Local Window Calculation When LPi ’s execution session begins the following actions are performed. First WTi is set to 1; the value of keep c idi is not defined at this point. Then each input channel is accessed, its busy bit set, its critical bit unset, and a test performed to determine if its sampled bit is set. If the sampled bit is set there is no need to perform any other action on this channel at this stage. If the sampled bit is not set, an attempt is made to remove a message from the channel. If a message is present, the message is removed, placed in queuei and the channel’s sampled bit is set. If no message is found, the local window time calculation (see section 4.2) is performed on this channel. Source LPs have no incoming channels and need to be treated rather differently when calculating the window. The natural value for WTi in this case is 1, and the LP would run to completion in a single execution session. This is undesirable since a large number of events could be generated, possibly leading to buffer exhaustion. Instead a global constant is used to limit the advance to an arbitrary . Thus, if LPi is a source LP,



WT = T +  i



i

(7)

The use of and event buffer management are discussed in section 4.5. Stage 2 : Event Execution While there are events in queuei with timestamps less than or equal to WTi the following actions are performed. If the timestamp of the event at the head of queuei is less

self-scheduled event

local window time clock0

1

0

... 1

1

LP

last safe time 2 2 2 2

LP clock: last safe-time

2

1 clock3 3

clock1 1 1

clock4 4

clock3 advance

0

clock3 3

clock1 1 1 1 1

self-scheduled event

clock0

2

1

LP

2

LP clock:: safe-time

clock4

schedule receive LP

safe time

critical channel

critical channel

critical channel

4

clock4 advance before execution

clock2 input channels

after execution

clock2

time window input channels

output channels

output channels

Figure 2. Diagrams showing the state of the same LP at the start and end of an execution session. than or equal to WTi , the event is removed from queuei and executed (any events generated with timestamps greater than the end time are discarded). If the event was self scheduled (sent from LPi to LPi ), then the actions associated with this event are complete; otherwise the channel from which the event was received is examined and an attempt is made to get another event from it. If there is a message at the head of the channel, the message is removed and placed in queuei . Otherwise, the channel’s sampled bit is unset and the local window calculation performed on the channel. This process continues until queuei holds no events with timestamps less than or equal to WTi . A facility is provided to allow lookahead optimization. LPi may increase the value of the channel clock of any of its output channels during stage 2. Note that if this is done it is required that if another message is sent on the same channel at any time in the future, its timestamp must be greater than or equal to the new channel time. Stage 3 : Update Clock and Set Critical Channel LPi ’s clock is now set to the current value of WTi .

T = WT If LP is a non-source LP and T i

(8)

i

i i is less than or equal to the simulation end time, the input channel (keep c idi , i) is set critical. Then, the busy bit is unset in each input channel.

Stage 4 : Update Channel Clocks and Schedule LPs Access each output channel and perform the following actions. For channel (i,j), store the value of Ti;j in the temporary variable Ti;j old and update the value of Ti;j . The value given to Ti;j depends on whether a message has been sent on (i,j) during the current execution session. If any messages have been sent, let tk be the timestamp of the last message sent along (i,j) and update the value of Ti;j using,

T = max(t ; T old ; T +  ) (9) The value of T old is considered since it is possible that LP increased the value of T during stage 2 (not necesi;j

k

i;j

i

i;j

i;j

i

i;j

sarily in the current execution session). If no message has

been passed along channel (i,j) during this execution session, update Ti;j using,

T = max(T old ; T +  ) (10) Next, if busy is set, spin and wait for it to become unset. Note that the busy bits of all LP ’s input channels were i;j

i;j

i

i;j

i;j

i

unset at the end of stage 3, so no other processor can be spinning waiting for LPi at this time. Then, if criticali;j is set and

T > T old (11) LP schedules LP . If LP is a source LP and its clock is less than the simulation end time, LP schedules itself at this point. This completes LP ’s execution session. i;j

i

j

i;j

i

i

i

4.4. Example Figure 2 shows the state of an LP before and after an execution session. The LP in question has three input channels, 0, 1 and 2, and two output channels, 3 and 4. The diagram on the left shows the state of the LP before the execution session begins. The local event queue holds one self scheduled event, and one event from each of channel 1 and 2, each tagged with the channel number. The positions that the two external events would hold in each channel is marked by the shaded area at the front of each channel. Channel 0 is empty and there is no representative event for channel 0 in the local event queue. The value of the last safe time calculated is the LPs current local time. At the start of the execution session, since channel 1 is the only empty channel, it is the only channel added to the set of empty channels used in the local window time calculation. During the execution session an event is removed from an input channel after an event that arrived on the same channel is executed. When channel 2 is found to be empty it is added to the set of empty channels used in the window time calculation. In this case, as there are no more events with timestamps less than clock2, stage 2 of the execution session is complete.

The LP then updates its clock to the new safe time (clock2 in this case) and sets channel 2 as critical. Finally, it sets the clock values in its output channels and schedules the LP connected to channel 4; it does this as a consequence of channel 4 being critical. The diagram on the right shows the state of the LP when the session is complete. The new value of clock4 is T delay , where T is the new clock value of the LP, and delay is the channel delay for channel 4. The value of clock3 has increased further. Since the last message sent on channel 3 had a timestamp greater than T delay , the new value of clock3 takes the value of the last message’s timestamp.

+

4

4

+

3

4.5. Note on Memory Usage Like most loosely coupled PDES algorithms, CCT does not itself offer a robust solution to the buffer exhaustion problem. It is possible to construct cases theoretically where an LP could generate events faster than adjacent LPs can execute them thus eventually consuming all the event buffers in the system. We have found that using an appropriate (see stage 1 in section 4.3) this situation does not arise for the models we are interested in. In fact, the TasKit often uses less memory than CelKit. However, we do feel that the requirement for the user to select the value of is undesirable and realize that certain models could be more prone to buffer exhaustion. We are currently considering several methods that remove the need for user parameter setting and guard against buffer exhaustion.





5. Proof of Correctness This section provides correctness proofs for the CCT algorithm. We prove that events will be executed in the correct order at each LP; this assures the causal correctness of a simulation using the algorithm [5]. We also prove that any simulation that will terminate correctly using the CMB algorithm will also terminate correctly using CCT. The following additional notation will be used in the proofs. EndTime is used as the value of the simulation’s end time. The set of input channels to LPi will be referred to as Ii . The subset of channels in Ii that did not hold an event the last time they were considered is referred to as ECi (LPi ’s empty channels).

5.1. Proof of Event Ordering In this section we prove that events will be executed in the correct order at each LP. A number of lemmas used in the final proof are given first. The proofs of Lemma’s 1 and 2 are omitted for brevity. Lemma 1 The sampled bit of channel (j,i), sampledj;i , will be set if and only if there is an external event received on (j,i) in LPi ’s event queue.

Lemma 2 For any message mk with timestamp tk sent on channel i; j , tk  Ti;j .

( )

Lemma 3 During stage 2 of LPi ’s execution session, the value of WTi will be less than or equal to the channel times of any empty channels. proof: By Lemma 2 and by (4) the channel time values for each channel form a monotonically non-decreasing function. Therefore, the lemma holds for all channels that are empty and have already been included in the calculation of WTi . Since this is true for any channel that is empty in stage 1, the lemma could only fail if a channel becomes empty during stage 2, but any channel that becomes empty during stage 2 will immediately be included in the WTi calculation. Lemma 4 No event will be inserted into LPi ’s event queue, queuei, that has a timestamp less than the timestamp of the last event removed from queuei . proof: The lemma is trivially true for the first event inserted into queuei . Assume that the first n-1 events added to queuei satisfy the condition and consider adding the nth event to queuei . Let en , with timestamp tn be the nth event added to queuei , and el with timestamp tl be the last event to have been removed from queuei . If en is self scheduled, then it was generated during the execution of el and by (1) we are done. Now suppose that en is an external event that arrived on j; i 2 Ii . en could have been removed from (j,i) in one of two situations :

( )

i) in stage 1 if sampledj;i was not set. ii) in stage 2 if el is an external event that arrived on (j,i).

For (ii), by Lemma 2 and by (3), since el arrived via (j,i) before en , tn  tl . For (i), by Lemma 1 if sampledj;i was not set there will not be an external event from (j,i) in queuei . Since no events are executed in stage 1, el must have been removed from (j,i) during stage 2 of a previous execution session. By Lemma 3, tl  WTiold , where WTiold was the value of the window time at the end of the execution session in which el was removed from the queue. By (3) and by (8) since WTi at the end of an execution session represents the minimum time of any message to arrive on an empty channel in the future, tn  WTiold . Therefore, tn  WTiold  tl . Thus by induction on the number of events added to queuei , we are done. Theorem 1 Events will be executed at each LP in nondecreasing timestamp order. proof: By Lemma 4 and by the definition of a priority queue, events will always be removed from queuei in nondecreasing timestamp order.

5.2. Proof of Termination The purpose of this section is to prove that a simulation using CCT will terminate correctly provided any CMB execution of the simulation also terminates. Task level scheduling will not be considered in this proof, instead only the LP and event scheduling levels will be considered. The individual task types each use different algorithms and their individual properties need to be proven separately. Lemma 5 Every LP that is waiting and has a clock value less than the simulation end time will have exactly one channel marked as critical. proof: The result is trivially true for a source LP as it can never be waiting if its clock is less than the end time. Let LPi be a non-source LP. At the start of the simulation, LPi is ready and will only become waiting after an execution session. The critical bits of all channels in Ii are unset during stage 1 of LPi ’s execution session. During stage 3, the critical bit is set in one of the channels in Ii if Ti < EndTime. Therefore, at the end of the execution session, if Ti < EndTime, exactly one of LPi ’s input channels is critical. All input channels are uni-directional so for any channel there is a unique receiver LP. Since only a receiver LP updates a channels critical bit, the state of the critical bit of any channel in Ii cannot change between LPi ’s execution sessions. Theorem 2 Given a simulation that terminates when executed using a CMB simulator the CCT algorithm will run to completion with all LPs having advanced their clocks beyond the simulation end time. proof: Consider the CCT execution of a simulation. There are two possibilities that we will consider. First, that every LP whose clock is less than the simulation end time is eventually scheduled and then executed. If this is true then CCT is acting as a standard CMB simulator and it will terminate correctly (because each LP is repeatedly executed the LP clocks will advance and terminate eventually). The second possibility is that this condition will fail. If so there will be some minimum simulation time t, and at least one LP, LPl , whose clock is at time t, which will be in wait state. LPl will be in this state when the CCT algorithm terminates or will remain in this state forever if CCT does not terminate. Now take a snapshot of the state of the CCT algorithm either when it terminates or after the point when all the LP clocks are greater than or equal to t. (There must be such a point by the minimality condition assumed for t). Given this snapshot, create a directed graph between the LPs. A link, or an edge, is created in the di-graph from LPj to LPi (in the same direction as channel (j,i)) whenever Tj Tj;i Ti t. Each LP that has an edge leading to it in the graph will be in wait state. Also by (2) such a

=

= =

graph must be acyclic. Choose some LPi which is a root of this directed acyclic graph. LPi is a non-source LP (as a source LP will only become waiting when its clock advances beyond EndTime). Consider a channel (j,i) leading into LPi , either;

 T < T = t, but this contradicts the minimality asj;i

i

sumption on t.

 T = T = T = t, but this contradicts the assumpj

 

j;i

i

tion that LPi is a root in the graph. Tj < Tj;i Ti t, but this again contradicts the minimality assumption on t. Tj;i > Ti t.

=

=

=

By Lemma 5 one of the channels (j,i) must be critical. But the last (real) time when LPi was scheduled it must have been true that Tj;i Ti t or else the channel would not have been marked as critical. So at some point after the last scheduling of LPi Tj;i must have been advanced. The only way that this could have happened is in stage 4 of the algorithm, which would have seen that the critical bit was set and scheduled LPi . (This assumes the correctness of the synchronisation of the critical bits via shared memory. We omit this for the sake of brevity.) But this contradicts the assumption that LPi is never scheduled again. Thus the first possibility, that all LPs are scheduled as often as necessary must be true.

= =

6. Performance and Conclusions Experiments with ATM-TN models has demonstrated the potential performance of the Critical Channel Traversing algorithm. The sequential performance of ATM-TN on TasKit, a simulation kernel using CCT, has proved consistently superior to the performance of ATM-TN on our splay tree central-event-list based sequential kernel (CelKit). The parallel performance of ATM-TN on TasKit has significantly outstripped the performance on our previous simulation kernels. Figure 3 compares the event rates for the ATM-TN NTN benchmark on TasKit and three other simulation kernels. The benchmark models a real ATM network that consists of 93 ATM switches and routers, and 355 traffic source/sink pairs of various types (LAN, MPEG, TCP, etc.). The model comprises 1381 LPs grouped into 112 tasks. The event rate using CelKit is 47,000 events per second. The speedup of TasKit relative to CelKit is almost linear up to four processors (2.8 on one and 10 on four processors). On 16 processors, the event rate reaches 1.23 million events per second giving a speedup of 26.3 (relative to CelKit). The CCT algorithm incorporates scheduling decisions into a conservative PDES causal event ordering algorithm. It aims to schedule the LPs that will make most progress at any time and provide coarse grain parallelism in a fine

[2] J. Cleary and J. Tsai, “Conservative Parallel Simulation of ATM Networks” Proceedings of the 10th Workshop on Parallel and Distributed Simulation, May 1996. [3] K. Chandy and J. Misra, “Distributed Simulation: A Case Study in Design and Verification of Distributed Programs”, IEEE Transactions on Software Engineering, pp. 440-452, September 1979.

1200

event rate (x1000)

1000

TasKit

800

600

400

200

WarpKit WaiKit CelKit

0 0

2

4

6

8 10 number of processors

12

14

16

Figure 3. Event rates for the ATM-TN simulator running the NTN model on a 16 processor SGI R8000 PowerChallenge.

grain system. We believe that CCT goes a long way towards achieving these goals. A system without tasks could use CCT directly by allowing the kernel to execute LPs directly rather than having tasks perform this jobs. However, our results indicate the benefit of using task level scheduling. Also note that a system using tasks is not restricted to using a single centralized task queue. For the ATM-TN simulation models we have worked with, the single task queue does not appear to cause contention problems on up to 16 processors. However, synthetic workload models can be constructed where the effects of contention are apparent. We are currently experimenting with ways of using multiple queues without losing the automatic load balancing property of the single centralized queue. It should be noted that this work does not solve the general PDES problem. Our results are dependent on models with sparse channel connectivity and enough lookahead to enable the execution of a significant number of events in each LP execution session. This has enabled us to get good absolute speed-up even in the face of low granularity events. Our previous experience indicates that Time Warp is a good robust general solution for systems with high granularity events but that its high overheads restrict its applicability for our current problems. The goal of a general PDES simulation algorithm for low granularity events remains unreached.

References [1] R. Bryant, “Simulation of packet communication architecture computer systems”, MIT/LCS/TR-188, MIT, Nov. 1977.

[4] K. Chandy and R. Sherman, “Space, Time and Simulation”, SCS Trans on Distributed Simulation Vol. 21 (2), (PADS89), pp93-99, March 1989. [5] R. Fujimoto, “Parallel Discrete Event Simulation”, Communications of the ACM, Vol. 33, No. 10, pp. 31-35, October 1990. [6] S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette, “A Time Warp System for Shared Memory Multiprocessors” Proceedings of 1994 Winter Simulation Conference, December, 1994. [7] R. Fujimoto and K. Panesar, “Buffer Management in Shared-Memory Time Warp Systems”, Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS95), pp149-156, 1995. [8] F. Gomes, S. Franks, B. Unger, Z. Xiao, J. Cleary, and A. Covington, “SimKit: A High Performance Logical Process Simulation Class Library in C++ ”, Proceedings of the 1995 Winter Simulation Conference, Arlington, VA, December 1995. [9] D. Jefferson, “Virtual Time”, ACM Transactions on Programming Language and Systems, pp. 405-425, July 1985. [10] W. Loucks and B. Preiss, “The Role of Knowledge in Distributed Simulation”, Proceedings of the SCS Multiconference on Distributed Simulation, pp9-16, San Diego, January 1990. [11] D. Nicol and R. Fujimoto, “Parallel Simulation Today”, Annals of Operations Research, Vol. 53, pp249-286, December 1994. [12] D. Nicol, “Scalability, Locality, Partitioning and Synchronization in PDES”, Proceedings of the Parallel and Distributed Simulation Workshop (PADS98), Banff, May 1997. [13] B. Unger, A. Covington, P. Gburzynski, F. Gomes, T. OnoTesfaye, S. Ramaswamy, C. Williamson and Z. Xiao, “A High Fidelity ATM Traffic and Network Simulator”, Proceedings of the Winter Simulation Conference, Washington D.C., December 1995. [14] B. Unger, Z. Xiao, J. Cleary, J. Tsai and C. Williamson, “Parallel Shared-Memory Simulator Performance for Large ATM Network Scenarios”, to be submitted. [15] B. Unger, Z. Xiao and J. Cleary, “High Performance TaskBased Parallel Simulation of ATM Networks”, to be submitted. [16] Z. Xiao, B. Unger, “Report on WarpKit - Performance Study and Improvement”, Technical Report 98-628-19, Computer Science Department, University of Calgary, May 1995.

Suggest Documents