Synchronizations and Rollbacks in Optimistic ... - Semantic Scholar

2 downloads 0 Views 133KB Size Report
Jun 30, 1995 - of William & Mary, Department of Computer Science, College of William & Mary, Williams- burg, VA 23187, 1992. 12] Rhonda Righter and JeanĀ ...
Synchronizations and Rollbacks in Optimistic Distributed Simulation Scheme Xiannong Meng Department of Mathematics and Computer Science University of Texas { Pan American Edinburg, TX 78539 [email protected] June 30, 1995

Abstract This paper describes a study of coordination issues in optimistic distributed simulations that are implemented over a loosely coupled environment. Detailed study of relations among the factors such as the number, the depth and the frequency of rollbacks, the event intensity, and the speedup of the distributed simulations are studied. This study shows that centralized coordination can perform very e ectively in certain types of simulation tasks; the rollbacks have di erent characteristics at di erent event intensities; and how the overall performance of optimistic distributed simulations is a ected by message intensity, communications overhead, and rollback frequency. Key words:

parallel and distributed simulation, discrete event simulation, synchronization

1 Introduction Theory and practice of parallel and distributed simulations (PADS) have made signi cant progresses in recent years [7, 5, 11, 6, 12, 9]. In distributed simulation, the simulation tasks are divided

into a number of logical processes or LP, which then are distributed to one or more processors. These LPs can execute the simulation in parallel. Thus a signi cant speedup often can be achieved. The correctness of the simulation is provided by the Local Causality Constraint (LCC) [6]. There are two classes of distributed simulation algorithms that ensure this correctness: conservative and optimistic. In a conservative approach, an LP will not proceed unless it is certain that no LCC will be violated. In an optimistic approach, each LP can proceed at its own pace. If, at a later time, a causality violation is detected, that is, an event with lower timestamp than the local simulation clock is received, a rollback is carried out to recover the errors. A rollback restores the state of a process to a prior one, starting from which the incoming events can be processed in correct timestamp order. Optimistic approach potentially allows much more concurrency than conservative approach and thus has received more research attention in the past a few years. This paper describes a study of coordination issues in optimistic distributed simulations that are implemented over a loosely coupled environment. The subject to simulate is a bus type, contentionbased computer network. We will present detailed study of relations among the factors such as the number, the depth and the frequency of rollbacks, the event intensity, and the speedup of the distributed simulations. This study shows that centralized coordination can perform very e ectively in certain types of simulation tasks; the rollbacks have di erent characteristics at di erent event intensities; and how the overall performance of optimistic distributed simulations is a ected by message intensity, communications overhead, and rollback frequency. We will state the problem under study in Section 2. In Section 3, we will review related works. Section 4 will present and analyze the experiment results followed by some concluding remarks in Section 5.

2 Coordinations, Global Virtual Clock and Rollbacks We describe in this section the problem under study and the issues we investigated. Simulation of computer networks has long been an interesting topic of research (see Section 3 for a brief review). With the development of simulation technology, more and more research projects concentrate on applying parallel and distributed simulation to network simulation problems. A key to the success of a distributed simulation is how to coordinate all participating agents. 2

Typically in optimistic PADS, each agent proceeds at its own pace. If an external, out-of-chronologicalorder event (straggler) [5] is received by an agent, the agent performs a rollback by canceling all events generated after the timestamp of the straggler at this agent. This rollback may trigger more rollbacks on other agents. The rollbacks are carried out generally in a distributed manner, i.e. no central coordinator. However, when simulating an subject such as a bus type, contention-based network, it is very ecient to use the central point (the bus network) as a coordinator to control the progress of optimistic PADS. In our study, the simulation is divided into a number of LPs. Each LP represents one or more computers connected to the network being simulated. One LP represents the bus network. This LP acts as the central coordinator. For detailed description of the simulation set-up, see [8].

2.1 The Central Coordinator The central contention point in our simulation is the bus to which every computer is connected. These computers compete for bus access. Each computer attached to the network has its own packet arrival stream. The access protocol is 1-persistent CSMA/CD. When a packet arrives, the computer will check if the network is free, if it is free, the arriving packet will be sent. If it is not free, the computer will wait until the network is free and send the packet. The bus network is modeled as one LP. Each computer will have to consult this bus LP for access by sending a message to it. The bus LP collects requests from its input queue; grants access if the network is free; rejects the request if the network is not free. If at any time a straggler is found, the bus LP sends a rollback message to each agent with the timestamp of the straggler. Each agent acts accordingly when receiving this message. After the rollback is performed, the simulation proceeds.

2.2 Calculation of Global Virtual Clock The advance of the simulation is indicated by the global virtual clock (GVC). Simulation is complete when the GVC is greater than, or equal to the total simulation time speci ed. A GVC is de ned as the minimum value of timestamps of all events including transit events in the system (for a general discussion of GVC, see, for example [5]). There are di erent ways of calculating GVC [5], de-centralized or centralized. The use of a bus LP enables centralized calculation of GVC in our 3

simulation. The protocol of calculating GVC is the following. Each agent sends a value of zero as local simulation clock (LSC) to the bus LP at the beginning of the simulation. The bus LP keeps one LSC variable for each agent and updates the GVC to the minimum of all LSCs.

2.3 Rollbacks Each time a straggler is detected, the bus LP sends a message with the timestamp of the straggler to every agent indicating that a rollback on that agent may be needed. The LP then compares its own LSC with that of the straggler. If the LSC is greater then that of the straggler, a rollback is executed. The total number of rollback events is recorded. Also recorded is the number of events a ected by each rollback. At the end of the simulation, an average number of events per rollback (rollback depth) is calculated.

3 Related Works There have been many studies in simulating computer networks using distributed and parallel techniques. Chai and Ghosh [3] presented a detailed study of distributed simulation of large scale Broadband-ISDN network in a set of loosely coupled parallel processors. They used up to 50 workstations connected through a regular network. The simulation is based on the conservative mechanism in that each logical model will execute up to the minimum clock in all its incoming links. Mouftah and Sturgeon [10] discussed the design and implementation of DISDESNET, a distributed simulation package for general communication networks. The discussion was centered around multiprocessor environment. The tests were done on a two Sun-workstation network environment. Baker et. al. [1] described a package that simulates and prototypes radio communication networks in a loosely coupled environment. Carothers et.al. [2] studied explicitly the e ect of communication overheads on the Time Warp performance. The study of distributed simulation of computer networks can be traced back to as early as late 1970s [4]. [5] listed a brief survey of di erent mechanisms used to calculate GVC. To the best knowledge of the author, there has been no published results in relating the performance of distributed simulations to factors such as the number of events a ected by each rollback, rollback depth. 4

4 Results and Discussion The experiments are conducted for three di erent simulation con gurations, one client, two clients, and three clients. The results presented in this section are divided into four groups for the three di erent simulation con gurations: wall-clock time comparison; comparison of the number of events processed; comparison of the number of rollbacks occurred in simulation; and the number of events a ected by the rollbacks.

4.1 Wall-clock Time Comparison Wall-clock time needed to complete the simulations is listed in Figure 1 (for detailed numbers, standard deviations, con dence intervals, see Appendix). The gure shows that simulations with four clients consistently perform the worst and two clients perform the best. The primary reason for this is that the rollback overhead (see later comparison) is too high for the simulations with more clients. Two-client simulation performs better than one client-case because the computation load is distributed among two clients. This result indicates the limit of bene ts of having more clients participate in distributed simulations. When the amount of communications between di erent clients is limited, one would expect the performance of distributed simulation improve as the number of participants increase. However the advantages of distributed simulation diminishes as the communication overhead grows too fast when more clients are involved.

4.2 Events Processed on Each Client Figure 2 compares the average number of events processed by each client in three di erent simulation con gurations. Because the simulation is con gured in such a way that each simulation in di erent con guration executes the same amount of logical tasks, the number useful events is fewer for each client in four-client case than that of two-client case. The number of events processed by each client appears to be the same in Figure 2 due to the fact that the number of rollback is far more in fourclient case than that of two-client case (see comparison in next subsection). In this comparison, the simulation with one client can be used as a base. The number of events processed by each client 5

in distributed cases (two- and four-client) are between 12% to 18% fewer than that of one-client. The sum of event count in all participating clients is increased in distributed case due to rollbacks. The number of events processed by each client is about the same in the case of two-client and four-client con guration. It takes more time to communicate among all clients in four-client con guration than that of two-client. Thus the total amount of time needed to complete the simulation is longer in four-client case than that of two-client case.

4.3 Number of Rollbacks The average number of rollbacks occurred during each simulation run is presented in Figure 3. Compared to the total number of events, the ratio of rollback is relative high. This is mainly because of that simulation is carried out in a loosely coupled environment. The communication delay is relative long compared the time needed to carry out any computation. It is very easy for the agents to lose synchronization with one another, causing relatively large number of rollbacks.

4.4 Number of Events A ected by Rollbacks Cancelation was carried out aggressively when rollbacks occurred. No lazy cancelation policy was implemented. The number of events that get canceled during rollbacks is recorded and listed in Figure 4. The information is collected on each client basis, i.e. the number shown is average number of events a ected by rollbacks seen on each client. The total number of events a ected by rollbacks on all clients would be the summation of that of each client. Although the exact number of rollbacks and the total number of events vary, the ratio of total number of rollbacks over total number of events processed is surprisingly close to constant in di erent simulation con gurations. Table 1 shows this result. The reason for this phenomenon is not clear at this point. The statistics of this ratio is very strong. Further study is needed to uncover the reason behind it.

6

load 0.2 0.5 0.8

4-client con guration 2-client con guration total events rollbacks rollbacks/event total events rollbacks rollbacks/event 3024 320 0.11 1526 146 0.096 8020 844 0.11 4026 368 0.091 13964 1474 0.11 6946 591 0.085 Table 1: Rollbacks per Event

5 Conclusions We presented in this paper some of the performance gures of optimistic distributed simulation in a loosely coupled environment. The data are collected from a simulation program that lends itself to have a centralized coordinator. The results indicate that it is very easy to compute global virtual clock in such an environment. The results also show that in a loosely coupled environment, the cost of communications can out-weigh the advantages of distributed computing when the number of participating clients increases. The results also show that the total number of rollbacks increases as the number of participating clients increases. It is interesting to note that while the number of rollbacks vary, the ratio of rollbacks over the number of events is close to a constant (con guration invariant). The future work includes expanding this experiment to di erent simulation problem where a central coordinator may be needed. It will also be very interesting to investigate the issue of con guration invariants in distributed simulations.

References [1] D.J. Baker, J.P. Hauser, and W.A. Thoet. A distributed simulation and prototyping testbed for radio communication networks. IEEE Journal on Selected Areas in Communications, 6(1), January 1988. [2] Christopher Carothers, Richard Fujimoto, and Paul England. E ect of communication overheads on time warp performance: An experimental study. In Proceedings of 1994 PADS Workshop, 1994. 7

[3] Arthur Chai and Sumit Ghosh. Modeling and distributed simulation of a broadband-isdn network. IEEE Computer, 26(9), September 1993. [4] K.M. Chandy, V. Holmes, and J. Misra. Distributed simulation of networks. Computer Networks, 3, 1979. [5] Alois Ferscha and Satish K. Tripathi. Parallel and distributed simulation of discrete systems. TR CS-TR-3336, University of Maryland, Computer Science Department, University of Maryland, College Park, MD 20742, 1994. [6] Richard Fujimoto. Parallel discrete event simulation. Communications of ACM, 33(10), Octobor 1990. [7] Yi-Bing Lin and Paul A. Fishwick. Asynchronous parallel discrete event simulation. TR tr95005.ps, Unviersity of Florida, Department of Computer and Information Sciences, University of Florida, Gainesville, Florida, 1995. [8] Xiannong Meng. Distributed simulation in a loosely coupled environment using the tcp/ip protocol. In Proceedings of the Fourteenth Annual International Pheonix Conference on Computer and Communications, pages 122{127, 1995. [9] Jayadev Misra. Distributed discrete-event simulation. Computing Survey, 18(1), March 1986. [10] H.T. Mouftah and R.P. Sturgeon. Distributed discrete event simulation for communication networks. IEEE Journal on Selected Areas in Communications, 8(9), December 1990. [11] David Nicol and Richard Fujimoto. Parallel simulation today. tech report pdes-survey, College of William & Mary, Department of Computer Science, College of William & Mary, Williamsburg, VA 23187, 1992. [12] Rhonda Righter and Jean Walrand. Distributed simulation of discrete event systems. Proceedings of the IEEE, 77(1), January 1989.

8

Appendix 4-client con guration 2-client con guration 1-client con guration load mean dev. c.i. mean dev. c.i. mean dev. c.i. 0.2 36.5 4.68 4.33 25.55 1.81 1.68 27.5 1.95 1.81 0.5 71.9 4.26 3.94 56.2 1.93 1.78 66.4 3.79 3.50 0.8 120.4 4.02 3.71 93.8 2.60 2.41 102.2 4.77 4.41

Table 2: Wall-clock Time (seconds) 4-client con guration 2-client con guration 1-client con guration load mean dev. c.i. mean dev. c.i. mean dev. c.i. 0.2 757 40.2 37.2 763.8 32.0 29.6 908.9 20.8 19.3 0.5 2005 88.9 82.2 2013 67.9 62.8 2472 113.5 104.9 0.8 3491 128.0 118.4 3472 95.2 88.0 3958 187.1 173.1

Table 3: Number of Events Processed (per client) 4-client con guration 2-client con guration load mean dev. c.i. mean dev. c.i. 0.2 320 24.0 22.2 146 10.0 9.27 0.5 845 45.6 42.2 368 17.5 16.2 0.8 1474 63.0 58.2 591 12.5 11.6

Table 4: Number of Rollbacks 4-client con guration 2-client con guration load mean dev. c.i. mean dev. c.i. 0.2 591 32.8 30.3 300 26.8 24.8 0.5 1572 86.3 79.8 725 32.1 29.7 0.8 2576 111 103 1131 44.2 40.9

Table 5: Number of Events A ected by Rollbacks

9

140 One client Twoclients Four clients 120

Elapse time (sec.)

100

80

60

40

20

0

0.2

0.4

0.6

0.8

1

Load

Figure 1: Wall-clock Time Comparison

4000 One client Two clients Four clients

Average Number of Events Processed

3500

3000

2500

2000

1500

1000

500 0

0.2

0.4

0.6

0.8

1

Load

Figure 2: Total Number of Events (seen from client side)

10

1600 Two client Four clients 1400

Number of Rollbacks Occured

1200

1000

800

600

400

200 0

0.2

0.4

0.6

0.8

1

Load

Figure 3: Number of Rollbacks

Number of Events Affected by Rollbacks

2500

Two client Four clients

2000

1500

1000

500

0

0.2

0.4

0.6

0.8

Load

Figure 4: Number of Events A ected by Rollbacks

11

1