The applicability of importance sampling in performance simulation of networks Poul E. Heegaard1
Keywords:
rare event simulation, importance sampling, networks, adaptive biasing, heuristics, generalisations Abstract
In this paper the focus is on application of importance sampling in performance simulation of well-engineered networks. An adaptive approach has previously been proposed for dynamical adjustment of the simulation parameters which are required for making importance sampling efficient. The applicability of this approach is demonstrated by results from two network simulation studies. The experience gained from these and other experiments is also included. The most important observation is that the likelihood ratio may serve as an indication of the correctness of the simulation results. It can be used when neither exact values nor direct simulation results are available. The second part of this paper contains a discussion of possible extensions of the work with importance sampling in network simulations. The discussion include generality of the arrival and service process, other performance measures than blocking, state space constraints and size, and operation on very different time scales. All these topics must be considered if importance sampling is to be included in a performance evaluation tool of the quality of service in telecommunication networks.
1
Introduction
Simulation is a flexible means for assessment of the quality of service offered by a telecommunication system. However, when very strict requirements are put on the quality of service, the simulation becomes inefficient because the performance depends on rare events to occur. A rare event is, for instance, a cell loss or a system breakdown. A simulation technique that speeds up the experiments must be added. Various techniques are known from the literature and they should be combined to achieve additional speedup. The most efficient techniques for rare event simulation, are importance sampling and RESTART. Both have documented impressive simulation speedup compared to direct simulation. However, most of the applications reported in the literature are single dimensional models, e.g. GI/GI/1/N-queues. This paper focuses on the application of importance sampling in rare event simulation of telecommunication systems. The efficiency of this technique is rather critical to the change of meas1. Norwegian University of Science and Technology, Dept. of Telematics, N-7034 Trondheim, Norway; Telephone: +47 73 55 0494, Fax: +47 73 53 2586, E-mail:
[email protected]
ure, i.e. the change of simulation parameters, denoted as parameter biasing in the following. Hence, it is essential for the success of importance sampling that methods exists for optimal parameters biasing, or at least heuristics for obtaining good parameters. Up to now, much work has been done on (asymptotically) optimal biasing in single queuing networks, or networks where the system performance, e.g. the system servability, is mainly dependent on the performance of only one queue [Fra93]. See [Hei95] for an thorough survey of parameter biasing. In a well-engineered telecommunication network, the system performance depends on the performance of several interacting queues. The simulation model is described in a multidimensional state space with a large number of resource constraints. To obtain good estimates of the system performance it is insufficient to bias the parameters to observe e.g. loss in only one specific queue. Instead, a recently proposed adaptive parameter biasing [Hee98] always considers the queue with the most significant contributions to the system performance as viewed from the current state. In Section 2, the modelling assumptions and the basic idea behind the adaptive approach are outlined. The feasibility of the adaptive biasing and some other experience are included in Section 3. Section 4 includes a discussion of what kind of extensions to the modelling framework that should be made before this can be used as a general performance tool for communication systems. The paper is closed by some final remarks in Section 5.
2
Importance sampling in network simulation
2.1 Model The simulation process { X(t) ;( Ω K, t ≥ 0 ) } is a K -dimensional continuous time Markov chain (CTMC). It is defined on a discrete state space1 Ω K = { 0, 1, …, M k } K . The description of a network comprising K generators (e.g. the traffic ˜sources or link failures) and J resource pools (e.g. queues or links). The state space dimensionality is given by the number of generators, and the number of states in each dimension depends on the number of resources in each pool (e.g. the link capacity) and the generators relation to the pools (e.g. the traffic routing). The simulated events, e.g. the arrivals or service completions, are triggered and controlled within the generators. The generators have the following attributes: • arrival and departure rates, • capacity requirement, • priority level, • routing strategy, i.e. a set of resource pools from which resources are requested, • rerouting strategies, for instance on overload or link failure. The generators can model traffic classes with different rates and quality of service requirements. Traffic classes with high priority is allowed to pre-empt lower level traffic classes. Figure 1 illustrates a typical communication system that can be modelled within this framework. The generator framework applies for description of dependability aspects, like link failures with arrival or repair of link failures as events. These failure generators have the highest possible priority level of all generators. Arrival of a failure corresponds to arrival of a “call” that is requesting the entire link capacity, and will pre-empt all current resource allocations of lower priority. More advanced modelling of dependability aspects, e.g. partial link capacity degradation, is also 1. Adding resource limitations to the model, e.g. finite buffer capacity, the feasible region of Ω K will be reduced, e.g. a common resource limitations will cut the corners of this state cube.
possible. This flexible model framework allows a combinations of both traffic and dependability simulation as depicted in the figure above.
4
- service classes
link failures - quality of service requirements user type A priority=1
- preemptive priority level
6
- rerouting on overload and link/node failure 3 4 2
8
5 2 3
10 6
- link and node failure
7
5
1 1
9
primary route user type B secondary route user type A primary route user type A
priority=2 user type B
Figure 1: Fictitious backbone network covering Norway.
2.2 Adaptive biasing The adaptive biasing scheme combines optimal biasing from single server networks [PW89, CFM83] with ideas from a technique called failure distance biasing [Car91]. The major objective of this work has been to develop a biasing scheme that is robust and efficient for large networks, rather than being optimal for all kinds of general arrival and service processes. The adaptive biasing has been successively improved since it was first introduced in [Hee95]. Application to a network was first presented in [Hee96] and later extended and defined to yield both traffic and dependability models in [Hee97]. In [Hee98], the modelling framework was further generalised and the adaptive biasing improved. The objective of the adaptive technique is that the biasing must ensure that the most likely of the unlikely sequences of events leading to a target is emphasised. A target is a state sub-space where a rare event of interest is observed.
The following algorithm is proposed:
3
importance
3 TYPE A
1 2 3 target
2
Choose a target at random, and change the measure toward this. importance
1
3 TYPE A
For every new state, the algorithm is repeated until a target is visited or the experiment is terminated. The targets relative importance is denoted the target distribution. This will change as the state changes, see figure to the right for an illustration where the relative importance of the 3 targets changes as the simulation process moves in the state space. The target distribution is the critical factor of this algorithm because this is the basis for the next direction decision. Calculation of the target distribution must be efficient since this should ideally be recalculated for every new state visited in a simulation experiment. Using this algorithm, the parameters are biased adaptive to the current state by inducing a positive drift towards the chosen target. This is, in mean, the current target of most importance. Even though a target is chosen after every state change, experiments have shown that the resulting sequence of events is far more directed towards the targets than for experiments with biasing towards all targets simultaneously. It is also observed that the targets are visited nearly in accordance to their relative contributions to the system performance, whenever a good estimate of the target distribution exists. In addition to be robust and precise, the estimate has to be efficient because it is determined after nearly all state transitions. In [Hee98], an estimate of the distribution is proposed which takes these objectives into account.
1
1 2 3 target
2 Approached target 1: Choose the target again according to the new relative importance estimates importance 1 3
TYPE A
(i) Choose, at random, one of the targets in accordance to its current relative importance, (ii) Bias the parameters to induce a positive drift towards this single target, (iii) Move to the next state with the new set of parameters.
1 2 3 target
2 Approached target 3: Choose the target again according to the new relative importance estimates
Simulation experiments
A large number of simulation experiments with importance sampling is conducted on a variety of models. In Section 3.2, simulations of complex, multidimensional network examples are reported, while in Section 3.1, heuristics are presented based on observations from these simulations, in addition to simulations of simple one and two dimensional models.
3.1 Importance sampling heuristics Based on simulation experience and analytic results from a single queue, it is observed that the likelihood ratio may serve as an indication of the goodness of the simulation results. This is im-
portant in the cases where neither analytic nor direct simulation results can be obtained. The expected value of the likelihood ratio is 1. Thus, 1. If the mean observed likelihood ratio is close to 1 and with a small relative error, then the estimate from the importance sampling simulation is likely to be good if its relative error is also small. 2. If the mean observed likelihood ratio is much less than 1, or have a large relative error, then it is likely that the estimate from the importance sampling simulation is poor, even if its relative error is small. A reasonable explanation to this was obtained by studying analytic results from a single queue. This revealed that when the parameters were too heavily biased, the sampling distribution of the likelihood ratio, and hence also the importance sampling estimates, was a heavy tailed distribution. This means that the estimated have an infinite variance, and after a finite number of samples, the mean value will most likely be (much) less than the expected value!
3.2 Feasibility study
blocking probabilities
The quantities of interest are the blocking probabilities of all, or a selection of, user types, i.e. the generators. Case 1: No priority nor alternative routing. 1e-07 When all the generators have the same priority and have a fixed routing, it is feasible to obtain exact blocking probabilities, e.g. by 1e-08 simulation results with error bars the convolution method [Ive87], when the model is of moderate size. Hence, the simuexact results 1e-09 lation results from this case can be compared with exact values. In [Hee96] and [Hee97] 1e-10 the efficiency and precision of the simulation method are demonstrated. The figure to the right is from this study and it shows very 1e-11 0 1 2 3 4 5 6 7 8 9 10 good correspondence between the exact and generators importance sampling simulation results.
(a) With primary route only 1e-04 blocking probabilities
Case2: Improving the QoS by rerouting. This case considers the use of rerouting. Two simulation series were produced, one with, and one without, an alternative (secondary) route. The improvement of the quality of service in terms of reduced call blocking is estimated. The figure to the right shows the simulated blocking probabilities for the case without rerouting, compared with rough blocking approximations (exact results were not provided).
1e-05 1e-06 1e-07
Upper bounds of blocking
1e-08 1e-09 1e-10
Simulated blocking probability of generator 5 1
2
3
4
5
6
7 8 9 generators
10 11 12 13 14 15
This, in addition to the simulation results in [Hee98], shows that it is feasible to use the adaptive strategy to conduct network simulations within a framework with users that have different service requirements, pre-emptive priorities, and may switch to an alternative route when the primary route is blocked.
4
Generality
The adaptive biasing of importance sampling parameters was developed under the assumption of continuous time discrete event Markov chain. This makes the strategy applicable to a broad class of simulation problems, not only performance simulation of networks as treated here. However, this also imposes some restrictions to the generality. This section addresses a few issues that ought to be considered before the adaptive biasing approach can be included in a general performance simulation tool for telecommunication networks.
4.1 Arrival and departure processes The model assumes Poisson processes with constant rates or with rates dependent on the state of a single generator. Extending the model to Poisson processes with globally state dependent rates will affect the calculation of the relative target importance, and hence this must be changed. Globally state dependent rates are needed if e.g. propagation of failure or access control based on network feedbacks should be modelled. Extending the process to a generalised semi-Markov process (GSMP) [Dam93, Gly89] requires a redesign of the underlying model and hence also the adaptive strategy. The basic idea of adaptive change the parameters will however remain unchanged. Non-poisson processes can be modelled by uniformisation as described in [GSHG92], and by making the necessary changes to the adaptive strategy. Redesign of the importance sampling strategy should not be carried out without reconsidering other rare event simulation approaches like RESTART [VA+94].
4.2 Performance measure In most application of importance sampling up till now, performance measures like the steady state blocking probability has been studied. The importance sampling parameters has been optimised with respect to these kind of measures. Obviously, other measures like delay, waiting time distribution [Par98], various transient measures should also be taken into account. This will probably change the biasing.
4.3 State space constraints The generators in the model in Section 2.1 have pure birth-death processes which span simple Markov chains in each dimension of the state space. The resource capacities of the network constrain the models. A set of generators that share a resource will give a surface barrier in the state space. Recall from Section 2.2 that this barrier is the target sub-space used in the adaptive biasing strategy. Figure 2 below shows a 3 dimensional example with 4 resource types. If generators that are not pure birth-death processes are included in the model, e.g. deferred repair or batch arrivals, the surface barriers in the state space will be less regular than the one in the current model. This will, at least, complicate the calculation of the relative target importance (target distribution).
4.4 Different time scales As mentioned in Section 2.1, the modelling framework allows both dependability and traffic aspects to be included. If this is mixed in the same experiment, it will often result in a combination of generators having arrival and departure processes that are operating on very different time
N1=3
λ1
N12=5
N1=3 ..
λ1
N12=5 ..
λ2
N2=4
λ3
N3=6
..
µ2
N123=8
λ3
..
µ3 λ2
µ1
N2=4
..
N3=6
N123=8
state diagram
Figure 2: Irregular resource limitations in 3D example. scales. For instance, the time between critical link failure should be significantly longer than time between arrival of a call, while the time to link repair will normally be longer than the duration of a call. Simulation of both link failure and call arrivals may then result in an enormous amount of calls between each link failure. This problem is not solved with importance sampling biasing strategies. Under Markov properties and steady state properties, re-scaling of the arrival and departure rates can be done as follows. Let λ k and µ k be arrival and departure rates, respectively, for dimension (generator) k . If the departure rates in each dimension is rescaled to unity, µ’k = 1 , then the arrival rates must be rescaled correspondingly λ’k = λ k ⁄ µ k . No similar method is known in the general case. Mixture of dependability and traffic process is normally not recommended, but even with pure dependability or traffic processes the problem is not always avoided.
4.5 State space dimensionality and size The current implementation of the adaptive biasing strategy has been tested on a model with 50 dimensions, and a model with 300 resources. The efficiency drops more than linearly, but less than exponential, as the number of dimensions and resources increase. In addition, the feasibility demonstration from Section 3.2 was running a model of up to 32 dimensions and 10 different resources pools with at most a capacity of 40. This is past trivial models, but still not a large system. Hence, further testing and possible improvement of the implementation are recommended.
5
Closing remarks
To develop a performance simulation tool for evaluation of rare events in a telecommunication network, a list of issues must be considered. This paper has only briefly discussed some of them. The setup of the rare event simulation technique, in this case the importance sampling, ought to be automated, or at least have good guidelines, before it can be included in a tool. The setup is not a straights forward task, and it is unfortunately also very critical with respect to the efficiency of importance sampling simulations. If the biasing is far from optimal, the results may be completely wrong. Hence, an indication of the goodness of the simulation results when no exact or approximate results exists is required. E.g. as proposed in this paper, use the observed likelihood ratio. For single dimensional model there exists well-specified biasing that guarantee efficient simulations. For multidimensional network model this is not the case, although the recently proposed adaptive biasing strategy seems to be a step in the right direction. This approach works well on continuous time discrete event Markov chains.
RESTART is an alternative technique where the efficiency is said to be more robust to the choice simulation parameters than is the case of importance sampling. However, to the authors knowledge, no work is yet done on multidimensional network models for other than in a symmetric case, i.e. where all generators have the same arrival and departure process with the same parameters [Vill98].
REFERENCES [Car91]
[CFM83]
[Dam93]
[Fra93] [GSHG92]
[Gly89] [Hee95]
[Hee96]
[Hee97]
[Hee98]
[Hei95]
[Ive87]
Juan A. Carrasco. Failure distance based simulation of repairable fault-tolerant computer systems. In G. Balbo and G. Serazzi, editors, Proceedings of the Fifth International Conference on Computer Performance Evaluation. Modelling Techniques and Tools, pages 351 – 365. North-Holland, Feb. 15-17 1991. Marie Cottrell, Jean-Claude Fort, and Germard Malgouyres. Large deviation and rare events in the study of stochastic algorithms. IEEE Transaction of Automatic Control, AC-28(9):13–18, September 1983. Halim Damerdji. Parametric inference for generalised semi-Markov processes. In G. W. Evans, M. Mollagasemi, E. C. Russel, and W. E. Biles, editors. Winter Simulation Conference, Los Angeles, California, USA, Dec. 1993, pages 323– 328. Michael R. Frater. Fast simulation of buffer overflows in equally loaded networks. Australian Telecom Research, 27(1):13–18, 1993. A. Goyal, P. Shahabuddin, P. Heidelberger, and P.W. Glynn. A unified framework for simulating Markovian models for highly dependable systems. IEEE Transaction on Computers, 41(1):36 – 51, 1992. Peter W. Glynn. A GSMP formalism for discrete event systems. Proceedings of the IEEE, 77(1):14–23, Jan. 1989. Invited paper. Poul E. Heegaard. Rare event provoking simulation techniques. In Proceeding of the International Teletraffic Seminar (ITS), pages 17.0–17.12, Bangkok, Thailand, 28 Nov.-1 Dec. 1995. Session III: Performance Analysis I, Regional ITCSeminar. Poul E. Heegaard. Adaptive optimisation of importance sampling for multi-dimensional state space models with irregular resource boundaries. Peder J. Emstad, Bjarne E. Helvik, and Arne H. Myskja, editors. The 13th Nordic Teletraffic Seminar (NTS-13), Trondheim, Norway, 20 - 22 August 1996. Tapir Trykk, pages 176–189. Poul E. Heegaard. Efficient simulation of network performance by importance sampling. In Teletraffic Contributions for the Information Age, Washington D.C., USA, June 23-27 1997. Poul E. Heegaard. A scheme for adaptive biasing in importance sampling. AEÜ International Journal of Electronics and Communications. Special issue on Rare Event Simulation, 3(52):172-182. May 1998. Philip Heidelberger. Fast simulation of rare events in queueing and reliability models. ACM transaction on modeling and computer simulation, 5(1):43–85, January 1995. Villy B. Iversen. A simple convolution algorithm for the exact evaluation of multi-service loss system with heterogeneous traffic flows and access control. In Ulf
[PW89] [Par98]
[Vill98]
[VA+94]
Körner, editor, The 7th Nordic Teletraffic Seminar (NTS-7), pages IX.3–1–IX.3– 22, Lund tekniska högskola, Sweden, 25 - 27 August 1987. Studentlitteratur. Shyam Parekh and Jean Walrand. Quick simulation of excessive backlogs in networks of queues. IEEE Transaction of Automatic Control, 34(1):54–66, 1989. Shyam P. Parekh. Quick Simulation of Stationary Delay Tail Probabilites in Tandem Queues. AEÜ International Journal of Electronics and Communications. Special issue on Rare Event Simulation, 3(52):162-164. May 1998. Jose Villén-Altamirano. RESTART Method for the Case Where Rare Events Can Occur in Retrials From Any Threshold. AEÜ International Journal of Electronics and Communications. Special issue on Rare Event Simulation, 3(52):183-189. May 1998. M. Villén-Altamirano et al. Enhancement of the accelerated simulation method RESTART by considering multiple thresholds. In J. Labatoulle and J.W. Roberts, editors. The 14th International Teletraffic Congress (ITC’14), Antibes Juan-lesPins, France, June 6-10 1994. Elsevier, pages 797 – 810.