1
A Queuing Model for Service Availability of Systems with Rejuvenation Felix Salfner International Computer Science Institute, Berkeley
[email protected]
Abstract—In this paper we present a queuing model to investigate the effect of time-based system rejuvenation on service availability. The model is formulated as a stochastic colored Petri net, which allows us to use realistic distributions such as the lognormal distribution. We define a metric for service availability and derive how it can be estimated from the model. Experiments show that the optimal rejuvenation interval as well as the achievable service availability improvement depend significantly on system utilization.
I. I NTRODUCTION Traditionally, system availability has been concerned with failure and repair of computing systems. Steady-state system availability is formally defined as the ratio of the mean time to failure (MTTF) and the total time, i.e. the sum of MTTF and mean time to repair (MTTR). The most popular methods for improving availability include reactive methods such as checkpointing [1] as well as the use of spare components [2], and proactive methods such as preventive maintenance [3] and in particular software rejuvenation [4]. The optimal choice of parameters in reactive as well as proactive methods is often determined through the analysis of stochastic models. An overview is given in [5] and the references therein. With the shift in paradigm from a system’s view to a services view, system availability has become less relevant. The attention now focuses on the availability of a service rather than the availability of the system hosting it. While in earlier work [6] we investigated improvements of service availability through failure prevention, this paper focuses on rejuvenation. We model a service as a simple queuing system processing jobs/requests using a stochastic colored Petri net (SCPN). More specifically, we model environments with single points of failure and hence our model incorporates failure and repair. In our experiments we assume the failure distribution to be lognormal, which has been shown to be a good approximation of time-to-failure distributions (see, e.g., [7]). The model incorporates time-based rejuvenation which requires additional down time of the server without loosing the jobs in the queue. Our approach is related to [3] and [8], which also build on a system model and a queuing model to represent jobs in the system but do not focus on service availability. In [9], the authors present a formalism to compute user-perceived service availability including user behavior on the basis of stochastic This work has been supported by Deutscher Akademischer Austauschdienst (DAAD). c 2008 IEEE 978–1–4244–3417–6/08/$25.00
Katinka Wolter Humboldt-Universit¨at zu Berlin
[email protected]
reward nets (SRN) with the limitation that exponential distributions have to be used to model system behavior such as failures. The model proposed here is relatively simple from which follows that the rejuvenation process is modeled on a high level of abstraction. On the other hand, it allows to use realistic distributions for the occurrence of failures such as the lognormal distribution. Furthermore, our model involves only a few parameters which need to be set in experiments. It is the goal of this paper to investigate the effects of the frequency of rejuvenation and the level of system utilization on service availability. As we will show, service availability is a completely different measure than steady state system availability. We also show that system utilization has a significant impact on service availability. Nevertheless, rejuvenation proves to be an effective way to increase service availability. The paper is organized as follows: Starting from a description of the proposed model in Section II we derive a formula how to compute service availability in Section III. Experiments and results are described in Section IV while Section V concludes the paper. II. T HE MODEL We model the service providing system as a finite queue subject to failures. We use stochastic colored Petri nets (SCPN) as modeling technique in order to determine the measures needed to compute service availability. The model is described in this section while the measures are introduced in Section III. The Petri net model is shown in Figure 1. It consists of two parts: the queuing model on the left and the operational state model on the right. Each token represents one place in the queue. We assume that the time between two job arrivals follows an exponential distribution and hence the rate of job arrivals is characterized by a single parameter arrival time. Once a job arrives, it is enqueued if there are empty places in the queue left (transition enqueue in the model). If the queue is full, the job is lost. The queue has a fixed finite capacity and each queue place is modeled by a token. Each job in the queue is sequentially processed (transition serve) and the time needed to complete a job is exponentially distributed, determined by parameter service time. We assume that the system is in one of three states: • Up. The system is up and running, processing jobs in the queue. • Down. In case of a failure the system goes down (transition fail). It looses all jobs currently in the queue
2
III. C OMPUTING S ERVICE AVAILABILITY Several definitions for service availability have been proposed in the literature. In this paper, we focus on a straightforward definition that is applicable to atomic jobs/requests, i.e., there are several (many) distinct jobs and each job is either considered to be successfully completed or lost. Then, service availability is simply the ratio of completed to all jobs in a given time interval ∆t: Fig. 1. Stochastic colored Petri net model for a queuing system with rejuvenation subject to failures.
•
(transition loose jobs). While it is down, no arriving jobs can be enqueued and are hence lost until the server is repaired (transition repair). Rejuvenating. Rejuvenation means a planned restart of the system even though no failure has occurred. Since it is a planned restart the jobs currently in the queue are saved and are processed after the system has resumed service.
Several studies (e.g., [7]) have shown that exponential distributions are not well-suited for time-to-failure (TTF) and repair time distributions. In [7], it has been concluded from data of a commercial telecommunication platform that the distribution of time-to-failure can be approximated best by a lognormal distribution. Hence in the experiments described in Section IV, we chose transition times of the fail transition to be lognormally distributed. However, our model is not restricted to lognormal distributions, other well-known distributions such as Weibull could be used, as well. The repair transition is executed to bring the system back to operation after a failure has occurred. In some cases, a simple restart is sufficient, in other cases, repair also involves other actions such as reconfiguration or exchange of components. In order to represent this, transition times of repair are uniformly distributed between an upper and lower limit. In this paper, we only consider time-based, periodic rejuvenation. Therefore, the rejuvenate transition is deterministic determined by parameter rejuvenate time. System restart is modeled by the deterministic transition restart and the time is the same as the lower limit of transition repair. Table I summarizes the types of transitions and also lists the global guard expressions by which the two parts of the model in Figure 1 are connected. TABLE I T YPE OF DISTRIBUTION AND GLOBAL GUARD EXPRESSIONS FOR TRANSITIONS OF F IGURE 1. ] REFERS TO THE NUMBER OF TOKENS IN THE GIVEN PLACE . Transition fail repair rejuvenate restart enqueue serve loose jobs
Distribution Lognormal Uniform Deterministic Deterministic Exponential Exponential –
Guard expression – – – – ]up > 0 ]up > 0 ]down > 0
As =
E [no. of completed jobs] E [no. of compl. jobs] + E [no. of lost jobs]
(1)
where E [·] denotes the expected value. The expected values in (1) can be replaced by rates times ∆t. Canceling out ∆t yields As =
completion rate completion rate + loss rate
(2)
The task is hence to compute completion rate and loss rate from the Petri net model. A. Completion Rate The completion rate is the number of successfully completed jobs per time unit. If the server would work on jobs all the time without interruption, the completion rate would be 1/service time. However, the server can only complete jobs if there are jobs in the queue and the server is up. The probability that jobs are in the queue is denoted as P (serve) while P (up) denotes the probability that the server is up. The completion rate is hence the product of P (serve) and P (up) divided by service time: completion rate =
P (serve) · P (up) service time
(3)
Translating this into the Petri net model, the two probabilities are equivalent to the following expressions: P (serve) = P (]in queue ≥ 1) P (up) = P (]up > 0)
(4) (5)
where, e.g., ]in queue denotes the number of tokens in place in queue. B. Loss Rate There are three reasons why jobs can be lost in the modeled system: • • •
An arriving job cannot be enqueued when the queue is full. An arriving job cannot be enqueued while the server is down or rejuvenating. When the server fails all jobs that are currently in queue are lost.
The first two reasons apply to arriving jobs while the third affects jobs that are already in the queue. Therefore the total loss rate is the sum of an enqueuing-related loss rate and a failure-related loss rate.
3
P (serve) · P (up) P (serve) · P (up) + ρ 1 − P (up) (1 − P (full )) +
1) Enqueuing-related loss rate: The number of jobs that cannot be enqueued per time unit depends on the number of jobs that arrive at the system (arrival time) and the probability that either the queue is full P (full ) or the server is not running. The latter applies both to the case where the server is down as well as where it is restarted due to rejuvenation: loss rate enqueue =
P (full)+ 1−P (up) −P (full) 1−P (up)
arrival time P (full ) relates to the Petri net expression P (full ) = P (]queue places = 0)
(9)
C. Effective Distribution of TTF The idea of rejuvenation is to eliminate effects of software aging. In our model we assume that rejuvenation puts the system back into the fault-free “up” state. From this follows that sampling for the fail transition is started over again and the effective distribution of TTF, i.e. the distribution of the time to the next failure that occurs although rejuvenation is in place, is different from the distribution of the fail transition. Figure 2 plots the resulting probability density for TTF using a lognormal distribution. The described behavior is the same as for restart mechanisms applied to reduce completion time, e.g., of web pages. A detailed analysis can be found in [10]. The article provides a formula for the expected value of TTF, repeated here for comprehensiveness. Let τ denote the rejuvenation interval, c the time needed to restart the system, and f (t) the probability density of the fail transition and F (t) its cumulative distribution. Then the expected value of TTF is given by
10000
15000
20000
25000
30000
35000
time [h]
D. Service Availability Substituting (3), (6), and (8) into (2) yields the formula for service availability. It can be simplified by introducing utilization ρ: service time (12) ρ= arrival time which results in (13). It can be seen that service availability depends not only on ρ but also on the ratio of service time to E [T T F ]. For reasons of better comparability and plotting of relative improvements we use logarithmic service unavailability log10 (Us ), where Us is defined by U s = 1 − As
(14)
IV. E XPERIMENTS AND R ESULTS In order to explore the effects of the rejuvenation interval and system load (utilization) we simulated the model shown in Figure 1 using TimeNET [11]. Utilization was investigated by varying service time using a fixed arrival time of eight hours. We used a fixed queue size of 25, which is, of course, an arbitrary value. We have also conducted experiments with different queue sizes showing that the effects are in principle the same although, not surprisingly, the actual numbers are different. Parameter values have been chosen to represent a scientific computing environment where on average a computing job arrives every eight hours. Based on our experiences with data of a commercial telecommunication system [7] we used a lognormal fail-distribution which causes the the server to crash approximately twice a year (without rejuvenation). We assumed a restart takes half an hour and a repair takes up to five hours. The complete set of values is provided in Table II.
(10)
where M (τ ) is the expected value of f (t) up to the rejuvenation interval τ : Z τ M (τ ) = t f (t) dt (11) 0
5000
(7)
while the expected (effective) TTF depends on the fail distribution as well as the rejuvenation process. A formula for E [T T F ] is derived in the next paragraph.
M (τ ) 1 − F (τ ) + (τ + c) F (τ ) F (τ )
0
Fig. 2. Distribution of TTF with rejuvenation for a rejuvenation interval τ of 5000h and a restart time c of 0.5h.
The expected number of jobs in the queue is determined by
E [T T F ] =
(13) E [jobs in queue]
(6)
2) Failure-related loss rate: The number of jobs per time unit that are lost due to the occurrence of failures is determined by the average number of jobs in the queue and the average rate of failure occurrence, which is the inverse of the expected TTF. Hence, E [jobs in queue] (8) loss rate fail = E [T T F ] E [jobs in queue] = E []queue places]
service time E[T T F ]
probability density of TTF
As =
A. Service Unavailability Simulating all 200 combinations of values in Table II and computing logarithmic service unavailability for each run yields the plot shown in Figure 3. For each combination we have simulated more than 30 years of system lifetime.
4
TABLE II M ODEL PARAMETER VALUES USED IN THE EXPERIMENTS . T IMES ARE IN HOURS . Parameter
List of values (in hours)
rejuvenation interval τ service time
Values between 10 and 8640 1, 2, 3, 4, 5, 6, 6.5, 7, 7.5, 8
arrival time µ of lognorm f (t) σ of lognorm f (t) repair time distribution restart time c
8.0 4320 1.5 uniform( 0.5, 5.0 ) 0.5
In order to have a reference, we also simulated a model without rejuvenation lacking the place rejuvenating and transitions rejuvenate, and restart. Resulting service unavailabilities are plotted in Figure 5. As it can be seen from the Figures 3 and 5, service (un)availability is dependent on utilization (i.e., service time in the experiments). In the case of a system without rejuvenation service unavailability varies between 8.4 · 10−4 and 5.0 · 10−2 , while in contrast steady state system unavailability (computed from the fail distribution and repair times) is 5.9 · 10−4 . This demonstrates that service availability is a completely different measure than system availability. In comparison, the minimum service unavailability achieved with rejuvenation is 2.9 · 10−4 . B. Rejuvenation Interval In order to visualize the effects of the rejuvenation interval, we plot the ratio of service unavailability with rejuvenation to the corresponding service unavailability of the reference model as a function of the rejuvenation interval τ (See Figure 4) for the case of low utilization and for high utilization. 1) Lower limit: If the rejuvenation interval is chosen too small service availability can get much worse than for a system without rejuvenation. For very long rejuvenation intervals it approaches the value of systems without rejuvenation.
However, the minimum rejuvenation interval above which rejuvenation helps to improve service availability depends on utilization (load): In case of a utilization of 12.5% (Figure 4a), the minimum rejuvenation time is about 651h (obtained from linear interpolation between rejuvenation times 336h, and 720h). In case of 100% utilization (Figure 4-b), the value goes down to about 54h (interpolation between 24h and 72h). From this follows that the rejuvenation interval should rather be chosen towards longer rejuvenation intervals ensuring that rejuvenation does not decrease service availability in cases of low utilization. 2) Optimal interval: All plots show the existence of an optimum value for the rejuvenation interval. However, the optimum also depends on system utilization. For low utilization it is around 2160h, while for high utilization it goes down to 1200h. Note that the fail distribution is the same for all experiments and the expected TTF without rejuvenation is 4690.1h. We do not provide a formula for optimal rejuvenation interval, here, but our experiments suggest that formulas for optimal rejuvenation intervals should take system load into account, as is done, e.g., in [3]. Our experiments also suggest that settings with low utilization should be used to determine the optimal rejuvenation interval since the optimum is more clearly observable and the optimal rejuvenation interval for low utilization achieves still reasonably good results in cases of high system utilization. C. Service Availability Improvement In order to highlight service availability improvement, Figure 6 shows optimal service unavailability improvement as a function of utilization. The graph plots the formula Usref (ρ) − Us (ρ) Usref (ρ)
· 100%
(15)
where Usref (ρ) denotes service unavailability of the reference model for a given utilization ρ. It can be observed that, again, the optimum achievable service unavailability improvement depends on system utilization. In our experiments, service unavailability can be reduced by up to 89.4% for a utilization of 75% (service time = 6). V. C ONCLUSIONS AND F UTURE W ORK
−1.5
n rvice u
mic se
logarith
−2.0
−2.5
availa
−3.0
e [h
−3.5
vic e
tim
bility
6
]
8
4
reju 4000 ven atio n in 6000 ter val [h]
ser
2000
2 8000
Fig. 3. Logarithmic service unavailability for all combinations of parameter values.
We have investigated service availability by modeling a system with a finite queue subject to failures performing rejuvenation. Such scenario typically appears in low-budget scientific computing where a failure causes the entire system to crash due to a lack of fault tolerance. One example for such a system is a blade server without redundancy in power supply, network, or memory. We assessed the effect of timebased rejuvenation which relates to a periodic restart of the system. In order to assess service availability, we proposed a stochastic colored Petri net (SCPN) and derived a formula how to compute service availability from the model. This involved computation of the expected time-to-failure which can be computed using theory developed in the context of restart of Internet requests. In order to investigate the
5
3 2 1
service unavailability ratio
0
0
1
2
3
service unavailability ratio
4
high utilization
4
low utilization
0
2000
4000
6000
8000
0
rejuvenation interval [h]
8000
(b)
R EFERENCES
−2.0
−1.5 −2.5
However, regardless of the dependence on utilization, rejuvenation is an effective technique to reduce service unavailability by 30% to almost 90%. Future work will focus on an extension to multiple servers, more elaborate rejuvenation techniques such as workloadbased rejuvenation and inclusion of software aging.
−3.0
logarithmic service unavailability
6000
Service unavailability ratio for a low utilization of 12.5% and high utilization of 100%.
1
2
3
4
5
6
7
8
service time [h]
0
20
40
60
80
100
Fig. 5. Logarithmic service unavailability for a system without rejuvenation. optimal service unavailability improvement [%]
4000 rejuvenation interval [h]
(a) Fig. 4.
2000
1
2
3
4
5
6
7
8
service time [h]
Fig. 6. Relative optimal reduction of service unavailability depending on utilization (service time).
effect of the rejuvenation interval and system utilization on service availability, we simulated 200 combinations of model parameters. From our experiments we conclude that system utilization has a significant impact on rejuvenation. This applies to the optimum length of the rejuvenation interval, to the minimum rejuvenation interval below which rejuvenation turns service availability to the worse, as well as to overall service availability improvement.
[1] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375–408, 2002. [2] H. Sun, J. Han, and H. Levendel, “A Generic Availability Model for Clustered Computing Systems,” in Proc. Pcific Rim Symp. on Dependable Computing, 2001, pp. 241–248. [3] S. Garg, A. Puliafito, M. Telek, and K. Trivedi, “Analysis of preventive maintenance in transactions based software systems,” IEEE Trans. Comput., vol. 47, no. 1, pp. 96–107, 1998. [4] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software Rejuvenation: Analysis, Module and Applications,” in Proc. 25th Symposium on Fault Tolerant Computing. Pasadena, CA: IEEE, June 1995, pp. 381–390. [5] K. Trivedi, G. Ciardo, B. Dasarathy, M. Grottke, A. Rindos, and B. Varshaw, “Achieving and Assuring High Availability,” in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, 2008, pp. 1–7. [6] F. Salfner and K. Wolter, “Service availability of systems with failure prevention,” in To appear in: IEEE Proceedings of International Workshop on Dependable and Secure Services Computing (DSSC), Jiaosi, Yilan, Taiwan, 2008. [7] F. Salfner, Event-based Failure Prediction: An Extended Hidden Markov Model Approach. Berlin, Germany: dissertation.de - Verlag im Internet GmbH, 2008, available at http://www.rok.informatik.hu-berlin.de/ Members/salfner/publications/salfner08event-based.pdf. [8] K. S. Trivedi, K. Vaidyanathan, and K. Goseva-Popstojanova, “Modeling and analysis of software aging and rejuvenation,” in Proceedings of the IEEE Annual Simulation Symposium, Apr. 2000. [9] D. Wang and K. Trivedi, “Modeling user-perceived service availability,” in Service Availability, ser. Lecture Notes in Computer Science (LNCS). Springer, 2005, vol. 3694, p. 107. [10] A. van Moorsel and K. Wolter, “Analysis of restart mechanisms in software systems,” IEEE Transactions on Software Engineering, vol. 32, no. 8, Aug. 2006. [11] A. Zimmermann, J. Freiheit, R. German, and G. Hommel, Petri net modelling and performability evaluation with TimeNET 3.0, ser. LNCS. Springer, 2000, vol. 1786, pp. 188–202.