Performability Models for Multi-Server Systems ... - Semantic Scholar

2 downloads 366 Views 171KB Size Report
Dept. of Computer Science. Southern ... ally, in the case of high-variance repair times, the model ... parameter settings, this repair behavior can lead to power-.
Performability Models for Multi-Server Systems with High-Variance Repair Durations Hans-Peter Schwefel Center for Teleinfrastruktur Aalborg University Email: [email protected]

Abstract We consider cluster systems with multiple nodes where each server is prone to run tasks at a degraded level of service due to some software or hardware fault. The cluster serves tasks generated by remote clients, which are potentially queued at a dispatcher. We present an analytic queueing model of such systems, represented as an M/MMPP/1 queue, and derive and analyze exact numerical solutions for the mean and tail-probabilities of the queue-length distribution. The analysis shows that the distribution of the repair time is critical for these performability metrics. Additionally, in the case of high-variance repair times, the model reveals so-called blow-up points, at which the performance characteristics change dramatically. Since this blowup behavior is sensitive to a change in model parameters, it is critical for system designers to be aware of the conditions under which it occurs. Finally, we present simulation results that demonstrate the robustness of this qualitative blow-up behavior towards several model variations.

1. Introduction and Motivation Performability modeling seeks to capture the behavior of systems that exhibit degradable performance. In recent years, as both the size and the pervasiveness of distributed systems have increased in support of mission-critical and high performance applications, the need to assess the performance of such systems has become more important. In this paper, we consider a model of a cluster with a small number of N nodes, where each node is prone to degradation in service, which in the limit also include crash failures. We assume and argue for that servers recover from their degraded state after a period of time that shows high variance in most practical scenarios. For such high-variance distributions however, performability metrics of such cluster systems show very peculiar behavior that has previously

Imad Antonios Dept. of Computer Science Southern Connecticut State University Email: [email protected]

only been observed in recent telecommunication models. Within the body of research on performability, Mitrani in [12] surveyed several queueing models where tasks are fed into unreliable servers, and [8] had studied the completion time of tasks in a fault-prone queueing environment for various failure handling strategies. Solutions to such models are presented with the assumptions that task service time, up and breakdown durations are exponentially distributed. More recently, a study provided evidence that hyperexponential distributions provide a better fit for the repair times associated with crash failure [13]. Under certain parameter settings, this repair behavior can lead to powertail distributed durations. Since in practice repair times are bounded, a truncated power-tail (TPT) distribution introduced in [6] is used. Put in a queueing context, long breakdown periods that are symptoms of high variance inevitably lead to long queue backups, making the study of systems with such behavior worthwhile. In developing a queueing model for the system laid out above, it is easy to recognize is that each server can be represented as an ON/OFF model. Our model extends to allow for an aggregation of servers being fed by a single queue with a Poisson arrival rate. This can thus be represented as an M/MAP/1 queueing system, for which analytical solutions are provided in [9]. The model construction makes the assumption of load independence, that is the task processing rate is independent of the number of tasks in the system. We show through simulation that this approximation of the physical system is of little bearing on our analysis results. We highlight the symmetry between the multi-server model and an MAP/M/1 traffic model termed N -Burst, and the applicability of results from the latter to understand the performance characteristics of the degradable system. With our analytic results as a baseline, we explore using simulation failure-handling strategies for systems that allow for node crash failures. Additionally, we consider such systems with nonexponential task service times and look at how these af-

fect queueing performance. The contributions of this paper are as follows: 1) The development of an analytic queueing model with variations amenable to exact solutions of queue-length distributions, 2) a characterization of blow-up points denoting a change in the qualitative behavior of the mean queue-length at specific parameter settings, and 3) simulation results showing that the qualitative behavior is robust towards small model variations including failure handling strategies.

2. System Model The type of system modeled in this paper is a cluster of N servers fed by a FIFO queue at which tasks arrive according to some process with average rate λ. A fail-safe dispatcher assigns a task to the first available server where, given the server is in its fully operational state, it executes for an exponential service time with mean 1/νp . The exponential task service time allows to model the N server cluster system by a single server with an MMPP for the service times. However, we will show via simulations in Section 4 that the qualitative performance results in most scenarios are insensitive to it. A server alternates between two states: the UP state denoting an operational server at full capacity with a mean duration corresponding to MTTF, and the DOWN state representing a degraded level of service lasting for a mean duration corresponding to MTTR. The level of service degradation is captured by a fixed degradation factor δ, where δ = 0 represents a crash failure, and 1 > δ > 0 may be viewed as a non-catastrophic fault at the server, which slows down the execution of a task, such as in the case when some erratic process consumes a large amount of CPU time. We assume that faults causing the service degradation are independent of the current processed tasks, of faults on other servers, and of faults in subsequent UP-DOWN cycles at the same node. We also assume that the dispatcher has instantaneous and always correct information about the nodes in case of crash faults (ideal failure detection). Consistent with convention, we express the server’s availability as MTTF . (1) A := MTTR + MTTF This definition of A does not depend on the fault type, meaning that it is independent of δ. With respect to the recovery behavior of tasks interrupted by crash failures, a case that only occurs for δ = 0, we consider three strategies: • Discard: The interrupted task is removed from the cluster. Such an approach can be applicable in soft real-time systems, where the utility of the result of computation decreases with time.

• Restart: The identical task is restarted at either the original node after it is repaired, or at a different node. This is different from the case considered in [14] where the task completion time can benefit from restarting it. The restart strategy adopted here would require that the dispatcher maintain the necessary task activation information until it is completed. A restarted task can be handled using two approaches: 1) add it to the head of the queue, or (2) add the failed task to the end of the queue. • Resume: The server nodes apply ideal checkpointing to the task execution with the consequence that the dispatcher can ask another node to resume the execution of the task at the point where it stopped. Compared to Restart, this has the advantage that the remaining processing time is the task residual time, which for exponential task times is also exponential with the same mean. However, the disadvantage of Resume is that the checkpointing is rather costly and may only be applied in limited cases. Same as in Restart, a resumed task may be placed at either the front or the tail of the queue. Note that with respect to the influence on the queue length process, Discard is the best strategy, Resume second, and Restart worst; the price for the former is the increased cost of checkpointing, and for the discard strategy, that some tasks are not successfully completed (even when there is no QoS/delay bound). For the queue-length process in case of exponential task times, it is irrelevant whether the resumed task is stored at the head of the queue or at the tail due to the properties of residual times of exponential distributions. However, there is an impact on system time distribution, which is defined as the sum of queueing delay and total service time including potentially multiple restarts. The impact on system time distribution is even more pronounced for the two restart cases, which are not equivalent for the queue-length based metrics. In summary, the basic system assumptions are: • Cluster consisting of fixed number N of statistically identical nodes. • Independent failures and repairs of each node, failures lead to either performance degradation (slowdown by factor 1 > δ > 0) or to complete crash (δ = 0). • A dispatcher maintains the queue of tasks to be executed in a transaction manner on one of the cluster nodes. The dispatcher never fails. In case of crash failures (not in main focus of this paper though), the dispatcher has instantaneous and always correct faultdetection implemented.

• Tasks are generated by clients according to some general process with rate λ. • Task time in principle can also be general, see Section 4, but the analytic model is based on exponential task times. • Time to failure (TTF) and time to repair (TTR) can be generally distributed. We will however, for the ease of description of the analytic model, include more limiting assumptions that can easily be removed, and are shown to be performance-wise not relevant in Section 4. Most of these assumptions can be circumvented by utilizing matrix-exponential dirstibutions [10] or Markovian Arrival Processes (MAP) [15] at the cost of increased model state-space, with typically also impact on the accuracy of the numerical evaluation of the anlytic model. Furthermore, complex distribution types increase the parameter space of the model which makes the numerical analysis more cumbersome. Finally, in order to gain understanding of the causes of certain performance behavior, it is in most cases advisable to use the most parsimonious model by which such behavior can be created.

2.1. Distribution of UP and DOWN Periods of the Servers The model presented in the next section will allow for general, matrix exponential [10] UP and DOWN times. We use the latter interchangeably with repair time to refer to a server being in either a degraded or crashed state. Using analogies from teletraffic models, see Section 2.3 and [21], we will show later that the actual distribution of UP times only has marginal influence on queue performance other than by its mean, and as such the analytic results in Section 3 will be presented using exponential UP times for convenience. As the goal of analytic modeling is to focus the model on the aspects that are influential for the considered metrics, here mean queue length and probability that a task with QoS/delay bound finishes successfully, using exponential assumptions here is not only useful for modeling convenience, but also helpful in determining clear evidence of performability relevant aspects of the cluster system. Regarding the repair/DOWN time, the results in Section 3 will show a dramatic impact, both on average queuelength and on tail probabilities of the queue-length distribution. Since this is closely related to probabilities of exceeding a certain system time, it can be used to approximate the fraction of successfully completed tasks under certain delay constraints. The DOWN period corresponds to the fault detection time and repair time of an individual server. Depending on

the type of fault, a repair time can range from a few seconds for a restart of a small process, to the order of minutes for a system reboot, or hours for hardware faults with spare parts in stock, up to even days and weeks for the replacement of the faulty machine or hardware component. Assuming that these different fault-types each lead to exponential repair times, but with different rates, the repairtime distribution can be represented as hyperexponential distribution with increasing average holding time for the different fault severity (in terms of effort for repair). For certain parameter settings, namely geometric decay of the entrance probability with geometric growth of mean state holding times, these hyperexponentials can exhibit powertail behavior until the reliability function finally drops off exponentially (see [6] for details). This exponential drop off denotes a truncation in the tail, and it corresponds to the longest repair time. Regardless of whether a hyperexponential distribution with high variance, or as a special case a truncated powertail distribution is assumed for the repair times, the performability metrics of the cluster system will show very peculiar behavior. Section 3 analyzes and explains this behavior first for the example of truncated Power-Tal distributions (in resemblance to teletraffic models [17, 19]) and then for 2-state hyperexponential distributions with large variance.

2.2. Matrix Representation of Server Model With exponential task times, the proposed multi-node cluster system is equivalent to a single server system with modulated service rate, namely the number of servers that are UP and the number of tasks in the system influence the instantaneous service rate as follows: ν(t) = νp ∗ SourcesUPt + δ ∗ νp (N − SourcesUPt )(2) given that the number of tasks in the system is larger or equal to N . Otherwise, the number of tasks in the system has a limiting impact, since not all servers can be utilized. In order to simplify the model specification, we make the following assumptions, see Sect. 2.4 for more discussion on them: • We do not consider the limiting influence that occurs, when the number of tasks in the system is smaller than N , i.e. Eq. (2) is always assumed to be exactly true. As such, the analytic model will lead a lower bound on the performance behavior, but Section 4 demonstrates that for the scenario in our interest, this lower bound is very close to the actual exact result. The analytic model below can be extended to include this loaddependence utilizing the same approach as in [7, 20]. The model extension however makes the numerical analysis computationally more expensive and also numerically much less stable.

• We assume that task arrivals are Poisson (despite other distributions can be easily included, see Sect. 2.4). • Task execution times are exponential, as stated already at the beginning of this section. • The number of servers, N , is a low integer, e.g. between 2 and 10 or 20. There are three main reasons for this assumption. (1) computational effort and numerical accuracy: The size of the state space of the model as introduced below grows exponentially with N . (2) High-availability clusters with redundantly stored processing states typically consist of only a few, closely coupled (within the same IP-subnet) nodes. (3) The performance and dependability impact that we identify in this paper is particularly pronounced for the settings of low N . • the dispatcher queue is infinite. Utilizing the first assumption/approximation together with the exponential task times, the collective N nodes in the cluster system can be represented as a single-server Markov Modulated Poisson Process (MMPP). We assume that all servers are independent and identical, namely with Matrix-Exponential DOWN(repair)/UP periods, respresented by the vector-matrix pairs, < pdown , Bdown > and < pup , Bup >, respectively, following the notation of [10]. Consequently, the modulating Markov process for the service rate of a single server then has the following generator matrix:   −Bdown Bdown εdown pup   , Q1 =   Bup εup pdown  −Bup and the corresponding Poisson service rates on the diagonal of the following matrix:   δνp Idown L1 = · νp Iup Note that the individual blocks may have different dimensions. I is the identity matrix, and ε is a column vector with all elements equal to 1, both of the corresponding dimension as indicated by the subscript. For multiple independent servers, the service process can be expressed by multiple Kronecker sums of the matrices Q1 and L1 , QN = Q⊕N 1 ,

LN = L⊕N 1 ,

but more efficient representations can be used since some of the states are redundant, if the servers do not need to be distinguished.

The cluster model therefore can be represented as a M/MMPP/1 queueing system with a standard Quasi-Birth Death representation, for which a matrix geometric solution can be numerically obtained, see [15, 9]. Note that in principle QBDs can also be described with other methods, such as infinite stochastic Petri Nets [16], but as we had mapped the cluster system to a simple M/MMPP/1 queueing system, where the MMPP has a structured form when using matrix-exponential distributions, the use of high-level description tools is not necessary. Alternatively, the heavy-tailed repair periods can be modeled as occasional heavy-tailed services, where the repair and the consecutive re-service are viewed as one long service, and in which case the model would lend itself to an M/G/1 or M/G/c type analysis, see [2, 10]. The matrix-geometric solution allows to compute explicit formulas for the mean queue length and for the tail probabilities of the queue-length distribution. The tail probability of the queue-length distribution, P r(Q > k) is equivalent to the probability that the queue-length as seen by an arriving customer is exceeding k, which for large k is closely related to system time, i.e. the following approximation links queue-length tail probabilities to system time S: P r(S > d) ≈ P r(Q > d¯ ν ). Thereby ν¯ = N νp (A+δ(1−A)) is the average service rate. In the case that tasks have to meet some delay requirement d, the equation above allows to determine the probability of violating this requirement.

2.3. Resemblance to Bursty Teletraffic Models The MMPP model that has been developed in Section 2.2 as a single-server approximation of the cluster service process, very closely resembles a class of models well known in communication network performance analysis, although there the MMPP models have been used for arrival processes. Packet-based network traffic in many cases shows burstiness, i.e. fluctuating arrival rates, which can be modeled by MMPPs, see [11]. In particular, ON/OFF behavior of traffic sources has a long history, and more recently, also matrix-exponential distributions have been utilized [17, 19] to reflect burstiness on multiple time-scales, or even selfsimilar or long-range dependent traffic models, see e.g. [22, 3]. When using ON/OFF traffic sources, the aggregated arrival rate is modulated by the number of sources in an ON period, hence closely resembles the scenarios for the service times in the cluster model of this paper. In fact, the MMPP model in Sect. 2.2 is equivalent to the aggregated traffic model, proposed first in [17].

Those traffic models represent a set of N statistically identical and independent sources that intermittently emit data. Each of the sources is an ON/OFF model with a peak rate of λp (during an ON period), and a mean rate of κ, leading to the aggregate arrival rate of λ = N × κ. The burstiness of the traffic is expressed by the burst parameter, b, which is the fraction of the time that a source is OFF. The following table illustrates the resemblance of the two models by comparing the parameters: Cluster Model Telco Model M/MMPP/1 queue MMPP/M/1 queue number of sources N number of servers N service during UP νp arrival rate during ON λp F TF b = ONOF avail. A = M T TMFT+M +OF F TTR average arrival rate = avg svc rate = ν¯ = N νp A λ = N λp (1 − b) Note that the average service rate for the cluster model is given in the table for the case of δ = 0 (crash faults). A somewhat similar notion to a degraded service rate δνp in the cluster model exists in traffic models by assuming a background Poisson process for the aggregation of other, non-bursty traffic. As the queueing analysis in Section 3 will show, the performance behavior of the cluster model also has many similarities to the observed performance behavior in network performance models. Also, the mechanisms that lead to poor performance in case of high variance DOWN times (corresponding to high-variance ON periods in the traffic models) are very similar.

2.4. Variations of the Analytic Model Most of the assumptions on the analytic model can be easily removed within the matrix-analytic framework, leading however to more complex matrices and possibly more complexity in the queueing analysis. We do not implement these model modifications here, but in order to show the power of the modeling approach, we highlight how the extensions can be incorporated: • Nonexponential task arrival processes: Any finitedimensional matrix exponential renewal process, or even any MAP can be included in the analytic model. The state space of the arrival process then has to be included in the overall state space of the queue-length process. • Finite task queue at the dispatcher: The finite QBD representing a ME/MMPP/1/K queue also has a matrix-geometric solution. For large buffer sizes however, qualitative results are expected to be unchanged, see [18] for arguments in a comparable setting. • Nonexponential TTF: The model in Section 2.2 already includes matrix-exponential TTF. However, the

analysis results in the subsequent section will be based on exponential TTFs, since earlier results from the corresponding teletraffic models indicate that high-variance distributions are most significant performance-wise in what now corresponds to the TTR, see [21]. • Including queue-size dependence when less than N tasks are present: This modification of the queue length dependence would require a modification in the service events in the first N block-rows of its QBD matrix representation, see [20] for an example. • Hyperexponential task times: By some extension of the state space, namely by keeping track of the selected phase for the (residual) times of the tasks that are being processed at one of the nodes, nonexponential task times can also be modeled. Furthermore, in the scenario of crash faults, discard strategies for the task under execution at the failing server can also be represented by using a MAP for the service process, namely transitions corresponding to failures of a node would then lead to a reduction of the queue size by 1 (one specific instance of a ’service’ event, although unsuccessful here).

3. Discussion of Analytic Results We will first present and discuss queue performance results from the analytic model, more specifically the behavior of the mean queue-length and of tail-probabilities of the queue-length distribution. All participating processes (task arrivals, task service time, UP time) except for the repair time are thereby assumed to be exponential. First, we look at the case of truncated power-tail distributions for the repair time, since those resemble closely recently used models for teletraffic.

3.1. Task queue behavior for TPT repair times We illustrate the behavior of the mean size of the taskqueue (also counting the tasks in service) using a cluster with N = 2 nodes with a degradation factor δ = 0.2 while varying the task arrival rate λ and thereby varying the utilization ρ := λ/¯ ν . Figure 1 shows the resulting normalized mean queue length; normalization is thereby performed with respect to an M/M/1 Queue at same utilization, mainly in order to avoid the vertical asymptote of the mean queuelength for ρ → 1. For an exponential repair time (solid line in the bottom), the normalized mean queue-length shows no surprising behavior, but it grows monotonously and steadily

M/2−Burst/1 Queue: ON=90, OFF=10, ν =2.0, δ=0.2, α=1.4, θ=0.20

QLD comparison: M/2−Burst/1 Queue, UP=90, DOWN=10, ν =2.0, δ=0.2, α=1.4, T=9, θ=0.20

p

3

10

TPT DOWN, ρ=0.1

T=1 T=5 T=9 T=10

TPT DOWN, ρ=0.3 TPT DOWN, ρ = 0.7 −2

M/M/1, ρ=0.7

10

2

Queue−length Probability

Normalized mean queue length

p

0

10

10

1

10

−4

10

−6

10

−8

10

0

10

−10

0

0.1

0.2

0.3

0.4

0.5 ρ

0.6

0.7

0.8

0.9

1

Figure 1. Normalized mean queue-length for a 2node cluster under varying task-arrival rate: For TPT distributions with larger range (T = 9, 10), the mean queue-length shows peculiar blow-up behavior at the points marked by the dotted vertical lines, see text. Note that for ρ → 1, the normalized mean queue-length of all models converges, i.e. the mean queue-length shows the same vertical growth as 1/(1 − ρ) as the M/M/1 queue.

with ρ. The growth is a consequence of the fluctuations in the service rates due to failures of the servers. However, when truncated Power-tail distributions with large range are used (T = 9, 10), the three different regions with respect to ρ have to be distinguished, namely: • For small ρ (approximately ρ < 21.7% for the chosen example), the mean queue-length is rather insensitive to the repair time distribution. • For the intermediate range 21.7% < ρ < 60.9%, the normalized mean queue-length is significantly higher than for exponential repair times, and with longer tails of the repair time distribution, this difference grows slowly. • For large ρ > 60.9%, the mean queue-length jumps to huge values, 100 times larger than for an M/M/1 model. Note the log-scale on the y-axis in the figure. With increasing Power-tail range, the mean queuelength rapidly increases. Figure 2 shows the probability mass function of the queue-length distributions that correspond to utilization values in the three different regions, and, for comparison, that of an M/M/1 queue at the largest of these utilization values. The queue-length distributions show (truncated) power law behavior, which in the utilized log-log plot appears as straight line, for the two parameter settings belonging to the intermediate and worst performance region in Figure 1. The slope of the linear part, corresponding to the Power-tail exponent, however is different between the two curves. In the

10

0

10

1

10

2

10 q

3

10

4

10

Figure 2. Probability mass function (pmf) of queue lengths for 2-node cluster model with TPT distributed repair times: The shape of the distribution is changing significantly for different utilization values, ρ = 0.1, 0.3, 0.7. Shown is also in comparison the queue-length distribution of a M/M/1 queue.

region of small ρ (solid curve), the queue-length distribution decays exponentially, as for an M/M/1 model. Similar behavior was observed first in [17] and then later analyzed in more detail for teletraffic models in [19]. The underlying mechanism that causes this remarkable behavior of the mean queue-length of the task queue in the cluster model is the same as for teletraffic models: truncated Power-tail distribution for the repair time allow for large repair times to occur with non-negligible probability. Temporarily, during time-intervals in which i servers simultaneously are in a LONG repair time, the mean service rate of the cluster degrades to νi = (N − i)(νp A + δνp (1 − A)) + iδνp ,i=1,2,...,N.(3) Note that 0 < νN < ... < ν2 < ν1 < ν¯, for N > 2 and 0 < δ < 1. Hereby, ν¯ =: ν0 is the overall long-term average service rate. Hence, if the task arrival rate λ is smaller than νN , even a simultaneous long repair time of all N server does not cause any oversaturation period since the degraded modes of the servers can still handle the average arrival rate. This setting corresponds to the leftmost region in Figure 1 and the queue-length distribution decays geometrically. For the case of crash faults, δ = 0, νN = 0 and the model thus always operates in a setting for which (truncated) power-tailed queue-length probability mass functions are observed. If and only if the condition νi < λ < νi−1 ,

i = 1, ..., N,

(4)

holds, at least i servers have to be in a long repair period in order to create an oversaturation period, and using equivalent residual time arguments as in [19], it can

0

M/2−Burst/1 Queue: ON=90, OFF=10, νp=2.0, δ=0.2, α=1.4, T=1, θ=0.20

10

−2

10

Normalized Mean Queue−length

T=10 T=9 T=5 T=1 (EXP)

−1

10 Tail Probability of qld: Pr(Q>=500)

M/2−Burst/1 Queue: ON=90, OFF=10, νp=2.0, δ=0.2, HYP2

3

10

−3

10

−4

10

−5

10

−6

10

T=10 T=9 T=5 T=1 (EXP) 2

10

1

10

−7

10

−8

10

0

0

0.2

0.4

ρ

0.6

0.8

10

1

Figure 3. Tail distribution of length of task-queue for 2-node cluster with different repair-time distributions for varying utilization: The blow-up behavior as observed for mean queue-length also occurs for these tail probabilities, which can be translated into probabilities of violating delay bounds.

0

0.2

0.4

ρ

0.6

0.8

1

Figure 4. Blow-up point for 2-node cluster model using HYP-2 distributed repair times: The parameters are chosen such that the first 3 moments of the HYP-2 distribution match the moments of the TPT distribution used earlier. M/2−Burst/1 Queue: lambda=1.80, cycle=100, νp=2.0, δ=0.20, HYP2

3

3.2. Hyperexponential Repair Times with High Variance In our analysis thus far we employed TPT distributions to describe the repair time durations, and have argued that they serve as good descriptors for degradation periods spanning multiple time scales. Here we show that under weaker assumptions, mamely with a high-variance hyperexponential2 DOWN time distribution, that the blowup behavior still holds, even in some cases is more pronounced. As only two states are required to represent a hyperexponential distribution, this reduces computation time and in many cases

Normalized Mean Queue−length

10

be shown that the probability mass function of the queuelength distribution shows a (truncated) power-tail with exponent βi = i(α − 1) + 1, where α is the tail exponent of the DOWN distribution for each server. The condition in Eq. (4) can be reformulated in terms of the utilization ρ as νi /¯ ν < ρ < νi−1 /¯ ν , or in terms of any of the other parameters N , A, νp and δ that influence the blow-up behavior. Note that the mean TTF and mean TTR do not have any impact on the location of the blow-up points. To shed some light on how these results extend to a QoS analysis setting, we consider the tail probabilities of the queue length distribution, which are shown in Figure 3 for varying values of the TPT parameter T . In particular we compute the probability that an arriving task sees more than 500 customers in the queue. For larger values of T , the blowup points can be distinguished quite clearly on the plot, and correspond to those from Figure 1. The case of exponential repair time (T = 1, solid curve) is qualitatively comparable to that of the M/M/1 queue, and only shows non-negligible tail probabilities for ρ close to 1.

2

10

1

10

0

10 0.3

0.4

0.5

0.6 0.7 Availability A

0.8

0.9

1

Figure 5. Blow-up point for two-node cluster model while varying the availability A of the individual nodes: high-variance HYP-2 distributions are used for the repair times.

increases numerical accuracy as compared to a T state TPT distribution, hence allowing to obtain numerical results for a larger number of servers, N . To illustrate the blowup behavior, we use a 2-stage hyperexponential distributions (HYP-2) for the repair time, for which we set the three parameters such that it has the same first three moments as the correponding TPT distributions used in Figure 1. Figure 4 shows the resulting normalized mean queue-length, which is subject to the same blow-up behavior as for TPT distributions. In the worst blow-up region at the right-hand side, even the actual values closely match the ones from Figure 1, while in the intermediate region, the normalized mean queue-length is slightly lower in this case. Finally, we use the model setting with HYP-2 distributions to illustrate the blow-up behavior when varying the

M/5−Burst/1 Queue: ON=90, OFF=10, νp=2.0, δ=0.2, HYP2

0

M/2−Burst/1 Queue: ON=90, OFF=10, ν =2.0, α=1.4, T=5, θ=0.50

Analytic Result Simulation M/2−Burst/1 Simulation Multi−processor M/M/1 Queue

T=10 T=9 T=1 (EXP)

−2

10

−4

10

1

10

−6

10

Mean Queue−length

Tail Probability of qld: Pr(Q>=500)

S

2

10

10

−8

10

−10

10

0

10 −12

10

−14

10

0

0.2

0.4

ρ

0.6

0.8

1

Figure 6. Tail probabilities for 5-node cluster model with high variance HYP-2 repair times: The five blowup points are clearly visible.

availability of the individual server in Fig. 5. A decrease of the availability is thereby achieved by a reduction of the mean UP duration, while at the same time increasing the mean repair time accordingly, such that the average duration of an UP-DOWN cycle is kept constant. Note that not the whole range of A between 0 and 1 can be covered, since for the fixed arrival rate λ = 1.8, the cluster becomes instable for values of A below approximately 31%, marked by the vertical dashed-dotted line. Note also that for the used choice of λ = 1.8 in Figure 5, and for the given settings of δ = 0.2 and νp = 2, the mean service rate during long repair times of both nodes simultaneously is ν2 = 0.72, hence for any A < 1, the model immediately is at least in the inter-mediate blow-up region. Hence, the region of operation with insensitivity towards the repair time distribution is here reduced to a single point at A = 1, at which the model reduces to a plain M/M/1 queue, since repair times are infinitely small. In the general case, the reformulation of Eq. (4) in terms of the availability A leads to the following condition for blow-up region i = 1, 2, ..., N − 1: λ − N νp δ λ − N νp δ

λ − N νp δ νp (1 − δ)

is only present, if and only if λ > N νp δ. Although the discussion of blow-up points has been general for any N ≥ 1, the numerical examples so far focused on the case N = 2. Since for Hyperexponential distributions with 2 states, larger settings of N can be easily

−1

10

0

0.1

0.2

0.3

0.4

0.5 ρ

0.6

0.7

0.8

0.9

1

Figure 7. Simulation of a 2-server system: The plot marked with circles shows that the effect of load independence assumption can only be observed for short queue lengths. The one marked by crosses corresponds to a simulation of exactly the analytical model and is used to validate our numerical results.

computed even without reduced state space representations, we conclude this section by demonstrating that the blow-up points in terms of tail probabilities of the queue-length distribution can be also clearly seen for larger N ; in the case of Figure 6 for N = 5, all five blow-up points are very pronounced.

4. Simulation Experiments With the analytic results as a baseline for comparison, simulation experiments served two purposes: first to evaluate the effect of the load-independence assumption in the analytic model, and second to explore our model under more general assumptions. In particular, we perform experiments that simulate the failure handling strategies presented in Section 2. Before we present these results, we discuss the difficulties inherent in creating simulations experiments that sample the TPT distribution. As discussed in Section 2, the repair time distribution can show power-tail like behavior over a wide range of time-scales, but eventually the repair-time distribution is expected to drop off exponentially corresponding to a truncated tail. Truncated tails can also be an artifact of the finiteness of sample sets in measurements or simulation experiments. For instance, taking a large set of K samples from inter-arrival times of the TPT-DOWN model with infinite tail, corresponds to on average sampling L := K/(λE(U P + DOW N )) DOWN periods, or L power-tail

M/2−Burst/1 exp tasks, νs=2.0, ON=90, OFF=10, T=10, θ=0.2, α=1.4

3

M/2−Burst/1 non−exp tasks, var=5.3, ν =2.0, ON=90, OFF=10, T=10, θ=0.2, α=1.4 s

3

10

10 Discard Resume Restart 95% Confidence Interval (lower) Confidence Interval (upper)

Discard Resume Restart

Normalized mean queue length

Normalized mean queue length

Analytic 2

10

1

10

2

10

1

10

0

10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Utilization ρ

Figure 8. Comparison between calculations from analytic model and simulation results: The latter show that the failure handling strategies behave almost identically with respect to mean queue length when task times are exponentially distributed. 95% The confidence interval plotted is for Discard. samples. Since for high quality components the UP periods can be very long (days to months), the number of powertail samples is rather small L := N · simultime/(U P + DOW N ). Simulation experiments would therefore require immensely long simulated virtual time in order to assure adequate sampling of the tails. Figure 7 compares the exact analytic result of a M/2Burst/1 Queue with a simulation of the same system as well as a simulation of the actual corresponding multi-processor system. Since it is difficult to obtain stable results for large values of T , we limited our test case to T = 5. The main difference of the multi-processor system is that i ≤ N operational servers can only be fully utilized if at least i tasks are at the queue. Hence, the service rate is not only modulated by the number of active servers but also by the queue length. This impact is however only visible for small queue lengths, as Figure 7 shows. In Figure 8, we compare the simulation results of the three failure handling strategies for varying values of ρ against the analytic computations. Each sample point in the figure represents the mean of 10 indepedent runs that each uses 2 × 105 UP/DOWN cycles. The results show that the failure handling strategies behave almost identically, with Restart being the worst and Discard the best. The choice of parameter values for the UP and DOWN periods, namely 90 and 10, relative to that of the mean task service time 1, was to allow us to obtain results within reasonable simulation times. To improve the stability of the simulation results and to provide for a more realistic choice of parameter values, the application of certain rare-event techniques, such as importance sampling, may be investigated. Such a tech-

0

10

0

0.1

0.2

0.3

0.4 Utilization ρ

0.5

0.6

0.7

0.8

Figure 9. Simulation of M/2-Burst/1 system with hyperexponential task service times for varying fault handling mechanisms. nique, as applied to systems with heavy-tailed properties is, however, still a subject of ongoing research [1]. Another variation that we considered is that of the task service time being nonexponential. Intuitively, the Restart strategy can be expected to be worst performing since a restarted task’s duration is biased in that it must take at least as long to execute as its elapsed execution duration when the server failed. In fact, it is shown in [4] that the completion time of a restarted task exhibits power-tail behavior. In Figure 9, we use a HYP-2 distribution with variance = 5.3 to model the task service time distribution. The simulation results in that figure show that the ordering in which the strategies perform holds, although the difference in mean queue length has grown significantly compared to that in Figure 8. The blowup behavior, however, can still be observed for all three variations. Other simulation experiments, not shown here due to space limitations, show that for the Resume and Restart recovery models, placing the interrupted task in the back of the queue is better than placing it in the front.

5. Summary This paper presents an analytic model of a cluster of N nodes, which are subject to failure and repair. We formulate the analytic model for general matrix-exponential repair and failure times but then focus in the analysis on high-variance repair times, due to their practical relevance. Under certain assumptions, most of them can be easily removed and also are shown to be not of major performance influence, the cluster model can be expressed as a single server M/MMPP/1 queue, which bears resemblance to earlier teletraffic models. The analysis of performability metrics mainly focuses

on the mean queue length and on tail probabilities of the queue-length distributions, since the latter can be mapped to successful task completion probabilities under delay constraints. Parameter variations of any of the model parameters, i.e. the number of server nodes N , the availability of the server nodes, A, the degradation factor of service rate during DOWN periods, δ, the service rate during UP, νp , and the task arrival rate λ, can lead to a dramatic change of all performance metrics, referred to as blowup-up points, if high-variance repair time distributions are present. The exact placement of the blowup boundaries in the parameter space is obtained Section 3. The analytic results are confirmed by simulation experiments which show that the main qualitative result, namely the existence of the blow-up points for such systems, is robust to model variations, in particular the type of participating distributions, and the failure handling strategy employed. ACKNOWLEDGEMENTS: This research was partially supported by the EU IST FP6 project ’HIghly DEpendable ip-based NETworks and Services – HIDENETS’, see www.hidenets.aau.dk. The authors would like to thank the HIDENETS consortium, in particular Felicita Di Giandomenica, ISTI-CNR Italy, and Andrea Bondavalli, University of Florence, Italy, for their helpful comments. Furthermore, the authors would like to thank Michael Clark for the implementation of the simulation, Lester Lipsky at the University of Connecticut for the discussions, and the anonymous reviewers for their helpful comments.

References [1] Asmussen, S., Fuckerieder, P., Jobmann, M., Schwefel, H.P.: Large Deviations and Fast Simulation in the Presence of Boundaries. Stochastic Processes and Applications, 102, pp. 1–23, 2002. [2] Borst, S.C. Boxma, O.J., and Nunez-Queija, R.: Heavy Tails: The Effect of the Service Discipline. Proceedings of Performance Tools 2002, pp. 1–30, London, 2002. [3] Crovella, M. and Bestavros, A.: Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. Proceedings of the ACM Sigmetrics, pp.160–169, Philadelphia, PA, 1996. [4] Fiorini, P., Sheahan, R., Lipsky, L., and Asmussen, S.: On the Completion Time Distribution for Tasks that Must Restart from the Beginning if a Failure Occurs. Proceedings of SPECTS 2006, Calgary, Canada. [5] Gaver, D. P., Jacobs, P. A., and Latouche, G.: Finite Birthand-Death Models in Randomly Changing Environments. Advances in Applied Probability, 16, pp. 715–731, 1984. [6] Greiner, M., Jobmann, M., and Lipsky, L.: The Importance of Power-tail Distributions for Telecommunication Traffic Models. Operations Research, 47, No. 2, pp 313-326, March 1999.

[7] Krieger, U., and Naumov, V.: Analysis of a Delay-Loss System with a Superimposed Markovian Arrival Process and StateDependent Service Times, Proceedings of MMB converence, University Trier, September 1999. [8] Kulkarni, V.G., Nicola, V.F., and Trivedi, K.S.: The Completion Time of a Job on Multimode Systems. Advances in Applied Probability, 19, No. 4, pp. 932–954, 1987. [9] Latouche, G., and Ramaswami, V.: INTRODUCTION TO MATRIX ANALYTIC METHODS IN STOCHASTIC MODELING. ASA-SIAM Series on Statistics and Applied Probability, 5, 1999. [10] Lipsky, L.: QUEUEING THEORY: A Linear Algebraic Approach. MacMillan Publishing Company, New York, 1992. [11] Meier-Hellstern, K. and Fischer, W.: MMPP Cookbook. Performance Evaluation 18, pp. 149–171, 1992. [12] Mitrani, I.: Queues with Breakdowns. Performability Modelling: Techniques and Tools, Haverkort, B.R., et al. (eds.). Wiley, 2001. [13] Palmer, J., and Mitrani, I.: Empirical and Analytical Evaluation of Systems with Multiple Unreliable Servers. Technical Report CS-TR-936, University of New Castle, 2005. [14] van Moorsel, A., and Wolter, K., Analysis and Algorithms for Restart. Proceedings of the First International Conference on the Quantitative Evaluation of Systems (QEST), pp. 195–204, 2004. [15] Neuts, M.: MATRIX-GEOMETRIC SOLUTIONS IN STOCHASTIC MODELS. John Hopkins University Press, London, 1981. [16] Ost, A., and Haverkort, B., Evaluating ComputerCommunication Systems using Infinite State Petri Nets. Proceedings of 3rd International Conference on Matrix Analytic Methods, pp. 295–314, 2000. [17] Schwefel, H.-P. and Lipsky, L.: Performance Results For Analytic Models of Traffic In Telecommunication Systems, Based on Multiple ON-OFF Sources with Self-Similar Behavior. In P. Key and D. Smith (eds.), ‘Teletraffic Engineering in a Competitive World, Vol 3A’, pp. 55–66. Elsevier, 1999. [18] Schwefel, H.P.: Performance Analysis of Intermediate Systems Serving Aggregated ON/OFF Traffic with Long-Range Dependent Properties. PhD Dissertation, Technische Universit¨at M¨unchen, 2000. [19] Schwefel, H.-P. and Lipsky, L.: Impact of Aggregated, SelfSimilar ON/OFF Traffic on Delay in Stationary Queueing Models (extended version). Performance Evaluation, No. 41, pp. 203–221, 2001. [20] Schwefel, H.-P.: Behavior of TCP-like Elastic Traffic at a Buffered Bottleneck Router. Proceedings of IEEE Infocom, 2001. [21] Schwefel, H.-P., Antonios, I., and Lipsky, L.: PerformanceRelevant Network Traffic Correlation, submitted to ASMTA 2007, Prague, Czech Republic. [22] Willinger, W., Taqqu, M., Sherman, R., and Wilson, D.: SelfSimilarity Through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level (Extended Version). Proceedings of the ACM Sigcomm, 1995.

Suggest Documents