A Hidden Semi-Markov Model for Web Workload Self ... - IEEE Xplore

1 downloads 0 Views 832KB Size Report
Abstract: Hidden semi-Markov models (HSMMs) have been well studied and successfully applied to many engineering and scientijic problems. The advantage ...
A Hidden Semi-Markov Model for Web Workload Self-similarity Shun-Zheng Yu,Zhen Liu, Mark S Squillante, Cathy Xia, Li Zhang IBM Research Division, T. J. Watson Research Center Yorktown Heights, NY 10598

Abstract: Hidden semi-Markov models (HSMMs) have

user preference in file transfer, the effect of user “think time”, and the superposition of many such transfers in a network. Several studies have shown that file sizes often follow heavy-tailed distributions [7][8] [9]. Moreover, it has been suggested that the distribution of file sizes in the Web is the primary determiner of Web traffic selfsimilarity [6][ 101. Traditionally, the commonly assumed models for network traffic are Markovian. In a simple Markov or Markov-renewal [ 111 traffic model, each state transition represents a new arrival where the interarrival times are exponentially distributed. In a Markov modulated process model, an auxiliary Markov process is used to control (or modulate) the arrival process. While in state m, the arrivals occur with rate &. The most commonly used Markov-modulated model is the Markov-Modulated Poisson Process (MMPP). MMPPs allow the modeling of time-varying sources while maintaining an analytically tractable solution of related queueing-theoretic measures. For example, performance measures such as the queuelength distribution and moments of the delay distribution can be obtained using MMPP/G/l queueing analysis [ 121. Moreover, the parameters of an MMPP can be estimated from empirical data and a number of algorithms have been developed for doing so. An important advantage of Markovian models is that its solution is analytically tractable, via for example matrix-analytic methods which are a well-established area of research that has developed computationally efficient and numerically stable algorithms [ 131. However, traditional Markovian models cannot in general model traffic that exhibits long-range dependence or self-similarity, because they have an exponentially decaying auto-correlation function [5]. Robert and Boudec [14][15] used a Markov chain to model shortrange dependence. Andersen and .Nielsen [ 16][ 171 consider the superposition of two-state Markovian sources for the modeling of long-range dependence over several time scales. In this paper, we propose a hidden semi-Markov model (HSMM) to characterize the user request patterns of Web workloads. The model is used to describe the

been well studied and successfully applied to many engineering and scientijic problems. The advantage of using a HSMM is its efficient forward-backward algorithms for estimating model parameters to best account for an observed sequence. In this paper, we propose a HSMM for modeling Web workloads. We show that this model asymptotically characterizes second order self-similar workloads when some duration distributions of the hidden states are heavy-tailed. A recursive formula is developed for estimating the Hurst parameter of self similarity. We validate our model and estimation methods with respect to two sets of empirical data (requests per second) collected from two different Web servers. We then use this model to generate self-similar workloads that exhibit the same statistical properties. These measurements show that we can use as few as 4 states together with a simple Poisson process and heavy-tailed Pareto holding time distributions to accurately model the Web workloads considered in this study.

1. Introduction Measurements of real workloads indicate that significant traffic variability is present on a wide range of time scales. Leland, Taqqu, Willinger and Wilson [ 11 demonstrate that network packet traffic has self-similar or long-range dependent characteristics. These effects can have a significant impact on the performance of networks and systems [2][3][4]. Therefore, a better understanding the nature of self-similarity in Web workloads is critical to properly design and implement Web servers. Willinger, Taqqu, Sherman and Wilson [5] attempt to explain the occurrence of self-similarity in LAN traffic, and show that the superposition of many ON/OFF traffic sources can result in self-similar aggregate traffic when the number of sources grows to be large. Crovella and Bestavros [6] argue that self-similarity in Web traffic may be explained by the underlying distribution of transferred document sizes, the effects of caching and

0-7803-7371-5/02/$17.00 02002 IEEE

65

observed value is k, when the system is in state m, which is assumed to be independent of time t. Since this semi-Markov process is not directly observable (i.e., hidden), the state sequence (s,] and the model parameters are estimated from the observed sequence { o r ] of the empirical workload data set. The process s(t) is often called the hidden semi-Markov model, which is doubly stochastic in the sense that it has general state duration distributions and general modulated arrival processes.

statistical properties for both peak and off-peak traffic intensities. We show that this model can characterize the self-similarity and long-range dependence of the workload when at least one of the state duration distributions is heavy-tailed. Similar to the MMPP model and the Markov modulated fluid model [18][19], the HSMM is a doubly stochastic process with a general state duration distribution and an arbitrary arrival distribution.

2. Hidden semi-Markov model for workload

3. Hidden semi-Markov model for selfsimilarity

In this section, we introduce -a hidden semi-Markov model to characterize the user request patterns of Web workloads. More specifically, consider a continuous-time semi-Markov process { s(t)] taking values in state space S = { 1, ..., M 1, where we assume that the times are discretized so that t E { 1,2, ..., TI. This assumption is due to the fact that available data is often finite and on a . discrete time scale. Let h, be the amount of time (or duration) that the process stays in state m before making - a transition into a different state. We assume h, has probability-density function p,(d), d=l, ...,D, independent of time t, where DI= is the maximum duration of all states. The tail distribution of duration h, is then given by

In this section, we will show that a HSMM can characterize the long-range dependence (or asymptotic self-similarity) of Web workloads. First, we derive recursive formulae of the workload autocovariance and variance over different time scales for the HSMM with general state duration distributions and general modulated arrival processes. Then, we use these recursive formulae to show that the resulting doubly stochastic process is asymptotically self-similar when the durations of some of the hidden states follow a heavytailed distribution. Assuming this Markov chain to be irreducible, its stationary distribution n(m,d) is given by the unique z(m,d)= c n ( m ' , d ' ) a , , , . d n , n , d , and solution of

D

p , (i) ,

cm(d)=

d= 1, ...,D.

(1)

I=d

When the process leaves state m, it will next enter state m' with some probability am., where a,,,,,,, = 1 , m'fm

m:d'

and we assume that a,,,,,, =O for all m E S. Clearly, when the duration distributions c,,,(d) are general distributions, then { s ( t ) ] is a semi-Markov process. Let s, be the state that the process is in at time t. Denote by 6 the amount of time that the process has been in the current state s, at time t. Define X,=(s, q), t =1, 2, ..., T. Note that the transition probability from (s,-,=m', gl=d') to (sFm, z ~ d )denoted , by a(,*,&Xmd), is independent of time t. To see this, observe that the state transitions can only occur in the following two cases: (s,. qSl=d') (s,=m, z,=1) when mfm' and (s,-I=m', 2;. 3 (sFm', z ~ d ' + l )when m=m'. We then have a(in.,d')(mJ) pm'(d')am'm/cm'(d') for "'

x x ( m , d )= 1. These equations can be further reduce to md

(3)

where n(m, 1) is the stationary probability of entering state m. Now we compute the autocovariance function pi over different time scales as follows:

+

(4)

9

'(,n'.d'X,n',d + I ) a(mdXm.d)

= cm'(d'+l)/c#n'(d')

=Ov

m.d

(2)

9

is the transition probability from (m', d')

where

for all others.

Therefore, the pair XT(S,, z;), t =1, 2, ..., T, forms a Markov chain with transition probability a ( , . , ~ ~ ~ ~ . Let 0,denote the observable output (i.e., the number of requests in the t'th interval) associated with state s,.We assume that there are K distinct values that 0, can take, i.e., 0, E { 0, 1, ..., K-1 ). We define b,(k)rPr [o,=kls, = m] to be the probability that the

66

at time t to (m,d) at time t+i, k, is the mean number of arrivals per unit time when the process is in state m, and pl is the mean number of arrivals per unit time for the overall process. Obviously, (1)

u(m'.dXm,d)

(1-1)

= ~ a ( m ' . d X n $ d , ) a ~ n $ d , M m d' ) "I4

p1= E[o,] = x n ( m , d ) b m ( k ) k= z~(m,l)c,(d)k:, . md3

=

m.d

9,-l (ml *dl I'm,

and, in the general case, q$m,l)-n(m,l)p, will approach 0 exponentially fast. The second term of (14) will asymptotically tend to L il-' as i+-, because

(6)

(dl )'(m,.d,Hr.d)

'"I4

Applying (2) and c,,,(l)=l, we obtain the following recursive formulae for the computation of pi over the time scales i=l, 2, ... po(m,d)=z(m,l)k,,,, d21, meS, (7) 9

meS, 2 1 , (9)

and

$'pi

p, =C[p,(m,d)-n(m,l)plE,(d)k, 9

Lj2-'

+-,

asj+m,

1ck2.

(17)

Hence, under the assumption that some of the state duration distributions are heavy tailed (Pareto with parameter in (1,2) ), the process will have the property of long-range dependence or asymptotic self-similarity with Hurst parameter H=(3-2)/2, which clearly indicates longrange dependence (or asymptotic self-similar) since 0.5

Suggest Documents