Characterizing Normal Operation of a Web Server - Semantic Scholar

15 downloads 284 Views 353KB Size Report
normal system operation for time varying workloads in a web server. We con- sider the in uences ... corporation using the collection facilities doc- umented in 4].
Characterizing Normal Operation of a Web Server: Application to Workload Forecasting and Problem Detection Proceedings of the Computer Measurement Group, 1998 Joseph L. Hellerstein , Fan Zhang , and Perwez Shahabuddin 1

2

2

IBM Research Division, IBM T.J. Watson Research Center, Hawthorne, NY Industrial Engineering and Operations Research, Columbia University, NYC, NY 1

2

This paper describes a systematic, statistical approach to characterizing normal system operation for time varying workloads in a web server. We consider the in uences of time-of-day, day-of-week, and month as well as time serial correlations. We apply our approach to two areas of capacity management: workload forecasting and problem detection. For forecasting workloads, we address the following questions: (1) What will the workloads be at a time in the future? (2) When will the workloads grow beyond a speci c limit? (3) When will this limit be exceeded for a speci c time-of-day or day-of-week week? For problem detection we use the characterization to remove known behavior so as to better detect anomalies when problems are present as well as to avoid false alarms when no problem is present.

1 INTRODUCTION Managing the capacity of information systems requires being able to characterize normal system operation. Typically, this is done by establishing baseline values of key metrics for resource utilizations, workload demands, etc. In capacity planning, such characterizations are used to extrapolate future workload demands. In problem detection, characterizations are used to set threshold values of measurement variables so as to detect abnormal behavior. For both areas, a signi cant challenge

is dealing with time-varying workloads. This challenge is particularly dicult in Internetattached servers in that they often have complex usage patterns (e.g., due to work days that span the globe). Herein, we describe a systematic, statistical approach to characterizing normal system operation for time varying workloads, and we apply this approach to workload forecasting and problem detection for a web server. The data we consider were collected over eight months (June, 1996 through January, 1997) from a production web server at a large

corporation using the collection facilities documented in [4]. Each observation contains approximately twenty variables that are aggregated over ve minute intervals; 288 ve minute intervals are reported for each day. The measurement variables consisted of system utilization measures (e.g. CPU usage measures, disk con guration measures, network load measures), and TCP network measures. Also considered are measures of HTTP (hypertext transfer protocol) operations, such as get, post, and the rate at which cgi (common gateway interface) scripts are initiated. Fig. 1 displays values of these variables for a single day. We focus on the HTTP operations per second (httpop/s) since this is an overall indicator of the demands placed on the web server. Fig. 2 displays httpop/s over several days. We have two objectives in this paper. First, we show how simple statistical techniques can be applied to characterizing normal operations. Others have done similar work in networks (e.g., [9]), employed ltering techniques to data from distributed systems (e.g., [3]), and used time series models to characterize system behavior (e.g., [7], [8], and [10]). However, priori work has not incorporated a methodology that evaluates the quality of the characterizations constructed. Our second objective is to show how characterizations of normal operation can be applied to capacity management. Two areas of application are considered. The rst is workload forecasting. Typically, workload forecasts employs linear extrapolations or other curve tting techniques. While such approaches address simple trends, they do not capture timeof-day variations, such as peak loads during \prime shift." A second application of the characterization herein developed is in the area of problem detection. In current practice, characterizations are used to obtain threshold values of measurement variables. We propose a di er-

ent approach in which characterization are used to remove known patterns, such as having a heavier load on Mondays than on Fridays. In essence, we use the characterization to lter the raw data. The ltered data are then input into standard algorithms for detecting change points (e.g., a change in the mean or variance), thereby identifying patterns that are not expected under normal operations. The anomalies detected in this way are analyzed further to determine if a problem is present. The remainder of this paper is organized as follows. Section 2 presents our characterization of normal operation in the web server data. Section 3 applies this characterization to workload forecasting. Section 4 applies the characterization to problem detection. Our conclusions are contained in Section 5.

2 STATISTICAL CHARACTERIZATION OF NORMAL OPERATION OF A WEB SERVER This section develops a characterization of normal operation of the web server described in Section 1. More speci cally, we characterize the variable httpop/s. Our approach is to construct a statistical model that estimates httpop/s based on the time-of-day, day-ofweek, and month. We consider both seasonal (periodic) components and trends. Also addressed are time-serial correlations. We begin by considering the e ect of timeof-day. Let yil be the value of httpop/s for the i-th ve minute interval (time-of-day value) and the l-th day in the data collected. Fig. 3 part (a) plots yil for a work week (Monday through Friday) in June of 1996 and a work week in November of 1996. The x-axis is time, and the y-axis is httpop/s. We proceed by using notation from analy-

20

0

10

20 hour

3

0

10

get/s 0

10

20 hour

Opkt/s 5 0 x 10

10

20 hour

0

10

20 hour

0

10

20 hour

0

10

20 hour

200 0

20 hour

40

1 0

10

20

0

20 hour

0.4

10

0

10

400

20

10

0

600

2

0

20 hour

20 0

20 hour

Retran %

Ipkt/s 10

0

20 httpop/s

10

200

tcpOut/s

mdb

tcpIn/s

4 0 x 10

1

0

0

400

5

0

10

20 hour

10

2

40

0

15

0

60

20

post/s

0

30 pdb

sys cpu

usr cpu

40

0

10

20 hour

0.2

0

Figure 1: Measurement Data Collected for a Single Day from A Web Server. sis of variance (e.g., [5]). We partition yil into three components: the grand mean, the deviation from the mean due to the i-th time-of-day value (e.g., 9:05 am), and a random error term1 that captures daily variability. The grand mean is denoted by . The i-th time-of-day deviation from P the grand mean is denoted by i . (Note that i i = 0:) The error term is denoted by il : The model is: yil =  + i + il :

(1)

The il are random variables with a mean of zero. Thus, the expected value of yil is  + i . We estimate  and i using ANOVA and use ^ and ^i (respectively) to denote these estimates. How good is the model in Eq. (1)? To answer this question, we subtract the expected This is error term in a statistical sense. That is, no software error or exception is intended. Rather, the error refers to deviation from the expected value predicted by the statistical model. 1

values of the yil as computed from the model from the observed values in the data. These di erences are called residuals. The residual for yil is denoted by eil , where: eil = yil ? ^ ? ^ i . Note that eil is an estimator of the error term il . The residuals are used in two ways. First, the variance of the residuals indicate the variability that is not explained by the characterization. We compute the fraction of variability unexplained by taking the ratio of the variance of the residuals to the variance of the raw data; one minus this ratio is the fraction of variability that is explained by the characterization. For the model above as applied to the data in Fig. 3 part (a), the fraction of variability explained is 53.24%. The second way in which we evaluate the characterization is qualitative. We look for systematic changes in the residuals that can be removed with a better model. Fig. 3 part (b)

15

15 Tuesday 06/11

10

httpop/s

httpop/s

Monday 06/10

5 0

0

5

10

15

10 5 0

20 hour

15

10

15

20 hour

10

15

20 hour

Thursday 06/13

10

httpop/s

httpop/s

5

15 Wednesday 06/12

5 0

0

0

5

10

15

20 hour

10

15

20 hour

10 5 0

0

5

15 httpop/s

Friday 06/14 10 5 0

0

5

Figure 2: Measurements of HTTP Operations Per Second for Several Days. plots the residuals for Eq. (1). Observe that much of the rise in the middle of the day (as evidenced in part (a)) has been removed. A further examination of part (b) in Fig. 3 indicates that there is a weekly pattern. For example, note that Tuesday 6/11 and Wednesday 6/12 both tend to have larger values of httpop/s than Friday 6/14. Similarly, Tuesday 11/19 and Wednesday 11/20 both tend to be larger than Friday 11/22. Removing this pattern requires that we extend the model. Let j denote the e ect of the j -th day of the work week. As with , this is aPdeviation from the grand mean (). Thus, j j = 0: Our extended model is: yijl =  + i + j + ijl :

(2) Note that since we include another parameter (day-of-week), another subscript is required for both y and . The inclusion of day-of-week allows us to account for 56.53% percent of the

variability of the data. The residuals of this model are plotted in Fig. 3, part (c). While the foregoing model is an improvement, we observe that another pattern remains: httpop/s is larger in November than it is in June. To eliminate this, we extend our model to consider the month. Let k denotePthe e ect of the k-th month. As with and , k k = 0: The model here is: yijkl =  + i + j + k + ijkl :

(3)

Once again, another subscript is added to both y and . While Eq. (3) has a similar form to the proceeding equations, we estimate the k di erently from the other parameters. Speci cally, we use month to indicate the trend in httpop/s values. Thus, instead of estimating the k using ANOVA, we use least squares regression (e.g., [6]). The resulting model accounts for 64.18%

06/10

20

(a)

06/11

06/12

06/13

06/14

11/18

11/19

11/20

11/21

11/22

10

0

0

50

100

150

200

0

50

100

150

200

0

50

100

150

200

0

50

100

150

200

5

(b)

0

−5 5

(c)

0

−5 5

(d)

0

−5

Figure 3: Data Used to Illustrate Steps in Building the Characterization Model:(a) Raw data for a work week in June followed by a work week in November; (b) Residuals after removing time-of-day e ects (Eq. (1)) from (a); (c) Residuals after removing time-of-day and day-of-week e ects (Eq. (2)) from (a); (d) Residuals after removing time-of-day, day-of-week, and month e ects (Eq. (3)) from (a). percent of the variability in the original data. Fig. 3 part (d) displays the residuals once the e ect of months has also been removed from the data. Fig. 4 depicts the values of the parameters of the Eq. (3) model as estimated from a training set of normal days of web server data. Part (a) of this gure displays the 288 values of k over a twenty-four hour period. Part (b) shows the e ect of day-of-week for Monday (1) through Friday (5). Part (c) shows the in uence of the month. Note that an upward trend is apparent in plot (c). In computer and communications systems, factors such as queueing relationships and the nature of end-user interactions often result in time-serial dependencies between measure-

ments. Typically, these dependencies have the following form: if a measurement variable has a large (small) value at time t, it is highly likely that this variable will have a large (small) value at time t + 1. Such dependencies can be identi ed by examining autocorrelations. Autocorrelations quantify the extent to which a measurement is correlated with predecessors that occur a speci ed number of time units (or lags) prior to the current observation. The autocorrelation function (ACF) expresses the relationship between the lag and the correlation between measurements at that lag. Thus, the ACF at lag 1 indicates the correlation between observations that are separated by one time unit; the ACF at lag 2 indicates the correlation between

15

1

0.5 10 0 5 −0.5

0

0

100

200

−1

300

1

2

3 (b)

(a)

4

5

2 1.5 1 0.5 0 −0.5 −1 −1.5

0

1

2

3

4 (c)

5

6

7

8

Figure 4: Estimates of Parameters of Characterization Model: (a) grand mean plus time-of-day parameters ( + i ); (b) day-of-week parameters ( j ); (c) month parameters ( k ) obtained from least-squares regression (indicated by asteriks) and what month parameters would have been had ANOVA estimation been used ("x"'s). observations separated by two time units; and so on. Fig. 5 part (a) plots the residuals of Eq. (3) for June (i.e., the rst half of part (d) of Fig. 3). Part (b) of this gure plots the ACF of these residuals. The y-axis is the correlation value (which lies between -1 and 1); the x-axis is the lag (number of time intervals) between measurements that are used for that correlation. The dashed-line speci es the ACF value that is statistically identical to zero (i.e., would not reject the hypothesis of 0 autocorrelation at a signi cance level of 5%). The correlation at lag 0 is always one since this is the variable correlated with itself at the same lag. Note that in part (b) of the gure, all correlations are above the dashed line. This suggests that the data

contain signi cant time-serial dependencies. To remove these dependencies, we extend the characterization in Eq. (3). We assume that the time index time t can be expressed as a function of (i; j; k; l). Then, we consider the following model: t = 1 t?1 + 2 t?2 + ut;

(4) This is a second order autoregressive model (AR(2)). Here, 1 and 2 are parameters of the model (which are estimated from the data), and the ut are independent and identically distributed random variables. The model parameters are estimated using standard techniques [2]. Fig. 5 part (c) plots the residuals of Eq. (4), that is: et ? ^1 et?1 ? ^2 et?2 , where ^i is an estimator of i . (Recall that et estimates t .) In

5

1

ACF

0.5

0

0

−0.5

−5

0

20

40

60 (a)

80

100

−1

120

0

5

10

15

10

15

(b)

5

1

ACF

0.5

0

0

−0.5

−5

0

20

40

60 (c)

80

100

120

−1

0

5 (d)

Figure 5: Autocorrelations of Data Used to Illustrate Building the Characterization Model: (a) residuals of Eq. (3) model for First Week in Fig. 3; (b) ACF of residuals in (a); (c) Residuals after AR(2) model (Eq. (4)) is applied to (a); (d) ACF of residuals in (c). our data, ^1 = :4632 and ^2 = :2111. Part (d) displays the ACF of the residuals of Eq. (4). Observe that almost all correlation values lie within the dashed lines. This suggests that autocorrelations have been removed.

From the analysis in this section, we conclude that Eq. (3) in combination with Eq. (4) (applied to the residuals of Eq. (3)) is a good characterization of normal behavior of httpop/s in the web server data.

(e) Raw data that has extreme anomalies (11/11/96); (f) Filtered data for (e);

3 APPLICATION TO WORKLOAD FORECASTING This section applies the characterization described in Eq. (3) to capacity planning. We begin by identifying the questions to address. We then show how the characterization in Section 2 is used to answer these questions. Our focus is forecasting future workloads, a sub-problem within capacity planning. Key questions here are: 1. What will the workload be at a speci c time in the future? 2. When will the workload grow beyond a speci ed limit?

16

14

12

10

8

6

4

Feb 97

2

0

0

Mar 97

100

200

Apr 97

300

May 97

400

Jun 97

500

600

Figure 6: Extrapolated Values of HTTP Operations Per Second by time-of-day for Monday through Friday of One Week in Each of Five Months. The dashed line is a threshold for httpop/s. 3. When will this limit be exceeded during a time interval of interest (e.g., time-of-day or day-of-week)? The rst question is valuable when forecasting demand for a speci c date and time. Answering this question requires estimating values of httpop/s at future times. To do this, we compute the expected value of yijkl for the time-of-day, day-of-week, and month of interest. From Eq. (3), the expected value of yijkl is  + i + j + k (since ijkl has a mean of 0, by assumption). Substituting our estimates for these parameters, we have: y^ijkl = ^i + ^k + ^l:

We use this equation to plot expected httpop/s for a desired time horizon. Fig. 6 displays such a plot (solid line) for ve one (work)

week periods in February through June of 1997, a time period for which we did not obtain measurement data. To obtain a forecast, the speci c date and time are located on the x-axis, and the associated value of expected httpop/s is located on the curve. The second question provides insight into when a speci ed limit on resource demands will be exceeded. To illustrate, suppose that the limit is 15 httpop/s. Once again, we use Eq. (3) to extrapolate values of httpop/s, such as is displayed in Fig. 6. Then, we draw a line at 15 httpop/s. This is indicated by the dashed line in the gure. We then observe where the dashed line intersects the solid line. The value of the x-coordinate at the point of intersection provides the answer to our question. In the gure, this occurs in the morning of the Tuesday in June.

15

5

10 (a)

(b)

0

5 0

0

5

10

15

20

−5

25

15

0

5

10

15

20

25

0

5

10

15

20

25

0

5

10

15

20

25

5

10 (c)

(d)

0

5 0

0

5

10

15

20

−5

25

15

5

10 (f)

(e)

0

5 0

0

5

10

15

20

25

−5

Figure 7: Change-points in Data: (a) Raw data without anomalies (7/15/96); (b) Filtered data for (a); (c) Raw data that has modest anomalies (8/5/96); (d) Filtered data for c. The third question is a variation on the second. Here, we are interested in when demands become excessive during a speci c time interval. Such a question is motivated by situations in which a service level objective applies only to a speci c time-of-day or day-of-week. We use Eq. (3) to extrapolate httpop/s over several months and then draw a horizontal line at the threshold value. Suppose that we are interested in the rst Wednesday when httpop/s exceeds 15. We then look for an intersection of the curves on a Wednesday. In Fig. 6, this occurs in June.

4 APPLICATION TO PROBLEM DETECTION Early detection of performance problems is central to e ectively managing information systems. Herein, we apply the characterization

in Section 2 to proactive detection of performance problems. We begin by describing the issues related to proactive detection. Then, we demonstrate the use of the characterization in Section 2 for detecting performance problems. Current practice for problem detection is to establish threshold values for measurement values. If the observed value violates its threshold, an alarm is raised. Threshold values are obtained from historical data, such as the 95-th quantile (e.g., [8]). Unfortunately, there is a signi cant dif culty with this approach in practice: Normal load uctuations are so great that a single threshold is inadequate. That is, a single threshold either results in an excessive number of false alarms, or the threshold fails to raise an alarm when a problem occurs. Some performance management products attempt to overcome this diculty by allowing installations to specify di erent thresholds at di erent times

of the day, day of the week, etc. Even so, such an approach cannot handle trend, such as the e ect of month in Eq. (3) (as can be seen in Fig. 4). Further, requiring that installations supply additional thresholds greatly adds to the burden of management. We propose a fundamentally di erent approach. We use the characterization model to remove all known patterns in the measurement data, including the time-serial dependencies. For httpop/s in the web server data, this means using Eq. (3) to remove \low frequency" behavior, and then applying Eq. (4) to the residuals of this equation so as to remove time-serial dependencies. The residuals of Eq. (4) constitute ltered data for which all patterns in the characterization have been removed. Last, a change-point detection algorithm is applied to this ltered data to detect anomalies, such as an increase in the mean or the variance. There are many algorithms for changepoint detection. (See [1] for a survey.) Herein, we use the GLR (Generalized Likelihood Ratio) algorithm. This is an on-line technique that examines observations in sequence rather than en mass. When a change has been detected, an alarm is raised. First, we introduce some terminology. Let ut be the t-th residual obtained by ltering the raw data using a characterization such as Eq. (3) and Eq. (4). We consider two timewindows, that is, a set of time indexes at which data are obtained. The rst is the reference window; values in this window are used to estimate parameters of the \null hypothesis" in the test for a change point. The reference window starts with the time at which the last changepoint was detected; it continues through the current time (t). Within the reference window, ut has variance u2 . The second time-window is the test window. Values in this window are used to estimate parameters of the \alternative hypothesis" that a change-point has occurred. The test window spans t ? L through t.

L should be large enough to get a stable estimate of u2 (the variance of ut in the test win0

dow), but small enough so that change points are readily detected. The GLR algorithm operates as follows. 1. Compute si = ln pp10 ((uuii )) , for t ? L  i  t, where p0 (ui ) is the likelihood of ui using a normal distribution with a mean of 0 and a variance (u2 ) estimated from the reference window. p1 (ui) is the likelihood of ui using a normal distribution with the variance (u2 ) and mean estimated from data in the test window. P 2. Let Stt?L = ti=t?L si: 0

3. If Stt?L > h, then raise an alarm. ([1] describes how to calculate h.) We apply the foregoing approach to the web server data collected on, July 15, 1996, a data for which no anomaly is apparent. Fig. 7, part (a) displays httpop/s for this day. The vertical lines indicate where change points are detected using the GLR algorithm. Note that not taking into account normal load uctuations, as is often done in practice, would have resulted in six alarms even though no problem is apparent. Part (b) plots the residuals after using Eq. (3) to lter the raw data and Eq. (4) to lter the residuals produced by Eq. (3). Observe that the GLR algorithm does not raise any alarm. How well does our approach to detection work when anomalies are present? To answer this, we consider data collected on August 5, 1996, a day when abnormal behavior is apparent. Fig. 7, part (e) plots the raw data for this day, and part (f) displays the data after being ltered by our characterization (as described in the preceding paragraph). Observe that change points are detected in the ltered data. While the data from this day are extreme, the automation provided by ltering the data (using

the characterization presented in Section 2) and then applying change-point detection, allows us to identify anomalies automatically. A more subtle example is displayed in part (c) of the gure. These data are taken from November 11, 1996. Here, the shape of the time-of-day variations changes from a gradual rise to an abrupt jump. Our approach to anomaly detection catches this, as indicated in part (d).

5 SUMMARY AND CONCLUSIONS Characterizing normal system operation is a critical part of managing capacity in information systems. This paper describes an approach to characterization that systematically models the behavior of metrics. This is applied to web server data, in particular the variable HTTP operations per second. Central to our approach is evaluating the quality of the model at each step in its re nement. For example, we began by considering only time-of-day. By examining the residuals of this model (what is left unexplained), we determine that day-of-week e ects are needed. Further, the residuals of this re ned model suggest that a month e ect is needed as well. Our methodology can capture both periodic effects (e.g., time-of-day, day-of-week) and trends (e.g., growth in user demand from month to month). We show how characterizations of normal behavior of workload metrics can be applied to two aspects of capacity management. For workload forecasting, we address three questions: (1) What will the workload be at a speci c time in the future? (2) When will the workload grow beyond a speci c limit? (3) When will this limit be exceeded during a speci c time-of-day or day-of-week? Having a statistical model that characterizes normal behavior

allows us to extrapolate values of metrics (e.g., HTTP operations) so that these questions can be answered. A second area of capacity management that we consider is problem detection. A major challenge here is removing known behavior (e.g., time-of-day variations) so that anomalies can be discerned. We show how the characterization of normal behavior can be used to lter out known e ects. By so doing, we are better able to detect anomalies when problems are present, and we are less likely to raise an alarm when no problem is present.

Acknowledgements

This work was in part supported by NSF Career Award Grant DMI-96-25291 and a grant from the IBM Corporation.

References

[1] M. Basseville and I. Nikiforov: Detection of Abrupt Changes: Theory and Applications, Prentice Hall, 1993. [2] George E. P. Box and Gwilym M. Jenkins: Time Series Analysis Forecasting and Control, Prentice Hall, 1976. [3] Je rey Buzen and Annie Shum: \MASF - Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1-10. [4] Adrian Cockcroft: \Watching your Web server," SunWorld OnLine, http://www.sunworld.com/swol03-1996/swol-03-perf.html, 1998. [5] Wilfrid J. Dixon and Frank J. Massey: Introduction to Statistical Analysis, McGraw-Hill Book Company, 1969. [6] N.R. Draper and H. Smith: Applied Regression Analysis, John Wiley and Sons, 1968.

[7] C.S. Hood and C. Ji: \Proactive Network Fault Detection.," Proceedings of INFOCOM, Kobe, Japan, 1997. [8] P. Hoogenboom and J. Lepreau: \Computer System Performance Problem Detection Using Time Series Models," Proceedings of the Summer USENIX Conference, 15-32, 1993. [9] Roy A. Maxion: \Anomaly Detection for Diagnosis," Proceedings of the 20th International Annual Symposium on Fault Tolerance (FTCS) 20, June 1990, pp. 20-27. [10] Marina Thottan and Chuanyi Ji: \Adapative Thresholding for Proactive Network Problem Detection," IEEE Thrid International Workshop on Systems Management, Newport, Rhode Island, April 2224, 1998, pp. 108-116.

Suggest Documents