Estimation of WEB Servers' Reliability with Symptoms of Software Aging

Estimation of WEB Servers’ Reliability with Symptoms of Software Aging Rivalino Matias Jr.1, Paulo J. F. Filho1, Luciano Guedes 2, Acires Dias3 1

PerformanceLab – INE-PPGEP-CTC 2 LabCADCAM–EMC-CTC 3 NEDIP–EMC-CTC Federal University of Santa Catarina – Florianópolis,SC – Brazil [email protected], [email protected], [email protected], [email protected]

Abstract. Web servers have continuous execution of long duration and with rather varied workloads. Such characteristics make them potential candidates for a degenerative phenomenon called software aging. We conducted controlled experiments that monitored aging in Apache web server. Based on the experimentation, a simulation model was created in order to obtain agingrelated failure times of fifty web servers. By means of goodness-of-fit tests applied against the web servers’ lifetime data, it was verified that the most adequate lifetime model was the three-parameter Weibull. This model shown that the shortest servers’ lifetime had a gamma equal to 1,780.04 hours and beta equals 18.13. Both parameters allow find out the optimal lifetime for the servers’ preventive maintenance.

1. Introduction The modern society has been evermore dependent on computer systems. This scenario requires computational resources which are increasingly complex and with high quality requirements. The consequences of failures in computer systems may vary from small inconveniences to financial losses and may even represent a risk to human lives [Lyu 1996]. In this context, the field of software reliability is highlighted as having utmost importance in software engineering. Basically, software reliability engineering (SRE) follows the same theoretical principles of reliability engineering applied in other industrial fields. The main difference in applying reliability engineering for software studies resides, basically, in the failure mechanisms [Musa 1998]. Several research papers have empirically proven that the accumulation of faults during the execution of a software process1 has been the cause of many problems related to performance degradation and software failures [Avritzer and Weyuker 1997]. The term ‘software aging’ has been used to identify such phenomenon [Huang et al. 1994]. It has been verified, to a greater extent, in those software processes that are in execution for long periods of time, being also influenced by variations in their workload.

1

In this context, the term 'software process' means an instance (process) of the software loaded and running in the computer memory.

This paper presents a study oriented towards the reliability of Apache web servers (httpd). We consider the memory leak as the main failure mode, which was proven in [Trivedi et al. 2000] identifying Apache’s aging symptoms. Based on the analysis of the web server physical memory degradation, caused by the aging of httpd processes during laboratory-controlled experiments, a simulation model was created for the generation of lifetime data of fifty web servers. Through this model, the servers’ failure times were obtained and utilized in the Apache reliability study. The sections of this article are described as follows. Section 2 presents the methodology used for the conduction of laboratory tests. Section 3 describes modeling and aging simulation of web servers. Section 4 includes lifetime data analysis for reliability calculation. Finally, Section 5 presents the conclusions of the research.

2. Experimentation description To characterize the aging of httpd processes, with the purpose of simulate its behavior, we conduct controlled experiments that allowed the generation of workloads for the web servers, in order to monitor them and use the collected data during the simulation modeling. Next there is a description of the environment set up for such purposes. 2.1. Test bed environment Two computers were used as traffic generators and one being the web server. The interconnection of the computers was done through an Ethernet switch. Figure 1 illustrates this set up. httperf / ab httperf / ab

httpd

sysmon 1

http (workload)

Lin

Linux

2

switch 100 Mbps

Linux

Figure 1. Test bed

On the server side the httpd processes as well as variables of the operating system were monitored. This measurement was done by sysmon, a program built for that purpose. In order to conduct the generation of workloads the httperf tool [Mosberger and Jin 1998] was used. The Apache 2.0 was chosen as the web server software. Due to the capacity of the server computer we used 200 httpd processes, thus supporting a maximum number of 200 simultaneous connections. All 200 processes were created and kept handling requests throughout the test period, without recycling, thus maximizing their exposure to software aging effects. The methodology used for the workload modeling will be presented next. 2.2. Workload modeling The creation of the workload model took into account the usual operational condition of a web server, therefore a workload of up to 50% of the server’s capacity was considered. Based on Trivedi et al. (2000) we selected three parameters: APS (average page size), IAT (inter-arrival times) and PT (page type).

The APS was based on a sample of data equivalent to six months of access to an Internet provider’s web server. These data were collected based on the logs of the provider’s Apache server. We selected the ten most accessed pages in this period for the calculation of the APS. The value obtained was 198 kilobytes. To model the IAT, it was necessary to define a value that would not increase the server’s load above 50% of its capacity, as described previously. Thus, it was verified that the request rate of 15 requests per second represented 50% of the server’s load capacity. This was accomplished through the httperf. Complementary, a goodness-of-fit test was conducted with a sample (n=156,926) of IAT obtained from the Apache log. The statistical test adopted was the chi-square (χ2) [Jain 1991]. Based on the χ2 result the adherence of nine models to the sample data was assessed in order to select the one that fit the most. Table 1 lists the models ordered by the square error. Considering these results, the model that best represented the IAT was the exponential (β=15). Table 1. Models adherence to the IAT sample Ranking 1º 2º 3º 4º 5º 6º 7º 8º 9º

Model Exponential Gamma Beta Normal Triangular Uniform Lognormal Poisson Weibull

Square error 0.0074 0.0158 0.0214 0.0287 0.0306 0.0394 0.0499 0.0874 0.1860

The definition of the third parameter of the load model was based on the ten most accessed pages obtained from the provider’s data sample. It was verified that all these pages are dynamically generated, which pointed to the utilization of dynamic-type pages during the tests. These three parameters were useful to create a workload model that was configured in the httperf generator for the conduction of the experiments. Three tests were carried out with duration of approximately eight hours each. The aim of this stage was to expose the httpd processes to an uninterrupted operation and to monitor their memory utilization, since the failure mode being studied was memory leak. Figure 2 presents the result of one test. The behavior of the other experiments was the same, and that’s why their graphs were omitted. RAMmemory (MBytes) (MB) Physical

200 175 150 125 100 75 50 25 0 Hours 0:00 Horas

0:59

2:00

2:59

3:59

5:00

5:59

6:59

8:00

Figure 2. Physical memory consumption of httpd processes

As can be observed, httpd processes start out occupying approximately 66 megabytes and after eight hours of execution they are using approximately 125 megabytes. At the end of each test, the Apache processes were kept loaded in order to verify whether they would return to their original size in the resting state. After 4 hours

without workload the size of the processes remained the same. The same load generation, previously used, was repeated showing that memory consumption continued to rise that proves the aging of the httpd processes. Based on the collected data regarding the Apache’s memory consumption during the tests, a simulation model was created to estimate failure times due to exhaustion of the server’s physical memory, which will be described next.

3. Aging modeling and simulation The effects caused by software aging usually require long periods of experimentation, due to the non-deterministic nature of this phenomenon. Based on the results presented in Figure 2, it becomes evident that to have the exhaustion of the server’s physical memory a long period of uninterrupted execution would be necessary. This conclusion is based on the available physical memory (approx. 512 megabytes) against the observed aging slow rate. Due to the observed aging slow rate, the time to exhaustion of physical memory would require long periods of test execution, increasing the costs of experimentation. In order to reduce these costs, an alternative is the utilization of computer simulation. We adopted this approach for the acquirement of failure times regarding the physical memory exhaustion. The simulation model was created using the Arena 7.0 framework2. Fifty simulation replications were carried out, generating a sample of 50 failure times to be used in the reliability study (Section 4). Aside from the parameters described in Section 2.2, it was necessary to model the software aging behavior. It became necessary to establish a relationship between memory consumption (aging effect) and workload. For that purpose, the data collected by the sysmon in the bench tests were used. The data were collected at 60-second intervals. Therefore, 480 values were collected during the eight-hour period. Based on the IAT adopted, a total of 900 requests were processed for each 60-second interval. Having established the relationship between the aging behavior (480 values) and ´batches´ of 900 requests, an empirical distribution (edf) was built to represent the evolution of the aging phenomenon. The load generation and the other variables of the simulation model followed the same parameters used in the bench tests. The result of the simulation model was the exhaustion-related failure times of fifty servers, considering 500 megabytes as the exhaustion threshold. Based on these lifetime data, the results of the web server reliability analysis will be presented in the next section.

4. Reliability analysis According to Musa (1998) the quantification of software reliability must be defined with regards to the software’s execution time. Equation (1) presents reliability based on the probability of failure until a certain instant of time.

R ( t ) = 1 − F (t )

2

Arena is a product of Systems Modeling Corporation.

(1)

where F(t) is the unreliability function (cdf). The average time to failure is obtained through Equation (2). ∞

MTTF = ∫ R(t )dt

(2)

0

Another metric used in this study is the instantaneous failure rate, provided by Equation (3). f (t ) h(t ) = (3) R (t ) In all equations the definition of the probability density function (pdf) is fundamental, which is represented by f(t). For that, a goodness-of-fit test has been used. The test indicated the theoretical distribution with regards to the failure times obtained based on the simulation. The distributions tested were the same used in Section 2.2. Among all of them, the three-parameter Weibull was the one that showed best adherence to the data utilized, being presented in Equation (4). β f (Τ ) = η

 Τ−γ   η

  

β −1

e

 Τ −γ −   η

  

β

(4)

where f(T)≥0, T≥ γ , β > 0, η > 0, - ∞ < γ < ∞ ; (η =scale, β =shape, γ =location). The method adopted for the estimation of the parameters of the model was the MLE (maximum likelihood estimator). A description on the application of the MLE along with the Weibull distribution can be obtained in [Lloyd and Lipow 1962]. The values calculated through MLE for each parameter were: β =18.13, η = 3.89 and γ = 1,780.04. The high beta value indicates a high concentration of failures in a short period of time. Complementarily, one can verify through the value given to the gamma parameter that there is a minimum lifetime warranty up to 1,780.04 hours. 1,0 0,9 0,8

Reliability R(t)

0,7 0,6 0,5 0,4 0,3 0,2 0,1 0,0 1783,4

1783,6 1783,5

1783,8 1783,7

1784,0 1783,9

1784,2 1784,1

1784,4 1784,3

Time-to-Fail t

Figure 3. R(t) with confidence bound of 95%

Figure 3 presents the reliability chart created based on the obtained model. Starting from the time 1,783.42, reliability decreases rapidly. It was also verified that with a 95% confidence bound it is possible to state that the server has 99% reliability in the interval of 1,782.896 up to 1,783.247. Also, the same confidence bound indicates a free-failure time starting at time 0 until 1,780.04.

5. Conclusions Experimental studies in the software aging field have proven to be of high cost, mainly due to the need for long uninterrupted software execution time. These long-run periods are necessary in order for the aging symptoms to be manifested and capable of being observed. Such costs are higher in research involving software reliability analysis with aging symptoms, because the obtainment of life data may require even lengthier time executions. As an example, based on the time of failures, obtained from the simulation’s results, the failures started only after seventy-four days of uninterrupted execution. Therefore, software aging modeling and simulation, aiming at the reduction of experimentation time, have been the most important contribution of this study. Thus, special attention is given to the relationship established between workload and aging, modeled by means of an edf obtained based on the data observed during the bench tests. This approach, based on simulation, allowed the obtainment of failure times of the 50 servers in just eleven hours, considerably reducing the period of data collection if compared to the approach without simulation. Based on the failure times we verified that the three-parameter Weibull distribution proved to be an adequate model to be used in software reliability analysis, as it has been used in reliability application in others industry fields. It is worthwhile to highlight that the high beta and gamma values interpretation allow the analyst to conclude that there is a minimum lifetime period for the web server studied, in this case 1,780.04 hours. Based on that information, preventive maintenance routines may be programmed to avoid web server failure.

References Avritzer, A. and Weyuker, E.J. (1997) “A Monitoring Smoothly Degradating Systems for Increased Dependability”, In: Empirical Software Engineering Journal, vol. 2, nº 1, p. 59-77. Huang,Y., Kintala, C., Kolettis, N. and Fulton, N.D. (1995) “Software Rejuvenation: Analysis, Module and Applications”, In: Proceedings of the 25th Symposium on Fault Tolerant Computer Systems, p. 381-390, USA. Jain, R. (1991) “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley & Sons. Lloyd, K.D. and Lipow, M. (1962) “Reliability: Management, Methods and Mathematics”, Prentice-Hall Space Technology, USA. Lyu, M.R. (1996) “Handbook of software reliability and system reliability”, McGrawHill, USA. Mosberger, D. and Jin, T. (1998) “Httperf – A Tool for Measuring Web Server Performance”, In First Workshop on Internet Server Performance, Madison, USA. Musa, J. (1998) “Software Reliability Engineering”, McGraw-Hill, USA. Trivedi, K.S., Vaidyanathan, K. and Popstojanova, K.G. (2000) “Modeling and Analysis of Software Aging and Rejuvenation”, In: Proceedings of the 33rd Annual Simulation Symposium, p. 270-279, IEEE Computer Society Press, USA.

Estimation of WEB Servers' Reliability with Symptoms of Software Aging

Estimation of WEB Servers' Reliability with Symptoms of Software Aging

Suggest Documents