102
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
Accelerated Degradation Tests Applied to Software Aging Experiments Rivalino Matias, Jr., Pedro Alberto Barbetta, Kishor S. Trivedi, Fellow, IEEE, and Paulo J. Freitas Filho
Abstract—In the past ten years, the software aging phenomenon has been systematically researched, and recognized by both academic, and industry communities as an important obstacle to achieving dependable software systems. One of its main effects is the depletion of operating system resources, causing system performance degradation or crash/hang failures in running applications. When conducting experimental studies to evaluate the operational reliability of systems suffering from software aging, long periods of runtime are required to observe system failures. Focusing on this problem, we present a systematic approach to accelerate the software aging manifestation to reduce the experimentation time, and to estimate the lifetime distribution of the investigated system. First, we introduce the concept of “aging factor” that offers a fine control of the aging effects at the experimental level. The aging factors are estimated via sensitivity analyses based on the statistical design of experiments. Aging factors are then used together with the method of accelerated degradation test to estimate the lifetime distribution of the system under test at various stress levels. This approach requires us to estimate a relationship model between stress levels and aging degradation. Such models are called stress-accelerated aging relationships. Finally, the estimated relationship models enable us to estimate the lifetime distribution under use condition. The proposed approach is used in estimating the lifetime distribution of a web server with software aging symptoms. The main result is the reduction of the experimental time by a factor close to 685 in comparison with experiments executed without the use of our technique.
DOE GOF IPL ISP LSE MLE NFS OS K-S
design of experiments goodness-of-fit test inverse power law Internet service provider least squares parameter estimation method maximum likelihood parameter estimation method network file system computer operating system Kolmogorov-Smirnov test
SAA SUT
stress-accelerated aging system under test
NOTATION level of -significance proportion of tests to run at each ADT’s stress level Pearson’s linear correlation coefficient number of levels in a factorial design standard deviation of K-S’s statistic degradation path of unit at time degradation path critical value degrees of freedom for error
Index Terms—Accelerated degradation tests, design of experiments, software aging, web server software reliability.
ACRONYMS ADT ALT ANOVA CI
aging factor accelerated degradation test accelerated life test analysis of variance -confidence interval
Manuscript received August 05, 2008; revised April 08, 2009; accepted June 02, 2009. First published December 04, 2009; current version published March 03, 2010. This work was supported in part by the US National Science Foundation under grant NSFCNS-08-31325. Associate Editor J.-C. Lu. R. Matias, Jr. was with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA. He is now with the Computing School, Federal University of Uberlândia, Uberlândia, MG 38400-902 Brazil (e-mail: rivalino@ facom.ufu.br). P. A. Barbetta and P. J. F. Filho are with the Informatics and Statistics Department, Federal University of Santa Catarina, Florianópolis, SC 88040-900 Brazil (e-mail:
[email protected];
[email protected]). K. S. Trivedi is with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TR.2009.2034292
-
0018-9529/$26.00 © 2010 IEEE
tolerated sampling error distribution function observed degradation measurement in unit at time mean value of the response variable in a factorial design number of factors in a factorial design characteristic life (e.g., mean life) the log-likelihood function normally distributed with mean zero, and standard deviation reliability function -coefficient of determination implies: statistical(ly) pilot sample variance effect’s standard error natural logarithm of times-to-failure
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
mean of the natural logarithms of the times-to-failure stress level
I. INTRODUCTION
S
OFTWARE aging can be defined as a growing degradation of a software’s internal state during its operational life [1]. The causes of software aging have been verified as the accumulated effect of faults activation [2] during the lifetime of software process [3], [4]. In this paper, the term software process means a computer program being executed in the computer’s physical memory. Due to the cumulative property of the software aging phenomenon, it occurs more intensively in continuously running processes that are executed over a long period of time. Studies [5] have already demonstrated that workload variability can also influence the aging manifestation. Problems such as data inconsistency, numerical errors, and exhaustion of operating system resources, among others, are examples of software aging consequences [6]. These problems often lead to progressive performance degradation, occasionally causing system lockout or system crash [7]. A difficulty in experimental studies on systems that fail due to software aging is the observation of failure times, which are usually required to assess reliability metrics. In this case, the nature of the phenomenon requires that the system run uninterrupted for a long period of time. Reported experimentation times utilized in software aging studies have covered 600 hours [8], 2160 hours [9], and 6480 hours [7]. Furthermore, the majority of software aging experiments found in the literature were terminated before the observation of system failures. The authors terminated the experiments early because the aim of many of these efforts was the identification and mitigation of aging effects, which does not necessarily require system observation until its failure. Depending on the data analysis techniques used, a certain minimal number of failures need to be observed, implying a substantial experimentation time. The problem worsens when the system under test is composed of highly reliable components, such as those developed for telecommunications, military, safety-critical, and long-life systems. The contribution of this paper is a systematic approach that reduces the experimentation time to estimate the times-to-failure for systems that fail due to software aging. Accelerated degradation tests (ADT) technique [10] is employed, together with the design of experiments (DOE) framework [11] to statistically plan the ADT experiments. The goal is to accelerate the manifestation of the software aging effects to minimize the experimentation time. To the best of our knowledge, ADT has not been applied to software experiments. This conclusion is based on an intensive literature search in fields such as experimental software engineering, software reliability, and system dependability, among others. Both techniques, ADT and DOE, have been successfully applied to several other areas [9]–[11]. Indeed, the necessity for reducing the time to obtain lifetime data in experimental software engineering is not any
103
different than in other areas, particularly when investigating software systems designed for high reliability. In several engineering fields, the use of accelerated life tests (ALT) [9], as well as ADT, is a practice increasingly adopted to significantly decrease the experimentation costs [9]. The most important differences when using ALT/ADT techniques applied to software systems in comparison to their utilization in other engineering fields are the acceleration methods, and stress variables definitions. The accelerated tests (ALT/ADT) theory [9] is based on physical laws, which are related to physical-chemical phenomenon (e.g., temperature, and humidity) associated with the systems/products analyzed. The non-existence of a physics law foundation would possibly justify the lack of ADT applications for software studies. This paper contributes in adapting the ADT technique to experimental software engineering studies focused on the software aging phenomenon. The rest of this paper is organized as follows. Section II gives a brief overview of previous work in software aging, and accelerated life tests applied to software engineering. Section III presents the core concepts of ADT, and the assumptions adopted in applying ADT to software aging experimentation. Section IV is dedicated to expounding our ADT-centric approach. The experimental study used to test our method is described in Section V. Finally, Section VI summarizes the contributions of this work. II. RELATED WORK In [7], a study of software aging in a telecommunication system switch considers a programmed restart of the switch for the restoration of its internal state after a certain uninterrupted execution period. Garg et al. [1] describe an experimental characterization of software aging via system monitoring, and the prediction of the time to exhaustion for each monitored resource. In a subsequent study [12], based on the same dataset used in [1], Vaidyanathan et al. found that a parameterized model, with both execution time, and workload, offered a higher accuracy when compared to those models based only on the execution time. One limitation of these studies is the difficulty of the mean time to failure (MTTF) estimation, as these studies estimate mean time to exhaustion of one resource at a time. The authors suggested carrying the study further to obtain a global prediction of the MTTF, which would take into consideration the combined aging effects among several resources. Shereshevsky et al. [4] use a different approach than [12], criticizing the linear models to estimate the aging tendency of the monitored resources. They point out that the approach is not adequate if a large increase in the -confidence intervals is observed, which occurs in periods when the studied variables show a non-linear behavior. To solve this problem, they proposed the use of fractal theory to estimate the time to resource exhaustion. Similar to our experimental work (see Section V), a study of software aging in a web server system is presented in [8]. Grottke et al. [8] used time series ARMA models to identify, characterize, and estimate aging effects in the Apache web server. The models were built from the operating system’s collected data during their controlled experiments. An issue not discussed in [8] is the frequency of updates of the models’
104
parameters to keep up the aging estimations updated with the changes in the monitored environment. As mentioned in Section I, there are not many publications that apply the ALT/ADT statistical methods to empirical software engineering studies. We now give a brief overview of three related papers that we found during an extensive literature review. In [13], the traditional accelerated life test method was applied for the software reliability assessment of a telecommunication network restoration system. They considered three different acceleration conditions that corresponded to 10, 79, and 130 times the usual rate of background processing in field conditions. Considering the possibility of the effect of one state carrying over to, and affecting the performance of, subsequent system states, they adopted a balanced design of experiments obtained from two 4 4 Latin squares. They considered the four states of interest, and eight replications of each state. All 32 treatments were tested for each acceleration condition in a single run, and the failures were analyzed through a Poisson regression model. The results indicated the log-linear relationship between the “mean number of failures,” and the “acceleration rate” as the best fit. Through this model, the mean number of failures at the no-acceleration setting (system’s use condition) was obtained. Further analysis considered the two-parameter Weibull distribution as compared to the log-linear model. In [14], accelerated stress testing (AST) was applied to both hardware, and software systems. The ALT was compared with AST. Given that the main purpose of [14] was to give the AST fundamentals, the ALT method was not fully explored. Following a different approach, [15] provided empirical evidence of software failure acceleration. Although they did not consider the ALT/ADT methods, we summarize their method of acceleration because some of the concepts are used in our approach. The failure acceleration approach discussed in [15] reduces the fault and error latencies, while increasing the probability of a fault causing a failure. Ideal failure acceleration is achieved when , , and , where is the probability of a fault causing a failure. The authors suggested two basic controls to achieve failure acceleration: the fault size, and the workload. In the former, a large fault can be injected, which increases the probability of software failure. The latter allows the decrease of error latency through increased system usage, which was also partially adopted in [13]. The experiments in [15] focused on obtaining life data of an NFS server based on two levels of workload (low, and high) in terms of server’s processor utilization, which ranged from 15% to 30% (low), and 25% to 50% (high). The results showed a substantial difference in the number of observed failures between low and high acceleration. The failure probability went up from 53% to 65%, validating the proposed approach. Chillarege et al. [15] emphasize the decrease in error propagation as a consequence of failure acceleration, which makes sense because the increase in acceleration should decrease error latency, consequently reducing the chance for errors to propagate. In line with [13], and [15], we also consider the workload control to reduce the time to failure. However, for many highly reliable systems, the failures due to software aging are difficult to achieve, even using a high usage rate. Therefore, our main effort is on the overstress acceleration. Both acceleration methods in our proposal will be
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
presented in Section IV-A. Next, we present the ADT fundamentals, together with the assumptions we considered, to apply the ADT method to software aging experiments. III. ACCELERATED DEGRADATION TESTS ADT extends the accelerated life tests (ALT) [9], and degradation test (DT) [10] techniques, which are used to obtain lifetime data quickly. The design of high-reliability systems implies failures are rare events, even during long periods of execution, and under high workload. For such highly reliable systems, an alternative approach to ALT and DT are accelerated degradation tests (ADT), which do not look for failure times, but instead for a degradation measure of a product’s performance characteristic taken over time, and under specific stress conditions [16]. In software aging research, the current difficulty to experimentally or empirically observe times to failure caused by software aging is similar to the highly reliable systems case mentioned above; hence our use of ADT techniques. Because ADT techniques were designed for physical systems, we had to establish a mapping between core concepts in software aging theory and ADT methods to make it applicable to software aging studies. Section IV describes our proposed approach. ADT involves overstress testing; and typical accelerating stresses include temperature, voltage, and thermal cycling. For many products, there are accelerating stresses documented in engineering standards tables, but for others there are no well-known standard stresses. Then, the experimenter needs to determine suitable accelerating stresses based on pilot experiments. To date, software engineering does not have standards related to accelerating stresses for ALT/ADT, so we adopt an experimental approach to determine the accelerating stresses (variables), and their respective acceleration levels (see Section IV-A). Besides the identification of accelerating stresses, it is also necessary to establish the stress loading that defines how to apply the stress to the system under test. Two possible stress-loading schemes are time-independent (constant stress), and time-dependent (varying stress) [17]. According to [9], the theories for the effect of varying stress on product life are in development, and mostly unverified. While [9] is at least 5 years old at the publication of our article, and there has been development since, we decided to employ constant stress in this paper for wider applicability. ADT has some advantages over ALT because performance degradation data can be analyzed sooner, even before any experimental units fail [9]. Also, performance degradation can yield better insights into the degradation process. However, such advantages can be achieved only if one has a suitable degradation model that establishes the relationship between the system degradation, and the accelerated stress variables [10]. According to [9], four common assumptions are adopted by the current degradation models: a) degradation is not reversible, and performance always gets monotonically worse; b) usually, the model applies to a single degradation process; and in case of simultaneous degradation processes, each one requires its own model; c) degradation of a unit’s performance before the test starts is insignificant; and d) performance is measured with negligible random error.
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
In [10], the degradation path of a particular unit over time , . A random sample of test units is denoted by . For each inare observed at pre-specified times spection, a performance measurement is registered for each test unit, and referred to as . The inspection times are not required to be the same for all units, or even equidistant. Consider as the th time measurement or inspection of the th unit. The observed degradation measurement in unit at time is repre; and at the end of the test, the degradation path sented by , for is registered as pairs . The observed sample degradation of unit at time is the unit’s actual degradation, plus measurement error, and is given by (1) is a vector of model parameters where is a residual dewhich has dimension , and viation of the th unit at time . The deterministic form of is usually based on empirical analysis of the degradation process under study. The vector corresponds to unknown effects which determine the degradation path of unit in the mea, or 4 paramsures. Typically, a path model will have eters [10]. Some of these parameters can vary from unit to unit, and others could be modeled as constant across all units [18]. It is reasonable to assume that the random effects of the vector are s-independent of the deviations. Also, it is assumed that deviations are i.i.d for , and . the are measured sequentially in time, Due to the fact that the there is, however, a potential for autocorrelation among the , values, especially when there are many closely spaced observations. As stated in [10], in many practical situations involving inference on the degradation of units, if the model fit is good, and if the testing & measurement processes are in control, the autocorrelation is typically weak. It is dominated by the unit-to-unit variability in the values, and thus autocorrelation can be ignored. In the general degradation path model, the proportion of failures at time is equivalent to the proportion of degradation paths that exceed the critical level at time . Thus, it is possible to define the distribution of time-to-failure from (1) as (2) For fixed , the distribution of depends on the distribution . can be expressed in closed form for simple of degradation models. For complex models, specially when is nonlinear, and more than one of parameters is random, it is with numerical methods. necessary to evaluate IV. OUR APPROACH In [5], the software aging phenomenon is defined as the continuous, increasing deterioration of the process’ internal state, or as the degradation of system resources. In the former case, because the phenomenon is confined to the process memory (internal state), measuring its progress is difficult. For example,
105
consider the accumulation of round-off errors in a global numeric variable inside a process memory image. In this case, the monitoring of aging should be possible if the program code is instrumented. When this cannot be easily implemented (e.g., software system is composed of closed third-party software components), the monitoring of aging alternatively could be possible through the individual process’ performance measures, and resource consumption observable externally. For the later case of degradation system resources, we can follow the software aging evolution through the monitoring of operating system resources. Our approach can be applied to any scenario where significant measures of the process/system degradation are observable. We the progress of the aging efconsider as degradation path fects on the system under test (SUT). This systematic approach can be divided into four main steps: 1) selection of accelerating stress, 2) ADT planning and execution, 3) definition of the stress-accelerated aging relationship, and 4) estimation of underlying life distribution for the use condition. The following sections will discuss each step in detail. A. Accelerating Stress (Aging Factor) The main differences between applying ALT/ADT techniques to software aging experiments in comparison with other research fields (e.g., materials science) are the definition of degradation mechanisms, and how to accelerate such mechanisms. In other areas, the degradation mechanisms (e.g., wear) are usually related to physical-chemical properties of the SUT [9], which are used for failure/degradation acceleration purpose. In the case of software aging, the degradation mechanisms can be understood as the degenerative effects caused by the activation of software faults related to software aging. In [19], this specific class of faults is called aging-related faults. When these faults are activated, they cause errors that result in the degradation (aging) of operating system resources or process’ performance. For example, a fault in a process’ memory management routine can cause an error when freeing memory pages that result in a memory leak. The recurrent activation of this fault has as a consequence the degradation of the OS virtual memory. Hence, we consider the concept of accelerating stress, from an ALT/ADT theory perspective, as being those operation patterns that activate the aging-related faults. In our approach, we will refer to an activation pattern as an . We also posit that, in systems that display aging factor software aging effects, it is possible to accelerate the degradation by use-rate, and by overstressing. The first can be achieved by increasing the frequency of the system usage to reduce the latency. A similar technique was adopted in [13], and [15]. In some cases, increasing the usage rate is not sufficient due to has in relation the low probability of the occurrence that the to the remaining operations from the SUT operational profile use-rate with re[20]. Hence, in addition to increasing the spect to the SUT use condition, we also consider the accelerated should be degradation by overstressing. We believe that the defined to allow different levels of influence in the SUT degracould also be comdation acceleration. If it is necessary, the bined with other secondary factors (e.g., environmental factors) to provide multiples stress levels. Individually or combined, the control of the frequency (use-rate), and intensity (stress level)
106
of the is achieved through the system workload. Therefore, as one of the synthetic workload parameters we consider the used during the ADT. Unlike the other areas that usually are based on physical laws from a for the accelerating stress definition, we select the sensitivity analysis of the aging phenomenon with respect to workload parameters at use condition. Thus, we consider the as the workload parameter that contributes the most to the increase in the SUT aging effects. When more than one parameter, individually or through interactions, have significant influwill be their combination. In a comence over aging, the bined form, the can be seen as the operational mode [20] that causes greater influence on the system’s aging effects. Using this approach, we are able to maximize the aging accelcontrol inside the workload. This coneration through the trol is important because, in many practical situations, a high workload does not guarantee the aging acceleration, because the operational mode used could not create the necessary conditions for the activation of aging-related faults. For this reason, the workload characterization, and the sensitivity analysis of its parameters on the aging effects, are fundamental for the correct selection. In [21], several techniques are presented to support the workload characterization, such as principal component analysis, multi-parameter histograms, clustering analysis, and Markov models. Each one of these techniques deals with specific requirements, and based on the experimenter’s objectives they are appropriately selected. Given the chosen method of workload characterization, we selecthen use the statistical design of experiment for the a measure tion. We consider as the DOE response variable that indicates the level of system aging during each run execution. In [11], Montgomery suggests several experimentation strategies such as best-guess, one-factor-at-a-time, and factorial. The correct approach in dealing with several factors is to conduct a factorial design because all factors are varied together instead of one at a time, this being an efficient method to study the effects of two or more factors on the response variable [11]. This identification. Among its varistrategy is thus chosen for the ants (two-factor, fractional, , mixed-levels, etc.) we use the factorial design, as it is particularly useful in factor screening experiments [11], and thus is suited for our approach that is fo. cused on the aging characterization to identify the In terms of workload specification, for each level of the factorial design, we selected their values to implement two load patterns: regular, and high. Both levels are taken with respect to the SUT total capacity, where the level regular causes a load of 50%, and the level high causes a load of 90% of the SUT nominal capacity. For the setup of treatment combinations, and sequence of runs, we adopted the signal matrix method [21], which was arranged according to the Yates’ order [11]. The computing of each factor’s effects on is obtained through solving the signal matrix following Yates’ algorithm [11]. As a result, we have a ranking of individual, and combined factors that are sorted by their influence degree on . Finally, supported by this ranking, we can choose the individual factors or their combinations that . better represents the
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
B. ADT Planning
, we have to plan the ADT. This activity After selecting the involves, initially, the specification of the following elements: number of stress levels, the amount of stress applied at each level, the allocation proportion in each level, and the sample size. The number of stress levels must be balanced while considering the test objectives, and restrictions. In ALT, at least two accelerating stress levels are mandatory to obtain a stress-life relationship [17], which is also required for the ADT. The more levels we have, the better off we are in correctly fitting the model to the data set. The experimenter should decide how many stress levels to adopt based on time, and resource constraints. For example, [13] adopted three different acceleration levels that corresponded to 10, 100, and 200 times the actual rate of the acceleration factor in field conditions. use-rate for each stress In contrast with [13], we adopt an level, because we believe that the use-rate is dependent on the system capacity, and not on the use condition. Thus, we aim to avoid defining workloads that exceed the maximum system capacity, which could cause side effects such as failure or degradation modes unrealistic in practical situations, and hence undesirable in the study [9], [10]. The amount of stress applied in each level should not exceed the design limits of the SUT. Usually, these levels are out of the specification limits, but within the design limits [17]. For many cases, exceeding the design limits can cause undesired failure modes. The allocation proportion, , is the division of the total number of test units among the stress levels. In the traditional ADT implementation, the sample size is the number of units being tested. Because we are dealing with software components, the sample size in our approach is the number of replications the ADT is executed on. In [9], and [10], algorithms are provided to calculate the sample size for ALT/ADT experiments. The specification of the abovementioned four elements follows one of the three most commonly used test plans: traditional, optimal, and compromise plans. The traditional plans usually consist of three or four levels, equispaced, with the same number of replications (test units) allocated per level. The op, and low timal plans specify only two levels of stress: high . Meeker & Escobar [10] stated that the value should be value, the maximal allowed within the design limits; and the and its allocation proportion , should be selected to minimize the variance of the estimators of interest. Nelson [9] suggests that this allocation be based on the fraction that minimizes the variance of the estimator at the use level of the stress variable. Assuming a sample size , the number of allocated replications is the closest integer value of , where the remaining in . The compromise plans usually tests are then allocated to work with three or four stress levels, non-equispaced, and use an unequal allocation proportion. An example of a well-known compromise plan is the Meeker-Hahn plan [9], which considers three stress levels, and follows an allocation proportion rule of , 4:2:1. This allocation specifies, for a sample of units, , and test units allocated, respectively, to , , and , where is an intermediate stress level. In general, the
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
compromise plans adopt a value of based on practical asis equal to pects of the SUT, mainly the design limits. The , assuring that the levels are equidistant. Thus, the value has to be specified to calculate the , which according to [9] should be chosen taking into account the required accuracy for the estimates studied, at use level. A more detailed description of the three plans can be found in [9], and [10]. In addition to the aforementioned four elements, another important quantity to be specified in ADT planning is the threshold. This value depends on the specific characteristics of the SUT, as well as on the experimenter’s objectives. The instrumentation adopted to measure the degradation evolution usually is the same used to obtain the values of the until DOE’s response variable described in Section IV-A. The differis taken in several repetitions ence is that the inspection of during the ADT, and not just at the end of the test. The number of these measurements should be separated of repetitions for in time to minimize possible autocorrelations among values [10]. In Section V, we present an experimental study where real values are assigned to each of these parameters. C. Stress-Accelerated Aging Relationship In ADT, the samples of failure or pseudo-failure times are obtained in use-rates, and accelerated stress conditions that differ from the system’s use condition. Thus, we need to establish a model that relates the system’s degradation observed in the evaluated stress levels to estimate a proper underlying lifetime distribution for the SUT’s use condition. This model is called the stress-accelerated degradation relationship [9], which in our case translates into the term stress-accelerated aging (SAA). Such a model is usually based on traditional ALT models, such as Arrhenius, Eyiring, Inverse Power, Coffin Manson, etc. [9]. Due to the lack of equivalent models established for ADT applied to software experiments, we studied the models currently used in other areas [9], [10], [18]. We adopted the Inverse Power Law (IPL) as a model applicable to any type of positive stress, unlike the others that apply to specific types of stress variables (e.g., Arrhenius for temperature). The IPL model is presented in detail during the experimental study developed in Section V-D. In situations where the IPL is found to be inadequate, an alternative empirical model may be used. Based on the established SAA model, the next step is to estimate the underlying life distribution for each stress level, as well as for the use condition. D. Lifetime Distribution Estimation Once the SAA is established, the next step is to estimate the underlying lifetime distribution for each stress level, and then to use them to estimate the for the use condition. First, we need a sample of failure times or pseudo-failure times for each stress level. For those degradation paths whose failure times within the test period, the failure are observed times sample is taken directly. Otherwise, we use the accelerated degradation data set from each degradation path to establish , and then to estimate an accelerated degradation model, model can be (but need not be) pseudo-failure times. The
107
Fig. 1. Test bed.
the same for each degradation path. In ADT, steps to estimate the lifetime distribution, called the approximation method [10], [16], are as follows. for 1) For the chosen stress levels, fit the model each unit . The model effects are considered as fixed for each unit, and random across them. for the unit by 2) Estimate the vector means of the least-squares method. for , and call the solu3) Solve the equation tion . 4) Repeat the procedure for each sample’s path to obtain the for that stress level. pseudo-failure times 5) To the samples of failure or pseudo-failure times, apply the for usual lifetime data analysis [10] to determine the each stress level. Through the SAA relationship previously established, and the estimated for each stress level, we obtain the for the use condition. Several dependability metrics can be estimated for for the system under test, once we have estimated the for the its use condition level. In addition, based on the use condition, software rejuvenation [12] mechanisms can be applied proactively to prevent system failure due to software aging effects. For example, [22] discusses algorithms to obtain the optimal software rejuvenation schedule based on the closed . form for V. EXPERIMENTAL STUDY The approach described in Section IV was employed on an experimental study focused on the life distribution estimation for a web server system that suffers software aging. The web server software chosen was Apache [23] for which previous studies (e.g., [8]) have verified the presence of software aging. Another reason to use Apache is that it is the most commonly used web server to date [24]; this will facilitate the repetition of our experiments, as well as the applicability of our results. A. Test Environment Three computers were used, two of them as traffic generators (CPU: Celeron 1.2 GHz, RAM: 512 megabytes, NIC: 100 Mbps), and the third hosting the web server (CPU: P4 2.8 GHz, RAM: 512 megabytes, NIC: 100 Mbps). The interconnection was via an Ethernet switch (Fig. 1). A program called sysmon was built to monitor the web server’s environment. As shown in Fig. 1, sysmon carries out two basic types of monitoring. The first refers to the resources of the operating system, and the second collects specific data on Apache’s daemon processes , such as physical, and shared memory usage. In both cases, the data are collected from several files in the Linux /proc directory (e.g., /proc/meminfo).
108
The [25], and [23] tools were used as workload was used to validate the results obtained generators. The . We used Apache version 2.0.46 (compiled as from MPM-prefork), and defined the number of processes to be 300 for the characterization phase (see Section V-B), and ADT processes were iniexperiments (see Section V-C). All tiated at the beginning of each run, and they remained active until the test conclusion. Based on this environment, the next sections show the application of the proposed approach. B. Accelerating Stress Selection The first stage of our approach is the accelerating stress selection. For this stage, we need to plan the DOE to characterize the aging effects, and to analyze the sensitivity of these effects in relation to experimental variables. As discussed in [8], the aging effect on the Apache processes is the increase in the server’s main memory consumption, which could cause a system failure due to resource exhaustion. Therefore, the DOE chosen is the total resident set size of the response variable processes. The DOE factors, and their respective levels were partially based on [8]. As a result, we employed the following factors: page size, page type, and request rate (the number of HTTP , , and requests per second), referred to as , respectively. Two levels were defined for each factor, full factorial design with replications [11]. resulting in a Note that the factor represents the use-rate that will frequency during the ADT. be used for accelerating the These levels (regular, and high) were chosen as 50%, and 90% of the web server’s capacity, respectively. To choose the value associated with the level regular of the , we examined the usage of 6,000 web pages of a factor selected ISP. The ten most visited pages over a period of three months were selected, yielding an average page size of 196 kilobytes. We compared this value to other page sizes from several Internet web sites, selected in an ad hoc manner, and found the value to be adequate. For the upper level (high) of this factor, we chose a representative size of those cases in which the web server transfers data objects that are not all HTML pages. Note that it is a common practice to distribute large binary data objects (e.g., software updates) through web servers instead of FTP repositories. To define a value that would represent these cases, we used the file size of antivirus updates. We collected three samples that averaged 2 megabytes. Therefore, this value was . assumed to represent the upper level of the factor , refers to the mode in which the The second factor, content of pages is available: static, or dynamic. For its regular level, we used the static content, while for its high level we used dynamic content. The generation of dynamic content was achieved through a program built to simulate database accesses, and to create HTML pages according to the sizes defined by the factor. For the factor , we previously described conducted preliminary load tests to obtain the maximal capacity of the web server for each factor and level combination (treatment). The performance metric used was the reply rate, known to be adequate to evaluate the capacity of web servers [17]. , Fig. 2 shows the results for the combination ( ). Based on the maximum capacity found, and
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
Fig. 2. Load test for T1 (pgSize = 196 kB; pgT ype = static).
TABLE I FACTOR AND LEVEL VALUES
the levels (regular, and high) were derived for this specific treatment (T1). The same procedure was used for all other treatments are shown in (T2 to T8), and their respective values of Table I. Table I summarizes the experimental plan adopted as part of selection step. These values are specific for the studied the test environment, though the procedure used to determine them is sufficiently general to be applied to any other web server software. In Table I, is the treatment index (factors or interactions in each level) that will be evaluated. The number of treatments for a full factorial design is given by
(3) The total number of runs (treatments execution) is then , where is the number of replications per treatment [11]. The value of is determined based on hypothesis tests chosen to detect differences among the means of response variables’ estimates, for every treatment, in comparison with a given reference value [11]. As our approach uses the DOE’s results to identify the factors that most influence the response variable, and thus upon the effects caused by each factor on the response variable, we calculate the value of taking into account the required -confidence intervals assigned to the effect’s estimates. We use the effects’ standard error equation provided in [26]: (4) where is the number of replications, and is the population variance that is approximated by the pilot sample . Therefore, given the effect’s -confidence interval by (5) is the estimated effect of a factor or interwhere actions in the signal matrix [11], [21], and the given tolerated
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
109
TABLE II ANOVA RESULTS
Fig. 3. Apache memory consumption for T1 to T8.
sampling error in [26], we get (6) by fixing a priori the with a -confidence interval of . value of
TABLE III SIGNAL MATRIX FOR COMPUTING THE EFFECTS
(6) where is the tabulated value for the Student’s -distribution at a given -significance level , and degrees of freedom for error , with equal to the size of the pilot being provided by the experimenter in the same sample, and unit of the DOE’s response variable. Using a 95% -confidence , the explicit value is level, (7) The pilot sample necessary to calculate was generated by performing a single treatment, replicated twenty times, with all factors assigned to the means of their regular, and high level values provided in Table I. Based on this sample, the number of replications was calculated as
assuming a tolerated sampling error of 500 bytes, and a -significance level . Hence, we replicated each run 21 times, where each run performed 5,000 requests, and the response variable measurements were taken every 100 requests. processes In Fig. 3, observe that for all treatments the had an initial period of size increase within the first lot of 500 requests. This initial size increase is normal related to the initial dynamic allocation, and thread creation. After this transient period, for the first four treatments, the processes did not show any sign of aging in terms of memory increase. On the other hand, for the treatments T5 to T8, we see a monotonic increase in the processes’ memory size after the initial period lasting until the end of the runs. For T5, and T6, the growth was approximately 17, and 28 megabytes, respectively; and for T7, and T8 were observed 31, and 30 megabytes, respectively. These results show no evidence of aging manifestation for treatments T1 to T4, while for treatments T5 to T8 aging effects on the Apache processes memory size can be clearly observed. For a long uninterrupted runtime, this incremental degradation of the OS physical memory will cause the failure of the web
server system due to resource exhaustion, because the physical memory availability is critical to maintain the system in a healthy operational state. The result of the runs provided evidence for the existence of an activation pattern for the Apache aging-related faults in the last four treatments. As described in Section IV-A, to evaluate which factors are part of the aging activation pattern, the next step is to conduct a quantitative analysis of the influence of each factor on the aging of the SUT. We carry out this analysis in two are conducted steps. First, an ANOVA, and test to identify the factors with -significant effects on aging, which will be considered as candidate(s) for . Next, the influence of each candidate factor, and their combinations (interaction) on aging is quantitatively evaluated through solving a signal matrix. As a result, a rank with the most influential factors is built to support the selection of the factor(s) that will be considered . Note that, in our experimental study, the ANOVA, and as test would not be necessary considering that the first four treatments did not show any variability. Despite this special case, we include both results (Table II) to illustrate every step of our approach. Analysing the test’s values, we conclude that the -signifi, , and their interaccant effects are from factors tion, because the computed statistics exceed the critical value . This result is confor the distribution sistent with the outcome obtained when computing the signal matrix presented in Table III. Table III presents the signal matrix used to quantify the influence of each factor on the response variable. The values 1, and 1 represent the regular, and high levels, respectively. The rows of column contain the mean values of the response variable, expressed in kilobytes, which were calculated from the observed
110
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
TABLE IV SIGNAL MATRIX SOLVED
measurements in each treatment’s replication. The signal matrix was computed, and its results shown in Table IV. The algorithms for this computation can be found in [11], [21], and [26]. The second row of Table IV shows the values of each factor’s effect on . The percentage shown in the fourth row is the contribution of each factor or interactions (e.g., sum of squares for factor A) divided by the total variation (SST, total sum of squares). Based on these numerical results, we conclude that is the factor with the most influence (89.20%) among all those analyzed. Its level high (dynamic) was the one that individumost influenced the Apache aging. The factor ally, and in combination with , influenced the aging to a and , small degree. The sum of variations due to plus the influence of their interactions, explains 99.94% of the processes. variation in the aging of the
TABLE V ADT PLAN
C. ADT Experimental Plan Execution For the ADT execution, we used as the combination of the , and factors. As demonstrated in Section V-B, had influence over the aging intensity when comthe factor (at level high). Through the ADT bined with the execution, we aim to estimate the SUT life distribution under use condition. Therefore, it was necessary to define the use-rate for the SUT in its use condition. To obtain a reference value, we used the log files of the same web server mentioned in Section V-B. These log files refer to three months of the web server’s operation. From this sample, we filtered all requests to dynamic pages with sizes between 178 and 208 kilobytes, this range being defined from the ten most accessed pages from this sample. We found 49,770 requests that fitted this pattern . Based on this filtered sample, we obtained the inter-arfor the web server under use conrival times distribution of ditions. The best fit was obtained by a lognormal model with , and based on the Kolmogorov-Smirnov test . Based on the estimated lognormal request rate equal to 0.11 parameters, we considered the . Hence, the workload used to represent the SUT req/s use condition with respect to was , , and . We defined three stress levels for the ADT plan: S1, S2, S3. These stress levels were equivalent to 2, 3, and 4 times the page size defined for the web server under use condition. The use-rate for each stress level was the maximum capacity supported in each stress level. As verified in Section V-B, the inhad on the aging was fluence that the factor close to zero. This result allowed us to plan the ADT using use-rate supported by the SUT without any the maximum unwanted influence on the results. Thus, we optimize the execution of each degradation path. The maximum rate for each
Fig. 4. Degradation paths for each stress level.
stress level was obtained from load tests as done in Section V-B. The value was chosen based on preliminary aging experiments conducted on a test bed configuration equivalent to the abovementioned ISP’s web server (Section V-B), which had 512 megabytes of physical memory. These preliminary aging tests showed that, when the total amount of memory used by Apache processes crossed 410 megabytes, the Linux kernel executed its value equal paging routines intensively. Thus, we chose the to 400 megabytes. For each stress level, we processed nearly based requests. This value (49,770 rounded up) rep50,000 traffic in the reference ISP. For each resents three months of 100 requests, we made one inspection to measure the resident processes, and we used the memory size allocated to all term “cycle” to represent a batch of 100 requests. ADT planning, as also required in the DOE phase, demands the selection of a sample size that represents the number of units tested. A more appropriate interpretation for this value, when applied to software systems, would be the number of replications of the ADT. Thus, to calculate the number of ADT replications, we applied the algorithms proposed in [9]. As a result, we obtained thirty-six replications for our ADT experiment, which meant twelve tests for every stress level. Table V shows a summary of the ADT plan used in our experimental study. D. ADT Results Fig. 4 presents the degradation paths obtained for each stress level. We specified the duration time (abscissa axis) in number of cycles that allowed the comparison of the results among the
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
111
TABLE VI ACCELERATED AGING MODELS (AVERAGE VALUES)
TABLE VII MODELS FOR THE AGING ACCELERATED FAILURE TIMES
stress levels, because the real time is different from one level to the other due to the different use-rates. The degradation paths for S1 and S2 did not reach the threshold during the test period. Hence, we calculated pseudofailure times for each degradation path obtained from S1 and S2 according to the steps described in Section IV-D. For both stress levels, several regression models were tested, and the logarithmic regression showed the best fit, which was obtained by the least squares method (LSE). We fitted the models using the second half of the sample (from cycle 250 onwards) to avoid the instability present during the initializing period. Through each one of the fitted models, we obtained the pseudo-failure time for each degradation path . Table VI shows the models adjusted for the average values of stress levels S1, and S2. From the samples of failure and pseudo-failure times, we obtained the relevant probability distributions. The criteria used to , build the best-fit ranking were the log-likelihood function , whose paand the Pearson’s linear correlation coefficient rameter estimation methods were MLE, and LSE, respectively. The Kolmogorov-Smirnov test (K-S) was not used due to the small sample size. The goodness-of-fit test results for the first three ranked models are shown in Table VII. The lognormal probability distribution showed the best fit for the three data sets as demonstrated by the numerical assessment ( , and ), which corroborates the graphical analysis through the probability plot in Fig. 5. Table VIII presents the lognormal parameters estimated for each stress level’s data set. As can be seen, the ADT assumption [10] of equal standard deviations for the estimated distribution model in each stress level was partially satisfied. We verify that the sigma values for S1, and S2 are within the intersection region of their estimated -confidence intervals. The same occurred with these values for S2, and S3 in relation to their estimated -confidence intervals’ intersection region. The intersection region between S1 and S2 is wider than that between S2 and S3. We analyzed the raw data measured during the accelerated tests executed for S3, and identified that, , the when the size of the Apache processes came close to OS started its virtual memory paging mechanism. As expected,
Fig. 5. Lognormal multiply probability plot with lognormal MLE fits for each stress level.
TABLE VIII LOGNORMAL PARAMETERS FOR EACH STRESS LEVEL
The log likelihood was Lk = 80:48. The confidence intervals were calculated based on the normal-approximation method.
the memory paging influenced the processes’ resident set size, and consequently the measurements taken for the last values. Thus, we adjusted the sigma value to 0.09 to keep it within the confidence intervals of S1 and S2, and as close as possible to S3’s -confidence interval. The next step was the estimation of the stress-accelerated degradation model to obtain the lifetime distribution of the system in its use condition. As described in Section IV-C, we adopted the IPL relationship to model the stress-accelerated aging relationship, shown in (8). (8) where , and are model parameters to be determined. The life distribution with the best fit for all levels was the lognormal whose the probability density function (pdf) is
(9) where is times-to-failure. The life characteristic for the lognormal distribution is its median value given by (10)
112
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
TABLE IX IPL-LOGNORMAL (SAA) PARAMETERS
Fig. 7. (a) Use condition level lognormal probability plot; (b) lognormal probability plot of the standardized residuals from the IPL-lognormal model fitted to the software aging data set.
Fig. 6. (a) The IPL-lognormal model’s ML estimates and 90% s-confidence intervals for F (t) at 196 kB; (b) scatter plot showing cycles to failure versus stress, and the IPL-lognormal model fitted to the web server aging data.
The pdf for the IPL-lognormal model can be obtained first by in (8). Then setting (11) Therefore, (12) Substituting (12) into (9) yields the IPL-lognormal pdf as (13) The estimated IPL-lognormal parameters are listed in Table IX. Afterwards, we used (13) to estimate the underlying life distribution for the use condition. Fig. 6(a) presents the fit of the IPL-lognormal model estimated for the three stress levels, and for the use condition. Fig. 6(b) shows the stress-life relationship projected for all stress levels with the abscissas values beginning at the use condition level. The ordinates values correspond to the time-to-failure in batches of 100 HTTP requests containing pattern. The intersection point between the y-axis, and the the line obtained by linearizing the IPL-lognormal model, is the MTTF estimated for the use condition level. The probability plot for the use condition level (Fig. 7(a)), and the standardized residuals plot (Fig. 7(b)) confirm the good fit of the estimated stress-accelerated aging model. From this model, the MTTF estimated for the web server at use condition was 289,000 cycles (90% CI: 262,000–318,770). requests of type, Assuming the relation use-rate equals 0.11 req/s, then the estimated avand an erage time to failure caused by memory exhaustion
Fig. 8. Reliability function for the use condition level.
is approximately 100 months. Fig. 8 shows the plot of the reliability function for the use condition, which was obtained through the IPL-lognormal model. request’s This MTTF value was calculated considering an inter-arrival time interval of nine seconds. This large interval was due to the small size of the ISP used as reference, which has a relatively low inbound rate of web traffic. However, if we consider a lower interval of say 0.10 seconds (10 requests of type per second), then the MTTF would be about 33 days. In addition, we performed a sensitivity analysis of the rerequests, liability function with respect to the number of and the stress level . We observed that, when the size of dynamically generated pages increases, or its request rate increases, the MTTF decreases. These relationships are captured through the stress-accelerated aging relationship (IPL-lognormal) described above. Fig. 9 presents these results graphically. Several reliability metrics can be obtained from the IPL-lognormal model estimated for the use condition level. For
MATIAS et al.: ACCELERATED DEGRADATION TESTS APPLIED TO SOFTWARE AGING EXPERIMENTS
Fig. 9. 3D graphical representation of the IPL-lognormal reliability function plotted against time and stress (A ). TABLE X EXPERIMENTATION TIME
example, in addition to the MTTF described previously, we requests that this web also calculated the total number of server can process with a success probability of 99.999%, which resulted in a mean value of 19,378,000 requests (90% CI: 17,110,000–21,947,000). Based on these results, proactive fault-tolerance mechanisms (e.g., software rejuvenation) may be scheduled to increase the system dependability. Finally, we also compared the time required to implement the proposal approach to the estimated time to observe the SUT’s failure. Table X shows the amount of time spent on each experiment executed during this study, which were described in Section V. The total time for all experimental phases was 105.2 hours. Assuming the MTTF of 100 months estimated for the use condition level (w/o acceleration), the proposal approach offers a reduction of approximately 685 times the time required to obtain one failure. If we consider a sample of thirty-six failure times, as obtained in the proposed approach, then this reduction factor should be multiplied by the sample size. Even though we assume an aging factor’s request rate maximized to 5.2 req/s (90% of server load -see Table I), we have a reduction factor of approximately 14.67 (289,000·100/5.2/3,600/ 105.2) multiplied by the sample size (36 in this study); hence a total reduction by a factor of 528.12. VI. CONCLUSION We presented a theoretical and experimental study on the acceleration of software aging effects. Among the contributions of
113
this paper, we highlight two that we consider most relevant for the experimental software engineering literature. The first is the introduction of the aging factor concept. This approach, oriented to factors with most influence on the aging effects, allows the characterization of the software aging from the identification of the aging-related fault activation patterns. These patterns can be individual operations of the operational profile, as well as system’s operational modes. Hence, when appropriately applied, this approach allows fine control of the software aging rate through the handling of the aging factors into the workload, or even through environmental variables. In addition, the results provide insights regarding the location of software aging-related faults. For example, the results of our experimental study indicate that the aging-related faults in the version investigated are located in one or more modules (e.g., mod_cgi, mod_php, etc.) executed during the handling of requests addressed to dynamic pages (aging factor pattern). Also, the quantitative results offered can support developing and maintaining activities to optimize their efforts to increase the software operational reliability. The second contribution is the aging acceleration. As mentioned in Section I, experiments in software aging require long periods of time for obtaining a decent sample of failure times. In this context, our systematic approach showed interesting results. For the execution of all the experiments reported in this paper, the total time spent was approximately 105.2 hours. In our survey of previous research, the experimentation times were all higher. We point out that for each stress level we executed 12 replications, resulting in 36 runs, in contrast to the preceding studies where the majority of them were based on only one run. A natural extension of this research is the assessment and development of other stress-degradation relationship models suitable to accelerated aging data sets. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers, the Associate Editor, and Managing Editor for their constructive comments, and suggestions. REFERENCES [1] S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi, “A methodology for detection and estimation of software aging,” in Proc. 9th International Symposium on Software Reliability Engineering, 1998, pp. 283–292. [2] A. Aviˇzienis, J. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004. [3] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton, “Software rejuvenation: Analysis, module and applications,” in Proc. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, pp. 381–390. [4] M. Shereshevsky, B. Cukic, J. Crowel, V. Gandikota, and Y. Liu, “Software aging and multifractality of memory resources,” in Proc. International Conference on Dependable Systems and Networks, 2003, pp. 721–730. [5] Y. Bao, X. Sun, and K. S. Trivedi, “A workload-based analysis of software aging, and rejuvenation,” IEEE Trans. Reliability, vol. 54, no. 3, pp. 541–548, 2005. [6] W. Xie, Y. Hong, and K. S. Trivedi, “Analysis of a two-level software rejuvenation policy,” Reliability Engineering and System Safety, vol. 87, no. 1, pp. 13–22, 2005. [7] A. Avritzer and E. J. Weyuker, “Monitoring smoothly degrading systems for increased dependability,” Empirical Software Engineering, vol. 2, no. 1, pp. 59–77, 1997.
114
IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 1, MARCH 2010
[8] M. Grottke, L. Lie, K. Vaidyanathan, and K. S. Trivedi, “Analysis of software aging in a web server,” IEEE Trans. Reliability, vol. 55, no. 3, pp. 411–420, 2006. [9] B. N. Nelson, Accelerated Testing: Statistical Method, Test Plans, and Data Analysis. New Jersey: Wiley, 2004. [10] W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data.. New York: Wiley, 1998. [11] D. C. Montgomery, Design and Analysis of Experiments, 6th ed. : John Wiley & Sons, 2005. [12] K. Vaidyanathan and K. S. Trivedi, “A comprehensive model for software rejuvenation,” IEEE Trans. Dependable and Secure Computing, vol. 2, no. 2, pp. 124–137, 2005. [13] W. Ehrlich, V. N. Nair, M. S. Alam, W. H. Chen, and M. Engel, “Software reliability assessment using accelerated testing methods,” Journal of the Royal Statistical Society, vol. 47, no. 1, pp. 15–30, 1998. [14] H. A. Chan, “Accelerated stress testing for both hardware and software,” in Proc. Reliability and Maintainability Annual Symposium, 2004, pp. 346–351. [15] R. Chillarege, K. Goswani, and M. Devarakonda, “Experiment illustrating failure acceleration and error propagation in fault-injection,” in Proc. International Symposium on Software Reliability Engineering, 2002. [16] V. R. B. Oliveira and E. A. Colosimo, “Comparison of methods to estimate the time-to-failure distribution in degradation tests,” Quality and Reliability Engineering International, vol. 20, no. 4, pp. 363–373, 2004. [17] A. Mettas, “Understanding accelerated life testing analysis,” in Proc. International Reliability Symposium, 2003, pp. 1–16. [18] C. J. Lu and W. Q. Meeker, “Using degradation measures to estimate a time-to-failure distribution,” Technometrics, vol. 35, no. 2, pp. 161–173, 1993. [19] M. Grottke and K. S. Trivedi, “Fighting bugs: Remove, retry, replicate and rejuvenate,” IEEE Computer, vol. 40, no. 2, pp. 107–109, 2007. [20] J. D. Musa, Software Reliability Engineering. : McGraw-Hill, 1999. [21] R. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. : John Wiley & Sons, 1991. [22] T. Dohi, K. Goˇseva-Popstojanova, and K. S. Trivedi, “Estimating software rejuvenation schedules in high assurance systems,” Computer Journal, vol. 44, no. 6, pp. 473–482, 2001. [23] Apache Software Foundation, “http server project,” [Online]. Available: http://httpd.apache.org [24] “Netcraft web server survey archive,” [Online]. Available: http://news. netcraft.com/archives/web_server_survey.html [25] D. Mosberger and T. Jin, “Httperf: A tool for measuring web server performance,” in Proc. First Workshop on Internet Server Performance, 1998, pp. 59–67. [26] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 4th ed. : John Wiley & Sons, 2007. Rivalino Matias, Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S. (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with De-
partment of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate under supervision of Dr. Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability and reliability analytical modeling. He is currently an Associate Professor in the Computing School at Federal University of Uberlândia, Brazil. Dr. Matias has served as reviewer for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences. His research interests include reliability engineering applied to computing systems, software aging theory, dependability analytical modeling, and diagnosis protocols for computing systems.
Pedro Alberto Barbetta is an Associate professor in the Informatics and Statistics Department at the Federal University of Santa Catarina, Brazil. He received his Ph.D. (1998) degree in Industrial Engineering from the same university. He is the author of two textbooks in statistics. Dr. Barbetta is a member of the Brazilian Statistical Association. His areas of interest are design of experiments and multivariate data analysis.
Kishor S. Trivedi (M’86–SM’87–F’92) holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has also published two other books entitled Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers; and Queueing Networks and Markov Chains, published by John Wiley. He is a Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published over 420 articles, and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF RISK AND RELIABILITY, INTERNATIONAL JOURNAL OF PERFORMABILITY ENGINEERING, AND INTERNATIONAL JOURNAL OF QUALITY AND SAFETY ENGINEERING. He is the recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation. His research interests in are in reliability, availability, performance, performability and survivability modeling of computer and communication systems. He works closely with industry in carrying out reliability/availability analysis, providing short courses on reliability, availability, performability modeling, and in the development and dissemination of software packages such as SHARPE and SPNP.
Paulo J. Freitas Filho is an Associate professor in the Informatics and Statistics Department at the Federal University of Santa Catarina, Brazil. He is a member of Society for Computer Simulation (SCS), and Brazilian Society for Computers (SBC). His research interests include simulation of computer systems for performance improvement, risk modeling and simulation, and analysis for input modeling and output analysis.