Optimal software maintenance policy ... - Wiley Online Library

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE J. Softw. Maint. Evol.: Res. Pract. 2011; 23:21–33 Published online 4 May 2010 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/smr.467

Optimal software maintenance policy considering unavailable time Cheng-Jie Xiong∗, † , Min Xie and Szu-Hui Ng Department of Industrial and Systems Engineering, National University of Singapore, Singapore

SUMMARY With the enhancement of hardware and software engineering, the effectiveness and correctness of software is less and less doubted and customers are more aware about whether software services are available or not when needed. Software maintenance is one of the main reasons that make software unavailable and it is often very expensive to perform maintenance tasks. Common approaches of studying software maintenance are to consider it as a static by-product of software operation and only the maintenance cost is covered. In this paper, software maintenance policies are studied with the consideration of unavailable service time. A non-homogeneous continuous Markov chain is adopted for modeling the software operation and maintenance process, and the cost of software unavailability that is brought in by software maintenance is investigated and analyzed for searching the optimal maintenance policy, which aims at minimizing the average maintenance time cost. The optimality of our proposed policy is shown and checked by numerical examples with discussions of its possible application perspectives. Copyright q 2010 John Wiley & Sons, Ltd. Received 5 March 2009; Revised 1 March 2010; Accepted 4 March 2010 KEY WORDS:

software maintenance; cost; software maintenance policy; unavailable time

1. INTRODUCTION Software maintenance is the process of correcting, perfecting, or adapting existing programs so that their performance more closely meets the needs of the organization using the software [1]. With the increasing size and capability of software systems, efforts that are put into software maintenance are also significantly increasing. According to Sommerville, there were approximately 250 billion lines of source code being maintained by organizations and companies in the year 2000 all over the world and this figure would keep increasing at a very fast pace [2]. Hanebutte and Oman [3] pointed that software maintenance is costly and time/effort consuming and the maintainability [4–6] of a software system is one of the most important criteria based on which companies or organizations choose to deploy it or not. Many authors have focused on software maintenance. Since maintenance only happens during the operational phase, software maintenance is different from software testing. Although the maintenance process may be different among organizations and the techniques utilized vary, one thing in common is that software maintenance is expensive [7, 8]. Many researchers and software practitioners have long been devoted into this field in order to find solutions for better maintenance and lower cost. Some focused on the fundamental software structure to compare different design ∗ Correspondence

to: Cheng-Jie Xiong, Department of Industrial and Systems Engineering, National University of Singapore, Singapore. † E-mail: [email protected] Contract/grant sponsor: A*Star (Agency for Science, Technology and Research) in Singapore; contract/grant number: 072 1340050 Copyright q

2010 John Wiley & Sons, Ltd.

22

C.-J. XIONG, M. XIE AND S.-H. NG

patterns’ impact on software maintenance [9], whereas others investigated different maintenance strategies [10, 11]. There are also some discussions on the topic of software maintenance policy, in which different policies are compared and the optimality is concluded [12, 13]. However, in most existing literatures software maintenance is regarded to be driven by users’ request and only the maintenance activities themselves are investigated and such activities are deemed as independent of software operation. On the other hand, as the quality of software keeps increasing, users are also having higher requirements for software [14]. Not only the system is expected to function correctly and efficiently, but also the system is expected to be available all the time. Unavailability of service is frustrating and there may be huge penalty cost associated with the unavailability of software systems [15]. Since maintenance is the main reason that causes a software system to be unavailable, it is no longer reasonable to isolate maintenance activities from operation activities and the optimal software maintenance policy needs to be reconsidered with unavailable service time. However, there are limited resources to which we can refer. Thus, we are motivated to investigate the problem of software operation and maintenance as equally weighted counterparts. The process of software operation and maintenance is modeled under a unified framework. Our proposed model grants us a new approach for analyzing software maintenance activities from the view point of unavailable time. The optimality problem of maintenance policy is then investigated from a cost-effective point of view. The remainder of the paper is organized in the following way. Section 2 first introduces the scenario of software operation and maintenance and then the process of software operation and maintenance is modeled with an illustration of how the modeling can be estimated using historical data. The problem of software maintenance policy is briefly reviewed in Section 3. Unavailable time and cost that are caused by maintenance are then investigated and a cost model is proposed. The problem of optimal maintenance policy is then modeled based on a cost-effective aspect. A simulation-based case study and the sensitivity analysis are presented in Section 4 for validation of our proposed approach’s ability in searching the optimal maintenance policy with a further discussion on the proposed solution’s application perspective in real industrial cases. Section 5 concludes the paper with the summary of the research and directions for future research.

2. MODELING OF SOFTWARE OPERATION AND MAINTENANCE PROCESS Software maintenance happens when a software system is released and deployed. In this section, a unified framework is settled for studying the operation and the maintenance process of software systems, which provides us the required information for analyzing the maintenance policy. 2.1. Modeling of software operation and maintenance processes In deployment, a software system may encounter many problems that violate the normal operation of software [15]. Some are due to improper software manipulation; some are due to external environmental interference, such as strong magnetic field; but more are caused by faults within the software system, which we call as software failures. For the purpose of clarity, we define software failures as events of un-expected output of software, such as wrong results of calculation, control violation of hardware, etc. All software failures could be related to certain software fault(s), to which we refer as the error of logic in the source code [14]. Software failures do not necessarily hamper software operation but they may lead to unpredictable results to the system in the long run. No actual software is perfect and encountered software faults need to be eventually removed in order to guarantee the operation. Fault-removal activities are classified as maintenance during the operational phase. Chapin et al. [16] classified software maintenance into three categories, namely corrective, perfective and adaptive maintenance. Corrective maintenance is the activity performed on the software system in order to remove a residual fault while leaving the intended semantics unchanged. Adaptive and perfective maintenance, including enhancive maintenance, are activities performed on a software system in order to implement changed functionality and/or Copyright q


J. Softw. Maint. Evol.: Res. Pract. 2011; 23:21–33 DOI: 10.1002/smr

OPTIMAL SOFTWARE MAINTENANCE POLICY

23

to change software properties such as security, performance, usability, and so on, or to adapt to a new platform. Among the three types of maintenances, corrective maintenance is the most common type and it is the most highly involved maintenance activity all through the software deployment life cycle [17]. Adaptive maintenance usually occurs when a software system is first deployed and changes are required for adapting to a new environment. Perfective maintenance usually refers to changes that make the software system perform better, such as functionality incensement and documentation update. The work required for the above-mentioned three types of maintenance differs a lot. Adaptive maintenance is often described as a ‘one-time-job’, such as class encapsulation when switching from the C language to C++/JAVA language [18]. Perfective maintenance is often performed only when the target system reaches a stable stage. According to the data of some existing system change tracing systems, such as Bugzilla [19], which can trace and classify different types of software maintenance, over 90% of software maintenance performed is corrective maintenance. Since corrective maintenance plays the main role in maintenance tasks and it is possible to distinguish corrective maintenance from other types of software maintenance, we mainly focus on corrective maintenance in this research. One major issue in studying the software operation and maintenance is how to determine the operation and maintenance time. In fact, time to failure and time to repair of software systems are not deterministic and they pretty much depend on the environment and states of software. It is reported by many authors [20–22] that software reliability is growing as more faults are removed, but faults encountered in later phase usually have higher complexity and require more time for correcting. There are plenty of literatures talking about the modeling of time to failure of software. Most of these models are either based on Markov chains or the non-homogeneous Poisson process (NHPP) and Gokhale et al. [23, 24] unified most of these proposed models that describe the software failure process as the Non-homogeneous Continuous Time Markov Chain (NHCTMC). Basically, the stochastic failure process that is described by NHCTMC, denoted by {X (t)}, counts the number of failures observed in an interval length t and it only depends on the failure rate, which also depends on the state of the software. Owing to the similarity of their probability nature between time to failure and time to repair, Gokhale and Lyu [25] further argued that the repair time of software failures could also be modeled using NHCTMC. The NHCTMC is adopted in this research to model the behaviors of both software operation and maintenance events. An NHCTMC can be uniquely characterized by its transition rate. We denote the process of software operation and maintenance as {X o (t)} and {X m (t)}, respectively. Under the NHCTMC framework, the software failure rate and repair rate are, respectively, referred as (n, t) and (n, t) where n stands for the total number of software failures encountered up to present time t. The mean value function of the operation and the maintenance process, denoted as m o (t) and m m (t), can be obtained by taking the integration of the failure rate and repair rate as shown in the following equations: t m o (t) = E[X o (t)] = (n, s) ds (1) 0

m m (t) = E[X m (t)] =

t

(n, s) ds

(2)

0

With proper forms of failure rate and repair rate obtained with the historical data, Equations (1) and (2) provide a mathematical approach for analyzing software failure and repair events. However, one major issue in the modeling is how to estimate the failure and repair rates, which will be covered in the following section. 2.2. Parameter estimation Different forms of failure rate and repair rate have been discussed by Gokhale et al. [23] and Gokhale and Lyu [25]. The parameters of failure rate and repair rate can be solely estimated based on sufficient historical data. In this research, we assume that both the failure and repair rate follow Copyright q



24


certain NHPP, which is independent of n and a special case of NHCTMC. Here we adopt the classical Geol–Okumoto model [26]. With this Geol–Okumoto model, the failure and the repair rates can be written as (n, t) = ab e−bt ,

a, b>0

(3)

(n, t) = e−t ,

, >0

(4)

where a, b, , are parameters to be estimated. These parameters also have physical meanings [20]: a and stand for the total number of failures that can be eventually detected and corrected, respectively; b and represent the efficiency of failure detection and correction. The mean value functions of the operation and maintenance processes can be obtained as m o (t) = a(1−e−bt )

(5)

m m (t) = (1−e−t )

(6)

The parameters in the above equations can be estimated using either MLE or SLE [22, 27]. For illustration, a set of real software operation and maintenance data is adopted in this research. Apache [28] is a very popular web server software system and nearly 60% of the worlds’ web servers run on this system. The latest Apache version is 2.2.13, which was released on 08/08/2009. Version 2.0.35 was first available to the public since 06/04/2002. It is the first release of Apache’s major version 2.0. We select this release as our example. The Apache project adopts Bugzilla [28] for tracking system changes and the operation and the maintenance data are open to public. Operation and maintenance records of Apache 2.0.35 are summarized in Table I. Please note that we select the Geol–Okumoto model in this research merely as a matter of convenience. The proposed approach allows us to use other models as well. It is also important to verify the validity of the used models when our approach is applied with other real data sets. We have used LSE to estimate the parameters and the results are shown in Table II. In this study, the Geol–Okumoto model’s performance is acceptable and shows reasonable goodness-of-fit. Fitting curves are provided in Figure 1. Table I. Operation and maintenance records in Release 2.0.35. Apache 2.0.35 Week 1 2 3 4 5 6 7 8 9

Cumulative no. of detected failures

Cumulative no. of corrected failures

Week

Cumulative no. of detected failures

Cumulative no. of corrected failures

27 47 62 71 73 73 74 74 74

7 18 32 42 47 49 52 55 56

10 11 12 19 21 24 27 31 32

74 74 74 74 74 74 74 74 74

60 62 63 64 65 66 68 69 70

Table II. Model estimates and goodness-of-fit.

Copyright q

Estimate values

MSE

aˆ = 74.37 bˆ = 0.5545 ˆ = 66.81 ˆ = 0.2139

2.393


4.674



25

Figure 1. Actual data set versus fitting curve of Apache 2.0.35.

As can be seen from the illustration, our proposed approach can properly model both the software operation and maintenance process. With the help of our proposed approach, decision of maintenance policy can be made based on the analysis of the operation and maintenance processes.

3. SOFTWARE MAINTENANCE POLICY CONSIDERING UNAVAILABLE TIME 3.1. Overview of software maintenance problem Maintenance is a common practice in industry and many authors have studied the problem of optimal maintenance policy in industrial systems [29]. As a unique type of industrial system, the problem of software maintenance has been studied by many authors [1, 2, 7, 30], but few have ever considered the impact of software maintenance on the software system’s service quality. Software maintenance takes time. Once a software system is deployed, software maintenance should be carefully scheduled. Unlike debugging during the software development phase, software maintenance could not be performed while the software is still in operation since software faults within the system need to be traced with the corresponding error in the source code being mapped and fixed [3]. Such activities require parts of or even all resources within the software system and since there are usually no extra spare systems, the software system could not be online until maintenance is done. Hence, the software operation and maintenance process take alternating orders. Software system’s service quality is degraded due to the unavailable time that is caused by software maintenance. The unavailable status of software servicing usually brings in severe consequences, such as huge loss, in money. For a given period, the higher the unavailability, the more costly it will be. Usually great efforts are spent in maintenance in the aim of shortening maintenance time [31]. Software failures usually require immediate attention [13] and it is desirable to request maintenance once a failure occurs. However, maintenance itself is also costly [32] and every time maintenance is performed, a setup cost incurs. This setup cost may include the cost for arranging repairing facilities and personnel, as well as the cost of configuring the software systems for repair. For large software systems, setup cost for maintenance is sometimes so expensive that customers could not afford maintenance too often. Luckily, since software does not degrade as hardware does, a software outage could be temporarily solved by releasing all resources it is using and deleting all transactions it is executing Copyright q



26


and the most effective and efficient way of doing this is rolling back the corresponding transaction. This is called software rejuvenation [33–35]. Thanks to the existence of software rejuvenation, software practitioners are able to record and roll back software transactions when a software failure occurs instead of calling for maintenance. Maintenance is not requested until certain threshold criteria are met. However, it is not easy to decide such threshold at which people call for maintenance. Most existing literatures that deal with maintenance policy avoid this problem by assuming that maintenance is driven by users’ request [7] and explicit fault-removing activities are not covered. 3.2. Software maintenance policy analysis 3.2.1. Preliminary assumptions. The structure of software differs from system to system and the forms of software maintenance can also be different in different stages within the same software system. It is hard to cover all the issues of software maintenance in one study. We aim at building a general model that can represent the most typical scenarios in industrial practice. In this research, only the corrective maintenance is considered. We further assert that software maintenance is carried out by maintenance personnel rather than online update or software patch installation. Several assumptions are adopted in this research. Assumptions: 1. Software failures are caused solely by software faults. Once a new failure occurs, it is recorded and then the corresponding software transaction is either rolled back or the software system goes under maintenance. Time for rolling back software transactions is negligible and the software operation is assumed to be unaffected by rolling back the transactions. 2. Operation and maintenance environment remain the same all through the operational phase. 3. If a software maintenance schedule is initiated, the maintenance process removes all the recorded pending software failures. The maintenance activities will not introduce new software faults. In the mean while, operation is suspended until maintenance is done. The software operation and maintenance process both follow NHCTMC. 4. A software failure will not occur again if its corresponding software fault is removed. If a recorded software failure keeps occurring, it is not regarded as new failures. The above assumptions can be illustrated in the following graph. In Figure 2, the first maintenance is requested when the third software failure was encountered and all the three software faults were corrected during the first maintenance interval. The two oriental lines represent operation and maintenance activities, respectively. Fi , i = 1, 2, . . . , denotes the time at which the ith failure occurs and Ri , i = 1, 2, . . . , denotes the time at which the ith fault is corrected. 3.2.2. Cost analysis of software maintenance considering unavailable time penalty. The cost incurred during the software development and maintenance interests many researchers [13, 36, 37]. Maintenance cost commonly consists of three parts: setup cost, work cost and penalty cost [12]

Figure 2. Demonstration of operation and maintenance of software systems. Copyright q




27

that are caused by unavailable time. Here we decompose the total cost of a maintenance schedule into the above-mentioned three main components. (1) Setup cost: the cost incurred for preparing and arranging required resources to perform maintenance activities. In most cases, this type of cost is compulsory and relatively stable when compared with other types of cost and is modeled as a known constant [38]. In this research, the setup cost is regarded as a known constant that is denoted as Cs . (2) Work cost: the cost incurred for attempting to remove the software faults within the software system. In most cases, this type of cost increases linearly with the maintenance time and such phenomenon can simply be expressed as C w = cw t

(7)

where cw is the unit time maintenance cost and t is the maintenance time. (3) Penalty cost: the cost incurred during the maintenance when the software system is unable to provide services. The reason why this type of cost is called penalty cost is because this type of cost usually increases dramatically when the unavailable time increases, e.g., 1 min of unavailable time in the New York City inter-banking clearance system would have caused a loss of billion dollars [15]. One apparent characteristic of the penalty cost is that it starts from zero but will increase very fast with time. However, the penalty cost cannot increase to infinity since decision makers would prefer to switch to a brand new system if the unavailable time is too long. In this research, we propose a compound linear–exponential expression to model the unique properties of the penalty cost that are caused by the unavailable time. The penalty cost can be expressed as C p = c p t (1−exp{−t})

(8)

where c p is the cost coefficient, is a shape factor and t is the maintenance time. Equation (8) implies that the ultimate penalty cost first increases smoothly if the maintenance time is short but will increase dramatically if the maintenance takes longer time. If the system breaks thoroughly and cannot be maintained anymore, then c p stands for the ultimate unit time penalty cost. The total cost of a maintenance schedule can then be derived as C T (t) = Cs +Cw +C p = Cs +cw t +c p t (1−exp{−t})

(9)

Equation (9) provides a mathematical approach for quantitative analysis of maintenance cost. Cs , cw , c p could be obtained from testing records or prior releases [21, 39]. However, as our aim is to minimize the total maintenance cost all through the operational life cycle, (9) may not serve as a good decision factor of maintenance policy since it only represents the total cost of one maintenance schedule. If the length of each maintenance schedule is reduced, the total cost of each schedule is reduced but the number of maintenance schedules increases, thus resulting in high cost of maintaining all the failures in the maintenance process. But if the length of each schedule is increased, the number of maintenance schedules is reduced but the total cost of each schedule increases, and it may also result in high cost of maintaining all the failures. To solve this paradox, we define unit time maintenance cost as a better metric. If the unit time maintenance cost is minimized in each maintenance schedule, the total maintenance cost of the whole operational life cycle can also be minimized. Thus, the objective can be set as to minimize the expected cost per unit maintenance time. However, time is not the only factor that software practitioners need to consider when scheduling a maintenance task. As the main purpose of software maintenance is to correct software faults, at least one software fault need to be removed during a software maintenance schedule otherwise such software maintenance is meaningless. Since the time required to maintain software failures varies each time, hence the maintenance time itself is not a good maintenance objective. In industrial practice, software practitioners often set the number of software failures to be corrected in the next maintenance schedule as the maintenance objective. Denote the number of software faults to be corrected during a maintenance schedule as N . It is obvious that N is inter-related with Copyright q



28


t and this figure is adopted as the decision variable in our research. Denote the expected unit time maintenance cost as E[C A (t)] and then the optimality problem can be modeled in the following non-linear programming problem: Minimize :

E[C A (t)]

subject to :

C A (t) =

C T (t) t

(10)

C T (t) = Cs +Cw +C p = Cs +cw t +c p t (1−exp{−t})

(11)

X m (t) = N

(12)

where N is the decision variable and a positive integer. Equation (12) implies that the maintenance schedule ends when N failures have been maintained. A close-form solution of Equation (10) alone can be mathematically solved by taking its first-order derivative and making it equal to zero: *C A (t) =0 *t

(13)

Owing to the existence of discrete constraint equation (12), Equation (13) cannot serve directly as the final solution and we are not able to obtain a close-form solution for the above non-linear programming problem. However, due to the continuous property of Equation (10), we can assert that the optimal solution lies either in X m (t) or X m (t)+1, where X m (t) stands for the largest integer that is no larger than X m (t). If sufficient information is gained and cost model built, the problem can be easily solved numerically with the help of computers.

4. NUMERICAL VALIDATION OF OPTIMALITY One major problem in quantitative analysis of software maintenance policy is the lack of supporting data. Software maintenance cost data, which is usually regarded as a part of business operation cost in practice, is often regarded as commercial confidential materials and hard to be accessed by public. Many authors discussed maintenance policy from the qualitative point of view [9–12] in order to avoid such problems. In this section, the example in Section 2.2 is extended to incorporate the cost issues. Since it is hard to get real industrial cost data and for the purpose of illustration, we use some preset values of the proposed cost model as Cs = 100, cw = 10, c p = 1000, = 0.5 in the following analysis with the aim of proving our proposed model’s ability in solving the optimality problem. If field cost data are accessible, software practitioners can easily change the values and new results can be obtained with ease. 4.1. Theoretical support and simulation procedures Since close-form solution for the policy decision model is not desirable, validation for optimality is a must. Simulation can take into account the operation process as well as the maintenance process in an integrated manner. Rate-based simulation is suitable for NHCTMC processes [25] and it is adopted here to simulate the operation and the maintenance processes. A pure birth process is considered. If an event has not occurred by time t, then for a pure birth process, the conditional probability that an event would occur in an infinitesimal interval [t, t +dt] is given by (t)dt, where (t) is called the event occurrence rate and for the operation and the maintenance processes, it would correspondingly be the failure (n, t) and repair rate (n, t), respectively. It can be proven that the probability that an event would not occur between [t, t +t] is given by P = exp{m(t +t)−m(t)} where m(t) can be either Equation (5) or (6). Copyright q



29


The simulation procedure is described as follows: (i) Proper values of parameters of failure rate, repair rate and cost model are chosen. Time increment dt is set to a reasonable value. (ii) N is assigned a value. Time is reset to 0. (iii) Time is increased by dt. A random number x is generated and compared with (n, t)dt. If x