N Warm Standby Systems With Dynamic Uneven Backups - IEEE Xplore

IEEE TRANSACTIONS ON RELIABILITY, VOL. 64, NO. 4, DECEMBER 2015

1325

Heterogeneous 1-Out-of-N Warm Standby Systems With Dynamic Uneven Backups Gregory Levitin, Senior Member, IEEE, Liudong Xing, Senior Member, IEEE, and Yuanshun Dai, Member, IEEE Abstract—In this paper, mission reliability, expected mission completion time, and cost of non-repairable 1-out-of- : G warm standby sparing systems subject to uneven backup actions are modeled and optimized. The backup actions are used to facilitate the data recovery process in the case of an online operating element failure, which enables an activated standby element to take over the mission task through subsequent data retrievals. Both data backup and retrieval times are dynamic, and physically dependent on the amount of work performed. The system elements are not necessarily identical; each element can be characterized by a different time-to-failure distribution, a different performance, and a different level of readiness to take over the system task during the warm standby mode. An iterative numerical method is first proposed to simultaneously evaluate mission reliability, expected mission completion time, and the cost of the considered heterogeneous warm standby systems. Due to the non-monotonic effect of the backup distribution on the mission reliability, time, and cost, we formulate and solve the optimal backup distribution problem considering different combinations of optimization objectives and constraints. In the case of system elements being non-identical, their activation order can influence the mission reliability, expected mission completion time, and mission cost significantly. Therefore, we also formulate and solve the optimal element sequencing problem for the considered system. Furthermore, new integrated optimization problems are formulated and addressed. The integrated optimization aims to identify the optimal combination of backup distribution and element activation order that maximizes the mission reliability, or minimizes the expected mission time or mission cost. As shown through examples, the proposed methodology can implement a tradeoff analysis among the three mission requirements of reliability, cost, and completion time, leading to the optimal decision on both backup and standby policies of warm standby systems. Index Terms—Uneven backup, warm standby, backup distribution, mission cost, mission time, optimization, sequencing.

Acronyms and Abbreviations cdf

cumulative distribution function

pdf

probability density function

pmf

probability mass function

.

random variable

Manuscript received July 15, 2014; revised September 09, 2014, October 10, 2014, and January 04, 2015; accepted February 22, 2015. Date of publication March 11, 2015; date of current version November 25, 2015. This work was supported in part by the National Natural Science Foundation of China (No. 61170042) and Jiangsu Province development and reform commission (No. 2013-883). Associate Editor: S. Eryilmaz. G. Levitin is with Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China, and also with the Israel Electric Corporation, Haifa 31000, Israel (e-mail: [email protected]). L. Xing is with the Department of Electrical & Computer Engineering, University of Massachusetts, Dartmouth, MA 02747 USA (e-mail: lxing@umassd. edu). Y. Dai is with Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China. Digital Object Identifier 10.1109/TR.2015.2407873

GA

genetic algorithm

WS

warm standby

Nomenclature number of system elements number of backup procedures throughout the mission total number of operations to be performed during the mission (excluding backups) maximum allowed mission time maximum allowed number of time intervals in the mission minimum possible number of time intervals in the mission per time unit cost of element being in the operation mode, and warm standby mode respectively replacement cost, time of warm standby element minimal recognized time interval life-time deceleration factor for element system reliability expected total mission cost expected mission completion time fraction of the entire mission task that should be performed between ( -1)-th and -th backup procedures backup distribution vector: number of operations needed for -th backup procedure number of operations needed to retrieve the data stored in -th backup procedure number of operations in each discrete portion of work number of work portions needed for -th backup procedure number of work portions that should be completed between the ( -1)-th and -th backup actions number of work portions needed to retrieve data stored by -th backup procedure number of work portions that should be performed between the mission beginning and the end of the -th backup procedure number of work portions needed to accomplish the mission when no failures occur

0018-9529 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1326


index of the element, which should be initiated, given it is still working, after elements with indices have failed probability that the number of the last backup that was completed by the sequence of elements is and the number of the time interval when the last element from this sequence failed is number of operations needed to save data generated after performing fraction of the entire mission task number of operations needed to retrieve the data saved after performing fraction of the entire mission task performance (number of operations per time unit) of -th element number of work portions performed by -th element in time unit: integer number for which integer number closest to expected cost of using element in WS mode given it should remain in this mode till time interval expected cost of using element given it should be initiated in time interval after -th backup expected cost of using element given it is never initiated during the mission time-to-failure cdf for element discretization function: I. INTRODUCTION

T

HE warm standby sparing technique has been widely applied to enhance the reliability and availability of critical systems in various application areas such as storage systems [1], high performance computing systems [2], flight controls [3], space missions [4], and power systems [5]–[7]. In a warm standby system, one or multiple elements are on-line and operating, with some elements serving as standby spares. Before being put into operation, standby elements are exposed to certain stresses, though their failure rates in the standby mode are typically less than their corresponding full operational failure rates. When an on-line element experiences a failure, it is removed from operation, and a standby element is activated to take over the system task from the failed element [8]–[10]. A specific example is a computer server with primary and redundant hard disks. The primary disks are used for normal data access. The standby disks are spinning, and thus exposed to operation stresses. However, they do not provide access to data, hence their positioning mechanisms are idle, which makes the disks in the standby mode less failure prone than in the operation mode [11]. Warm standby is effective in upgrading a system's reliability. However such benefits cannot be obtained without additional overhead [12], [13]. Particularly, to enable system reconfigu-

ration in the case of an online operating element failing, each online element usually conducts a number of data backup operations during its lifetime based on a predetermined data backup policy. These backup operations are associated with additional mission time and additional capital cost for backup storage. The data backup time incurred in each backup action is typically dynamic, depending on the amount of work conducted since the last backup or since the beginning of the mission. When an online operating element failure happens, a replacement procedure is initiated to activate a selected warm standby element, and to transfer the previously-saved data from the backup storage to the selected element. The time and cost needed for conducting such a replacement procedure also contribute to the extra overhead of the warm standby system. In addition, each activated warm standby element, before performing any of the remaining uncompleted mission task, requires additional cost and time for re-performing all the work portions that were done by the last active element since the previous successful backup. Considering high competition to provide both reliable and cost-effective system operation under limited budget and time constraints, it is essential to perform a trade-off analysis among the three mission requirements of reliability, time, and cost. To facilitate such a trade-off study, in particular, for 1-out-of- : G warm standby systems with non-identical elements subject to uneven backups, we first propose a numerical method for evaluating system reliability, expected mission time, and the cost of the considered warm standby systems. The method allows for dynamic data backup and retrieval times. Based on the suggested numerical evaluation method, a collection of optimization problems are then addressed. First, because the backup distribution can have non-monotonic influence on mission reliability, time, and cost (refer to Section V for details), we formulate and solve the optimal backup distribution problem for the considered systems. Second, when the system elements are different, their activation sequence can greatly influence the mission performance indices of reliability, time, and cost. Therefore, we formulate and solve the optimal standby element sequencing problem for the considered system. Lastly, integrated optimization problems are solved for identifying the optimal combination of backup distribution and element activation sequence that can offer the best combination of system reliability, expected mission time, and estimated system cost. The remainder of the paper is arranged as follows. Section II presents relevant works on standby sparing system optimizations. Section III gives the system model. Section IV presents the mission reliability, expected mission time, and the cost evaluation method proposed for 1-out-of- : G heterogeneous non-repairable warm standby systems with uneven backup actions. Section V describes numerical examples and analysis results. Effects of different parameters on the evaluation results are also discussed in this section. Section VI gives the optimization problem formulation and example solutions. Section VII concludes the work. II. RELATED WORK Extensive research efforts have been expended on formulating and solving optimization problems for warm standby systems as well as other types of standby systems (i.e., hot and

LEVITIN et al.: HETEROGENEOUS 1-OUT-OF-N WARM STANDBY SYSTEMS

cold) [14]–[16]. In contrast to warm standby elements which are partially working and thus partially ready to take over the task from the failed element [17], [18], a hot standby element operates in parallel with the on-line active element, and is ready to take over the mission task at any time [13]. Thus it can provide fast system recovery in the case of a failure occurring. However, the hot standby redundancy involves high maintenance or operation overhead as the working hot standby element consumes energy and materials as much as the online element does. As the other extreme case, a cold standby element is unpowered, and does not operate or consume any energy or materials until it is needed to replace the failed on-line element. In this case, the maintenance cost of a standby element is minimal, and its failure rate can be assumed to be zero before being activated. However, significant restoration delays or high startup costs are inevitable for cold standby sparing systems [19]–[21]. Essentially, both hot and cold standby systems are special cases of the warm standby model studied in this paper. The optimization methods used for these three types of standby systems are roughly categorized into exact, and heuristic or meta-heuristic methods. For example, the exact methods of integer programming, Lagrange multipliers, and dynamic programming were adopted to solve the redundancy allocation problem (RAP) of 1-out-of- : G hot standby series-parallel systems with a homogeneous backup strategy (one type of system elements can be substituted only with the same type of system elements) [22]–[24]. The meta-heuristic methods such as genetic algorithm (GA), Tabu search, and ant colony optimization algorithm were applied to solve the RAP of 1- or -out-of- : G hot standby series-parallel systems with a heterogeneous backup strategy (one type of system element can be substituted by a different type of system element with equivalent functionality) [25]–[28]. In [29], GA was also adapted for series-parallel hot-standby systems with uncertain component Weibull scale parameters. Sample works on solving the RAP of cold standby systems include an integer programming method in [30], and a hybrid algorithm based on GA and fuzzy theory in [31]. The integer programming approach was also extended for addressing the RAP of 1- or -out-of- : G heterogeneous series-parallel systems with spare elements being configured in either hot or cold standby in each parallel subsystem [32], [33]. In [34], [35], GA was adapted to solve the RAP for 1- or -out-of- : G seriesparallel systems with a combination of heterogeneous cold and hot spare elements co-existing within one parallel subsystem. Recently, GA was also implemented to address the optimal element sequencing problem for 1- or -out-of- : G heterogeneous cold standby systems with single-phase [36], [37], or multi-phase [38] missions, where a practical cost model evaluating the mission cost as a function of the elements' actual working times was used. In addition, there exist works integrating the cold standby and active redundancy technique (a technique similar to hot standby but without using switching mechanisms [39]); a multi-objective version of GA in [40], and a hybrid intelligent algorithm based on GA, neural networks, and fuzzy theory in [41] have been suggested to analyze such hybrid systems.

1327

While the majority of optimization efforts were dedicated to the special cases of cold or hot standby systems, there exist few works on the optimization of general warm-standby systems. For example, integer programming and GA-based methods were proposed for solving the RAP of series-parallel warm standby systems in [17], [42]. The GA-based method was also implemented to address the optimal element sequencing problem for 1-out-of- : G heterogeneous warm standby systems considering perfect [11], and imperfect [43] switching mechanisms. None of the aforementioned works claimed to have considered the effects of backup actions in the standby system modeling and optimization; a pioneering work addressing backups is [44], where a restricted backup model with evenly distributed backups (uniformly distributed along the mission time), fixed data backup time, and negligible data retrieval time was considered. As mentioned in the Introduction, in practice the time incurred in the data backup action, and hence the subsequent data retrieval, depends greatly on the amount of work performed since the last backup, which is typically not constant. Moreover, the proposed methodology in [44] is only applicable to the special class of cold-standby systems; it cannot be applied to the general, more complex warm-standby systems. Recent work [45] studied the non-coherent behavior of standby systems with evenly distributed backups, and negligible backup and replacement times. In this work, we make new contributions by addressing the effects of generic, uneven backups with dynamic data backup and retrieval times for modeling and analyzing 1-out-of- : G warm standby systems with heterogeneous and non-repairable system elements. Standby systems with even backups appear as a special case of the proposed model. We also formulate and solve a set of optimization problems relevant to these systems considering mission reliability, time, and cost. III. THE MODEL The system consists of non-identical elements; each element is characterized by a specific time-to-failure distribution with cumulative distribution function and performance (number of operations executed in a time unit) . The computational complexity of the mission (total number of operations to be performed) is . If no failures or backup procedures happen, then element needs time to complete the entire mission task. At any given time, only one element is working and online with the remaining un-failed elements waiting in the warm standby (WS) mode, i.e., the system is a 1-out-of- : G WS system. In the beginning, based on a prespecified sequence of elements activation , the first element is activated. This model is motivated by practical systems, for example, local networks consisting of computers using different processors. All the processors are able to perform the computational task. Depending on location (ambient conditions), type of the processor, and exploitation history, the processors have different computation speeds and different time-to-failure distributions. When one of the processors performs the mission task, the rest of them can wait in an idle mode or perform low priority tasks.

1328


According to the pre-specified sequence, the standby processors take over the mission task in the case of failure of the operating processor. During the mission, any operating element performs data backup actions, such that the -th backup is performed when a fraction of the whole mission task is accomplished since the last backup. The -th backup procedure needs operations. is dynamic, and its value can be a function of the work completed since the previous backup, or since the beginning of the mission, depending on the backup techniques adopted. Specifically, in the case of an incremental backup scheme being applied, the amount of data saved during the backup action, and thus the number of operations needed for the backup (i.e., ), rely on the amount of work accomplished since the last successful backup. Hence, . In the case of a total backup technique being used, depends on the amount of total work completed since the beginning of the mission. Thus, (1) When the working online element fails, it is substituted with a WS element according to the pre-specified activation sequence . The state of any WS element is checked, and immediately and perfectly detected at the moment when it should be activated. If the WS element fails before it should be activated, then the next element is checked, and so on. The replacement or activation process includes element startup and warming up, and it starts immediately after failure of the online operating element. The startup process presumes installing and powering and connecting the element to the system. The warming up process presumes preparing the powered element for typical function (for example, the hard disk drive should reach its nominal speed before it can be used). The activation process can be performed automatically, or by technical personnel. The activated element is exposed to the operation stresses during the replacement procedure. The replacement cost and time are assumed to be fixed for each element , and logically independent from success or failure of the replacement procedure. This assumption is made because the replacement procedure cannot be stopped immediately after its failure, and the failed element should be removed or switched off, which takes a certain amount of time

and effort. After activation, the WS element first retrieves the backup data. The retrieval process needs a certain number of operations, the number of which depends on the amount of total work completed before the last backup procedure. Particularly, the number of operations needed to retrieve the data stored after the th backup procedures is (2) As illustrated in Fig. 1, each activated WS element, after the data retrieval, starts executing the mission task from the operation directly following the previous successful backup action. Any WS element activated after the -th backup succeeds to perform next backups if it does not fail during the time (3) The mission time corresponding to the case of no failures during the mission is (4) Note that this time may not be the minimal possible mission time. Consider, for example, a case when element fails immediately after performing the first backup procedure, and element completes the remaining part of the mission. In this case, the mission time is shown in (5) at the bottom of the page, which may be less than (4) if we have (6) at the bottom of the page. If the mission time exceeds the maximum allowed value , then the mission fails. The considered system model also has the following assumptions. 1) The system fault detection and switching mechanism, as well as the backup mechanism, are fully reliable. 2) The replacement is activated as long as the allowed mission time is not elapsed, regardless of the time remaining to complete the mission, even if the system cannot complete the mission within the allowed time. 3) The mission task is executed evenly in time by any working element. 4) The system and elements are non-repairable during the mission.

(5)

(6)


1329

IV. RELIABILITY, EXPECTED MISSION TIME, AND COST EVALUATION ALGORITHM A. Discretization of Mission Performance and Time We divide the entire mission task portions with consists of

into equal work

operations in each. Thus, the entire mission (7)

work portions. The processing speed of each element can be measured in terms of the number of work portions performed in a time unit: . Each backup procedure is associated with performing (8) work portions. Each data retrieval procedure is associated with performing (9) work portions. The number of work portions that should be accomplished between the ( -1)-th and -th backup actions is (10) corresponds to the mission beginning, we define . The -th backup procedure is completed when

As

(11) work portions are completed (see Fig. 1). By definition, , which corresponds to the mission completion. For any amount of the completed work portions , the number of the last completed backup procedure can be obtained as such that . If an element activated after backups successfully retrieves the backup data (which presumes performing work portions), and then does not fail during performing work portions, it succeeds to complete the next backup procedures. Thus, the performed work portions should not be redone again, because, if the element fails earlier, then the task should be re-performed starting from the work portion that follows the last succeeded backup, or from the beginning of the mission in the case of no backups being performed yet. Consider, for example, element that is activated after the failure of element which occurred after completion of the first backup, but before completion of the second backup (Fig. 1). First, element retrieves data corresponding to the first backup, which requires performing work portions. If the element fails after performing more than but less than - work portions, the index of the last backup performed by element is . Thus, the next activated element should retrieve the data stored after the second backup, that requires work portions. Then this

element should continue the mission task from the work portion that follows the second backup. Note that, if , then no element should perform greater than work portions because, after performing this amount of work, the entire mission is accomplished, thus the element is turned off. To discretize time, we introduce the time interval , and measure the time periods by an integer number of time intervals using the function . The maximal number of time intervals in the mission is , where is the maximal allowed mission time. B. Probability of an Element Failure Any element can be in WS, and in operation modes that are characterized by different parameters of time-to-failure distributions. To consider the effect of different modes, we apply the cumulative exposure model (CEM) that uses the equivalent age concept [46]. According to the CEM, the cumulative failure probability of an element is a function of the stress-dependent cumulative exposure time in which the element's time spent in WS is multiplied by a deceleration factor (reflecting lower stresses that the element experiences in the WS mode) [46], [47]. Specifically, if element remains for a time in the WS mode, and then works for a duration of time in the operation mode, then where is the deceleration factor. Let denote the cumulative time-to-failure distribution of element in the operation condition. The cumulative time-to-failure distribution function of element that works in WS and then in operation modes is thus . For any element that should be activated at time , the cumulative distribution function of the failure time is thus (12) is the time of the element staying in the where WS mode, and is the time during which the element is exposed to the operation stresses. If the element fails before its activation, then . The probability that element which should be activated at time fails in the interval between and is , where for . Notice that the probability that an element operates during a certain time (from to ) is affected by because the time of the element exposure to standby and operation stresses depends on , as expressed in (12). The probability that element activated at time fails after performing exactly work portions is equivalent to the probability that the element fails between times and , which can be obtained as (13) The probabilities that the WS element that should be activated at time fails in standby mode, or during the replacement procedure, are , and respectively. The probability that the WS element activated at time after backups completes the mission is (14)

1330


C. Performed Work and Time Distribution for a Sequence of Elements be a pair of random values representing the Let number or index of the last backup procedure that was completed by the sequence of elements , and the number or index of the time interval when the last element from this sequence fails. by definition because the mission starts at time 0 before which no backups were performed. The probability mass function of can be defined by a matrix , where . If element is activated in the time interval given the index of the last completed backup is , then the element can operate without failure until the mission completion, or until time when the mission is terminated because it exceeds the time limit. Thus, the element can perform no more than remaining work portions, and no more than work portions, corresponding to the remaining time. Let (15), shown at the bottom of the page, be the maximum number of the work portions element can perform given ( ). If element activated at time interval after the -th completed backup fails immediately after performing work portions, then see (16) at the bottom of the page. The probability of this event is shown in (17) at the bottom of the page. If element , activated at time interval after the -th backup, does not fail before performing the remaining work portions, then it is switched off. In this case,

(18) (19)

and the probability of this event is shown in (20) at the bottom of the page. If element fails in the WS mode before being activated given , then (21) The probability of this event is (22) If element

fails during the replacement procedure given , then (23)

The probability of this event is shown in (24) at the bottom of the page. Thus, having and , one can obtain using the following iterative procedure. 1. Assign for 2. For 2.1. For 2.1.1. 2.1.2. 2.1.3.

: : . ; .

2.1.4. ; 2.1.5. For 2.1.5.1. If 2.1.5.2. 2.1.5.3.

: ; ; ;

2.1.6. . In this procedure, Step 2.1.1. corresponds to the case when element fails in the WS mode, Step 2.1.3. corresponds to the case when it fails during the replacement procedure, Step 2.1.5.3. corresponds to the case when it fails after performing

(15)

(16)

(17)

(20)

(24)


portions of work in the operating mode, and Step 2.1.6. corresponds to the case when it does not fail before the end of the mission.

D. Mission Reliability, and Expected Completion Time

1331

E. Expected Mission Operation Cost , activated at time interval after -th If element backup, fails after performing work portions, then the operation cost associated with using this element is (28)

After applying the above described procedure, represents the probability that the sequence of elements completes the mission at time . Notice that for some values of . For example, this is the case for any because the mission time cannot be less than the time needed by the fastest element to perform the entire task without failures. In the algorithm presented in the previous section, each is obtained using , where runs from 0 to , and runs from 0 to . , and are excluded from consideration because corresponds to successful mission completion by elements , and for corresponds to the entire mission failure. In both of these cases, elements are not activated. Thus, represents the probability that elements complete the mission at time given elements fail to complete the mission: . The system reliability is computed as a sum of the probabilities of disjoint events, shown in (25) at the bottom of the page. Having for , one can obtain the mission reliability as (26) and the conditional expected mission completion time given the mission succeeds as (27)

The probability of this event is given in (18). For , and thus the operation cost associated with using element is . If element activated at time interval after the -th backup does not fail after performing work portions, then it is switched off because the entire mission is completed, or it is terminated because of the time limit violation. In this case, its operation cost is (29) The probability of this event is given in (20). The expected cost of using element that fails in the WS mode at time interval or before can be obtained as (30) is the probability density function of the where element time-to-failure. This expression can be approximated as shown in (31) at the bottom of the page. If element activated at time fails during the replacement procedure, then the expected cost of using this element is (32) The probability of this event is given in (24). Thus, the expected operation cost associated with using element given for and (which means that the mission is not completed by the previous elements) is shown in (33) at the bottom of the page, where we see (34) and (35) at the bottom of the next page.

(25)

(31)

(33)

1332


5.2.1.3. 5.2.1.4. 5.2.1.5. 5.2.1.6. 5.2.1.7. ; 5.2.1.8. For 5.2.1.8.1. If ( 5.2.1.8.2. 5.2.1.8.3. 5.2.1.8.4. 5.2.1.8.5.

If for any element completes the mission at time interval or works until time interval , corresponding to the maximum allowed mission time, then element is not activated, and remains in the WS mode until the end of the mission. In this case, the expected costs of using element is

(36)

; ; ; ; : ; ; ; ; ;

where , and are the expected costs of using element when it fails, and when it does not fail in the WS mode before the end of the mission respectively, given the mission ends in the time interval . is obtained using (31), and (37) because the element is switched off in interval The total expected cost of using element is

.

; ; ;

5.3. For 5.3.1. 5.3.2. 5.3.3.

: ; ; ;

5.4. (38)

The total expected mission operation cost is

;

The computational complexity of the above evaluation algorithm is less than . V. EXAMPLES OF MISSION RELIABILITY, EXPECTED TIME, AND COST EVALUATION

(39)

F. Evaluation Algorithm for System Reliability, Expected Mission Time, and Cost As a summary, below is the pseudo code of the evaluation algorithm for analyzing system reliability , expected mission completion time , and expected mission cost for non-repairable 1-out-of- : G warm standby systems subject to uneven backup actions. 1. For the given and functions , determine and using (1), (2), (8)–(11); 2. Assign ; 3. For : For : Assign ; Assign ; 4. For : For : obtain using (31) for the given ; 5. For : 5.1. Assign for 5.2. For 5.2.1. For 5.2.1.1. ; 5.2.1.2.

5.2.1.8.6. 5.2.1.9. 5.2.1.10.

: :

;

Consider a 1-out-of-5 non-repairable warm standby system. The system elements are characterized by Weibull time-to-failure distributions with cdf (40) where , and are respectively the scale parameter, and the shape parameter with values presented in Table I. The replacement , the standby and operation costs (per time unit, ), the deceleration factors , the replacement times , and the processing speeds are also presented in this table. The mission parameters are presented in Table II. It is assumed that the number of operations needed for each backup data saving and retrieval are linear functions of the total amount of work performed before the backup:

(41) , and are given in Table II. Note that Values of the total backup technique in (1) is assumed for the example analysis and results. The minimal total number of operations needed to complete the mission (including the backups) is . The minimal mission time given the

(34) (35)


1333

Fig. 1. Example of a successful mission with three backups and four replacements. TABLE I ELEMENT PARAMETERS FOR THE NUMERICAL EXAMPLE

TABLE II MISSION PARAMETERS FOR THE NUMERICAL EXAMPLE

Fig. 2. .

, and

for the example 1-out-of-5 standby system as functions of

elements are activated in increasing numerical order is . The obtained mission reliability, expected cost, and time for elements activated in an increasing numerical order are , and , respectively. To investigate the impact of the discretization parameters on the accuracy of the obtained results, the values of , and are calculated for different numbers of operations in each portion of work ranging from 4 to 200. For any given , the minimal recognized time interval was chosen equal to the time needed by the fastest element to perform one portion of work: . and as functions of . Fig. 2 presents values of Fig. 3 presents the system reliability as well as the running

Fig. 3. Obtained mission reliability for a 1-out-of-5 standby system, and the , and evaluation algorithm as functions of running time needed for the .

time of the proposed algorithm on a Pentium 2 GHz PC as functions of . The difference between the results obtained by the algorithm for and is 0.1% for , 0.3% for , and 0.14% for . The difference between these results obtained for and lowers to 0.025%, 0.07%, and 0.03% respectively. This result pattern illustrates the quick convergence of the algorithm. To verify the algorithm, its results have been compared with results of Monte Carlo simulation that uses software developed for cold standby systems with periodic backups. For the model with negligible data retrieval times , even backups for ), cold standby ( for ), and constant data backup times, the suggested

1334

Fig. 4. Mission reliability


, expected cost

, and time

Fig. 5. Number of backups , mission reliability performed between evenly distributed backups.

as functions of the maximum allowed mission time.

, expected cost

, and completion time

algorithm produced the same results as the simulation procedure described in [44]. Fig. 4 presents the obtained mission reliability, expected mission cost, and expected mission completion time as functions of the maximum allowed mission time. The rest of the mission parameters are taken from Table II. When , the mission reliability is because there is no chance to complete the mission in time less than . However, the mission cost is nonzero because the elements are activated and worked until time . For , the mission reliability is equal to the probability that the first element completes the mission, and the expected mission time is equal to . Indeed, is the minimal time needed by element 2 to replace element 1 and to complete the mission, when element 1 fails at time 0. Thus, only the first element is able to complete the mission in time less than 161. When becomes greater than the maximal possible time of the mission completion, which equals 523.6, the value of does not affect , and any more.

as functions of the fraction of the mission task that should be

Fig. 5. presents the number of backups , mission reliability , expected cost , and time as functions of the fraction of the mission task that should be performed between backups assuming that they are evenly distributed: for all . The rest of the mission parameters are taken from Table II. In this case, the number of backups equals the greatest integer number not exceeding . The impact of on the system reliability and expected mission cost and time is two-fold. On the one hand, increasing the value of results in less frequent backup actions, leading to an increase of the work to be re-performed when element failures occur. On the other hand, the increase of causes a reduction of the number of backups during the mission, which reduces the total time required to accomplish the entire mission. Hence, the system reliability, expected mission completion time, and expected mission cost are non-monotonic functions of . The abrupt jumps in functions , and happen when the value of changes. Consider with being an integer number. This case corresponds to backup actions throughout the entire mission. If the value of is re-


Fig. 6. Mission reliability

, expected cost

, and time

1335

as functions of scale parameters

duced by a negligibly small amount, the value of immediately changes from to , causing a sharp increase in the minimal possible mission time. Because the variation in the value of is negligible, the work portions that should be re-performed when failures occur have no considerable changes. Hence, the system reliability increases, and the expected mission time and cost decrease abruptly due to the decrease in the value of . Thus, in the case of even backups, the maximal system reliability and the minimal expected mission time and cost are always obtained when is valued as . For example, corresponds to backups with the last backup performed when 0.9996 of the entire mission task is completed. corresponds to backups, with the last backup performed when 0.75 of the entire mission task is completed. Performing the fourth backup when the mission is almost completed is ineffective because the probability that the system fails after this backup is negligible. Therefore, when increases from 0.2499 to 0.25, the expected mission completion time and cost drop drastically, and the mission reliability increases abruptly. Fig. 6. presents mission reliability , expected cost , and time as functions of the scale parameters for , and 4. When for certain varies, the rest of the element parameters are fixed, and taken from Table I. The mission parameters are again taken from Table II. When the first element fails before the first backup procedure occurs, the next element starts the task from the beginning regardless of the time when the first element fails. Thus, the sooner the first element fails, the sooner the inevitable standby element activation starts, which decreases the overall expected mission time as well as the time of the standby element's exposure to the warm standby stresses. Therefore, when the first element is very unreliable, an increment in its reliability (i.e., its

.

expected failure time) leads to a decrease in the entire mission reliability as well as an increase in the expected mission time and cost. When the reliability of the first element continues to increment, the probability that the element completes the several backups and even completes the entire mission task increases. This change results in a decrease in the expected mission cost and time, and in an increase in the system reliability. Such an effect of non-monotonic functional dependence of system reliability, time, and cost on a single element's reliability is more distinguished for elements activated earlier than for elements activated later. The reason for this effect is that the elements that should be activated later have larger chances to fail in the WS mode before being activated if they are more unreliable. VI. STANDBY SYSTEM OPERATION OPTIMIZATION PROBLEMS A. Optimization of Backups As shown earlier, the mission reliability, expected time, and cost depend on the number of backups and their distribution non-monotonically. Thus, the optimal backup distribution problem for the considered system can be formulated as follows. Find the backup distribution vector that maximizes (minimizes or ), subject to the constraints on the rest of the mission success indices. For example, see (42) at the bottom of the page. Having the procedure for evaluating , and suggested in Section IV, one can solve the optimization problem using any algorithm for multidimensional optimization. In this work, the GA meta-heuristic is used [48], [49]. GA generates numbers such that each belongs to the interval (0, 1.1). The number of backups during the mission is determined by the conditions

. For

(42)

1336


Fig. 7. Mission reliability , expected cost time function parameter .

, expected time

, and the number of backups, corresponding to the optimal solutions as functions of the backup

example, when , no backups are performed. Thus, the number of backups can vary from 0 to (in this work, was used). For each vector generated by the GA, values of , and are obtained by the algorithm described in Section IV.F. The objective function, shown in (43) at the bottom of the page, is evaluated, where are penalty coefficients. The GA seeks for a that minimizes the function Ξ. Notice that, when , the problem reduces to the expected mission cost minimization; when , and , the problem reduces to the mission reliability maximization; and when , the problem reduces to the expected mission time minimization. B. Numerical Examples Table III presents sample optimal solutions for the example 1-out-of-5 warm standby system given the elements are activated in an increasing numerical order.

Ξ

TABLE III OPTIMAL BACKUP DISTRIBUTION SOLUTIONS

The proposed algorithm can facilitate a study of the effects of different parameters on the optimal backup distribution . For example, the maximal possible values of system reliability, and minimal possible values of expected mission cost and time, as well as the corresponding optimal number of the backup procedures obtained by the optimization procedure, are presented in Fig. 7 as functions of which determines the backup time.

(43)


1337

TABLE IV OPTIMAL ELEMENT ACTIVATION SEQUENCING SOLUTIONS FOR FIXED

TABLE V OPTIMAL ELEMENT ACTIVATION SEQUENCING AND BACKUP DISTRIBUTION SOLUTIONS

COMPARISON

TABLE VI OPTIMAL SOLUTIONS

OF THE

Observe that, as the data backup time increments, the optimal value of always reduces, and eventually becomes zero, meaning the backup actions should not be conducted at all. For example, when the expected mission time is minimized, the optimal value of drops to zero for . From this point on, the data backup time has no effect on system reliability, expected mission cost, and time. The number of the backup procedures needed to maximize the mission reliability is usually greater than the number of the backup procedures needed to minimize the expected mission cost and time. C. Optimal Element Activation Sequencing When system elements are non-identical, the optimal element activation sequencing problem arises. The problem is formulated as follows. Find the activation order that maximizes (minimizes or ) subject to constraints on the rest of the mission success indices. When the number of system elements is small, the optimal element activation sequence can be identified through a brute-force enumeration of all possible permutations of numbers . When the value of is large, some heuristic algorithms can be used [16], [48], [49]. Examples of the optimal activation sequencing for the 1-outof-5 standby system with parameters taken from Tables I and II are presented in Table IV. D. Integrated Optimization Problem In the case when both the backup distribution and the system element activation order are changeable, a combination of the

vector and the activation sequence providing the best combination of the system reliability and expected mission time and cost should be identified. Table V illustrates such solutions for the example 1-out-of-5 warm standby system with uneven backups. The optimal sequences vary depending on the problem, though the fastest elements 1 and 2 tend to be used first when the mission reliability is the main concern. When the mission cost is restricted or should be minimized, the inexpensive element 4 is used first. The minimal expected mission time is obtained when no backups are used and the fastest elements are activated first. Solving the integrated optimization problem allows obtaining better results than solving the optimal element activation sequencing and optimal backup distribution problems separately. Table VI compares results obtained for max , min , and min problems when only backup distribution, only element activation sequencing, or both backup distribution and element activation sequencing are optimized. Relative differences among the obtained solutions of the separate optimization problems and the integrated optimization problem are presented in parenthesis. Also see that, in the considered case, the backup distribution optimization provides better solutions than the activation sequence optimization. VII. CONCLUSION The warm standby sparing model provides an effective design solution to enhance system reliability while balancing operation cost and the replacement cost of standby elements. Its benefit is always accompanied with additional overhead

1338


including both time and budget. Therefore, a trade-off analysis among mission reliability, expected time, and cost is essential in the practical operation management of warm standby systems for providing the best combination of the three mission performance indices. To facilitate such a trade-off study, a numerical evaluation method is first proposed for analyzing the mission reliability, expected time, and the cost of non-repairable 1-out-of- : G heterogeneous warm standby systems subject to uneven backups. Both dynamic data backup and dynamic data retrieval times are considered. A set of system operation optimization problems are then formulated and addressed, including the optimal backup distribution, the optimal element activation sequencing, and the integration of the former two problems. Solutions to these optimization problems consider different combinations of objectives in maximizing reliability, minimizing expected time or cost, as well as system-level constraints. Effects of different parameters on the system characteristics and optimization solutions are also investigated, which include the maximum allowed mission time, fraction of mission task performed between backups, element reliability, as well as data backup time. As demonstrated through examples, the proposed model can assist in making the optimal decision regarding the system's backup policy and standby policy, leading to both reliable and cost-effective operation of 1-out-of- : G warm standby systems. REFERENCES [1] J. G. Elerath and M. Pecht, “A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID),” IEEE Trans. Comput., vol. 58, no. 3, pp. 289–299, Mar. 2009. [2] C. Hsieh and Y. Hsieh, “Reliability and cost optimization in distributed computing systems,” Comput. Oper. Res., vol. 30, pp. 1103–1119, 2003. [3] B. W. Johnson and P. M. Julish, “Fault-tolerant computer system for the A129 helicopter,” IEEE Trans. Aerosp. Electron. Syst., vol. 21, no. 2, pp. 220–229, 1985. [4] G. Sinaki, “Ultra-reliable fault tolerant inertial reference unit for spacecraft,” in Proc. Annu. Rocky Mountain Guidance and Control Conf., San Diego, CA, USA, 1994, pp. 239–248. [5] K. Durga Rao, V. Gopika, V. V. S. Sanyasi Rao, H. S. Kushwaha, A. K. Verma, and A. Srividya, “Dynamic fault tree analysis using Monte Carlo simulation in probabilistic safety assessment,” Rel. Eng. Syst. Safety, vol. 94, no. 4, pp. 872–883, Apr. 2009. [6] Q. Zhai, R. Peng, L. Xing, and J. Yang, “Reliability of demand-based warm standby systems subject to fault level coverage,” Appl. Stochast. Models Business Ind., to be published. [7] T. Zhang, M. Xie, and M. Horigome, “Availability and reliability of k-out-of: G warm standby systems,” Rel. Eng. Syst. Safety, vol. 91, no. 4, pp. 381–387, 2006. [8] S. V. Amari, H. Pham, and R. B. Misra, “Reliability characteristics of k-out-of-n warm standby systems,” IEEE Trans. Rel., vol. 61, pp. 1007–1018, 2012. [9] E. Papageorgiou and G. Kokolakis, “Reliability analysis of a two-unit general parallel system with warm standbys,” Eur. J. Oper. Res., vol. 201, no. 3, pp. 821–827, 2010. [10] O. Tannous, L. Xing, and J. B. Dugan, “Reliability analysis of warm standby systems using sequential BDD,” in Proc. 57th Annual Reliability & Maintainability Symp., Orlando, FL, USA, Jan. 2011. [11] G. Levitin, L. Xing, and Y. Dai, “Optimal sequencing of warm standby elements,” Comput. Ind. Eng., vol. 65, pp. 570–576, 2013. [12] X. Yang, Z. Wang, J. Xue, and Y. Zhou, “The reliability wall for exascale supercomputing,” IEEE Trans. Comput., vol. 61, no. 6, pp. 767–779, Jun. 2012. [13] B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems. Reading, MA, USA: Addison-Wesley, 1989.

[14] W. Kuo, V. R. Prasad, F. A. Tillman, and C. Hawang, Optimal Reliability Design Fundamental and Application. London, U.K.: Cambridge Univ. Press, 2001. [15] M. Gen and Y. Yun, “Soft computing approach for reliability optimization: State-of-the-art survey,” Rel. Eng. Syst. Safety, vol. 91, no. 9, pp. 1008–1026, 2006. [16] W. Kuo and R. Wan, “Recent advances in optimal reliability allocation,” IEEE Trans. Syst., Man, Cybern., Part A: Syst. Humans, vol. 37, no. 2, pp. 143–156, 2007. [17] S. V. Amari and G. Dill, “Redundancy optimization problem with warm-standby redundancy,” in Proc. Ann. Reliability and Maintainability Symp., Jan. 2010, pp. 1–6. [18] J. E. Ruiz-Castro and G. Fernández-Villodre, “A complex discrete warm standby system with loss of units,” Eur. J. Oper. Res., vol. 218, no. 2, pp. 456–469, 2012. [19] C. Wang, L. Xing, and S. V. Amari, “A fast approximation method for reliability analysis of cold-standby systems,” Rel. Eng. Syst. Safety, vol. 106, pp. 119–126, Oct. 2012. [20] D. Pandey, M. Jacob, and J. Yadav, “Reliability analysis of a powerloom plant with cold standby for its strategic unit,” Microelectron. Rel., vol. 36, no. 1, pp. 115–119, 1996. [21] L. Xing, O. Tannous, and J. B. Dugan, “Reliability analysis of non-repairable cold-standby systems using sequential binary decision diagrams,” IEEE Trans. Syst., Man, Cybern., Part A: Syst. Humans, vol. 42, no. 3, pp. 715–726, May 2012. [22] D. E. Fyffe, W. W. Hines, and N. K. Lee, “System reliability allocation and a computation algorithm,” IEEE Trans. Rel., vol. 17, pp. 64–69, 1968. [23] K. B. Misra and U. Sharma, “An efficient algorithm to solve integer programming problems arising in system-reliability design,” IEEE Trans. Rel., vol. 40, no. 1, pp. 81–91, 1991. [24] K. B. Misra, “Reliability optimization of a series-parallel system,” IEEE Trans. Rel., vol. R-21, no. 4, pp. 230–238, 1972. [25] D. W. Coit and A. E. Smith, “Reliability optimization of series-parallel systems using a genetic algorithm,” IEEE Trans. Rel., vol. 45, no. 2, pp. 254–260, 1996. [26] L. Y. Chia and A. E. Smith, “An ant colony optimization algorithm for the redundancy allocation problem (RAP),” IEEE Trans. Rel., vol. 53, no. 3, pp. 417–423, 2004. [27] J. Onishi, S. Kimura, R. J. W. James, and Y. Nakagawa, “Solving the redundancy allocation problem with a mix of components using the improved surrogate constraint method,” IEEE Trans. Rel., vol. 56, no. 1, pp. 94–101, 2007. [28] T.-C. Chen and P.-S. You, “Immune algorithms-based approach for redundant reliability problems with multiple component choices,” Comput. Ind., vol. 56, no. 2, pp. 195–205, 2005. [29] D. W. Coit and A. E. Smith, “Genetic algorithm to maximize a lowerbound for system time-to-failure with uncertain component Weibull parameters,” Comput. Ind. Eng., vol. 41, pp. 423–440, 2002. [30] D. W. Coit, “Cold-standby redundancy optimization for non-repairable systems,” IIE Trans., vol. 33, pp. 471–478, 2001. [31] R. Zhao and B. Liu, “Standby redundancy optimization problems with fuzzy lifetimes,” Comput. Ind. Eng., vol. 49, no. 2, pp. 318–338, 2005. [32] D. W. Coit and J. Liu, “System reliability optimization with k-out-of-n subsystems,” Int. J. Rel., Qual., Safety Eng., vol. 7, no. 2, pp. 129–143, 2000. [33] D. W. Coit, “Maximization of system reliability with a choice of redundancy strategies,” IIE Trans., vol. 35, no. 6, pp. 535–544, 2003. [34] P. Boddu and L. Xing, “Reliability evaluation and optimization of series-parallel systems with k-out-of-n: G subsystems and mixed redundancy types,” Proc. IMechE, Part O, J. Risk Rel., vol. 227, no. 2, pp. 187–198, Apr. 2013. [35] P. Boddu, L. Xing, and O. Tannous, “Optimal design of heterogeneous series-parallel systems with mixed redundancy types,” in Proc. 7th Int. Conf. Mathematical Methods in Reliability (MMR 2011), Beijing, China, Jun. 2011, pp. 99–105. [36] G. Levitin, L. Xing, and Y. Dai, “Cold-standby sequencing optimization considering mission cost,” Rel. Eng. Syst. Safety, vol. 118, pp. 28–34, Oct. 2013. [37] G. Levitin, L. Xing, and Y. Dai, “Sequencing optimization in k-outof-n cold-standby systems considering mission cost,” Int. J. General Syst., vol. 42, no. 8, pp. 870–882, 2013. [38] G. Levitin, L. Xing, and Y. Dai, “Minimum mission cost cold-standby sequencing in non-repairable multi-phase systems,” IEEE Trans. Rel., vol. 63, no. 1, pp. 251–258, Mar. 2014. [39] V. da, C. Bueno, and I. M. do Carmo, “Active redundancy allocation for a k-out-of-n: F system of dependent components,” Eur. J. Oper. Res., vol. 176, no. 2, pp. 1041–1051, 2007.


[40] A. Chambari, S. Rahmati, A. Najafi, and A. Karimi, “A bi-objective model to optimize reliability and cost of system with a choice of redundancy strategies,” Comput. Ind. Eng., vol. 63, no. 1, pp. 109–119, 2012. [41] R. Zhao and B. Liu, “Redundancy optimization problems with uncertainty of combining randomness and fuzziness,” Eur. J. Oper. Res., vol. 157, no. 3, pp. 716–735, 2004. [42] O. Tannous, L. Xing, R. Peng, M. Xie, and S. H. Ng, “Redundancy allocation for series-parallel warm-standby systems,” in Proc. IEEE Int. Conf. Industrial Engineering and Engineering Management, Singapore, Dec. 2011. [43] G. Levitin, L. Xing, and Y. Dai, “Mission cost and reliability of 1-outof-N warm standby systems with imperfect switching mechanisms,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. 44, no. 9, pp. 1262–1271, Sep. 2014. [44] G. Levitin, L. Xing, B. W. Johnson, and Y. Dai, “Mission reliability, cost and time for cold standby computing systems with periodic backup,” IEEE Trans. Comput., to be published. [45] G. Levitin, L. Xing, and Y. Dai, “Reliability of non-coherent warm standby systems with reworking,” IEEE Trans. Rel., to be published. [46] A. A. Alhadeed and S. S. Yang, “Optimal simple step-stress plan for cumulative exposure model using log-normal distribution,” IEEE Trans. Rel., vol. 54, no. 1, pp. 64–68, 2005. [47] S. V. Amari, K. B. Misra, and H. Pham, , K. B. Misra, Ed., “Tampered failure rate load-sharing systems: Status and perspectives,” in Handbook of Performability Engineering. New York, NY, USA: Springer, 2008, ch. 20, pp. 291–308. [48] G. Levitin, “Genetic algorithms in reliability engineering. Guest editorial,” Rel. Eng. Syst. Safety, vol. 91, no. 9, pp. 975–976, 2006. [49] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA, USA: Addison-Wesley, 1989.

Gregory Levitin (M'97–SM'99) is presently a distinguished visiting professor at University of Electronic Science and Technology of China, and a senior expert at the Reliability Department of the Israel Electric Corporation. His current interests are in operations research and artificial intelligence applications in reliability, defense, and power systems. In this field, he has published more than 230 papers, and four books.

1339

Prof. Levitin is chair of the ESRA Technical Committee on System Reliability. He is an Associate Editor of the IEEE TRANSACTIONS ON RELIABILITY, area coordinator of International Journal of Performability Engineering, and member of editorial boards of Reliability Engineering & System Safety, Journal of Risk and Reliability, and Reliability and Quality Performance.

Liudong Xing (S'00–M'02–SM'07) received the B.E. degree in computer science from Zhengzhou University, China in 1996; and the M.S. and Ph.D. degrees in electrical engineering from the University of Virginia in 2000 and 2002, respectively. She is a Professor with the Department of Electrical and Computer Engineering, University of Massachusetts (UMass) Dartmouth, USA. Her research focuses on reliability modeling and analysis of complex systems and networks. Prof. Xing is an Associate Editor for International Journal of Systems Science and International Journal of Systems Science: Operations & Logistics. She is also an Assistant Editor-in-Chief for International Journal of Performability Engineering. She is the recipient of the 2010 Scholar of the Year Award, 2011 Outstanding Women Award of UMass Dartmouth, and the IEEE Region 1 Technological Innovation (Academic) Award in 2007. She is also the co-recipient of the Best Paper Award at the IEEE International Conference on Networking, Architecture, and Storage in 2009.

Yuanshun Dai (S'02–M'03) received the B.S. degree from Tsinghua University, Beijing, China, in 2000, and the Ph.D. degree from the National University of Singapore, Singapore, in 2003. He is Associate Dean of the School of Computer Science and Engineering, University of Electronic Science and Technology of China. He is also a Chaired Professor, and the Director of the Collaborative Autonomic Computing (CAC) Laboratory. He serves as Chairman of the Professor Committee in the School since 2012, and as the Associate Director at the Youth Committee of the “National 1000er Plan” in China. He has published more than 100 papers and 5 books, where there are 50 papers indexed by SCI including 25 IEEE/ACM Transactions papers. His current research interests include Cloud Computing and Big Data, Reliability and Security, Modeling, and Optimization. Dr. Dai has served as a Guest Editor of the IEEE TRANSACTIONS ON RELIABILITY. He is also on the editorial boards of several journals.