Chapter 14
Virtual Accelerated Life Testing of Complex Systems Michael T. Todinov
Abstract. A method has been developed for virtual accelerated testing of complex systems. Part of the method are an algorithm and a software tool for extrapolating the life of a complex system from the accelerated lives of its components. This makes the expensive task of building test rigs for life testing of complex engineering systems unnecessary and reduces drastically the amount of time and resources needed for accelerated life testing of complex systems. The impact of the acceleration stresses on the reliability of a complex system can also be determined by using the developed method. The proposed method is based on Monte Carlo simulation and is particularly suitable for topologically complex systems, containing a large number of components. Part of the method is also an algorithm for finding paths in complex networks. Compared to existing path-finding algorithms, the proposed algorithm determines the existence of paths to multiple end nodes and not only to a single end node. This makes the proposed algorithm ideal for revealing the reliability of engineering systems where more than a single operating component is controlled.
14.1 Introduction Acceleration stress is anything that leads to accumulation of damage and faster wearout. Examples of acceleration stresses are the temperature, humidity, cycling, vibration, speed, pressure, voltage, current, concentration of particular ions, etc. This list is only a sample of possible acceleration stresses and can be extended significantly. Because acceleration stresses lead to a faster wearout, they entail a higher propensity to failure for groups of components. Components affected by an Michael T. Todinov Department of Mechanical Engineering and Mathematical Sciences, Oxford Brookes University, Oxford OX33 1HX, UK e-mail:
[email protected]
P. Bouvry et al. (Eds.): Intelligent Decision Systems, SCI 362, pp. 293–314. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
294
M.T. Todinov
acceleration stress acting as a common cause are more likely to fail, which reduces the overall system reliability. A typical example of this type of common cause failures is the high temperature which increases the susceptibility to deterioration of several electronic components. By simultaneously increasing the hazard rates of the affected components, deterioration due to a high temperature increases the probability of system failure. Humidity, corrosion or vibrations affect all components exposed to the acceleration stress. This increases the joint probability of failure of the affected components and shortens the system’s life. A common cause failure is usually due to a single cause with multiple failure effects which are not consequences from one another [1]. Acceleration stresses acting as common causes increase the joint probability of failure for groups of components or for all components in a complex system. Even in blocks with a high level of built-in redundancy, in case of a common cause failure, all redundant components in the block can fail within a short period of time and the advantage from the built-in redundancy is lost. Failure to account for the acceleration stresses acting as common causes usually leads to optimistic reliability predictions - the actual reliability is smaller than the predicted. For a number of common engineering components, accelerated life models already exist. They have been built by using a well documented methodology [3, 6, 8]. Building an accelerated life model for a component starts with the time to failure model for the component. The time to failure model gives the distribution of the time to failure of each component in the system [6, 8]. The most common time to failure model is the Weibull distribution: F(t) = 1 − exp[−(t/η )β ]
(14.1)
where F(t) is the cumulative distribution of the time to failure, β (shape parameter) and η (characteristic life/scale parameter) are constants determined from experimental data. This model is commonly used in the case where the hazard rate depends on the age of the component. Another common time to failure model is the Negative exponential distribution: F(t) = 1 − exp[−(t/MTT F)]
(14.2)
where F(t) is the cumulative distribution of the time to failure and MTTF is the mean time to failure. The negative exponential distribution can be obtained as a special case from the Weibull distribution for β = 1 and is used in cases where the hazard rate characterising the component does not practically depend on its age. The scale parameter η in the Weibull distribution and the mean time to failure MTTF in the negative exponential distribution depend on the acceleration stresses through the stress-life relationships [3, 6, 8, 11]. When the stress-life dependence is substituted in equations 14.1 and 14.2, the acceleration time to failure model for the component is obtained. The acceleration time to failure model is the time to failure model at particular levels of the acceleration stresses.
14
Virtual Accelerated Life Testing of Complex Systems
295
14.1.1 Arrhenius Stress-Life Relationship and Arrhenius-Type Acceleration Life Models For this type of accelerated life model, the relationship between the life and the level V of the acceleration stress is L(V ) = C × exp(B/V )
(14.3)
where L(V ) is a quantifiable life measure and C and B are constants obtained from experimental measurements. The Arrhenius stress-life relationship is appropriate in cases where the acceleration stress is thermal, for example temperature. The temperature values must be in absolute units [K]. In the case of a Weibull time to failure model L(V ) ≡ η = C × exp(B/V )
(14.4)
where η is the characteristic life (scale parameter) calculated in years. Substituting this in the Weibull time to failure model 14.1, yields the Arrhenius-Weibull time-tofailure accelerated life model: F(t,V ) = 1 − exp(−[t/(C · exp(B/V ))]β )
(14.5)
14.1.2 Inverse Power Law Relationship (IPL) and IPL-Type Acceleration Life Models The relationship between the life of the component and the level V of the acceleration stress is 1 (14.6) L(V ) = (K ·V n ) where L(V ) is a quantifiable life measure; K and n are constants obtained from experimental measurements. The IPL stress-life relationship is appropriate for nonthermal acceleration stresses like ’load’, ’pressure’, ’contact stress’. It can also be applied in cases where V is a stress range or even in cases where V is a temperature range (in case of fatigue caused by thermal cycling). In the case of a Weibull time to failure model L(V ) ≡ η =
1 (K ·V n )
(14.7)
the life measure is assumed to be the characteristic life 14.7, where η is the characteristic life (scale parameter). Substituting this in the Weibull time to failure model 14.1 yields the IPL-Weibull accelerated life model: F(t,V ) = 1 − exp(−(t · K ·V n )β )
(14.8)
296
M.T. Todinov
14.1.3 Eyring Stress-Life Relationship and Eyring-Type Acceleration Life Models The relationship between the life and the acceleration stress level V is L(V ) =
1 exp[−(A − B/V)] V
(14.9)
where L(V ) is a quantifiable life measure; A and B are constants obtained from experimental measurements. Similar to the Arrhenius stress-life relationship, the Eyring stress-life relationship is appropriate in the case of thermal acceleration stresses. It can also be used however, for non-thermal acceleration stresses such as humidity. In the case of a Weibull time to failure model L(V ) ≡ η =
1 exp[−(A − B/V)] V
(14.10)
where η is the characteristic life (scale parameter, calculated in years). Substituting this in the Weibull time to failure model 14.1 yields the Eyring-Weibull accelerated life model: (14.11) F(t,V ) = 1 − exp[−(t ·V · exp(A − B/V))β ] There exist also stress life models involving simultaneously two acceleration stresses, for example temperature and humidity [11]. Such are the TemperatureHumidity (TH) relationship and TH-type acceleration life models and TemperatureNon-thermal relationship (T-NT) and T-NT-type acceleration life models. Despite the increasing research in both the area of accelerated life testing [5] and the area of common cause failure modeling [9, 7, 4], no models and software tools are currently available for building the accelerated life model of a complex system from the accelerated life models of its components.
14.1.4 A Motivation for the Proposed Method The effect of the acceleration stresses acting as common causes in a complex system can be revealed if an accelerated life model for the system is built from the accelerated life models of its components. Apart from revealing the impact of the acceleration stresses on the system’s performance, building an accelerated life model for a complex system has another significant advantage. During life testing of complex systems, estimating the system reliability under normal operating conditions requires special test rigs and a large amount of time and resources and can be a very complex and expensive task. This task however does not have to be addressed if a method is developed for building an accelerated life model of a complex system
14
Virtual Accelerated Life Testing of Complex Systems
297
from the accelerated life models of its components. Deducing the time to failure distribution of the complex system under normal operating conditions from the accelerated life models of its components will be referred to as ’virtual life testing of a complex system’. The significant advantages of the virtual accelerated life testing can be summarized as follows: • The virtual accelerated life testing does not require building test rigs for the various engineering systems, which is an expensive and difficult task. In cases of very large systems, building such test rigs is impossible. • The virtual accelerated life testing reduces drastically the amount of time and resources needed for accelerated life testing of complex engineering systems. • It permits testing a large variety of systems built with components whose accelerated life models are known. The virtual accelerated life testing offers enormous flexibility in specifying various levels for the acceleration stresses. Consequently, the objectives of this work are: (i) to propose an efficient algorithm for building an accelerated life model of a complex system from the accelerated life models of its components and (ii) to propose a software tool with the capability of extrapolating the system’s life under normal operating conditions.
14.2 Limitations of Available Analytical Methods for Determining the Reliability of Large and Complex Systems Many common engineering systems have reliability networks which cannot be described by a series arrangement or a combination of series-parallel arrangements. Consider for example the common system in Fig. 14.1(a) which consists of a power supply (PS), a power cable (PC), a block of four electronic switches (S1, S2, S3 and S4) and four electro-mechanical devices M1, M2, M3 and M4. The reliability of the electronic switches could be a problem, the cable could suffer accidental damage as well. In many safety-critical applications, all of the electromechanical devices must be operational on demand. The reliability on demand of the system in Fig. 14.1 can be improved significantly by making the low cost and low reliability components redundant (e.g. the power cable and the electronic switches). As a result, from the system in Fig. 14.1(a) we arrive at the system in Fig. 14.1(b). This is a system with a significantly higher reliability compared to the simple system in Fig. 14.1(a) and has a number of useful potential applications in cases where uninterrupted power supply is highly desirable: a power supply to fans or pumps cooling chemical reactors, to pumps dispensing water in case of fire, for life support systems, for automatic shut-down systems, control systems etc. For the system in Fig. 14.1(b), the electro-mechanical device M1 for example will still operate if the power cable PC or the switch S1 fails, because power supply will be maintained through the alternative power cable PC and the switch S1 . The same applies to the rest of the electromechanical devices. The power supply
298
M.T. Todinov
to an electromechanical device will fail only if both power supply channels fail. The reliability network of the system in Fig. 14.1(b) is given in Fig. 14.2, where block/component 1 stands for the power supply, blocks ’2’ and ’3’ represent the two cables, blocks 4-11 represent the switches and blocks 12-15 the electro-mechanical devices.
Fig. 14.1 a) A simple system supplying power to four electromechanical devices; b) a highly reliable power supply system with complex topology.
The system in Fig. 14.2 fails whenever at least one of the components 12-15 stops operation because of failure or because of a loss of both power supply channels. This system is not a series-parallel system and its reliability cannot be obtained by a network reduction technique [10] oriented towards systems composed of components arranged in series and parallel. Telecommunication systems and electronic control systems can also have very complex reliability networks, which cannot be represented as series-parallel arrangements. The decomposition method [10] works for topologically complex systems, but is not suitable for large systems. Each selection of a key component splits a large system into two systems each of which is in turn split into two new systems and so on. For a large number n of components in the initial system, the number of product systems generated from the selection of the key components increases exponentially and quickly becomes unmanageable. One of the most important methods for determining the reliability of complex networks are based on minimal paths and cut sets [10, 2, 1]. A path is a set of components which, when working, connect the start node with the end node through working components thereby guaranteeing that the system is in working state. A minimal path is a path without loops. A cut set is a set of components which, when failed, disconnect the start node from the end node and the system is in a failed state. A minimal cut set is a cut set for which no component can be returned in working state without creating a path between the start
14
Virtual Accelerated Life Testing of Complex Systems
299
Fig. 14.2 A reliability network of the dual control system from Fig. 14.1(b).
and the end node, thereby returning the system into a working state. For large and complex systems, an approach based on minimal paths and cut sets is not feasible because with increasing the size of the system the number of minimal paths and cut sets increases exponentially. This can be demonstrated immediately with the example in Fig. 14.3. The reliability network in the figure has N N + N minimal cut sets and N N+1 minimal paths. Even the moderate N = 10 results in 1010 + 10 (more than ten billion) cut sets and 1011 (100 billion) minimal paths. The storage and manipulation of such a large number of cut sets and path sets is impossible.
300
M.T. Todinov
Fig. 14.3 An example of a system where the number of minimal cut sets and minimal paths increases exponentially with the size of the system.
As can be verified, in the general case, the reliability of large systems with complex topology cannot be revealed by analytical methods. The described limitations of the existing analytical methods for determining the reliability of large and complex systems can be avoided. This is done if the system reliability is revealed by determining at the end of a specified time interval the probability of existence of paths through working components, from the start node to each of the end nodes of the corresponding reliability network.
14.3 Efficient Representation of Reliability Networks with Complex Topology and a Large Number of Components The main concepts here will be explained on the basis of the reliability network of the dual-power supply system presented in Fig. 14.1(b). Each of the electromechanical devices (marked by circles with numbers 12-15) receive power from two channels, only one of which is sufficient to maintain the device working. A system failure occurs if an electromechanical device fails or if both power channels have been lost because of failures of components along the paths of the power channels. The reliability network has been modeled by a set of nodes (the filled small circles in Fig. 14.2, numbered from n1 to n12) and components (1, 2, 3, ... ,15) connecting them. The system works only if there exist paths through working components between the start node n1 and each of the end nodes marked by n9, n10, n11 and n12. As a result, the reliability network in Fig. 14.2 can be modeled conveniently by undirected graph. The nodes are the vertices and the components that connect the nodes are the edges of the graph. Each component connects exactly two nodes. If from node i, node j can be reached through a single component, node j will be referred to as ’immediately reachable from node j’. For example, n4 is immediately reachable from n2. The converse is also true because edge (n2, n4) is undirected. Node n5 however, is not immediately reachable from node n2 because there is no a
14
Virtual Accelerated Life Testing of Complex Systems
301
single edge connecting these two nodes. Two nodes separated by a single component will be referred to as ’adjacent nodes’ Now, suppose that in order to increase the reliability of the system in Fig. 14.1(b), a redundancy has been provided for some of the components. As a result, between any two adjacent nodes, there can be more than one component and the corresponding edges in the graph are called parallel edges.
14.3.1 Representing the Topology of a Complex Reliability Network by an Array of Pointers to Dynamic Arrays The reliability network in Fig. 14.2 will be represented by an array of pointers (memory addresses) to adjacency dynamic arrays. The topology of the network in Fig. 14.2 is fully represented by the array N[12] of twelve pointers which correspond to the twelve nodes in the network (Fig. 14.4). For each pointer N[i], the exact amount of memory is reserved, necessary to accommodate the indices (the names) of all adjacent nodes of the i-th node. In this way, the topology of the network is described by twelve pointers and twelve dynamic arrays where N[i], i = 1, . . . , 12 is the address of the i-th dynamic array. N[i][0], which is the first cell of the i-th dynamic array is reserved for the number of nodes adjacent to the i-th node. Thus N[3][0] = 5 because node 3 has exactly five adjacent nodes. The indices of the actual nodes adjacent to the i-th node are stored sequentially in N[i][1], N[i][2], . . ..
Fig. 14.4 Presenting the topology of the reliability network from Fig. 14.2 by an array N[12] of twelve pointers to dynamic arrays (adjacency arrays)
For the second node for example, N[2][1] = 1; N[2][2] = 3 and N[2][3] = 4, because the nodes with indices ’1’, ’3’and ’4’ can be reached from node ’2’. As can be verified, the representation of the reliability network topology by adjacency arrays is a very compact representation. In order to keep track of the parallel edges in the reliability network, another set of dynamic arrays called Link arrays L[12] are also created, which correspond one-to-one to the set of adjacency arrays N[12]. Instead of indices of reachable nodes, however, L[i][ j] contains the number of parallel edges from node i to node j. L[1][1] = 1, for example, gives the number of parallel edges
302
M.T. Todinov
from node 1 to the node whose index is listed first in N[1] dynamic array (the node index is kept in N[1][1]). All values in the link dynamic arrays from Fig. 14.6 are set to ’1’ because no redundant components (parallel edges) are present in the reliability network from Fig. 14.2 and because all edges are undirected. If two parallel edges for example exist between nodes i and j (two parallel components between nodes i and j), the corresponding value in the link arrays will be set to ’2’. The structure of the link arrays for the network in Fig. 14.2 is given in Fig. 14.6.
Fig. 14.5 Structure of the link array L[12] for the reliability network in Fig. 14.2.
The node adjacent arrays, together with the link arrays describe fully the topology of any complex reliability network which can be represented as a graph. The structure based on adjacency arrays and linked arrays is also suitable for describing reliability networks including a combination of directed and undirected edges or only directed edges. If the edge between two nodes i and j is a directed edge (represented by an arrow from node i to node j), the adjacency array N[i] will contain the node j but the adjacency array N[ j] will not contain the node i. This is because node j is immediately reachable from node i but the converse is not true. Adjacency linked lists are common for representing network topology [13] and can also be used. The main features of the linked lists however, such as the possibility to insert and delete an element without the need to move the rest of the elements are not used here, because there is no need for such operations. As a result, the proposed solution based on adjacency arrays and link arrays is sufficient, simpler and more efficient compared to a representation based on linked lists.
14.3.2 Updating the Link Arrays after a Component Failure The link-arrays keep track of the topology of the system which changes dynamically, as the separate components fail during operation. When components fail, edges between the nodes of the reliability network disappear and the corresponding entries in the link arrays need to be updated upon each component failure. Three basic pieces of information are kept for each component - the node index and the two nodes to which it is connected. Suppose that a component with index x is connected to
14
Virtual Accelerated Life Testing of Complex Systems
303
nodes i and j. Failure of component x causes one of the links between nodes i and j to disappear. Consequently, an updating is performed of the Link-arrays upon failure of component x. For this purpose, two specially designed arrays named IJ − link and JI − link with number of cells equal to the number of components in the reliability network are created. If component x fails and it corresponds to an edge (i, j) directed from node i to node j the IJ − link[x] contains the address of Link[i][ j] which contains the number of parallel edges from node i to node j. The value contained in Link[i][ j] is updated immediately by Link[i][ j] = Link[i][ j] − 1 . If the failed component x corresponds to an undirected edge (i, j), upon failure, two values in the Link-arrays are immediately updated: Link[i][ j] = Link[i][ j] − 1 and also Link[ j][i] = Link[ j][i] − 1. The updating reflects the circumstance that failure of component x has decreased simultaneously both - the number of edges going from i to j and the number of edges going from j to i.
14.4 Existence of Paths to Each End Node in a Complex Reliability Network Represented by Adjacency Arrays and Link Arrays Reliability of a system presented by a reliability network can be conveniently defined as the probability of existence of paths through working components, from the start node to each of the end nodes, at the end of a specified time interval. The reliability network in Fig. 14.2 for example has four end nodes ( n9 , n10 , n11and n12 ). If no path through working components exists from the start node n1 to the node n9 for example, this means either component (12) has failed or both power channels to component (12) have been lost because of component failures. In both cases, lack of a path from the start node n1 to the end node n9 means ’a system failure’. The system is considered in a failed state if at least one of the components 12 , 13 , 14 or 15 stops working. Consequently, central to determining the probability of a system failure is an algorithm for determining the existence of a path through working components to each end node. An algorithm in pseudocode performing this task is presented next. The function path_to_each_end_node() returns ’1’, if a path between the first and each of the end node exists and ’0’ otherwise. The algorithm works as follows. A stack ’stack[]’ is formed first, and the start node is put into the stack. Until there exists at least a single node in the stack, the node ’r_node’ from the top of the stack is removed and marked as ’visited’. This is achieved by using the array ’marked[]’ containing information about which nodes have been visited. The array ’visited_end_nodes[]’ contains information about which end nodes have been reached. The counter ’num_reached_end_nodes’ counts the number of different reached end nodes; the statement ’end_nodes_zone = num_nodes - num_end_nodes +1’, marks the index from which the end nodes start.
304
M.T. Todinov
Algorithm 16. function path_to_each_end_node() 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39:
stack[]; sp, num_nodes, num_end_nodes; marked[], in_the_stack[], visited_end_nodes[]; num_reached_end_nodes = 0; end_nodes_zone = num_nodes - num_end_nodes +1; for i=1 to num_nodes do marked[i]=0; in_the_stack[i]=0; end for for i=1 to num_end_nodes do visited_end_nodes[i]=0; end for sp=1; stack[sp]=1; while (sp > 0) do r_node = stack[sp]; marked[r_node]=1; sp=sp-1; for i=1 to N[r_node][0] do node_index = N[r_node][i]; if (marked[node_index]=0 and L[r_node][node_index] >0) then if (node_index > = end_nodes_zone) then if (visited_end_nodes[node_index] = 0]) then visited_end_nodes[node_index]=1; num_reached_end_nodes = num_reached_end_nodes+1; if (num_reached_end_nodes = number_of_end_nodes) then return 1; end if else if (in_the_stack[node_index]=0) then sp=sp+1,stack[sp]=node_index; in_the_stack[node_index]=1; end if end if end if end if end for end while return 0;
Next, a check is conducted whether the end node is among the immediately reachable non-marked (non-visited) nodes of the removed node ’r_node’. If the end node is among them, the function returns immediately true (’1’). Otherwise, all non-marked immediately reachable nodes from the removed node are stored in the stack, but only if they are not already there. If node ’i’ is reachable from the removed node ’r_node’, this is indicated by a greater than zero element
14
Virtual Accelerated Life Testing of Complex Systems
305
L[r_node][i]) > 0 in the corresponding link array. The statement ’node_index = N[r_node][i]’ retrieves the index of the node which is immediately reachable from node ’r_node’, removed from the top of the stack. Using the conditional statement ’if (marked[node_index]=0 and L[r_node][node_index]) > 0’, another check is performed in the links arrays whether there are still parallel edges remaining from ’r_node’ to the current node. A value ’L[r_node][node_index]=0’ indicates that all parallel edges (components) from node ’r_node’ to node ’node_index’ have failed. The algorithm then continues with removing another node from the top of the stack. By removing nodes from the top of the stack, it is guaranteed that the network will be traversed towards the end nodes first and an end node will be discovered quickly. If a path to an end node has not been found, a non-visited node is pulled from the top of the stack and an alternative path is explored. If the stack is empty and an end node still has not been reached, no path exists between the start node and any of the end nodes. After visiting an end node, the node is flagged as ’reached’ and the search for paths to the end nodes continues until all end nodes have been marked as ’reached’. The ’num_reached_end_nodes’ counter is incremented only if a non-reached end node is encountered. If the ’num_reached_end_nodes’ counter becomes equal to the number of end nodes, paths to all of the end nodes exist and the function ‘path_to_each_end_node()’ returns ’1’ (true). Otherwise, there exists at least one end node to which a path through working components does not exist and the function returns ’0’. An important feature of the proposed algorithm is that loading of nodes already existing in the stack is prevented by the statement ’if (in_the_stack[node_index]=0) then’. The array ’in_the_stack[node_index]’ contains ’1’ if the node with index ’node_index’ is already in the stack. Consequently, the maximum number of nodes that can be pushed into the stack is limited to the number of nodes n in the network. The block of statements in the loop ’for i=1 to N[r_node][0] do’ checks the adjacent nodes of the current node. The worst-case number of these checks is O(m), where m is the number of edges. After these checks, the algorithm will mark all nodes in the network and terminate its work. The worst-case running time of the algorithm is therefore O(m). If the size of the system is increased, no matter how complicated the system topology is, the computation time will be proportional to the number of components in the network. Although, path-finding algorithms based on depth-first search have already been presented in the literature [13], these methods deal with a single end node and are not suitable for reliability networks with multiple end nodes which are typical for engineering systems where more than a single operating component is controlled. The proposed algorithm searches for paths to multiple end nodes and is ideal for revealing the reliability of engineering systems where multiple operating components are controlled. The proposed algorithm also implements adjacency arrays and link arrays rather than linked lists and the implementation is not based on a recursion.
306
M.T. Todinov
14.5 Accelerated Time to Failure of a Complex System The method for determining the accelerated time to failure of a complex system consists of the following. After the accelerated times to failure of the separate components have been generated, they are placed in a linked list. Subsequently, the linked list is traversed and the system topology updated, considering the failed components. Upon each update of the system topology, a system reliability analysis is performed to check whether the system is in operational state. If the system is working and the end of the operation interval has not been reached, traversing of the list of component failure times is continued. If the system is working and the end of the operation interval has been reached, the reliability counter is incremented and a new simulation trial is initiated. If the system is in a failed state, after a component failure, the failure counter is incremented, the time to failure of the component becomes a time to failure of the system and this time to failure is placed in the array containing the times to failure of the system from the simulation trials. The algorithm in pseudocode is presented next: Algorithm 17. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:
function path_to_each_end_node(); procedure make_copy_of_link_arrays(); function generate_time_to_failure(k); f_counter = 0; rel_counter = 0; for k = 1 to Number_of_trials do make_copy_of_link_arrays(); for i =1 to Number_of_components do time_to_failure = generate_time_to_failure( i ); place the sampled time to failure into the list of times to failures; end for while (there are components in the list of failed components) do retrieve the component ’m’, characterised by the smallest time to failure; Update the copy of the link-arrays; paths_exist = path_to_each_end_node(); if (paths_exist = 0) then f_counter = f_counter + 1; cumul_array [f_counter] = the time to failure of the component with index m; if (the time to failure of the component with index m > a) then rel_counter = rel_counter + 1; end if break; end if end while end for Reliability = rel_counter/Number_of_trials; Sort in accending order the times to failure of the system in cumul_array[]
Here we show step by step how the algorithm works.
14
Virtual Accelerated Life Testing of Complex Systems
307
The function ‘path_to_each_end_node()’ determines the existence of paths from the start node to each of the end nodes and has been defined in the previous section. It returns ’1’ if a path to each end node exists, otherwise, the function returns ’0’. The procedure ‘make_copy_of_link_arrays()’ creates a copy of the link dynamic arrays which will be subsequently used for updating the network topology upon failure of components. The function generate_time_to_failure(k) returns the time to failure for the component with index ’k’. A counter with name f_counter is initialised first, where the number of trials resulting in non-existence of a path through working components from the start node to one of the end nodes will be accumulated. In fact, ’f_counter’ counts the number of system failures during the simulations. The counter ’rel_counter’ is also initialized, where the number of simulation trials resulting in no failure of the system during the specified time interval (0, a) will be accumulated. Next, a Monte Carlo simulation loop with control variable k is entered. For each simulation trial, a fresh copy of the link dynamic arrays L[] is made, and subsequently used to update the system topology upon component failures. The times to failure characterizing the separate components in the network are generated in the nested loop i, by using standard sampling methods [12]. The sampled time to failure of each component is placed in the list containing the times to failure of the components in such a way that the times to failure in the list always remain ordered in ascending order. Consequently, the smallest time to failure is always at the head of the list. In the while-do loop, the failed components for the current simulation trial are scanned one-by-one, always starting with the component with the smallest time to failure. This procedure essentially initiates an efficient process of tracking all discrete events (failures) in the system. Retrieving the smallest time to failure essentially means that the corresponding component has failed at that point in time. This permits the state of the system (working or failed) to be checked after each component failure. After scanning each failed component, the copy of the Link arrays is immediately updated in order to reflect the changing network topology. Due to the two additional arrays (IJ_link[] and JI_link[]), the Link dynamic arrays are updated very efficiently upon component failure, without any searching. After updating the corresponding parts of the Link dynamic arrays upon a component failure, the function path_to_each_end_node() is called, and a check is performed whether it has returned zero. In other words, a check is performed whether there exist paths through working components from the start node to each of the end nodes of the system. Nonexistence of a path to even a single end node indicates a system failure and the ’f_counter’ is incremented by one. For the next Monte Carlo simulation trial, the content of the Link-arrays is restored by making a copy from the original Link-arrays, corresponding to a state where no component failures exist. The ’break’ statement in the algorithm means that upon encountering it, the current loop is exited immediately.
308
M.T. Todinov
The reliability of the system, associated with a specified time interval ’a’ is calculated by counting the number of trials for which the system failure has not occurred during the specified time interval (0, a) and dividing this number to the total number of simulation trials. For a single simulation trial, the worst-case complexity of this algorithm can be estimated as follows. Suppose that m is the number of components in the network. The largest possible number of component failures per simulation is m. As shown earlier, for each failure, the worst-case number of operations to verify whether the system is in a failed state is O(m) . Consequently, the running time of the proposed algorithm per single simulation is O(m2 ). At the end of the simulations, the times to failure of the system which have been stored ‘in cumul_array[]’ are sorted in ascending order, for example, by us-ing the Quicksort algorithm [13]. The minimum failure free operating period (MFFOP) guaranteed with probability 1 − α , (0 < α < 1) [14] is calculated from the sorted array containing the time to failure distribution of the system. A cut-off point is calculated first from: ‘cut_off_point = [α × f _counter]’, where [α × f _counter] is the largest integer part of the product α × f _counter. The MFFOP guaranteed with probability 1 − α is then determined from: MFFOP(α ) = cumul_array [cut_off_point]. The MT T F for the system is obtained by determining the expected value of the system time to failure, from the sorted array containing the times to failure of the system.
14.6 A Software Tool The input information is coded in input text files. Special software routines have been developed for reading the input files and presenting the necessary information in appropriate data structures. The input is based on four text files: general_system_data.txt; acceleration_stresses.txt; system_structure.txt and failure_modes.txt, where all necessary input information is provided by the user. Organising the input in separate files simplifies the input and reduces the possibility of errors. The input files can be edited by any available text editor. The first input file (general_system_data.txt) specifies general information about the system. Specified are the total number of components in the system, the total number of nodes, the total number of end nodes and the total number of different types of acceleration stresses. The number of simulation trials, the required confidence level 1 − α to determine the MFFOP (the life at 1 − α confidence level) and a required MFFOP (minimum failure-free operating period) are also specified. The total number of nodes and the number of end nodes are also specified. Nonexistence of a path from the start node to at least a single end node indicates a system failure. Usually, the reliability networks are defined by a single start node
14
Virtual Accelerated Life Testing of Complex Systems
309
and a single end node. In order to expand the capability of the software tool, the algorithm presented here handles reliability networks with many end nodes. The next file ’acceleration_stresses.txt’ specifies the indices and the corresponding levels of the acceleration stresses. The structure of the file is given in Table 1. For the acceleration stress with index ’1’ (e.g. temperature), the value 333K has been specified. The acceleration stress with index 2 is also temperature and its level has been set to 413K. The third acceleration stress is pressure, and its level has been set to 60MPa. Because the acceleration stresses are accessed by their indices, this is an efficient solution of the problem related to representing various types and sets of acceleration stresses. The next data file, ‘system_structure.txt’, defines the topology of the reliability network. The topology of the reliability network is specified in a text file containing four columns (Table 14.2). The first column lists the index of the component, the second and the third column contain the indices of the nodes to which the corresponding component is connected.
Table 14.1 Structure of the input text file acceleration_stresses.txt. Accelerated Accelerated stress index stress level 1 333 2 413 3 60
Table 14.2 Structure of the input text file system_structure.txt representing the topology of the reliability network in Fig. 14.2. Comp First Second Orientation No node node of the edge 1 1 2 0 2 2 3 0 3 2 4 0 ... ... ... ... 15 8 12 0
The fourth column, determines whether the connecting edge (component) is directed or undirected. If the edge is directed, the column contains ’1’ which means that the starting node of the edge is in the column labelled ’First node’ and the edge points to the node listed in the column labelled ’Second node’. If the edge is undirected, the fourth column contains ’0’. This basic information is sufficient to represent the topology of any reliability network. For the reliability network in Fig. 14.2, the input file defining the system topology is given in Table 14.2. The software routines necessary for restoring the
310
M.T. Todinov
topology of the reliability network after reading the input file specifying the system topology have also been developed. The purpose of the software routines is to build the dynamic ar-rays representing the system topology from the information coded in the input file system_structure.txt defining the system topology. In order to restore the reliability network topology, a ’reverse engineering’ of the system is performed. The complexity is concentrated in the decoding procedure, in order to simplify the user input as much as possible. This principle has been followed during the development of all input routines. The final input file failure_modes.txt specifies input information about the failure modes of the separate components and the parameters of the time-to-failure distributions and the stresslife models. The generic input is capable of reproducing various time to failure models for each component in a system. Different models for the components are obtained by combining the basic time-to-failure distributions with the basic stress-life relationships and with an additional possibility, where no stress acceleration is present. The specified number of failure modes from the input file failure_modes.dat is used to create dynamic arrays with size equal to the number of failure modes characterizing the component. These dynamic arrays hold the parameters of the time-tofailure distributions characterizing each failure mode of the component. As can be seen from equations 14.1 and 14.2, describing the time to failure distribution associated with any failure mode requires no more than two parameters. Some of these parameters are supplied directly by the user, some are calculated from the supplied by the user stress-life model. Because the memory for the arrays is allocated dynamically, the number of failure modes specified for any of the components in the system is not restricted. It is limited only by the available memory of the computer system. In order to maintain a high computational speed in the presence of stress-life relationships, all numerical constants in the accelerated time-to-failure distribution are calculated while reading the input files. This avoids the need for recalculation during the simulation stage where a repeated sampling of the time-to-failure distributions is taking place. As a result, the computational speed is not diminished by the presence of stress-life models. An output file ‘system_times_to_failure.txt’ contains the full data set regarding the accelerated time to failure for the system. The rank estimates for the probability of failure are obtained from: Fi = (i − 0.5)/ f _counter
(14.12)
where f _counter is the total number of system failures. The empirical (simulated) distribution of the accelerated time to failure of the system is obtained by plotting Fi versus ‘cumul_array[i]’.
14
Virtual Accelerated Life Testing of Complex Systems
311
14.7 A Solved Test Example The test example which illustrates the algorithm is based on the dual-power supply system in Fig. 14.1(b). This system has been selected for analysis because it clearly demonstrates that even relatively simple systems can have a complex topology which cannot be described in terms of series-parallel arrangements of the building components. Component 1 (Fig. 14.2) is characterized by Arrhenius stress-life model 14.3 where the acceleration stress is temperature, set at a level V = 333 (see equation (9)) and the constants in the equation are B = 461.64 and C = 2. Components 2 and 3 are also characterized by Arrhenius stress-life models with constants B = 118.77 and C = 1.4 where the acceleration stress is temperature set also at a level V = 333K. Components 4-11 in Fig. 14.2 are characterized by Eyring stress-life relationship (9), with constants A = 1.8, B = 3684.8, where the acceleration stress is temperature, set at a level V = 413K. Finally, components 12-15 in Fig. 14.2 are characterized by inverse power law stress-life relationship (6) with constants K = 9e− 5, n = 1.7, where the acceleration stress is ‘pressure’, set at a level V = 60MPa. All components are characterized by negative exponential time to failure distribution. The duration of the time interval for which reliability was calculated was α = 1.5 years. The execution of the programme yielded: ‘Reliability = 0.162; ‘MFFOP(0.1) = 0.173 years and ‘MT T F o f the system = 0.887years. The computational speed is very high; 100000 simulations have been performed within 1.03s on a laptop with processor Intel(R)T 7200 2.00GHz. The distribution of the times to failure for the system in Fig. 14.1(b) is shown in Fig. 14.6. Finally, an extrapolation of the time to failure of the system under normal conditions has been made. This constitutes the main advantage of the developed software tool: estimating the reliability of a complex system working in normal conditions without allocating time and resources for real testing. The normal conditions correspond to the specified in Table 14.3 levels of the acceleration stresses (room temperature and normal, low pressure).
Table 14.3 Structure of the input text file acceleration_stresses.txt under normal operating conditions Accelerated Accelerated stress index stress level 1 293 2 293 3 1
The execution of the programme yielded: ‘Reliability = 0.634’; ‘MFFOP (0.1) = 0.53 years’ and ‘MTTF of the system = 2.5 years’. Here is a simple example based on the reliability network from Fig. 14.2 revealing the effect of the temperature acting as a common cause acceleration stress.
312
M.T. Todinov
Fig. 14.6 Distribution of the times to failure for the system in Fig. 14.1(b)
All components with indices 1-12 have been assumed to be identical, characterized by the Arrhenius-Weibull distribution of the time to failure 14.5 , where β = 2.3 , C = 2. The time to failure of components 12-15 has been assumed to be the negative exponential distribution, with MT T F = 60 years. System failure occurs when both power channels to any of the components 12-15 are lost or when any of components 12-15 fail. For an acceleration stress V = 293K, the simulation yielded Rs = 0.624 probability that the system from Fig. 14.1(b) will survive ’4’ years of continuous operation without failure. If however the acceleration stress (the temperature) is raised to V = 523K, the simulation yields only Rs = 0.1 probability that the system will survive 4 years of continuous operation without failure. Another advantage of the developed software tool consists of revealing impact of acceleration stresses acting as common causes in the case of topologically complex systems, where no simple analytical solution for the system reliability exists.
14
Virtual Accelerated Life Testing of Complex Systems
313
14.8 Conclusions 1. A method has been proposed for virtual accelerated testing of complex systems. As a result, the life of a complex system under normal operating conditions can now be extrapolated from the accelerated life models of its components. 2. The developed algorithm and software tool are based on Monte Carlo simulation and are particularly suitable for topologically complex systems, containing a large number of components. 3. The proposed method can be used for virtual accelerated life testing of complex systems. This makes building test rigs for complex systems unnecessary, which can be an expensive and difficult task. It also reduces drastically the amount of time and resources needed for accelerated life testing of complex engineering systems. 4. Building a life model for a complex system, based on the accelerated life models of its components also reveals the impact of the acceleration stresses on the reliability of the system. 5. The proposed method and algorithm handle an important class of engineering systems where more than a single operating component is controlled. This has a significant advantage compared to standard algorithms for system reliability analysis dealing with reliability networks with a single end node. 6. The developed accelerated life model of a complex system has a running time O(m2 ) per simulation trial, where m is the number of components in the system. The algorithm handles systems of any complexity, any size, and any number of failure modes, limited only by the available computer memory. 7. The accelerated life model of a complex system is capable of calculating the accelerated time to failure distribution of the system, its reliability, MTTF and MFFOP at a specified confidence level. 8. An algorithm for finding paths in a reliability network has been proposed. Compared to existing path-finding algorithms, the algorithm determines the existence of paths to multiple end nodes and not only to a single end node. This makes the proposed algorithm ideal for revealing the reliability of engineering systems where more than a single operating component is controlled. 9. The proposed algorithm and software tool have been used to reveal the impact of an acceleration stress acting as a common cause, in the case of topologically complex systems where no simple analytical solution for the system reliability exists.
References 1. Billinton, R., Allan, R.: Reliability evaluation of engineering systems, 2nd edn. Plenum press, New York (1992) 2. Hoyland, A., Rausand, M.: System reliability theory. John Wiley and Sons, Chichester (1994) 3. Kececioglu, D., Jacks, J.A.: The arrhenius, eyring, inverse power law and combination models in accelerated life testing. Reliability Engineering 8, 1–6 (1984)
314
M.T. Todinov
4. Kvam, P., Miller, J.G.: Common cause failure prediction using data mapping. Reliability Engineering and System Safety 76, 273–278 (2002) 5. Meeker, W., Escobar, L.A.: A review of recent research and current issues in accelerated testing. International Statistical Review 60(1), 147–168 (1993) 6. Nelson, W.: Accelerated testing, Statistical models, test plans and data analysis. Wiley, Chichester (2004) 7. Parry, G.: Common cause failure analysis: a critique and some suggestions. Reliability Engineering and System Safety 34, 309–326 (1991) 8. Porter, A.: Accelerated testing and validation. Newnes, Oxford (2004) 9. Prentice, R.: The analysis of failure times in the presence of competing risks. Biometrics 34, 541–554 (1978) 10. Ramakumar, R.: Engineering reliability: Fundamentals and applications. Prentice-Hall, Englewood Cliffs (1993) 11. ReliaSoft: Accelerated Life Testing On-Line Reference, ReliaSoft’s eTextbook for accelerated life testing data analysis (2007) 12. Ross, S.: Simulation, 2nd edn. 13. Cormen, T.H., Leiserson, T.C.E., Stein, R.C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill (2001) 14. Todinov, M.T.: Reliability and Risk models: Setting reliability requirements. Wiley, Chichester (2005)