ARTICLE IN PRESS
The Journal of Systems and Software xxx (2004) xxx–xxx www.elsevier.com/locate/jss
Evolving car designs using model-based automated safety analysis and optimisation techniques Yiannis Papadopoulos a
a,*
, Christian Grante
b
Department of Computer Science, University of Hull, Hull HU6 7RX, UK b Volvo Cars, Go¨tenborg, Sweden
Received 15 December 2003; received in revised form 15 April 2004; accepted 15 June 2004
Abstract Development processes in the automotive industry need to evolve to address increasing demands for integration of car functions over common networked infrastructures. New processes must address cost and safety concerns and maximize the potential for automation to address the problem of increasing technological complexity. In this paper, we propose a design process in which techniques for semi-automatic safety and reliability analysis of systems models are combined with multi-objective optimisation techniques to assist the gradual development of designs that can meet reliability and safety requirements and maximise profit within pragmatic development cost constraints. The proposed process relies on tools to automate some aspects of the design that we believe could be automated and thus simplified without loss of the creative input brought in the process by designers. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Fault tree synthesis; Automated safety analysis; Software hazard analysis; Fault tolerance; Multi-objective optimization
1. Introduction and background Economic pressures in the automotive and other safety critical industries are causing a shift towards systems of increased complexity, tighter function integration, and open architectures based on networks of embedded systems. In the automotive industry, research on networks of embedded systems is mainly driven by two factors. First, such systems provide opportunities for technological innovation, and for achieving competitive edge through provision of new desirable functions, for example advanced driver assistance functions such as obstacle and collision avoidance. Second, networks of embedded systems provide opportunities for building cost-effective and flexible design solutions and for reduc* Corresponding author. Tel.: +44 1482 46 5981; fax: +44 1482 46 6666. E-mail addresses:
[email protected] (Y. Papadopoulos),
[email protected] (C. Grante).
0164-1212/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2004.06.027
ing production costs further. Note that cost reduction has become crucial in an increasingly competitive industry, where the expectation is that more functions should be provided at lower costs and where products such as cars must constantly address an increasing number of justified environmental and other societal concerns. In contemporary automobiles, the potential for cost reduction seems to be greater in the power train and chassis subsystems. These represent the least visible aspects of vehicle design, and in that sense they can be rationalized and streamlined without a visible impact on the design of the car. The automotive industry is therefore aiming to rationalise and standardise the design of these subsystems by developing distributed solutions that can be based on standard interchangeable and reusable modules. Such flexible modules could be produced in high volumes by suppliers and used by automotive manufacturers in a way similar to the braking systems of today. We should note though that such modules must provide a high degree of flexibility to
ARTICLE IN PRESS 2
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
enable automobile producers to design vehicles that have individual characteristics. Also, new technologies in this area must address carefully issues of safety, since the power-train and chassis subsystems support critical vehicle functions such as steering and braking. A solution currently under investigation is the autonomous corner module (ACM). In this concept, each wheel of the car is controlled by a module that can accelerate, steer, and brake the wheel. In some versions of this concept, ACM modules also provide basic damper and suspension functionality. General Motors in their Autonomy Concept Vehicle, SKF and Bertone in their Filo concept and DaimlerChrysler in their concept car F400 Carving have demonstrated that this concept can be used to increase flexibility in design and provide new advanced functions. In these applications, ACM modules form a network of embedded systems and control the car in a co-ordinated manner by exchanging sensory data, status and commands over one or more replicated busses. Current trends show that future car control systems will be built using this or a similar distributed concept. Such systems are expected to behave autonomously in the sense that forces applied on the wheels of the car will not correspond directly to the intention of the driver but will also take into account the state of the vehicle and environmental conditions. In many cases, systems will incorporate complex distributed software intended to provide driving assistance in case of vehicle instability and crash imminence, which of course raises serious safety concerns. To address some of those concerns, a body of academic and industrial work has been directed towards defining appropriate interfaces and protocols that could provide sufficient levels of safety and reliability in network communication in order to ensure the correct and uninterrupted delivery of critical functions. As a result of this work, several reliable communication technologies have emerged. These include communication protocols such as the Time-Triggered Protocol (Kopetz and Grunsteidl, 1994), the Time-Triggered Controller Area Network (ISO, 2000) and Flexray (Leen and Heffernan, 2002). The key characteristic of such architectures is that significant events, including tasks and messages, do not occur at random points in time, but according to a pre-determined schedule. This, in turn, means predictable temporal behaviour which gives the ability to guarantee deadlines and to detect node failures from the omission of scheduled communication in the network. These are certainly important features that simplify the design of critical systems and make integration of critical and non-critical functions over a common bus a real possibility in car design. Indeed, open networked architectures supported by fault tolerant communication technologies could in principle deliver safely new and improved functions at
lower costs whilst leading to systems that occupy less space than traditional mechanical and electrical alternatives. By sharing sensory information as well as basic actuator functionality, complex functions could be provided economically while robust fault-tolerant communication protocols would guarantee the integrity of data on the bus. The integration of functions and its implications, though, i.e. interaction, interoperation and sharing of hardware resources, raise safety concerns, which go beyond issues that can be simply resolved by using robust communication technologies. Such concerns include the effects on the network of new classes of failure modes such as the commission of functions, the possibility of common cause failure and the potential for unpredicted dependent failure of critical functions caused by malfunction of non-critical functions. Traditionally, such concerns were addressed using well-established classical safety analysis techniques such as Failure Modes and Effects Analysis (FMEA) and Fault Tree Analysis (FTA). These techniques are still widely practised in many industries. However, increasing complexity in technology questions the applicability of these techniques on new complex programmable designs. Perhaps the biggest obstacle in the analysis is the manual nature of classical techniques which makes the classical safety analysis of complex systems increasingly more difficult to complete within the realistic constraints of most projects. To address this problem, analysts often make simplifying assumptions about the behaviour of the system or restrict the analysis to selected ‘‘critical’’ sub-systems. Such compromises though on the precision or completeness of the analysis may give rise to a number of problems which include omissions and inconsistencies in the results and a difficulty to relate the results of such partial analyses into a coherent view of system failure and recovery. The automotive and other safety critical industries are therefore confronted with a situation in which, on one hand economic pressures cause a shift towards function integration, while on the other hand safety concerns push for a re-thinking of the presently established safety assessment processes. Clearly new development processes will need to emerge in the future which will address both these economic and safety concerns to ensure the viability and success of integrated technologies and open architectures. A new generation of safety standards is emerging to address those concerns. Such standards include, for example, the EUROCAE & SAE standards in the aerospace industry (EUROCAE, 1997, 1999; SAE, 1996), the CENELEC standards in the railway industry (CENELEC, 1999a,b,c), and IEC 61508 the international standard on safety related systems (IEC, 1997). These standards recommend safety processes that start early in the design life-cycle and proceed in parallel to the design of the system. In these processes, design
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
and safety analysis activities are always integrated and interact, so that the requirements derived in a stage of the safety process can influence subsequent stages of design. Beyond general processes, though, complex safety assessments also require suitable methods that can be applied effectively in the context of the proposed processes. And the question remains whether manually performed classical safety analysis techniques can be applied on complex systems in the iterative and integrated with design manner which is envisaged by standards. Another difficulty with emerging standards is that, although general principles are clearly set out, there is little practical guidance and examples on how these principles can be translated into pragmatic processes that can achieve a continuous, consistent and cost-effective safety assessment that can proceed in parallel to the design of the system. In our view, new integrated design and safety analysis processes in the spirit of new standards should address cost and profitability as well as reliability and safety issues from the early stages of the design in order to avoid expensive mistakes and unnecessary costly design iterations. To gain acceptability in the industry, they should also build upon current achievements in model-based product development, and upon already established processes in which electronic models of the system are used for system design, prototyping, fault simulation and source code generation. A critical factor in dealing with the increasing complexity of systems in the future is automation. Indeed the key in the future for accelerating the development of successful designs, we believe, would be to identify those aspects of the development and assessment process that could be automated and thus simplified without loss of creativity in the design process. Code generation from system models (Heimdahl and Keenan, 1997) and the automated analysis of design specifications, for example using model checking techniques, are actually two problems in this area where a large body of research is presently focused. Our contribution to that body of research is work towards the development of a model-based safety analysis technique called HiP-HOPS (Hierarchically Performed Hazard Origin and Propagation Studies) that largely automates the safety and reliability analysis of complex systems (Papadopoulos et al., 2001). In this work, we have shown that it is possible to automatically generate safety analyses, such as fault trees and FMEAs, from design models by augmenting the structure of the model with information in local failure propagation. This has the beneficial effect of simplifying the analysis while keeping the analyses consistent with the design information. In this paper, we propose a concept in which this earlier work on automated safety analysis is combined with multi-objective optimisation techniques to assist the
3
development of complex safety critical systems. To address in a balanced way both the economic and safety concerns that arise in the design of such systems we propose a tool-supported and largely automated design and safety analysis process. The proposed process integrates design and first-principle safety analysis into a single thread that allows the consistent safety assessment of evolving designs from the early stages of the design. In the following sections, we present the process focusing on methods that we develop to automate and simplify aspects of the process. We also outline the current state of the tools that assist the application of these methods and draw tentative conclusions based on preliminary results from a case study on an embedded vehicle steering control system.
2. Overview of the proposed process The proposed design and safety analysis process is illustrated in Fig. 1. The process starts early in the design of a system with the identification of promising design concepts, i.e. configurations of functions that can potentially maximise profit within a range of acceptable development costs. Multi-objective optimisation techniques are employed at this stage to assist and largely automate this part of the process. Promising design concepts are then refined into architectural models and these models are assessed using HiP-HOPS to determine whether they can meet given safety and reliability requirements. This type of semiautomatic safety and reliability analysis helps to identify design flaws and stimulate useful design iterations. The analysis can be repeated on progressively more detailed models of the system until designers and analysts are satisfied that all hazards associated with the operation or failure of the system have been identified and dealt with appropriate design measures. When the application of this technique shows that the design does not satisfy all safety or reliability requirements, a further automatic step can be initiated in which a multi-objective optimisation algorithm will attempt to determine where component redundancies can be introduced in the model to achieve the given requirements. If this final step shows that meeting the given requirements entails an unacceptable increase in development cost (caused by replication of components) then the current design can be rejected, requirements can be reviewed or the process can be iterated for other design concepts. On the other hand, if the analysis shows that safety and reliability requirements can be met with moderate increase in cost then further effort could be committed in the implementation of the current design. In the following sections, we discuss in more detail the three steps of the above process: (a) identification of promising designs concepts (b) refinement of concepts
ARTICLE IN PRESS 4
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
Fig. 1. The proposed integrated design and safety analysis process.
into models and safety analysis of these models, and (c) optimal allocation of component redundancies to achieve safety and reliability requirements.
3. Identification of promising functional design concepts The proposed process starts at the conceptual level where the design for a new system, for example a vehicle control system, is typically initiated. At this stage, the functions to be provided by the system are decided in the context of constraints like the cost of components, the cost of development and the production capabilities of the company. There are usually a number of design possibilities that can be translated to different concepts which could provide different functions. Issues like potential markets, volumes, costs and ultimately profit have to be addressed for a concept to be translated into a successful design. Typically, the motivation for proceeding further in the design with one or more of those concepts is the potential for profit measured absolutely or more commonly as return on investment. There is often a wealth of information from past experience or competitor products upon which the decision about the potential of a new design could be evaluated. Such information typically includes knowledge of
desirable functions, of the value that customers are prepared to pay for those functions and the components that are likely to be needed to deliver such functions. In practice, though, it is not always obvious how this diverse information can be used to decide which combination of functions and components should be selected for the new design. This is especially true when there are a vast number of viable combinations of functions and components, because a decision on which combination is best would require evaluation of all those design options using, for example, simple or more complicated calculations of profit and cost. Realistically, and for pragmatic designs which may have hundreds of functions and thousands of components, exhaustive evaluation of all options is impossible even if the simplest calculations were adopted. To address this problem, in the proposed process we employ a tool that we have recently developed to automate and accelerate the above search and optimisation process (Grante and Andersson, 2003). The program takes as input the list of functions considered for the new design, an estimate of the value that customers would be prepared to pay for each function, the components needed to deliver the functions (note that each component can be shared by more than one functions) and the cost of each component. This input is provided
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
in the form of a matrix that relates potential functions to sets of components as this is illustrated in Fig. 1. A genetic algorithm is then used to systematically search for those configurations of functions and components that optimise profit within certain budget constraints. The output of the program is a set of candidate concepts for further development that are optimal in the sense that they achieve maximum profit for varying (higher or lower) levels of expenditure that fall within the constraints of the budget. To arrive at this set, in the course of an iterative evolutionary optimisation process, the genetic algorithm calculates the fitness of candidate designs in terms of profit and cost and then ranks potential solutions according to their ‘‘degree of dominance’’ (Fonseca and Fleming, 1998) thus eventually forming a ‘‘Pareto front’’ 1 of optimal solutions (Pareto, 1896). In this approach expenditure is calculated as the sum of component costs, while profit is calculated as the sum of function values minus the overall cost. In the context of the proposed design process, application of this technique helps to arrive at one or more promising configurations for further development. Each configuration is effectively a design concept which defines a set of functional requirements and a set of components likely to be included in the implementation. To proceed further in the design, such abstract functional configurations will need to be interpreted into more detailed architectural models (e.g. Simulink models) which can show how components need to interact in order to deliver the desired functions.
4. Refinement of concepts into models and automated safety analysis Explosive growth in the area of model-based product development means that such models become increasingly more available in the industry as they progressively find more useful applications in the course of the design life-cycle. Such models are currently employed for the specification of ‘‘architectural plans’’ that can be used to co-ordinate complex, distributed system development processes. Executable models are also used for simulation, fault injection and the verification of designs against specified properties, while more recently modelling tools have also acquired capabilities of source code generation. In the proposed process, the utility of such models is pushed further into a hitherto unexplored area, that of safety and reliability analysis. More specifically, we use 1 To avoid repetition, the concepts of ‘‘dominance’’, ‘‘ranking by degree of dominance’’ and ‘‘Pareto front’’ are explained in Section 5 where application of these concepts is reported on a more complex optimisation problem.
5
our earlier work on automated safety and reliability analysis (i.e. HiP-HOPS) for evaluating whether the design under development can meet safety and reliability requirements. Application of this work can start once a concept has been interpreted into a Simulink model. Simulink was chosen as a modelling environment because it is both a widely used engineering tool and a tool for which in the past we have developed an automated reliability and safety analysis algorithm (Papadopoulos and Maruhn, 2001). The applicability of the proposed techniques, however, is not restricted to Simulink models. Any model that provides the topology of the system, i.e. components and connections, is suitable for this type of analysis. In (Papadopoulos and Petersen, 2003) for example, we have demonstrated application on models of marine system designs developed in Simulation X, a different modelling and simulation tool (ITI, 2003). The proposed analysis can be performed on abstract or more detailed versions of the model as this is progressively refined in the course of the design life-cycle. The first step in the analysis is the establishment of the local failure behaviour of each component in the model as a set of ‘‘failure expressions’’ which show how ‘‘output failures’’ of the component can be caused by ‘‘internal malfunctions’’ and ‘‘deviations of component inputs’’. At this stage, a variant of Hazard and Operability Studies (HAZOP) is used to identify plausible output failures such as the ‘‘omission’’ or ‘‘commission’’ of each output and then to determine the local causes of such events (Papadopoulos et al., 2001). Other classes of output malfunctions are typically also considered in order to achieve a more detailed and informative analysis. These include conditions such as the output being delivered at a higher or lower level than expected (‘‘value failures’’), and deviations such as the early or late provision of the output (‘‘timing failures’’). The input deviations that are specified as causes of such output events represent similar conditions, i.e. omission, commission, timing and value failures of input parameters that by themselves or in conjunction with other events cause the output failure. Internal component malfunctions, on the other hand, represent component failure modes such as electrical or mechanical problems caused by wear or environmental conditions for which the component is not qualified. For such failure modes the analysis may also record a failure rate k (in failures per hour or other appropriate units) that may be obtained from component manufacturers or large reliability databases. If they are provided, such failure rates can later on be used for quantitative analysis and evaluation of system reliability. Collectively, the results of the above analysis which is performed at component level form a specification of the local failure behaviour of the component under examination which contains component failure data (failure modes and failure rates) and a definition of the local
ARTICLE IN PRESS 6
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
effects and propagation of failure across the component interfaces, from inputs to outputs. Once this local analysis has been completed for all components, the structure of the model is then used to automatically determine how local failures propagate through connections in the model and cause functional failures at the outputs of the system. This global view of failure is captured in a set of fault trees which are automatically constructed by traversing the model of the system backwards moving from the final elements of the design, i.e. the actuators, towards system inputs and by evaluating the failure expressions of the components encountered during this traversal. The fault trees synthesized using this approach show how functional failures or malfunctions at the outputs of the system are caused by logical combinations of component failures. These fault trees may share branches and basic events in which case they record dependencies in the model and common causes of failure, i.e. component failures that contribute to more than one system failures. Thus, in general, the result of the fault tree synthesis process is a network of interconnected fault trees which record logical relationships between component and system failures as this is illustrated in Fig. 2. The top events of these fault trees represent system failures. Leaf nodes represent component failure modes while the body of intermediate events (and intervening logic) records the propagation of failure in the system and the progressive transformation of component malfunctions to system failures. In the final step of the process, the complex body of intervening fault propagation logic is removed from the analysis by an automated algorithm which translates the network of interconnected fault trees into a simple table of direct relationships between component and system failures. To achieve this transformation, the cut-sets of the synthesised fault trees are first calculated by applying a depth-first, bottom-up traversal strategy, in the course of which the logic of each tree is progressively established and then simplified using classical Boolean reduction techniques (Fussell and Vesely, 1972). A traversal of all cut-sets is then performed to establish direct relationships between component failures and system failures and record these relationships in a single table
as this is illustrated in Fig. 3. In a similar way to a classical FMEA, this table determines for each component in the system and for each failure mode of that component, the effect of that failure mode on the system, i.e. whether, and how, the failure mode contributes to one or more system failures and malfunctions (i.e. the top events of fault trees). Note that in a classical manual FMEA only the effects of single failures are typically assessed. Thus, one advantage of generating an FMEA from fault trees is that fault trees record the effects of combinations of component failures and this useful information can also be transferred into the FMEA. To accommodate this additional information, effects in the resultant FMEA tables are presented in two columns, one containing the direct effects on the system, i.e. those effects caused by single component failures, and the other containing further effects, i.e. those effects caused in conjunction with other failure modes. This allows clear and easy access to the most critical information, the single points of failure. Perhaps more importantly, the FMEA shows all functional effects that a particular component failure mode causes. The latter is particularly useful as a failure mode that contributes to multiple system failures (e.g. failure mode C6 in the illustrated example of Fig. 3) is potentially more significant than those that only cause a single top event. In the context of the proposed process, this semiautomatic analysis of the system model determines whether the design can meet given reliability or safety requirements. We should note that this analysis does not necessarily depend upon credible component failure rates to produce useful results. In the case of software modules, or components without sufficient history of use, such failure rates would be impossible or very difficult to obtain anyway. In such cases, the logical reduction of fault trees into a single tabular form which is equivalent to a classical FMEA can still indicate single points of failure in the system and point out potential design weaknesses that may lead to useful design iterations. This, however, means that this type of analysis could be employed while the model of the system is still under development in order to drive the allocation of critical safety requirements during the design process. An example of this is discussed in the context of the case study which is presented in Section 6.
5. Introducing fault tolerance in the model
Fig. 2. A network of automatically created fault trees.
At the early stages of the design, the model that provides the basis for the above analysis may be functional and record the decomposition of functions into networks of lower level sub-functions. As functions are allocated to hardware, though, entities in the model will start to represent real components, for example sensors
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
7
Fig. 3. Synthesised fault trees and FMEA.
and actuators or processors that enclose networks of tasks running upon those processors. The semi-automatic safety and reliability analysis of such detailed models will in practice often reveal that certain safety and reliability requirements cannot be met by the proposed architecture. It is usually possible to correct this and achieve higher levels of reliability and safety by improving the degree of fault tolerance in the system, i.e. by replicating components or subsystems so that functions are still provided when components or subsystems fail. In a typical design, though, there are many options to replicate at different places in the system and different levels of the design. It may be possible, for example, to achieve the same level of safety or reliability by replicating a few sensors in one part of the system or an actuator in another part. Different solutions however will lead to different additional costs, and the question here is which solution is optimal, i.e. one that achieves a certain level of reliability and safety with minimal additional cost. Because in a non-trivial design there are a vast number of options for replication, it is virtually impossible for designers to address the above question systematically. Thus, people tend to rely on experience and evaluate only a few design options. Some automation in this area would therefore be useful to designers, and could help them achieve successful trade-offs between cost and reliability in the design of fault tolerant systems. In our approach, the above is achieved by combining the fault tree synthesis concept of HiP-HOPS with recent advances in design optimisation. More specifically, we use synthesised fault trees and genetic algorithms in order to progressively ‘‘evolve’’ an initial design model in which there is no replication of components to a design where replicas have been allocated in a way that minimises the cost of replication while achieving safety and reliability requirements. In the course of the evolutionary process, the genetic algorithm generates populations of candidate designs that incorporate a range of replication strategies based on widely used fault tolerant schemes such as hot or cold standbys and triple-modular redundancy with majority voting. For the algorithm to
progress towards an optimal solution, a selection process is necessary in which the fittest designs survive and their genetic make up is passed to the next generation of candidate designs. The fitness of each design in this approach relies on cost and reliability. To calculate fitness, therefore, we need fast ways in which to automatically calculate these two elements. An indication of the cost of a system can be easily calculated as the sum of the costs of its components (although for more accurate calculations life-cycle costs should also be taken into account such as production, assembly and maintenance costs). However, while calculation of cost is relatively easy to automate, the evaluation of safety or reliability is more difficult as conventional methods rely on manual construction of the reliability model (e.g. the system fault tree). The fault tree synthesis concept that we developed in HiP-HOPS, though, automates the development and calculation of the reliability model, and therefore facilitates the evaluation of fitness as a function of reliability (or safety) thus enabling a selection process through which the genetic algorithm can progress towards an optimal solution. Indeed, in the course of the optimisation process, fragments of fault trees that correspond to modifications in the initial model that are made by the genetic algorithm are mechanically re-constructed, and all fault trees for the current population of models are then re-evaluated in order to determine the fitness of individuals. One reason for using a genetic algorithm (as opposed to another optimisation technique) is that genetic algorithms carry out a multi-directional search and can therefore be easily designed to generate several good or nearly optimal solutions. In the context of the reliability optimisation problem, this characteristic provides the essential flexibility required in making successful tradeoffs among cost, reliability and other factors. Theoretically optimal solutions, for instance, may require unavailable space or imply awkward physical arrangements of components in which case sub-optimal solutions may be preferable. In another scenario, one solution may be optimal in the sense that it achieves a given reliability target with minimal cost. However, it
ARTICLE IN PRESS 8
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
may also be possible to achieve much higher reliability with a rather small increase in cost. To enable this type of trade-offs among reliability, cost and other parameters, we opted for a ‘‘Pareto set’’ optimization. In this approach, the genetic algorithm is searching not for a single optimal solution but for a set of Pareto optimal solutions, also known as ‘‘non-dominated’’ solutions. A solution x is said to dominate solution y, if x is as good as y in terms of all objectives, and there exists at least one objective where x is better. Thus, the Pareto set (also called the ‘‘Pareto front’’) of optimal solutions in our problem consists of all designs that incur a minimal cost which is lower than that of all other designs that are equally or more reliable. To calculate the Pareto front, in each iteration of the evolutionary optimisation, the genetic algorithm determines the relative fitness of candidate designs by ranking them according to the scheme proposed by Fonseca and Fleming (Fonseca and Fleming, 1998). In this scheme, ranking is performed using the ‘‘degree of dominance’’ which for each individual equals the number of individuals that is dominated by, plus one. Design concepts with a degree of one, therefore, lie on the Pareto front and represent the fittest designs in each population of candidate designs. An example of a population ranked according to degree of dominance is illustrated in Fig. 4. The two parameters of optimisation in this illustrated example are reliability and cost. The Pareto front consists of designs that are less costly than all other designs that are equally or more reliable. To progressively improve the Pareto front in the course of the evolutionary optimisation process, the genetic algorithm creates new generations of candidate designs using an implementation of a recently proposed scheme (Andersson and Wallace, 2002). In this scheme, parents are chosen from the two most recent generations of candidate designs and then basic genetic mixing and modification mechanisms such as crossover and mutation are performed to create children. For each child, the most genetically similar individual in the entire pop-
ulation is then identified and the fitness of this individual is compared with that of the child. If the child is fitter it replaces the older individual in the population. There is evidence that this replacement strategy counteracts genetic drift that confines population diversity and leads to inbreeding. Poor diversity in practice means that the population is clustered at the extremes of the Pareto front, which means that trade-offs that could be achieved in the middle area are not clearly identified. The result of this optimisation process is a set of designs that minimise cost at different levels of reliability that may lie within a range of acceptable reliability requirements. Once more, we should note that quantitative reliability analysis based on failure rates is not necessary to drive this process, which could instead be driven by qualitative safety criteria, for example the number and criticality of single points of failure in each design. Single points of failure are component failure modes that by themselves cause one or more system failures. As we saw in Section 4, such conditions can be determined via automated analysis of the cut-sets of the synthesised fault trees and the construction of an FMEA. The criticality of these failures can be established from the severity of their system level effects (i.e. top events in the fault trees or system effects in the FMEA). In the context of the proposed design process, this last stage in the optimisation of the model determines whether and where redundancies are needed to achieve the given reliability and safety requirements. If the analysis suggests that the cost of replication is unacceptably high then the current design can be rejected, requirements can be reviewed or the process can be iterated for other design concepts. On the other hand, if the analysis shows that safety and reliability requirements can be met with moderate increase in cost then we can proceed in the implementation of the current design.
6. Tool support and applications of proposed techniques A number of tools have been developed to support and partly automate aspects of the proposed process.
Cost 7
16
1
3
1
4
1 1
8
on
t
3
Pa
1
re
to
Fr
1 2
6.1. Support for optimisation of early design concepts
3
Reliability Fig. 4. Pareto front of optimal solutions.
The first of these tools is an optimisation tool that assists the search for promising design concepts early in the design process. The tool provides a Microsoft Excel interface which designers and analysts can use to specify and relate potential functions that could be included in a new design to components or sub-systems that would be needed to support the delivery of each function. Given this input, the tool then applies the process specified in Section 3 to identify design concepts that potentially maximise profit within given cost constraints.
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
The tool has been tested by Volvo on a case study which included 52 active safety functions implemented over 48 components and sub-systems (Grante and Andersson, 2003). This configuration could result in more than 70 billion possible design concepts. Manual evaluation of all those concepts with regards to cost and profit would have certainly been impossible. However, with the aid of the tool it was possible to arrive at a few promising design concepts that potentially maximise profit at different levels of development cost. An indication of the performance of this tool is that it takes a few seconds for the genetic algorithm to arrive at the optimal Pareto front of solutions starting from a random population of candidate design concepts. 6.2. Support for automated safety analysis The second tool that we develop to support the proposed process is a safety and reliability analysis tool that generates system fault trees and FMEAs from Matlab Simulink or Simulation X models as this was described in Section 4. The architecture of this tool is illustrated in Fig. 5. The tool provides a graphical user interface (GUI) that enables annotation of components in the model with the failure modes and failure expressions required for the fault tree synthesis. These data become part of the model and are automatically saved and retrieved by the modelling tool every time the model is opened or closed by a user. Failure annotations reference only attributes of the corresponding components (i.e. failure modes and deviations at component input
9
or output ports) which means that such annotations can, in principle, be re-used within the same model or across different models with the obvious benefit of simplifying the manual part of the analysis. Once a model has been annotated, it is saved in the format of the corresponding modelling tool. The second component of the safety analysis tool is a parser that interprets such files, and reconstructs the enclosed annotated models for the purposes of fault tree synthesis. The synthesis itself is performed by the third component of the tool, the fault tree synthesiser. To generate fault trees, the algorithm performs a backward traversal from each output of the model, in the course of which it evaluates the failure expressions contained in the local analyses of the components encountered during the traversal. The resultant fault trees are then logically reduced into minimal cut-sets and, finally, in a single traversal of these cut-sets, an FMEA synthesiser generates the FMEA table. In the current implementation of the tool, the synthesis is separated from the display of models. Indeed, fault tree and FMEA stores are first created in memory and then different views of these results can be generated. An ‘‘html’’ generator can parse the stores and create web pages that contain the FMEA and the individual analyses of fault trees. The advantages of this medium include easy distribution and display and the ability, through hyperlinks, to navigate different aspects of this information. Beyond offering its own analysis and display capabilities, the tool also exports the synthesised fault trees to
Fig. 5. Architecture of safety analysis tool.
ARTICLE IN PRESS 10
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
Fault Tree Plus (FT+) (Isograph, 2003), a widely used fault tree analysis tool. It can, therefore, be used in conjunction with FT+ to simplify the safety and reliability analysis of systems. Case studies using the automated safety analysis tool on an advanced brake-by-wire car prototype and a fuel system of a container ship have been reported in collaborative work with DaimlerChrysler (Papadopoulos et al., 2001) and Germanisher Lloyd (Papadopoulos and Petersen, 2003) respectively. In the context of the work presented in this paper, Volvo is currently using the tool to drive the design of an advanced steer-by-wire prototype car. A functional model of the system has been developed in Matlab Simulink and this was deliberately designed without any degraded or fallback modes, in order to test whether the results from the automated safety analysis could help in the systematic identification and design of such modes. The model was annotated with information about the local behaviour of functions and then 16 interconnected fault trees and an FMEA were constructed by the automated safety analysis tool. An indication of the complexity of the failure logic incorporated in this model is that the analysis results to a few thousand cut-sets. It takes a couple of seconds in a standard personal computer to automatically generate and evaluate the fault trees and FMEA from this model. Two failure modes were considered during the analysis, the omission and the commission of functions. The synthesised fault trees and FMEA therefore show how omissions and commissions of input, processing and
actuator functions cause system level effects, i.e. omissions or commissions of steering functions. A classification of the severity of those effects into marginal and catastrophic helped to identify the criticality of causes (i.e. failures of input, processing and actuator functions) and this in turn assisted the design of these basic functions. For example, wherever the analysis indicated that the omission of a function had only marginal effects while the commission had catastrophic effects, a design recommendation was made to design the function in a way that it fails silent. This in turn led to the identification of several degraded modes in which non-critical steer-by-wire functions may fail silent with only marginal effects on the system. A hierarchical state-chart was then constructed to show how graceful transition to such modes could be achieved. It is beyond the scope of this paper to present in detail the voluminous safety analyses, i.e. the synthesised fault trees and FMEA derived in this study. To illustrate the practical value of the analysis, though, in Figs. 6 and 7 we present the high level mode-chart that was derived as a result of the straightforward interpretation of these results. The mode-chart of Fig. 6 shows how a number of ‘‘critical’’ (as indicated by the analysis) failures of input, processing and actuator functions should lead the system safely into assisted-mechanical and progressively unassisted steering modes. The decomposition of the main ‘‘steer-by-wire’’ mode into the mode-chart of Fig. 7 also identifies a number of ‘‘less critical’’ (as indicated by the analysis) functional
Steering System
Steer-By-Wire - -
Input_Rack_Position OR Input_SteeringWheel_Angle OR Required_Torque OR Torque_Provider
Mechanical backup assisted by Rack_Position _Provider
End stops working
Required_Rack_Position OR Wheel_Positon_Provider (Required_Rack_Position OR Wheel_Positon_Provider) AND NOT(Required_Torque OR Torque_Provider)
((Required_Rack_Position OR Wheel_Positon_Provider) AND (Input_Rack_Position OR Input_SteeringWheel_Angle OR Required_Torque OR Torque_Provider)) )) OR Input_Torque_reading
Required_EndStop_Left OR EndStop_Left OR Required_EndStop_Right OR EndStop_Right
Mechanical backup assisted by Torque_Provider Required_Torque OR Torque_Provider OR Input_Torque_reading
Mechanical backup without assistance
Required_Torque OR Torque_Provider OR Input_Torque_reading
End stops failed silent
Fig. 6. Degraded modes introduced after the automatic safety analysis.
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
Steer - By - Wire
Vehicle speed accounted
Yaw accounted
Input_Vehicle_Speed
Input_Yaw
Rack force accounted
Input_Rack_Force
Lateral acceleration accounted
Input_Lateral_
Longitudinal acceleration accounted
Input_Logitudinal_
Accelration
Acceleration
Steering wheel speed Input_SteeringWheel accounted _ Speed
Selected mode accounted
Vehicle speed silent
Yaw silent
Rack force silent
Lateral acceleration silent
Longitudinal acceleration silent
Steering Wheel speed silent
Input_Selected_Mode
Selected mode silent
Fig. 7. Sub-modes within the main Steer-By-Wire mode.
failures that should lead the system into sub-modes where some of the steer-by-wire functionality is lost but the system can safely remain in the normal ‘‘steerby-wire’’ mode. Driven by these results, a design iteration is currently underway to enable detection of the specified functional failures and facilitate the safe transition of the system to the new degraded modes specified in the mode-chart. 6.3. Support for redundancy allocation While the safety analysis tool is relatively mature, the redundancy allocation tool, on the other hand, is still under development. It is therefore currently impossible to comment on performance or make credible comparisons of the proposed approach with earlier approaches to the redundancy allocation problem, at least in terms of speed and scalability. However, we can still identify at this point a few conceptual differences between our approach and earlier work. To the best of our knowledge, for example, earlier approaches calculate reliability using a mathematical function that is derived straight from the system topology on the assumption that the system is formed as a series–parallel configuration of components and that these components and the system itself ‘‘work’’ or ‘‘fail’’ in a single
11
failure mode which typically represents a complete loss of function (Coit and Smith, 1995, 1996). It is usually also assumed that a series configuration fails when any of its components fails, while a parallel configuration fails when all constituent components fail. In addition to work on parallel–series systems, some work has also been done to solve the reliability optimisation problem on more complex network topologies in which networks are assumed to function between nodes as long as there is at least one working path connecting these nodes (Denzig et al., 1997; Deeter and Smith, 1997). The difference between the proposed approach and this earlier work is that, in our approach, the safety and reliability model (i.e. a set of automatically derived system fault trees) is not generated simply from the topology of the system, but from an engineering model which also includes information about the failure behaviour of components. In this model, components do not need to be in series–parallel configurations. The model for example may include bridges between parallel paths, hierarchically nested series and parallel connections as well as complex functional dependencies caused by control loops. Perhaps a more significant departure from earlier work is that our basic failure assumption goes beyond the classical ‘‘success–failure’’ model. Indeed, in our approach, components can exhibit more than one failure modes which include the ‘‘loss’’ but also the ‘‘commission’’ of functions as well as ‘‘value’’ and ‘‘timing’’ failures (McDermid et al., 1995; Papadopoulos et al., 2001). Such failures can in turn propagate through connections in the system model and cause hazardous failures at system outputs. Hazardous failures may also be stopped by fault tolerant mechanisms or be detected by system monitors and be transformed into less malign failures. A typical example of the latter is the transformation of a hazardous timing failure into an omission, as it can be seen in the case of a communication controller that detects early or late data at its input and in response it falls silent to avoid corruption of other data on the bus (Kopetz, 1995). We hope that this type of more detailed and realistic failure modelling, which is possible in the frame of the proposed techniques, will help to improve the accuracy of the automated analysis performed by the tools and ultimately improve the quality of the solutions reached. This, in turn we believe, will help to improve the industrial relevance and uptake of the proposed techniques. Clearly, though, much more work is required in the future to validate this belief.
7. Conclusions Development processes in the automotive industry need to evolve to address increasing demands for
ARTICLE IN PRESS 12
Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx
integration of car functions over common networked infrastructures. In this paper, we argued that new processes must address both cost and safety concerns and maximize the potential for automation to address the problem of increasing technological complexity. We also proposed a process which combines work on automated safety analysis with multi-objective evolutionary optimisation techniques to achieve some of the attributes that we believe should characterise an effective development process. The proposed process relies on tools to automate aspects in the design of critical systems that we believe could be automated, and thus simplified, without loss of the creative input brought in the process by designers. These tools are currently at different stages of development. The optimisation tool that assists the search for promising design concepts early in the design process is quite mature and has been tested by Volvo on a case study of realistic complexity. The automated safety analysis tool is also mature and allows fault tree and FMEA analysis of complex design models developed in popular modelling and simulation tools. On the other hand, the redundancy allocation tool is less mature and currently under development in a project between Volvo and the University of Hull. In this project, a case study is also carried out on a full scale drive-by wire prototype car with individual four wheel steering which has been developed by Volvo cars. The first results from this case study are located mainly in the area of automated safety analysis. These results are encouraging and show that the proposed process and tools can be effectively applied from the early stages of the design, integrating safety analysis in the design and potentially elevating it into a tool that can drive the design process. There is not yet sufficient project experience to judge the practicability of this approach. However, at least in the area of safety analysis, early experience suggests that it is relatively easy to produce the failure annotations needed at component level, and that some re-use of this information is possible. The proposed process is therefore likely to give economic benefits by simplifying the construction and analysis of safety and reliability models (i.e. fault trees and FMEAs) for complex systems. We expect soon to be in a position to report further on this work and on new results in the area of design optimisation which, we hope, could provide us with further insight into the practicability, scalability and limitations of this approach.
Acknowledgements We would like to thank Volvo Cars Corporation for funding this work, in particular Johan Wedlin and Mats
Willanders for their useful comments and for supporting the development of the ideas presented in this paper.
References Andersson, J., Wallace, D., 2002. Pareto optimization using the struggle genetic crowding algorithm. Engineering Optimization 34 (6), 623–643. CENELEC, 1999a. Railway applications: The specification and demonstration of dependability, reliability, availability, maintainability and safety, EN 50126, European Committee for Electrotechnical Standardisation. CENELEC, 1999b. Railway applications: Safety related electronic railway control and protection systems, EN 50129, European Committee for Electrotechnical Standardisation. CENELEC, 1999c. Railway applications: Software for railway control and protection systems, EN 50128, European Committee for Electrotechnical Standardisation. Coit, D.W., Smith, A.E., 1995. Optimisation approaches to the redundancy allocation problem for series–parallel systems. In: 4th Industrial Engineering Research Conference, Nashville, TN, pp. 342–349. Coit, D.W., Smith, A.E., 1996. Reliability optimisation of series– parallel systems using a genetic algorithm. IEEE Transactions on Reliability R45 (2), 254–260. Denzig, B., Altiparmak, F., Smith, A.E., 1997. Efficient optimisation of all-terminal reliable networks using an evolutionary approach. IEEE Transactions on Reliability R46, 18–26. Deeter, D.L., Smith, A.E., 1997. Heuristic optimisation of network design considering all terminal reliability. In: Annual Reliability and Maintainability Symposium, Philadelphia, pp. 194–199. EUROCAE, 1997. Certification considerations for highly-integrated or complex aircraft, ED-79/ARP-4754, European Organisation for Civil Aviation Equipment. EUROCAE, 1999. Software considerations in airborne systems and equipment certification, ED-12B/DO-178B, European Organisation for Civil Aviation Equipment. IEC, 1997. Functional safety of electrical/electronic/programmable electronic safety-related systems, IEC-61508, International Electrotechnical Commission 65A/179–185. ITI, 2003. Simulation X, ITI GmbH Dresden. Available from . Fonseca, C., Fleming, P., 1998. Multi-objective optimization and multiple constraint handling with evolutionary algorithms. IEEE Transactions on Systems, Man & Cybernetics 28, 26–37. Fussell, J.B., Vesely, W.E., 1972. A new methodology for obtaining cut-sets for fault trees. Transactions of American Nuclear Society 15, 262–263. Grante, C., Andersson, J., 2003. Safety considerations in optimisation of design specification content. In: International Conference on Engineering Design, Stockholm. Heimdahl, M.P.E., Keenan, D.J., 1997. Generating code from hierarchical state-based requirements. In: International Symposium on Requirements Engineering, Annapolis, pp. 210–221. ISO, 2000. Road vehicles-Controller Area Network (CAN)-Part 4: Time-Triggered Communication, ISO/CD11898-4, International Organization for Standardization, Geneva. Isograph, 2003. Fault Tree Plus, Isograph Ltd, Manchester. Available from . Kopetz, H., 1995. The time-triggered approach to real-time system design, Predictably Dependable Computing Systems, ESPRIT basic research series, Springer-Verlag, Berlin. Kopetz, H., Grunsteidl, G., 1994. TTP-A or ault tolerant real-rime systems. IEEE Computer 27 (1), 14–23. Leen, G., Heffernan, D., 2002. Expanding automotive electronic systems. IEEE Computer 35 (1), 88–93.
ARTICLE IN PRESS Y. Papadopoulos, C. Grante / The Journal of Systems and Software xxx (2004) xxx–xxx McDermid, J.A., Nicholson, M., Pumfrey, D.J., Fenelon, P., 1995. Experience with the application of HAZOP to computer-based systems, COMPASSÕ95, Gaithersburg, MD. Papadopoulos, Y., Maruhn, M., 2001. Model-based automated synthesis of fault trees from Simulink models. In: International Conference on Dependable Systems and Networks, Go¨tenborg, pp. 77–82. Papadopoulos, Y., Petersen, U., 2003. Combining ship machinery system design and first principle safety analysis. In: 8th International Marine Design Conference, Athens, vol. 1, pp. 415–426. Papadopoulos, Y., McDermid, J.A., Sasse, R., Heiner, G., 2001. Analysis and synthesis of the behaviour of complex programmable electronic systems in conditions of failure. Reliability Engineering and System Safety 71 (3), 229–247. Pareto, V., 1896. Cours dÕ E´conomie Politique. Lausanne, Rouge. SAE, 1996. Guidelines and methods for conducting the safety assessment process on civil airborne systems and equipment, ARP 4761, Society of Automotive Engineers. Yiannis Papadopoulos holds a degree in Electrical Engineering from the Aristotelian University of Thessaloniki in Greece, an MSc in Advanced Manufacturing Technology from Cranfield University and a PhD in Computer Science from the University of York. In 1989, he joined the Square D Company where he took up a leading role in the development of an experimental Ladder Logic compiler with fault injection capabilities and a simulator for the Symax programmable logic controllers. Between 1994 and 2001 he worked as a research associate and research fellow in the Department of Computer Science
13
at the University of York. Today, he is a senior lecturer in the Department of Computer Science at the University of Hull where he continues his research on Safety and Reliability of Computer Systems and Software. Dr. Papadopoulos is the author of more than 25 scientific journal and conference papers. His research engages with a body of work on safety analysis and design optimization and contributes to that body of work with new methods and algorithms that automate aspects of safety analysis, fault tolerant design and operational monitoring of complex computer-based systems. His work currently develops through extensive technical collaborations with the European Industry, mainly Volvo Cars, Jaguar-Landrover, DaimlerChrysler, and Germanisher Lloyd. Christian Grante graduated in 1998 with an MSc in Mechatronics from Chalmers University of Technology and in 1999 with an MSc in Engineering Design from Georgia Institute of Technology. He joined the Department of Research and Development at Volvo Car Corporation in 1999 to work in the area of dynamic embedded safety critical systems. His work is presently focused on the development and evaluation of safety assessment methodologies and tools for the integrated design and safety assessment of such systems. Christian has contributed to a number of research and development projects in this area. Recently, he has managed the Sirius project, a joint project between Volvo and Lulea˚ University of Technology in which an X-By-Wire prototype car with individual steering on four wheels was developed. Christian is also currently completing a part-time PhD programme on safety of mechatronic systems at Linko¨ping University of Technology. His doctoral work is looking into the development of industrially applicable design and safety assessment processes that could achieve tradeoffs among a number of parameters which include functionality, safety, development cost and profit.