Workshop on Dependable Systems of Systems

9 downloads 0 Views 2MB Size Report
Sep 6, 2011 - [1] http://www.adelard.com/web/hnav/ASCE/choosing- asce/cae.html. [2] http://www.omg.org/news/meetings. /tc/agendas/ut/SysA info day.htm.
Workshop on Dependable Systems of Systems

Technical Papers

Editor: George Despotou

5-6 September 2011

University of York, York, UK

Introducing the Challenges of Dependable Systems of Systems George Despotou

The term Systems of Systems has increasingly been used to describe classes of systems exhibiting a combination of characteristics such as emergence, autonomy and dynamic reconfiguration. Although there is a good understanding of the kind of system paradigms (e.g. air traffic control, network enabled capability, UAV swarms) we consider as being SoS, a consensus on the exact definition of a SoS has yet to be established. A number of terms have been used to describe similar paradigms of systems. Despite the different nomenclature, all terms converge on the fact that this type of systems exhibits a number of characteristics (to various degrees) that make it a discernible, such as:





By and large SoS can be seen as systems consisting of elements that have been developed independently, which can be described as systems in their own right. Despite their independence, the elements of a SoS need to collaborate, combining their functionality, in order to achieve higher SoS objectives. This often includes adapting to environmental conditions, by changing configuration (to deliver different objectives). Configurations can be pre-determined, but can also be the result of autonomous behaviour. This can raise obvious challenges, when engineers need to be confident about the deterministic operation of the SoS, with regard to attributes such as safety.











Overall objectives: A System of Systems is tied together by a set of high level goals of interest to its stakeholders, such as provision of air traffic management (for ATC) or accomplishment of a mission (for NCW). Complexity: SoS are typically used in problems of high complexity (e.g. aircraft route planning or targeting). Multiple elements: A System of Systems consists of many elements which are systems in their own right and can or have been developed independently from the SoS (e.g. a radar or an unmanned aircraft). Autonomy: Elements are able to make their own decisions with varying degrees of autonomy (e.g. autonomous vehicles). Geographical dispersion: SoS deployed in the real world often involve elements that are geographically dispersed and mobile – changing their position according to the overall SoS objectives. Collaboration: SoS elements collaborate, each contributing different functions in order to achieve the overall SoS objectives (e.g. sharing of intelligence requires that some elements sense while others analyse data).



Communication: Collaboration between the SoS elements requires exchanges of information. Independent development: The elements that constitute a SoS are often built without the operational context of a particular SoS in mind, potentially with different technologies used. Elements often include legacy systems that are configured to collaborate in a particular way.

Overall, a System of Systems can be considered as being an organised complex unity assembled from dispersed, highly collaborating, autonomous systems – each of which is capable of operating independently.

These characteristics introduce a number of challenges that make hazard analysis difficult to perform such as constant evolution and reconfigurations, emergence, functional composition from independent systems and lack of clear ownership. SoS constantly evolve and adapt according to their environment (and the needs of the operator) – their concept of operations changes, new elements are added and new capabilities are implemented. This results in a fluid operational context that cannot be assumed as a definite during hazard analysis. Furthermore a typical SoS capability involved the functions of many individual SoS elements such as sensing and processing elements,

human supervision, input and processing as well as considerable network infrastructure. The human also considered as significant characteristic of SoS. The human is often unpredictable and prone to err, and is often part of accident sequences, but can also be seen as ‘the hero’ resolving situations that the SoS cannot. These aspects of a SoS often obscure the way a hazard can develop. Failures in a system may propagate through collaboration between elements and transform into different types of failures, eventually manifesting as hazards. Moreover, it can be the case that all elements perform as expected, but resulting in an overall emergent behaviour that is unsafe. Apart from hazard analysis the SoS characteristics can also introduce difficulties when making claims about the safety of its operation either implicitly or explicitly (e.g. a safety case). Different elements are developed using different standards. Synthesis of a SoS safety case out of existing safety cases is possible however it brings up problems of (among others) heterogeneity of the evidence used as well as the relevance of the assumed operational context in which each (SoS element) case study has been created. Even if this is achieved, there is the hurdle of maintaining the validity of the case during operation. As SoS exhibit a dynamically reconfigured structure, there may be configurations that have not been analysed. This can affect the relevance of the safety case as the included argument may assume a number of known configurations. Even if a safety case about the SoS design is achieved, uncertainty there remains the challenge for the operational aspect of the safety case. A number of approaches that may produce evidence that reduce uncertainty of the operational safety case of a SoS are considered (e.g. simulation). But their effectiveness is based on anecdotal evidence, and remains to be determined. From a development point of view, achieving safety will require a number of design or architectural decisions from the inception of the SoS. Often, due to the fact that SoS elements will not be affected (at least now without significant cost and time overhead), hazard management will need to occur at the SoS level. Safety policy will need to be effective and inclusive to prevent all identified hazards. In case safety derived requirements are not met by the SoS

constituent elements, the SoS behaviour will need to be revised so that realistic safety related requirements may be assigned to elements. Notwithstanding the cost and time element of development there will be cases when the result of these decisions will be at odds with the interests of the stakeholders and the operational fitness of the SoS in general, an attitude that can be counter-intuitive to treating safety as of utmost importance. Stakeholders need to be in a position to evaluate the end effect of operational and system (design) changes, understanding and prioritising the (unwanted) impact of the trade-offs. Many of the (safety related) obstacles observed in SoS are not new. Although there is not exact consensus on SoS there is also not contention; practice demonstrates that their characteristics further exacerbate these obstacles to an extent that can be unmanageable. Existing methods often fall short to providing a clear picture of the SoS, and generate suitable evidence that will convince the developers and operator about its safety. Preserving the ongoing discussion is necessary to achieve the necessary degree of convergence in SoS engineering. Starting to Consider the Relationship between Human Work and Emergent Properties in Systems of System by C. Fairburn, offers a perspective on how the human can contribute to the emergent behaviour of a SoS and affect its dependable operation. Delivering COTS based Dependable Solutions by T. Harrison, offers a practitioner’s perspective on lessons learnt from the use of open architectures to create (SoS) capabilities. The Dependability Case Delivering Value for Money Programmes by J. Hursell, provides an insight into the concept of dependability cases, from a standards perspective. Iterative and Simultaneous Development of Embedded Control Software and Dependability Cases for Consumer Devices by Y. Matsuno et al. discusses the challenges of engineering SoS of consumer devices, also highlighting the issue of assuring dependable operation. Interface Contracts for Architectural Specification and Assessment: a SysML Extension by R. Payne and

J. Fitzerald, describe a bottom up approach to capturing and formalising the interfaces between components of an architecture, contributing to the understanding of the composition of SoS capabilities. The Workshop on Dependable Systems of Systems (WDSoS’12) was an attempt to bring together academic and industry researchers and practitioners interested in the subject of Systems of Systems, in order to understand the main challenges and ways forward.

Starting to Consider the Relationship between Human Work and Emergent Properties in Systems of Systems Christian Fairburn BAE SYSTEMS Submarine Solutions1 Barrow-in-Furness, Cumbria LA14 1AF United Kingdom [email protected]

ABSTRACT As the notion of Systems of Systems (SoS) as a distinct analytical problem gains momentum, it is being recognised that traditional safety assessment methods are unlikely to be fully equipped to account for the routes to hazard offered by these structures. It is important that methodological responses to these concerns take into account the human elements as well as the technical elements of such systems during safety assessment. Previous work in which the author has been involved proposed a generic approach intended to increase appreciation of the Human Factors (HF) of relevance to SoS Hazard Identification. A call is made here for future analyses to take explicit account of a key area of HF concern, namely the ways in which people may operate at a cognitive level during joint or collaborative work. Such work is expected to be both commonplace and potentially complex within a SoS. This paper draws upon Distributed Cognition (DC) and Distributed Situation Awareness (DSA) to throw a spotlight on this area of concern and provoke debate around notions of emergence, cognition and joint work in a SoS safety sense. It is becoming clear that predicting emergent properties in a SoS is not simply a technology-centred issue. By virtue of the possibility of joint and distributed work within a SoS, emergent properties are likely to be routine from an HF perspective. Not only does the presence of emergent properties have the potential to degrade SoS performance, human work frequently relies on emergent properties for success. Understanding the positive and negative contributions of emergent phenomena to SoS dependability is a significant HF concern which practitioners will need help to address in SoS safety assessment.

Categories and Subject Descriptors H.1.2 [Models and Principles]: User/Machine Systems - Human factors, Human information processing.

General Terms Human Factors.

Keywords Systems of Systems, Safety, Joint Work, Collaboration, Human Reliability Analysis (HRA), Distributed Cognition, Distributed Situation Awareness.

1

1. INTRODUCTION Systems of Systems (SoS) have been distinguished from Complex Systems by virtue of certain key characteristics [1, 2, 3]. These characteristics have been used as a basis to argue that traditional safety assessment methods may fall short for the purposes of SoS analysis [3]. One defining trait is that the nodes or entities within a SoS are able to have a ‘life of their own’ exhibiting a degree of autonomy in their own right on an independent basis. A SoS is formed when Systems or Complex Systems come together in order to pursue some common overall objective. In doing so, it has been noted that a SoS may exhibit so-called ‘emergent properties’ which are meta-properties of the SoS and which cannot be localized to the contributing systems themselves. It is also recognized that a SoS may exhibit a different structure from one instance of operation to the next. While the functional purpose of a SoS may remain fairly stable, membership may change and specific nodes or node types may not always be present. The functions performed within a SoS may therefore be allocated in different ways on different occasions of operation. A node within this structure may find its role or remit of responsibility vary. In fact, function allocation, constituent roles and SoS membership may vary dynamically as work continues to progress and functions continue to be performed. Indeed, this type of behaviour is seen by some as a key goal for modern Network Centric Warfare (NCW) [4]. On this basis, it seems that the key to understanding an SoS and SoS reliability is to not only understand the overall function and underlying sub-functions of the architecture well (the common objectives and contributions to those objectives), but also to appreciate the ‘glue’ that might act to hold the constituent parts of the SoS together and enable work to be managed and combined such that a common purpose may be realised. In particular, the key to a dependable SoS must involve robust mechanisms whereby the variable contributions of SoS nodes are integrated and combined. For Maier, an important facet of this ‘glue’ is achieved through communication and communication standards [2]. Maier has stated that the strengths and weaknesses of a SoS exist at the interfaces between nodes and he has provided guidance to those designing a SoS on this basis. Although this assertion was made with a focus on the technical, it is likely that this type of argument also holds true in a Human Factors (HF) sense. For humans and groups of humans to join together within a SoS and work towards common objectives, successful communication is of course essential. What is of interest in relation to the human-human

Note: the opinions expressed within this paper are those held personally by the author and should not be taken to constitute a formal Company position.

working that may take place within a SoS, is its intimate connection with the notion of emergence. It is argued here that distributed multi-agent work and human collaboration are well recognized in the Cognitive Systems Engineering (CSE), socio-technical systems and even conversation and joint work analysis literature as a potential ‘hotbed’ of what those writing in SoS literature might consider to be emergent properties. This paper aims to bring together some of this literature in the context of SoS safety assessment, focusing in particular on the ways in which a SoS may operate at a cognitive level during work. It is argued that this and related existing literature should feed into the debate around SoS dependability, acting as a potentially rich resource for those engaged in Hazard Identification and Hazard Analysis activities in a SoS context.

2. EMERGENT PROPERTIES The notion of emergent properties and of emergence itself can be treacherous ground where written words are concerned, therefore time is taken here to describe in plain terms the perspective from which this paper is written. The Oxford University Press views emergence as “the process of coming into being or becoming prominent” and defines emergent properties as those which arise as “an effect of complex causes” and which are not “analysable simply as the sum of their effects” [5]. The dictionary is cited here as a shorthand narrative device to draw to our attention the way in which, by its nature, emergence is very much a ‘run-time’ phenomenon. It is dynamic and, from a SoS perspective, closely connected with the contribution and interaction of systems in their pursuit of the common SoS objective. The dictionary also reminds us that emergent properties are not necessarily clear in their nature even after the fact. According to the dictionary, looking at the effects of emergent properties, the overall outcome, doesn’t necessarily give us insight to what ‘emerged’. Other authors state that emergent properties remain elusive when looking ahead of time, the ‘input’ to the dynamic moment in terms of SoS architecture and structure does not give us perfect insight into the nature of emergent properties either, an SoS is not simply the sum of its parts and a ‘supersystem’ emerges only through interaction of its components [2]. This paper self-consciously treads a fine line. In a literal sense, anticipating or fully understanding a true emergent property is forever impossible. The word itself has close links with unpredictability. If one perfectly anticipates what one terms an emergent property, then there will be those who can legitimately argue against its categorisation as being truly emergent in the first place. On this basis, the author will try to leave the philosophy to the philosophers and instead put forward a set of characteristics, which for him at least, take pragmatic steps to describing in the written word some of what SoS dependability needs to appreciate in relation to emergence. These characteristics will then be borne in mind as the paper moves on to review what has been written about human work in contexts of relevance to SoS analysis. In this paper, within these bounds, emergent properties are considered as: Phenomena linked with SoS operation which may come to exist (and cease to exist) in a dynamic way. Phenomena which are imperfectly predictable ahead of time on the basis of ‘input’ (e.g. SoS architecture / membership).

Phenomena which are imperfectly distinguishable after the fact on the basis of ‘output’ (e.g. work performed, goals achieved). Macro in nature, a property of the SoS as a whole rather than individual SoS members / nodes themselves. Linked with a ‘more than the sum of its parts’ view of SoS operation. This perspective allows for the fact that emergent properties may be positive or negative in their nature when considered in terms of SoS dependability. Also, even when the impact on dependability is negative, emergent properties tend not to be the result of failure but normal operation. It is also compatible with statements that have been made characterising properties such as Resilience and Safety as emergent properties [for example, see 6]. The following sections recruit material from the HF literature to ‘think out loud’ about the close connections between human work and emergence in a SoS environment.

3. EXAMINATION OF THE LITERATURE 3.1 Contextual View of Human Error The contextual view of human error takes a strong view that, when considering higher order error forms connected with decision making and problem solving, humans have a very strong tendency to do things that make sense to them at the time [7]. Dekker argues that it is rare for professionally trained experts with their lives on the line operating within safety critical contexts to behave complacently, in a slap-dash, arrogant or dim-witted way. This may sound obvious, but it is argued that HF aspects of accident analysis frequently run the risk of failing to translate ‘context-specific data’ (information found within a detailed accident account) to ‘concept-dependent’ data in the language of HF without mishap. For example, Dekker is concerned to avoid the self-referential, circular and ultimately meaningless application of broad unqualified HF labels (e.g. ‘loss of situation awareness’). He argues successfully that such strap lines are endowed with little or no productive meaning in terms of accident diagnosis and prevention. He also cautions that analysts should move away from a ‘broken component’ approach to human action and realize that understanding the circumstances and context of action surrounding the actor and its influences on behaviour is a much more productive investigation than the identification of specific actors as ‘broken’ or ‘faulty’ system components. According to Dekker’s account, putting different pilots in the same seats would very rarely prevent an accident and would very rarely constitute a successful safeguard for future operation. Dekker’s ‘field guide to human error investigations’ rests upon this premise and guides accident investigators to reconstruct the “unfolding mindset” of those involved in an accident [7]. To understand failures properly, one needs to avoid the perils of hindsight and understand the accident context from the insiders’ point of view. Dekker drives home his point by illustrating the contextual view on error with a tube or pipe-like diagram (see Figure 1). This illustration serves to highlight that accident investigators are likely to have much wider access to information pertaining to an accident than the protagonists involved in its evolution. By virtue of hindsight, investigators can follow what appears to them a casual evolutionary chain of events, knowing

that a negative outcome was on the horizon. Actors working at the time, however, operated within some kind of contextual tube, the walls of which may have related to information access, personal experience, hypotheses, misunderstandings or any number of things which had a bearing on what made sense to them at the time events were unfolding. Dekker’s more recent work on safety culture and also the notion of so-called Drift into Failure is an extension of this type of perspective which illustrates a maturing view in this direction and an even stronger quest to avoid a linear or causal view of complex accidents [8].

user. Alternatively, one could offload aspects of this task in a different way and set an alarm. In this latter case the work of the human is concerned with remembering why an alarm was set and determining its meaning with reference to the tasks at hand when it actually goes off. From a SoS perspective, the DC view points towards cognition as being both dynamic and distributed. Cognition is about the bringing together of information representations in different ways, about the transformation and movement of information across people, technology and objects in the systems within a SoS and across the SoS as a whole.

3.3 Mutual Knowledge and Distributed Situation Awareness Figure 1: Contextual view of human error, after [7]. The contextual view of human error (in fact, Dekker is reluctant to use the ‘error’ term without qualifying the risks of hindsight bias) has significance for analysis at the SoS level. From a SoS perspective, it suggests that human mental and physical actions are related to the embedded view of humans acting in real-time in the SoS context with a limited or perhaps distorted view of ground truth. Their views of the world, the walls of their ‘tubes’, must be at least partly related to the SoS architecture and the way in which the SoS as a whole functions on this particular occasion. The contextually embedded nature of the mindsets adopted across a SoS and therefore the nature of any ‘errors’ that may be made are starting to sound very closely connected with emergence. Relating this view to safety assessment, it becomes clear that those producing and qualifying safety critical systems or SoS would also benefit from somehow embracing a contextual view of human reliability.

3.2 Internal and External Representation The Distributed Cognition (DC) perspective adopts a view of cognition as something which takes place and resides both internally and externally across an environment of work [9, 10]. In other words, information may be represented, translated and combined in actors’ heads but it may also be similarly processed externally outside people’s heads by virtue of artefacts such as instrumentation, user interfaces, paper based objects or other physical items which may be said to have the ability to represent information in various explicit, implicit, direct or indirect ways. For DC, cognition is therefore connected with the way in which information is combined, translated and passed around a system of human actors and automated systems (‘agents’) via different artefacts. These artefacts, whether they are paper based forms or highly complex computer interfaces may influence the quantity of work required of people and they may also transform the nature of the work that people have to perform [9, 10]. By way of a domestic example, determining the approximate time you have until you need to return to your car in a Pay and Display car park using a traditional analogue watch could be a rule based spatial judgement task (concerning hand positions and their proximity to the numbers on the dial). A digital watch does not afford this approach and instead represents time in a way that is likely to require some (albeit simple) mathematics on the part of the human

Studies conducted by Clark on joint work and what he has termed ‘Common Ground’ have expanded beyond the original domain of conversation analysis to tell us much about what those working in DC would consider to be some of the internal representations implicated in distributive collaborative work. Clark used the ‘Joint Project’ term as an aid to the analysis of talk-based co-ordination between actors [11]. However, as a unit of analysis it can extend beyond mere talk to encompass aspects of physical work that involve more than one actor operating in a co-ordinated way for other purposes. For example, previous studies have extended this language of analysis to account for work in Air Traffic Control (ATC) where joint projects can include things such as aircraft routings, separation and movements [11, 12]. The aim is not to reproduce a full account of Clark’s theories here, but to draw attention to one particular aspect of multi-agent work which he describes particularly well and which, for him and those analyses based on his work, is very much central to the reliable achievement of joint projects. This relates to notions of mutuality and the way in which mutual knowledge may come to exist between collaborating actors as they work. For information to be mutual, or in Clark’s terms to be part of ‘Common Ground’, actors are required to know an item of information and also to know that each knows the information. In this way, mutuality is something of a recursive phenomenon (‘I know x, she knows x and we both know that we know x …’). Such recursion may be achieved by direct and public agreement – for example those in direct communication may explicitly establish an element of information as something they agree upon during verbal communication. Alternatively, collaborating actors may make larger leaps in assumption, for example knowing that a piece of data is displayed on their own screen and on someone else’s by virtue of system design might lead an actor to believe that the data concerned is mutually held. When humans work together, information may be held or deduced implicitly via mutual knowledge. The ability of humans to infer the plans and goals of others and to interpret the meaning of the actions that others may make, or choose not to make rests heavily on the ability to understand incoming information in the context of existing mutual knowledge. Mutuality is unique to human interaction and it has been argued that mutual knowledge is unattainable by non-human automated systems by virtue of its recursive nature [13]. From an emergence perspective, mutual knowledge could not be said to be directly traceable as a property belonging to individual

actors. Actors will develop and maintain their own internal representations of what is and what isn’t believed to be mutual knowledge, but mutuality as a shared and commonly established resource for action seems very much a property of the act of collaboration itself, something which comes to exist as a function of work in practice and which would be, in a SoS environment, more of an emergent property of the work system as a whole than individual actors or groups of actors within it. This notion of a system wide (or SoS wide) emergent property with close connections to the internal mental representations of collaborating actors has also been explored by other researchers. Elsewhere in the literature, Distributed Situation Awareness (DSA) applies an explicitly DC style approach to the study of Situation Awareness (SA). DSA is intended to assist consideration of the different ways in which collaborating networks of people, artefacts and automated systems might be designed to facilitate processes of situation assessment. For DSA, SA may exist internally and externally across the environment of human work. Much has been written on this approach, but the concepts of ‘Compatibility’ and ‘SA Transactions’ are of particular interest from an SoS perspective and, as we will see, have a connection with what people like Clark have to say about mutuality. The concept of Compatibility is based on the realization that no two individuals will hold exactly the same perspective on a situation. However, for DSA researchers this is not necessarily problematic. Firstly, it is generally the case that not everyone working together needs to know absolutely everything there is to know; rather they only need to possess the knowledge that is most crucial for their role in the work at hand. Secondly, even when people do need to know the same things, DSA only requires the beliefs held by actors to be ‘Compatible’ for work to succeed. The crucial thing from a DSA view of safety is that co-operating actors don’t have conflicting views of a situation; it is conflict and outright contradiction that damages work. Subtle differences may safely persist providing that, in terms of the work at hand, likemindedness still exists across actors at a level sufficient for common objectives to be achieved in a safe manner. The notion of Compatibility is therefore inextricably contextual and tied up with the roles and common goals of actors at a particular point in time. Though DSA discounts the notion that true, perfect shared awareness never exists between actors, it does not discount the fact that sharing of information takes place. SA Transactions describe the way in which the passing of information between people can, possibly as a by-product, enhance actors’ awareness of one another through exchanges of SA information. SA Transactions, as the name suggests, applies to exchanges of information which also could be said to represent an exchange in awareness between actors. The SA-relevant content of these information exchanges may be implicit or embedded within the information exchange rather being the main message or constituting the overall purpose for communication. The meaning within these transactions can furnish an actor with a greater awareness of what others are doing, what they know or perhaps what they don’t know. DSA is argued to be an emergent property of collaboration or teamwork [14]. Each actor’s SA (and so the overall DSA) is updated by SA Transactions which may involve actor to actor communication or the posting of information to an external representation, for example an artefact such as a computer interface. On the face of it, DSA may appear to go against notions of mutual knowledge, but this isn’t so when one takes into account the

acknowledgement by authors such as Clark of the way in which the establishment of mutuality, although a joint activity, tends to result in individually held beliefs regarding the degree of mutuality that has been achieved. DSA describes how SA information may be stored internally and externally, Clark focuses in on how mutual internal representations may be formed. In this way, these viewpoints are in harmony and point towards emergent coordination mechanisms between collaborating actors of clear relevance to reliable SoS operation.

4. EMERGENT PROPERTIES & THE HUMAN CONTRIBUTION TO SYSTEM OF SYSTEMS DEPENDABILITY Examination has highlighted some important and influential areas of the HF literature that would argue for very close connections between the human contribution to SoS operation and the notion of emergent properties. Though some of this material may have been produced with an initially smaller scale focus on the behaviour of single actors or the interaction of just two collaborating people, all are capable of being applied to work in complex socio-technical systems. It is argued here that these same principles also have important messages for analysis at the SoS level. In broad terms, the material that has been cited takes a view that emergence is a routine facilitator of multi-agent human work and that emergent properties such as DC, DSA or mutual knowledge frequently hold the key both to success and failure. Thinking in these terms from a SoS standpoint suggests that the contributions of human actors within a SoS are likely to be integrated and combined via mechanisms which from an analytical perspective could constitute emergent properties of the SoS. According to this perspective, at least some of the ‘glue’ holding the human contributions together within an SoS and therefore considered to be important to dependability will be emergent in nature. Returning back to Maier’s concerns regarding communication, it is unlikely that human to human communication would be reliable without them. This may provoke concern with regard to predictive safety assessment. Emergent properties are acknowledged as intangible and imperfectly predictable, they are dynamic - coming into existence and ceasing to exist over time. In a SoS they are also likely to be complex in their form and the manner in which they arise. Yet, the perspective on human behaviour that is discussed here urges us that we need to be able to design to encourage and shape some of them even if we can’t expect to fully understand their precise nature before run-time. To argue that performance in a human-rich SoS may routinely depend upon emergent properties for success is different to acknowledging that SoS dependability can be vulnerable to erosion because emergent properties may become present. It throws up complexities and a level of intangibility that could be unsettling from a SoS system design and qualification point of view. However, the fact that various frameworks and models already exist to describe at least the likely flavour of these emergent properties should provide reassurance that we can start to incorporate some acknowledgment of their potential within our SoS Hazard Identification and Hazard Analysis methods. In particular, a Differential Analysis approach to Hazard Identification [3] would seem to be sufficiently generic to afford inclusion of multi-agent models of human work such as

Delivering COTS based Dependable Solutions

Anthony Harrison Thales

[email protected] Abstract The defence market has increasingly adopted COTS hardware and software systems as part of a move from proprietary closed systems towards open systems. The move to an open system is not without its challenges, but the benefits of an open system often exceed the original goals, particularly with regards to reliability. This paper describes some of the additional benefits that were realised when migrating a large sonar system to a COTS architecture due to the selected technical solution.

Key words 1

COTS, Defence, Open Architecture, SOA INTRODUCTION

Over the last 10 years, the defence market has undergone a number of significant changes following the increasing adoption of COTS hardware and software systems as part of the move from proprietary closed systems towards open systems. This is to meet the customer's objectives to avoid the potential of vendor lock-in and control costs in order to get ‘value for money’. A greater reliance on systems built using recognised standards and a need for systems to be increasingly flexible to continually changing mission needs has introduced some key architectural principles, which have been used to develop systems which are now entering operational service. This paper first describes the key architectural principles that are now being adopted. It then describes a recent implementation and the lessons learned from this implementation. 1.1

Background

In the past (last century), the majority of defence systems were typically developed by a single prime contractor who was responsible for the specification, design, implementation and ongoing support of the system. System developments took a long time, were often based on proprietary and bespoke components, and required extensive testing to demonstrate the reliability and quality of the system. In some cases it was impossible to satisfactorily prove the reliability of the system prior to entering operational service due to the complexity of the implementation and the tight coupling of components. The approach has become increasingly untenable, particularly when systems needed to be replaced due to obsolescence of the processing hardware that often required a full replacement of the system rather than an evolution of the existing in-service system. In order to meet these undoubted challenges there has been a clear move to use COTS components as an alternative to proprietary components, and the development of an open architecture, which allows the development of the system to be constructed from components sourced from many different providers, and co-ordinated by a system integrator. For the benefit of an open architecture to be realised, it is necessary to define a number of supporting principles to be followed by the component providers in order for the integrated system to deliver the required functionality. One of the key principles is the concept of modularity which divides a system into a number of discrete modules which when operating together deliver the required system capability. This concept has the effect of creating a System of Systems as each module can be considered to be a system in its own right due to the supporting system architecture. Modules are capable of being procured independently such that each module has an independent life cycle with regards implementation, operation, enhancement and disposal. Delivering COTS based dependable solutions

Page 1

Recent proposed evolutions in the architectures of the complete Combat System of a vessel, which encompasses the Sonar, also reflect this Modularity goal leading to multiple layers of the Systems of Systems concept. In fact, it is believed tat the application of these concepts at the many layers of the “Battlespace” will improve interoperability and thus improve the flexibility of operations. 2

ARCHITECTURAL INFLUENCES

2.1

Open Architecture

An open architecture defines the overall system structure in terms of its components and their interconnectivity (data and control interactions) together with the properties of components and their interactions. It is a critical element of a system in that it: •

Enables the delivery of complex capability (by integrating multiple functionality, integrating multiple technologies, and supporting progressive integration);



Accommodates short-life components (with commercially sourced components (COTS) having a practicable life far shorter than that of an operational defence platform);



Supports capability change (by allowing the evolution over time to meet new demands and counter emerging threats, and through permitting he inclusion of new technology as it becomes available);



Delivers the non-functional requirements (e.g. security, safety, performance, reliability), which form a major part of required capability.

The implementation of the architecture typically requires that a number of discrete layers are employed to separate out the largely military-specific applications functionality, which constitutes much of the delivered military capability, from the COTS components, which provide the software infrastructure, and hardware which constitutes the enabling functionality. Layering also allows a separation of concerns such as exist between software and hardware, and between delivering military capability and delivering secure behaviour, thereby easing capability upgrade and technology update. An open architecture accommodates a high degree of modularity through the exploitation of open standards; its realisation should also enable the efficient adoption of a variety of COTS products with their associated rapid life cycles. Business Model Management Process

Technology

Lack of IPR Barriers

Standards

Openness Features

Accessibility

Granularity

Modularity

Availability

Architecture

Open Architecture requires both a technical and business model in order for the benefits to be realised. The technical model considers such aspects as standards which is the extent to which the system is based on proven, unambiguous and published standards, availability which assesses the extent to which the COTS components or standards are supported by multiple independent vendors and the modularity which assesses the degree to which the system encapsulates functional entities to enable easy removal or replacement without adversely affecting the remainder Delivering COTS based dependable solutions

Page 2

of the system. The business model defines the system management process, which allows customers and suppliers to produce, subcontract and maintain an open architecture solution whilst controlling access to IPR to ensure that is can be made available without excessive cost. The structure of a modular architecture requires that the constituent parts are compartmentalised into logical groups without compromising the openness of the architecture. Factors that need to be considered in determining the grouping of components within the architecture include: •

The products typically supplied by the (various tiers of) the industrial base;



The coupling and cohesion between such components such that the number of the interfaces between component are minimised;



The products are operationally useful, supportable and separately procurable;



The handling of overall system integration risk and the parties best able to handle it;



Manageability of the overall integration task – current practices and competences are typically centred around integrating an intermediate number of pre-integrated components



Export implications, potentially unique to a military context, to be adhered to which the UK industrial base may export.

2.2

Modularity

Establishing the correct modularity of a system is a complex activity due to the inevitable compromises that need to be addressed. Strict adherence to the modularity principle within the architecture is important so that modules can be easily added to, removed from, or substituted within the system. These modules may also: •

Be provided by a different supplier (i.e. reduced dependency upon proprietary hardware & software components);



Be a different version of the module that has been substituted (which is the case when defects are corrected);



Provide a different implementation of the same functionality (like for like comparison);



Provide upgraded functionality (new or changed);



Employ a different technology (e.g. technology refresh (stay the same) and technology insertion (grow));

• Be a combination of the above. If the principles are followed, then each module should be capable of replacement with minimal impact on the rest of the system. This means that components can be replaced with minimum system downtime (under operational circumstances). However, as Section x describes, it may also be possible to replace modules without any system downtime if appropriate implementation policies have been followed. This has the additional benefit of creating a more resilient system and also ensures that failure of one module does not result in complete loss of a military capability. 2.3

Reliability or Dependability

Every military system is required to achieve key reliability targets because system users need assurance that the facilities on which they depend to perform their jobs will be available to agreed expectations and can be relied upon to deliver functionality when required. Increasingly, reliability targets are required even when a system is only partially operational and is required to be recovered in the event of failure within pre-defined timescales. Whilst reliability has typically been defined as a pre-defined ‘Non-Functional Requirement’, the apportionment of the requirement within an open architecture becomes increasingly complex. Typical approaches at a system level may be achieved by considering various strategies including: •

Elimination of single points of failure throughout the delivery chain.



Use of fault–tolerant hardware that allows graceful degradation of performance in the event of module failures

Delivering COTS based dependable solutions

Page 3



Use of standby systems (operating in either Cold, Warm or Hot) to invoke a changeover to an alternative component

However these are difficult to control in the development of a system based on open architecture principles and developed independently by many suppliers. As there can no longer be reliance on a single strategy to deliver the required system reliability targets, it is necessary for each module to be responsible for achieving reliable operation and to be isolated from failure of another module in the system. This can be achieved by ensuring •

That the boundary of a module is well defined;



That the amount of information exchanged internally within a module (cohesion) is always greater than the information exchanged with the rest of the system (coupling);



That modules are always explicit in their communication, with interfaces that adequately describe the dynamic nature of their interactions i.e. behavioural, synchronisation and Quality of Service;



Modules are not engineered using radically different approaches; have significantly different physical lives; have different intervals for maintenance and technology update;



That application modules are isolated from the infrastructure;



That key interface points are identified and that generic techniques are employed in order to implement those interfaces;



Subdivision is maintained through conformance testing and interface management;

Module containment and boundaries embrace the Non Functional Requirements to reflect the different constraints or characterisation levels that need be applied to the various areas of system functionality; Section 3 describes an approach, which has been adopted to meet these goals. •

3

IMPLEMENTATION USING COTS

3.1

System overview

For many years Thales has delivered world leading Sonar systems to the Navies across the world. These technically complex systems have always required innovative solutions in order to meet the increasing capability requirements of the customer. The systems are characterised by remaining in operational service for at least 20 years and are required to adapt to the continually changing capability requirements of the customer. Around 2003, following experience from the US Navy, the UK decided that an open architecture was required for future evolutions of the sonars installed in the UK Navy. Key members of the industrial supplier base developed a detailed architectural definition that met the open architecture requirements of the UK customer. The architecture did not mandate any particular implementation or technology but each implementation was to be subjected to an independent openness assessment to ensure that the proposed implementation did not compromise or constrain future evolutions of the system. This resulting architecture formed the basis for a technology refresh of one of the largest sonars installed on the UK submarine fleet. Sonar forms the main sensor outfit for the Trafalgar class and the new Astute class submarines. The fully integrated sonar consists of bow, flank, and towed arrays containing thousands of hydrophones, many times the number fitted to earlier submarines. The processing power required for the inboard data handling, data processing, and display processing subsystems is equivalent to many thousands of desktop PCs. The original solution to the Sonar architecture made extensive use of parallel processing using INMOS T-9000 transputers; the technology refresh was to use COTS components for all of the processing elements.

Delivering COTS based dependable solutions

Page 4

3.2

Development approach

The technology refresh of the Sonar was the first stage in a future evolution of the capability. Whilst the technology being used for the processing was changing, the functionality was unchanged. However as the existing architecture was not conformant with the requirements of open systems architecture, a revised architectural definition was required to be developed. A sonar system consists of many separate functions that together deliver the required capability. Whilst these functions can be considered to be relatively separate, a key characteristic of the processing is the significant amount of data being transferred between the functions. As the mechanism for inter-communication between these functions (in excess of 200) was one of major changes from the existing implementation to a COTS based solution, it was essential that an efficient approach to produce the communications ‘glue’ was used. The existing implementation had a diverse set of communication technologies which was necessary due to the different hardware employed throughout the system. The COTS based architecture established a single network infrastructure (Ethernet), which enabled for the first time commonality in communication approach. A layered architecture was developed with the system decomposed into subsystems and then into modules. At the lowest level (the module), a blueprint is defined which describes the context of the module in terms of interfaces. As the goal of the revised architecture was also to exploit any commonality of processing within the system, this blueprint was generic so that the specific module implementations could be defined at run-time. By exploiting commonality, particularly when looking at similar processing across the different sensors, it is possible to reduce the amount of unique software code to be developed (the pre-COTS based implementation was in excess of 10 Million lines of code). The architecture adopted a model driven engineering (MDE) approach using the Rhapsody toolset supplemented by bespoke tools. The processing logic was captured within Rhapsody but the data exchange between modules was automatically generated by bespoke tools to ensure a consistent approach was followed. 3.3

Inter-Process Communications

In a large system the number if inter-process communications is significant. Every data exchange is a potential area for failure particularly in handling loss of connections between modules. However, by examining the architecture it was possible to categorise each data exchange into a small number of data exchange patterns, each of which had specific behaviour associated with it. This resulted in over 1600 separate data exchanges being reduced to instances of 6 different data exchange patterns. Mapping each data exchange to one of these patterns, and using an MDE approach dramatically improved the reliability of the communications infrastructure. This was a direct consequence of the move towards COTS and the use of common hardware throughout the processing. 3.4

Improving Reliability

As an open architecture has to ensure that the modules are not tied to a specific physical resource, except in the case of some specific functions specifically where legacy interface support is required, a service-orientated architecture (SOA) approach is often selected due to the great flexibility it offers in constructing system solutions. The discrete functions are organised into interoperable loosely coupled standards-based services, which are combined and reused quickly to meet customer needs. The SOA approach followed by Thales for encapsulated three stages of registration, discovery and bind within an infrastructure built on top of COTS components.. By encapsulating the underlying implementation of the broker which managed the registration and discovery activities approach within the system, it was possible to create a very dynamic, distributed and scalable implementation which allowed for the effort in integrating the modules together to form subsystems Delivering COTS based dependable solutions

Page 5

to be significantly reduced. This resulted in a more rapid build-up of capability, which can then be subjected to the extensive testing required of such a complex system. Due to the dynamic implementation of the broker technology, it was possible to replace modules ‘in situ’ without having to reset the rest of the system. The system was therefore building up excellent reliability growth data early in the integration phase and gives greater confidence that once operational, the required reliability of the system would be achieved. Due to modularity of the system, the system has the ability to carry on running, albeit with some limited functionality being temporarily unavailable, which can therefore avoid loss of capability during critical operations.

Due to the limited number of inter-process communication behaviours due to the 6 data exchange patterns, once these patterns had been extensively tested, there was great confidence in deploying each pattern many times throughout the system. System reliability, in particular network bandwidth monitoring was enhanced in a number of different ways: •

By limiting the scope of some of the services, it ensured that the number of clients which could subscribe to a service was manageable



By allowing the technology of the service to be changed e.g. from TCP to UDP, without any application change

Through the adoption of SOA paradigms realised by modularity and the scalable interfaces increased reliability of the overall system can be achieved without the need for complex fault tolerant or redundant technology architectures and technologies. Failures in the system are easily detected either through continuous monitoring of the platform itself of the termination of the service connectivity. Simple system Management facilities can be provided to restart the modules impacted by the failure, either in place or in alternative locations. The dynamic nature of the system ensures reconnection and thus continuation of the capability. Of course this approach is highly appropriate for the essentially Real Time nature of the Sonar system and cannot today be considered a global solution to address safety criteria’s such as that identified in safety systems, Aircraft or Rail systems and standards such as Arinc (DAL) / Cenelec (SIL). 4

LESSONS LEARNED

The migration to an open architecture has delivered the benefits required by the customer. A modular system provides great flexibility but it is clear that the supporting infrastructure to support each module is a key component. The reliability of the system far exceeded expectations. This would appear to be due to a number of aspects and decisions, specifically •

The architectural definition through the use of a service-orientated architecture and the choice of distributed discovery mechanisms.



The selection of COTS components, in particular by standardising on a small number of components and technologies

Delivering COTS based dependable solutions

Page 6



The development approaches, in particular the use of Model driven engineering for all of the module definitions and data exchanges to ensure consistency throughout the system



The simplification of data exchanges between modules so that common behaviour patterns can be de-risked early in the development life cycle.

The benefits of the new SoS architecture now provide a more reliable and dependable solution and allow for a change in COTS technology to be achieved in far faster timescales. This means that as COTS technology becomes more reliable, complex systems are now able to reap the benefit from this improvement as part of the continual evolution of the system. 5

REFERENCES

[1] Thomas, M “How Closed is Open”, presented at the Maritime Systems and Technology Conference 2008 6

BIOGRAPHY

Anthony Harrison is a Software Architect at Thales with over 25 years experience in the IT/defence industry having held key technical positions in a variety of major projects. He has a BSc in Computer Science and Mathematics from Manchester University, a MSc in Mathematical Modelling from Manchester Metropolitan University and is a Member of the British Computer Society with Chartered Engineer and Chartered IT Professional certifications.

Delivering COTS based dependable solutions

Page 7

The Dependability Case Delivering Value for Money Programmes Jo Hursell BMT Reliability Consultants Ltd, 12 Little Park Farm Road, Fareham, Hampshire.

ABSTRACT In accordance with BS5760-18 [1] the Dependability Case is “A reasoned, auditable argument created to support the contention that a defined system will satisfy the dependability requirements”. The Dependability Case is initiated by the identification of risks that the defined system will not meet its requirements. These risks are then analysed and appropriate controls and mitigation identified to form a programme of activities that is implemented. The outputs from the planned activities provide progressive assurance that the risks are being managed and therefore the requirements are being, or will be, satisfied. The Dependability Case approach therefore designs a programme of activities that invests resources that are focused to manage the Dependability risks. As a result, the Dependability Case does not only provide progressive assurance, but also determines a programme of activities that provides value for money to meet the dependability requirements. This paper provides a definition of ‘Dependability’ and introduces the Dependability Case. The paper discusses what can go wrong if risks are not managed, which is based on research conducted on a number of UK defence vehicle reliability growth programmes. This paper then provides a simple process for developing a value for money programme through the identification of control and mitigation measures that should be in place to manage the Dependability risks. INTRODUCTION In accordance with IEC 60050-191 [3] Dependability is the “ability of an item to perform as and when required”. An item can be a component, device, functional unit, equipment, subsystem, or system. The dependability characteristics of an item are reliability, maintainability, recoverability and maintenance support (including management of obsolescence). The dependability of an item contributes to a number of performance measures including availability, safety, serviceability, cost, survivability, and capability. Dependability is not a quantifiable term, therefore when reference is made to the dependability requirements within this paper, these are the requirements for the dependability characteristics and performance measures e.g. reliability and availability. The purpose of the Dependability Case as defined in BS5760-18 [1] is to provide “A reasoned, auditable argument created to support the contention that a defined system will satisfy the dependability requirement”. Starting with the initial statement of requirement, the Dependability Case includes identified perceived and actual risks, strategies for the management, control and mitigation of these risks and the associated evidence. The Dependability Case approach designs a programme of activities that invests resources where they are required to manage the Dependability risks; as such it aims to obtain maximum benefit from the resources available in order to meet the dependability requirements. The approach records the objective of each activity and upon completion a review is conducted to determine whether it has achieved the objective or whether further activities are required to mitigate the Dependability risks. WHAT CAN GO WRONG IF RISKS ARE NOT MANAGED In order to provide some practical insight into the possible outcomes when risks are not adequately managed, four examples are presented relating to UK defence vehicle programmes. Failing to manage the reliability risks resulted in increased costs to the programmes to fund further reliability growth testing in order for the vehicles to achieve the required reliability performance. Programme 1: The vehicle reliability at the start of the Reliability Growth Trial (RGT) was much less than indicated by the sub-contractor trials. The reduction was due to the R&M programme being planned against an incorrect assumption ‘that the reliability was acceptable from the beginning’. This assumption was made as the prime contractor was only required to integrate the vehicle, and the sub-contractors had stated without exception that the subsystems were already compliant with the apportioned

requirement based on in-service data. This programme failed to manage the risk that ‘the starting point for reliability at the commencement of the RGT was less than that indicated by the sub-contractor trials’. Programme 2: There was a drop in reliability from the completion of the accelerated RGT to the Reliability Qualification Test (RQT). This reduction was believed to be as a result of fixes being carried out between RGT and RQT that were hurried through and did not include any testing to determine whether the fix designed out the failure modes prior to the RQT. This programme failed to manage the risk that ‘fixes identified during the RGT were not successfully implemented prior to the RQT’. Programme 3: The vehicle did not meet the reliability targets due to the inadequacy of the prototype RGT. The prototype vehicles did not include all the subsystems either at the start of the RGT or for the entire RGT and not all modifications were implemented on the prototype vehicles. The prototype vehicles were run against a modified Battlefield Mission (BFM) and when the programme timescales were not met, the RGT was truncated. This programme failed to manage the risk that ‘the RGT was overoptimistic or ineffective at meeting its reliability growth targets’. Programme 4: The RGT was not carried out as planned as a result of a delay in the production of the prototype and a series of failures to the prototype vehicle that took several weeks to fix. This programme failed to manage the risk that ‘the prototype vehicle(s) were not available when required’. HOW TO CONSTRUCT A DEPENDABILITY CASE A programme of activities is developed through the identification of control and mitigation measures that should be in place to manage the dependability risks. The completeness of the programme is then assessed through the review of the dependability assurance argument. There are six sequential steps to the construction of a Dependability Case, as follows: 1. Identification of the risks that would result in the system failing to meet its dependability requirements; 2. Analysis of the risks to determine the significance (and priority) of the inherent risk; 3. Evaluation of the risks to determine whether the risk is acceptable or whether action is required to address/modify the risk; 4. Identification of risk mitigation activities; 5. Construction of the argument that demonstrates the dependability requirements will be/have been met by showing the identified risks are mitigated to an acceptable level and that the acceptance criteria of evidence is defined; 6. Review of the dependability assurance argument for completeness. If the argument is incomplete further risk management activities should be conducted to complete the argument.

Figure 1: Dependability Case Construction Process

Once the argument is complete the dependability plan is implemented, which will include the iteration/evolution of the Dependability Case to keep track of newly-identified risks. Risk Identification The dependability characteristics required (and therefore the dependability risks) for a particular item will depend upon its application and performance requirements. A good starting place is to create a risk for each dependability requirement e.g. if there is a reliability requirement, then create the following risk “There is a risk that the item will fail to meet its reliability requirement”. A risk should as a minimum be defined by the following: Risk Title, Risk Description, Effects and Causes (an example is shown in Table 1). Risk Title

Reliability

Risk Description

There is a risk that the system will fail to meet its reliability requirement

Effects Causes

System fails to be accepted into service System requires re-design Failure to develop a robust and binding contract with sufficient incentives. Reliability analysis fails to influence the design. Prototype reliability growth trial is overoptimistic or ineffective at meeting reliability growth targets (expanded further as a risk). Reliability programme plan in terms of schedule & resources is not coherent with the rest of the programme. Etc Table 1: Example of a Risk

If the risk cannot be analysed because one or more of the causes have insufficient detail, a new risk for all the causes which require expansion should be created (an example is provided at cause 4 in Table 1). This process should be continued until it is possible to analyse all the risks. Risk Analysis The objective of risk analysis is to determine the significance (and priority) of the risks, which is determined by: Considering the probability of the root cause occurrence; Identifying and quantifying the consequence of the risk in terms of performance, time and cost; Calculating the risk level using a Probability Impact Diagram (PID)/Risk Acceptance Matrix. A PID is a graphical method of assessment, which evaluates the level of risk as a product of the probability of the risk and the impact if the risk is realised. The level of risk is represented by coloured areas, for example green signifies a low risk. The PID is usually defined by the project risks manager in accordance with the Project Risk Management Plan, covering not only Dependability but all key elements of the programme. If a risk management tool is used, the criteria will have been defined within the tool.

Figure 2: PID/Risk Acceptance Matrix A tolerable area is identified on the PID, which defines the level of exposure that is acceptable to the project. This area is used to assess the residual risk in order to decide whether further mitigation activities are required. Table 2 provides an example of risk analysis conducted for a reliability risk.

Impact Description

The probability has been assessed to be High based on the difficulties encountered by similar projects, as it is highly likely that the reliability requirement will not be met if mitigation activities are not implemented. The system will not enter service in order to carry out re-design and testing activities until the reliability requirement is met.

Point of impact

This risk will impact at the acceptance into service date.

Trigger Point(s)

The risk will trigger during the pre-production verification tests & production reliability acceptance tests.

Inherent Risk Probability

Inherent Impact

Risk

Risk Level

80%

Note: The point at which it is known that if the planned programme is followed the risk will impact. Cost Time Performance Very High: 2 year Very High: 2 Nil: System does not enter service; therefore development costs + 10% years no impact on performance. in Unit Price Cost. High Table 2: Example of Risk Analysis

Risk Evaluation and Treatment Where risks fall outside the tolerable area (the green zone in Figure 2), it is important to determine what action is required to address/modify the risk. If the risk is unacceptable, targets should be defined for the residual risk (an example is shown in Table 3). Residual Risk Target Probability Residual Risk Target Impact

Significance Target for Residual Risk