2014 IEEE International Conference on Systems, Man, and Cybernetics October 5-8, 2014, San Diego, CA, USA
A Flexible Contracts Approach to System Resiliency Michael Sievers,1.Senior Member, IEEE
Azad M. Madni, Life Fellow, IEEE
Jet Propulsion Laboratory California Institute of Technology Pasadena, CA, USA
[email protected]
Viterbi School of Engineering University of Southern California Los Angeles, CA, USA
[email protected]
Abstract—Contract-based design (CBD) employs formalisms that explicitly define system requirements, constraints, and interfaces. This paper explores a contract-based design paradigm for expressing system resiliency features. Specifically, resilience formalisms are defined in terms of invariant and flexible assertions. A flexible assertion is one that is learned during system operation and can accommodate unpredicted system behaviors. Invariant assertions are fixed system constraints that are known a priori. A general model structure comprising four key features that contribute to system resilience is presented. In particular, the concept of flexible contracts is operationalized using the Hidden Markov Model (HMM) construct. A system architecture based on flexible contracts and lightweight error monitoring and resiliency response mechanisms is also presented. The proposed framework can serve as a testbed to experiment with different systems resiliency approaches. Keywords—MBSE, system resiliency, contract-based design
I. INTRODUCTION Resiliency is a system property that allows a system to continue to provide useful service in the face of disruptive events [1]. There are three general categories of disruption: external disruption – caused by factors outside the control of the system such as a natural disaster; systemic disruption – a service interruption due to an internal fault; and human agenttriggered disruption – the result of human error or misuse of the system [1]. A system failure occurs when the disruption prevents the system from providing the services it was designed to deliver. Disruptive events can be predictable (e.g., known from experience or through reliability analysis); or unpredictable (e.g., result from unanticipated usage or faults). Predictable faults are typically mitigated by a combination of fault avoidance (e.g., better parts, more testing, design margins.) and fault-tolerance (e.g., restoration of service after occurrence of a fault). In general, fault-tolerance strategies are applied to faults that can be localized to specific hardware or software. Mitigating unpredictable disruptions is a more daunting problem because it means dealing with “unknown unknowns.” Ad-hoc application of so-called safety-net strategy is a typical means for managing the unpredictable [2]. Safety-net functions are typically implemented by monitoring the correctness of system-level behaviors or results. By their very nature, safety-net functions do not isolate a fault to a specific hardware or software component; rather, it isolates 1.
a fault to a high-level function. When a safety-net fault is detected, all entities associated with the affected function are considered suspect and typically a system-level recovery is employed. Several researchers have explored ways to formally and precisely define system behaviors with the goal of improving safety-net coverage. Some of the better known approaches are: computational tree logic (CTL) [3] [4]; linear temporal logic (LTL) [5] [6] [7]; Hidden Markov Model (HMM) analysis [8] [9]; and contract-based design (CBD) [10] [11] [12]. CTL, LTL and CBD create formal, checkable, system representations that are useful in defining agents and other error detection mechanisms.. HMMs create black-box models that are trained to recognize nominal behavior and flag unusual behavior. In this paper, we propose an approach that combines the benefits of up-front formal descriptions with self-adapting interfaces that continually learn and monitor behavior. The up-front descriptions are defined by a set of invariant assertions and preconditions that fully define the domain of legal inputs. The post-condition is either correct for those inputs, or an error is declared. Accommodating self-adaptation necessitates a new concept that allows for incomplete specification of legal inputs and a flexible definition of post-condition correctness. II. CONTRACT-BASED DESIGN Various researchers have explored formalisms for representing and applying contract-based design (CBD) [13] [14]. At its core, CBD comprises explicitly defined system requirements, constraints, and interfaces. A contract C is defined by a pair of assertions, C = (A, G) in which A is an assumption made on the environment and G is the guarantees a system makes if the assumption is met. Assumptions are system invariants and pre-conditions while guarantees are system post-conditions. A contract is refined when its assumption is weakened and its guarantee strengthened with respect to the assumption. Contract assertions may also be added to a system (or a system component) as implementation progresses. Contract addition is analogous to incorporating new requirements in a classical design process. An implementation, M, satisfies a contract if it satisfies the guarantee when the assumptions hold. More formally [15]: ك ځ
This work was done as a private venture and not in the author's capacity as an employee of the Jet Propulsion Laboratory, California Institute of Technology
978-1-4799-3840-7/14/$31.00 ©2014 IEEE
1002
(1)
and the notation M |= C is used when M satisfiies C.
Figure 1: System decomposition into subsysttems
Contracts may be combined using paraallel composition [15]. If C1 = (A1, G1) and C2 = (A2, G2) then the parallel composition, ܥൌ ሺܣǡ ܩሻ ൌ ܥଵ ܥ צଶ , muust satisfy the guarantees of both contracts implying an inntersection of the guarantees. As noted in [14], a simple coonjunction of the assumptions is not correct because the compoosition of C1 and C2 may satisfy A1 by forming part of the envvironment for C1. Consequently, G2 may relax A1 and G1 mayy relax A2 which complicates the definition of assumptions’ coomposition. The combined assumption and guarantee is definedd as: ൌ ሺଵ תଶ ሻ ሺ ଵ תଶ ሻᇱ
(2)
ܩൌ ܩଵ ܩ תଶ Parallel composition enables decompoosing a system hierarchically as shown in Figure 1. Thee decomposition preserves the input and output ports at each level of decomposition. In Figure 1, the guarantees on Out-1 and Out-2 depend on the guarantees associated with subsystem m-2 (Sub-2) and subjected to In-1 assumptions. Sub-2 guaranttees are subject to the assumptions placed on In-2. For exampple, in Figure 2, Sensor-1 could be read at a relatively high ratee and Sensor-2 at a slower rate. Sub-2 reads Sensor 2, pperforms unit or coordinate transformations and sends the resullt to Sub-1 which propagates Sensor-2 data between read aand updates the Actuator to align the sensor reads with the exxpected readings. If C1 defines the contract applicable to Sub--1 and C2 is the contract applicable to Sub-2 then C1 depends oon C2. A contract for the system depicted in Figuure 1 might look like: mal_Delta && a1 If (|Sensor-1 – E(Sensor-1)| < Norm No_Faults)
Contract assertions may be app plied locally to components within a system, or to the system as a whole. Local assertions focus on the input/output/behaviior characteristics of the component while global contractts apply to assertions on combinations of component co ontracts. Generally, local contracts are more narrowly focused d than global contracts and, consequently, cannot be relied on solely to ascertain system health. Conversely, global assertions apply to higher-level capabilities and may not detect subtlle but important differences in system health. Factors that complicate the detection and management of disruptive behaviors include: system m complexity; modal state, and time variations; hidden states (states that can neither be observed nor controlled); hum man activity; incomplete understanding of contract dependenccies, and poorly understood environmental assumptions. Moreo over, adding new contracts or performing decomposition duriing design can potentially introduce unexpected side-effectss. By definition, though, resilient systems are expected to operate o within performance bounds, degrade gracefully, or faail safely in spite of the prevailing unpredictability. III. INVARIANT AND FLEXIIBLE CONTRACTS
Command(t-1)| < g1 |Actuator_Command(t) – Actuator_C Normal_Command mmand applied to where t is time, Actuator_Command is the com the actuator, Normal_Command is the m maximum normal actuator command, Normal_Delta is the eexpected normal difference between the expected reading of S Sensor-1 and the current value of Sensor-1. System behavior is affected by sensor noise, actuator irregularities and fauults affecting the computations, interfaces within and external too the system, and the way in which the system is used. A relatively simple set of assertions is possible for the system model in Figure 1. This set covers ccorrectness of the sensors, transformations of data, and controll of the actuator. From a resiliency perspective, disruptions maay be tolerated or masked in any number of classical ways [16]. In fact, one could create contracts that specify goal-based error detection and fault response; e.g.: Normal_Delta a2 If (|Sensor-1 – E(Sensor-1)| >= N g2 Actuator_Command(t) = Actuator__Command(t-1)
1003
We define an invariant contract as one that must always be satisfied when its assumptions are true. t A flexible contract is one in which assumptions and guaarantees may not be known upfront, or may not be predictablle with time or usage. A flexible contract may include invariaant assertions. As noted earlier, contract flexibiility is needed for modeling unknown unknowns. However, theere are other very practical aspects of resiliency related to errror monitoring and system resiliency response that require flex xibility. In fact, flexibility is a prerequisite to resiliency [1] [16]. t look for violations of Error monitors are functions that system assertions and trigger resiliency responses. These responses can take a variety of form ms including some that are potentially risky, or that involve recconfiguring some aspects of the system. It is important to recognize thaat errors might result from transient disruptions or permanentt fault conditions. For this reason, error monitors typically comprise two parts: one that looks for assertion violations and the t other that evaluates the persistence of the violations. A peersistent violation is a more serious problem than a transient vio olation and needs a different response. Distinguishing between transient and permanent
disruptions depends on system use, the environment, and other contextual factors and unknowns. Additionally, unknown unknowns may trigger an error monitor because the system may be operating in an unplanned state. Even though that state was not anticipated, the system could still be capable of providing the right services correctly and, therefore, does not require a resiliency response. Resiliency responses to disruptions vary with system usage, availability of redundant resources, risk posture, seriousness of the condition, and other such considerations. A further complication pertains to global versus local resiliency contracts. Figure 2 shows a contract hierarchy. Specifically, it shows a system contract as the parent of Sub-1 and Sub-2 contracts. Similarly, Sub-1.1 and Sub 1.2 are the children of Sub-1. The parallel composition equation presented earlier is used to incorporate the children subcontracts into the parent contract with the implication that parent assertions are at least partially defined by the child assertions. A parent component may also have assertions that cannot be directly tied to component assertions and, moreover, that may conflict with child assertions. For example, in Figure 1, Sub-1 and Sub-2 will check contracts related to the system sensors and the controls sent to the actuator. However, a human operator or an external system may notice that the system is not behaving as expected by observing a trend not known to either Sub-1 or Sub-2. A conflict may also arise when a local resilience contract drives an action that is counter to the action that a global contract requires. Similarly, the global contract could cause violations of local contracts and trigger local fault monitors. Typically, conflicting actions are managed by priorities, while local contract violations caused by global assertions are dealt with by temporarily suspending local assertions.
Figure 2: Contract Hierarchy
A. General Model Structure Against the foregoing backdrop, we identify four key features that are needed for system resiliency: 1. 2. 3. 4.
Invariant and flexible contracts Contract priority Contract suspension Resiliency response contracts
Specifically, we propose a concept in which a system is initially defined by local component contracts and global contracts on assemblies of components. These contracts are associated with the ports defined by the system structure. Semantically, we use SysML blocks and full ports in our model. A block is a general systems construct used to represent structures and behaviors. A full port handles incoming and outgoing items, performs operations, and is typed by a block. Figure 3 shows our port contract stereotype and a generalized PortContract. PortContract comprises ContractState (Suspended, Active), ContractType (Invariant, Flexible), and Contract Priority, Valid (is the contract guarantee trusted or is it suspect), AssumptionMethod, and GuaranteeMethod. Specializations of PortContract may redefine (Contract 1) or inherit (Contract 2) the parent methods. Port contracts are constructed hierarchically in correspondence with block decomposition as shown in Figure 4, and are defined as invariant or flexible. We explicitly associate the port contract with invariant or flexible to enable consistency checking. For example, an inconsistency occurs if an invariant input port is associated with a flexible output port. Resiliency-related contracts serve five purposes: 1. establish consistent states and data needed for some recovery strategies, 2. monitor behavior and signal error conditions, 3. determine whether actions are needed immediately when an error is detected, 4. implement priority-based responses when multiple errors are detected or multiple contracts are violated, and 5. trigger execution of recovery processes. System resiliency responses are conditioned by associations implemented as messages or signals that alter ContractState. Those associations may include state transition activities, for example, that configure safe operation before executing resiliency responses. Figure 3: Contract Stereotype & Port
1004
Figure 4: Hierarchical Port Contracts
Figure 5: State-based Contract
While any of the three learning methods may be used to capture unknown unknowns, we have explored a supervised method based on Hidden Markov Models (HMMs). HMMs comprise state knowledge known by design and state knowledge learned during system test and operation. We note that methods such as updating a posteriori estimates in Kalman filters [17] may be more amenable to contracts typical in control applications.
Some care is required in implementing message or signalbased transactions to avoid race conditions and inconsistent fault opinions. A state transition activity, for example, is one mechanism for protecting against race conditions or inconsistencies by synchronizing events and determining the proper response. We define assumptions and guarantees using block method properties. Methods are general functions that may be described by state, activity, sequence diagrams, LTL assertions, mathematical models, constraint models, or any other suitable behavioral or computational representation. State dependencies can be conveniently modeled by associating each state with an assertion method. For example, in Figure 5, state S2 is associated with the behavior, Contract1, defined by the activity diagram. As noted earlier, invariant assertions are either atomic, or derived, in part, by parallel composition of component models. The flexibility needed for resiliency, however, is a more complicated concept because there needs to be a concept of autonomous learning and configurability. Configurability is relatively straightforward and generally achieved through parameterized assertions. There are three categories of autonomous learning: supervised (a human observer guides the process providing corrections); unsupervised (no human in the loop); and reward-based (a human provides an assessment of the quality of the learned process).
1005
B. Flexible Contracts A Hidden Markov Model (HMM) is a Markov model that includes observable and hidden (unobservable) states, state transitions, and emissions. Observable states correspond to invariant contracts while the hidden states represent flexible contracts. Figure 6 shows a HMM for a system that comprises observable states at the top, hidden states at the bottom, and observed outputs in the middle. The hidden processes in Figure 6 assume that the system is a black box in which only inputs are known and outputs observed. The architecture of the hidden state diagram is unknown but is trained with system use. A trained HMM can be used to monitor disruptive conditions and trigger resiliency responses as needed. State Sn in the hidden process of Figure 6 is an interesting feature that represents a previously unknown state autonomously added to the hidden model as described in [9]. Sn may represent a fault condition, or a normal behavior that was not anticipated by system designers. Figure 6 shows that the observable and hidden states share some of the same outputs but hidden states S1 and Sn produce unexpected outputs. The persistence of these outputs and the perceived system behavior determines whether a response is needed or a new state added. For example, if O6 is observed but all assertions are otherwise satisfied then Sn can be added to the model. Conversely, if O4 is observed and the system is not meeting its other contracts, then O4 can be associated with a fault condition that requires establishing a state transition trajectory back to a working state [8].
Depends on the accuracy of item #2
Figure 6: Hidden Markov model with added state, Sn
4) Assuring that corrective actions are available when needed Mechanisms responsible for monitoring and responding need periodic checking. The checks may be done concurrently with normal operation, if system performance, risk, and implementation costs permit. Alternatively, offline testing may be used. 5) Assuring that corrective actions do not worsen the disruptive condition
p3
This capability involves contracts that compare system health before and after a corrective action. This may require operator collaboration.
n
6) Protecting against permanent damage or degradation Generally, gross system parameters such as system stability, temperature and power usage are monitored. As needed, actions are taken that put the system into a safe state/configuration. Relating HMM states to contracts is rather straightforward. Each state has a set of input and output transitions as well as a set of emissions. Conditions defined by the input transitions represent contract assumptions. The set of emissions and output transitions represent the post-condition guarantees. The implication is that assertions are represented by mathematical constraints derived from the transition and emission probabilities. Training and analyses of the observable and hidden aspects of the model are well known as are implementation issues and model options [18]. C. Resiliency Contracts Resiliency contracts guarantee that a system will continue providing useful services in the face of disruptive events. There are at least eight basic capabilities that may be included in a resiliency contract: 1) Distinguishing between acceptable and unacceptable system behavior Performed by error monitors, this capability involves searching for global and local contract violations. This may be as simple as hardware check circuits or require comparing system behavior to the acceptable state space in flexible contracts. 2) Diagnosing the unacceptable behavior
source
and
seriousness of the
The source of a disruption may be inferred from the invalidated contracts but at higher levels of integration this most likely will not be the case. Seriousness might also be inferred by the invalidated contracts, but at higher levels of integration it can be assumed that the problem is serious enough to require a response because lower-level responses have not corrected the problem. 3) Taking corrective actions only when needed
1006
7) Checking that any corrective actions taken are successful This relates to item #6. After a response, the system needs to be checked to verify that acceptable operation is restored, i.e., that the key contract guarantees are valid. IV. SYSTEM ARCHITECTURE A trained HMM is well suited for describing flexible contracts and also for developing light-weight error monitoring and resiliency response mechanisms. We note though that a practical implementation of resiliency needs operational independence of the components providing the service contracts and the components involved with the resiliency contracts. While this isn’t always possible, operational independence improves the likelihood of providing the capabilities described in the previous section. Figure 7 updates Figure 4 with resiliency blocks that provide the features needed at local and global levels. In Figure 7, the top-level ports are typed by a contract blocks (Contract1 and Contract2) that enables assigning assertions and checks to the ports. Additionally, the contracts owned by Sub1 have been augmented with a resiliency contract. Parallel composition combines the earlier contracts with the added resiliency contract. The instantiation of Sub-1 internals shown in Figure 7 shows the resiliency component snooping on the inputs to Sub1 and also capturing and checking the outputs from Sub 1.2. We add another port type to the mix – a proxy port. This port is not an actual structural element but represents the ability to look at the internals of the owner block. The proxy ports let the resiliency module look into subcomponents and evaluate their behaviors. Not shown in Figure 7 are the detailed internals of Sub1.1 and Sub1.2. However, these internals will appear similar to the internals shown in Figure 7 for the integrated structure.
REFERENCES
Figure 7: Resiliency Model [1]
[2] [3] [4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
The allocation of functions to hardware and software as well as the resiliency hierarchy depend on cost, risk, test complexity, sensitivity of the resiliency contracts, and the impact of adding resiliency to system performance. However, the structural model shown in Figure 7 reflects a classical architecture. What is new is “under the hood” – a combination of strongly defined invariant assertions and flexible contracts that accommodates both predicted and unpredicted disruptions. CONCLUSIONS AND FUTURE WORK This paper has presented an approach for extending the definition of a design contract to include resiliency features. It has also described a means to define flexible contracts using Hidden Markov Models (HMMs). The paper has also provided appropriate semantics for incorporating these contracts into system models. Additional concerns that will be addressed in future work include the development of overall resiliency semantics and semantically compatible architectures. Issues of affordable adaptability and effectiveness [19], which are key elements of resilient systems, will also be addressed in followon research.
[12]
[13]
[14]
[15]
[16] [17]
[18]
[19]
1007
Madni, A.M., and Jackson, S. “Towards a Conceptual Framework for Resilience Engineering,” IEEE Systems Journal, Special Issue in Resilience Engineering; 2009; 3(2): 181-191. Fault Management Handbook, NASA-HDBK-1002, April 2012 McCabe-Dansted, J., et. al., “On the Expressivity of RoCTL,” 2009 16th International Symposium on Temporal Representation and Reasoning McCabe-Dansted, J. and Dixon, C. “CTL-Like Fragments of a Temporal Logic of Robustness,” 2010 17th International Symposium on Temporal Representation and Reasoning Pokorny, L.R., and Ramkrishnan, C.R. “Modeling and Verification of Distributed Autonomous Agents Usling Logic Programming, “ Declarative Agent Languages and Technologies II, Springer Berlin Heidelberg, 2005. 148-165. Baltrop, K.J.,and Pingree, P.J. “Model Checking Investigations for Fault Protection System Validation,” 2003 International Conference on Space Mission Challenges for Information Technology, June, 2003 Cimatti, A.,. Dorigatti, M. and Tonetta, S. “OCRA: A tool for checking the refinement of temporal contracts,” IEEE/ACM 28th International Conference on Automated Software Engineering (ASE), 2013 Williams, B. C., Ingham, M.D., Chung, S.H.. Elliott, P. H. “Modelbased Programming of Intelligent Embedded Systems and Robotic Space Explorers,” Proceedings of the IEEE: Special Issue on Modeling and Design of Embedded Software, vol. 91, no. 1, pp. 212–237, 2003 Wong, W. C., and Lee, J. H. “Fault Detection in Process Systems using Hidden Markov Disturbance Models,” 8th International Symposium on Dynamics and Control of Process Systems. Cancun, Mexico Sangiovanni-Vincentellik, A., Damm, W., and Passerone, R. “Taming Dr. Frankenstein: Contract-Based Design for Cyber-Physical Systems,” European Journal of Control 18(3):217-238, 2012 Meyer, B. “Towards More Expressive Contracts,” J. Object Oriented Programming, pp. 39-43, 2000 Le Traon, Y., Baudry, B., “Design by Contract to Improve Software Vigilance,” IEEE Transactions on Software Engineering, Vol. 32, No. 8., August 2006 Cimatti, A., and Tonetta, S. “A Property-Based Proof System for Contract-Based Design,” 2012 38th Euromicro Conference on Software Engineering and Advanced Applications Modgil, S., Faci, N., Meneguzzi, F., Oren, N., Miles, S., and Luck, M. “A Framework for Monitoring Agent-Based Normative Systems,” Proc. of 8th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2009), Decker, Sichman, Sierra and Castelfranchi (eds.), May, 10–15, 2009, Budapest, Hungary Benveniste, A. et. al., “Multiple Viewpoint Contract-Based Specification and Design,” Proceedings of the Software Technology Concertation on Formal Methods for Components and Objects, FMCO’07, ser. Lecture Notes in Computer Science, vol. 5382. Springer, October 2008, pp. 200–225 Rassmussen, R. “GN&C Fault Protection Fundamentals,” AAS 08-31, 31st Annual AAS Guidance and Control Conference, February 1-6, 2008 Kalman, R. E. “A New Approach to Linear Filtering and Prediction Problems,” Transactions of the ASME - Journal of Basic Engineering, March 1960, pp. 35-45 Rabiner, L. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, “Proceedings of the IEEE, vol. 77, no. 22, February 1989, pp. 257-286 Neches, R. and Madni, A.M. “Towards Affordably Adaptable and Effective Systems,” Systems Engineering, Vol. 15, No. 1, 2012