Panel Session On Limits In Dependability - Fault ... - IEEE Xplore

Panel Session on

Limits in Dependability Moderator: Jean-Claude Laprie, LAAS-CNRS, Toulouse, France Panelists: G6rard Le Lann, INRIA, Rocquencourt, France Michele Morganti, Italtel, Milano, Italy John Rushby, SRI International, Menlo Park, California, USA

Scope and Context of the Discussion

computer-based systems where the notion of limit in dependabilityis especially felt.

Dependablecomputing systems are specified, developed, operated and maintained according to assumptions which are relative to a) the expected function(s) to be fulfilled or service(s) to be delivered, b) the environment where the computing system is to be operated (load, perturbations from the physical environment, behavior of the operators and maintainers), and c) the faults which are likely to manifest, in terms of their modes and frequencies of action. The achieved dependability is crucially depending on the validation of a) the actual system with respect to these assumptions, of b) the assumptions themselves with respect to reality, and, recursively, of c) the assumptions of the validation itself (e.g., criteria according to which faultremoval is conducted or distributions according to which fault-forecastingis performed). Limiting factors to dependability can thus originate from a variety of sources, due to imperfections either in the assumptions or in the validation. The topic of Limits in Dependabiliry was the theme of a two-and-a-half days workshop held during the last IFIP 10.4 meeting (Islamorada, Florida, January 27-31 1993). The questions raised, the opinions on the currently available approaches, and the ensuing discussions have evidenced the need for having this concern about limits in the achievement and assessment of dependability to be shared by a larger audience, hence this panel. During the workshop, the limits of dependability achievement and assessment were discussed according to three viewpoints: Distributed Systems, Formal Specification and Verification, Software Reliability. The program of the workshop is given in annex. Three of the panelists present the salient points which came out of the presentations and discussions during the workshop, and emphasize their own viewpoint as well: Gerard Le Lann for Distributed Systems, John Rushby for Formal Specification and Verification, and I for Software Reliability. Michele Morganti provides with telecommunication services and networks an illustration of complex

Limits in Dependability in the context of Distributed Systems GQard Le Lann Limits in Dependability within the context of Distributed Systems were investigated from five different viewpoints. A summary of the position statements and ensuing discussions is given below, in the order of the presentations. I defended the view that ultra-high dependability (e.g., a solution to the 10-9 problem) is achievable only if on-line algorithms are utilized. Under an on-line approach, control algorithms are devised or selected at design time under zero or minimal advance knowledge on future system usage. Examples of areas where such algorithms are useful would be Concurrency Control and Task Scheduling. This is in contrast with off-line (traditional)approaches, where design decisions rest on clairvoyance assumptions. The dependability "proofs" that come with off-line computations therefore are no more "trustable" than their on-line counterparts, as the validity of the former fully depends on the validity of the clairvoyance predictions, that is their coverageratio (see D. Powell's presentation). I argued that: (i) ultra-high dependability implies reliance upon redundant architectures and decentralized control algorithms, which implies solving scheduling problems akin to those raised by "real-time" systems (how to service waiting queues of clients that have timeliness requirements), (ii) assuming full clairvoyance is invalid when considering distributed systems; I used the example of the conventional "hard real-time" view, where extemal events are looked at periodically, thus easing the problem of how to schedule tasks (i.e., how to prevent timing failures). Obviously, whenever variable delays and distributed computations are considered, the "magic" external periodicity vanishes. Internal events

608 0731-3071/93 $3.00 0 1993 IEEE

being inevitably aperiodic, it is easy to demonstratethat off-line approaches are inadequate. It was counter-argued that on-line algorithms are more complicated than off-line computations and that it is implicitly assumed that these algorithms are not faulty. Also, doubts were raised relative to the possibility of devising "provably correct" algorithms under no assumptions.

For this to work correctly Ihowever, the membership problem must be solved. The Transis algorithm is live. There is no contradiction with the Fisher-L ynch-Patterson impossibility result for the reason that live and correct nodes can be arbitrarily and unjustfully excluded from a group. The ensuing discussionscentered around the probabilistic nature of the Transis protocols (there is 110 predictable lower bound on the size of the g:roups/partitions)and on to which extent this relates to "limits to dependability". Dany responded in asking for an exact definition of the concept of dependability,

The concept of assumption coverage was presented by David Powell, who showed how to construct a failure mode implication graph (a lattice), starting from simple orderings for value errors (none -> noncode -> arbitrary) and timing errors (permanent omission -> bounded number of omissions -> only omissions -> only late errors -> arbitrary). Node X being a predecessor of node Y in the resulting 21-node graph implies that P(Y) is strictly higher than P(X), where P(A) is the coverage of assumption A. The dilemma faced by a designer is to identify the optimal conservative assumptions trade-off. Increasing the degree of "conservatism"for failure assumptions yields increased system complexity, i.e. lower dependability. Conversely, increased "conservatism" yields increased assumption coverage, i.e. higher dependability. Through an example (broadcast over single-writer busses), David showed that such ultra-conservative assumptions as "Byzantine"errors do not necessarily yield higher dependability. Figures of unreliability and unavailability versus assumption coverage were given. Arbitrary behavior turned out to be the major topic of discussion. Merits (e.g., "Byzantine" protocols are simple) and drawbacks (e.g., they need high levels of redundancy -- are they costly or not?) were debated. An agreement was reached that the concept of assumption coverage is indeed useful, provided that coverage can be accurately estimated.

Kane Kim stressed the need for identifying failure modes that would be more accurate and more realistic than the failsilent unit (FSU) model or the malicious unit (MU) model. The claim was made that these imodels are not appropriate for designing and validating Ilarge-scale safety-critical systems. The difficulty of implementing the FSU abstraction increases with the number of VO channels. Conversely, a system consisting of FSUs only is easy to analyze and validate. The other extreme abstraction, the MaU, has been useful in facilitatingtheoretical investigation of some limits of fault-tolerant algorithms. However, Kane argued, existing MaU models appear to have: fundamental flaws, the major one being that the probability of occurrence of malicious behavior @) is orders of magnitude smaller than the probability of some non-conservative failure assumption being violated. Hence, the. cost induced by "Byzantine"protocols is hardly justifiable. A few examples were given, showing that "conventional" (low complexity) fault-tolerant algorithms coiuld cope with malicious errors, or that probability p is inifinitesimally small (10-25 was computed for the clock example). The conclusion was (i) models sitting in between the FSU and the MaU extremes are needed, (ii) probability theory should be used more extensively. It was counter-argued thiat work along those recommendations has been carried for some tirne by various research teams. Many diverse failure models have been and are being investigated. It was also observed that the existence of model-dependent lower bounds in time/message complexity cannot ble ignored.

Dany Dolev gave a presentation on how to exploit the broadcast nature of local area networks in order to build high availability communication services (the Transis transport layer system). Design assumptions,e.g., arbitrary topology, arbitrary delays, network partitioning, no malicious errors, correspond to what is sometimes called an "asynchronous"model. Dany described the algorithms used to implement causal multicast within Transis (based on each node counting its own messages and piggybacking acks and nacks). A unique DAG (pending or all-acked messages) is maintained by all participating nodes in every partition, although being "revealed asynchronously to each of them. Preliminary performance results showed that impressive throughput can be achieved, essentially by exploiting the natural "shared knowledge" property of broadcast nets, that is obtained at low cost (close to 1 message per knowledge update).

Is software dependability limited by our inability to conduct perfect debugging of distributed programs? This issue was addressed by Michel Raynal. Formall methods (a priori proofs) were contrasted with practical methods (a posteriori debugging). Michel stated that debugging is more powerful when the complexity of Ihe distributed computing model goes beyond some threshold. It was acknowledged that looking for possibly erroneous past behiavior is not equivalent to demonstrating the innpossibilityd erroneous future behavior. Nevertheless, it was observed that many programs have been made dependable through extensive

609

solutions, for any given finite amount of advance knowledge at design time. From an algorithmic viewpoint, I believe that the best current approach to the problem of achieving quantifiable ultra-high dependability consists in conducting an analysis of how (possibly optimal) competitive ratios and assumption coverage ratios are related, for a selected set of candidate algorithmic solutions and targeted architectures.

debugging. Also, conceptual tools and algorithms exist whereby lack of desired properties can be detected for practically all kinds of distributed computations. This was illustrated with a description of how a lattice of causally ordered global states or consistent cuts could be used to check predicates. Two recent techniques were reviewed, and limitations pointed out [semantic limitations, practical (time/space)limitations].Michel then presented his own approach, which is based on the idea of grouping causally related local predicates into global "atoms". The claim was made that atomic sequences of local predicates facilitate the expression of global predicates Also, such sequences have linear complexity, The ensuing discussion centered on whether it is true that those global properties which cannot be expressed (under the atomic sequence approach) are indeed irrelevant. Another more general question was also discussed, namely to which extent is high dependabilityachieved given that the debugging of particular executions of a distributed program has a coverage (relative to all possible executions) which might be only roughly estimated. Being given the opportunity to stress my own viewpoint in this panel, I would like to get back to the questions raised by my presentation. There are examples of proofs established under zero knowledge of the future. In the context of distributed transactional systems, the well known 2-phase locking algorithm has been proved to be safe under zero advance knowledge concerning times of transaction activation, transaction durations, number of locks to be acquired by each transaction. When trying to establish the existence of upper bounds on response times, one must choose between devising a solution which is valid only under "substantial" advance knowledge, and deyising a solution which is valid under "minimal" advance knowledge. I obviously favor the second approach, which clearly separates system design (a formal exercise) from system dimensioning (an engineering exercise). I also object to the view that on-line algorithms have to be more complex than off-line computation based schemes. A good counter-example is the earliest-deadline-first algorithm, whose complexity is strictly identical to that of (off-line computed) fixed priority schemes. Furthermore, for those cases where on-line means higher complexity, what should be assessed is what is gained - in coverage, in performance-in comparison with an off-line approach. The intellectual confusion that has surrounded those online versus off-line debates in the past is coming to an end, thanks to such emerging disciplines as Competitive Analysis, which works well for distributed algorithms. After all, it should not be too surprising that Information Theory and Game Theory can help us in devising optimal

certification are then those of evaluating the utility of the of the properties proved, and of validating the accuracy of the modeling employed.

610

'

and 1,200,000 lines of grounld, support, software. The approach is dominated by staiic analyses conducted all along software developmenl:, and by a continuous improvement in the software building process based on the feedback from the results of the static ;analyses. She presented statistics in terms of number of fault density (per thousand of lines of sourve codes) which illustrated the high ratio of faults uncovered during static analysis as compared to those found in testmg, and the improvements achieved over the years. Dick Hamlet presented an approach aimed at amplifying the results drawn from testing in estimating reliability. He showed that combining testing anid testability measurements using the operational profile can yield an estimate of the probability of correctness. Improvements in the results obtained can be achieved when recuring it0 systematic testing over subdomains of the input space. The conclusion was that the approach seems promising, although depending on some open questions such as the dilemna between bounds and confidence, the validation of operational profiles, or the selection of adequate subdomains. I defended the view that the current approaches to software reliability prediction from observations during validation, based either on failure: data or on results of tests without failure, in fact generally provide figures that grossly underestimate the actual reliability vvhich will be exhibited in operation. On the other hand, today's complex software are rarely built from scratch, they rather result from evolutions of previously existing software; there is thus the notion of software family, a new software being in fact a new generation of a family. I proposed to take advantage of this notion of family in making use, via a Bayesian approach, of extensive field reliability data relative to previous members of the family. I showed that combining both validation data and past experience data can drastically improvereliability predictions. Suchi an approach is clearly crucially dependent on the relation between the various members of a family of software systems. Lorenzo Strigini presented a joinit work together with Bev Littlewood from City University, London, UK. In this work, they have gone over variouis methods for validation with respect to dependabilityrequirements for safety-critical software, namely reliability growth modeling, inference from failure-free operation, and other sources of evidence for validation (past experience, structural modeling, proofs and formal methods, step-wise improvement, combination of different kinds of evidence). The conclusion1was that no solution is in sight without radical progresses in knowledge, and thus that the requirements on the computers should be reduced by changing the design of the

It should be easier to validate the accuracy of a fairly abstract model that makes a few broad assumptions, than of a detailed model that depends on many intricate details: in fact, it may be easier and more convincing to test sequential code in execution, rather than validate the highly detailed model of its execution environment that would support its formal verification. It therefore seems to me that for maximum utility in contributing to assurance at the limits of dependability,formal methods should mainly be applied in a "fully formal" manner to problems where: Other techniques are absent or ineffective, The problems are crucial to overall success, and Validation of the accuracy and utility of the modeled assumptions and properties is relatively straightforward. The problems that best meet these criteria are likely to be the hurdest parts of the overall design problem, treated early in the life cycle and at a relatively high level of abstraction. Examples include problems of distributed and parallel execution, timing, fault-tolerance, and the combinations of these. Testing can explore only a tiny fraction of the possible behaviors of such elements of design, and can in any case only be performed relatively late in the development life cycle. By working early in the life cycle, on relatively abstracted representations of the problem, fully formal methods can provide compellingevidence that these crucial elements of design are correct; furthermorethey can provide this evidence early enough to be useful, cheaply enough to be feasible, and on the basis of modeling that is simple enough to be credible. I know of no other method that offers comparable assurance for these elements of design.

.

Limits in Software Reliability Jean-Claude Laprie Softwarereliability is currently being felt as the bottleneck of computing systems dependability, which is substantiated either by statistical evidence as provided by Tandem's field data published in the October 1990 issue of IEEE transactions on Reliability,or by large outages such as what happened with the AT&T network in January 1990, and in June and July 1991. When dealing with software reliability, the well-known problem of the distance between the estimatedreliability and the actual reliability is exacerbated, a consequence of which being the current absence of reliability estimations performed for high-dependability, especially safety-critical, software. Barbara Kolkhorst presented the approach adopted by IBM Federal Systems in Houston for building large aerospace software systems, culminating in the Space Shuttle software with 500,000 lines of on-board software

system or enterprise using it.

Let me come back to the overall problem of limits in dependability as stated in the first paragraph of the section

611

devoted to the scope and context of the discussion, in terms of assumptions and validation. I would like to point out that the vast majority of the work which has been done in the past decades on computer dependability, and thus on the reliance we can place on computer operation, has been dominated by a product view. This is a priori legitimate as what is of real interest to a user is indeed the dependability of the computer system he is interacting with. However, as any fault in a computer system is ultimately a design fault, pushing forward the limits in dependability can only be achieved in lowering the sources of design faults, via actions on the process. This necessitates to account for past experience, which is currently largely done implicitly and informally.

Limits in Dependability: the Telecom Case Michele Morganti Billions of kilometers of transmission links interconnecting hundreds of millions of terminations and hundreds of thousands of nodes, each node in itself a smaller but complete computer network, whilst other computers control and supervise the operation o each network element, to form an elaborate pattem of overlaid processing levels. Today's global telecommunications network definitely represents one of the most challenging and possibly the most complex monument to computing technology ever built. Entirely dependent on computers, critical but not life critical, the telecommunications network covers such a variety of applications and presents such a variety of requirements that make it the natural battleground on which operators and manufacturersca confront the validity and the effectiveness of their products and of their alternative solutions. Further, the large scale of operation, allowing a large number of replicas to be simultaneously installed, together with an extended operational life (with respect to that of most other application areas) and the continuous monitoring functions allow for direct observation of otherwise too rare events and eventually contribute to a better understanding and a more objective assessment of the different approaches. The network continues to evolve. Every day thousands of older network elements are replaced by new ones, more powerful, more flexible, and most likely more complex. Entirely new technologies, such as photonics and ATM, can be accommodated and experimentedin coexistence with much older ones, and the evolution pace, far from slowing down, has now accelerated the point where the change process itself has become barely manageable.

New software releases are dispatched every year, or six months in some cases, to offer new features and entirely new classes of services. Combined with the increased complexity of the network architecture and with the more stringent requirements imposed by new services, this acceleration is progressively invalidating many of the assumptions on which the almost legendary dependability of the network was traditionally based, first of all that of fault free hardware and software designs. Fibers, with their intrinsically higher bandwidth granularity, one hundred to one thousand times that of copper, are leading to a significant reduction of network connectivity, thus requiring explicit redundancy to be added to it. Stored program control, digital switching techniques, the new synchronous digital transmission hierarchy and ATM are all leading to a significant reduction of the number of network nodes, however enormously increasing their size, their physical distribution, their complexity and, in the end, their criticality. The intelligent network architecture and the continuous growth of mobile services are making the network more and more dependent on the availability of very large data bases where information concerning customer profiles, status, and instantaneous location are maintained and will eventually have to be accessed on a per call basis. New regulations, requiring the opening of networks to customer control and to competing or complementary service providers, are rising safety and security challenges never considered before. At the same time, the extension and the complexity of the network, the total impredictability of its environment (which normally includes, in some part of the other, the catastrophic effects of fires, earthquakes and even wars) and of its users, the diversity of the many coexisting solutions and the amount of software changed daily, make any deterministic approach completely unrealistic and unpractical.

Today, no satisfactory solutions seem to exist to many of these dependability problems, at least in the current approach. Yet, some recent catastrophic failure histories, lately tracked back to some minor fault (a dust particle in a circuit, a fiber rupture, a smaller fire, etc.) clearly indicate that the situation may soon be beyond control... ...unless new and renewed attention is given to the whole problem of network dependability,a new approach is taken and entirely new solutions are envisaged and investigated.

Annex : programme of the Workshop on Limits in

- Dan Craigen (Odissey Research Associates, Ottawa, Canada), on Formal Methods: Current Usage and

Dependability held during the 23rd meeting of IFIF’ WG 10.4Dependable Computing and Fault Tolerance (Islamorada, Florida, January 27-31 1993)

Limitations

- Michael Melliar-Smith (University of California, Santa Barbara, USA), on Practical Limits to Formal Verijication - John Rushby (SRI International, Menlo Park, USA), on Formal Verification: I’nstrument of Justijication or Toolfor LXscovery? Softwarereliability: - Dick Hamlet (Portland :State University, USA), on Amplifying Software Reliability with Other Techniques - Barbara Kolkhorst (IBM Federal Systems, Houston, USA), on Experience (or Lessons Learned)from a Mature 2i’ofmat-e Prolcess - Jean-Claude Laprie (LAAS-CNR,S, Toulouse, France), on a Product-in-a-Process View of SoftwareReliability Evaluation - Lorenzo Strigini (CNR, Pisa, Italy), Ion Validation of Ultra-High Dependatbility for Software-Based Systems The workshop was concluded by synthesis reports on the presentations and discussions, by Brian Randell (University of Newcastle upon Tyne, UK) on Distributed Systems, Jack Goldberg (SRI International, Menlo Park, USA) on Formal Specscation aid Verijication,and Yoshi Tohma (Tokyo Institute of Technology, Japan) on Software Reliability.

The workshop was intended to discuss about the limits of dependabilityachievement and assessment, with respect to three viewpoints: Distributed Systems, Formal Specij7cation and Verijication,Software Reliability. The presentations were as follows: Distributed systems: - Gerard Le Lann (INRIA, Rocquencourt, France), on Clairvoyance Assumptions seen as a Limiting Factor in Dependability - David Powell (LAAS-CNRS, Toulouse, France), on Assumption Coverage Limits to Fault Tolerance - Danny Dolev (Hebrew University of Jerusalem, Israel), on Fault Tolerant Protocols in Broadcast Domains - Kane Kim (University of California at Irvine, USA), on A Fair Distribution of Concerns in FaultTolerant System Design and Analysis - Michel Raynal (IRISA, Rennes, France), on Is Distributed Debugging Limited? Formal specificationand verification: - Flaviu Cristian (University of California at San Diego, USA), on Specifying and Verifring FaultTolerant Systems: How Far Can One Go in Princ@le?

613

Panel Session On Limits In Dependability - Fault ... - IEEE Xplore

Panel Session On Limits In Dependability - Fault ... - IEEE Xplore

Suggest Documents

ICWS 2006 Panel Session 1 - IEEE Xplore

Panel Session - Reflections on International Accreditation - IEEE Xplore

Coverage modeling for dependability analysis of fault ... - IEEE Xplore

On Representing Knowledge in the Dependability Domain - IEEE Xplore

Panel on Real-Time Scheduling - IEEE Xplore

session 52 session 53 - IEEE Xplore

Editorial: Dependability and Security - IEEE Xplore

Software Patents Panel - IEEE Xplore

ieee iri 2013 panel - IEEE Xplore

Fundamental Limits on Energy Use in Optical Networks - IEEE Xplore

The EFTOS approach to dependability in embedded ... - IEEE Xplore

A Survey on Fault Tolerance Techniques in Wireless ... - IEEE Xplore

Combining Fault Avoidance, Fault Removal and Fault ... - IEEE Xplore

Combining Operational Flexibility and Dependability in ... - IEEE Xplore

The EFTOS approach to dependability in embedded ... - IEEE Xplore

Fault-Injection and Dependability Benchmarking

Panel: SOA and Quality Assurance - IEEE Xplore

Panel and Rump Sessions - IEEE Xplore

Panel The Reusably Incorrect Forum - IEEE Xplore

Panel and Rump Sessions - IEEE Xplore

session xvii: circuit technology advances - IEEE Xplore

Special Session 4B: Elevator Talks - IEEE Xplore

Fundamental Capacity Limits on Compact MIMO-OFDM ... - IEEE Xplore

3rd Edition SPECIAL PANEL SESSION ON