Needed Foundations for Assuring the Desirable Behavior of Software ...

Needed Foundations for Assuring the Desirable Behavior of Software-Reliant Systems Linda Northrop

Mark Klein

John Goodenough

Dennis Smith

Software Engineering Inst. Software Engineering Inst. Software Engineering Inst. Software Engineering Inst. 4500 Fifth Avenue 4500 Fifth Avenue 4500 Fifth Avenue 4500 Fifth Avenue Pittsburgh, PA 15213-2612 Pittsburgh, PA 15213-2612 Pittsburgh, PA 15213-2612 Pittsburgh, PA 15213-2612 +1 412-268-7638 +1 412-268-7615 +1 412-268-6391 +1 412-268-6850

[email protected]

[email protected]

[email protected]

[email protected]

and services created by different organizations. SoSs may include safety-critical elements running in concert with IT applications or enterprise systems, such as health-monitoring devices that communicate with both device-monitoring systems and individual electronic health records. They may include families of similar systems, where interoperability challenges are only exacerbated when such systems are developed and maintained separately with no systematic means to take advantage of their commonality.

ABSTRACT Future trends and current limitations presage a need for interdisciplinary foundations for designing, constructing, maintaining, adapting, and rapidly deploying software-reliant systems with assured system capabilities at all scales.

Categories and Subject Descriptors D.2.4 [Software Engineering]: Software/Program Verification – reliability, validation. F.3.1 [Theory of Computation]: Logics and Meanings of Programs – logics of programs, invariants.

Techniques for rapidly fielding and evolving systems of systems are clearly inadequate today. For example, today’s testing and security-certification processes are major barriers to rapid fielding of software-reliant systems because there is inadequate technology to work quickly and still be confident that serious problems will be detected before fielding. Determining what constitutes acceptable behavior has itself become a problem. What does reliability mean in this context? How does one “test” for agility and adaptability? To conduct early and ongoing analyses to ensure acceptable behavior of all software-reliant systems, software and system engineers need appropriate abstractions for each system type and for combinations of system types. Architecture has begun to serve that role. While there is evidence that architecturecentric practices are being adopted, the use of effective architecture practices is still limited. This is evidenced in recent studies by NASA [3] and the NRC [8]. Increasing system scale and multisystem configurations pose additional problems for architecturecentric practices. There is a dearth of effective architectural principles and techniques that remain applicable as systems become more complex and as their social, physical, and computational elements become intertwined in SoSs.

General Terms Design, Reliability, Theory, Verification.

Keywords architecture, system of systems, ultra-large-scale systems, cyberphysical systems, software assurance

1. INTRODUCTION In general, the ability to rapidly develop and field software-reliant capability is deficient. Part of this deficiency can be attributed to business and governance issues. But part of the reason is technical—software systems science and engineering knowledge is inadequate to ensure that software-reliant capabilities can be rapidly fielded while being sufficiently safe, reliable, secure, responsive, and adaptable to change. This deficiency applies across a spectrum of systems that rely on software to be functional (software-reliant systems). Such systems include safety-critical and embedded systems, IT applications, cyber-physical systems, families of similar systems, systems of independent systems, and ultra-large-scale systems. Today, the greatest challenges come when attempting to field or upgrade systems of systems (SoSs). When we say SoSs we don’t mean avionics systems, airplanes, or cars, but rather multi-system configurations in which the constituents are independent. The constituent systems in SoSs are decisionally autonomous and interact independently but need to interoperate to provide local and global capabilities. SoSs can manifest as cyber-physical systems, socio-technical systems, or cyber-physical-social systems. SoSs serve the needs of various users by providing access to data

Future systems will be increasingly more complex. They will push beyond the bounds of today’s systems of systems in number of lines of code, amount of data, number of people involved, number of hardware and software elements, and number of connections and interdependencies. They will be ultra-large-scale (ULS) systems [15]. Their inherent decentralization, continuous evolution and adaptation, heterogeneity, routine failures, and social embedding will further undermine the assumptions of traditional engineering approaches. New principles and techniques are needed to cope with these conditions. Future trends and current limitations presage a need for interdisciplinary foundations for designing, constructing, maintaining, adapting, and rapidly deploying software-reliant systems with assured system capabilities at all scales.

Copyright © 2010 Carnegie Mellon University. FoSER 2010, November 7–8, 2010, Santa Fe, New Mexico, USA. ACM 978-1-4503-0427-6/10/11.

259

•

2. CURRENT LIMITATIONS Current practice is to understand, develop, assure, and then finally field systems. In the understanding and developing phases, software and system engineers often fail to recognize that acceptable behavior of software-reliant systems depends not just on acceptable functionality but also on non-functional properties such as security, reliability, interoperability, openness, agility, and so on. These non-functional properties, or quality attributes, are not addressed sufficiently early in the system-development or evolution life cycle, resulting in frequent unacceptable system behavior and delays in fielding. There are too few early architectural tradeoff analyses based on quality attribute reasoning.

3. NEEDED FOUNDATION FOR SUCCESS We strongly believe that research is needed with a focus on the structure and behavior of software-reliant systems and the intimate relationship between structure and manifested quality attributes. The goal of this research would be to develop theories, analyses, technologies, techniques, and methods for acquiring evidence used to predict and bound system behavior for systems in all domains and at all scales (embedded systems, stand-alone systems, cyber-physical systems, product line systems, SoSs, and ULS systems). At the same time, there is a need for the technical foundations to accommodate change and facilitate rapid and insitu deployment, in the face of increasing system scale, range, and heterogeneity; acceleration of the pace of change; and uncertainty presented by new technologies, new threats, and unanticipated system uses.

Many traditionally developed systems are taking longer to field than ever, and their behavior is not improved in the process. Systems are traditionally developed in one environment (the “engineering environment”) and deployed for use in another environment (“the operational environment”). Since new opportunities, capabilities, and threats are not always predictable and are continuously changing, this whole approach is becoming increasingly untenable. Moreover there is no way to know before fielding a system how it will behave in every possible situation. In many cases, attempts to do so are resulting in increasingly long conceptto-deployment timeframes. And finally, in some cases the only way to match the tempo of market needs, or in the defense world of adversaries, is to allow end users to make changes directly within the operational environment. Systems then need to be designed to ensure acceptable behavior in the face of user-added variations, and this is a not trivial.

We believe this will require a combination of formal notations, quantitative analyses, and qualitative methods that

Though there is research aimed at overcoming the limitations of current practice, the research pales in comparison to the scope of technical challenges to be addressed. Gaps in current research include •

quality attribute foundations for cyber-physical systems realtime scheduling algorithms and concurrency strategies for multi-core platforms in cyber-physical systems

•

architecture principles and patterns that support the open, loose coupling required of SoSs

•

data modeling, requirements management, configuration management, identity management, and validation and certification techniques that directly support SoSs

•

SoS technology limitations (For example, service-oriented architecture (SOA)-based approaches are limited in the way they handle security, service orchestration, service discovery, mediation, versioning, and application to constrained environments.)

•

quality attribute measures and associated architecture principles for ULS systems, where humans are not merely users but are elements in the systems

•

assurance approaches that address the special problems of SoSs

•

failure modes of SoSs and how to impose design constraints that limit or reduce such failure modes

•

technologies that will permit, and even encourage, innovation to occur within the operational environment

automatic adaptation mechanisms to relax the need to completely anticipate user needs and how users will exploit system capabilities to meet new needs

•

address conflicts between quality attributes, interfaces between the computational and physical aspects of systems, and systems with concerns of varying levels of criticality

•

provide foundations for handling multiple critical quality attributes in dynamic and resource-constrained environments

•

provide improved theories of system structure for bounding the behavior of systems of systems

•

explicitly recognize that systems of increasingly large scale will manifest new quality attributes; will require the use of non-traditional theories and analyses for their software design, construction, deployment, adaptation, and evolution; and will require new technologies and methods to accommodate their scale

3.1 Quality Attribute Foundations/Analyses Cyber-physical systems have critical quality attribute requirements such as reliability and real-time performance requirements and hence warrant deep analyses to assure their behavior. However theoretical limitations hinder performing such analyses. Moreover, techniques do not exist for analytically determining the ways that multiple attributes affect one another and ultimately for making tradeoffs between conflicting attributes. Traditional statistical techniques for analyzing hardware reliability are not applicable to software. These techniques assume that designs evolve slowly. They also assume that the longer a software-reliant system executes, the more likely it is that latent defects will be discovered [14]. This approach is problematic because system designs can evolve rapidly and are often used in continuously changing environments. Moreover, since many defects will manifest only after system integration, they will be discovered too late in development and therefore will be very costly to fix. The system's architecture in its very structure and properties contains important information that should be exploited to improve software-reliant system reliability through analysis and testing. But methods and techniques do not exist to exploit it. To date, architecture has been exploited in testing only in minor ways;

260

fined. And generally techniques for designing, evaluating, and evolving ULS systems need to be developed.

such as to inform regression-testing plans [13] and unit testing [11]. Researchers have attempted to apply various formal methods to the problem of detecting component mismatch [6, 7], but these rely on the presence of behavioral and interface specifications that are far more complete and formal than are found in practice.

We regard human and social behavior as a critical locus of adaptive capability in systems that will operate in environments that are intrinsically unstable, and about which information will be necessarily incomplete, fragmented, distributed, and of variable veracity. This requires new architectural concepts inspired by well-studied analytic theories developed in the human sciences. For example, we have demonstrated how to allocate tactical network bandwidth in response to changing human needs by using auction mechanisms. These mechanisms build on well-studied microeconomic foundations that account for human value, incentive, and rational behavior [11].

The eventual migration of cyber-physical systems to multi-core platforms will require new foundations for providing performance-related guarantees. Most recent real-time scheduling research has focused on single-core allocation [2, 5], and initial work in parallelizing jobs for multi-core has been very restrictive [10]. Migration to multi-core platforms will introduce new concurrency patterns leading to different and possibly undesirable runtime behavior. Problems introduced as a consequence of migration will be difficult to diagnose and fix. Tools and techniques for analyzing concurrent systems continue to be limited by the “state space explosion” problem. Verification has proven to be non-scalable when scheduling disciplines and synchronization protocols are ignored as they are in state-of-the-art approaches [9, 4].

Human self interest is an example of the ULS system characteristic “erosion of the people/system boundary: people will not just be users of a ULS system; they will be elements of the system, affecting its overall emergent behavior.” [15]. Therefore resourceallocation decisions will require theories and techniques that account for the interactions between the social and computational aspects of systems. Moreover, since ULS systems are constantly evolving, the factors that cause change need to be better understood and principles for how ULS systems react to these factors need to be developed.

Consolidation of multiple single-core systems into a single multicore system could also result in critical and non-critical functionality being co-resident on the same platform. The mixed criticality resulting from such consolidation will need to be addressed. Traditional techniques use isolation approaches that ignore criticality [16, 1], leading to the possibility of critical tasks being prevented from acquiring resources.

We need to invert our thinking from the traditional notion that humans play prescribed roles in technical systems to one where computation plays prescribed roles in social systems. That is, instead of human behavior being bounded by what the system allows, we need to define the system to support unpredictable human needs and interactions. This inversion necessarily forces us to rethink the role and principles of software engineering, but is critical to achieving the required tempos for software adaptation and innovation. The ultimate goal is to field socio-technical systems that are fundamentally self-adaptive.

To fully exploit multi-core platforms, new flexible scheduling algorithms for parallelized jobs must be developed and analyzed.

3.2 Architecture-Centric Practices Practices do not exist for uniformly applying architecture and quality attribute principles across SoSs and their constituent systems. Design and evolution strategies are ad hoc. Many architectural patterns have been created for software architecture, but not for system-of-systems architectures. In particular, standard strategies for end-to-end resource management (including handling degraded modes of operation) do not exist for SoSs.

4. CONCLUSION From a software engineering perspective, we are currently illequipped to meet the challenges of assuring the desirable behavior of many of today’s and certainly tomorrow’s software-reliant systems. The needed research agenda is broad and deep. However, success implies faster availability of assured and flexible software-reliant system capability. Instead of taking years to acquire integrated systems and certify them through the testing process, new techniques will allow for incremental delivery of assured capability. These techniques will enable the delivered capability to be sufficiently safe, reliable, secure, responsive, and adaptable to change. In particular, a stronger technical foundation will enable

The increasing scale and complexity of long-lived systems also underscores the role of architecture in balancing the need to prepare a system for future modifications with the need to provide immediate capability. Within the software community, there is a shift toward adopting software-development processes that claim to accelerate delivery of technical capabilities in the face of an uncertain future. While agile software development techniques provide some guidance for quick capability delivery, they do not account for unanticipated interactions within complex unprecedented systems (either during their initial development or as they evolve over their lifespan). On the other hand, too much upfront architecting can compromise rapid delivery of capability and can be wasteful in anticipation of future needs that change. Techniques for determining the right balance of architecture and agility do not exist, but are vital to the ongoing debate about the appropriateness of agile methods for today’s systems.

3.3 Architecture Principles for ULS Systems Increasing scale poses additional problems for systems that have (even some, if not all of the) ULS system characteristics. Wellestablished notions of architecture and quality attributes for ULS systems do not exist. Overall health indicators have not been de-

261

•

early detection of potential system and system-of-systems problems

•

efficient use of the available processing power of multi-core platforms

•

analysis of social and computational interaction in systems to allocate system resources more safely and efficiently

•

identification of elements of the assurance process that contribute most to establishing justified confidence in system behavior

•

reduced time needed to field new software-reliant systems and achieve adequate confidence in overall system behavior,

[7] Inverardi, P., Yankelevich, D., and Wolf, A.L. Static Checking of Systems Behaviors Using Derived Component Assumptions. ACM TOSEM 9, 3 (July 2000), 239-272.

even when the development and evolution of system constituents are not completely under centralized control •

ability of end users to introduce software changes to meet their local needs with minimal negative impact on other users (for example, on overall system reliability, security, and performance)

[8] Jackson, D., ed. 2007. Software for Dependable Systems: Sufficient Evidence? Committee on Certifiably Dependable Software Systems, National Research Council. National Academic Press, Washington, DC. ISBN: 0-309-10857-8.

5. ACKNOWLEDGEMENTS

[9] Kahlon, V., Wang, C. and Gupta, A. Monotonic Partial Order Reduction: An Optimal Symbolic Partial Order Reduction Technique. In Proceedings of the 21st International Conference on Computer Aided Verification, Lecture Notes in Computer Science 5643/2009 (Grenoble, France, June 26 – July 2, 2009). CAV 2009.Springer, Berlin/Heidelberg, 398413. ISBN 978-3-642-02657-7.

The ideas in this paper are the result of the research and ongoing discussion of the members of the Research, Technology, and System Solutions Program of the Carnegie Mellon University’s Software Engineering Institute. The authors are technical leaders of that organization and are greatly indebted to their colleagues.

6. REFERENCES

[10] Kato, S. and Ishikawa, Y. Gang EDF Scheduling of Parallel Task Systems. In Proceedings of the 2009 30th IEEE RealTime Systems Symposium (Washington, DC, USA, December 1 - 4, 2009)/ RTSS 2009. IEEE, Washington, DC, 459-468. ISBN ~ ISSN:1052-8725 , 978-0-7695-3875-4.

[1] Aeronautical Radio Inc. 2010. ARINC 653 Specification. http://www.arinc.com. [2] Brandenburgh, B.B. and Anderson, J.H. On the Implementation of Global Real-Time Schedulers. In Proceedings of the 2009 30th IEEE Real-Time Systems Symposium (Washington, DC, USA, December 1 - 4, 2009) RTSS 2009. IEEE, Washington, DC, 214-224. ISBN ~ ISSN: 1052-8725 , 9780-7695-3875-4.

[11] Klein, M., Moreno, G. A., Parkes, D. C., Plakosh, D., Seuken, S., and Wallnau, K. Handling Interdependent Values in an Auction Mechanism for Bandwidth Allocation in Tactical Data Networks, In Proceedings of the 3rd international Workshop on Economics of Networked Systems (Seattle, WA, USA, August 22, 2008). NetEcon '08. ACM, New York, NY, 73-78.

[3] Dvorak, D. L. 2009. NASA Study on Flight Software Complexity. Technical Report, NASA Office of Chief Engineer Technical Excellence Program.

[12] Muccini, H., Bertolino, A. and Inverardi, P. Using Software Architecture for Code Testing. IEEE Transactions on Software Engineering 30, 3 (2004), 160-171.

[4] Farzan, A. and Madhusudan, P. Causal Dataflow Analysis for Concurrent Programs. In Proceedings of Thirteenth International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4424/2007 (Braga. Portugal, March 24 – April 1, 2007). TACAS 2007. Springer, Berlin/Heidelberg, 102116. DOI 10.1007/978-3-540-71209-1.

[13] Muccini, H., Dias, M. and Richardson, D. Software architecture-based regression testing. Journal of Systems and Software 79, 10 (October 2006), 1379-1396. [14] Musa, J. 2004. Software Reliability Engineering: More Reliable Software Faster and Cheaper 2nd Edition. Authorhouse, Bloomington, IN. ISBN-13: 978-1418493882.

[5] Guan, N., Stigge, M., Yi, W., and Yu, G. New Response Time Bounds for Fixed Priority Multiprocessor Scheduling. In Proceedings of the 2010 31st IEEE Real-Time Systems Symposium (San Diego, CA, USA, November 30 – December 3, 2010). RTSS 2010. IEEE, Washington. DC, 387-397. ISBN ~ ISSN:1052-8725 , 978-0-7695-3875-4.

[15] Northrop, L. et al. 2006. Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Software Engineering Institute, Pittsburgh, PA. [16] Oikawa, S., Rajkumar, R. 1999. Portable RK: A Portable Resource Kernel for Guaranteed and Enforced Timing Behavior. In Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium (Vancouver, Canada, June 02 - 04, 1999) IEEE RTAS ’99. IEEE, Washington, DC, 111-120. ISBN: 0-7695-0194-X.

[6] Inverardi, P., Yankelevich, D., and Wolf, A. Checking Assumptions in Components Dynamics at the Architectural Level. In Lecture Notes in Computer Science 1282/1997. Springer, Berlin/Heidelberg, 46-63. DOI 10.1007/3-54063383-9_72.

262