Defining Survivability for Engineering Systems

5 downloads 904 Views 193KB Size Report
tute of Technology (MIT) has focused on corporate ... and MIT) identified resilience engineering as the top priority .... an entity will cease to function, the degree to.
Paper #77

Defining Survivability for Engineering Systems Matthew G. Richards

Daniel E. Hastings

Massachusetts Institute of Technology 77 Massachusetts Ave., 41-205 Cambridge, MA 02139 [email protected]

Massachusetts Institute of Technology 77 Massachusetts Ave., 4-110 Cambridge, MA 02139 [email protected]

Donna H. Rhodes

Annalisa L. Weigel

Massachusetts Institute of Technology 77 Massachusetts Ave., NE20-388 Cambridge, MA 02139 [email protected]

Massachusetts Institute of Technology 77 Massachusetts Ave., 33-404 Cambridge, MA 02139 [email protected]

Abstract This paper introduces an on-going doctoral research track on the role of survivability as an attribute in the design of complex system architectures. Survivability may be defined generally as the ability of a system to minimize the impact of a finite disturbance on value delivery, achieved through either the satisfaction of a minimally acceptable level of value delivery during and after a finite disturbance or the reduction of the likelihood or magnitude of a disturbance. While survivable design of physical system components is a well-understood practice that is best left to established domains, architecting for survivability is a poorly-understood, socio-technical challenge increasingly relevant to engineering systems. After describing the motivation for and the scope of the research, survivability is conceptualized as a meta-framework for robustness and changeability. Survivability includes both passive and active techniques which may manifest themselves in the physical, operational, and organizational design of engineering systems. Where passive survivability re-

lies on the design principles of hardness, stealth, redundancy, and diversity to enable a system to resist a projected disturbance environment, active survivability embraces uncertainty in the projection of future disturbances by stressing architectural agility through the ability to regenerate, evolve, relocate, and retaliate.

Introduction As the interdependence of large-scale, distributed systems has grown since the advent of modern telecommunications, so too has the risk to systems from disturbances that rapidly propagate networks, damage critical infrastructure, and trigger catastrophic failures of system-of-systems. This risk is further exacerbated by the emergence of new sources of disturbances. Physical disturbances may now be traced to terrorism in addition to traditional threats from state aggression and natural disasters. Electronic disturbances arise from information warfare and cyber-attacks. Furthermore, rapidly changing contexts may disturb the value propositions of complex sys1

CSER 2007, Stevens Institute of Technology , ISBN - 0-9787122-1-8 PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

tems with long developmental and operational lives. While traditionally specified as a requirement in military systems, survivability is an increasingly important attribute of all engineering systems, 1 which must be robust to environments characterized by frequent disturbances. As a critical property of lifecycle utility for many types of system architecture, the domains in which survivability is applied are quite diverse—ranging from the radiation hardening of satellites to the security protocols of automated teller machines. While systems engineering tools exist for survivable design within these domains, these methodologies are often “reductionist” in nature (e.g., risk analysis, bottom-up verification), arrive at subjective definitions of survivability (i.e., based on domain-specific operating scenarios and presupposed disturbance environments), and provide limited insights for senior decision makers considering survivability at higher levels in the system architecture (e.g., satellite constellation trades between cost, utility, and survivability). Development of a generic survivability framework and associated design methodologies represent both a need (Neumann 2000) and an opportunity for growth within systems architecting and engineering. The goals of this paper are threefold: (1) integrate definitions of survivability from different disciplines, (2) develop a framework that defines the dimensions and different forms of survivability, and (3) illustrate how survivability maps to the temporal system 1

An engineering system can be defined as a complex, dynamic, and technologically-enabled system characterized by non-trivial feedback from multiple stakeholders. (Non-trivial feedback is present when modifications to the structure of the system are required due to divergent stakeholder preferences.) An example of an engineering system is the Space Surveillance Network (SSN), which consists of a global network of radar and optical sensors; is operated by the U.S. Army, Navy, and Air Force; and fulfills a variety of civil and military space missions.

properties known as the “ilities” (e.g., flexibility, adaptability). Upon providing motivation for the research, identifying two critical research questions, and outlining a preliminary research design, a taxonomy is introduced where, at a macroscopic level, survivability is shown to arise from a combination of robustness and changeability. By explicitly defining, classifying, and relating the dimensions and forms of survivability in engineering systems to existing fields, this exploratory paper proposes scope and structure for future research on the design of survivable system architecture.

Motivation The operational environment of engineering systems is increasingly characterized by disturbances which may asymmetrically degrade performance, particularly for systems with networked structures. In recent years, hostile actors have preyed upon infrastructures which may be linked, whether physically or psychologically. For example, businesses incurred an estimated $5.5 billion in damages from the 2000 ILOVEYOU internet virus which generated thousands and thousands of emails (an assault on internet links by overwhelming limited bandwidth) and overwrote important files on servers and workstations (an assault on internet nodes). The tragic events of September 11, 2001, that injured and killed thousands of people, may also be viewed as a psychological attack on our interdependent economy, with four physical disturbances causing a $1.2 trillion loss in the valuation of U.S. stocks in the week following the tragedy (Kean, Hamilton, et al. 2004). Engineering systems are also vulnerable to unintelligent threats arising from the natural environment. For example, the outage of a generating plant in Parma, OH, in 2003, triggered a massive power outage across the Northeast, affecting 40 million people in eight states (U.S.-Canada Power System Outage 2

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

Task Force 2004). In 2005, Hurricane Katrina breached the levees of New Orleans, subsequently flooding 80% of the city—costing 2,000 lives and over $80 billion in damages (Knabb, Rhome, and Brown 2005). In addition to the observations noted above, research on survivable system architecture is motivated by national studies of vulnerabilities in critical U.S. infrastructure and the emergence of resilience engineering as a research priority at government and academic institutions. • In 2001, the Rumsfeld Commission to Assess U.S. National Security Space Management and Organization surveyed satellite vulnerabilities to various hostile acts (e.g., denial and deception, interference, jamming, microsatellite attacks, nuclear detonation) and found that the impact of such surprise attacks could constitute a “Pearl Harbor” in space. Risks are further exacerbated by reliance on unhardened commercial systems and inadequate space situational awareness (Rumsfeld, et al. 2001). • Following the terrorist attacks on September 11th, hundreds of national studies have identified vulnerabilities in critical economic infrastructures—including information, transportation, energy, retail, manufacturing, and finance—which reside in the private sector and are largely not hardened (Parfomak 2005). • Recent research at the Massachusetts Institute of Technology (MIT) has focused on corporate security and resilience with an emphasis on creating enterprises with supply chains robust to high-impact disturbances. An important finding by Sheffi (2005) is that the pressure to achieve cost efficiencies has led to the “leaning” of global supply chains that are now extremely fragile to disruptions. Empirical evidence indicates the need for balance between security, redundancy, and shortterm profits. • In October 2006 at USC’s Center for Systems and Software Engineering, stake-

holders from commercial organizations (e.g., Motorola, Bosch), defense companies (Lockheed Martin, Northrop Grumman, and Boeing), and academia (USC and MIT) identified resilience engineering as the top priority for system of systems architecting (Axelband, Valerdi, et al. 2007). An extension of the traditional fields of reliability engineering and safety management, resilience engineering incorporates the principles of the Santa Fe Institute and Highly Optimized Tolerance (HOT) (Carson and Doyle 2000), positing that safety emerges from an aggregate of system components, subsystems, software, organizations, human behaviors, and their interactions. To be resilient, systems must not only be reliable but also able to recover from disturbances through the design of proactive organizations and processes (Hollnagel, Woods, Levenson, et al. 2006).

Problem Statement Survivable design of physical system components is a well-understood practice best left to established domains (e.g., Journal of Aircraft Survivability is a rich source for information on the design of aeronautical systems with minimized acoustic, infrared, optical, electro-optical, and radar signatures). However, survivability at the system architecture level is a poorly understood system property that extends beyond questions of component reliability and node hardening (Neumann 2000) to internalize the role of operational behavior, human factors, supporting infrastructures, and the technical system architecture in the assessment of survivability (Hollnagel, Woods, Levenson, et al. 2006). Indeed, when considered above the component level, survivability is an emergent property of system architecture that has meaning primarily in the overall context to which it relates. Unfortunately, survivability requirements are extremely difficult to specify, develop, procure, operate, and maintain (Neumann 2000).

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

3

Existing systems engineering methodologies have difficulties modeling lowprobability, high consequence events as well as with evaluating the benefits of protective measures. Furthermore, existing models of survivable architectures—developed primarily during the Cold War with virtually unlimited resources (e.g., nuclear command and control)—are not readily deployable to the design of future engineering systems (Blair 1985). The fundamental goal of this research is to address this challenge by developing a design process that enables cost-benefit analysis of survivability to be performed at the architecture level (e.g., to be able to answer the question: “distributed constellation of Iridium spacecraft vs. hardened MILSTAR satellite— which is more survivable?”) An ideal outcome of this research would be a systems engineering methodology that allows decision makers to explicitly trade between utility, cost, and survivability in system architecture design. The following two questions provide scope for this research opportunity.

when a system is vulnerable to catastrophic failure? b. Which supporting infrastructures and cultural attributes—including the network of developers, customers, suppliers, operators, and maintainers—reduce vulnerabilities to catastrophic failures?

Research Design The research design involves four phases (Table 1). These are not discrete steps but rather an abstraction of an iterative, concurrent process of continuous learning, revisiting of assumptions, and development and testing of theory. Table 1: Research Design 1. Knowledge Capture and Synthesis (Descriptive) a) Literature review b) Survivability definition c) Survey & structured interviews d) Descriptive case studies for each component of preliminary framework (functional view)

2. Theory Development (Normative)

3. Computer Experiments (1-2 Mapping)

4. Case Applications (Prescriptive)

Hypotheses

Multidisciplinary System Design Optimization

Air Force TACSAT program

Quantitative Methods

Homeland Security technology

Framework

– Disturbance mechanism • Terrorism

– Detection mechanism

Metrics

• Space Situational Awareness (SSA)

1. How can survivability be quantified and used as a decision metric in exploring tradespaces during conceptual design of complex, socio-technical systems? a. What are the dimensions of survivability and how do these map to the “ilities?” b. What are the fundamental tensions in designing cost-effective, survivable engineering systems? c. How should a designer allocate passive versus active techniques in order to achieve system survivability? 2. What are the architectural aspects of designing survivable engineering systems? a. Which statistical methods can be employed to identify conditions

– Decision mechanism • National Command Authorities

Heuristics

SSA Modeling

Critical infrastructure protection

Spring 2007 Æ

Spring 2008 Æ

– Response mechanism • Nuclear “Triad”

Fall 2006 – Fall 2007

Winter 2006 Æ

The first phase, knowledge capture and synthesis, includes four steps: (a) review literature on survivability and resilience engineering, (b) integrate definitions of survivability from different disciplines, (c) administer a survey and conduct structured interviews with key stakeholders (to motivate and refine the research questions, not for statistical significance), and (d) review real-world disturbance phenomena and existing engineering systems deemed survivable (e.g., nuclear command and control system) to ground the subsequent theoretical investigation. The second phase, theory development, includes the development of a survivability framework, a mapping of survivability to the 4

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

temporal system properties known as the “ilities,” and the formation of hypotheses, metrics, and heuristics. In the third phase, computer experiments will be used to map the descriptive research conducted during knowledge capture and synthesis to the normative work performed during theory development. The purpose of the computer experiments is to test the framework and hypotheses in an idealized modeling environment. Potential methods for these computer experiments are categorized in Table 2. Table 2: Potentially Relevant Disciplines System Representation Modeling & Simulation • Architecture frameworks • Agent-based modeling –DoDAF/MoDAF • System dynamics • Network analysis • Discrete event simulation • Hierarchical structures • Parametric modeling • Engineering Systems Model (ESM) • Time-based simulations Decision Theory • Utility theory • Multi-Attribute Utility Theory (MAUT) • Analytic Hierarchy Process (AHP) • Prospect theory

Robust Design • Axiomatic design • Taguchi methods • Highly-Optimized Tolerance (HOT)

• • • •

• • • • •

Optimization Genetic algorithms Simulated annealing Gradient search Programming: Linear (LP), Multi-Integer (MIP), and Dynamic (DP) Uncertainty Management Hazard analysis Failure Modes and Effects Analysis (FMEA) Real options Portfolio theory Decision tree analysis

In the fourth and final phase, case applications, the survivability framework and hypotheses will be used to make prescriptive statements. Interviews, primary and secondary sources, and computer-based models and simulations will be employed to answer the research questions in the “messy” real world. With access to design data and program management, one candidate engineering system for case application is the Air Force’s tactical satellite program (TACSAT). In developing a standardized microsatellite bus to facilitate the on-demand launch of payloads tailored for specific missions, the TACSAT program aims to improve the flexibility and operational responsiveness of the U.S. space enterprise. Such a paradigm shift could provide a means for rapidly reconstituting capabilities on-orbit in the event of catastrophic losses–a key design principle for survivable architecture.

Definition of Survivability Definitions for survivability vary across the biological, network security, and aero-

space and defense domains (Table 3). While the “continuation of life” is a simple, clear definition for the life sciences, there is less clarity in defining survivability for engineering systems. Formerly proposed metrics for survivability include: the range of environments within which an entity remains operational, the disturbance threshold above which an entity will cease to function, the degree to which performance remains following a disturbance, and the time required to restore health following a compromising disturbance. A general definition of survivability is a helpful step towards the development of a design process for survivability that may be applied across domains. Table 3: Spectrum of Survivability Definitions Domain

Definition / Criteria

Source

Biology

Environmental fitness of organisms; evolutionary longevity of species to natural selection

Darwin 1859

Percentage of stations both surviving the physical attack and remaining in electrical connection with the largest single group of surviving stations

Baran 1964

Communication Probability of retaining connection between Networks representative pairs of nodes

Aerospace / Defense

Al-Noman 1998

Ability of a system to perform required functions at a given instant in time after a subset of components become unavailable

Yurcik and Doss 2002

Quantified ability of a system, subsystem, equipment, process, or procedure to continue to function during and after a natural or man-made disturbance

MIL-STD188

Capability of a system and crew to avoid or withstand a man-made hostile environment without suffering an abortive impairment of its ability to accomplish its designated mission

Naval Air Warfare Center 2001

The literature on survivable system design dates through the long history of naval architecture to the design of combat aircraft in the 20th century (Yurcik and Doss 2002). In the 1960s, the U.S. Department of Defense (DOD) formally defined survivability as “the system capacity to resist a hostile environment so that it can fulfill its mission” (MIL-STD721, DOD-D-500.3). Other DOD definitions are noted in Table 3. Traditionally calculated by military planners with specific operational scenarios in mind, survivability is often defined subjectively—dependent not only on susceptibility (e.g., how likely am I to be hit?) and vulnerability (e.g., am I killed if I’m hit?) but also on the range and breadth of proposed

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

5

missions, threat levels, and the availability of supporting assets within missions (Hall 2004). Even within domains, there are definitional discrepancies. For example, survivability of communications networks has been defined in terms of reliability as the “probability of retaining connection between representative pairs of nodes” (Al-Noman 1998), in terms of the capability of a system “to perform required functions at a given instance in time after a subset of components becomes unavailable” (Yurcik and Doss 2002), and in terms of regeneration as the “degree to which systems recover from attacks” (Moitra and Konda 2000). Given that all systems exist to deliver value, an objective definition will achieve domain-neutrality by defining survivability in terms of value. Furthermore, as an aggregate system property that reveals itself over time, it is critical that the temporal aspects of survivability are captured. These principles and the desire for a quantitative formulation guided the development of the following definition. Survivability may be defined generally as the ability of a system to minimize the impact of a finite disturbance on value delivery, achieved through either (1) the satisfaction of a minimally acceptable level of value delivery during and after a finite disturbance or (2) the reduction of the likelihood or magnitude of a disturbance. V(t) value

disturbance

fixed environment (Ross 2006). Following successful value delivery during Epoch 1a, the system sustains a finite disturbance during Epoch 2 that degrades performance. Once the disturbance ceases, the environment reverts back to the original context, Epoch 1b. In this example, the system achieves survivability by maintaining value delivery [V(t)] at a level above the emergency value threshold [Ve] and then recovering to deliver value above the expected value threshold [Vx] within the permitted recovery time [Tr]. It is helpful to distinguish the subtle difference between robustness and survivability. While robust systems are able to accommodate permanent changes in context (e.g., policy shift), survivable systems are able to recover from finite changes in context (e.g., impulse event). Therefore, survivability can be considered a special case of value robustness with a finite condition on disturbance duration. Figure 2 illustrates this distinction. Whereas the robust system might be able to accommodate three new epochs, the survivable system must only sustain value or utility [U] delivery during and after a finite disturbance (e.g., the loss of U2 in Epoch 2 followed by partial recovery of U2 in Epoch 1b). Robust

Survivable

Accommodating a permanent change in context

⎡ DV1 ⎤ DV 1a : ⎢⎢ DV2 ⎥⎥ ⎣⎢ DV3 ⎦⎥ Epoch X

value

Epoch: Time period during with a fixed context; characterized by static constraints, design concepts, available technologies, and articulated attributes (Ross 2006)

original state

Recovers from a finite change in context

Epoch 1

Epoch Y

time

Epoch 1a

Epoch 2

⎡U1 ⎤ ⎡1⎤ U 1a = ⎢⎢U 2 ⎥⎥ = ⎢⎢2⎥⎥ ⎣⎢U 3 ⎦⎥ ⎣⎢3⎦⎥

Epoch 1b

U2Æ0

⎡1⎤ U 2 = ⎢⎢0⎥⎥ ⎣⎢3⎦⎥

re c

ov er y

recovered state

Ve emergency value threshold

Vx

Epoch Z

expected value threshold

Figure 2. Distinguishing Between Robustness and Survivability

Tr permitted recovery time

Epoch 1a

Epoch 2

Epoch 1b

time

Figure 1. Definition of Survivability Figure 1 illustrates this definition of survivability across two epochs, time periods of a

⎡1⎤ ⎡1 ⎤ U 1b ,end = ⎢⎢1⎥⎥ U 1b, start = ⎢⎢0⎥⎥ partial recovery ⎢⎣3⎥⎦ ⎣⎢3⎦⎥

The “ilities” (e.g., flexibility, adaptability, serviceability) are temporal system properties that specify the degree to which systems are able to maintain or even improve function in 6

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

the presence of change. Although desired attributes of systems that characterize their interaction with uncertainties, there is a great deal of confusion associated with the ility definitions and relationships (McManus and Hastings 2005). Therefore, before mapping survivability to the ilities, it is necessary to define the relationships among the ilities themselves. Environment Context

robustness

Plane of Value Robustness t ili ab iv v r su

y

changeability

system that is able to satisfy a variety of needs (McManus and Hastings 2005). An example of an actively survivable system in the Ility Space is one that loses some function following an environmental disturbance (without falling below the emergency value threshold) and is able to recover functionality from the new state through some path of internal or external adaptation. More generally, survivable systems leverage a combination of robustness and changeability to remain in this “plane of value robustness.” The following two sections elaborate on the passive (robust) and active (changeable) techniques available to engineering systems to achieve survivability.

Physical System Design Variables

ty ili at rs ve

Stakeholder Values Utilities

Figure 3. “Ility Space” The “Ility Space” (Figure 3) is a depiction of how the various ilities relate to one another (McManus et al. 2007). The three axes represent sources of change: (1) the physical system which is specified by design variables, (2) stakeholder values which may be articulated as a utility function, and (3) environmental context. A changeable system is able to have its design variables modified by either an internal change agent (adaptability) or by an external change agent (flexibility) to maintain or improve value delivery in the presence of shifting environments and requirements (Ross 2006). Changeability may be applied along the physical system axis of Figure 3 as a technique for achieving active value robustness (i.e., maintaining value in the presence of changing contexts through system modifications). Along the environment axis, robustness specifies an unchanging system that is able to accommodate various environments while maintaining value delivery. Along the value axis, versatility refers to an unchanging

Passive Survivability Passive survivability is the ability of a system to maintain value delivery despite environmental disturbances. As a special case of robustness, passive survivability seeks to resist disturbances with defensive barriers. For a projected disturbance environment, designers develop a closed-loop architecture using various combinations of four design principles. 1) Hardness – resistance of a system to deformation; increased cost to attacker (e.g., MILSTAR) 2) Stealth – ability of a system to conceal itself within its operating environment (e.g., B-2 multi-role bomber) 3) Redundancy – duplication of critical system components to increase reliability (e.g., fly-by-wire flight control systems) 4) Diversity – variation in the range of systems within an architecture; mitigation of HOT fragility in changing contexts (e.g., nuclear triad) Figure 4 illustrates how passive survivability maps to the Ility Space. While Path AB shows how a hardened system is able to accommodate environmental changes without changes to the physical systems, Path AC shows how redundant and diverse systems 7

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

may incur physical losses from environmental disturbances and remain on the plane of value robustness. A survivable system utilizing stealth avoids disturbances altogether and is excluded from the plot. Environment

B

C redundant

hardened

diverse A

Physical System

3) Relocate – movement in position (e.g., mobile Scud launchers) 4) Retaliate – provision of negative consequences to origin of disturbance (e.g., nuclear deterrence) Figure 5 shows how active survivability may be mapped to the Ility Space. In this case, regeneration is illustrated. Following an environmental disturbance (AB), the system incurs losses in the physical (BC) and value (CD) dimensions. Through the regeneration of physical capability (EF), value is actively restored (FA). Environment D

Stakeholder Values

B

C

Figure 4. Passive Survivability

F

E

A

utility loss

Active Survivability Active survivability is the ability of a system to react to environmental disturbances through changeable design variables. Rather than resisting disturbances, the design focus is on agile architectures which are able to react to environmental feedback. While design for passive survivability presupposes knowledge of the disturbance environment, design for active survivability acknowledges uncertainty in the projection of future disturbances. In contrast to the design principles of passive survivability which are designated as nouns, the design principles of active survivability are best specified as verbs:

AB – environmental disturbance BCD – physical, value losses DE – disturbance subsides

Physical System

EFA – physical, value recoveries (regeneration case)

Stakeholder Values

Figure 5. Active Survivability Table 4 hypothesizes differences between the passive and active techniques of survivability. To first order, the distinction is philosophical: passive survivability is something that a system has while active survivability is something that a system does.

1) Regenerate – restoration of capability through repair and replacement activities (e.g., emergency resupply, refreshment of cryptographic key values) 2) Evolve – system modification to maintain and possibly extend capability (e.g., on-orbit servicing)

Table 4: Passive vs. Active Survivability Philosophy Characteristics

Passive Survivability

Active Survivability

Survivability is something that a system has

Survivability is something that a system does

proactive, resistant, robust

reactive, flexible, adaptive

Design Principles

hardness, stealth, redundancy, diversity

regenerate, evolve, relocate, retaliate

Forecasting

Presupposes knowledge of disturbance environment

Acknowledges uncertainty in projection of future disturbances

Architecture Design Focus Failures Relevant Disciplines

Closed (static)

Open (dynamic)

Defensive barriers at system-level to resist disturbances

Architectural agility to avoid, deter, and recover from disturbances

Causal chain (often linear)

Tight couplings, functional resonance (nonlinear)

Component reliability, safety engineering, risk analysis, domainspecific technologies

Real options, organizational theory, process design, domain-specific technologies

8

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

Next Steps Three efforts comprise the next steps of this doctoral research: (1) structured interviews with stakeholders of survivable engineering systems, (2) a descriptive case study of nuclear command and control, and (3) computer experiments of space situational awareness. Structured Interviews. Input from three stakeholder communities—system developers, acquirers, and users—will be used further to motivate and refine the scope of the research. The working list of questions below aims to elicit problems with the current design paradigm as well as opportunities for improving the design process of survivable system architecture. 1) How do you define survivability? 2) How would you characterize the disturbance environment of your systems? - Benign ↔ Hostile - Unintelligent ↔ Intelligent 3) How are system vulnerabilities determined? 4) How does the importance of survivability change over time in your domain? - During what phases of the design process is survivability addressed (if at all)? - Relative to other architecture considerations, what level of priority is given to survivability requirements? 5) How are survivability requirements articulated in the design process? - Constraints - Traditional requirements - Attributes to maximize - Emergent system properties 6) At what level in the program is survivability addressed? - Component (e.g., reliability) - System (e.g., hardening) - Architecture (e.g., redundant nodes) - Enterprise (e.g., supply chain robustness, robust workforce) 7) What design principles are utilized to achieve survivability? 8) How is survivability managed in the tradespace of conceptual design? - Is survivability performance aggregated into a larger attribute set or considered at the highest levels of programmatic decision-making (with cost and utility)? - How is survivability quantified (if at all)?

- What systems engineering methods and tools are available for survivable design (and at what level in the architecture do they address survivability)? 9) What is the greatest challenge in survivable design?

Nuclear Command and Control Case Study. As an extreme example of a survivable architecture, the U.S. nuclear command and control system for strategic deterrence during the Cold War may offer theoretical insights for the design of engineering systems in hostile environments. “Best practices” may span across the design of physical components, organization, and operational behavior. As noted previously, a key challenge here is to selectively apply lessons learned to design paradigms where resources for achieving survivability are limited. Other potential systems for descriptive case studies on survivable architecture include the naval Carrier Battle Group (for which 1017 ships are deployed to provide a degree of survivability to the carrier with its 35 to 45 offensive aircraft) and the Airborne Warning and Control System aircraft (for whose protection up to 24 fighter aircraft are deployed) (Giffen 1982). Space Situational Awareness Computer Experiment. The first planned computer experiment is to investigate the design of the next-generation ground and space segments of the U.S. Space Situational Awareness (SSA) architecture. In particular, the degree of coupling among various sensors will be varied to assess the impacts of network structure and stakeholder collaboration on survivability. Issues to be investigated include how loosely coupled architectures might enable design flexibility for system-of-systems and how designers might trade-off between passive and active survivability techniques (e.g., hardening vs. flexible routing).

Conclusion This paper introduced an ongoing doctoral research track on the role of survivability as an attribute in the design of complex system 9

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

architectures. After describing the motivation for and the scope of the research, the three goals set in the introduction were accomplished. First, definitions of survivability from various disciplines were integrated to arrive at a value-centric definition that may be applied to different domains. Second, a framework was developed which illustrated survivability as a superposition of robustness and changeability. The framework included an explanation of the passive and active techniques available for the design of survivable systems. Third, survivability was explicitly mapped to the “ilities,” relating this research to ongoing research efforts on the temporal aspects of complex system architectures. It is hoped that this exploratory paper may serve as a foundation for future research on survivable system architectures and that survivability is recognized as critical to the design of value robust systems.

Acknowledgements As a rich source of ideas and steadfast encouragement, Dr. Adam Ross has been instrumental in the formulation and development of this research topic. The authors also thank Dr. Hugh McManus, Dr. Ricardo Valerdi, and Nirav Shah for their valuable insights. Funding for this work was provided by the Program on Emerging Technologies, an interdisciplinary research effort of the National Science Foundation at MIT.

References Al-Noman, A., “Analysis and Evaluation of Survivability of Various Configured Communication Networks.” International Journal of Communication Systems, 11: 305-310, 1998. Axelband, E., Valerdi, R., Baehren, T., Boehm, B., Brown, W., Colbert, E., Dorenbos, D., Jackson, S., Madni, A.,

Nadler, G., Robertson, R., Robitaille, P., Settles, S., and Tran, T., “A Research Agenda for Systems of Systems Architecting.” Conference on Systems Engineering Research, Hoboken, NJ, March 2007. Baran, P., “On Distributed Communications Networks.” IEEE Transactions on Communications Systems, March 1964. Blair, B., Strategic Command and Control. The Brookings Institution, Washington, DC, 1985. Carson, J., and Doyle, J., “Highly Optimized Tolerance: Robustness and Design in Complex Systems.” Physical Review Letters, 84(11): 2529-2532, March 2000. Carter, A., “The Architecture of Government in the Face of Terrorism.” International Security, 26(3): 5-23, Winter 2001/2002. Darwin, C., On the Origin of Species. John Murray, London, UK, 1859. Giffen, R., US Space System Survivability. National Defense University Press, Washington, DC, 1982. Hall, D., “Integrated Survivability Assessment (ISA) in the Acquisition Lifecycle.” AIAA Structures, Structural Dynamics, and Materials Conference, Palm Springs, CA, April 2004. Hollnagel, E., Woods, D., Levenson, N., et al., Resilience Engineering. Ashgate, Hampshire, England, 2006. Kean, T., Hamilton, L., et al., National Commission on Terrorist Attacks Upon the United States. Government Printing Office, 2004. Knabb, R., Rhome, J., and Brown, D., Tropical Cyclone Report: Hurricane Katrina. National Hurricane Center, December 2005. McManus, H., and Hastings, D., “A Framework for Understanding Uncertainty and its Mitigation and Exploitation in Complex Systems.” INCOSE Symposium, Rochester, NY, July 2005. McManus, H., Richards, M., Shah, N., Valerdi, R., and Hastings, D., “A Framework 10

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

for Incorporating “ilities” in Tradespace Studies.” Submitted for publication in 2007. Moitra, S., and Konda, S., A Simulation Model for Managing Survivability of Networked Information Systems. Carnegie Mellon Software Engineering Institute, December 2000. Naval Air Warfare Center, Aerospace Systems Survivability Handbook Series. Joint Technical Coordinating Group on Aircraft Survivability, May 2001. Neumann, P., Practical Architectures for Survivable Systems and Networks. SRI International for U.S. Army Research Laboratory, June 2000. Parfomak, P., Vulnerability of Concentrated Critical Infrastructure: Background and Policy Options. Congressional Research Service, December 2005. Ross, A., Managing Unarticulated Value: Changeability in Multi-Attribute Tradespace Exploration. Engineering Systems Division, doctoral dissertation. Massachusetts Institute of Technology, Cambridge, MA, 2006. Rumsfeld, D., et al., Commission to Assess U.S. National Security Space Management and Organization. September 2001. Sheffi, Y., The Resilient Enterprise. MIT Press, Cambridge, MA, 2005. U.S.-Canada Power System Outage Task Force, Final Report on the August 14th Blackout in the United States and Canada, April 2004. Yurcik, W., and Doss, D., “A SurvivabilityOver-Security (SOS) Approach to Holistic Cyber-Ecosystem Assurance.” IEEE Workshop on Information Assurance, West Point, NY, June 2002.

Biographies Matthew Richards is a graduate student at MIT pursuing a Ph.D. in Engineering Systems. Matt’s research focuses on system architecture and design, space systems engineering, and innovation management. His work experience includes Mars rover mission de-

sign at the Jet Propulsion Laboratory and systems engineering support on two autonomous vehicle programs for the Defense Advanced Research Projects Agency. Matt has B.S. and M.S. degrees in Aeronautics and Astronautics (MIT 2004, 2006) and a M.S. degree in Technology and Policy (MIT 2006). Daniel Hastings is a Professor of Aeronautics and Astronautics and Engineering Systems at MIT. Dr. Hastings has taught courses and seminars in plasma physics, rocket propulsion, advanced space power and propulsion systems, aerospace policy, technology and policy, and space systems engineering. He served as Chief Scientist to the U.S. Air Force from 1997 to 1999 and as Director of MIT’s Engineering Systems Division from 2004 to 2005. He is a member of the National Science Board, the International Academy of Astronautics, the Applied Physics Lab Science and Technology Advisory Panel, and the Air Force Scientific Advisory Board. Donna Rhodes is a Principal Research Engineer at MIT in the Engineering Systems Division. Her research interests are focused on systems engineering, systems management, and enterprise architecting. Dr. Rhodes has twenty years of experience in the aerospace, defense systems, systems integration, and commercial product industries. Prior to joining MIT, she held senior level management positions at IBM Federal Systems, Lockheed Martin, and Lucent Technologies in the areas of systems engineering and enterprise transformation. Dr. Rhodes is a Past-President and Fellow of the International Council on Systems Engineering (INCOSE). Annalisa Weigel is an Assistant Professor of Aeronautics and Astronautics and Engineering Systems at MIT. Dr. Weigel’s teaching and research areas are focused on space system architecture and design, systems engineering, systems-of-systems analysis, aerospace policy, and finance. As an engineer at Adroit Systems from 1995-1998, she worked in support of the DoD Space Architect Office during its stand-up and initial space system architecture design studies in the areas of satellite communications, satellite operations, 11

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA

and launch on demand. After completing her Ph.D. at MIT, Dr. Weigel worked for a year as a research associate at Lehman Brothers in New York City.

12

PROCEEDINGS CSER 2007 , March 14-16, Hoboken, NJ , USA