Can Complex Systems Be Engineered?

Can Complex Systems Be Engineered? Lessons from Life Anthony H. Dekker Defence Science and Technology Organisation DSTO Fern Hill Department of Defence, Canberra ACT 2600, Australia Email: [email protected] Abstract. Recently there has been considerable debate as to whether Complex Systems can be engineered. Can engineering techniques be applied to Complex Systems, or are there fundamental attributes of Complex Systems that would prevent this? A first glance at the Complex Systems literature suggests a negative answer, but this is partly due to the fact that complex systems theorists often look for unstable, chaotic, and “interesting” behaviour modes, while engineers look for stable, regular, and predictable modes. Looking more deeply at the Complex Systems literature, we suggest that the answer is “yes,” and draw out a number of design principles for engineering in the Complex Systems space. We illustrate these principles on the one hand by commonly-studied complex systems such as the Game of Life, and on the other hand by real socio-technical systems. These design principles are particularly related to properties of the underlying network topology of the Complex System under consideration. Keywords: Complex Systems, Systems Engineering, Software Safety, Network, Game of Life Conway’s Game of Life (Sarkar 2000), and secondly from a number of real-world examples.

1. INTRODUCTION Is it possible to engineer Complex Systems? This question has prompted considerable debate (Wilson et al. 2007), although engineers have been working with Complex Systems for many years, and the problems that arise in today’s Complex Systems differ from those of the past only in degree, not in kind. So-called “Systems of Systems” are Complex Systems which are particularly problematic because they: • involve a network of stakeholders, rather than a rigid management hierarchy, so that problem resolution requires negotiation; • require creative thinking when problems arise; • need engineers to conflicting objectives;

balance

multiple

• and are in a constant state of flux. Engineers have always faced these issues, of course – but as systems become larger and more complex, these issues become more significant. In this paper, we will show how the theory of Complex Systems leads to a number of general approaches which help to tame system complexity. We will draw these examples firstly from that prototypical Complex System,

2. COMPLEX SYSTEMS Complex Systems are characterised by interactions between system components that produce emergent system properties. The study of Complex Systems involves several complementary viewpoints: • The Social Viewpoint focuses on human aspects of systems (Heyer 2004, Checkland 1981). Many Complex Systems incorporate a human component. • The Biological Viewpoint draws on the study of complex biological systems existing in nature, such as ecosystems and even individual organisms (Solé and Goodwin 2000). Biological and social systems both involve adaptivity as conditions change. As the philosopher Heraclitus pointed out 2500 years ago, change and adaptivity are ubiquitous, so that “you cannot step twice into the same stream,” since the stream is constantly changing (Copleston 1946). It is important to know the ways in which a system will adapt and the implications of its doing so. • The Mathematical Viewpoint focuses on the topology of the underlying network of the

system (Barabási 2002, Watts 2003), and on measurable system attributes. Much of the theory of Complex Systems arises from applying the mathematical viewpoint to physical and biological systems. • The Engineering Viewpoint concentrates on building Complex Systems with desirable characteristics. This overlaps with the mathematical viewpoint when analysing and simulating systems, but also overlaps with the social viewpoint in areas such as process design. Figure 1 summarises these four viewpoints: Social Viewpoint Soft OR & Process Design

Engineering Viewpoint

Adaptivity

Complex System

Biological Viewpoint Complex Systems Theory

Analysis & Simulation

Mathematical Viewpoint

Figure 1: Four Viewpoints on Complex Systems This paper will concentrate on the engineering viewpoint, which attempts to tame complexity rather than studying its fascinating characteristics. In Complex Systems Engineering, both the system being “built” and the organisation put together to build it, are Complex Systems, and our comments will apply to systems of both kinds. 3. STATE TRANSITION FORESTS Let us consider a Complex System in terms of the set S1…SN of all possible states (the number of states N may be extremely large). As time progresses, states change according to a deterministic state transition operation Si → Sj → Sk → … We can arrange the states into a State Transition Forest (STF) by enumerating states one by one and adding the chains of states they transition to. If a state transitions to a previously examined state in a different chain, the chains join to form a tree. If the state transitions to a previous state in the same chain, a back-link is formed. The back-link defines a cycle, and

hence a periodic behaviour mode for the system as a whole. Figure 2 shows a simple illustration: the system with the numbers 0…10 as states, and transitions defined by x → x2 mod 11. The state transition forest for this system contains one large tree and two small ones. Among the important features of this forest are the indegrees of the states, that is, the number of incoming arrows. For example, state 4 has an indegree of 2:

0 1 2 6 7 8 10

4

5

3

9

Figure 2: State Transition Forest for x → x2 mod 11 For each system, we can calculate the average of the non-zero in-degrees in the state transition forest, which we call din. For the example in Figure 2, din = (1+2+2+2+2+2)/6 = 1.833. When din > 1, the expected depth δ of the trees in the forest is approximately:

δ ≈ (log N ) / (log din) For the example in Figure 2, this gives δ ≈ (log 11)/(log 1.833) = 4.0, which in this case is precisely the depth of the largest tree (though in general it will not be exactly equal). Since cycles of states are produced from back-links in the state transition forest, the expected value of the longest period is approximately the same as δ . In Figure 2, it is 4, which is in fact equal to δ (although again it will not be exactly equal in general). For systems where din is close to 1, the expected length of the longest period is of the order of N , by a version of the Birthday Paradox. A more specific result is provided by Wolfram et al. (1984). Summarising, the expected longest period for a system with N states is therefore:

N

if din ≈ 1

log N log din

if din > 1

The Game of Life is a cellular automaton where cells in a grid can be live (1) or dead (0), and all cells change state (in parallel) based on the number of live neighbours (horizontal, vertical, or diagonal), as in Table 1.

Systems with an average nonzero in-degree din > 1 are systems which erase old state information (so that the details of the “from” state is lost), and the higher din, the more rapidly old state information is erased. Why is erasing old state information important? It contributes to system stability, and fewer or shorter system oscillations. Imagine an organisation with constantly changing equipment and procedures, but which also temporarily outposts staff to other units. When outposted staff return, they incorporate old state information in the form of out-of-date procedures. This can cause serious organisational problems. Organisational methods for addressing this include compulsory training, and specialised refresher training for staff returning from outposted positions. Distribution of key standards, requirements, and decisions is also important, and should be combined with mechanisms that force staff to return out-of-date copies. In software systems, old state information is erased by, for example, setting blocks of memory to all zeros before they are reused. This helps to avoid obscure bugs. 4. THE GAME OF LIFE

When studying finite grids for the Game of Life, the usual practice is to use a rectangle with opposite edges viewed as connected (topologically this corresponds to a torus). For an m×n grid, the number of different states or patterns is N = 2mn. We define the period of a pattern p as: • 0, if the pattern p eventually dies out (becomes all zeros); • 1, if the pattern p becomes stable: p → … q →q→… • k ≥ 2, if the pattern p becomes a k-loop: p → … q1 → q2 → … qk → q1 → … On a finite grid, every pattern has a welldefined period. For a particular size and shape of finite grid, we define the spectrum to be the distribution of periods for randomly chosen starting patterns. Figure 3 shows the spectrum for an 8×8 toroidal Game of Life, showing examples of patterns with period 0, 1, 2, 6, 32, 48, and 132 (periods 4, 8, 9, 16, and 20 are also possible, but rare). Notice that the maximum period of 132 is very small compared to the number of states N = 264, but has the same order of magnitude as the logarithm of the number of states. This is consistent with a state transition forest with din > 1, i.e. one erasing old state information.

The Game of Life (invented by John Conway in 1970) is one of the most famous Complex Systems (Sarkar 2000), and one of the easiest to describe.

50% 40% Frequency

Table 1: Cell Transition Rules for the Game of Life

60%

30% 20%

Old Cell

Live Neighbours

New Cell

Explanation

0

3

1

Birth

10% 0% 0

20

40

60

80

100

120

140

Period

0

≠3

0

No change

1

0, 1

0

Death from isolation

Figure 3: Spectrum for the Game of Life on an 8×8 Toroidal Grid

1

2, 3

1

Survival

1

4…8

0

Death by overcrowding

The one-dimensional analogue of the Game of Life, sometimes called “Rule 90” (Wolfram 2002) uses a ring of n cells, with state changes defined by Table 2:

Table 2: Cell Transition Rules for OneDimensional Game of Life Old Cell

Live Neighbours

New Cell

Explanation

0

1

1

Birth

0

0, 2

0

No change

1

0

0

Death from isolation

1

1

1

Survival

1

2

0

Death by overcrowding

For rings of size n where n is a prime number not close to a power of 2, the most common period is 2(n–1)/2 – 1. For example, random starting states on rings of size 11, 13, 19, and 23 give periods of 2(n–1)/2 – 1 (i.e. periods of 31, 63, 511, and 2047) 99.9% of the time (for other ring sizes, the period is less). The period 2(n–1)/2 – 1 is approximately: 1 2

N

This is consistent with a state transition forest having din ≈ 1, i.e. one not erasing old state information. Disturbances propagate symmetrically along both sides of the ring, and “reflect” back from the other side. Each propagation changes the cells it passes, and this in turn alters the impact of the next propagation. Because old state information is not erased, the disturbance continues to move around the ring many times, giving an extremely large period. In contrast, if we produce a version of the Game of Life where every cell is connected to all the others, it is easy to show that the period will always be at most 2, no matter which variation of Table 1 we use. This corresponds to a state transition forest with very large din (e.g. the square root or cube root of N ). The diameter of the underlying network is an important factor in these variations of the Game of Life. An n×n square Game of Life grid has diameter n/2, but the fully connected version has diameter 1 (since all cells are one link apart). Systems with low diameter tend to “damp out” propagated disturbances more quickly (since the disturbance “returns” after less time), and hence tend to have much higher values of din.

Sometimes networks are defined so that this relationship between diameter and din does not hold, and it is din that is the more significant in such cases. For example, in the random Boolean networks of Kauffman (1995), large diameters are associated with extremely high din and hence periodic behaviour, while high connectivity is associated with no erasure of state information (din ≈ 1) and hence a period of N . We can summarise the different kinds of system behaviour demonstrated within these variations of the Game of Life by adapting Wolfram’s empirical classification of cellular automata (Sarkar 2000). In our modified classification, it is the structure of the state transition forests that is significant. There are four classes of system: Class 1 (stable): all patterns have period 0 or 1. Class 2 (oscillatory): this extends class 1, in that all patterns have small periods (much smaller in size than log N ). This corresponds to state transition forests with very high din, and systems with underlying networks of very small diameter. The fully connected Game of Life is an example. Class 3 (chaotic): class 1 or 2 behaviour can occur, but some patterns have a very large period, of order N . This corresponds to state transition forests with din ≈ 1, and systems with underlying networks of very large diameter. The one-dimensional Game of Life is an example. In real-world systems, the period of order N will often be longer than the lifespan of the Universe, giving essentially random behaviour. Class 4 (complex): class 1 or 2 behaviour can occur, but also patterns with periods of order log N, giving a characteristic spectrum like that in Figure 3. This corresponds to state transition forests with din > 1, but where din is small compared to N. The underlying network of the system typically has a diameter of moderate size. The ordinary Game of Life is an example. When we engineer Complex Systems, we would prefer their behaviour to be restricted to classes 1 and 2. We can do this by putting in place “damping” mechanisms to erase old state information, and by reducing the diameter of the system’s underlying network, so that disturbances do not propagate for long periods of time.

Although the Game of Life is a famous example of complexity, aficionados do in fact “engineer” complicated Game of Life patterns to achieve non-trivial goals. This engineering is accomplished by: (a) defining an ontology of stable or oscillatory sub-patterns, such as “blocks,” “gliders,” “glider guns,” “eaters,” “spaceships,” etc. [see Honour and Valerdi (2006) for applications of ontologies to Systems Engineering]; (b) conducting computer searches to find patterns with specified properties; (c) “assembling” patterns by placing them adjacent to each other; and (d) controlling interactions so that most adjacent patterns do not interact, and the ones that do interact do so in carefully specified ways, for example by exchanging “gliders.” In general, engineering Complex Systems requires controlling interactions. However, we must not eliminate so many interactions that the diameter of the underlying system network is increased. One way of reconciling these conflicting goals is to introduce “hub” components which interact with many different parts of the system, but which do so in very carefully specified ways. The “bus” inside a computer or microprocessor is an example of such a hub. Widely circulated documents, standards, ontologies, and models are also hubs, particularly when they are part of an easily accessible online information repository. However, they need to be carefully written so that different people will interpret them in the same way. If that is difficult, then processes need to be put in place to resolve ambiguities and widely disseminate the resolution. Organisationally, a small underlying network diameter can also be obtained by moving at least partially from a hierarchical structure to an “Edge Organisation” (Alberts and Hayes 2003, 2006), and by holding regular crossorganisational meetings and workshops on issues of wide interest. Many natural Complex Systems obtain a small diameter for the underlying network with a scale-free network topology (Barabási 2002). This is also possible for designed systems – in modern software packages, a low diameter can result from an approximately scale-free topology

of module interconnectivity (Wen and Dromey 2006). 5. INTERACTION PROBLEMS Safety expert Nancy Leveson points out that “the most challenging problems in building complex systems today arise in the interfaces between components.” In controlling interactions between system components, we wish to avoid two dangers: • an excessively large number of interactions (as in software systems written in assembly language, or using shared-memory threads); or • a large underlying network diameter, which may result in complex or chaotic behaviour, as described in the previous section. However, there are other features of the underlying system network which are potentially dangerous. One of them occurs when two system components A and B both control a component C (the “troublesome triangle”):

A C B Figure 4: The “Troublesome Triangle” Failure to coordinate inputs from components A and B can result in incorrect behaviour by C, and possibly system failure. Where A and/or B have a human component, adaptation may reduce the effectiveness of the coordination. This can occur when people learn more “efficient” short-cut procedures which rely on assumptions which may not always be true (Leveson 2004). One example involving this “troublesome triangle” was the tragic death of 71 people in a midair collision over Ueberlingen in 2003 (Nunes and Laursen 2004). A contributing factor to the collision was conflicting instructions to the pilots from air traffic control on the one hand, and the onboard Traffic Collision and Avoidance System (TCAS) on the other. The overall system did not contain mechanisms for coordinating or resolving conflicts between air traffic control and TCAS. Figure 5 gives a simplified view of this system:

6. SOFTWARE SYSTEMS

Pilot 1

Air Traffic Controller

Some of the most complex systems ever designed by human beings have incorporated large software subsystems. Some of these systems have failed spectacularly, either in use, or by being abandoned partway through development (Brooks 1975; Fiadeiro 2007).

Figure 5: Troublesome Triangles for the Traffic Collision and Avoidance System (TCAS)

The Therac-25 X-ray therapy machine is one of the more serious examples. The Therac-25 fatally overdosed several patients, as a result of uncontrolled component interactions (assembly language and shared-memory concurrency were used) and a “troublesome triangle” in the control of the radiation beam (Leveson 1995). Figure 7 illustrates the main system components:

TCAS 1 Collision

Inter-aircraft communication protocol

TCAS 2 Pilot 2

Another example was the shooting down of two US Army Black Hawk helicopters by US Air Force F-15s in the “No Fly Zone” over Iraq in 1994. Among the contributing factors to this “friendly fire” incident was the fact that helicopters and F-15s in the same airspace were guided by different controllers on the same AWACS aircraft (Leveson et al. 2002). Consequently, three uncoordinated influence chains impacted on the unfortunate Black Hawk pilots, as shown in Figure 6: Combined Task Force Commander Air Component Commander

Land Component Commander

Airborne Cmd Element

Senior Director

No Fly Zone Controller

Enroute and Helo Controller

F-15 Pilots

Shoot down

AWACS

Sensors

Operator

Computer

Controls

Turntable

Starts

Patient

Modifies

Radiation Beam

Figure 7: The Troublesome Triangle in the Therac-25 X-ray System Traditional software techniques controlling interactions include:

for

• replacing goto commands by higher-level if, for, and while constructs;

Black Hawk Pilots

Figure 6: Simplified Control Structure for No-Fly Zone Friendly Fire Incident Organisational examples of the “troublesome triangle” involve responsibilities for C shared between organisational units A and B. Managing such shared responsibilities requires effective information exchange and liaison mechanisms, including cross-posted staff, shared training, and opportunities to discuss issues of common interest.

• controlled exception-handling, as in Java (Gosling et al. 1996), which simplifies the system state on exceptional conditions (an example of erasing old state information); • discouraging globally writable data, in favour of information hiding and modularisation (Pressman 1992); • reducing sub-program side-effects (Storey 1996), or eliminating them completely as in functional programming languages (Reade 1989); • using communicating processes (as in Unix), rather than shared-memory threads; • type systems, assertions, and array bounds checks (as in Java), to control the way that data is accessed, and to provide safety checks;

• module interfaces which control and specify interactions, including the use of types, access control, and contracts (Fiadeiro 2007); • well-structured operating systems, such as Unix, which provide a controlled environment for software to run in; • static (compile-time) analysis of programs to identify at least some undesirable interactions; and • user interfaces which clearly communicate a model of the system’s current state, and its expected state following user action. All these methods are instances of the more general techniques of reducing the number of interactions, controlling the kinds of interactions, and making the interactions more predictable. These general techniques are applicable to all Complex Systems. 7. DISCUSSION Yes, it is possible to engineer Complex Systems. Complex Systems Engineering is like Traditional Systems Engineering, only more so. Engineering Complex Systems is, however, difficult. It demands even higher levels of creativity and willingness to collaborate than ordinary engineering does. This in turn requires explicit management commitment to growing the necessary staff – particularly in developing their people skills. Unexpected things can occur in Complex Systems, and dealing with this requires a combination of planning ahead and responding to problems as they arise. The boundaries of Complex Systems need to be drawn widely – if there is debate about the system boundary, then the broader definition is probably the right one. Intelligent risk management needs to be applied to potential problems near the system boundary. The human components of Complex Systems need to be understood, and appropriate techniques (Heyer 2004) need to be applied to understand them. An important aspect of Complex Systems Engineering is erasing old state information, and replacing it with new information. This includes continually updated online information repositories, and appropriate education and training for the human elements of the system.

This applies to both the system being designed, and the system responsible for designing it. The underlying network of the system should have a low diameter. One way of achieving this is “hub” components which tie the whole system together. Information repositories are hubs in this sense, and within organisational systems, workshops and meetings can fill the same role. The hubs require particular care to ensure that they are error-free. It is also important to use traditional engineering approaches for reducing subcomponent interactions and making them predictable. These techniques include modularisation, interface specifications, and avoidance of “troublesome triangles,” in which components A and B both control C, but without adequate coordination. However, reducing interactions should not compromise the low diameter of the underlying system network. Finally, it is important to understand the ways in which the system can adapt, and to be aware of the positive and negative impacts of such adaptivity. 8. ACKNOWLEDGEMENTS The author is grateful to Ed Kazmierczak, Bernard Colbert, Anne-Marie Grisogono, Åse Jakobsson, Jon Rigter, and Tim Smith for discussions on Complex Systems Engineering. 9. REFERENCES Alberts, D.S. and Hayes, R.E. (2003), Power to the Edge, CCRP Press, Washington. Available at www.dodccrp.org/files/Alberts_Power.pdf Alberts, D.S. and Hayes, R.E. (2006), Understanding Command and Control, CCRP Press, Washington. Available online at www.dodccrp.org/files/Alberts_UC2.pdf Barabási, A.-L. Publishing.

(2002),

Linked,

Perseus

Blanchard, B.S. and Fabrycky, W.J. (1998), Systems Engineering and Analysis, 3rd edition, Prentice Hall. Brooks, F.P. (1975), The Mythical Man-Month: Essays on Software Engineering, AddisonWesley. Checkland, P. (1981), Systems Thinking, Systems Practice, Wiley.

Copleston, F. (1946), A History of Philosophy, Volume 1: Greece and Rome, Continuum Books. Dekker, A.H. (2007), “Using Tree Rewiring to Study ‘Edge’ Organisations for C2,” Proceedings of SimTecT 2007. Fiadeiro, J.L. (2007), “Designing for Software’s Social Complexity,” Computer 40 (1), January, pp 34–39. Gosling, J., Joy, B. and Steele, G. (1996), The Java Language Specification, AddisonWesley. Heyer, R. (2004), Understanding Soft Operations Research: The methods, their application and its future in the Defence setting, DSTO Report DSTO-GD-0411: www.dsto.defence.gov.au/publications/3451/ DSTO-GD-0411.pdf Honour, E. and Valerdi, R. (2006), “Advancing an Ontology for Systems Engineering to Allow Consistent Measurement,” 4th Conference on Systems Engineering Research, Los Angeles, April. Kauffman, S.A. (1995), At Home in the Universe: The Search for the Laws of SelfOrganization and Complexity, Oxford University Press. Leveson, N. (1995), Safeware: System Safety and Computers, Addison-Wesley. Leveson, N. (2000), “Intent Specifications: An Approach to Building Human-Centered Specifications,” IEEE Transactions on Software Engineering, 26 (1), January, pp 15–35. At sunnyday.mit.edu/papers/intenttse.pdf Leveson, N. (2004), “A New Accident Model for Engineering Safer Systems,” Safety Science, 42 (4), April, pp 237–270. Available at sunnyday.mit.edu/accidents/safetysciencesingle.pdf Leveson, N., Allen, P. and Storey, M.-A. (2002), “The Analysis of a Friendly Fire Accident Using a Systems Model of Accidents,” Proc. International Conference of the System Safety Society. Available online at sunnyday.mit.edu/accidents/issc-bl-2.pdf

Nunes A. and Laursen, T. (2004), “Identifying the Factors That Contributed to the Ueberlingen Midair Collision,” Proc. 48th Annual Chapter Meeting of the Human Factors and Ergonomics Society, Sept 20– 24, New Orleans. Pressman, R. (1992), Software Engineering: A Practitioner’s Approach, 3rd edition, McGraw-Hill. Reade, C. (1989), Elements of Functional Programming, Addison-Wesley. Sarkar, P. (2000), “A Brief History of Cellular Automata,” ACM Computing Surveys, 32 (1), March, pp 80–107. Solé, R. and Goodwin, B. (2000), Signs of Life: How Complexity Pervades Biology, Basic Books. Storey, N. (1996), Safety-Critical Computer Systems, Addison-Wesley. Watts, D. (2003), Six Degrees: The Science of a Connected Age, Vintage. Wen, L. and Dromey, R.G. (2006), “Component Architecture and Scale-Free Networks,” Australian Software Engineering Conference (ASWEC) Workshop on Complexity in ICT Systems and Projects. Wilson, S., Boyd, C., and Smeaton, A. (2007), “Beyond Traditional SE: Report on Panel at SETE 2006,” SESA Newsletter, No. 42, March, pp 6–20. Wolfram S. (2002), A New Kind of Science, Wolfram Media, Champaign, IL. Wolfram, S., Martin, O., and Odlyzko, A.M. (1984), “Algebraic Properties of Cellular Automata,” Communications in Mathematical Physics, 93, March, pp 219– 258.

Anthony Dekker obtained his PhD in Computer Science from the University of Tasmania in 1991. After working as an academic for five years, he joined DSTO, where his interests include networks, agent-based simulation, complex systems, and organisational structures.