Simulation and Human Factors in Modeling of Spaceflight Mission Control Teams International Symposium on Resilient Control Systems 15 August 2012, Salt Lake City, UT Barrett S. Caldwell, PhD Jeffrey D. Onken, PhD ISRCS, August 2012
Background Four Team Processes
GROUPER History
Asking
2005 paper
A novice comes to experts with a question.
Sharing
by Caldwell
2006 paper
Experts share opinions, experience, and backgrounds.
Solving
by Ghosh & Caldwell
this paper
An expert finds a problem, and the team need to solve it.
Learning
future work
Experts study together to expand their knowledge.
ISRCS, August 2012
NASA Flight Control Team Flight controllers are highly trained experts all on one team. Their job is to monitor the crew and the Shuttle subsystems via their consoles.
The flight director is the decision lead of the team. The team is grouped according to engineering and scientific disciplines. They communicate on voice loops.
They solve problems together; no one simply dictates instructions.
ISRCS, August 2012
3
Foundational Work Distributed Supervisory Coordination Situation Awareness
Human Supervisory Coordinator
Group Performance Expertise Human–Human Comm. Interfaces Human Supervisory Controller
Human–Human Comm. Interfaces
Group Decision Processes “Solving” Information Flow Human–System Interfaces
Asking and Sharing modules have been modeled. This models the Solving module.
4
Dimensions of Expertise
Human Supervisory Controller
Human–System Interfaces Engineering System Being Controlled ISRCS, August 2012
Situation Awareness
Problem Statement: Analysis of Alternatives How do we guide the design of an MCC?
An offline research MCC would be prohibitively expensive and still take a long time. A model & simulation is cheap, fast, and safe.
Extensibility
Sustainability
8
1
2
3
Total
Adaptability
21 5
Effectiveness
Robustness
Safety Alternatives
Wght
Trial and error takes years, is very risky, and is limited to only slight variations.
Figures of Merit
Status Quo MCC
?
Skeleton MCC (=10)
?
Smaller MCC (-10) Larger MCC (+10) Unmonitored Sleep Mobile Team??
ISRCS, August 2012
? 5
? ? ? ?
Challenges for a Good Model Integrating Human Factors and Simulation Disciplines An Effective Model Must Include: •Realistic Examples of Anomalies •Taken from actual spaceflight mission
•Believable and Realistic Dynamics of Agents •Based on authors’ experience, research literature on “intellective”
and decision making tasks, permitting “non-rationality” (but “truth wins” convincing) •Recognition of Task Demands and Constraints •Dynamic function allocation and mission safety criteria
•Stability and Convergence of Estimates of Parameter and
Scenario Performance •Monte Carlo analyses, 300 trial convergence
ISRCS, August 2012
Telemetry and Anomalies Telemetry is data from subsystems.
Anomalies are irregularities.
Regular telemetry is nominal.
Anomalies is off-nominal telemetry.
Flight-control consoles display telemetry.
Anomalies may be evidence of underlying problems (may be blip in sensor or comm.) Anomalies are problems that the team must solve.
Decisions must be made by the team, all together, to keep the crew safe. The flight controllers cannot ignore any anomalies. The one they ignore may be the one that kills the crew.
ISRCS, August 2012
7
Fault Detection, Isolation, and Recovery (FDIR) The JPL Purdue Architecture Analysis (PAA) project introduced a FDIR model. Anomalies 1.
start appearing in telemetry at some mission elapsed time (MET)
2. disappear after some set lifetime 3.
impart some penalty based on whether they are degrading
There is a process of anomaly detection, another process of anomaly isolation, and another process for anomaly resolution.
The anomaly model and anomaly resolution process are based on this. 8
ISRCS, August 2012
Working The Problems, When They Arise Anomalies
Monitoring for Faults
Evidence of problems Could be anything from nonissues to catastrophes.
FDIR
Fault Detection
NASA calls problems faults. Fault Detection requires situation awareness. Fault Isolation requires hypotheses.
Fault Isolation
Fault Recovery requires a solution.
Fault Tolerance Some on-orbit systems may not tolerate any downtime. The MCC works to maximize reliability and availability. ISRCS, August 2012
Fault Recovery
Anomaly Model Anomaly Severity All anomalies expire after some lifetime.
Anomaly Severity
Minor
Non-degrading
No effect
Minor anomalies do not penalize the score. Impairing anomalies do. Critical anomalies make the score zero.
Some anomalies affect objectives.
Degrading
Non-degrading anomalies do not block objectives. Degrading anomalies block all objectives. They must expire or be resolved to unblock.
Resolving Anomalies Agents are not privy to anomaly details. Agents resolve anomalies through Hypotheses.
ISRCS, August 2012
Impairing
Critical
Worst effect
Agent Model Agent-based Modeling Agents represent flight controllers. Agents are dynamic model elements.
Character Stats Character stats simplify complexity through all-inclusive attributes. Agents have four of the six dimensions of expertise. Subject-matter expertise
Communication expertise
Situational-context expertise
Expert-identification expertise
Agents have two more character stats. Saturation
Independence
total demand of simultaneous tasks
autonomy to do their own tasks without discussing them with others
ISRCS, August 2012
Multidisciplinary Modeling Framework Agents each have 4 character stats.
Actor Voice Loop: High-level Messages
Actor
Three are constant. Saturation is dynamic. Telemetry: Low-level Messages
Console
These influence the actor’s effectiveness. Information requirements for acquiring and maintaining situation awareness also influences the actor’s effectiveness.
Agents communicate on voice loops.
Anomalies & Hypotheses
Information sharing occurs via loops.
The team must arrive at a consensus to select the best hypothesis.
Anomalies occur within loop domains, not across them.
Degrading anomalies block progress on mission objectives. ISRCS, August 2012
The Baseline: STS-135 Timeline: Scenario 0 0
2
Mission Elapsed Time, MET (days) 4 6 8 10
12
14
Scenario Check heat shield (#0) Atlantis’ GPC3 failed (#0) Orbital debris tracking (#1) Docking with ISS & leak checks (#1) Install Raffaello module to Harmony (#2) Toilet problem (#2) Spacewalk (#4)
Scenario Objective Anomaly Degrading Anomaly
Inventory swap with ISS (#3)
Atlantis’ GPC4 failed due to single-event upset (#3) Bypass planned communications outage to fix GPC4 (#4) Broken LiOH latches (#5) Microbial air samples (#5) Uninstalling Raffaello from ISS (#6) Undocking from ISS (#7) Photographing the ISS (#8) Deploying picosat (#9)
These objectives and anomalies were taken from the mission status reports, with times estimated from context. ISRCS, August 2012
Discussion
Scenario 0: Convergence and Anomaly Resolution
Outcome Time to first consensus Time to resolution (after first consensus)
300 273
300 Number of Trials
There were big differences between complex and simple anomalies:
Anomaly 0 Anomaly 1 Anomaly 2 Anomaly 3 Anomaly 4 Anomaly 5
297 277
250 200
174
186
150
126
114
100 50
0
0 Resolved Consensus Times by Anomaly Types
100000 10000 1000 100 10 1
1000000 Avg. Resolution Time (s)
Consensus Time (s)
1000000
Sorted Trials
27
3
23
Expired
Resolution Times by Team
100000 10000
Scenario 0 Complex Scenario 0 DYN Scenario 0 MECH
1000 100 10 1
ISRCS, August 2012
14
Sorted Trials
Scenarios Scenario
probability of correct hypothesis
team sizes
level of integration
probability of hypothesis competition
Scenario 0 the baseline
0.2
3, 13, 11
Level 2
0.1
Scenario 1
increased to 0.8
Scenario 2
decreased to 3, 3, 3
Scenario 3
increased to 13, 23, 21
Scenario 4
increased to Level 4
Scenario 5
increased to Level 4
Scenario 6
decreased to 3, 9, 7
increased to Level 4
Scenario 7
decreased to 3, 9, 7
increased to Level 4
Scenario 8
decreased to 3, 9, 7
15
ISRCS, August 2012
increased to 0.5
increased to 0.5
Variability: Model Parameter Changing a model parameter to see how the results vary.
Scenario 1 vs. Scenario 0 in Consensus Time 1000000
Decreasing the probability of correct hypotheses for simple anomalies results in a decrease in the time to first consensus for simple anomalies and no change to complex anomalies.
Consensus Time (s)
100000
Scenario 1 Complex Scenario 1 Simple Scenario 0 Complex Scenario 0 Simple
10000
1000
100
10
1 ISRCS, August 2012
Sorted Trials
16
Variability: Scenario Parameter
Decreasing the teams’ sizes down to three members each does not affect consensus time increases the speed at which the teams get through the anomaly-resolution process for simple anomalies
Resolution Times in Scenarios 2 and 0 1000000 Avg. Time to Resolution (s)
Changing a scenario parameter to see how the results vary.
100000
1000 100 10 1
prevents the teams from reaching consensus for complex anomalies
Number of Trials
300
Increasing the teams’ sizes by ten members each
Scenario 2 DYN Scenario 2 MECH Scenario 0 DYN Scenario 0 MECH
10000
Sorted Trials
Expired Anomalies in Scenarios 3 and 0 300 299
250
Scenario 3
200 150
Scenario 0
126
114
100
ISRCS, August 2012
50
2127
10
0 0
1
17
2 3 Anomaly #
03 4
2523 5
Analysis of Alternatives Results Increased Level of Integration on Consoles
Reduced Team Size
Scenario 0
Level 2
Scenario 4
-> 4
Scenario 5
-> 4
Scenario 6
-> 4
-> 3, 9, 7
Scenario 7
-> 4
-> 3, 9, 7
Effectiveness at Resolving Anomalies
0.1
Good
3, 13, 11
Almost As Good -> 0.5
Scenario 8
Pretty Bad Good
-> 0.5
Pretty Bad
-> 3, 9, 7
The Best
Analysis of Alternatives Results: Anomalies Resolved
300 Number of Trials
Increased Probability of Hypothesis Competition
250
Scenario 0
200
Scenario 4
150
Scenario 5 Scenario 6
100
Scenario 7
50
Scenario 8
0 0
1
2
3 Anomaly # ISRCS, August 2012
4
5
18
Sample Unexpected Insight: Cliffs in Consensus Time
These are consensuses that the team reaches hours or days after their first attempts fail, long after the anomaly disappears from their telemetry. --Not rational agent behavior!
These are retrospective consensuses. The team revisits old problems to try again to solve them. Small mutations in hypotheses occur at a steady state; eventually they lead to breakthroughs.
Avg. Consensus Time (s)
The cliffs appear in both simple and complex anomalies.
AoA: Scenarios 4 vs. 0 in Consensus Time 1000000 100000
Scenario 4 Complex Scenario 4 Simple Scenario 0 Complex Scenario 0 Simple
10000 1000 100 10 1
Sorted Trials
AoA: Scenarios 7, 6, & 0 in Consensus Time 1000000 Avg. Consensus Time (s)
Cliffs appear in the distribution of consensus time.
100000
ISRCS, August 2012
Scenario 7 Complex Scenario 6 Complex Scenario 0 Complex
10000 1000 100 10 1
Sorted Trials
19
Some Conclusions Cheaper than Building New Control Rooms, but Not Trivial Effective Analysis of Alternatives Planning and Design Requires Plausible Agents, Plausible Scenarios, Plausible Mission and Task Dynamics Such Multidisciplinary Integration Supports Novel Links of “Simulation-informed Human Factors” and “Human Factors-informed Simulation” Agent Models Demonstrate Learning and “Retrospective” Behaviors Suggesting Memory and Integration—Are Such Findings Reliable? Useful?
ISRCS, August 2012
Acknowledgements and Contact Material from J. Onken Dissertation, “Modeling Real-time Coordination of Distributed Expertise and Event Response in NASA Mission Control” (2012) Dr. Onken is now at Northrup Grumman; both Prof. Caldwell and Dr. Onken have been supported by NASA. However, this work was not supported by NG and does not reflect any official positions of NG, NASA, or others http://www.grouperlab.org Barrett Caldwell:
[email protected]; Jeffrey Onken:
[email protected]
ISRCS, August 2012