chapter three

0 downloads 0 Views 2MB Size Report
An example of the latter kind of automation false alarm in conflict detection is ..... automated MD alert, hereafter referred to as heavy dependence group, and ...
Aviation Human Factors Division Institute of Aviation

AHFD

University of Illinois at Urbana-Champaign 1 Airport Road Savoy, Illinois 61874

Imperfect Conflict Alerting Systems for the Cockpit Display of Traffic Information Xidong Xu, Christopher D. Wickens, and Esa Rantanen Technical Report AHFD-04-8/NASA-03-2 June 2004 Prepared for NASA Ames Research Center Moffett Field, CA Contract NASA NAG 2-1535

Abstract This experiment assessed pilots’ abilities to use the imperfect (partially reliable) alerting system to detect conflicts on a cockpit display of traffic information (CDTI). An 83% reliability level of automated conflict detection that was chosen, simulated the sort of unreliability that might be characteristic of such a system, predicting conflicts with longer look-ahead-times, in a probabilistic airspace, subject to turbulence, uncertainty, and future pilot control actions between the time of alert and the time of closest approach with the conflicting traffic. 24 licensed pilots viewed a series of dynamic encounters on a 2D CDTI that varied widely in their difficulty, as influenced by lateral conflict geometry (conflict angle, speed, and distance & time till closest approach). Pilots were asked to estimate the point and time of closest approach at varying times before that point was reached. A 3-level alert system provided a correct categorical estimate of the projected miss distance on 83% of the trials. The remaining 17% of alerts were equally divided between automation misses and false alarms, of large and small magnitude. The data of these pilots were compared with a matched sample of “baseline” pilots who viewed identical trials without the aid of automated alerts. The results revealed that roughly half the pilots depended on automation, and used it to improve their performance (accurate estimation of miss distance) relative to the baseline pilots who did not receive the automation. Those pilots who depended on automation did so more on the more difficult traffic trials, and were able to improve their performance on the 83% trials when automation was correct, without causing harm (relative to the non-automated group) on the 17% of automation-error trials. The presence of automated alerts appeared to lead pilots to inspect the raw data more closely. Pilots did not appear to be differentially hurt by automation misses versus false alerts. There was some evidence that the presence of automation, while assisting the accurate prediction of miss distance, led to an underestimation of the time remaining till the point of closest approach. The results point to the benefits of even imperfect automation in the strategic alerts characteristic of the CDTI, at least as long as this reliability remains high (above 80%).

1

1. INTRODUCTION 1.1 Overview In Xu, Wickens, and Rantanen (2004) which is attached as Appendix A, we examined the parameters of lateral aircraft traffic encounters that imposed challenges on the pilots’ ability to detect conflicts (loss of separation) using a cockpit display of traffic information and to understand the spatial-temporal properties of those conflicts. It was revealed that longer distances, slower speeds and greater time until the point of closest approach was reached, all increased the difficulty of conflict understanding. Pilots also exhibited optimal risk-aversive behavior. In the current report, we describe a follow on experiment in which pilots’ conflict detection and understanding was assessed on a subset of geometries that had been documented as “easy” and “hard”, when the pilots were now aided by imperfect conflict alerts. This alerting automation tool is imperfect, as would be expended in real world application, simply because of the challenges of predicting future trajectories in an uncertain airspace. 1.2 Automation in Conflict Detection: an Overview Because of human’s limited spatial working memory capacity and vulnerability in prediction of aircraft’s future status, automation should have a very important role to play in supporting the human operator for the prediction task involved in conflict detection (Wickens, Mavor, Parasuraman, & McGee, 1998). In fact, automation has been extensively used in conflict detection both in the cockpit and in ATC. Typically, automated conflict detection involves the use of predictor algorithms providing automatic prediction of conflict over a certain temporal span indicating which aircraft will constitute a conflict, when and where it will happen. Many studies on conflict detection involved automated conflict detection tools that provide a dichotomous “conflict” or “no conflict” prediction (e.g., Metzger & Parasuraman, 2001), but some provided information on the three critical variables underlying conflict understanding (Xu et al., 2004): the miss-distance (MD) at the closest point of approach (CPA), the orientation of the target to airship at the CPA (OCPA), and the time remaining until that CPA is reached (TCPA). These parameters are shown graphically in Figure 1.1 (Merwin & Wickens, 1996; Wickens, Gempler, & Morphew, 2000; Wickens, Helleberg, & Xu, 2002). It should be noted, however, that the information regarding the MD, OCPA, and TCPA in those studies was available to assist the pilot to avoid traffic, but it was not part of the pilot’s task to estimate the values of these parameters, and data on the pilot estimation performance of these parameters were not collected or analyzed.

2

Figure 1.1.

The form of automation we examine here—an example of stage 2 (information analysis and inference) automation characterized by cueing and alerting (Parasuraman, Sheridan, & Wickens, 2000)—is one that is rarely 100% reliable: Misses and false alarms will occur, and it is the implications of this unreliability that is of interest to us. 1.3 Uncertainty of Flight Paths and Automation Reliability Most of the studies investigating automated conflict detection in aviation have treated the future paths of aircraft as being deterministic except Metzger and Parasuraman (2001) and Wickens et al. (2000); thus, automated conflict detection is also performed in a deterministic fashion. However, the future flight paths are inherently uncertain due to a number of factors in the flight environment, making perfect predictions impossible (Kuchar, 2001; Thomas Wickens, & Rantanen, 2003). Examples of these factors include wind shift, air turbulence, pilots’ intentions to change flight plans, and prediction time span (i.e., how far into the future the prediction is made). The effect of future uncertainty is depicted in Figure 1.2, which shows the distance between ownship and intruder traffic flying at the same altitude on converging but not colliding courses and at constant speeds, as a parabolic function of time. The minimum miss distance (MD) at the time at the closest point of approach (TCPA) is shown in the figure.

3

Distance between aircraft Best case Nominal risk function Worst case

MD 5 miles 3 miles

0

TL

Time

TCPA

Figure 1.2. Conflict risk (distance between ownship and intruder traffic) as a function of time and uncertainty thereof due to changes in flight environment. Adapted from Rantanen, Wickens, Xu, & Thomas (2003) and Thomas, Wickens, & Rantanen (2003).

Figure 1.2 also presents the growing uncertainty of the future distance as it might be predicted at time 0 due to changes in the flight environment. The nominal or deterministic risk function shows that the MD will be greater than the 5-mile radius of the protected zone (i.e., nonconflict) if there were no uncertainty as to the positions of the two aircraft. However, the unpredictable changes in the environment will cause the aircraft’s positions to be nondeterministic presumably with a normal distribution around the nominal positions of the aircraft (Magill, 1997). This distribution is represented by “best case” and “worst case” boundaries, which could be thought of as representing a 95% confidence interval of a normal distribution. Moreover, the degree of the uncertainty is amplified as a function of time into the future, with the best case being safer (greater MD) and the worst case being riskier (shorter MD) than the nominal situation. Note that the worst case in Figure 1.2 assumes that there will be a loss of separation occurring at time TL (time of separation loss) for a 5-mile radius protected zone. These uncertainty factors essentially constitute the various sources of unreliable or imperfect alert automation in this context, leading to two types of errors: automation misses (conflicts are not predicted) and automation false alarms (safe separations are treated as conflicts) (Parasuraman, Hancock, & Obofinbaba, 1997; Wickens & Xu, 2002). The same serious safety consequences as those outlined at the beginning of this chapter may result if pilots over-trust and/or over-rely on these erroneous outcomes of the automation. However, due to the adaptive nature of human behavior, both high false alarm rates and high miss rates may cause operators to mistrust the system, which may in turn cause under-reliance and even disuse of the system (Dixon & Wickens, 2004; Parasuraman & Riley, 1997; Wickens & Xu, 2002), although 4

the patterns of attention allocation behavior may be somewhat different as induced by the two types of automation errors (Dixon & Wickens, 2004; Meyer, 2004). Further, because of the growing level of uncertainty with the passage of time as shown in Figure 1.2, the magnitude of these errors is a function of prediction span: Both the miss and false alarm rates increase as the prediction span increases (Wickens, Rantanen, Thomas, & Xu, 2004; see Appendix B). In addition to the growing uncertainty and automation error with increasing prediction time span (favoring shorter spans), there is a second factor which trades off with prediction span, which is the time available for traffic avoidance maneuver (Kuchar, 2001). For example, the more accurate prediction associated with shorter prediction span leaves shorter time for maneuvering to avoid a conflict if one is present. Thus, it is important to consider the effects of unreliability of CDTI-based automation of longer prediction spans, where errors of prediction are inevitable (i.e., the automation is unreliable or imperfect), and the trade-off must be made by the automation designer between misses and false alarms (Dixon & Wickens, 2004). A third factor influencing this situation is the base rate of the events to be detected – the conflicts. If this base rate is quite low (conflicts are rare), the alert threshold will need to be set at a level that produces many false alarms, in order to avoid a high miss rate (Parasuraman et al., 1997; Wickens et al., 2004). 1.4 Effect of Automation Reliability on Performance An important issue is how reliability of an automated alert system influences performance of the human and automation working together as a system. Automation reliability can be specified by the number of correct operations divided by the total number of operations in which a task is automated. Since it is almost impossible for an automated alert system to be perfectly reliable for reasons of future uncertainty described above, reliability is typically less than 1.0. The general finding is that correctly or accurately functioning automation improves performance, relative to non-automated, or manual performance (e.g., Dixon & Wickens, 2003; Yeh & Wickens, 2001). Automation benefits may not always be present especially when the task being automated is easy when manually performed (Rovira & Parasuraman, 2002), whereas the benefits are found when the task is more difficult in its manual (non-automated) form (e.g., Dixon & Wickens, 2003, 2004; Galster et al., 2001; Maltz & Shinar, 2003; Yeh, Merlo, Wickens, & Brandenburg, 2003). On the other hand, costs of inaccurate automation may be larger for the difficult task than for the easy task (Dixon & Wickens, 2003, 2004; Maltz & Shinar, 2003; Wickens et al., 2000). This issue will be of particular importance in the present experiment. These costs and benefits can easily be interpreted with the mediating concept of automation dependence. As a task becomes more difficult in its unaided (manual) form, users become more dependent on automation to assist them (to improve performance and/or to reduce effort), a dependence that will provide greater benefits when the automation is correct, but greater costs when it fails due to reduced situation awareness and/or skill degradation resulting from complacency (Parasuraman, Sheridan, & Wickens, 2000; Wickens & Hollands, 2000). (Note: we prefer the term “dependence” over the oft-used term “reliance” because reliance is a term that refers to a specific type of dependence on automated alerting systems (Meyer, 2004).) At the time the operator experiences the first automation failure in a previously perfect system, large performance decrements (performance level often lower than manual level) due to

5

complacency have been observed (e.g., Metzger & Parasuraman, 2001; Molloy & Parasuraman, 1996; Yeh & Wickens, 2001; Young & Stanton, 1997; Wickens, 2000). It is also clear that performance may still suffer on failure trials following the first failure as a consequence of some continued dependence upon the automation even if it is now known to be imperfect (Molloy & Parasuraman, 1996; Wickens et al., 2000; Yeh et al., 2003). Typically these subsequent automation failures produce smaller costs than the “first failure” trials (Yeh et al., 2003). Of equal importance is to examine the effect of imperfect automation on subsequent trials where the automation is accurate or correct. Under certain circumstances, performance on the “automation correct” trials also suffers, as compared to trials of perfect automation prior to the first failure (Yeh & Wickens, 2001). This phenomenon can be explained by decreased level of dependence on automation, resulting in either manually performing the task (hence reducing performance if performance with perfect automation is higher than manual performance) or the use of sub-optimal strategy even if some automation use is retained. Relative to manual performance, the results on those “correct” trials are mixed: Some studies indicate retained benefits over manual performance (Yeh et al., 2003, Kantowitz, Hanowski, & Kantowitz, 1997; Galster et al., 2001), some suggest no such benefits (Yeh & Wickens, 2001), and still some show continuing costs (Dzindolet, Pierce, Beck, & Dawe, 1999), particularly if the overall reliability is low (Dixon & Wickens, 2004). Based on a literature review of the effects of automation reliability on performance, Wickens and Xu (2002) conclude that correct automation improves performance in relation to manual performance, and automation failures especially the first failure negatively impact performance. Inconsistent findings or qualifications regarding how automation reliability influences performance, however, have nonetheless been observed. When it comes to the overall performance under imperfect automation assistance (across correct and incorrect automation conditions), a certain reliability threshold or “cutoff point” seems needed in order for it to gain benefits compared to the unassisted or manual performance. However, there is inconsistency as to the value of this threshold in the literature. Table 1.1 summarizes the findings of the benefits and costs for automation-assisted performance in relation to manual performance associated with different automation reliability. Note that all the studies summarized in Table 1.1 involved single task performance, except Dixon and Wickens (2003; 2004) and Rovira, Zinni, and Parasuraman (2002) involved multiple task performance. Dixon and Wickens (2004) noted that the reason why performance assisted by automation under a “cutoff point” reliability is worse than the manual performance may be attributable to the strategy of resource allocation between concurrent tasks under high workload. Under the circumstance of high workload, it is possible that the operator may depend on the imperfect automation for one task out of necessity even though he or she does not fully trust it such that more resources can be allocated to another task. This strategy will degrade performance of the automated task even as it will enhance the concurrent task performance, and the findings in Dixon and Wickens (2003) and Rovira et al. (2002) were consistent with this interpretation. One general trend that emerges from Table 1.1, which is important to the current structuring of the current experiment, is that when the task is difficult and automation reliability is higher than 80%, benefits, not costs (excluding “first failure” effects), are almost always obtained.

6

Table 1.1. Benefits and Costs for Automation-Assisted Performance in Relation to Unaided Manual Performance with Different Automation Reliability (Measurement of Costs Excludes “First Failure” Effects) Study

Automation Benefits/Costs relative to manual performance reliability Ben-Yaacov et al. (2002) 60% Benefits Dixon and Wickens (2003) 67% Benefits for concurrent task but costs for automated task Dixon and Wickens (2004) 80% Benefits 60% Costs Dzindolet et al. (1999) 70% No costs Galster et al. (2001) 67% Benefits Kantowitz et al. (1997) 70% Benefits Lehto et al. (2000) 60% Benefits Maltz and Meyer (2003) 90% Benefits Costs when the manual version of the task was Maltz and Shinar (2003) 90% relatively easy 60%-80% Costs Rovira et al. (2002) 50% Benefits for concurrent tasks Yeh et al. (2003) 70% Benefits Yeh and Wickens (2001) 70% No benefits Despite the importance of human response to the above-mentioned imperfect automation in aviation conflict detection, it appears that only two experiments have examined this issue, and each will now be described in some detail. An important study conducted by Metzger and Parasuraman (2001) had subjects perform a conflict detection task first in the manual condition, then in the reliable automation condition, followed by two unreliable automation conditions. Each condition consisted of one or two scenarios each in turn containing two conflicts and three self-separations, where the aircraft would have created conflicts had the aircraft not initiated avoidance maneuvers, and subjects were required to detect them before the self avoidance maneuvers started. The two unreliable automation conditions and the manual condition contained an additional failure event as will be described below. In the three automation conditions, a red circle would appear around two aircraft six min before the aircraft lost separation or would have lost it if the aircraft had not self-separated themselves. In the manual condition, there would be a red circle only when separation had been lost. In the unreliable automation conditions, there was an automation failure in which one aircraft deviated from its planned flight path and was on a conflict course with another aircraft, a situation that was not detected by the automation routine, that is, an “automation miss”. There was one such automation failure in each of the two unreliable automation conditions, generating the first and the second automation failures, respectively, and the controller still needed to detect the developing conflict before the actual loss of separation occurred. The manual condition contained a similar event to be compared with the unreliable conditions.

7

The automation (the conflict detection aid) improved controller’s performance on conflict detection and reduced controller mental workload relative to the manual condition when it was 100% reliable. However, when the automation was less than 100% reliable, controller performance in detecting the conflicts that the automation missed was worse than in the manual condition, as a result of complacency that had developed in all the perfect automation trials that preceded the failure events. The evidence of complacency was revealed by the eye movement finding that controllers who did not detect the failure had fewer fixations and shorter dwell times on the radar display in the automated than in the manual conditions, whereas controllers who did detect showed no difference. However, there was no difference in detection rate between the first and second failure events. In addition, controllers rated trust in the automation higher under the reliable than the unreliable automation, but no difference was shown between the two failure events. It is noted that no data were available to compare the controllers’ detection performance in the reliable and the unreliable automation conditions for those conflicts that were correctly indicated by the automation. These results are consistent with the general pattern that perfect automation is beneficial to performance, and a performance decrement occurs at the time of the first automation failure (following a period of perfectly functioning automation), along with a decrease in trust in the automation. Interestingly, the performance and trust in automation did not change between the times of the first and the second automation failures. The other important study of automated aircraft conflict alert was conducted by Wickens et al. (2000), who examined the effect of imperfect automation on pilots using a CDTI. In their second experiment, pilots were required to maintain pre-assigned flight path, to detect and avoid conflicts with other aircraft with the aid of a CDTI, which had an overall reliability of 83.3%. During the automation error trials, the traffic changed heading or vertical speed (climb or descent), but the conflict predictor line would continue to point in the direction based on the original flight parameters. It was found that the cost of erroneous prediction trials (time spent in a predicted loss of separation, as well as deviation from prescribed flight path) relative to correct prediction trials was least on trials where the conflict traffic was level (i.e., a 2-D traffic problem), and was greatest on trials with a descending traffic aircraft (i.e., a more difficult 3-D traffic problem). Another finding is that there was no difference between the trial immediately prior to an error trial and that immediately following the error trial regarding the safety measures (i.e., time spent in predicted and actual conflicts and flight path deviation), indicating that there was no temporary change in trust in the automation as a consequence of each failure. However, the authors did not compare the overall performance with the imperfect automation with the manual baseline condition, which was part of their first experiment, to see if there was an overall benefit or cost. Furthermore, the researchers did not investigate conflict detection per se. According to Wickens et al. (2000), these results suggest that pilots were more dependent on the predictor to assist them in conflict avoidance on the most difficult trials (descending traffic) than on the less difficult trials (traffic level and ascending) due to the greater complexity associated with the former. Consequently, when encountering the error trials, the greater dependence on and presumably the more attention allocated to the predictor on these trials lead to the greater problems. The authors inferred that the problems were not caused by complacency, because pilots did not modulate their behavior before versus after an error trial, suggesting

8

instead a reasonably good calibration between the allocation of attention and the true level of reliability. The findings of Metzger and Parasuraman (2001) and Wickens et al. (2000) collectively show that inaccurate automation in conflict detection poses costs, which may be high on the first failure and may remain so if the interval between the first and the subsequent failures is long (Metzger & Parasuraman, 2001). However, the costs may not be so high if the operator can experience the failures with some regularity and can allocate attention well between the automation and the raw data of the traffic situation (Wickens et al., 2000). As noted, Wickens et al. and other studies also found that costs and benefits are more likely to emerge on difficult (vs. easy) task conditions, the circumstance which would be likely to make people depend on automation. 1.5 Relative Costs of Automation False Alarms versus Automation Misses As noted above, an automation false alarm occurs when the automation indicates an abnormal situation or a failure that does not exit in the world or something that is more serious than it actually is. An example of the latter kind of automation false alarm in conflict detection is when an automation tool indicates to the pilot a miss distance (MD) shorter than its true value. An automation miss occurs when the automation fails to alert the operator of an event that does exist in the world or indicates it as a less serious event than it actually is. Again, in conflict detection, an automation miss would occur when the automated conflict detection tool alerts the pilot of an MD longer than its true value. Automation false alarms and misses have different performance and behavior consequences that Meyer (2001, 2004) has attributed to the loss of compliance and loss of reliance respectively, both subclasses of the loss of dependence. According to Meyer (2001, 2004), automation false alarm rate influences the extent to which the operator follows or complies with the automation alert of a failure, with higher false alarm rate correlated with lower compliance. The initial consequence of false alarms is that the operator will make unnecessary responses to the automation’s advice, and the long-term consequence (in particular with a high false alarm rate) is that the operator will distrust the automation and not comply with the automation’s advice (Meyer & Ballas, 1997), a phenomenon known as “crywolf syndrome” (Breznitz, 1983) and observed with early versions of TCAS (Rantanen, Wickens, Xu, & Thomas, 2004). On the other hand, automation miss rate influences the extent to which the operator relies on the automation to detect a failure for him/her, with higher miss rate correlated with lower reliance. The short-term consequence of high reliance is that operators will themselves miss the rare failure that the automation also misses. The long-term problem of low reliance (resulting from a high automation miss rate) will be that more attention is allocated to the raw data such that less attention will be available for a concurrent task (Dixon & Wickens, 2004). Only a few studies have examined whether automation false alarms degrading compliance have greater impact on performance than automation misses degrading reliance. An analysis of safety data base of civil and military aviation has shown that under certain circumstances automation false alarms have caused more accidents and incidents than have automation misses (Bliss, 2003). Maltz and Shinar (2003) investigated the effects of automation false alarms and misses on compliance and reliance in a military setting. Subjects were to search for military targets that were cued with less than 100% reliability, and combining three miss rates

9

and three false alarm rates yielded nine reliability conditions. Maltz and Shinar found that as false alarm rate increased, target detection performance decreased and level of compliance (i.e., following the cue) decreased too. However, increasing miss rate did not have a significant effect on performance and had less an effect on the level of reliance on the cue. The authors explained that the reason for the different effects of false alarms and misses were possibly due to the fact that the true percentage of correct cueing (hit rate or “1 – miss rate”) was not known to the observers, whereas the false alarms were more apparent to them, hence more distracting and misleading. In two parallel studies, Dixon and Wickens (2003, 2004) examined the impact of varying the threshold of an automated failure alerting system on pilots controlling a simulated unmanned air vehicle (UAV). Consistent with the reliance-compliance distinction, they observed qualitatively different patterns of behavior from the false alarm-prone automation, compared to the miss-prone automation, with some evidence that the former had more overall disruption, but the latter imposed more cost on some concurrent tasks. When applied to traffic conflict detection system, the distinction between false alarms and misses is not quite as clear as in the studies of Maltz and Shinar and Dixon and Wickens, because such systems are not likely to truly “miss” predicting a conflict; they will only be late in issuing an alert, a delayed alert that will allow the pilot less time to respond (Kuchar, 2001). The issue of false alarms versus delayed alerts has been examined in surface traffic (i.e., vehicle) conflict detection systems (e.g., Gupta, Bisantz, & Singh, 2001; Cotté, Meyer, & Coughlin, 2001) with the general conclusion that false-alarm-prone system are more disruptive than late alert systems. However, this issue has not been examined an air traffic conflict detection. To summarize, it seems the case that overall, automation false alarms have a greater adverse affect on performance than automation misses. But, based on our review of the literature, no studies have investigated the relative impact of automation false alarms and automation misses in the context of conflict detection alerts in aviation. 1.6 Strategies of Automation Use: Automation Dependence Maltz and Shinar (2003) describe several different philosophies or “styles” of human interaction with imperfect alerting or attention-guidance automation, three of which are relevant here. An automation dependent style is one in which users entirely depend on the automation, responding as it does. When the automation is correct (in saying “signal” or “no signal”), humans will also be correct. When automation is in error (through misses or false alarms), humans will also be in error. Hence human performance, as measured by a signal detection sensitivity parameter, will be neither better nor worse than automation performance. Stated in another way, the human shifts the response criterion beta to be “risky” (always saying “yes”) when the automation says yes, and conservative (always saying “no”) when the automation says “no” (Maltz & Shinar, 2003; Meyer, 2001). An optimal style is exhibited when the human uses automation to its advantage when it is correct, but can ignore it when it is in error. For example, an automation that signals a signal, would lead the human to inspect the raw data more closely to insure that there is in fact a signal present, hence improving sensitivity, without causing a false alarm, on those occasions when the

10

automation gave a false alarm. Adapting such a style, it is assumed that when the automation is silent (signaling no event), humans will rely on the automation, and hence have a miss-rate equal to the automation miss rate. In signal detection theory terms, sensitivity is improved, perhaps relative to either the human alone or the automation alone, and the response criterion is little influenced by this style of interaction. A skeptical style, is one in which the human user simply pays no attention to the automation, and hence performance is no different from unaided human performance, and may be worse than automation performance, if the latter is good. Maltz and Shinar's study of response to imperfect target cueing appeared to yield a mixture of styles. In another imperfect target cueing study, Yeh and Wickens (2001) distinguished two styles corresponding to the dependent and optimal style of Maltz and Shinar (2003). The dependent (response bias) style triggered the user to respond in whatever way the automation target cueing did, and the optimal or “sensitivity” style triggered a closer inspection of the raw data underlying an automation target cue. Their results indicated that, following the first automation failure (a cued non-target) operators showed an extreme response bias, and then following subsequent failures, behavior tended to be more closely approximated by a mixture of skeptical and dependent observing, showing little evidence of improved sensitivity. Importantly, operators did not take advantage of the cue, even a partially (70%) reliable one, to help them inspect the raw data, and to improve their overall target detection performance. An important aspect of the current study (in Experiment 2) will be to contrast these three styles of use of the automated conflict detection aid. In conclusion, the literature has accumulated a considerably large body of research on the effects of automation reliability on performance (Wickens & Xu, 2002; Dixon & Wickens, 2004). Accurate automation improves performance, but inaccurate automation results in poorer performance than that with accurate automation, and often times poorer than unaided performance particularly if (a) the unaided task is easy or (b) the level of reliability is below around 75%. This has also been the case with automation in conflict detection. Again, no studies on conflict detection with automation have been found that have addressed the MD, OCPA, and the TCPA judgment accuracy. As noted, automation benefits emerge more often when the task being automated is difficult (Dixon & Wickens, 2003, 2004; Galster et al., 2001; Maltz & Shinar, 2003; Yeh et al., 2003) than when it is easy (Rovira & Parasuraman, 2002) in its manual (nonautomated) form. It has also been observed that performance would suffer more for difficult than for easy tasks upon automation failures (Dixon & Wickens, 2003, 2004; Maltz & Shinar, 2003; Wickens et al., 2000). Finally, the findings in the literature seem to point to the larger adverse impact on performance by automation false alarms than by automation misses. 1.7 Dichotomous Versus Multi-Level Alert System As mentioned briefly above, many conflict alert systems provide a dichotomous warning (conflict vs. no-conflict) (e.g., Metzger & Parasuraman, 2001). The Traffic Alert and Collision Avoidance System (TCAS), however, employs a three-level alert. If an intruder traffic is within 50 seconds of a loss of separation with ownship, a traffic advisory is issued in the form of verbal warning, “Traffic Traffic,” and the intruder icon color will change to yellow on the display. If the traffic is within 25 seconds of loss of separation, the intruder icon turns red and a resolution

11

advisory is issued verbally, along with an indication of the required range of climb/descent rate to avoid the loss of separation (Ho & Burns, 2003). A multi-level alerting system or multi-level “likelihood alarm” (Sorkin & Woods, 1985) is a more refined or more accurate form of alerting than a two-level or dichotomous one. It is assumed that an imperfect multiple-level alerting system is less likely to produce “bad” errors and thus is more tolerable than an imperfect twolevel system. Multiple-level alerting has been advocated over two-level form (Rantanen et al., 2003; Sorkin, Kantowitz, & Kantowitz, 1988; Sorkin & Woods, 1985; Wickens, 2003).Surprisingly, few studies have actually compared the efficacy of multi-level alarms with dichotomous ones. Sorkin et al. (1988) observed better concurrent task performance when a multi-level likelihood alarm was employed, but found no improvement on the alerted task itself. St. Johns and Mannes (2002) did find an improvement on the alerting task, although their study was not one of conflict detection. We used a three-level alerting system in the current study. The three levels are defined by the degree of predicted risk of a pending encounter, as risk was operationally defined by the projected miss distance (MD), and the closest point of approach. 1.8 Overview of Experiment 2 The first goal of Experiment 2 was concerned with how an alerting system that predicted conflicts could alleviate the biases and the poor performance of difficult conflict problems (long time and distance, slow speed) found in Xu, Wickens, and Rantanen (2004; see Appendix A). The second goal focused on how the reliability of prediction automation influenced performance. Specifically, we investigated how correct and erroneous predictions of an imperfect (83% reliable) automation alert affected performance in relation to manual performance in Xu et al. (2004); how the effect of automation reliability was modulated by individual differences in automation dependence, by task difficulty, and finally whether automation error magnitude (“modest” vs. “bad”) and automation error type (false alarms vs. misses) would each have an impact on performance. 2. METHOD 2.1 Subjects Eight flight instructors and 16 certified non-instructor pilots (22 male and two female; age ranging between 18-25 years, with a mean of 19.83 years) were recruited from the Institute of Aviation, University of Illinois at Urbana-Champaign. Each subject was paid $8/h for his/her participation. 2.2 Simulation and Display The CDTI depicted ownship and intruder in a map (top-down) view (see Figure 2.1). The display represented ownship by a white triangle and the intruder by a solid circle in cyan. Ownship icon was positioned in the center of the display throughout the whole experiment, thus yielding an egocentric view of the traffic situation where the ownship icon appeared to be stationary to the participant. The ownship and the intruder were flying at the same altitude on straight converging courses and at constant but not necessarily same speeds. Participants individually observed the development of a conflict scenario for 15 sec, after which the scenario froze. They were then required to mentally extrapolate the development of the scenario, press a

12

key at the time when they estimated that the CPA was reached had the trajectory not been frozen, thereby providing the estimate accuracy of TCPA, and move the cursor to a location that they believed was the CPA, thus providing the estimate accuracy of MD and OCPA. As in Xu et al. (2004; see Appendix A), conflict traffic proceeded from either the left or right side, could pass in front of, or behind ownship, could proceed at overtaking (45o), crossing (90o), or approaching (135o) conflict angles, and could proceed at slow, medium, or fast relative speeds.

Figure 2.1. Schematic illustration of key components of the experimental paradigm and independent variables. The ownship icon was stationary to the participant.

Alerting automation was implemented as follows: at the start of a trial, a conflict predictor automatically provided a three-level MD alert. More specifically, the predictor did not alert the pilot of MD if its value was greater than 3.5 miles. It provided the low level of MD alert if MD was shorter than 3.5 miles but longer than 1.5 miles, and provided the high level of alert if MD was shorter than 1.5 miles. As shown in Table 2.1, the three levels of MD alert were indicated by different colors of the traffic icon as well as different verbal warnings. The traffic icon retained the color throughout a trial for a given level of MD alert and the verbal warning was given once. We chose the three-level alert for two reasons. First, it is somewhat consistent with the alerting algorithm of the current Tactical Collision Avoidance System (TCAS), which has been discussed earlier in this chapter. Second, the three-level MD alerting system is a form of multi-level “likelihood” alert, and it is assumed to be more advantageous than a two-level alert system (Rantanen et al., 2003; Sorkin, Kantowitz, & Kantowitz, 1988; Sorkin & Woods, 1985; St. Johns & Manes, 2002; Wickens, 2003). Table 2.1. Three-Level MD Alert Alert level

Color of traffic icon

Verbal warning

MD (mile)

No alert

Cyan

None

> 3.5

Medium alert

Yellow

“Traffic Traffic”

1.5 – 3.5

High alert

Red

“Conflict Conflict”

< 1.5

13

To simulate a less than perfectly reliable predictor, on some trials (one in every six trials) the automation provided the erroneous prediction of MD. Using the terms of signal detection theory, there were two general types of predictor errors here due to the automation’s failure, one in which the predictor indicated MD that was in a greater separation category than the actual value (an automation miss), and the other where it indicated MD in a smaller separation category than the actual value (an automation false alarm). As shown in Figure 2.2, half of the error trials were false alarms and the other half misses. Furthermore, these two error types each had two levels of magnitude (i.e., modest and bad misses, modest and bad false alarms). The notion of multi-level of false alarms and misses is consistent with the fuzzy signal detection theory (Parasuraman, Masalonis, & Hancock, 2000). According to this theory, both signal and response can be continuous variables (e.g., for a signal with a probability of .8, there can be .8 hit, .1 false alarm, 0 miss, and .1 correct rejection), whereas in the traditional or the “crispy” signal detection theory, both signal and response must be binary (e.g., for a non-signal, the response must be 0 hit and 0 miss, and either 0 false alarm/1 correct rejection or 1 false alarm/0 correct rejection).

MD predictor indication No alert

> 3.5 miles

MD = 1.5-3.5 miles

< 1.5 miles

Correct rejection

Low alert

Modest false alarm

Bad false alarm

Hit

Modest false alarm

Modest miss

Hit

Modest miss

Bad miss

High alert

Figure 2.2. Different outputs provided by the MD predictor.

2.3 Task The task for the pilot was identical to that used in Xu et al. (2004; see Appendix A). That is, after the simulation froze, the pilot was to estimate TCPA, and then MD and OCPA by positioning the cursor where the point of closest approach was predicted to be (see Figure 2.1). However, the pilot’s judgment was assisted by the automated MD alert system, as noted. Pilots were instructed that when the MD alert was correct, they were supposed to take advantage of the automation assistance. However, when they believed that the predictor provided invalid MD prediction, the pilots were asked to ignore it and make their estimations based on their own judgments.

14

2.4 Experimental Design The current experiment employed repeated measures design. However, the data for pilots in this experiment using automation; were statistically compared with data from a matched set of pilots, on identical conflict trials, performing without the aid of automation, as reported in Xu, Wickens, and Rantanen (2004). In the following, this previous baseline (i.e., non-automated) data will be referred to as “Experiment 1”, and the current data; collected with imperfect automation, will be referred to as “Experiment 2”. The geometries in Experiment 1 that produced the easy and hard trials were chosen to create an independent variable of task difficulty for this experiment (Experiment 2), which was varied within subjects. By the process outlined in Figure 2.3, in order to create the two task difficulty levels, the trials in Experiment 1 were first divided into three groups according to the values of distance to CPA: short (1.33 miles), medium (2.67 miles), and long (4.0 miles). For each of the short (1.33 mile) and medium (2.67 mile) distance to CPA groups, the 108 trials were first sub-grouped according to the MD levels (< 1.5 miles for short, 1.5 – 3.5 miles for medium, and > 3.5 miles for long), each of which was again subdivided into easy and hard trial groups (18 trials each) based on the weighted performance measure or error score of [.4 * (absolute TCPA estimate error) + .4 * (absolute MD estimate error) + .2 * (absolute OCPA estimate error)], which had been measured in Experiment 1 (see Chapter 2 for an explanation of the rationale for assigning different weights to the three error measures.) That is, the 18 trials that had generated higher error scores (i.e., the weighted measure of absolute TCPA, MD, and OCPA, estimate errors) in Experiment 1 were classified as hard trials and the other 18 with lower scores as easy trials, with the error scores derived from the average of pilots participating in Experiment 1. Then 12 trials were randomly selected from the 18 trials. Therefore, there were a total of 36 easy and 36 hard trials for each of the 1.33 and 2.67 mile distance to CPA groups, where “easy” and “hard” was defined based upon the level of performance in Experiment 1.

Short DCPA

Short MD: Easy & Hard

Medium MD: Easy & Hard

Long MD: Easy & Hard

Medium DCPA

Short MD: Easy & Hard

Medium MD: Easy & Hard

Long MD: Easy & Hard

Long DCPA

Short MD: Easy & Hard

Medium MD: Easy & Hard

Long MD: Easy & Hard

Figure 2.3. Method of creating two task difficulty levels.

15

For the long (4.0 mile) distance to CPA group, all the 72 trials in Experiment 1 were employed for this experiment (i.e., no sampling was involved). The 72 trials for the 4.0 mile distance to CPA were also sub-grouped according to the three MD levels, each of which was also sub-divided into easy and hard trial groups according to performance in Experiment 1, each containing 18 trials. Therefore, there were also a total of 36 easy and 36 hard trials for the 4.0 mile distance to CPA group. For each DCPA group, the easy trials in each of the three MD levels were put into an “easy” geometry category and the hard trials in each of the three distance to CPA levels in a “hard” category; thus, on average, the “easy” category was assumed to be easier than the “hard” category based on the performance of the participants in Experiment 1. Another independent variable that was incorporated into the design was automation validity (error vs. correct), and it was also varied within subjects. For each level of distance to CPA, there were 12 automation error trials and 60 correct automation trials, thus yielding an automation reliability of 83%. Nested within the 12 error automation trials, there were four modest misses, two bad misses, four modest false alarms, and two bad false alarms, being equally represented by easy and hard trials (e.g., two of the modest misses were for easy trials and the other two for hard trials, and one bad miss for an easy trial and the other bad miss for a hard trial). The reason that there were more modest errors than “bad” errors reflects the statistical (normal) distribution that would be expected from such a diagnostic system. Two additional independent variables were then created from those automation error trials. One was automation error magnitude (modest automation error vs. bad automation error) and another automation error type (miss vs. false alarm), both of which were varied within subjects. The 24 pilots were assigned to the three distance to CPA groups (8/group) in such a way that the flying experience of the pilots was roughly equal to that of the corresponding pilots in Experiment 1 (e.g., Subject 1 in both experiments were flight instructors with similar ratings and flying hours). In this way, matched pairs of subjects between the two experiments were created. Note that although distance to CPA was varied between-subjects, DCPA was not an independent variable examined for this experiment. The 72 trials in each DCPA group were presented in quasi-random order to the pilot. The automation error trials were quasi-randomly distributed within the total 72 trials. The error trials that created false alarms and those creating misses were in turn quasi-randomly distributed among those error trials such that the order would appear random to the pilot. Also, the error trials were quasi-randomly distributed in a different manner for each participant so these occurred across a wide range of conflict geometry. Appendix C provides a listing of the actual conflict geometries for easy and hard trials. 2.5 Dependent Measures For this experiment, five dependent variables (absolute and signed TCPA estimate errors, absolute and signed MD estimate errors, and absolute OCPA estimate error) that were also employed in Xu et al. (2004) were collected and analyzed. In addition, three new dependent variables were derived for the testing of Hypotheses 3A and 3B stated below. Specifically, automation-induced differences in absolute MD, TCPA, and OCPA estimate errors were computed by subtracting absolute MD, TCPA, and OCPA estimate errors in Experiment 1 from those in Experiment 2. These subtractions were done on a trial-by-trial basis between each subject in Experiment 2 and the matched pair encountering the same geometry in Experiment 1. As a consequence of this procedure we could directly assess the benefits and/or costs of

16

imperfect automation, and how these costs and benefits were modulated by variables of task difficulty and automation correctness. As we will note below, greatest attention was focused on the MD measures, since pilotestimated miss distance represented the most safety-critical aspect of the pilot’s assessment of conflict risk. 2.6 Procedure Pilots participated in one experimental session to complete two blocks of 36 trials each, lasting approximately one and two hours in total, depending on which distance (DCPA) group a subject was assigned to. Before completing the experimental session, pilots participated in a practice session, in which they read instructions and were shown the task and display symbology to get familiar with the simulation, and were explicitly told that the predictor would not be 100% reliable and may indicate erroneous MD. Then they performed ten practice trials, with a valid predictor for six trials and an invalid predictor for four trials (one false alarm, one bad false alarm, one miss, and one bad miss), being informed explicitly of the invalidity of the later four trials. Upon completion of the practice session, pilots participated in one experimental session to complete the 72 trials. Pilots took a short break between the two blocks to avoid fatigue. Upon the completion of the experimental session, pilots were required to indicate their estimation of the reliability of the MD predictor as an explicit measure of automation trust (Wickens et al., 2000). The means of the estimated reliability were 85.28%, 83.79%, and 81.17% for the 1.3 mile, 2.7 mile, and 4.0 mile DCPA groups, respectively. 2.7 Hypotheses Hypothesis 1: Conflict detection or awareness performance in Experiment 2 would be better than in Experiment 1. This hypothesis is based on the assumption that generally valid (83% reliable in this experiment) automation benefits performance relative to manual performance (Dixon & Wickens, 2004). Hypothesis 2: Correct automation with valid MD alert would improve performance (Hypothesis 2A) and error automation with invalid MD alert would hinder performance (Hypothesis 2B) relative to manual performance on equivalent difficulty trials in Experiment 1. Hypothesis 3: Increasing trial difficulty would amplify the effect of reliability. More specifically, for correct automation, automation will provide greater performance improvement relative to manual performance in Experiment 1 for hard trials than for easy trials (Hypothesis 3A); and for error automation, automation will induce greater performance costs relative to manual performance in Experiment 1 for hard trials than for easy trials (Hypothesis 3B). Figure 2.4 illustrates Hypotheses 3. This hypothesis is consistent with the finding that automation benefits are often greater in difficult than in easy tasks (Dixon & Wickens, 2003, 2004; Galster et al., 2001; Maltz & Shinar, 2003; Yeh et al., 2003), but costs of automation failure are also greater for tasks that are difficult in their manual forms (Wickens et al., 2000). Both findings link to the fact that increasing task difficulty fosters greater dependence on the automation, with positive consequences if correct, but negative ones if in error (Maltz & Shinar, 2003). Note that in Figure 2.4, we can define automation dependence as the difference between 17

the benefit of correct and the cost of error automation (Maltz & Shinar, 2003). While dependence is expected to differ between easy and difficult task conditions or problems, we will also see that such dependence varied between subjects, providing the basis for an important aspect of our analysis. Error automation (Hypothesis 3B)

MD Error



Greater cost Manual (Exp. 1)

High Automation dependence

• • Cost Greater • Benefit benefit • Correct automation •

Automation dependence

(Hypothesis 3A)

Low Easy

Hard Task difficulty

Figure 2.4. Illustration of Hypothesis 3. Hypothesis 4: Bad automation errors would induce greater performance costs than modest automation errors. Hypothesis 5: Automation false alarms would induce greater costs than automation misses. Hypothesis 6: The effects associated with the above hypotheses (1-5) would be more significant for those pilots who were more dependent upon the automation than for those who were less dependent upon the automation, an issue we address as follows. 2.8 Automation Dependence We note that Hypotheses 1-5 were made based on the assumption that participants would utilize or depend upon the automation with varied levels of dependence across participants, hence Hypothesis 6. Regarding Hypothesis 6, a good measure of dependence on automation is the difference in performance between conditions of error and correct automation (see Figure 2.5; Maltz & Shinar, 2003). A large difference would be indicative of heavy dependence on automation. Figure 2.5 illustrates schematically the different levels of dependence on automation of five hypothetical subjects. Subjects 1 and 2 have higher level of dependence than Subjects 3, 4, and 5 in that the performance difference between error and correct automation trials for Subjects 1 and 2 is greater than that for Subjects 3, 4, and 5. Within the two heavy dependence subjects, while the benefit relative to the manual baseline performance due to depending on correct automation is the same, the cost due to error automation is greater for Subject 1 than for 18

Subject 2. Subject 1’s performance (large benefit with correct automation and large cost with error automation by blindly following the automation) is analogous to the beta shift in signal detection theory terms (i.e., more likely to report “signal”), whereas Subject 2’s performance (only benefit with small cost by increased processing of raw data) is similar to an increased sensitivity in signal detection theory (Yeh & Wickens, 2001). Note that Subjects 1 and 2’s performance also represents the automation dependent style and the optimal style, respectively, as described by Maltz and Shinar (2003). MD Error

Dependence measure

Correct automation

High

Error automation

Baseline (Manual)

Low Sub 1

Sub 2

Heavy dependence

Sub 3

Sub 4

Sub 5

Low dependence

Figure 2.5. A schematic illustration of different levels of dependence on automation that is not perfectly reliable. There are also differences among the three low dependence subjects. Subject 3 has small benefit with correct automation but also small cost with erroneous automation. Subject 4 totally ignores the automation (thus no benefit nor cost), thus representing the skeptical style identified by Maltz and Shinar (2003). For Subject 5, the automation error might be so obvious that it attracts the pilot to scrutinize the conflict geometry evident in the raw data more closely, thus yielding better performance than does correct automation. Given that only MD prediction was automated in this experiment, the automation dependence measure that we employed was based on the difference in absolute MD estimate error between trials of error MD alert and those of correct MD alert. In the following analysis, this difference was calculated individually for each pilot, and the 24 pilots were then divided into two dependence groups (light and heavy) according to the value of this automation dependence measure using a median-split method. Automation dependence would be heavy to the extent that the difference was positively large and light to the extent that the value was positively small or even negatively large. For pilots who did not heavily depend on the MD alert, it was possible that their estimation errors in the error automation trials were smaller in magnitude than in the correct automation trials, hence the negative value of difference in estimation errors between 19

these two trial types. This behavior is typical of the hypothetical Subject 5 in Figure 2.5, and was observed for some subjects in the current study. 3. RESULTS The results are presented in two sections, one for pilots who heavily depended on the automated MD alert, hereafter referred to as heavy dependence group, and another for pilots who did not as heavily depended on the automation, hereafter referred to as light dependence group. As described in the context of Figures 2.4 and 2.5, the measure of dependence on automation is the difference in absolute MD estimate error between the trials with an automation error and those with correct automation (Maltz & Shinar, 2003). The difference was calculated separately for each individual pilot, thus yielding two levels of automation dependence (light and heavy) for the 24 pilots using a median-split method to assign 12 pilots to each level. Figure 3.1 presents the data on the difference in absolute MD estimate error between trials with error and correct automation for the two groups of pilots. Pilots with negative and positive values belonged to light and heavy dependence groups, respectively. Note that the light dependence pilots mostly encountered trials with short DCPA (1.3 miles), whereas the heavy dependence pilots mostly had trials with medium (2.7 miles) and long (4.0 miles) DCPAs, which were more difficult than the short DCPA trials. Data analyses were then performed separately for the two dependence groups with greatest emphasis below placed on the heavy dependence group. Difference in Absolute MD Estimate Error between Error and Correct Automation Trials (mile)

0.5

0.4

0.3

0.2

Light dependence group 0.1

0

-0.1

Heavy dependence group

-0.2

-0.3

-0.4

20

L

8

7

9

12

4

L S M S M

5

18 22

S S

2

6

S S L

19 13

S

1

11 15 17

3

10 21 24 16 23 14

L M L S M L M M M L L M

Subject

Figure 3.1. Difference in absolute MD estimate error (|estimated MD – true MD|) between error and correct automation trials for light and heavy automation dependence pilots. The letters “L,” “M,” and “S” below the subject numbers designate long, medium, and short DCPA groups, respectively, to which the subjects were assigned. Note the prevalence of easier, short (S) DCPA’s in the light dependence group.

20

Among the three measures of conflict awareness (absolute TCPA, MD, and OCPA estimate errors), the most important one is absolute MD estimate error, since MD best reflected the true conflict risk, and thus its estimate error best represents the pilot’s true understanding of the conflict risk. Furthermore, for this experiment, we were interested in particular in how automation (vs. non-automation in Experiment 1) would influence conflict awareness when pilots depended on automation that provided three levels of predicted MD. Therefore, this chapter puts greatest emphasis on the effects of the automation variable on absolute MD estimate error for the 12 heavy dependence pilots, on the right side of Figure 3.1. Most of the tests below were based on planned contrasts between data for Experiment 2 (automation) and Experiment 1 (no automation) using one-tailed t-tests to examine predicted costs or benefits of automation. Unless otherwise stated, all of the following analyses were carried out only on the 12 subjects of the heavy dependence group comparing these against their matched pair counterparts in Experiment 1. All statistical analyses were conducted using SPSS version 11.5 for Windows and all error bars depicted on the figures below represent a 95% confidence interval. 3.1 Analyses for Heavy Dependence Group 3.1.1 Overall effect of automation. Hypothesis 1 predicts that overall conflict awareness in Experiment 2 (with MD alert) would be better than in Experiment 1 (without MD alert). Figure 3.2 presents the data on absolute MD estimate error in Experiment 2 and for the corresponding manual trials in Experiment 1. Absolute MD estimate error was .13 mile smaller in Experiment 2 than in Experiment 1, t(22) = -1.83, p = .04, suggesting that the automated alerts used here, even though imperfect, nonetheless served to benefit MD estimation performance. 0.6

Absolute MD Estimate Error (mile)

0.5

0.4

0.3

0.2

0.1

0 Manual (Exp 1)

Automation (Exp 2) Experim ental Condition

Figure 3.2. Absolute MD estimate errors (|estimated MD – true MD|) for the heavy dependence group in Experiment 2 and the corresponding trials in Experiment 1. 21

Absolute TCPA and OCPA estimate errors did not differ significantly between Experiments 2 and 1, t(22) = .63, p = .27 for absolute TCPA estimate error, and t(22) = -.61, p = .27 for absolute OCPA estimate errors. Therefore, Hypothesis 1 was generally substantiated in that the most important measure of conflict risk estimation, absolute MD estimate error, was reduced with automation assistance. 3.1.2 Effect of automation validity (correct vs. erroneous). Hypothesis 2A predicts that correct automation would improve performance and Hypothesis 2B predicts that error automation would hinder performance relative to baseline unaided performance in Experiment 1. Figure 3.3 shows the data on absolute MD estimate error, plotted separately for the error and correct automation trials of the heavy dependence group in Experiment 2 and for their corresponding manual trials in Experiment 1. What is evident immediately from the figure is that both correct and error automation conditions appeared to show some form of benefit (smaller MD error in Experiment 2 than in Experiment 1), so surprisingly, even when automation was in error, there was no cost. Ideally, this hypothesis should be tested by a 2 (manual vs. automation) × 2 (correct automation vs. error automation) ANOVA. However, this was made difficult because the important variable of automation validity (correct vs. error) was not defined in Experiment 1. Therefore, we performed two planned contrasts between the data on absolute MD estimate error for Experiment 2 and Experiment 1 using one-tailed t-tests. When the automation was in error, there was no significant difference in absolute MD estimate error between the two experiments, t(22) = .72, p = .24. However, when the automation was correct, absolute MD estimate error in Experiment 2 was significantly lower (.15 mile) than that in Experiment 1, t(22) = 2.01, p = .025, suggesting performance improvement with correct automation assistance. 0.6

Absolute MD Estimate Error (mile)

0.5

0.4

0.3

0.2

0.1

0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Correct Auto (Exp 2)

Corresponding trials Corresponding trials Figure 3.3. Absolute MD estimate errors (|estimated MD – true MD|) for error and correct automation trials of heavy dependence group in Experiment 2 and the corresponding manual trials in Experiment 1.

22

There was no significant effect of experimental condition on absolute TCPA estimate error for correct automation, t(22) = -.97, p = .17, but for error automation, absolute TCPA estimate error was slightly smaller in Experiment 2 than in Experiment 1 (2.28 sec difference), a result of marginal significance, t(22) = 1.48, p = .08. However, since the effect of error automation on the TCPA estimate error was in the opposite direction to that predicted, a twotailed t-test was conducted, and this revealed that the effect was not significant, t(22) = 1.48, p = .15. There was no significant effect of experimental condition on absolute OCPA estimate error for error automation, t(22) = .40, p = .34, or correct automation, t(22) = .61, p = .27. Therefore, Hypothesis 2A was generally substantiated in that correct automation benefited the most important variable--MD estimation--but not TCPA and OCPA estimations. Hypothesis 2B did not obtain support from the data in that error automation did not hurt any aspect of the performance. Based on the above analyses, imperfect automation provided a benefit when correct, but no cost when in error. Given the reduction in error of estimating MD in Experiment 2 relative to Experiment 1, we can ask, “how was this error reduced?” Was it simply because the estimate was less varied around the true MD value? Or did the automation move the signed estimates closer to the true value, perhaps reducing the bias to underestimate MD that was apparent in Experiment 1 (Xu, Wickens, & Rantanen, 2004). The data on absolute MD estimate error cannot discriminate these, so the data on signed MD estimate error were evaluated for the heavy dependence group in Experiment 2 and the corresponding trials in Experiment 1. These data are shown in Figure 3.4. 0

Signed MD Estimate Error (mile)

-0.05

-0.1

-0.15

-0.2

-0.25

-0.3 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Correct Auto (Exp 2)

Corresponding trials Corresponding trials Figure 3.4. Signed MD estimate errors (estimated MD – true MD) for error and correct automation trials of heavy dependence group and the corresponding manual trials in Experiment 1. A negative reading indicates underestimate of MD (or overestimate of risk). 23

Assisted by the correct MD alert, pilots estimated MD with greater accuracy by reducing the underestimate of MD relative to Experiment 1, t(22) = 1.46, p = .08, a result of marginal significance. However, with error automation, there was no significant change in signed MD estimate error from that in Experiment 1, t(22) = .56, p = .29. 3.1.3 Effect of automation validity modulated by task difficulty: Miss Distance. The following predictions were made according to Hypothesis 3 (see Figure 2.4) as a consequence of the prediction that automation dependence would be greater with more difficult problems: For correct automation, automation would provide greater performance improvement (relative to manual performance in Experiment 1) for hard trials than for easy trials (Hypothesis 3A); and for erroneous automation, automation would induce greater performance costs for hard trials than for easy trials (Hypothesis 3B). As with the testing for Hypothesis 2, this hypothesis should ideally be tested by a 2 (automation vs. manual) × 2 (correct automation vs. error automation) × 2 (easy vs. hard) ANOVA. However, this was again made difficult because the variable of automation validity was not defined in Experiment 1. Hence, our analysis procedure computed the difference between the score of each subject in Experiment 2 and the corresponding value of the baseline (manual) subject with which he/she was matched on the basis of matching trials in the two experiments and having equivalent experience. Thus, each raw score (i.e., the difference) can be considered as a benefit (or cost) relative to the baseline. Analyses on MD estimate measures: E2 – E1 difference scores. The cost-benefit data on absolute MD estimate error were plotted as a function of automation correctness and task difficulty in Figure 3.5. Statistical evaluation of the four data points within Figure 3.5 was carried out via two planned contrasts using one-tailed tests. The t-test comparing the two data points on the left in the figure (differences between the two experiments for error automation trials) revealed that there was no significant effect of task difficulty for the error automation trials, t(11) = .53, p = .30. However, the t-test comparing the two data points on the right (differences for correct automation trials) revealed that absolute MD estimate error difference between the two experiments was significantly greater for the hard trials than for the easy trials, t(11) = 2.31, p = .021.

24

Difference in Absolute MD Estimate Error between Exp 2 and 1 (mile)

0.05

0

-0.05

-0.1

-0.15

-0.2

-0.25

-0.3 Easy

Hard

-0.35 Error

Correct Autom ation Validity

Figure 3.5. Difference in absolute MD estimate error (|estimated MD – true MD|) between heavy dependence group in Experiment 2 and the corresponding group in Experiment 1 by automation validity and task difficulty. A negative value indicates reduced MD estimate error in Experiment 2 compared to that in Experiment 1.

The findings of this automation benefit, specific to hard conflicts when automation was correct is corroborated by separate post-hoc contrasts performed to assess whether a benefit was present at all for each of the four conditions. There was no significant benefit for the two error automation conditions compared to Experiment 1, either for easy trials, t(11) = -.57, p = .29, or for hard trials, t(11) = -.94, p = .18. However, there was a marginally significant correctautomation-benefit for easy trials, t(11) = -1.38, p = .098, and a significant correct-automationbenefit for hard trials, t(11) = -2.7, p = .011. Taken together, it can be concluded that when the automation was in error, MD estimation performance did not change significantly as compared to Experiment 1, and was not influenced by task difficulty. However, when the automation was correct, MD estimate accuracy improved for both easy and hard trials, but such improvement was considerably greater for hard trials than for easy trials, thus supporting Hypothesis 3A. On the other hand, Hypothesis 3B was not supported, since pilots did not suffer more when automation was in error, on hard than easy trials, once again replicating the finding of Hypothesis 2; that is, imperfect automation appeared to provide benefits when correct, but no costs when in error, in terms of MD estimation.

25

Analyses on MD estimate measures: Absolute errors in Experiments 2 and 1. Hypothesis 3A implies that when the automation was correct, there would be no difference in performance between the easy and the hard trials, or performance would still be better for the easy than for the hard trials but the difference would be reduced relative to the difference in Experiment 1; still another possibility is that performance for the hard trials would be better than for the easy trials as shown in Figure 2.41. These possibilities cannot be examined based on the data on the difference in absolute MD estimate error between Experiments 2 and 1 (Figure 3.5). So we analyzed below the absolute performance levels (rather than the differences). Figure 3.6 presents the data on absolute MD estimate error by automation validity and task difficulty in Experiment 2 and for the corresponding trials in Experiment 1. A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the error automation trials and the corresponding trials in Experiment 1 (shown on the left side of Figure 3.6) revealed that performance did not differ between the two experiments, F(1, 22) = .56, p = .46, and performance on hard trials was poorer than on easy trials in both experiments, as indicated by the significant main effect of task difficulty, F(1, 22) = 17.40, p < .0001. Furthermore, the same ANOVA revealed that the difference in performance between easy and hard trials in Experiment 2 was not reduced compared to that in Experiment 1 when automation was absent, since the interaction between experimental condition and task difficulty was not significant, F(1, 22) = .31, p = .59.

1

The last possibility is conceivable if pilots had used the automation only when the task was hard but not when it was easy. 26

1 Easy

Hard

0.9

Absolute MD Estimate Error (mile)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Correct Auto (Exp 2)

Corresponding trials Corresponding trials Figure 3.6. Absolute MD estimate errors (|estimated MD – true MD|) for heavy dependence group in Experiment 2 by automation validity and task difficulty and the corresponding manual trials in Experiment 1.

In contrast, a 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the correct automation trials and the corresponding trials in Experiment 1 (shown on the right side of Figure 3.6) revealed that performance in Experiment 2 was better than in Experiment 1, F(1, 22) = 4.19, p = .053, and performance on hard trials was poorer than on easy trials, as indicated by the significant main effect of task difficulty, F(1, 22) = 36.73, p < .0001. Most importantly, the difference in performance between the easy and the hard trials in Experiment 2 was reduced compared to that in Experiment 1, because there was a significant interaction between experimental condition and task difficulty, F(1, 22) = 6.89, p = .015. A contrast between easy and hard trials for correct automation revealed that this difference was still significant, t(11) = 3.44, p = .005, indicating that correct automation did not entirely eliminate the effect of task difficulty. Finally, a 2 (error automation vs. correction automation) × 2 (easy vs. hard) withinsubjects ANOVA on the Experiment 2 data only revealed that absolute MD error was significantly smaller for the correct automation than for the error automation trials, F(1, 11) = 6.65, p = .026. Moreover, while the automation validity did not significantly influence the error for the easy trials, t(11) = 1.07, p = .15 (one-tailed test), it did so for the hard trials such that the 27

error was smaller when the automation was correct than when in error, t(11) = 2.04, p = .033 (one-tailed test), thus validating firmly that the “high dependence” 12 subjects truly (and statistically reliably) did depend on automation. Examination of Figure 3.1 suggests that several of these 12 subjects showed little dependence, as operationally defined here. Hence, the analyses of the 12 subjects’ data might have diluted the dependence effect. To examine this possibility, the same analyses were also performed for the seven pilots who depended most heavily on the automation (those seven pilots represented on the far right side in Figure 3.1). Figure 3.7 presents the data on absolute MD estimate error by automation validity and task difficulty for the seven heaviest dependent pilots in Experiment 2 and for the corresponding pilots in Experiment 1. That is, it is the same data representation as in Figure 3.6, but with the restricted set of pilots. 1

Easy

Hard

0.9

Absolute MD Estimate Error

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Corresponding trials

Correct Auto (Exp 2)

Corresponding trials

Figure 3.7. Absolute MD estimate errors (|estimated MD – true MD|) for the seven heaviest dependent pilots in Experiment 2 by automation validity and task difficulty and the corresponding manual trials in Experiment 1.

As described below, the identical statistical pattern of results was found with this restricted analysis on the most dependent pilots. A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the error automation trials and the corresponding trials in Experiment 1 (shown on the left side of Figure 3.7) revealed that performance did not differ between the two 28

experiments, F(1, 12) = .73, p = .41, and performance on hard trials was poorer than on easy trials in both experiments, as indicated by the significant main effect of task difficulty, F(1, 12) = 12.35, p = .004. Furthermore, the same ANOVA revealed that the difference in performance between easy and hard trials in Experiment 2 was not reduced compared to that in Experiment 1, when automation was absent, since the interaction between experimental condition and task difficulty was not significant, F(1, 12) = .06, p = .89. In contrast, a 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the correct automation trials and the corresponding trials in Experiment 1 (shown on the right side of Figure 3.7) revealed that performance in Experiment 2 was better than in Experiment 1, F(1, 12) = 5.10, p = .043, and performance on hard trials was poorer than on easy trials, as indicated by the significant main effect of task difficulty, F(1, 12) = 65.04, p < .0001. Most importantly, the difference in performance between the easy and the hard trials in Experiment 2 was again reduced compared to that in Experiment 1, because there was a highly significant interaction between experimental condition and task difficulty, F(1, 12) = 23.42, p < .0001 (much more significant than the 12-subject interaction, despite the smaller sample size). A contrast between easy and hard trials for correct automation revealed that this difference was still significant, t(6) = -2.67, p = .037. Thus, the results for the most heavily dependent pilots showed the same pattern as those for the larger heavy dependence group consisting of the 12 pilots. A 2 (error automation vs. correct automation) × 2 (easy vs. hard) within-subjects ANOVA was also performed for the seven most heavily dependent pilots. The ANOVA again revealed that absolute MD error was significantly smaller for the correct automation than for the error automation trials, F(1, 6) = 6.89, p = .039. Moreover, while the automation validity did not significantly influence the error for the easy trials, t(6) = 1.18, p = .19 (one-tailed test), it did so for the hard trials such that the error was smaller when the automation was correct than when in error, t(6) = 1.95, p = .05 (one-tailed test). Therefore, we can conclude that assisted by correct MD alert, although performance on hard trials was still poorer than on easy trials as in Experiment 1, the difference between these two difficulty levels was greatly reduced relative to that in Experiment 1 (and more so for the pilots who depended on the automation most heavily). The amplified significance of the interaction for the “heavy seven” despite the smaller sample size provided strong evidence that the effect of difficulty was more pronounced with higher dependence. Moreover, there was a benefit for the correct automation compared to the error automation when the task was hard, but not so when the task was easy. This was true for both the heavy dependence group as well as the seven most dependent pilots within this group. 3.1.4 Analyses on time estimate measures: E2 – E1 difference scores. Shown in Figure 3.8 are the data on difference in absolute time (TCPA) estimate error between the two experiments as a function of automation validity and task difficulty. The data revealed that when the automation was in error, there was no significant effect of task difficulty on absolute TCPA estimate error difference between the two experiments, t(11) = -.65, p = .26. However, when the automation was correct, there was a significant effect of task difficulty on absolute TCPA estimate error difference, t(11) = -2.68, p = .011, suggesting an automation cost on hard trials, but not on easy trials.

29

Difference in Absolute TCPA Estimate Error between Exp 2 and 1 (sec)

8

6

4

2

0

-2 Easy

Hard

-4

-6 Error

Correct Autom ation Validity

Figure 3.8. Difference in absolute TCPA estimate error (|estimated TCPA – true TCPA|) between heavy dependence group in Experiments 2 and the corresponding manual group in Experiment 1 by automation validity and task difficulty. A negative reading indicates reduced TCPA estimate error in Experiment 2 compared to that in Experiment 1.

Separate post-hoc contrasts were performed to assess whether a benefit or a cost was present at all for each of the four conditions. For error automation, absolute TCPA estimate error was smaller than in Experiment 1 for easy trials both based on one-tailed test, t(11) = -2.53, p = .014, and two-tailed test, t(11) = -2.53, p = .028, but not for hard trials, t(11) = -.53, p = .30 (onetailed test). For correct automation, absolute TCPA estimate error did not differ significantly between the two experiments for easy trials, t(11) = -.45, p = .33, but the error was marginally greater with automation than in Experiment 1 for hard trials using a one-tailed test, t(11) = 1.51, p = .08, but such difference was not significant using a two-tailed test, t(11) = 1.51, p = .16. Thus, the finding with respect to absolute TCPA estimate error was contradictory to Hypothesis 3 in that the error MD alert did not hurt TCPA estimation nor did task difficulty affect it, but correct MD alert actually tended to induce cost on hard trials but not on easy trials. Therefore, there is evidence suggesting that while correct automation benefited miss distance estimation performance in particular for hard task trials (Figures 3.6 and 3.7), such benefit was achieved at the cost to time estimation performance in particular for hard trials.

30

Analyses on time estimate measures: absolute errors in Experiments 2 and 1. We cannot tell from Figure 3.8 whether the time estimate error was larger or smaller on hard trials than on easy trials, or whether it was the same between the two difficulty levels in each condition. Analogous to the analysis represented in Figure 3.6 for MD estimate error, the data were also analyzed on absolute TCPA estimate error for the erroneous and correct automation trials by task difficulty (easy and hard) separately in Experiment 2 and for the corresponding trials in Experiment 1 (see Figure 3.9). The results of a 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA revealed that when the automation was in error (left half of the figure), performance did not differ significantly from that in Experiment 1, F(1, 22) = 2.35, p = .14; performance on the hard trials were constantly poorer than on the easy ones in both experiments, F(1, 22) = 28.86, p < .0001. And the lack of significant interaction between the two factors suggests that the difference between easy and hard trials did not change significantly across the two experiments, F(1, 22) = .40, p = .53. 25 Easy

Hard

Absolute TCPA Estimate Error (sec)

20

15

10

5

0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Corresponding trials

Correct Auto (Exp 2)

Corresponding trials

Figure 3.9. Absolute TCPA estimate errors (|estimated TCPA – true TCPA|) for heavy dependence group in Experiment 2 by automation validity and task difficulty and the corresponding manual trials in Experiment 1. However, when the automation was correct (right half of the figure), the increase in task difficulty imposed greater cost to performance as indicated by greater absolute TCPA estimate error than Experiment 1. This was confirmed by the results of a 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA, which revealed that the hard trials were still harder than the easy ones in Experiment 2, F(1, 22) = 69.32, p < .0001, but the significant interaction between

31

experimental condition and task difficulty suggests that the hard trials (with automation) in Experiment 2 induced greater TCPA estimation error than the corresponding hard trials (without automation) in Experiment 1 (i.e., greater difference in error between easy and hard trials than in Experiment 1), F(1, 22) = 6.37, p = .019. Therefore, we can conclude that with correct alerts, TCPA estimation performance on hard trials was not only poorer than on easy trials as in Experiment 1, but also the difference between these two difficulty levels was increased in Experiment 2 relative to Experiment 1. Similar to the analyses on signed MD estimate error, because of the importance of signed TCPA estimate error (estimating conflict to be too early or too late), we also looked at these signed errors for each condition individually in both experiments. These data are shown in Figure 3.10, suggesting a general tendency to underestimate TCPA in both experiments. A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the error automation trials and their corresponding trials in Experiment 1 (left half of the figure) revealed that TCPA was more underestimated in Experiment 2 than in Experiment 1, F(1, 22) = 10.26, p = .004. It is interesting to note that with error automation, while absolute TCPA estimate error did not differ between the two experiments (left side of Figure 3.9), TCPA in Experiment 2 was mostly underestimated even for the easy trials. The ANOVA also revealed the TCPA was more underestimated on hard than on easy trials in both experiments, F(1, 22) = 8.85, p = .007. There was no significant interaction between the two factors, F(1, 22) = .56, p = .46. 5 Easy

Hard

Signed TCPA Estimate Error (sec)

0

-5

-10

-15

-20

-25 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Correct Auto (Exp 2)

Corresponding trials Corresponding trials Figure 3.10. Signed TCPA estimate errors (estimated TCPA – true TCPA) for heavy dependence group in Experiment 2 by automation validity and task difficulty and the corresponding manual trials in Experiment 1. A negative reading indicates underestimate of TCPA (or overestimate of risk).

32

A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA was also performed for the correct automation trials and their corresponding trials in Experiment 1 (right side of the figure). The results revealed an identical pattern as for the error trials. That is, TCPA was more underestimated in Experiment 2 than in Experiment 1, F(1, 22) = 6.9, p = .015, and it was more underestimated on hard than on easy trials in both experiments, F(1, 22) = 132.12, p < .0001. Again, there was no interaction between the two factors, F(1, 22) = .002, p = .96. The issue of why both the increases in conflict difficulty and the presence of automation led pilots to bias in estimating conflicts sooner than they would occur will be addressed in the Discussion. 3.1.5 Analyses on OCPA estimate measures. Task difficulty did not significantly influence the difference in absolute OCPA estimate error between Experiment 2 and 1, t(11) = .95, p = .18 for the error automation trials, and t(11) = .96, p = .18 for the correct automation trials. 3.1.6 Conclusion: Hypothesis 3. Hypothesis 3A was partially supported in that when the automation was correct, there was a greater improvement in MD estimation accuracy (relative to that in Experiment 1) for the hard trials than for the easy trials, whereas there was a decrease in TCPA estimate accuracy (relative to that in Experiment 1) for the hard trials. Hypothesis 3B was not supported in that, with erroneous automation, task difficulty did not adversely affect any aspect of the performance difference between the two experiments. 3.1.7 Effect of automation error magnitude. Given the three-level alert system used in the experiment, Hypothesis 4 states that “bad” automation errors would induce greater performance costs than modest automation errors. There was no significant effect of automation error magnitude (modest errors vs. bad errors) on difference in absolute MD estimate errors between the two experiments, t(11) = -.08, p = .47. There was also no significant effect on difference in absolute TCPA estimate error, t(11) = .35, p = .37, or on difference in absolute OCPA estimate error, t(11) = .32, p = .38, between the two experiments. Therefore, the results did not support Hypothesis 4. 3.1.8 Effect of automation error type. Hypothesis 5 states that automation false alarms would induce greater costs than automation misses. No significant effect of automation error type (automation misses vs. automation false alarms) was found on difference in absolute MD estimate error between the two experiments, t(11) = -.61, p = .28. No significant effect of error type was found on difference in absolute TCPA estimate error, t(11) = -.67, p = .26, or on difference in OCPA estimate error, t(11) = .29, p = .39, between the two experiments. Therefore, the results did not support Hypothesis 5. In interpreting the null effect regarding Hypotheses 4 and 5, it should be noted that the power of these statistical tests is considerably lower than in the tests of Hypotheses 1-3, because of the rarity of the different classes of automation errors per subject (miss, false alarm, modest, and bad). Hence we have less confidence in firmly accepting the null hypotheses.

33

3.1.9 Summary of effects of independent variables on dependent variables. Table 3.1 summarizes the effects of all the independent variables (rows) on the five dependent measures (columns) for the heavy dependence group.

Table 3.1. Summary of Effects of Independent Variables on Dependent Variables for Heavy Dependence Group Effect on: Signed MD estimate Absolute TCPA error estimate error

Effect of:

Absolute MD estimate error

Exp. condition (auto vs. manual)

Auto led to smaller error (Hypothesis 1)

No effect

No effect (Hypothesis 1)

Auto validity (error vs. correct) relative to manual condition

Correct auto led to smaller error than manual (Hypothesis 2A)

Neither correct nor error auto had effect

Neither correct nor error auto had effect (Hypothesis 2)

Hard trials led to greater underestimate for error auto and corresponding manual conditions, but not for correct auto and corresponding manual conditions

Hard trials led to greater errors than easy trials

Task difficulty (easy vs. hard)

Error auto had no effect (Hypothesis 2B) Hard trials led to greater errors than easy trials Correct auto led to greater benefit for hard than for easy trials (Hypothesis 3A), but error auto did not lead to greater cost for hard than for easy trials (Hypothesis 3B)

Auto error magnitude (modest vs. bad) Auto error type (false alarm vs. miss)

Effect of task difficulty was not affected by auto validity

Correct auto led to greater cost for hard than for easy trials (Hypothesis 3A), but error auto did not lead to greater cost for hard than for easy trials (Hypothesis 3B)

Signed TCPA estimate error Automation led to greater underestimat e Both correct and error auto led to greater underestimat e Hard trials led to greater underestimat e than easy trials

Absolute OCPA estimate error No effect (Hypothesis 1) Neither correct nor error auto had effect (Hypothesis 2) No effect (Hypothesis 3)

Effect of task difficulty was not affected by auto validity

No effect (Hypothesis 4)

Not tested

No effect (Hypothesis 4)

Not tested

No effect (Hypothesis 4)

No effect (Hypothesis 5)

Not tested

No effect (Hypothesis 5)

Not tested

No effect (Hypothesis 5)

34

3.2 Analyses for Light Dependence Group Table 3.2 shows the results for the hypothesis testing for the light dependence group. According to Hypothesis 6, the effects associated with Hypotheses 1-5 would be more significant for the heavy dependence group than for the light dependence group, where they might not be significant at all. As is evident in Table 3.2, none of the Hypotheses 1-5 was supported for the light dependence group, thereby confirming Hypothesis 6. Table 3.2. Results of Hypothesis Testing for Light Dependence Group Hypothesis

Results

Statistical Tests

1

Absolute MD, TCPA, and OCP estimate errors did not significantly differ between Experiments 2 and 1.

t(22) = -.11, p = .46; t(22) = -.22, p = .42; t(22) = -.37, p = .36, for absolute MD, TCPA, and OCPA estimate errors, respectively

2

There was no significant effect of experimental condition on absolute MD, TCPA, and OCPA estimate errors either for error or correct automation trials.

t(22) = .53, p = .30; t(22) = -.19, p = .43, for absolute MD estimate error for error and correct automation, respectively. t(22) = .55, p = .30; t(22) = -.38, p = .36, for absolute TCPA estimate error for error and correct automation, respectively. t(22) = 1.28, p = .11; t(22) = -.69, p = .25, for absolute OCPA estimate error for error and correct automation, respectively.

3

There was no significant effect of task difficulty on difference in absolute MD, TCPA, and OCPA estimate errors between Experiments 2 and 1 either for error or correct automation trials.

t(11) = .48, p = .32; t(11) = .49, p = .32, for difference in absolute MD estimate error for error and correct automation, respectively. t(11) = -.93, p = .19; t(11) = -1.45, p = .09, for difference in absolute TCPA estimate error for error and correct automation, respectively. t(11) = 1.24, p = .12; t(11) = .01, p = .50, for difference in absolute OCPA estimate error for error and correct automation, respectively.

4

There was no significant effect of automation error magnitude on difference in absolute MD, TCPA, and OCPA estimate errors between Experiments 2 and 1.

t(11) = -.56, p = .30; t(11) = .62, p = .28; t(11) = -.92, p = .19, for difference in absolute MD, TCPA, and OCPA estimate errors, respectively.

5

No significant effect of automation error type on difference in absolute MD, TCPA, and OCPA estimate errors between Experiments 2 and 1.

t(11) = .47, p = .33; t(11) = 1.43, p = .09; t(11) = .77, p = .23, for difference in absolute MD, TCPA, and OCPA estimate errors, respectively.

35

Table 3.2 reveals a collective pattern of results suggesting that the light dependence group did not use the automation, and hence were unaffected by its properties. There are two possible explanations for this “failure to use.” One is that those pilots chose not to use the automation because the trials defined to be “hard conflicts” were not actually hard for this group, and they could perform as accurately without the automation as on the easy trials. The other explanation is that this group simply chose not to use the automation for other reasons (e.g., perhaps because it was imperfect and mistrusted and/or because it was effortful to use it) even though they could have benefited from its advice. To discriminate between these two explanations, we examined the difference in MD error between easy and hard trials for the light dependence group in Experiment 2 as well as for the counterpart pilots in Experiment 1. These data are shown in Figure 3.11. 0.45 Easy

Hard

Absolute MD Estimate Error (mile)

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Corresponding trials

Correct Auto (Exp 2)

Corresponding trials

Figure 3.11. Absolute MD estimate errors (|estimated MD – true MD|) for light dependence group in Experiment 2 by automation validity and task difficulty and the corresponding manual trials in Experiment 1.

According to the first explanation (automation rejected because it was unneeded), the hard-problem decrement should be eliminated in Experiment 2 for the light dependence group. A 2 (manual or Experiment 1 vs. automation or Experiment 2) × 2 (easy vs. hard) mixed ANOVA for the error automation conditions and their corresponding manual conditions in Experiment 1 revealed that absolute MD estimate error did not differ significantly between the two

36

experiments, F(1, 22) = .057, p = .59, and the error was greater on the hard trials than on the easy trials, F(1, 22) = 10.97, p = .003. It was also revealed that the interaction between task difficulty (easy vs. hard) and experimental condition (manual vs. automation) was not significant, F(1, 22) = .096, p = .76. A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA for the correct automation conditions and their corresponding manual conditions in Experiment 1 revealed that, unlike the analyses for the high dependence group, absolute MD estimate error did not differ significantly between the two experiments, F(1, 22) = .051, p = .82, and the error was greater on the hard trials than on the easy trials, F(1, 22) = 16.36, p = .001. It was also revealed that the interaction between task difficulty (easy vs. hard) and experimental condition (manual vs. automation) was not significant, F(1, 22) = .047, p = .83. The complete additivity between task difficulty and experimental condition for both error and correct automation conditions suggests that the performance difference between hard and easy trials did not reduce or diminish across conditions. The data shown in Figure 3.11 thus do not support the explanation that the light dependence pilots rejected automation because they did not need it. On the contrary, the data support the second explanation that the pilots in the light dependence group should have used automation to support better performance, but rejected it perhaps because of its known unreliability, or it was hard to use, or they felt that the task was too easy to be assisted by the automation. We also analyzed the data on signed MD estimate errors, but none of the effects or interactions was significant. As with the analyses for MD estimate error, the data were also analyzed on absolute TCPA estimate error for the erroneous and correct automation trials in Experiment 2 and for the corresponding trials in Experiment 1 (see Figure 3.12). A 2 (manual vs. automation) × 2 (easy vs. hard) mixed ANOVA was performed for both the error and correct automation and their corresponding conditions in Experiment 1. It was revealed that hard trials led to poorer performance under all the circumstances, F(1, 22) = 17.66, p < .0001 for error automation conditions and their corresponding manual conditions, and F(1, 22) = 64.91, p < .0001 for correct automation conditions and their corresponding manual conditions. No other effects nor interactions were significant.

37

16 Easy

Hard

Absolute TCPA Estimate Error (sec)

14 12 10 8 6 4 2 0 Manual (Exp 1)

Error Auto (Exp 2)

Manual (Exp 1)

Corresponding trials

Correct Auto (Exp 2)

Corresponding trials

Figure 3.12. Absolute TCPA estimate error (|estimated TCPA – true TCPA|) for light dependence group in Experiment 2 by automation validity and task difficulty and their corresponding trials in Experiment 1.

Thus, the non-significant effect of experimental condition and non-significant interaction between experimental condition and task difficult for absolute TCPA estimate error suggest that pilots in the light dependence group did not use the MD alert as those in the heavy dependence group had done. For the latter, hard trials with correct automation induced greater TCPA estimate error than in Experiment 1, and the performance gap between easy and hard trials increased than in Experiment 1, presumably due to the use of the correct MD alert information (see Figure 3.9). To summarize, pilots in the light dependence group did not use the MD automation or at least did not use it extensively (Figure 3.11) as the heavy dependence pilots (Figure 3.6). The data shown in Figure 3.12 seem to confirm this finding, since there was no cost to the TCPA estimation accuracy, as was the case for the heavy dependence group (Figures 3.9). It was also found that pilots in this group did not use automation because they did not need it, but rather because they chose not to use it.

38

4. DISCUSSION The general goal of the present experiment was to show that imperfect automation (alerting system) in conflict detection could assist unaided pilot performance when using a CDTI, as well as to demonstrate the modulating role of task difficulty in that assistance. While there are ample data showing that perfect automation can aid performance (e.g., Metzger & Parasuraman, 2001; Dixon & Wickens, 2003; Yeh & Wickens, 2001) and does so more on difficult than easy tasks (e.g., Dixon & Wickens, 2003, 2004; Galster et al., 2001; Maltz & Shinar, 2003; Yeh et al., 2003), and some data showing that imperfect automation can aid performance (e.g. Dixon & Wickens, 2003, 2004; Rovira et al., 2002; Rovira & Parasuraman, 2002), the latter class of such data are sparse, and only Dixon and Wickens (2003, 2004) and Maltz and Shinar (2003) have shown the role of task difficulty in modulating the benefits of imperfect automation. Importantly, neither study used a traffic conflict detection paradigm. The data from Xu, Wickens, and Rantanen (2004) were used to create a wide range of difficulty of conflict detection problems, and those from the low and high end were sampled, to be employed in the present experiment, with more difficult problems generally involving slower speed, longer time and distance to CPA, and oblique conflict angles (see Appendix A). The results generally supported some, but not all of our hypotheses. First, we had not originally anticipated the wide range of automation dependence between participants. Given such a range, it made sense to focus our hypothesis testing regarding automation properties primarily upon those who depended on automation in the first place, since those who did not would be expected to show generally null results of automation correctness (and indeed they did). Those low dependence pilots were people who were more likely paired with those pilots in the short DCPA (1.3 miles) group in Experiment 1. Since shorter DCPAs were generally “easier” in Experiment 1, hence the low dependence pilots in Experiment 2 were generally those who received easier (short DCPA) problems (see bottom row of Figure 3.1). It appears that those low dependence pilots did not feel the need to obtain assistance from the automation, presumably because the task was relatively easy, although they should still have used it to improve performance in particular when encountering the harder conflict problems (Figure 3.11). In contrast, since the high dependence subjects received the more difficult trials (mostly 2.7 and 4.0 mile DCPAs; see bottom row of Figure 3.1), it might have appeared to be an advisable strategy to depend on the automation to enhance performance. Indeed those pilots generally were found to benefit from automation regarding the most critical safety-relevant or risk measure of conflict understanding, the estimation of the miss distance (MD) at the point of closest point of approach. Performance of these pilots was better than that of their demographically matched counterparts, facing problems of equivalent difficulty but unaided, in Experiment 1; thus, supporting Hypothesis 1. Furthermore, this benefit was enhanced on problems of greater difficulty, supporting Hypothesis 3, so that, for the high automation dependence group, performance was little affected by difficulty (Figures 3.6 and 3.7). The detailed analysis examining Hypothesis 2 revealed, not surprisingly, that benefits were only realized when automation was correct and not when it was in error (Figure 3.3). However, the results were a little surprising in that even on the automation error trials, performance was no worse than its level had been in Experiment 1, and sometimes showed a hint

39

of being better. That is, unlike other findings, erroneous automation did not yield a “complacency cost” of over-reliance (e.g., Maltz & Shinar, 2003; Metzger & Parasuraman, 2001; Yeh & Wickens, 2001). One partial explanation is that pilots were clearly pre-warned of the lessthan perfect characteristics, and so were presumably not “caught” by a first failure effect, which is typically used to document the effect of overtrust, overreliance, or “complacency” (e.g., Yeh et al., 2003; Yeh & Wickens, 2001). How did the high dependence pilots show a benefit from imperfect automation when it was correct, but no cost when it was wrong? Part of the answer may be because the pilots’ response (positioning the cursor on the location of the projected CPA) was different from the actual guidance given by the automation (predicted size of the CPA). In interpreting our results, we assume that when the highest level alert appeared, pilots invested a high level of perceptual and cognitive processing of the raw data – a careful inspection – in order to most accurately estimate the CPA. This effort investment was greater than that for corresponding pilots in experiment 1, who did not receive the high level alert. Such behavior would lead to enhanced accuracy even when the alert was incorrect. When the alert was “silent” in contrast, pilots maintained an equivalent level of inspection to their experiment 1 counterparts. This behavior may have penalized them some when the alert was silent but should have sounded (the automation miss), but not enough to offset the real benefits of careful inspection when the alarm signaled the most dangerous level, as described in the previous paragraph, including when the latter was a false alert. Another, parallel, way of accounting for the data is to assume an overall improvement in performance of experiment 2 versus experiment 1, perhaps due to a motivational increase from having the automation available (Beck, Dzinodolet, Pierce, & Piatt, 2003; Ben-Yaacer et al., 2002). Within the overall improved performance, the cost-benefit differences associated with automation error versus correct still existed (at least on the difficult problems; see Figures 3.5 and 3.6). However any cost for error automation is entirely offset by the overall benefit of improved motivation and performance, particularly when the alert sounds, as described above, triggering a closer inspection of the raw data. Importantly, the data show that with an error rate of 17% (83% reliability), pilots clearly benefited from imperfect automation, a data point that adds to the general conclusion that imperfect automation above a 70-75% rate is better than no automation at all when workload is high or the task is difficult (Dixon & Wickens, 2004). In the current experiment the balance of automation misses and false alerts was equal. However we might presume that an imbalance, caused by an alerting criterion that penalizes misses more than false alerts (and therefore elevates the false alert rate), might not disrupt performance, given the benefit (inferred above) to an alert sound that triggers a closer inspection of the raw data. This benefit can be assumed to exist, whether the alert was true of false, as long as the likelihood of the latter event remains reasonably low, and the user is calibrated to this false alert frequency. However it remains uncertain what might occur if the base rate of conflicts drops to a much lower value than that used in the current experiment, so that the posterior probability of P(conflict/alert) is quite low (Parasuraman et al., 1997), perhaps dropping below 50%. We do

40

believe however (but cannot confirm) that some of the problems that might otherwise have occurred, were mitigated by the use of the 3 state likelihood alarm. The finding that automation benefits emerged on high difficulty trials is a familiar and expected one (e.g., Dixon & Wickens, 2003, 2004; Maltz & Shinar, 2003). It is also important to note that a major feature of the high difficulty was the long distance to CPA, creating a lengthening of space over which projection must take place, that would be typical as we extrapolate the current results to the more strategic uses of the CDTI that are envisioned (e.g., 24 minute look ahead time). In such a case, pilots would either have to project across a larger region of the display or if the display scale were minified, they would have to project across a slower velocity symbol movement, a prediction which is also more difficulty and so, again, would be likely to benefit from imperfect automation (Xu et al., 2004). The current results did reveal three important departures from the anticipated findings. First, in contrast to our predictions of Hypothesis 4, we found that “bad errors” were no worse than “modest errors”. As noted above, we believe that the potential cost of higher automation error magnitude was mitigated by pilot strategy, whereby the sounding of an alert (or the highest level of alert), led to a closer scrutiny of the raw data. (Indeed it is possible that the most urgent (level 3) level of alert led to an even closer inspection than the modest (level 2) alert.) Second we had predicted in Hypothesis 5, that false alerts would be more disruptive than misses (e.g., Bliss, 2003), but did not find this. In interpreting this null effect, we reiterate, as noted above, that the high level alert may have supported a closer inspection that more than offset the costs of the occasional “falseness” of that alert, given that the pilots had perceptual access to the raw data on the visual display. We also note that some of the studies that have attributed false alarms with more problems than misses (Maltz & Shinar, 2003; Dixon & Wickens, 2004) have varied the frequency of these two types of events, by creating “false alarmprone” and “miss-prone” systems, as might be the case were the designer to vary the threshold of the alert. However as noted above, we did not impose such a variation in our experimental design. Furthermore, some of the studies in which false alarms have proven more problematic than misses have been carried out in a dual task setting, where false alarms, if heeded, force a disengagement from ongoing task(s), a potentially disruptive and annoying attention switch, that was not required in the single task context used here. A third finding that was not anticipated was the distance-time estimation accuracy tradeoff that was produced by automation. That is, while automation appeared to improve the accuracy of performance on the most critical task associated with conflict estimation –what would the miss distance at the CPA be – it actually disrupted the accuracy of estimating the time till that CPA would occur. Some possible diagnosis of why this occurred may be suggested by examining the time-underestimation data shown in Figure 3.10. These data reflect three independent effects. The time to CPA is underestimated by: (1) the increased difficulty of conflicts, (2) the presence of automation and (3) automation being correct (versus in error). The first two of these can both be accounted for by a resource tradeoff: more difficult problems, as well as the requirement to process both the automated alert and the raw visual data require more resources. Such resources are diverted from the time estimation/projected process, which is itself resource limited (Zackey, Block, & Tsal, 1999). Given then that time would be more poorly estimated as a consequence of resource diversion, pilots adopt a “conservative strategy” to

41

underestimate that time; that is, to give themselves less time available than they really have. It is less clear however how the third of these influences (automation correctness) might also lead to an underestimation of time to conflict. In conclusion, the results have clearly illustrated the benefits that can be provided by even imperfect or “unreliable” CDTI alerting, at least given the relatively high reliability level about 80% such benefits –without costs – are, we believe the result of three factors. (1) Raw data were available to be inspected. (2) Pilots were calibrated to the approximate reliability level. (3) A multilevel alert was employed. We might project that increases in task workload might amplify the benefits, just as decreasing the automation error rate would have had the same effects. However, it is possible that these two changes, while amplifying the benefits of correct automation may have led to the emergence of costs on automation-error trials. REFERENCES Beck, H. P., Dzindolet, M. T., Pierce, L. G. & Piatt, N. (2003). Looking forward: A simulation of decision aids in tomorrow’s classroom. Proceedings of the 47th Annual Meeting of the Human Factors Society (pp. 330-334). Santa Monica, CA: Human Factors Society. Ben-Yaacov, A., Maltz, M., & Shinar, D. (2002). Effects of an in-vehicle collision avoidance warning system on short- and long-term driving performance. Human Factors, 44(2), 335-342. Bliss, J. (2003). An investigation of alarm related accidents and incidents in aviation. International Journal of Aviation Psychology, 13(3), 249-268. Breznitz, S. (1983). Cry-wolf: The psychology of false alarms. Hillsdale, NJ: Lawrence Erlbaum. Cotté, N., Meyer, J., & Coughlin, J. F. (2001). Older and younger driver's reliance on collision warning systems. Proceedings of the 45th Annual Meeting of the Human Factor Society (pp. 277-280). Santa Monica, CA: Human Factors and Ergonomics Society. Dixon, S. R., & Wickens, C. D. (2003). Imperfect automation in unmanned aerial vehicle flight control (Technical Report AHFD-03-17/ MAAD-03-2). Savoy, IL: University of Illinois, Aviation Human Factors Division. Dixon, S. R., & Wickens, C. D. (2004). Reliability in automated aids for unmanned aerial vehicle flight control: Evaluating a model of automation dependence in high workload (Technical Report AHFD-04-05/MAAD-04-1). Savoy, IL: University of Illinois, Aviation Human Factors Division. Dzindolet, M. T., Pierce, L. G., Beck, H. P., & Dawe, L. A. (1999). Misuse and disuse of automated aids. Proceedings of the 43rd Annual Meeting of the Human Factors and Ergonomics Society (pp. 339-343). Santa Monica, CA: Human Factors and Ergonomics Society.

42

Galster, S. M., Bolia, R. S., Roe, M. M., & Parasuraman, R. (2001). Effects of automated cueing on decision implementation in a visual search task. Proceedings of the 45th Annual Meeting of the Human Factor Society (pp. 321-325). Santa Monica, CA: Human Factors and Ergonomics Society. Gupta, N., Bisantz, A. M., & Singh, T. (2001). Investigation of factors affecting driver performance using adverse condition warning systems. Proceedings of the 45th Annual Meeting of the Human Factor Society (pp. 1699-1703). Santa Monica, CA: Human Factors and Ergonomics Society. Ho, D., & Burns, C. M. (2003). Ecological interface design in aviation domain: Work domain analysis of automated conflict detection and avoidance. In Proceedings of the HFES 47th Annual Meeting of the Human Factor Society. Santa Monica, CA: Human Factors and Ergonomics Society. Kantowitz, B., Hanowski, R., & Kantowitz, S. (1997). Driver acceptance of unreliable traffic information in familiar and unfamiliar settings. Human Factors, 39, 164-176. Kuchar, J. K. (2001). Managing uncertainty in decision-aiding and alerting system design. In Proceedings of the 6th CNS/ATM Conference, Taipei, Taiwan, March 27-29, 2001. Lehto, M. R., Papastavrou, J. D., Ranney, T. A., & Simmons, L. A. (2000). An experimental comparison of conservative versus optimal collision avoidance warning system thresholds. Safety Science, 36-3, 185-209. Magill, S. A. N. (1997) Trajectory predictability and frequency of conflict avoiding action. Defence Evaluation and Research Agency (DERA), paper presented at CEAS Free Flight Conference, 1997. Maltz, M. & Meyer, J. (2003). Use of warnings in an attentionally demanding detection task. Human Factors, 43(2), 217-226. Maltz, M., & Shinar, D. (2003). New alternative methods of analyzing human behavior in cued target acquisition. Human Factors, 45(2), 281-295. Merwin, D. H., & Wickens, C. D. (1996). Evaluation of perspective and coplanar cockpit displays of traffic information to support hazard awareness in free flight (Technical Report ARL-96-5/NASA-96-1). Savoy: University of Illinois, Aviation Research Lab. Metzger, U., & Parasuraman, R. (2001). Conflict detection aids for air traffic controllers in free flight: Effects of reliable and failure modes on performance and eye movements. Proceedings of the 11th International Symposium on Aviation Psychology, Columbus, OH: The Ohio State University. Meyer, J. (2001). Effects of warning validity and proximity on responses to warnings. Human Factors, 43(4), 563-572. Meyer, J. (2004). Conceptual issues in the study of dynamic hazard warnings. Human Factors.

43

Meyer, J., & Ballas, E. (1997). A two-detector signal detection analysis of learning to use alarms. Proceedings of the 41st Annual Meeting of the Human Factor Society (pp. 186-189). Santa Monica, CA: Human Factors and Ergonomics Society. Molloy, R., & Parasuraman, R. (1996). Monitoring an automated system for a single failure: Vigilance and task complexity effects. Human Factors, 38(2), 311-322. Parasuraman, R., Hancock, P.A., & Obofinbaba, O. (1997). Alarm effectiveness in driver centered collision warning systems. Ergonomics, 40, 390-399. Parasuraman, R., Masalonis, A. J., & Hancock, P. A. (2000). Fuzzy signal detection theory: Basic postulates and formulas for analyzing human and machine performance. Human Factors, 42(4), 636-659. Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, and abuse. Human Factors, 39(2), 230-253. Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 30(3), 286-297. Rantanen, E. M., Wickens, C. D., Xu, X., & Thomas, L. C. (2003). Developing and validating human factors certification criteria for cockpit displays of traffic information avionics. Paper presented at the FAA General Aviation, Aviation Maintenance, and Vertical Flight Human Factors Research Program Review Conference. University of Nevada, Reno. Rantanen, E. M., Wickens, C. D., Xu, X., & Thomas, L. C. (2004). Developing and validating human factors certification criteria for cockpit displays of traffic information avionics (AFHD-04-1/FAA-04-1). Savoy, IL: University of Illinois, Aviation Human Factors Division. Rovira, E., & Parasuraman, R. (2002). Sensor to shooter: Task development and empirical evaluation of the effects of automation unreliability. Paper presented at the Annual Midyear Symposium of the American Psychological Association, Division 10 (Military Psychology) and 21 (Engineering Psychology). Ft. Belvoir, VA. Rovira, E., Zinni, M., & Parasuraman, R. (2002). Information and decision uncertainty: Effects of unreliable automation on multi-task performance and workload. Paper presented at the Annual Midyear Symposium of the America Psychological Association, Division 10 (Military Psychology) and 21 (Engineering Psychology). Ft. Belvoir, VA. Sorkin, R. D., Kantowitz, B. H., & Kantowitz, S. C. (1988). Likelihood alarm displays. Human Factors, 30, 445-460. Sorkin, R. D., & Woods, D. D. (1985). Systems with human monitors, a signal detection analysis. Human-Computer Interactions, 1, 49-75.

44

St. Johns, M., & Manes, D. I. (2002). Making unreliable automation useful. Proceedings of the 46th Annual Meeting of the Human Factors and Ergonomics Society. Santa Monica, CA Human Factors and Ergonomics Society. Thomas, L. C., Wickens, C. D., & Rantanen E. M. (2003). Imperfect automation in aviation traffic alerts: A review of conflict detection algorithms and their implications for human factors research. Proceedings of the 47th Annual Meeting of the Human Factors and Ergonomics Society. Human Factors and Ergonomics Society, Santa Monica, CA. Wickens, C. D. (2000). Imperfect and unreliable automation and its implications for attention allocation, information access and situation awareness (Final Technical Report ARL-0010/NASA-00-2). Savoy, IL: University of Illinois, Aviation Research Laboratory. Wickens, C. D. (2003). Aviation displays. In P. Tsang & M. Vidulich (Eds.), Principles and practices of aviation psychology (pp. 147-199). Mahwah, NJ: Lawrence Erlbaum Publishers. Wickens, C. D., Gempler, K., & Morphew, M. E. (2000). Workload and reliability of predictor displays in aircraft traffic avoidance. Transportation Human Factors, 2(2), 99-126. Wickens, C. D., Helleberg, J. & Xu, X. (2002). Pilot maneuver choice and workload in free flight. Human factors, 44(2), 171-188. Wickens, C. D, & Hollands, J. G. (2000). Engineering psychology and human performance (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Wickens, C. D., Mavor, A. S., Parasuraman, & McGee, J. P. (Ed.). (1998). The future of air traffic control: Human operators and automation. Washington, DC: National Academy Press. Wickens, C.D., Rantanen, E.M., Thomas, L., & Xu, X. (2004). Imperfect automation and CDTI alerting: Implications from literature and systems analysis for display design. Abstracts of the Aerospace Medical Association 75th Annual Scientific Meeting and Supplement to Aviation, Space & Environmental Medicine, 75(4), Section II, B138. Wickens, C. D., & Xu, X. (2002). Automation trust, reliability and attention HMI 02-03 (Technical Report AHFD-02-14/MAAD-02-2). Savoy: University of Illinois, Aviation Human Factors Division. Xu, X., Wickens, C. D., & Rantanen, E. (2004). Effects of air traffic geometry on pilots’ conflict detection with cockpit display of traffic information. Proceedings of the 48th Annual Meeting of the Human Factors and Ergonomics Society. Santa Monica, CA: Human Factors and Ergonomics Society. Yeh, M., Merlo J. L., Wickens, C. D. & Brandenburg, D. L. (2003). Head up versus head down: The costs of imprecision, unreliability, and visual clutter on cue effectiveness for display signaling. Human Factors, 45(3), 390-407.

45

Yeh, M., & Wickens, C. D. (2001). Display signaling in augmented reality: The effects of cue reliability and image realism on attention allocation and trust calibration. Human Factors, 43(3), 355-365. Young, M. S., & Stanton, N. A. (1997). Automotive automation: Effects, problems and implications for driver mental workload. In D. Harris (Ed.), Engineering psychology and cognitive ergonomics, Vol. 1 (pp. 347-354). Brookfield, VT: Ashgate. Zackey, D., Block, R., & Tsal, Y. (1999). Prospective time judgments and workload. In D. Gopher & A. Koriat (Eds.), Attention and performance XVII. Cambridge, MA: MIT Press.

46

APPENDIX A EFFECTS OF AIR TRAFFIC GEOMETRY ON PILOTS’ CONFLICT DETECTION WITH COCKPIT DISPLAY OF TRAFFIC INFORMATION Xidong Xu, Christopher D. Wickens, and Esa M. Rantanen Aviation Human Factors Division, Institute of Aviation University of Illinois at Urbana-Champaign We explored the effects of conflict geometry on pilot conflict understanding, manifested in estimation accuracy of three continuous variables: miss distance, time to closest point of approach, and orientation at the closest point of approach. Results indicated (a) increased difficulty of understanding with conflicts that occurred with slower speeds, a longer time into the future, and a longer distance into the future; (b) a tendency for pilots’ judgments often to be conservative, judging that conflicts were both more risky and would occur sooner than was actually the case; and (c) a “distance-over-speed” bias, such that two aircraft viewed farther apart and converging rapidly were perceived as less risky than two aircraft that were closer to each other and converging at a slower rate, even though the time until a conflict occurred was identical. Given the importance of conflict detection with the CDTI in free flight and its under-representation in research, the present study investigated the effects of conflict geometry on pilots’ conflict awareness using a CDTI. Dependent variables were estimate errors of the above-described continuous measures of conflict risk (TCPA, MD, and OCPA). The goals of this experiment were to identify the features that would make unaided conflict detection difficult or easy and to identify the biases that would affect estimation performance. Based on our review of the literature regarding the factors influencing time-to-contact estimation and conflict detection performance (Xu, 2003, Xu, Wickens, & Rantanen, 2004), we hypothesized that estimation would be made more difficult, manifested in increased errors, by increasing TCPA (either by increasing intruder’s distance to the CPA or DCPA or reducing traffic’s speed) and increasing MD. It was further predicted that for a same TCPA, pilots would estimate this time to be longer with a longer distance (DCPA) and faster speed of the intruder than with a shorter DCPA and slower speed, a phenomenon known as the distance-over-speed bias (Law et al., 1993).

INTRODUCTION The cockpit display of traffic information (CDTI) will play a key role in free flight, allowing pilots to detect and avoid potential conflict with other aircraft, termed hereafter “intruder.” Most of the studies that investigated conflict detection in aviation have been conducted in the context of air traffic control (e.g., Endsley, Mogford, & Stein, 1997; Galster, Duley, Masalonis, & Parasuraman, 2001, Metzger & Parasuraman, 2001a, 2002b; Remington, Johnston, Ruthruff, Gold, & Romera, 2000) and only a few have focused on airborne conflict detection by pilots with the CDTI (e.g., Merwin & Wickens, 1996). Several studies have addressed only conflict resolution (as opposed to detection) using a CDTI (e.g., Alexander, Wickens, & Merwin, in press; Scallen, Smith, & Hancock, 1996; Wickens, Gempler, & Morphew, 2000; Wickens, Helleberg, & Xu, 2002). In those investigations, when there was a failure in avoiding a conflict, it has not always been clear whether the conflict was not detected at all, was detected too late, or was detected on time but the maneuver to avoid it was unsuccessful. Furthermore, most of the studies that did investigate conflict detection evaluated only detection rate and response time using a dichotomous criterion based on whether a cylindrical protected zone was penetrated or not. It has been shown that such a binary criterion is not the best measure of conflict risk (e.g., Masalonis & Parasuraman, 2003). In a task analysis (Xu, 2003), it was demonstrated that the true risk of conflict between the ownship, whose pilot uses CDTI, and intruder aircraft can be best represented by the miss distance (MD) between the two aircraft at the closest point of approach (CPA), intruder’s orientation relative to the ownship’s heading at the CPA (OCPA), and intruder’s time to the CPA (TCPA) (see Figure 1). We believe that the estimation accuracy of these three conflict features better reflects pilots’ true understanding of conflict situations and their implications for future maneuvering than does the simple dichotomous measure.

Figure 1. Conflict scenario for two aircraft flying at the same altitude at constant speeds on straight, converging courses, as shown on a CDTI. The ownship would appear to be stationary to the pilot in an egocentric frame of reference.

47

and as this TCPA might have been excessively long resulting in participant distraction and impatience, it was excluded from the experiment. Within the above-described DCPA and speed levels, some pairs of trials had the same TCPA when the freezing occurred, but different DCPA because of different speed levels. This allowed for the testing of the distance-overspeed bias hypothesis. DCPA was varied between subjects and the other variables varied within-subjects. For the 1.33-nm and 2.67nm DCPA groups, crossing three conflict angle (CA) conditions with the three speed levels and the three MD levels yielded 27 conflict geometries. Four replicates of each of the 27 conditions resulted in 108 trials in total for each of the two DCPA groups. For the 4.0 mile DCPA group, there were a total of 72 trials (3 CAs × 2 faster speeds × 3 MDs × 4 replicates). These trials were presented to the participants in a quasi-random fashion but appearing to be random to them.

METHOD Participants Twenty-four certified flight instructors and non-instructor pilots (20 male and four female), with mean age 23.3 years (range 18-49 years) were recruited from the Institute of Aviation, the University of Illinois at Urbana-Champaign. Display and Task The CDTI depicted ownship and intruder in a map (topdown) view (see Figure 2). The display represented ownship by a white triangle and the intruder by a solid circle in cyan. Ownship icon was positioned in the center of the display throughout the whole experiment, thus yielding an egocentric view of the traffic situation where the ownship icon appeared to be stationary to the participant. The ownship and the intruder were flying at the same altitude on straight converging courses and at constant but not necessarily same speeds. Participants individually observed the development of a conflict scenario for 15 sec, after which the scenario froze. They were then required to mentally extrapolate the development of the scenario, press a key when they estimated that the CPA was reached, thereby providing the estimate accuracy of TCPA, and move the cursor to a location that they believed was the CPA, thus providing the estimate accuracy of MD and OCPA.

Dependent variables. Dependant variables reported below were absolute and signed MD estimate errors, and absolute and signed TCPA estimate errors, derived by subtracting the true values from their corresponding estimated values (i.e., |estimated values – true values| and estimated values – true values). The absolute errors would reveal the estimation accuracy, whereas the signed errors would reveal the estimation directions (whether under- or overestimate), an indication of biases. OCPA estimate errors were also analyzed, but their results are not reported below both due to space constraint and their relatively lesser degree of importance Procedure Pilots first participated in one practice session, in which they encountered some representative conflict geometries. After this, participants completed two (for the 4.0 nm DCPA group) or three blocks (for the 1.33 nm and 2.67 nm DCPA groups) of 36 trials each in a single session. Between each two blocks, the participants were allowed to take a short break to avoid fatigue effects. RESULTS

Figure 2. Schematic illustration of key components of the experimental paradigm and independent variables. The ownship icon was stationary to the participant.

Full details of the results can be found in Xu et al. (2004), including the effect of conflict angle (CA), and the effects on OCPA estimate error.

Experimental Design

Effects of Distance to Closest Point of Approach (DCPA) and Relative Speed (RS)

Independent variables. Independent variables employed were (1) intruder’s distance to CPA (DCPA) at freezing point (1.33 nautical miles or nm, 2.67 nm, and 4.0 nm), (2) intruder’s speed relative to ownship (RS or speed), which was defined as the speed at which the intruder was moving in the ownship-centered frame of reference and thus determined how rapidly the two aircraft would converge (160 knots, 240 knots, and 480 knots), and (3) miss distance (MD) (.67 nm, 2.67 nm, and 4.67 nm). Note that coupling the longest DCPA (4.0 nm) with the slowest speed (160 knots) resulted in a TCPA of 90 s,

For absolute MD estimate error, error increased as DCPA increased (Figure 3), F(2, 21) = 18.37, p < .0001. There was also a main effect of speed, F(2, 28) = 15.22, p < .0001; error did not differ significantly between 160 knots and 240 knots (p > .10), but error was greater at 240 knots than at 480 knots (p < .0001).

48

25

160 knots 240 knots 480 knots

0.9 0.8

Absolute TCPA Estimate Error (sec)

Absolute MD Estimate Error (mile)

1

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

160 knots 240 knots 480 knots

20

15

10

5

0

1.33

2.67

4

1.33

DCPA (m ile)

2.67

4

DCPA (m ile)

Figure 3. Absolute MD estimate errors for three DCPA and speed levels.

Figure 5. Absolute TCPA estimate errors for three DCPA and three speed levels.

For signed MD estimate error (Figure 4), there was a greater underestimate of MD at the longest DCPA relative to the two shorter DCPA levels, F(2, 21) = 4.67, p < .05.

For signed time estimate error (Figure 6), there was no main effect of DCPA, F(2, 21) = 1.83, p > .10, but there was progressively greater underestimate of time (CPA estimated sooner than it would actually occur) as speed decreased, F(2, 28) = 128.15, p < .0001, and the interaction between DCPA and speed, F(2, 28) = 15.69, p < .0001, suggests amplified effect of speed at longer DCPA levels. That is, time overestimate increased with DCPA at the fastest speed and underestimate increased with DCPA at the slower speeds.

0.1

-0.1 -0.2 -0.3

15

-0.4

160 knots 240 knots 480 knots

-0.5 -0.6

10

Signed TCPA Estimate Error (sec)

Signed MD Estimate Error (mile)

0

-0.7 -0.8 -0.9 -1

1.33

2.67

4

DCPA (m ile)

Figure 4. Signed MD estimate errors for three DCPA and three speed levels. A negative value indicates that the conflict was estimated to be more risky than it would actually be.

30 sec 5

10 sec 0

20 sec

20 sec 40 sec

-5

30 sec -10

60 sec -15

60 sec

-20

160 knots 240 knots 480 knots

-25

For absolute time (i.e., TCPA) estimate error (Figure 5), there was a monotonic increase in error as DCPA increased, F(2, 21) = 5.75, p < .05, and as speed decreased, F(2, 28) = 40, p < .0001. The interaction between DCPA and speed, F(2, 28) = 5.88, p < .01, suggests a greater effect of DCPA at 160 knots than at 240 knots or 480 knots.

1.33

2.67

4

DCPA (m ile)

Figure 6. Signed TCPA estimate errors for three DCPA and three speed levels. A negative value indicates that the conflict was estimated to occur sooner than it actually would. The dashed lines connect pairs of points with identical times (TCPA).

49

Distance-over-Speed Bias

DISCUSSION The general pattern of effects that were observed can be partitioned into those that generally make conflict risk judgments more difficult (less accurate), and those that reflect two systematic forms of estimation biases manifested by the pilots. The results concerning the effects of DCPA, relative speed, and MD on the absolute estimate errors are mostly consistent with our predictions: Increasing DCPA and MD, and reducing speed made conflict detection more difficult, manifested by increased absolute errors. Importantly, regarding signed MD estimate error, there was a greater underestimate of MD at the longest DCPA than at the two shorter DCPA levels (Figure 4), as well as at the two longer MD levels compared to the shorter MD (Figure 7). The pilots also had a tendency to underestimate TCPA, in particular at slower speeds and over longer true TCPAs (Figure 6). These suggest the first pattern of bias—as the uncertainty regarding the true values of MD and TCPA increased, MD and TCPA were either underestimated or the amount of underestimate increased. We may describe this as a “safety bias” given that, with increased uncertainty, it is a safe strategy to overestimate conflict risk (i.e., to underestimate MD) and underestimate the time before the conflict occurs (i.e., TCPA). Furthermore, as the conflict situation became more risky (decreased MD), the time to the conflict (i.e., TCPA) was progressively underestimated. The pilots might have perceived the conflict situation to be more urgent as MD decreased, even when the time to the conflict was the same, and this bias could have invoked earlier avoidance maneuvers with the decreasing MD, had conflict resolution been required. The conservativeness found in this experiment is consistent with the time-to-contact (between vehicles) underestimate in driving (Hancock & Manser, 1998) and distance underestimate in air traffic control (Boudes & Cellier, 2000). These findings collectively suggest an inherent bias of the operator to err on the side of caution, where safety is an issue. This strategy may be good to a certain extent, but when overdone, it may potentially invoke unnecessary avoidance maneuvers resulting in wasted fuel, passenger discomfort, or even conflict with other traffic in the nearby airspace. The second class of bias manifested in the distance-overspeed bias hypothesis was supported in that pilots were influenced more heavily by the distance information than by the speed information in estimating TCPA (see Figure 6). This phenomenon can be explained by Kahneman’s (2003) theory on the two systems (system 1 and system 2) in human perception. According to Kahneman, system 1 is intuition, which is fast and effortless; and system 2 is reasoning, which is analytical and optimal, but slow and effortful. There is evidence that humans have a tendency to substitute system 1 for system 2, especially when the information required for system 2 is not totally accessible. Time estimation in our experiment was a system 2 process involving both distance perception (a system 1 process) and speed perception (see Figure 8). When the intruder icon was not visible, the speed information was less accessible than the distance information.

The three dashed lines in Figure 6 (for signed TCPA estimate errors) connect pairs of conditions each having a same true TCPA (20, 30, and 60 s, respectively), which differed with respect to the ratio of distance/speed: short/slow on the left and long/fast on the right. Within each pair of connected points, the estimated TCPA was always shorter (i.e., the TCPA was estimated to be sooner) for that point with the shorter distance and slower speed. An ANOVA on the data points connected by the three dashed lines, with two levels of distance (short vs. long) and three levels of time (20, 30, and 60 sec), confirmed this trend, F(1, 42) = 24.61, p < .0001, and also revealed that the longer time led to greater underestimate of TCPA, F(2, 42) = 34.55, p < .0001, with significant interaction F(2, 42) = 3.37, p < .05. Effect of Miss Distance (MD) The significant effect of MD on absolute MD estimate error, F(2, 42) = 17.66, p < .0001, suggests greater error as true MD increased. The significant interaction between DCPA and MD, F(4, 42) = 5.60, p < .005, suggests amplified effect of MD across longer DCPA levels. MD also significantly influenced signed MD estimate error (see Figure 7), F(2, 42) = 14.49, p < .0001; the two longer MD levels were underestimated relative to the shorter MD (p < .01), but there was no difference between the two longer levels (p > .10).

Signed MD Estimate Error (mile)

0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5

MD = .67 mile MD = 2.67 miles MD = 4.67 miles

-0.6

1.33

2.67

4

DCPA (m ile)

Figure 7. Signed MD estimate errors for three DCPA and three MD levels. MD did not influence absolute time estimate error, F(2, 42) = 2.08, p > .10, but its significant effect on signed time estimate error, F(2, 42) = 3.38, p < .05, indicates that as true MD decreased (i.e., greater conflict risk), there was a greater underestimate of time.

50

Galster, S. M., Duley, J. A., Masalonis, A. J., & Parasuraman, R. (2001). Air traffic controller performance and workload under mature free flight. Int’l J. of Aviation Psych, 11(1), 71-93. Hancock, P. A., & Manser, M. P. (1998). Time-to-contact. In A. Feyer, & A. Williamson (Eds.), Occupational injuries: Risk, prevention and intervention. London: Taylor & Francis. Kahneman, D. (2003). A perspective on judgment and choice: Mapping bounded rationality. Am. Psychologist, 58(9), 697-720. Law, D. J, Pelegrino, J. W., Mitchell, S. R., Fischer, S. C., McDonald, T. P., & Hunt, E. B. (1993). Perceptual and cognitive factors governing performance in comparative arrival-time judgments. JEP: HP & P, 19(6), 1183-1199. Masalonis, A. J., & Parasuraman, R. (2003). Fuzzy signal detection theory. Ergonomics, 46(11), 1045-1074. Merwin, D. H., & Wickens, C. D. (1996). Evaluation of perspective and coplanar cockpit displays of traffic information (ARL-96-5/NASA-96-1). Savoy, IL: UIUC ARL. Metzger, U., & Parasuraman, R. (2001a). Conflict detection aids for air traffic controllers in free flight. Proc. 11th Int’l Symposium on Aviation Psych. Columbus, OH: OSU. Metzger, U., & Parasuraman, R. (2001b). The role of the air traffic control in future air traffic management. Human Factors, 43(4), 519-528. Remington, R. W., Johnston, J. C., Ruthruff, E., Gold, M., & Romera, M. (2000). Visual search in complex displays. Human Factors, 42(3), 349-366. Scallen, S. F., Smith, K., & Hancock, P. A. (1996). Pilot actions during traffic situations in a free-flight airspace structure. Proc. HFES 40th Annual Meeting. Santa Monica, CA: HFES. Wickens, C. D., Gempler, K., & Morphew, M. E. (2000). Workload and reliability of predictor displays in aircraft traffic avoidance. Transp. Human Factors, 2(2), 99-126. Wickens, C. D., Helleberg, J. R., & Xu, X. (2002). Pilot maneuver and workload in free flight. Human Factors, 44(2), 171-188. Xu, X. (2003). Conflict detection with cockpit display of traffic information: What is it, what have been found, and what need to be done? Proc. 47th HFES Annual Meeting. Santa Monica, CA: HFES. Xu, X., Wickens, C. D., & Rantanen, E. M. (2004). Effects of air traffic geometry and conflict alerting system reliability on pilots’ conflict detection with cockpit display of traffic information. Aviation Human Factors Division Technical Report. Savoy, IL: UIUC, Institute of Aviation.

Therefore, it is conceivable that the pilots just substituted distance perception for the more complex integration of distance and speed information. This bias may be potentially risky and may also have important safety implication in that time to conflict may be estimated too long (longer than it actually is) when intruder is far away but flying at fast speed. Time estimation (System 2) Distance perception (System 1) More accessible

Speed perception Less accessible

Figure 8. Illustration of time estimation as a process of integrating both distance and speed information. In conclusion, the results reported above would provide valuable information in helping designers of CDTI improve the design to overcome human shortcomings. For example, since it is very difficult to estimate TCPA accurately unaided, an automated TCPA alerting system might take the cognitive burden off the pilot. Automated MD prediction would also increase pilots’ MD estimation accuracy. Automation may also reduce the extent of bias shown in this experiment (i.e., MD and TCPA underestimates and distance-over-speed bias). Finally, information regarding the various biases might be incorporated into training programs such that pilots can be aware of the types of error to which they are susceptible. ACKNOWLEDGEMENTS This research was supported in part by a grant from the Federal Aviation Administration (Award No. DOT 02-G-032). The FAA technical monitor for this grant was Dr. William Krebs, AAR-100. The views expressed in this paper are those of the authors and do not necessarily reflect those of the FAA. We also thank Ron Carbonari for the programming support. REFERENCES Alexander, A. L., Wickens, C. D., & Merwin, D. H. (in press). Perspective and coplanar cockpit displays of traffic information: Implications for maneuver choice, flight safety, and mental workload. Int’l J. of Aviation Psych. Boudes, N., & Cellier, J.-M. (2000). Accuracy of estimations made by air traffic controllers. Int’l J. of Aviation Psych, 10(2), 207-225. Endsley, M. R., Mogford, R. H., & Stein, E. S. (1997). Controller situation awareness in free flight. Proc. HFES 41st Annual Meeting (pp. 4-8). Santa Monica, CA: HFES.

51

APPENDIX B ASMA 2004 Alaska Imperfect Automation And CDTI Alerting: Implications For Display Design From Systems Analysis And Research Literature. Christopher D. Wickens University of Illinois at Urbana-Champaign, Institute of Aviation, Aviation Human Factors Division The Cockpit Display of Traffic Information (CDTI) has the potential to host an automation alerting system that will alert pilots as to whether a loss of separation is likely to occur over a time horizon greater than typical of the current TCAS system. As such, the CDTI alerting function can be represented in the context of any generic diagnostic system, as represented in Figure 1, which signals a dangerous condition (loss of separation) as either present (or predicted) or not. The pilot may or may not take an evasive action if a conflict is predicted. The conflict condition will also either exist (or occur in the future) or not. Using the signal detection matrix shown in Figure 1, four classes of joint events or outcomes can be represented. Of these, both misses (a conflict that was not predicted) and false alarms (a conflict that was predicted when none occurred) are aversive events that should be minimized. We also note that this representation can be applied to the alert system alone, or to the human and alert system working as a team, in which the human will likely have perceptual access to the “raw data” on which the alert system is making its diagnosis. In the case of the CDTI, these raw data are represented by the visual display.

Diagnostic Systems (Conflict Alerts) System Properties

Pilot Response Properties

EVENT (CONFLICT) PRESENT

ACTION

NO

/ Diagnostic Decision (Prediction)

YES

NO

“Hit”

False Alarm

Evade Aware (Ignore)

/ Miss

CR

52

CONTINUE

Systems Analysis. As represented by signal detection theory, alert systems may vary in their “sensitivity” (ability to discriminate signals from non-signals, as reflected by low miss and false alarm rate), and this is a reflection of the automation reliability. For example, TCAS developers improved the algorithms from version 2 to version 6.04 to increase overall sensitivity. Pilots will be more sensitive if they are better trained, and have more attention available to focus on the visual display. Given that the diagnostic system (whether automation or human) is imperfect (reliability 3.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 2.61 2.63 2.83 2.95 2.98 3.16 3.53 3.62 3.64 3.76 3.94 4.18 4.56 4.61 4.74 4.99 5.26 5.59 5.64 5.65 5.85 6.03 6.36 6.53

CA (º) 90 (R) 135 (L) 90 (R) 90 (L) 135(R) 90 (L) 90 (R) 135 (R) 135 (R) 45 (L) 90 (L) 135 (L) 45 (R) 45 (R) 135 (L) 135 (R) 135 (R) 45 (L) 90 (L) 45 (L) 90 (R) 45 (R) 135 (L) 135 (R)

OCPA (º) 0 (B) 315 (B) 0 (F) 0 (F) 45 (B) 0 (F) 0 (F) 315 (F) 45 (B) 315 (F) 0 (B) 45 (F) 315 (B) 45 (F) 45 (F) 315 (F) 315 (F) 45 (B) 0 (B) 45 (B) 0 (B) 315 (B) 315 (B) 45 (B)

60

RS (knots) 480 480 480 480 480 240 240 240 240 480 480 240 240 240 160 480 160 480 160 240 160 480 160 160

TCPA (sec) 10 10 10 10 10 20 20 20 20 10 10 20 20 20 30 10 30 10 30 20 30 10 30 30

MD (mile) 4.67 5.37 4.67 4.67 5.37 4.43 4.43 5.13 5.13 3.97 4.67 5.13 3.73 3.73 5.60 5.37 5.60 3.97 4.90 3.73 4.90 3.97 5.60 5.60

Table A.4. Conflict Geometries for 12 Easy and 12 Hard Trials for Medium DCPA (2.67 Miles) and Short MD (< 1.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 5.52 5.79 5.79 6.15 6.69 7.53 7.70 8.02 8.03 8.07 8.84 8.92 10.20 11.02 11.10 11.52 11.64 12.53 12.96 13.14 13.66 14.24 14.55 16.79

CA (º) 90 (L) 90 (R) 90 (L) 45 (R) 135 (R) 135 (R) 90 (R) 90 (L) 135 (L) 90 (R) 45 (R) 45 (L) 135 (R) 45 (L) 45 (L) 45 (L) 135 (L) 90 (L) 45 (R) 90 (R) 135 (R) 45 (R) 90 (R) 45 (R)

OCPA (º) 0 (B) 0 (F) 0 (F) 45 (F) 315 (F) 45 (B) 0 (F) 0 (F) 45 (F) 0 (B) 315 (B) 45 (B) 45 (B) 45 (B) 315 (F) 315 (F) 45 (F) 0 (B) 45 (F) 0 (F) 315 (F) 315 (B) 0 (B) 45 (F)

61

RS (knots) 480 480 480 480 480 480 240 240 480 480 480 480 240 240 240 160 240 160 160 160 160 240 240 480

TCPA (sec) 20 20 20 20 20 20 40 40 20 20 20 20 40 40 40 60 40 60 60 60 60 40 40 20

MD (mile) .67 .67 .67 .57 .77 .77 .63 .63 .77 .67 .57 .57 .73 .53 .53 .60 .73 .70 .60 .70 .80 .53 .63 .57

Table A.5. Conflict Geometries for 12 Easy and 12 Hard Trials for Medium DCPA (2.67 Miles) and Medium MD (1.5 Miles—3.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 2.84 4.36 4.42 4.81 5.30 5.37 5.47 5.50 5.62 5.68 5.77 6.18 7.19 7.40 7.47 7.51 8.22 8.99 9.02 9.87 11.19 11.32 11.33 12.07

CA (º) 90 (R) 135 (L) 135 (R) 90 (L) 90 (R) 45 (R) 90 (R) 90 (L) 135 (L) 45 (L) 45 (L) 90 (R) 90 (L) 135 (L) 45 (R) 135 (R) 135 (R) 135 (L) 45 (R) 135 (R) 45 (R) 135 (L) 90 (L) 90 (L)

OCPA (º) 0 (F) 45 (F) 45 (B) 0 (F) 0 (B) 315 (B) 0 (B) 0 (F) 315 (B) 45 (B) 45 (B) 0 (F) 0 (B) 315 (B) 45 (F) 315 (F) 45 (B) 45 (F) 315 (B) 315 (F) 315 (B) 315 (B) 0 (F) 0 (B)

62

RS (knots) 480 480 480 480 480 480 160 240 480 480 240 240 240 240 240 240 240 160 240 160 160 160 160 160

TCPA (sec) 20 20 20 20 20 20 60 40 20 20 40 40 40 40 40 40 40 60 40 60 60 60 60 60

MD (mile) 2.67 3.07 3.07 2.67 2.67 2.27 2.80 2.53 3.07 2.27 2.13 2.53 2.53 2.93 2.13 2.93 2.93 3.20 2.13 3.20 2.40 3.20 2.80 2.80

Table A.6. Conflict Geometries for 12 Easy and 12 Hard Trials for Medium DCPA (2.67 Miles) and Long MD (> 3.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 3.74 3.80 3.95 4.00 4.12 4.85 4.99 5.39 5.59 5.61 5.96 6.34 7.18 7.23 7.38 8.44 8.59 9.23 9.25 9.28 10.11 10.12 10.62 11.41

CA (º) 90 (R) 45 (L) 90 (L) 135 (R) 90 (R) 45 (R) 90 (L) 45 (L) 90 (R) 135 (L) 135 (L) 45 (R) 135 (R) 135 (R) 90 (R) 135 (L) 90 (L) 90 (L) 45 (R) 135 (L) 90 (R) 135 (R) 45 (R) 45 (L)

OCPA (º) 0 (B) 315 (F) 0 (B) 45 (B) 0 (F) 45 (F) 0 (F) 45 (B) 0 (B) 315 (B) 45 (F) 315 (B) 45 (B) 315 (F) 0 (F) 315 (B) 0 (F) 0 (B) 45 (F) 45 (F) 0 (F) 315 (F) 315 (B) 45 (B)

63

RS (knots) 480 480 480 480 480 480 240 480 240 480 240 480 240 240 240 240 160 160 240 160 160 480 160 160

TCPA (sec) 20 20 20 20 20 20 40 20 40 20 40 20 40 40 40 40 60 60 40 60 60 20 60 60

MD (mile) 4.67 3.97 4.67 5.37 4.67 3.97 4.43 3.97 4.43 5.37 5.13 3.97 5.13 5.13 4.43 5.13 4.90 4.90 3.73 5.60 4.90 5.37 4.20 4.20

Table A.7. Conflict Geometries for 12 Easy and 12 Hard Trials for Long DCPA (4.0 Miles) and Short MD (< 1.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 3.63 4.28 4.59 4.70 6.61 6.93 7.43 8.68 8.90 9.41 10.42 10.62 11.42 11.42 12.04 12.28 14.05 14.15 14.27 14.71 15.35 16.05 17.11 21.70

CA (º) 90 (L) 90 (R) 135 (R) 90 (L) 90 (R) 135(L) 45 (L) 90 (R) 135 (L) 135 (R) 90 (L) 45 (R) 45 (L) 45 (L) 135 (L) 45 (R) 45 (R) 135 (L) 90 (L) 90 (R) 45 (R) 135 (R) 45 (L) 135 (R)

OCPA (º) 0 (B) 0 (F) 315 (F) 0 (F) 0 (B) 45 (F) 315 (F) 0 (F) 315 (B) 315 (F) 0 (F) 315 (B) 315 (F) 45 (B) 315 (B) 45 (F) 315 (B) 45 (F) 0 (B) 0 (B) 45 (F) 45 (B) 45 (B) 45 (B)

64

RS (knots) 480 480 480 480 480 480 480 240 240 240 240 480 240 240 480 480 240 240 240 240 240 480 480 240

TCPA (sec) 30 30 30 30 30 30 30 60 60 60 60 30 60 60 30 30 60 60 60 60 60 30 30 60

MD (mile) .67 .67 .77 .67 .67 .77 .57 .63 .73 .73 .63 .57 .53 .53 .77 .57 .53 .73 .63 .63 .53 .77 .57 .73

Table A.8. Conflict Geometries for 12 Easy and 12 Hard Trials for Long DCPA (4.0 Miles) and Medium MD (1.5 Miles—3.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 3.85 4.19 4.25 4.28 5.90 6.05 6.84 6.85 7.05 7.07 7.66 7.74 7.79 8.07 8.36 8.51 8.78 8.95 9.08 9.45 10.93 11.03 11.04 11.51

CA (º) 90 (R) 90 (L) 90 (L) 45 (R) 90 (R) 45 (R) 45 (L) 135 (R) 135 (L) 45 (L) 135 (R) 90 (R) 135 (R) 90 (R) 90 (L) 135 (L) 90 (L) 135 (L) 135 (R) 45 (L) 45 (R) 45 (L) 135 (L) 45 (R)

OCPA (º) 0 (B) 0 (F) 0 (B) 315 (B) 0 (F) 45 (F) 315 (F) 315 (F) 45 (F) 45 (B) 45 (B) 0 (B) 315 (F) 0 (F) 0 (B) 315 (B) 0 (F) 315 (B) 45 (B) 315 (F) 315 (B) 45 (B) 45 (F) 45 (F)

65

RS (knots) 480 480 480 480 480 480 480 480 480 480 480 240 240 240 240 480 240 240 240 240 240 240 240 240

TCPA (sec) 30 30 30 30 30 30 30 30 30 30 30 60 60 60 60 30 60 60 60 60 60 60 60 60

MD (mile) 2.67 2.67 2.67 2.27 2.67 2.27 2.27 3.07 3.07 2.27 3.07 2.53 2.93 2.53 2.53 3.07 2.53 2.93 2.93 2.13 2.13 2.13 2.93 2.13

Table A.9. Conflict Geometries for 12 Easy and 12 Hard Trials for Long DCPA (4.0 Miles) and Long MD (> 3.5 Miles) Task difficulty Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Easy Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard Hard

Weighted estimate error 3.55 4.65 4.66 5.33 5.62 6.05 6.10 6.24 6.47 6.63 7.12 7.32 7.49 8.22 8.43 8.46 9.00 9.35 9.35 9.42 10.42 10.45 10.61 11.95

CA (º) 90 (L) 90 (L) 90 (R) 90 (R) 135 (R) 90 (R) 45 (R) 90 (L) 90 (L) 90 (R) 135 (L) 45 (R) 135 (L) 135 (R) 135 (L) 45 (R) 45 (L) 135 (R) 135 (R) 45 (L) 45 (R) 45 (L) 45 (L) 135 (L)

OCPA (º) 0 (F) 0 (B) 0 (F) 0 (F) 45 (B) 0 (B) 45 (F) 0 (F) 0 (B) 0 (B) 315 (B) 315 (B) 45 (F) 45 (B) 45 (F) 315 (B) 45 (B) 315 (F) 315 (F) 315 (F) 45 (F) 45 (B) 315 (F) 315 (B)

66

RS (knots) 480 240 480 240 480 240 480 240 480 480 480 480 480 240 240 240 240 240 480 240 240 480 480 240

TCPA (sec) 30 60 30 60 30 60 30 60 30 30 30 30 30 60 60 60 60 60 30 60 60 30 30 60

MD (mile) 4.67 4.43 4.67 4.43 5.37 4.43 3.97 4.43 4.67 4.67 5.37 3.97 5.37 5.13 5.13 3.73 3.73 5.13 5.37 3.73 3.73 3.97 3.97 5.13