Sep 30, 2004 - allowing the National Airspace System (NAS) to accommodate the increasing ..... the involved computer simulations or computational models were ...... were compensated CAD 60.00 for their time (approximately 2 hours). ...... First AIA. 5242). Los Angeles, CA, Oct. 16-18, 2001. Chiles, W. D., & Alluisi, E.
Aviation Human Factors Division Institute of Aviation
AHFD
University of Illinois at Urbana-Champaign 1 Airport Road Savoy, Illinois 61874
Development and Validation of Objective Performance and Workload Measures in Air Traffic Control Esa Rantanen Technical Report AHFD-04-19/FAA-04-7 September 2004 Prepared for Federal Aviation Administration Civil Aeromedical Institute Oklahoma City, OK Cooperative Agreement DOT 02-G-019
DEVELOPMENT AND VALIDATION OF OBJECTIVE PERFORMANCE AND WORKLOAD MEASURES IN AIR TRAFFIC CONTROL
Federal Aviation Administration Cooperative Agreement No. 02-G-019
FINAL REPORT, MAY 6, 2004—SEPTEMBER 30, 2004
Esa M. Rantanen, Ph.D. Institute of Aviation, Aviation Human Factors Division University of Illinois at Urbana-Champaign Willard Airport, Savoy, IL September 30, 2004
i
EXECUTIVE SUMMARY The current POWER “suite” of air traffic control (ATC) metrics includes such measures as traffic count, control duration, and variability in aircraft headings, altitudes, and speeds, as well as latencies of handoff initiation and acceptance, as well as records of a number of different controller activities. While these measures yield a reasonably detailed picture of events in a sector during the recording period, there is a substantial gap between them and measures of real interest, for example, measures of sector complexity and controller workload, situation awareness, and performance. Hence, although vast amounts of objective data are potentially available from operational ATC system, derivation of valid, reliable, and meaningful measures from them remains a problem. This problem is particularly pronounced when variables of interest are not directly measurable, such as controller workload and performance. This project represents an emphatically systematic and comprehensive approach to the measurement problem in ATC. The project involved many subtasks. The body of literature on ATC research was organized into taxonomy of ATC measures. Development of criteria for POWER measures was approached in the larger context of establishing and validating theoretical links between direct and indirect measures, facilitated by the measures taxonomy. Evaluation and validation of POWER measures was performed by analysis of data from three sectors at Indianapolis center (ZID). Development and validation of new measures involved an experiment using a part-task ATC simulation and POWER output from data from Kansas City center (ZKC) together with associated subjective workload measures. The ATC measures taxonomy revealed a rather sobering picture of the state-of-the-art of ATC research and measurement. Of all available and reported measures in ATC related literature only a small minority were used in the vast majority of research. Very few particular topics were also researched, explicated by indirect measures reported. The taxonomy also allowed for examination for criteria for the POWER metrics, but no theoretical foundations for measurement of constructs under the class of indirect measures were provided in the literature. Based on the review of the airspace complexity and dynamic density literature, the POWER output could readily be expanded by several complexity measures. However, the fragmentary nature of the metrics reviewed and the lack of data for their validation pushes this possibility far into the future. Evaluation of POWER measures from three sample sectors from ZID showed that a number of POWER metrics clearly differentiated between the sectors of different characteristics, revealing important factors that might affect controller taskload. Metrics that remained invariant between sectors may reflect taskload factors that are independent from sector characteristics. Present POWER output includes many parameters that are also part of the proposed airspace complexity and dynamic density measures. Results from the experimental approach to developing novel metrics for POWER showed substantial benefit of vertical separation in terms of conflict detection performance, possibly allowing measurement of controller workload by counting the number of aircraft pairs at the same altitude in a given sector at a given time. Efforts to validate this measure of controller taskload using available data from ZKC and ZID were only partially successful, however, and further research is required to determine the validity and utility of the proposed measure.
ii
TABLE OF CONTENTS Executive Summary ................................................................................................................... i Table of Contents...................................................................................................................... ii 1. Introduction........................................................................................................................... 1 1.1. Background ........................................................................................................................ 1 1.2. Purpose and Outline of the Project .................................................................................... 2 1.2.1. Taxonomy of Measures........................................................................................... 3 1.2.2. Development of Criteria for the POWER Measures............................................... 3 1.2.3. Development of Secondary Measures .................................................................... 3 1.2.4. Identification of Measurement Scales for Secondary Measures............................. 3 1.2.5. Development of Evaluation Criteria for the POWER Measures ............................ 4 1.2.6. Validation of the POWER Measures ...................................................................... 4 1.2.7. Development and Validation of New Measures ..................................................... 4 1.3. Outline of the Report ......................................................................................................... 4 1.4. Summary ............................................................................................................................ 5 2. Taxonomy of Measures in Air Traffic Control Research and Development........................ 6 2.1. Method ............................................................................................................................... 6 2.1.1. Literature Review.................................................................................................... 6 2.1.2. Construction of Taxonomy of Measures................................................................. 7 2.2. Results: Taxonomy of ATC Measures............................................................................... 8 2.2.1. Direct Subjective and Direct Objective Measures .................................................. 8 2.2.2. Secondary Measures ............................................................................................... 9 2.2.3. Direct Objective Measures: Human and System Measures.................................... 9 2.2.4. Indirect Measures.................................................................................................... 9 2.3. Tabulated Results............................................................................................................. 10 2.4 Summary ........................................................................................................................... 28 3. Criteria for ATC Measures ................................................................................................. 29 3.1. Theoretical Foundations for Association of Indirect Variables with Direct Variables ... 29 3.2. Airspace Complexity and Dynamic Density Metrics: A Literature Review ................... 34 3.2.1. Terminal airspace metrics ..................................................................................... 34 3.2.2. Dynamic density ................................................................................................... 35
iii 3.2.3. Risk index ............................................................................................................. 36 3.2.4. Predictive workload measures .............................................................................. 36 3.2.5. FAA sector complexity metrics ............................................................................ 37 3.2.6. Metron algorithm .................................................................................................. 38 3.3. Summary .......................................................................................................................... 39 4. Evaluation of POWER Metrics from Three ZID Sectors ................................................... 40 4.1. Sector Descriptions .......................................................................................................... 40 4.2. Evaluation of the POWER Measures............................................................................... 40 4.2.1. Qualitative Analysis.............................................................................................. 40 4.2.2. Inferential Statistics .............................................................................................. 45 4.3. Analysis of Voice Data from the ZID Sample................................................................. 47 4.3.1. Review of Relevant Literature .............................................................................. 47 4.3.2. Method .................................................................................................................. 48 4.3.3. Results................................................................................................................... 48 4.4. Summary .................................................................................................................. 51 5. Development and Validation of New Measures ................................................................. 52 5.1. Experimental Investigation: Introduction ........................................................................ 52 5.1.1. Air Traffic Controller Conflict Detection Performance........................................ 52 5.1.2. Workload and information processing.................................................................. 53 5.1.3. Altitude ................................................................................................................. 54 5.1.4. Heading ................................................................................................................. 55 5.1.5. Speed..................................................................................................................... 56 5.1.6. Premises and hypotheses....................................................................................... 57 5.2. Method ............................................................................................................................. 58 5.2.1. Participants............................................................................................................ 58 5.2.2. Apparatus .............................................................................................................. 58 5.2.3. Design ................................................................................................................... 59 5.2.4. Procedure .............................................................................................................. 61 5.3. Results.............................................................................................................................. 61 5.3.1. Controller group.................................................................................................... 62 5.3.2. Student group ........................................................................................................ 64 5.3.3. Comparison of the groups..................................................................................... 68 5.4. Discussion ........................................................................................................................ 70
iv 5.5. Implications...................................................................................................................... 71 5.6. Validation of Results........................................................................................................ 72 5.6.1. Subjective Workload Estimates from ZKC Data.................................................. 72 5.6.2. Evaluation of ZID Samples................................................................................... 75 6. Future Directions ................................................................................................................ 77 6.1. Premises ........................................................................................................................... 77 6.2. Experiment 2.................................................................................................................... 78 6.2.1. Method .................................................................................................................. 78 6.3. Results.............................................................................................................................. 80 6.3.1. Data Reduction and Coding .................................................................................. 80 6.3.2. Window Size and Response Initiation .................................................................. 80 6.3.3. Taskload and Performance.................................................................................... 81 6.4. Discussion ........................................................................................................................ 82 7. Summary and Conclusions ................................................................................................. 84 8. References........................................................................................................................... 86 Appendix A. Air Traffic Control Measures Bibliography...................................................... 96 Appendix B. Analysis of ATC Voice Communications from ZID Samples........................ 118 Appendix C. Publications Originating from This Research ................................................. 123
1
1. INTRODUCTION 1.1. Background All scientific research and subsequent engineering applications are dependent on methods of measurement (Chapanis, 1959). The need for reliable and valid measures is particularly compelling in the research and development (R&D) activities on a variety of aspects of air traffic control (ATC). Ever-increasing traffic volume and the collateral demand to improve aviation safety present an unprecedented challenge to the nation’s (ATC) system and necessitate introduction of new technologies and automation applications in ATC. While potentially allowing the National Airspace System (NAS) to accommodate the increasing demand, these technologies will conceivably also have a fundamental impact on the system’s functionality as well as air traffic controllers’ working methods, workload, strategies, and performance (Hopkin, 1995; Wickens, Mavor, & McGee, 1997; Wickens, Mavor, Parasuraman, & McGee, 1998). Thorough assessment and evaluation of the consequences of new technologies on the system capacity and safety on one hand, and on the working conditions and performance of individual controllers on the other, is hence of utmost importance. Success of these evaluation efforts, however, is subject to the availability of valid and reliable measures. Until recently, observation and subjective evaluations have been the primary sources of controller performance data from both operational and simulated ATC. Direct observation, or the Over-the-Shoulder (OTS) method, can be a valid and reliable method for performance measurement if a number of prerequisites are met: The method requires a standardized checklist for the evaluators’ use, all items on the checklist must be explicitly defined, and the evaluators must be extensively trained to achieve reasonable inter-rater and intra-rater validity. The OTS method does, however, have a number of significant disadvantages: First, it is extremely labor intensive, with one-to-one evaluator-evaluatee ratio. Second, training of evaluators is timeconsuming and expensive. Finally, a human evaluator may not be able to provide sufficiently accurate quantitative data for research purposes, due to the limitations of human observation capabilities. The latter is the case particularly in observation of simultaneous events. Subjective measures—where subjects themselves rate their performance and workload—are relatively easy to use and inexpensive, and hence have a definite place in human factors research. Expert subject-evaluators can also readily detect and process task-related information that would otherwise require vast amounts of objective data to be recorded, stored, coded, reduced, and analyzed to yield useful measures (Wickens, Gordon, & Liu, 1997). However, evaluations by subject-raters suffer from many of the same problems as those by an outside observer, that is, they depend on the expertise and experience of the subject-raters and their ability to make absolute and comparative judgments. Furthermore, concurrent measurement is often very intrusive and interferes with the task at hand, and measures taken after the task are subject to changing perceptions and decay of memory. Sound and valid arguments have been made both for (e.g., Hennessy, 1990) and against (e.g., Kosso, 1989; Scheffler, 1967) the use of subjective measurements. In any case, it is clear that objective measures are highly desirable. Particularly in the domain of ATC there is a need to develop valid and reliable objective evaluation methods to be used both in conjunction with high-fidelity, realistic ATC simulation and in operational settings. Additionally, objective
2 measures can be routinely collected and analyzed concurrently and unintrusively during the task, and subjected to data mining techniques to detect trends in the system’s performance, before any possible problems are manifested as incidents or operational errors. The availability of measures in ATC, as in any scientific endeavor, is largely contingent on the availability of data in a usable form. Collection of usable data from an environment as dynamic, variable, and multidimensional as ATC is not a trivial matter. Recent technological advances, particularly in area of digital technology, and the ATC modernization efforts, however, potentially make available new sources for data as well as data collection and storage methods. An example of access to data from which ATC measures can be derived is the System Activity Recordings (SAR) that stores all flight and radar information in Air Route Traffic Control Centers (ARTCCs). These data can be further processed by two specific computer programs, the Data Analysis and Reduction Tool (DART) (Federal Aviation Administration [FAA], 1993) and the National Track Analysis Program (NTAP) (FAA, 1991), which produce a number of text-based output files. These files can be further analyzed by specialized computer programs, such as the Performance and Objective Workload Evaluation Research (POWER) (Mills, Manning, & Pfleiderer, 1999; Manning, Mills, Fox, & Pfleiderer, 2000). Currently, the POWER program derives over 40 separate measures that describe a variety of aspects of ATC. The current POWER “suite” includes such measures as traffic count, control duration, and variability in aircraft headings, altitudes, and speeds, as well as latencies of handoff initiation and acceptance. A number of different controller activities are also recorded. Although these measures yield a reasonably detailed picture of what was going on in a sector during the recording period, there is a substantial gap between them and measures of real interest, that is, measures of sector complexity and controller workload, situation awareness, and performance. It is therefore important to distinguish between what can be termed primary—or direct—measures, such as the current POWER measures, and secondary—or indirect—measures that are based on primary measures but that make inferences on variables that were not directly measured (e.g., workload or performance). Although a number of POWER measures have been shown to correlate with other sector complexity and workload measures, their relationship with controller performance is less clear (Manning, Mills, Fox, & Pfleiderer, 2000). In summary, vast amounts of objective data are potentially available from operational ATC system. The problem is therefore not in availability of data, but in derivation of valid, reliable, and meaningful measures from the abundance of data. This problem is particularly pronounced when variables of interest are not directly measurable, such as controller workload and performance. 1.2. Purpose and Outline of the Project This project contended with several tasks and problems for research. It was restricted to consideration of objective measures that can be derived from the data collected automatically from the ATC system computers (e.g., HOST). Hence, subjective measurement methods and over-the-shoulder observation were considered only in literature reviews and as possible tools for validation of measures derived from system data. Because of the complexity of the issue of measurement in ATC and the ensuing necessity of a systematic and comprehensive approach to the problem, the project was originally organized into several subtasks.
3 1.2.1. Taxonomy of Measures Hadley, Guttman, and Stringer (1999) listed no less than 170 separate measures or measurement techniques in their ATCS performance measurement database. Many of these measures, however, have substantial overlap, are derivatives of each other, and measure diverse aspects of ATC functions, all of which are not relevant to performance of an individual ATCS. A taxonomy will allow for a cross reference between different types of measures, their purposes, and the required data, facilitating development of comprehensive models of ATC performance and additional measures as new sources of data become available. Taxonomy of measures must therefore be seen as a prerequisite for the other tasks proposed here and in other phases of this research, as well as to any scientific study of ATC. 1.2.2. Development of Criteria for the POWER Measures No meaningful measurement is possible without a criterion (Meister, 1989). Currently, none of the POWER measures has explicitly defined criteria and therefore cannot be used to measure performance (system of human). Therefore, development of criteria for each of the POWER measures is critical. For many of the systems measures of POWER these criteria can possibly be derived from flight plans, and others from sector configuration and airspace constraints (e.g., the average time to accept handoffs). 1.2.3. Development of Secondary Measures The current POWER measures do not directly measure variables of most interest to ATC system developers, that is, controller performance, workload, and situation awareness, or the performance of the ATC system as a whole. It is therefore important to determine the extent to which the direct POWER measures can measure these variables. The multidimensional nature of ATC is evident from the numerous attempts to measure the various aspects of the system (e.g., system performance, controller workload, controller performance) and the difficulties encountered in these efforts. The relevance of these efforts to the first subtask described above (i.e., development of an ATC measure taxonomy) is also apparent. 1.2.4. Identification of Measurement Scales for Secondary Measures Measurement involves the assignment of a number system to represent the values of the variables of interest (Proctor & Van Zandt, 1994). There exist four distinct measurement scales—the nominal, ordinal, interval, and ratio scales—and the measured variables must be explicitly associated with the appropriate scale and its corresponding mathematical properties. The measurement scale is also a possible taxonomic criterion, as described above. Although many of the en route baseline measures computed with POWER (Mills, Manning, & Pfleiderer, 1999) are ratio measures, the measures derived from these are only ordinal at best. Hence, it is only possible to state that one sector is more complex than another, or that one controller experienced higher taskload than another. To fully utilize the measures and allow a full range of mathematical operations on them for further analysis they should be brought to a ratio scale. This task is not trivial, but must be considered essential given the challenges to the nation’s ATC system and the criticality of the modernization efforts. The identification of measurement scales is also inexorably linked to development of criteria, as described above. This task is crucial in order to ensure correct interpretation and proper statistical treatment of the data, and the success of subsequent modeling efforts.
4 1.2.5. Development of Evaluation Criteria for the POWER Measures The POWER measures will be evaluated by their reliability, validity, sensitivity and practicality. Specific parameters for each of these criteria will be developed, and each of the POWER measures as well as the secondary taskload, system performance, and controller performance indices will be evaluated against them. 1.2.6. Validation of the POWER Measures As can be seen, all subtasks described above will contribute to the validation of the current POWER measures. Validation efforts in this phase will be limited to the analysis of existing data. Experimental work for further validation issues will be proposed below. 1.2.7. Development and Validation of New Measures The availability of data is likely to be enhanced with the implementation of new technologies (e.g., digital voice switch system) in TRACON and ARTCC facilities. This makes possible to derive new measures to complement and support the existing POWER measures. Part-task simulations will allow for testing of novel measures in realistic ATC tasks but without the complexity and confounding variables present in high-fidelity simulations or operational ATC. The results from the part-task simulations will be used to direct the efforts of identifying usable data for measure development and for interpretation of both existing and new measures. 1.3. Outline of the Report All of the above subtasks were performed. However, as it is clear that there was substantial overlap among many of the subtasks, the organization of this report will not narrowly follow the project outline as originally proposed, but rather reflect a coherent approach to development of objective performance and workload measures in ATC and the role of the POWER metrics in this larger context. Hence, I will first describe the development of taxonomy of ATC measures (see 1.1.1. above) and the literature review to populate the taxonomy. Since secondary measures (or, indirect measures, as termed in the taxonomy) are part of the taxonomy they will be discussed in this part, too (1.1.3.). Development of criteria for POWER measures belongs to the larger context of establishing and validating sound theoretical links between direct and indirect measures, and hence subtasks 1.1.2., 1.1.3., and 1.1.4., are reported together in section 3. Evaluation and validation of POWER measures (1.1.5. and 1.1.6.) is discussed in section 4 in light of analysis of data from three sectors at Indianapolis air route traffic control center (ZID ARTCC). Finally, development and validation of new measures is described in section 5; results from an experiment involving a part-task ATC simulation are reported, including their implications for developing a novel metric for controller taskload. Partial validation of the new taskload metric is offered using POWER output from data from Kansas City ARTCC (ZKC), originally reported in Manning, Mills, Fox, and Pfleiderer (2001), together with associated subjective workload measures.
5 1.4. Summary This project represents an emphatically systematic and comprehensive approach to the measurement problem in ATC. This approach serves a dual purpose: On one hand, it is essential to perform a thorough review of past and current research efforts and to organize the findings in a manner that facilitates the use of existing knowledge for a basis of future evolvement of ATC measurement. In other words, my purpose here is to avoid “reinventing the wheel.” On the other hand, it is prudent to proceed cautiously on an issue as complex as ATC measurement and consider carefully all the constraints, assumptions, and threats to validity that may emerge. This not only makes this undertaking more likely to succeed, but also provides a solid foundation for future research efforts and facilitates concurrent developments in the ATC measurement domain.
6
2. TAXONOMY OF MEASURES IN AIR TRAFFIC CONTROL RESEARCH AND DEVELOPMENT 2.1. Method This taxonomy is based on an extensive (exhaustive) review of literature related to ATC research and development. We catalogued and classified measures used in published research, and hence the lowest level of classes is based on existing research in this domain. We predominantly sought out original and empirical papers, but in a limited number of instances had to rely on secondary sources, that is, on review articles. Most notable exception to this standard is the Hadley, Guttman, and Stringer’s (1999) ATCS performance measurement database, which was included in our review in its entirety. 2.1.1. Literature Review The literature review was limited by three criteria: (1) English language publications that involved (2) human subjects in (3) in an ATC task. Hence, purely theoretical papers or studies the involved computer simulations or computational models were excluded from the review. The following sources were searched: A. Conference Proceedings 1. Proceedings of the International Symposium on Aviation Psychology, 1st (1981) through 12th, 2003. 2. Proceedings of the Human Factors and Ergonomics Society Annual Meetings, 21st, 1977 through 47th, 2003. B. Journals 1. Human Factors, Vol. 1(1), 1958, through Vol. 14(2), 2004. 2. International Journal of Aviation Psychology, Vol. 1(1), 1991, through Vol. 46(2), 2004. 3. Air Traffic Control Quarterly, Vol. 9(1), 2001, through Vol. 12(2), 2004. 4. The Journal of Air Traffic Control, Vol. 40(1), 1990, through Vol. 46(3), 2004. C. Government Document Databases 1. MITRE 2. MITRE CAASD 3. NTRS: NASA Technical Reports Server 4. Eurocontrol Experimental Centre – EEC Docs 5. FAA Document Library 6. WJH Technical Center
7 7. CAMI 8. Office of Aviation Research D. University Libraries 1. University of Illinois Library i. Aerospace and High Technology Database (a.k.a. Aerospace Database) ii. Applied Sci. & Tech. Index iii. Engineering Village iv. IEEE/IEE Full Text v. PsycInfo vi. Science Citation Index Expanded 2. Embry-Riddle Aeronautical University 3. San Jose State University Library The library databases were searched to cross-check our original literature search from A–C above and any relevant literature that had not been found in the primary sources were retrieved and reviewed. We reviewed a total of 260 articles that met the criteria as expressed above. From this review, a total of 2475 separate incidences of directly measured variables and a total of 2344 indirectly measured variables were recorded. 2.1.2. Construction of Taxonomy of Measures There are several possible taxonomies for ATC measures. One is the dichotomy of measurement of system performance and the measurement of an individual controller or a team of controllers (Buckley, DeBaryshe, Hitchner, & Kohn, 1983; Hopkin, 1980). System measures are defined in system terms (i.e., capacity, throughput, delays, and channel occupancy times). Although they are greatly influenced by human performance, they are usually insufficient in the measurement of the performance of an individual controller. Identified task performance, human activity, errors, omissions, physiological and biochemical indices, and subjective assessment are possible measures of individual controller (Hopkin, 1995). Of these, task performance, activity, and possibly error measures could be derived from DART and NTAP data. Task performance measures compare the controller’s output to that what is required in the task and encompass broad measures of errors and omissions. Human activity measures passively record what occurs in the task, such as radio transmissions, equipment usage, and communication and coordination with other sectors in terms of times, frequencies, and sequences of the activities. The main division of measures in the taxonomy proposed here is between direct and indirect measures. Direct measures are defined here as those that can explicitly measured. Examples of such measures include a direct observation of a controller’s action, measurement of a response latency, or count of aircraft in a sector at a given time. Indirect measures are those that cannot be measured directly but must be inferred from directly measurable variables. For example, certain actions of a controller may be indicative of his or her performance, response latency can be used to make inferences on some covert cognitive processes, and a number of aircraft in a sector can be used to signify sector complexity. We will return to indirect measures later; however, as such
8 measures are derived from direct measures, it is important to lay a foundation for a systematic examination of these, which is the explicit goal of our taxonomy. 2.2. Results: Taxonomy of ATC Measures Identification of direct measures in the literature was a relatively simple task. Their classification was straightforward as well. In the resulting taxonomy, the separately reported direct measures were grouped under 65 distinct classes, down to sixth level in some cases. The taxonomic structure is based on published research literature and hence it does not contain classes under which no examples of measures could be found in the literature. Indirect measures were classified in a similar manner, under 36 distinct categories. The level of classification was dependent on the details provided in the literature; in many cases only a very coarse classification could be done due to lack of specification of the variables on interest in the reviewed articles. 2.2.1. Direct Subjective and Direct Objective Measures Within the class of direct measures, the next sublevel is created by differentiating between subjective and objective measures. The method for making this distinction is identification of objective criteria, a prerequisite for meaningful measurement (Meister, 1989). An example can be seen in Figure 2.1. The classification is here based on an objective criterion against which the observer bases his or her judgment. For example, an FAA OTS evaluation sheet (Form 3430) contains items such as “standard phraseology is not adhered to” and “awareness is not maintained.” The former has an objective criterion, the standard phraseology as published in the FAA Air Traffic Control Handbook (7110.65) and hence any deviations from it would warrant a check in the aforementioned box. However, the latter, while providing several guidelines for making this judgment, provides much latitude for a subjective assessment. Note that by applying this scheme, observer rating will appear under both subjective and objective categories; the difference between these two categories hence lies in the presence of an objective criterion. The role of a criterion is explicit in the case of error measures. The term “error” clearly presupposes some threshold value, or outcome of an action, that separates correct action from an incorrect or erroneous one. In other words, without explicit criteria no actions could be classified as errors.
Action Action outcome outcome
Criterion
Error Error
Correct Correct Action Action
Figure 2.1. The role of criterion in measurement. Here, the action outcome is regarded as a direct measure (i.e., directly measurable or observable); application of a criterion will allow for derivation of another measure, in this case error, which is also classified as a direct measure
9 2.2.2. Secondary Measures Another example of the importance of criteria in classification of measures is the differentiation between what is termed here primary and secondary measures. Primary measures are those that are measured directly, for example, count of aircraft in a sector, or number of heading changes per aircraft. Secondary measures are those derived from primary measures, for example an average number of traffic in a sector, its variance, or range. In the case of the average, the criteria are implicit (the time duration or interval during which the aircraft were counted and the number of samples) but nevertheless have an impact on the eventual measure. 2.2.3. Direct Objective Measures: Human and System Measures The subcategory of direct, objective measures was further divided in two major sub-sub categories, system measures and human measures. The criterion for this division was simple: variables that were not directly dependent on human performance were classified as system measures. Hence, for example, number of keystrokes required to enter data into the system is system-related; there is little the controller can do about it, although it can certainly be used as a possible metric of taskload (an indirect measure). Keystroke errors, on the other hand, would be classified as a human measure. Direct, objective, human measures fall under further three categories: observer-rated (recall the role of explicit criteria in this class), cognitive, which is further divided into two time-based classes, response time and time required to perform a task, and a variety of psychophysical metrics. Objective observer-rated measures predominantly consist of either graded exam-type questionnaires or problems, to which a correct answer can be know, or observed behavior or performance compared to a given standard. A primary example of measures in this class is situation awareness global assessment technique (SAGAT; Endsley, 2000). This measurement technique is based on queries to the subject on a variety of aspects in the given situation. In the ATC context these often involve recall of aircraft callsigns, altitudes, positions, speeds, etc. Since the subject’s responses are compared to the actual situation, the measure is objective by our definition (i.e., it has explicit criteria). The distinction between response time and time required is somewhat fluid. Response time is extensively used to probe a wide variety of covert cognitive processes, and it is also readily available and measurable in many settings. Time required to perform a task as a separate category is justified by a similar class of system measures and as a counterpart of the time required vs. time available paradigm, which is a basis of some taskload and complexity constructs (Chatterji, & Sridhar, 2001; Chiles, & Alluisi, 1979; Mogford, Guttman, Morrow, & Kopardekar, 1995; Tham et al., 1993). Finally, it is perhaps noteworthy that in the literature reviewed there were very few cases where physiological measures were used in the ATC research literature. The vast majority of physiological measures were different eye movement measures. 2.2.4. Indirect Measures The indirect measures identified in the literature are depicted in Table 2.3. Again, the main sub-classification scheme was between human and system measures. Of human measures, workload, situation awareness, and performance were clearly those of most interest. Of system measures, airspace or sector complexity has attracted much attention in recent years; the
10 distinction between complexity and taskload is not very rigid, as complexity is often considered an integral component of taskload. It should be noted that in this review the definitions of the indirect constructs varied widely, and often no explicit definition was provided. The classification of indirect variables is therefore based only on authors’ own account of the measure. Indirect variables are reviewed in more detail in a separate section below. 2.3. Tabulated Results The resultant taxonomy of ATC measures populated with examples recorded in relevant literature is presented in Tables 2.1-2.38. We present these data in three different formats: First, the basic taxonomic structure is presented in Tables 2.1 (direct measures) and 2.2 (indirect measures) with both the number and proportion of recorded incidences of corresponding measures in the reviewed literature. The same tables are then arranged in a descending order according to the proportion of measures in each category. Furthermore, a cumulative percentage was calculated in a separate column, indicating the number of measure categories accounting for a given percentage of all measures reported in the literature (Tables 2.2 and 2.4, for direct and indirect measures, respectively). Typically, only a very small number of separate classes of measures accounted for a majority of instances the measures were used in the literature; for example only five (out of 65) classes of direct measures accounted for over 50% (51.96%) of direct measures reported in the literature. For indirect measures, merely four (out of 36) classes accounted for nearly 75% (74.66%) of all reported indirect measures in the literature. Finally, Tables 2.5-2.38 show the associations of direct measures with indirect measures, as reported in the literature. These tables, too, are ordered according to the frequency (and proportion) of specific classes of direct measures and accompanied with a column of cumulative percentage. The picture that emerges is similar to what was discussed above: a relatively small number of direct measure categories were associated with indirect measures in the majority of cases. For example, workload was associated with only two categories of direct measures, subjective workload ratings and eye-movement recording, in over 50% (51.01%) of all incidences of workload measurement in the reviewed literature.
11 Table 2.1 ATC measures taxonomy, direct measures. The numbers in parentheses refer to counts and proportions of instances measures were used in the literature for each level in the taxonomy. 1. Direct Measures 1.1. Subjective Measures (537; 21.73%) 1.1.1. Self-rated (440; 17.81%) 1.1.1.1. Taskload (63, 2.55%) 1.1.1.1.1. Time Pressure (2, 0.08%) 1.1.1.1.2. Activity (4, 0.16%) 1.1.1.1.3. Acceptability (43, 1.74%) 1.1.1.1.4. Complexity (12, 0.49%) 1.1.1.2. Workload (189, 7.61%) 1.1.1.2.1. Mental (188, 7.61%) 1.1.1.2.2. Physical (1, 0.04%) 1.1.1.3. Performance (116, 4.69%) 1.1.1.3.1. Situation Awareness (44, 1.78%) 1.1.1.3.2. Safety (16, 0.65%) 1.1.1.4. Emotion (44, 1.78%) 1.1.1.4.1. Confidence (8, 0.32%) 1.1.1.4.2. Frustration (0, 0%) 1.1.1.4.3. Trust (11, 0.45%) 1.1.1.4.4. Mood (2, 0.08%) 1.1.1.4.5. Fatigue (14, 0.75%) 1.1.1.4.6. Stress (8, 0.32%) 1.1.2. Observer-rated (97, 3.93%) 1.1.2.1. Workload (11, 0.44%) 1.1.2.1.1. Mental (10, 0.4%) 1.1.2.1.2. Physical (0, 0%) 1.1.2.2. Taskload (16, 0.65%) 1.1.2.2.1. Time Pressure (0, 0%) 1.1.2.2.2. Activity (3, 0.12%) 1.1.2.2.3. Complexity (1, 0.04%) 1.1.2.3. Performance (81, 3.28%) 1.1.2.3.1. Safety (7, 0.28%) 1.1.2.3.2. Problem-Solving (4, 0.16%) 1.1.2.3.3. Decision-Making (3, 0.12%) 1.1.2.3.4. Test Score (2, 0,08%) 1.1.2.3.5. Strategy (12, 0.49%) 1.1.2.3.6. Coordination/Communication (9, 0.36%)
12 Table 2.1 (continued) 1.2. Objective Measures (1934, 78.27%) 1.2.1. System Measures (1171, 47. 39%) 1.2.1.1. Count (807, 32.66%) 1.2.1.1.1. Action (528, 21.37%) 1.2.1.1.2. Event (149, 6.03%) 1.2.1.1.3. Traffic (104, 4.21%) 1.2.1.1.4. Infrastructure (22, 0.89%) 1.2.1.2. Time (260, 10.52%) 1.2.1.2.1. Duration (199, 8.05%) 1.2.1.2.2. Latency (44, 1.78%) 1.2.1.3. Distance (68, 2.75%) 1.2.1.3.1. Horizontal (26, 1.05%) 1.2.1.3.2. Vertical (5, 0.2%) 1.2.1.3.3. Euclidean (33, 1.34%) 1.2.1.3.4. Angular (3, 0.12%) 1.2.1.4. Speed (30, 1.21%) 1.2.1.5. Area (0, 0.0%) 1.2.1.6. Volume (1, 0.04%) 1.2.1.7. Fuel Consumption (4, 0.16%) 1.2.1.8. Degrees of Freedom (0, 0.0%) 1.2.2. Human Measures (763, 30.88) 1.2.2.1. Observer-rated (289, 11.7%) 1.2.2.1.1. Performance (189, 7.65%) 1.2.2.1.2. Situation Awareness (65, 2.63%) 1.2.2.1.3. Strategy (8, 0.32%) 1.2.2.2. Cognitive (207, 8.38%) 1.2.2.2.1. Response Time/Latency (123, 4.98%) 1.2.2.2.2. Time Required (2, 0.08%) 1.2.2.2.3. Recall Ability (62, 2.51%) 1.2.2.3. Psychophysiological (227, 9.19%) 1.2.2.3.1. Eye Movement (180, 7.28%) 1.2.2.3.1.1. Saccade length 1.2.2.3.1.2. Number of fixations 1.2.2.3.1.3. Dwell time 1.2.2.3.1.4. Pupil diameter 1.2.2.3.1.5. Blink rate 1.2.2.3.2. Heart rate (25, 1.01%) 1.2.2.3.3. EEG (8, 0.32%) 1.2.2.3.4. Respiration rate (1, 0.04%) 1.2.2.3.5. Biochemical (5, 0.02%) 1.2.2.3.6. EMG (2, 0.08%) 1.2.2.4. Demographic Data (13, 0.53%) 1.2.2.5. Simulator Score (26, 1.05%)
13 Table 2.2 Direct measures ranked by their frequency of use in the literature. Cumulative percentage is given in the last column. Tax. No.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.1.2.1. 1.2.2.1.1. 1.1.1.2.1. 1.2.2.3.1. 1.2.1.1.2. 1.2.2.2.1. 1.2.1.1.3. 1.2.2.1.2. 1.2.2.2.3. 1.1.1.3. 1.1.1.3.1. 1.1.2.3. 1.2.1.2.2. 1.1.1.1.3. 1.2.1.3.3. 1.2.1.4. 1.1.1. 1.2.2.1. 1.2.1.3.1. 1.2.2.5. 1.2.2.3.2. 1.2.1.1.4. 1.2.2.2. 1.2.1.2. 1.1.1.3.2. 1.1.1.4.5. 1.2.2.4. 1.1.1.1.4. 1.1.2.3.5. 1.1.1.4.3. 1.1.2.1.1. 1.1.2.3.6. 1.1.1.4.1. 1.1.1.4.6. 1.2.2.1.3. 1.2.2.3.3. 1.1.2.3.1. 1.2.2.3. 1.2.1.3.2. 1.2.2.3.5. 1.1.1.1.2. 1.1.2.3.2. 1.2.1.1. 1.2.1.7. 1.1.2.2.2. 1.1.2.3.3.
Objective Objective Objective Subjective Objective Objective Objective Objective Objective Objective Subjective Subjective Subjective Objective Subjective Objective Objective Subjective Objective Objective Objective Objective Objective Objective Objective Subjective Subjective Objective Subjective Subjective Subjective Subjective Subjective Subjective Subjective Objective Objective Subjective Objective Objective Objective Subjective Subjective Objective Objective Subjective Subjective
Sys. Meas. Sys. Meas. Human Meas. Self-Rated Human Meas. Sys. Meas. Human Meas. Sys. Meas. Human Meas. Human Meas. Self-Rated Self-Rated Observer-Rated Sys. Meas. Self-Rated Sys. Meas. Sys. Meas. Self-Rated Human Meas. Sys. Meas. Human Meas. Human Meas. Sys. Meas. Human Meas. Sys. Meas. Self-Rated Self-Rated Human Meas. Self-Rated Observer-Rated Self-Rated Observer-Rated Observer-Rated Self-Rated Self-Rated Human Meas. Human Meas. Observer-Rated Human Meas. Sys. Meas. Human Meas. Self-Rated Observer-Rated Sys. Meas. Sys. Meas. Observer-Rated Observer-Rated
Count Time Observer-Rated Workload Psychophysiological Count Cognitive Count Observer-Rated Cognitive Performance Performance Performance Time Taskload Distance Speed
Action Duration Performance Mental Wkld Eye Movement Event Response Time Traffic Sit. Aware. Recall Ability
Observer-Rated Distance Simulator Score Psychophysiological Count Cognitive Time Performance Emotion Demographic Taskload Performance Emotion Workload Performance Emotion Emotion Observer-Rated Psychophysiological Performance Psychophysiological Distance Psychophysiological Taskload Performance Count Fuel Consumption Taskload Performance
Sit. Aware. Latency Acceptability Euclidean
Horizontal Heart Rate Infrastructure
Safety Fatigue Complexity Strategy Trust Mental Wkld Coord./Comm. Confidence Stress Strategy EEG Safety Vertical Biochemical Activity Problem-Soving
Activity DecisionMaking
Count
%
Cum.%
528 199 189 188 180 149 123 104 65 62 56 44 44 44 43 33 30 28 27 26 26 25 22 20 17 16 14 13 12 12 11 10 9 8 8 8 8 7 6 5 5 4 4 4 4 3 3
21.37 8.05 7.65 7.61 7.28 6.03 4.98 4.21 2.63 2.51 2.27 1.78 1.78 1.78 1.74 1.34 1.21 1.13 1.09 1.05 1.05 1.01 0.89 0.81 0.69 0.65 0.57 0.53 0.49 0.49 0.45 0.40 0.36 0.32 0.32 0.32 0.32 0.28 0.24 0.20 0.20 0.16 0.16 0.16 0.16 0.12 0.12
21.37 29.42 37.07 44.68 51.96 57.99 62.97 67.18 69.81 72.32 74.59 76.37 78.15 79.93 81.67 83.00 84.22 85.35 86.44 87.49 88.55 89.56 90.45 91.26 91.95 92.59 93.16 93.69 94.17 94.66 95.10 95.51 95.87 96.20 96.52 96.84 97.17 97.45 97.69 97.90 98.10 98.26 98.42 98.58 98.75 98.87 98.99
14 Table 2.2 (continued)
Tax. No.
Level 2
Level 3
Level 4
Level 5
Count
%
Cum.%
1.2.1.3.4. 1.1.1.1. 1.1.1.1.1. 1.1.1.4.4. 1.1.2.3.4. 1.2.2.2.2. 1.2.2.3.6. 1.1.1.2.2. 1.1.1.4. 1.1.2.1. 1.1.2.2. 1.1.2.2.3. 1.2.1. 1.2.1.3. 1.2.1.6. 1.2.2. 1.2.2.3.4.
Objective Subjective Subjective Subjective Subjective Objective Objective Subjective Subjective Subjective Subjective Subjective Objective Objective Objective Objective Objective
Sys. Meas. Self-Rated Self-Rated Self-Rated Observer-Rated Human Meas. Human Meas. Self-Rated Self-Rated Observer-Rated Observer-Rated Observer-Rated Sys. Meas. Sys. Meas. Sys. Meas. Human Meas. Human Meas.
Distance Taskload Taskload Emotion Performance Cognitive Psychophysiological Workload Emotion Workload Taskload Taskload
Angular
3 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
0.12 0.08 0.08 0.08 0.08 0.08 0.08 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
99.11 99.19 99.27 99.35 99.43 99.51 99.60 99.64 99.68 99.72 99.76 99.80 99.84 99.88 99.92 99.96 100.00
Time Pressure Mood Test Score Time Required EMG Physical Wkld
Complexity
Distance Volume Psychophysiological
Resp. Rate
15 Table 2.3 ATC measures taxonomy, indirect measures. The numbers in parentheses refer to counts and proportions of instances measures were used in the literature for each level in the taxonomy. 2. Indirect Measures 2.1. Human 2.1.1. Workload (445, 19.02%) 2.1.2. Performance (1029, 43.97%) 2.1.3. Situation Awareness (136, 5.81) 2.1.4. Trust (11, 0.47%) 2.1.5. Confidence (8, 0.34%) 2.1.6. Acceptability (9, 0.38%) 2.1.7. Adaptation (16, 0.68%) 2.1.8. Behavioral traits (5, 0.21%) 2.1.9. Cognition 2.1.9.1. Perception (3, 0.13%) 2.1.9.2. Attention (0, 0%) 2.1.9.3. Decision making (109, 4.66%) 2.1.9.4. Information management (4, 0.17%) 2.1.9.5. Memory (29, 1.24%) 2.1.9.6. Mental models (18, 0.77%) 2.1.10. Competence (9, 0.38%) 2.1.11. Skill (19, 0.81%) 2.1.12. Fatigue, sleepiness (8, 0.34%) 2.1.13. Sleep quality (5, 0.21%) 2.1.14. Mood (3, 0.13%) 2.1.15. Stress (34, 1.45%) 2.1.16. Strategy (47, 2.01%) 2.2. System Measures 2.2.1. Taskload (6, 0.26%) 2.2.2. Complexity (24, 1.03%) 2.2.3. Safety (118, 5.04%) 2.2.4. Efficiency (137, 5.85%) 2.2.5. Capacity (0, 0%) 2.2.6. Economy (0, 0%) 2.2.7. Relevance (8, 0.34%) 2.2.8. Communications (3, 0.13%) 2.2.9. Simulation (30, 1.28%) 2.2.9.1. Simulation fidelity (7, 0.30%) 2.2.9.2. Simulation validity (3, 0.13%) 2.2.9.3. Scenario fidelity (14, 0.60%) 2.2.9.4. Scenario difficulty (2, 0.09%) 2.2.9.5. Scenario impact (1, 0.04%) 2.2.9.6. Scenario qualities (3, 0.13%) 2.2.10. Usability (41, 1.75%) 2.2.10.1. Readability (9, 0.38%)
16 Table 2.4 Indirect measures ranked by their frequency of use in the literature. Cumulative percentage is given in the last column.
Tax. No
Level 2
Level 3
2.1.2. 2.1.1. 2.2.4. 2.1.3. 2.2.3. 2.1.9.3. 2.1.16. 2.1.15. 2.2.10. 2.1.9.5. 2.2.2. 2.2.5. 2.1.11. 2.1.9.6. 2.1.7. 2.2.9.3. 2.1.4. 2.1.10. 2.1.6. 2.2.10.1 2.1.12. 2.1.5. 2.2.7. 2.2.9.1. 2.2.1. 2.1.13. 2.1.8. 2.1.9.4. 2.1.14. 2.1.9.1. 2.2.8. 2.2.9.2. 2.2.9.6. 2.2.9.4. 2.2.9.5. 2.1.9.2.
Human Human System Human System Human Human Human System Human System System Human Human Human System Human Human Human System Human Human System System System Human Human Human Human Human System System System System System Human
Performance Workload Efficiency Situation Awareness Safety Cognitive Processes Strategy Stress Usability Cognitive Processes Complexity Capacity Skill Cognitive Processes Adaptation Simulation Trust Competence Acceptability Usability Fatigue Confidence Relevance Simulation Taskload Sleep Quality Behavioral Traits Cognitive Processes Mood Cognitive Processes Communications Simulation Simulation Simulation Simulation Cognitive Processes
Level 4
Decision Making
Memory
Mental Model Scenario Fidelity
Readability
Simulation Fidelity
Info. Management Perception Simulation Validity Scenario Qualities Scenario Difficulty Scenario Impact Attention
Freq.
%
Cum. %
1029 445 137 136 118 109 47 34 32 29 24 23 19 18 16 14 11 9 9 9 8 8 8 7 6 5 5 4 3 3 3 3 3 2 1 0
43.97 19.02 5.85 5.81 5.04 4.66 2.01 1.45 1.37 1.24 1.03 0.98 0.81 0.77 0.68 0.60 0.47 0.38 0.38 0.38 0.34 0.34 0.34 0.30 0.26 0.21 0.21 0.17 0.13 0.13 0.13 0.13 0.13 0.09 0.04 0.00
43.97 62.99 68.85 74.66 79.70 84.36 86.37 87.82 89.19 90.43 91.45 92.44 93.25 94.02 94.70 95.30 95.77 96.15 96.54 96.92 97.26 97.61 97.95 98.25 98.50 98.72 98.93 99.10 99.23 99.36 99.49 99.62 99.74 99.83 99.87 99.87
17
Table 2.5 Directly measurable variables used to measure workload. Indirect Variable: Workload ( 2.1.1.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.2.1. 1.2.2.3.1. 1.2.1.1.1. 1.2.1.2.1 1.2.1.1.3 1.2.1.1.2 1.2.2.3.2. 1.2.2.1.1. 1.2.2.2.1. 1.1.2.1.1. 1.2.1.2.2. 1.2.1.4. 1.2.2.2. 1.2.1.1.4. 1.2.1.2. 1.2.2.1. 1.1.1.1.1. 1.2.2.3.3. 1.1.1.2.2. 1.1.2.1. 1.1.2.2.2. 1.1.2.3. 1.2.1.3.2. 1.2.1.6. 1.2.2.1.2. 1.2.2.3. 1.2.2.3.4. 1.2.2.3.5.
Subjective Objective Objective Objective Objective Objective Objective Objective Objective Subjective Objective Objective Objective Objective Objective Objective Subjective Objective Subjective Subjective Subjective Subjective Objective Objective Objective Objective Objective Objective
Self-Rated Human Meas. Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Human Meas. Human Meas. Human Meas. Observer-Rated Sys. Meas. Sys. Meas. Human Meas. Sys. Meas. Sys. Meas. Human Meas. Self-Rated Human Meas. Self-Rated Observer-Rated Observer-Rated Observer-Rated Sys. Meas. Sys. Meas. Human Meas. Human Meas. Human Meas. Human Meas.
Workload Psychophysiological Count Time Count Count Psychophysiological Observer-Rated Cognitive Workload Time Speed Cognitive Count Time Observer-Rated Taskload Psychophysiological Workload Workload Taskload Performance Distance Volume Observer-Rated Psychophysiological Psychophysiological Psychophysiological
Mental Wkld Eye Movemnt Action Duration Traffic Event Heart Rate Performance Resp. Time Mental Wkld Latency
Infrastructure
Time Pressure EEG Physical Wkld Activity Vertical Sit. Aware. Resp. Rate Biochemical
Freq.
%
Cum. %
188 39 37 37 25 22 16 12 11 10 8 8 7 5 3 3 2 2 1 1 1 1 1 1 1 1 1 1
42.25 8.76 8.31 8.31 5.62 4.94 3.60 2.70 2.47 2.25 1.80 1.80 1.57 1.12 0.67 0.67 0.45 0.45 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.22
42.25 51.01 59.33 67.64 73.26 78.20 81.80 84.49 86.97 89.21 91.01 92.81 94.38 95.51 96.18 96.85 97.30 97.75 97.98 98.20 98.43 98.65 98.88 99.10 99.33 99.55 99.78 100.00
18
Table 2.6 Directly measurable variables used to measure performance. Indirect Variable: Performance (2.1.2.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.2.1.1 1.2.2.2.1 1.2.1.2.1. 1.2.1.1.2 1.1.1.3. 1.2.1.1.3. 1.2.2.2.3. 1.2.2.3.1. 1.1.2.3. 1.2.1.2.2. 1.2.2.5. 1.2.1.3.3 1.1.1.3.1. 1.2.1.3.1. 1.2.1.1.4 1.2.1.4. 1.2.1.2. 1.1.2.3.6. 1.2.2.1.2. 1.1.1. 1.1.2.3.3. 1.1.2.3.5. 1.2.1.1. 1.2.1.7. 1.2.2.2. 1.1.2.3.1. 1.1.2.3.4. 1.2.1.3.2 1.2.2.1. 1.2.2.2.2. 1.2.1.3.4 1.2.2.3. 1.2.2.3.2. 1.2.2.3.3.
Objective Objective Objective Objective Objective Subjective Objective Objective Objective Subjective Objective Objective Objective Subjective Objective Objective Objective Objective Subjective Objective Subjective Subjective Subjective Objective Objective Objective Subjective Subjective Objective Objective Objective Objective Objective Objective Objective
Sys. Meas. Human Meas. Human Meas. Sys. Meas. Sys. Meas. Self-Rated Sys. Meas. Human Meas. Human Meas. Observer-Rated Sys. Meas. Human Meas. Sys. Meas. Self-Rated Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Observer-Rated Human Meas. Self-Rated Observer-Rated Observer-Rated Sys. Meas. Sys. Meas. Human Meas. Observer-Rated Observer-Rated Sys. Meas. Human Meas. Human Meas. Sys. Meas. Human Meas. Human Meas. Human Meas.
Count Observer-Rated Cognitive Time Count Performance Count Cognitive Psychophysiological Performance Time Simulator Score Distance Performance Distance Count Speed Time Performance Observer-Rated
Action Performance Response Time Duration Event
Performance Performance Count Fuel Consumption Cognitive Performance Performance Distance Observer-Rated Cognitive Distance Psychophysiological Psychophysiological Psychophysiological
Traffic Recall Ability Eye Movement Latency Objective Euclidean Sit. Aware. Horizontal Infrastructure
Coord./Comm. Sit. Aware. Decision-Making Strategy
Safety Test Score Vertical Time Required Angular Heart Rate EEG
Freq.
%
Cum. %
308 105 89 84 80 56 37 37 33 31 29 25 23 16 10 9 9 6 5 4 3 3 3 3 3 3 2 2 2 2 2 1 1 1 1
29.96 10.21 8.66 8.17 7.78 5.45 3.60 3.60 3.21 3.02 2.82 2.43 2.24 1.56 0.97 0.88 0.88 0.58 0.49 0.39 0.29 0.29 0.29 0.29 0.29 0.29 0.19 0.19 0.19 0.19 0.19 0.10 0.10 0.10 0.10
29.96 40.18 48.83 57.00 64.79 70.23 73.83 77.43 80.64 83.66 86.48 88.91 91.15 92.70 93.68 94.55 95.43 96.01 96.50 96.89 97.18 97.47 97.76 98.05 98.35 98.64 98.83 99.03 99.22 99.42 99.61 99.71 99.81 99.90 100.00
19 Table 2.7 Directly measurable variables used to measure situation awareness. Indirect Variable: Situation Awareness (2.1.3.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.1.2. 1.1.1.3.1. 1.2.1.1.1. 1.2.2.2.1. 1.2.2.2.3. 1.2.2.1.1. 1.2.1.2. 1.1.2.2.2. 1.2.2.1.3. 1.1.2.3. 1.1.2.3.5. 1.2.1.2.2. 1.2.1.3.4.
Objective Subjective Objective Objective Objective Objective Objective Subjective Objective Subjective Subjective Objective Objective
Human Meas. Self-Rated Sys. Meas. Human Meas. Human Meas. Human Meas. Sys. Meas. Observer-Rated Human Meas. Observer-Rated Observer-Rated Sys. Meas. Sys. Meas.
Observer-Rated Performance Count Cognitive Cognitive Observer-Rated Time Taskload Observer-Rated Performance Performance Time Distance
Sit. Aware. Sit. Aware. Action Resp. Time Recall Ability Performance
Freq.
%
Cum. %
60 26 18 10 6 5 3 2 2 1 1 1 1
44.12 19.12 13.24 7.35 4.41 3.68 2.21 1.47 1.47 0.74 0.74 0.74 0.74
44.12 63.24 76.47 83.82 88.24 91.91 94.12 95.59 97.06 97.79 98.53 99.26 100.00
Freq.
%
Cum. %
11
100.00
100.00
Activity Strategy Strategy Latency Angular
Table 2.8 Directly measurable variables used to measure trust. Indirect Variable: Trust (2.1.4.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.4.3.
Subjective
Self-Rated
Emotion
Trust
Table 2.9 Directly measurable variables used to measure confidence. Indirect Variable: Confidence (2.1.5.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.4.1. 1.1.1.3.1.
Subjective Subjective
Self-Rated Self-Rated
Emotion Performance
Confidence Sit. Aware.
Freq.
%
Cum. %
7 1
87.50 12.50
87.50 100.00
Table 2.10 Directly measurable variables used to measure acceptability. Indirect Variable: Acceptability (2.1.6.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.1.3. 1.2.1.1.1. 1.1.2.3.5. 1.1.2.3.6.
Subjective Objective Subjective Subjective
Self-Rated Sys. Meas. Observer-Rated Observer-Rated
Taskload Count Performance Performance
Acceptability Action Strategy Coord./Comm.
Freq.
%
Cum. %
5 2 1 1
55.56 22.22 11.11 11.11
55.56 77.78 88.89 100.00
20 Table 2.11 Directly measurable variables used to measure adaptation. Indirect Variable: Adaptation (2.1.7.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.2. 1.2.2.1.1. 1.2.2.3.1. 1.2.2.3.3. 1.2.1.1.3. 1.1.1.4.5.
Objective Objective Objective Objective Objective Subjective
Sys. Meas. Hum. Meas. Hum. Meas. Hum. Meas. Sys. Meas. Self-Rated
Count Observer-Rated Psychophysiological Psychophysiological Count Emotion
Event Performance Eye Movemnt EEG Traffic Fatigue
Freq.
%
Cum. %
4 3 3 3 2 1
25.00 18.75 18.75 18.75 12.50 6.25
25.00 43.75 62.50 81.25 93.75 100.00
Table 2.12 Directly measurable variables used to measure behavioral traits. Indirect Variable: Behavioral Traits (2.1.8.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.1. 1.2.2.1.1 1.2.2.4.
Objective Objective Objective
Hum. Meas. Hum. Meas. Hum. Meas.
Observer-Rated Observer-Rated Demographic
Performance
Freq.
%
Cum. %
3 1 1
60.00 20.00 20.00
60.00 80.00 100.00
Table 2.13 Directly measurable variables used to measure cognition. Indirect Variable: Cognition (2.1.9.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1.
Objective
Sys. Meas.
Count
Action
Freq.
%
Cum. %
3
100.00
100.00
Freq.
%
Cum. %
2 1
66.67 33.33
66.67 100.00
Table 2.14 Directly measurable variables used to measure perception. Indirect Variable: Perception (2.1.9.1.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.1.3. 1.2.2.2.
Subjective Objective
Self-Rated Hum. Meas.
Taskload Cognitive
Acceptability
21
Table 2.15 Directly measurable variables used to measure decision-making. Indirect Variable: Decision-making (2.1.9.3.) Direct Var. 1.2.2.3.1 1.2.2.1.1. 1.2.2.2.1
Level 2 Objective Objective Objective
Level 3
Level 4
Hum. Meas. Hum. Meas. Hum. Meas.
Psychophysiological Observer-Rated Cognitive
Level 5
Freq.
%
Cum. %
Eye Movmnt Performance Resp. Time
102 5 2
93.58 4.59 1.83
93.58 98.17 100.00
Table 2.16 Directly measurable variables used to measure information management. Indirect Variable: Information management (2.1.9.4.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.1.3. 1.1.1.
Subjective Subjective
Self-Rated Self-Rated
Taskload
Acceptability
Freq.
%
Cum. %
3 1
75.00 25.00
75.00 100.00
Freq.
%
Cum. %
17 5 2 2 1 1 1
58.62 17.24 6.90 6.90 3.45 3.45 3.45
58.62 75.86 82.76 89.66 93.10 96.55 100.00
Table 2.17 Directly measurable variables used to measure memory. Indirect Variable: Memory (2.2.9.5.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.2.3. 1.2.2.1.1. 1.2.1.1.1. 1.2.2.2.1 1.2.1.1.3. 1.2.1.2.2. 1.2.1.3.3.
Objective Objective Objective Objective Objective Objective Objective
Hum. Meas. Hum. Meas. Sys. Meas. Hum. Meas. Sys. Meas. Sys. Meas. Sys. Meas.
Cognitive Observer-Rated Count Cognitive Count Time Distance
Recall Ability Performance Action Resp. Time Traffic Latency Euclidean
Table 2.18 Directly measurable variables used to measure mental models. Indirect Variable: Mental models (2.1.9.6.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.1. 1.2.2.2. 1.2.1.1.1. 1.1.1. 1.1.2.3. 1.1.2.3.5.
Objective Objective Objective Subjective Subjective Subjective
Human Meas. Human Meas. Sys. Meas. Self-Rated Observer-Rated Observer-Rated
Observer-Rated Cognitive Count
Action
Performance Performance
Strategy
Freq.
%
Cum. %
8 5 2 1 1 1
44.44 27.78 11.11 5.56 5.56 5.56
44.44 72.22 83.33 88.89 94.44 100.00
22
Table 2.19 Directly measurable variables used to competence. Indirect Variable: Competence ( 2.1.10.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.1.1. 1.1.1.
Objective Subjective
Hum. Meas. Self-Rated
Observer-Rated
Performance
Freq.
%
Cum. %
8 1
88.89 11.11
88.89 100.00
Table 2.20 Directly measurable variables used to measure fatigue. Indirect Variable: Fatigue ( 2.1.12.) Direct Var.
Level 2
Level 3
Level 4
Level 5
Freq.
%
Cum. %
1.1.1.4.5.
Subjective
Self-Rated
Emotion
Fatigue
9
100.00
100.00
Table 2.21 Directly measurable variables used to measure sleep quality. Indirect Variable: Sleep Quality ( 2.1.13.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.3.1. 1.2.2.3.6. 1.1.1.4.5.
Objective Objective Subjective
Hum. Meas. Hum. Meas. Self-Rated
Psychophysiological Psychophysiological Emotion
Eye Movemnt EMG Fatigue
Freq.
%
Cum. %
2 2 1
40.00 40.00 20.00
40.00 80.00 100.00
Freq.
%
Cum. %
2 1
66.67 33.33
66.67 100.00
Table 2.22 Directly measurable variables used to measure mood. Indirect Variable: Mood (2.1.14.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.4.4. 1.1.1.4.
Subjective Subjective
Self-Rated Self-Rated
Emotion Emotion
Mood
23 Table 2.23 Directly measurable variables used to measure stress. Indirect Variable: Stress (2.1.15.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.4.6. 1.2.2.3.2. 1.2.2.3.5. 1.1.1.4.5. 1.2.2.2.1. 1.2.1. 1.2.1.1.1. 1.2.1.1.3. 1.2.1.2. 1.2.2.1.1. 1.2.2.3. 1.2.2.4. 1.2.2.5.
Subjective Objective Objective Subjective Objective Objective Objective Objective Objective Objective Objective Objective Objective
Self-Rated Hum. Meas. Hum. Meas. Self-Rated Hum. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Hum. Meas. Hum. Meas. Hum. Meas. Hum. Meas.
Emotion Psychophysiological Psychophysiological Emotion Cognitive
Stress Heart Rate Biochemical Fatigue Resp. Time
Count Count Time Observer-Rated Psychophysiological Demographic Simulator Score
Action Traffic
Freq.
%
Cum. %
8 8 4 3 2 1 1 1 1 1 1 1 1
24.24 24.24 12.12 9.09 6.06 3.03 3.03 3.03 3.03 3.03 3.03 3.03 3.03
24.24 48.48 60.61 69.70 75.76 78.79 81.82 84.85 87.88 90.91 93.94 96.97 100.00
Freq.
%
Cum. %
14 6 5 4 4 3 2 1 1 1 1 1 1 1 1 1
29.79 12.77 10.64 8.51 8.51 6.38 4.26 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13 2.13
29.79 42.55 53.19 61.70 70.21 76.60 80.85 82.98 85.11 87.23 89.36 91.49 93.62 95.74 97.87 100.00
Freq.
%
Cum. %
2 1 1 1 1
33.33 16.67 16.67 16.67 16.67
33.33 50.00 66.67 83.33 100.00
Performance
Table 2.24 Directly measurable variables used to measure strategy. Indirect Variable: Strategy (2.1.16.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.2.1.3 1.2.2.1.1. 1.1.2.3.2 1.2.2.1. 1.1.2.3.5. 1.2.2.2.1. 1.1.1. 1.1.1.3.1. 1.1.1.4.1. 1.1.2.3. 1.2.1.1.2 1.2.1.1.3 1.2.1.2. 1.2.1.3. 1.2.2.3.1.
Objective Objective Objective Subjective Objective Subjective Objective Subjective Subjective Subjective Subjective Objective Objective Objective Objective Objective
Sys. Meas. Human Meas. Human Meas. Observer-Rated Human Meas. Observer-Rated Human Meas. Self-Rated Self-Rated Self-Rated Observer-Rated Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Human Meas.
Count Observer-Rated Observer-Rated Performance Observer-Rated Performance Cognitive
Action Strategy Performance Problem-Solving
Performance Emotion Performance Count Count Time Distance Psychophysiological
Sit. Aware. Confidence
Strategy Response Time
Event Traffic
Eye Movement
Table 2.25 Directly measurable variables used to measure taskload. Indirect Variable: Taskload (2.2.1.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.1.2.2. 1.1.2.3.6. 1.2.1.2.1. 1.2.2.2.1.
Objective Subjective Subjective Objective Objective
Sys. Meas. Observer-Rated Observer-Rated Sys. Meas. Human Meas.
Count Taskload Performance Time Cognitive
Action Coord./Comm. Duration Resp. Time
24 Table 2.26 Directly measurable variables used to measure usability. Indirect Variable: Usability (2.2.10.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.1.3. 1.2.1.1.1. 1.1.1. 1.2.2.1. 1.1.2.3. 1.2.1.1.
Subjective Objective Subjective Objective Subjective Objective
Self-Rated Sys. Meas. Self-Rated Human Meas. Observer-Rated Sys. Meas.
Taskload Count
Acceptability Action
Observer-Rated Performance Count
Freq.
%
Cum. %
16 6 3 3 1 1
53.33 20.00 10.00 10.00 3.33 3.33
53.33 73.33 83.33 93.33 96.67 100.00
Freq.
%
Cum. %
6 3
66.67 33.33
66.67 100.00
Freq.
%
Cum. %
12 3 2 6 1 1
48.00 12.00 8.00 24.00 4.00 4.00
48.00 60.00 68.00 92.00 96.00 100.00
Table 2.27 Directly measurable variables used to measure readability. Indirect Variable: Readability (2.2.10.1.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.2.2.1.
Objective Objective
Sys. Meas. Hum. Meas.
Count Cognitive
Action Resp. Time
Table 2.28 Directly measurable variables used to measure complexity. Indirect Variable: Complexity (2.2.2.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.1.1.1.4. 1.2.1.1.3. 1.2.1.1.4. 1.2.1.2.1 1.2.1.2.2. 1.2.1.4.
Subjective Objective Objective Objective Objective Objective
Self-Rated Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas.
Taskload Count Count Time Time Speed
Complexity Traffic Infrastructure Duration Latency
25 Table 2.29 Directly measurable variables used to measure safety. Indirect Variable: Safety (2.2.3.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.1.2.1. 1.1.1.3.2. 1.2.1.1.2. 1.2.2.1.1. 1.2.1.3.1. 1.2.1.3.3. 1.2.1.4. 1.1.2.3.1. 1.2.1.3.2. 1.1.2.2.3. 1.2.1.1.3. 1.2.2.1.
Objective Objective Subjective Objective Objective Objective Objective Objective Subjective Objective Subjective Objective Objective
Sys. Meas. Sys. Meas. Self-Rated Sys. Meas. Human Meas. Sys. Meas. Sys. Meas. Sys. Meas. Observer-Rated Sys. Meas. Observer-Rated Sys. Meas. Human Meas.
Count Time Performance Count Observer-Rated Distance Distance Speed Performance Distance Taskload Count Observer-Rated
Action Duration Safety Event Performance Horizontal Euclidean Safety Vertical Complexity Traffic
Freq.
%
Cum. %
36 15 14 11 11 9 8 5 3 2 1 1 1
30.77 12.82 11.97 9.40 9.40 7.69 6.84 4.27 2.56 1.71 0.85 0.85 0.85
30.77 43.59 55.56 64.96 74.36 82.05 88.89 93.16 95.73 97.44 98.29 99.15 100.00
Freq.
%
Cum. %
49 29 10 8 7 6 5 5 3 3 3 2 2 2 1 1 1
35.77 21.17 7.30 5.84 5.11 4.38 3.65 3.65 2.19 2.19 2.19 1.46 1.46 1.46 0.73 0.73 0.73
35.77 56.93 64.23 70.07 75.18 79.56 83.21 86.86 89.05 91.24 93.43 94.89 96.35 97.81 98.54 99.27 100.00
Table 2.30 Directly measurable variables used to measure efficiency. Indirect Variable: Efficiency (2.2.4.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.1.2.1 1.1.1.1.3. 1.2.1.1.2. 1.1.1. 1.1.2.3. 1.2.1.1.3. 1.2.1.3.1. 1.1.2.3.5. 1.2.1.4. 1.2.2.1.1. 1.1.2.3.1. 1.1.2.3.6. 1.2.1.2.2. 1.2.2.1. 1.2.2.2.1. 1.2.2.4.
Objective Objective Subjective Objective Subjective Subjective Objective Objective Subjective Objective Objective Subjective Subjective Objective Objective Objective Objective
Sys. Meas. Sys. Meas. Self-Rated Sys. Meas. Self-Rated Observer-Rated Sys. Meas. Sys. Meas. Observer-Rated Sys. Meas. Human Meas. Observer-Rated Observer-Rated Sys. Meas. Human Meas. Human Meas. Human Meas.
Count Time Taskload Count
Action Duration Acceptability Event
Performance Count Distance Performance Speed Observer-Rated Performance Performance Time Observer-Rated Cognitive Demographic
Traffic Horizontal Strategy Performance Safety Coord./Comm. Latency Resp. Time
26 Table 2.31 Directly measurable variables used to measure capacity. Indirect Variable: Capacity (2.2.5.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.3. 1.2.1.2.1. 1.2.1.1.1. 1.2.1.1.2. 1.2.1.3.4.
Objective Objective Objective Objective Objective
Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas. Sys. Meas.
Count Time Count Count Distance
Traffic Duration Action Event Angular
Freq.
%
Cum. %
10 6 5 1 1
43.48 26.09 21.74 4.35 4.35
43.48 69.57 91.30 95.65 100.00
Table 2.32 Directly measurable variables used to measure relevance. Indirect Variable: Relevance (2.2.7.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.2.1.1. 1.1.1.
Objective Subjective
Hum. Meas. Self-Rated
Observer-Rated
Performance
Freq.
%
Cum. %
7 1
87.50 12.50
87.50 100.00
Freq.
%
Cum. %
1 1 1
33.33 33.33 33.33
33.33 66.67 100.00
Freq.
%
Cum. %
6 1
85.71 14.29
85.71 100.00
Table 2.33 Directly measurable variables used to measure communications. Indirect Variable: Communications (2.2.8.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.1. 1.2.1.2.1. 1.2.2.1.1.
Objective Objective Objective
Sys. Meas. Sys. Meas. Hum. Meas.
Count Time Observer-Rated
Action Duration Performance
Table 2.34 Directly measurable variables used to measure simulation fidelity. Indirect Variable: Simulation fidelity (2.2.9.1.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.3. 1.1.1.
Objective Subjective
Sys. Meas. Self-Rated
Count
Traffic
27 Table 2.35 Directly measurable variables used to measure scenario fidelity. Indirect Variable: Scenario fidelity (2.2.9.3.) Direct Var.
Level 2
Level 3
Level 4
Level 5
1.2.1.1.3. 1.1.1. 1.1.1.1.3. 1.1.2.3. 1.2.1.2.1. 1.2.2.1.
Objective Subjective Subjective Subjective Objective Objective
Sys. Meas. Self-Rated Self-Rated Observer-Rated Sys. Meas. Human Meas.
Count
Traffic
Taskload Performance Time Observer-Rated
Acceptability
Freq.
%
Cum. %
6 4 3 2 1 1
35.29 23.53 17.65 11.76 5.88 5.88
35.29 58.82 76.47 88.24 94.12 100.00
Duration
Table 2.36 Directly measurable variables used to measure scenario difficulty. Indirect Variable: Scenario difficulty (2.2.9.4.) Direct Var.
Level 2
Level 3
Level 4
1.1.1.1.
Subjective
Self-Rated
Taskload
Level 5
Freq.
%
Cum. %
2
100.00
100.00
Table 2.37 Directly measurable variables used to measure scenario impact. Indirect Variable: Scenario impact (2.2.9.5.) Direct Var.
Level 2
Level 3
1.1.1.
Subjective
Self-Rated
Level 4
Level 5
Freq.
%
Cum. %
1
100.00
100.00
Table 2.38 Directly measurable variables used to measure scenario qualities. Indirect Variable: Scenario qualities (2.2.9.6.) Direct Var.
Level 2
Level 3
1.1.1.
Subjective
Self-Rated
Level 4
Level 5
Freq.
%
Cum. %
3
100.00
100.00
28
2.4 Summary Measurement involves the assignment of a number system to represent the values of the variables of interest. There exist four distinct measurement scales—the nominal, ordinal, interval, and ratio scales—and the measured variables must be explicitly associated with the appropriate scale and its corresponding mathematical properties (Krantz, Luce, Suppes, & Tversky, 1971; Ghiselli, Campbell, & Zedeck, 1981). The measurement scale is also a possible taxonomic criterion. Many of the direct measures are ratio measures. However, indirect measures derived from these are often ordinal at best. Hence, it is only possible to state that one sector is more complex than another, or that one controller experienced higher workload than another. The development of measures taxonomy will also help in identification of measurement scales for the multitude of measures used in the ATC research literature. In other words, measures grouped under a particular class in the taxonomy all share measurement scale specific to the class. For example, the subjective self- and observer-rated measures are only ordinal, that is, the ratings may be logically ordered, but they do not reflect the distance or difference between ratings. The availability of data is likely to be enhanced with the implementation of new technologies in ATC (e.g., digital voice switch system) in TRACON and ARTCC facilities. This makes possible to derive new measures to complement and support the currently available and feasible measures. The multidimensional nature of ATC is evident from the numerous attempts to measure the various aspects of the system (e.g., system performance, controller workload, controller performance) and the difficulties encountered in these efforts. Furthermore, individual measures typically allow only a very narrow view to the behavior of the system or the human operator as a whole. Therefore, it is important to develop indices that capture the majority of the relevant variables and combine them in a meaningful and informative manner. This proposed taxonomy represents an emphatically systematic and comprehensive approach to the measurement problem in ATC. This approach serves a dual purpose: On one hand, it is essential to perform a thorough review of past and current research efforts and to organize the findings in a manner that facilitates the use of existing knowledge for a basis of future evolvement of ATC measurement. On the other hand, it is prudent to proceed cautiously on an issue as complex as ATC measurement and consider carefully all the constraints, assumptions, and threats to validity that may emerge. In the face of the unprecedented challenges to the nation’s air transportation system it is imperative to secure a “toolbox” of measures that would predict controller success in his or her task and the impact of changing procedures and advancing technology on the system as a whole.
29
3. CRITERIA FOR ATC MEASURES Criteria, or criterion variables, commonly refer to regression analysis and denote the variables to be predicted by other, predictor variables. It is hence synonymous with independent variable, a term most commonly used in context of experimental research. In the ATC measures taxonomy presented in previous section criteria or dependent variables are represented by the indirect measures, the classification system bearing close resemblance to Meister (1985) and Senders and McCormick’s (1993) description of criterion measures. The main problem with criteria, or dependent variables, or indirect measures in ATC research and development, is that of validity, that is, do the directly measurable variables associated with indirect variables (or, criteria) really measure what was intended? Alas, this research did not provide much information to establish validity of indirect measures. In general, very few references were presented in the literature to back up associations between direct and indirect measures. Hence, our review of relevant literature is unlikely to yield valid theoretical frameworks for derivation of indirect measures from the POWER output. We will describe the available data in more detail below. A second attempt to validate the POWER metrics is based on the premise that system measures (see section 2) that are presently part of the POWER metric suite and other system measures that can be derived from the NDMS data could be used as criteria for the human measures derived from the same sources. Many of the directly measurable variables used in these metrics are available either directly from the POWER output or from the NDMS data, and hence the POWER suite could easily be expanded to include also these metrics. Unfortunately, a review of airspace complexity and dynamic density metrics literature painted an equally bleak picture, as these measures themselves remain to be validated and are often very speculative in nature. Detailed review of the literature is provided below. 3.1. Theoretical Foundations for Association of Indirect Variables with Direct Variables In the following tables we list the secondary references provided in the literature as a basis for the claimed relationship between direct measures and indirect variables. Two observations are particularly noteworthy; (1) only a minority of all articles reviewed provided any such justification for making inferences about indirect variables based on direct measures, and (2) those articles that did typically cited same sources, making the body of supporting research literature for ATC measurement remarkably small. These observations are apparent in the tables.
30
Table 3.1. References provided in support of measurement of controller workload. Indirect Measure: Workload (2.1.1.) Direct measure
Associated secondary references
Subjective > Self-Rated > Workload > Mental Workload (1.1.1.2.1.)
Bierwagen et al., 1995; Hart & Staveland, 1988; Moroney, Biers, Eggemeier, & Mitchell, 1992; Sollenberger, Stein & Gromelski, 1997; Stein 1985, 1991; Zijlstra & Meijman, 1989; Zijlstra & van Doorn, 1985.
Subjective > Observer-Rated > Workload > Mental Workload (1.1.2.1.1.)
Hurst & Rose, 1978
Objective > System Measure > Count > Action (1.2.1.1.1.)
Means & Gott, 1988
Objective > Sys Measure > Time > Duration (1.2.1.2.1.)
Eisler, 1968
Table 3.2. References provided in support of measurement of controller performance. Indirect Measure: Performance (2.1.2.) Direct measure
Associated secondary references
Subjective > Observer-Rated > Performance (1.1.2.3.)
FAA Form 3120-25, Sollenberger, Stein, & Gromelski, 1997; Vardaman & Stein, 1998; Vortac, Edwards, Fuller, & Manning, 1993
Objective > System Measure > Count > Action (1.2.1.1.1.)
Durso, Hackworth, Truitt, Crutchfield, Nikolic, & Manning, 1998a; Foushee, Lauber, Baetge, & Acomb, 1986; Peterson, Bailey, & Willems, 2001
Objective > System Measure > Count > Event (1.2.1.1.2.)
Broach & Brecht-Clark, 1994
Objective > System Measure > Time > Duration (1.2.1.2.1.)
Gempler & Wickens, 1998; Knecht & Hancock, 1999
Objective > System Measure > Time > Latency (1.2.1.2.2.)
Broach & Brecht-Clark, 1994
Objective > Human Measure > Observer-Rated > Performance (1.2.2.1.1)
Conejo & Wickens, 1997; Dollins, Lynch, Wurtman, Deng, Kischka, Gleason, & Lieberman, 1993; James, James, & Ashe, 1990; Feldman, 1968; Johnson, 1958; Morphew & Wickens, 1998; Sollenberger, Stein, & Gromelski, 1997; Vardaman & Stein, 1998
31 Table 3.3. References provided in support of measurement of controller situation awareness. Indirect Measure: Situation Awareness (2.1.3.) Direct measure
Associated secondary references
Subjective > Self-Rated > Performance > Situation Awareness (1.1.1.3.1.)
Durso, Truitt, Hackworth, Crutchfield, Ohrt, Nikolic, Moertl, & Manning, 1995; Endsley & Kiris, 1995; Taylor, 1990
Subjective > Observer-Rated > Performance > Strategy (1.1.2.3.5.)
Rodgers & Dreschler, 1993
Objective > Human Measure > Observer-Rated > Situation Awareness (1.2.2.1.2.)
Endsley, 1988; Endsley, 1995; Endsley, 2000; Endsley & Rodgers, 1994
Table 3.4. References provided in support of measurement of controller trust. Indirect Measure: Trust (2.1.4.) Direct measure
Associated secondary references
Subjective > Self-Rated > Emotion > Trust (1.1.1.4.3.)
Barber, 1983; Rempel, Holmes, & Zanna, 1985; Zuboff, 1988; Lee & Morray, 1992
Table 3.5. References provided in support of measurement of controller confidence. Indirect Measure: Confidence (2.1.5.) Direct measure
Associated secondary references
Subjective > Self-Rated > Emotion > Confidence (1.1.1.4.1.)
Conejo & Wickens, 1997
Table 3.6. References provided in support of measurement of controller behavioral traits. Indirect Measure: Behavioral Traits (2.1.8.) Direct measure
Associated secondary references
Objective > Human Measure > Observer-Rated (1.2.2.1.)
Jenkins, Zyzanski, & Rosenman, 1971; Spence, Helmreich, & Pred, 1987
Objective > Human Measure > Observer-Rated > Performance (1.2.2.1.1)
Jenkins, Zyzanski, & Rosenman, 1971; Spence, Helmreich, & Pred, 1987
Objective > Human Measure > Demographic (1.2.2.4.)
Wilde, 1963
32 Table 3.7. References provided in support of measurement of controller decision-making. Indirect Measure: Decision Making (2.1.9.3.) Direct measure
Associated secondary references
Objective > Human Measure > Observer-Rated > Performance (1.2.2.1.1.)
Enard, 1975
Objective > Human Measure > Cognitive . Response Time (1.2.2.2.1)
Bouju, 1978
Table 3.8. References provided in support of measurement of controller information management. Indirect Measure: Information Management (2.1.9.4.) Direct measure
Associated secondary references
Subjective > Self-Rated > Taskload > Acceptability (1.1.1.1.3.)
Endsley & Rodgers, 1994
Subjective > Observer-Rated > Performance > Strategy (1.1.2.3.5.).
Zachary, Ryder, Ross, & Weiland, 1992
Objective > Human Measure > Cognitive (1.2.2.2.)
CTA, 1989; Wickens, 1984; Zachary, Ryder, & Zubritzky, 1989
Table 3.9. References provided in support of measurement of controller skill. Indirect Measure: Skill (2.1.11.) Direct measure
Associated secondary references
Objective > System Measure > Count > Event (1.2.1.1.2.)
Bailey, Broach, Thompson, & Enos, 1999
Objective > System Measure > Count > Traffic (1.2.1.1.3)
Bailey, Broach, Thompson, & Enos, 1999
Objective > System Measure > Count > Duration (1.2.1.2.1.)
Bailey, Broach, Thompson, & Enos, 1999
Objective > System Measure > Speed (1.2.1.4.)
Bailey, Broach, Thompson, & Enos, 1999
Table 3.10. References provided in support of measurement of controller fatigue. Indirect Measure: Fatigue (2.1.12.) Direct measure
Associated secondary references
Subjective > Self-Rated > Emotion > Fatigue (1.1.1.4.5.)
Bougrine, Mollard, Cabon, Cointot, Martel, & Coblentz, 1997; Hoddes, Zarcone, Smythe, Phillips, & Dement, 1973
33 Table 3.11. References provided in support of measurement of controller mood. Indirect Measure: Mood (2.1.14.) Direct measure
Associated secondary references
Subjective > Self-Rated > Emotion > Mood (1.1.1.4.4.)
Watson, Clark, & Tellegen, 1988
Table 3.12. References provided in support of measurement of controller stress. Indirect Measure: Stress (2.1.15.) Direct measure
Associated secondary references
Subjective > Self-Rated > Emotion > Fatigue (1.1.1.4.5.)
Folkard, Monk, & Lobban, 1979; Grandjean, Wotzka, Schaad, & Gilgen, 1971; Horne & Ostberg, 1976
Subjective > Self-Rated > Emotion > Stress (1.1.1.4.6.)
Halberg, Johnson, Nelson, Runge, & Sothern, 1972; Spielberger, Gorsuch, & Lurshene, 1972
Objective > Human Measure > Psychophysiological > Heart Rate (1.2.2.3.2.)
Astrand & Rodhal, 1977; Chamoux, Borel, & Catilina, 1985
Table 3.13. References provided in support of measurement of controller strategy. Indirect Measure: Strategy (2.1.16.) Direct measure
Associated secondary references
Subjective > Observer-Rated > Performance > Problem-Solving (1.1.2.3.2.)
Means & Gott, 1988
Objective > System Measure > Count > Action (1.2.1.1.1.)
Stokes, Belger, and Zhang, 1990
Objective > Human Measure > Observer-Rated > Performance (1.2.2.1.1.)
Johnson & Johnson, 1987
Objective > Human Measure > Observer-Rated > Strategy (1.2.2.1.3.)
Means & Gott, 1988
34 Table 3.14. References provided in support of measurement of safety. Indirect Measure: Safety (2.2.3.) Direct measure
Associated secondary references
Objective > System Measure > Count > Action (1.2.1.1.1.)
Prinzo, Britton, & Hendrix, 1995; Prinzo & MacLin, 1996 (computerized version of ATSAT)
Objective > System Measure > Count > Event (1.2.1.1.2.)
Buckley, DeBaryshe, Hitchner, & Kohn, 1983
Objective > System Measure > Distance > Vertical (1.2.1.3.2.)
Paul, Shochet, & Algoe, 1989
Table 3.15. References provided in support of measurement of efficiency. Indirect Measure: Efficiency (2.2.4.) Direct measure
Associated secondary references
Subjective > Self-Rated (1.1.1.)
Wish & Carroll, 1974
Subjective > Self-Rated > Taskload > Acceptability (1.1.1.1.3.)
Endsley & Rodgers, 1994
3.2. Airspace Complexity and Dynamic Density Metrics: A Literature Review This section describes the presently published propositions for measurement of airspace complexity or dynamic density. Note, however, that little data is available for validation of any of these measures, which can predominantly viewed as conceptual and even anecdotal. 3.2.1. Terminal airspace metrics Cocanower (2002) developed a number of metrics that derive from only two simple direct measures presently available in the POWER output: (1) number of operations and (2) time. These metrics are: (1) Transition time: Transition time is the duration of the transition from a period of light traffic to a period of heavy traffic. Cocanower’s (2002) data was smoothed by moving average procedure. This metric requires a number of criteria, however, that may be necessary to determine on a case-by-case basis, namely light and heavy traffic load), time epochs (or intervals) for traffic counts, and moving average. Essentially, transition time measures how fast traffic load increases in a given sector, a factor that could be associated with taskload.
35 (2) Transition slope: This metric is based on the slope of the line fitted to the traffic increase over time and it represents a measure that combines both time and traffic load into a single number. A large (positive) slope could represent taskload. (3) Correlation time: This measure is based on autocorrelation of the number of operations; the time (lag) when the autocorrelation coefficient reaches zero depicts independent operations. In the terminal airspace this metric would allow for separation of different peak periods of traffic. It is not clear whether the same concept would apply also to en route sectors. (4) Parabolic Stability Parameter: This is a very complex metric and the output is consequently not very intuitive. The procedure for derivation of the parabolic stability parameter is: (1) evaluate the divergence function for the time series of air traffic operations and examine the temporal distribution of the divergence values for linear segments; (2) find the Lyapanov exponent and determine its value (positive or negative); (3) examine phase plane representations; time (t) on x-axis and (t + time shift) on y-axis; (4) fit an appropriate function to these data; a parabola (y =ax2 + bx) was fitted to example data (from ATL), and the parameter b taken as parabolic stability parameter. A parabolic stability parameter > 3 will yield multiple solutions for the equation and this was considered as a criterion for instability; in the ATL sample data, the value was 3.02, taken to indicate optimum performance just beyond the limits of stability. 3.2.2. Dynamic density Dynamic density is used in a variety of contexts in the literature and does not necessarily correspond to a single metric. However Laudeman, Shelden, Branstrom, and Brasil (1998) and Sridhar, Sheth, and Grabbe (1998) have reported an equation for this construct. Directly measured variables for dynamic density include: 1. Traffic density (N) 2. Number of aircraft with heading change > 15˚ (NH) 3. Number of aircraft with speed change > 10 kts (NS) 4. Number of aircraft with altitude change > 750 ft (NA) 5. Number of aircraft with 3-D Euclidean dist. 0–5 nm, excl. violations (S5) 6. Number of aircraft with 3-D Euclidean dist. 5–10 nm, excl. violations (S10) 7. Number of aircraft with lateral dist. 0–25 nm, and vertical separation less than 2,000/1,000 ft above/below FL 290 (S25) 8. Number of aircraft with lateral dist. 25–40 nm, and vertical separation less than 2,000/1,000 ft above/below FL 290 (S40) 9. Number of aircraft with lateral dist. 40–70 nm, and vertical separation less than 2,000/1,000 ft above/below FL 290 (S70) 10. Time; the above variables are measured during a sample interval of 1 min. From the above variables, dynamic density is calculated by the following equation: DD = W1•N+W2•NH+W3•NS+W4•NA+W5•S5+W6•S10+W7•S25+W8•S40+W9•S70
36 where weights W1-9 are derived from regression analysis of controller activity data and subjective ratings. No definition was provided for traffic density, however. It is noteworthy that all of the directly measured variables are available in the NDMS files and could be extracted by the POWER program with (presumably) minimal modifications to the code. 3.2.3. Risk index Risk index is an index of collision risk (Knecht, Smith, & Hancock, 1996) and it has also been referred to as dynamic density (Smith, Scallen, Knecht, & Hancock, 1998). It is derived from two directly measurable variables, (1) number of aircraft at a given altitude, N, and (2) distance from the ith to the jth aircraft, dij by the following equation: N −1
N
1
∑ ∑ ⎛d i=1 j= i+1
⎞a ⎜ ⎟ ⎝c ⎠ ij
where a and c are constants. Again, this metric is readily derivable from data in the NDMS files through the POWER program. 3.2.4. Predictive workload measures These measures are based on work done at NASA Ames research center by Chatterji and Sridhar (2001). Directly measured variables for this metric include: 1. Number of aircraft in sector (N) 2. Most recent monitor alert threshold value (Nmax); 3. Number of aircraft in sector with climb rate > 200 ft/min (aircraft climbing, Ncl) 4. Number of aircraft in sector with climb rate ≥ –200 ft/min and ≤ 200 ft/min (aircraft leaving, Nlv) 5. Number of aircraft in sector with climb rate < –200 ft/min (aircraft descending, Nds) 6. Horizontal distance between aircraft in a pair (in nm, dij) 7. Vertical distance between aircraft in a pair (in nm, hij) 8. Time to go to conflict (in seconds, tij) 9. Mean and variance of ground speed 10. Crossing angle 11. Heading angle for each aircraft As can be seen, many of the above variables can be derived from the NDMS files, albeit some will require development of specific algorithms to do so (e.g., time to go to conflict). These metrics also use a number of parameters: 1. Scaling factor Sh = 5 nm/2000 ft 2. Scale factor sf = .0025 nm/feet 3. Vertical neighborhood, ∆h = 4000 ft if min. alt of aircraft pair is > 29000 ft, 2000 ft otherwise.
37 4. Horizontal neighborhood radius r = 10 nm 5. Time to go to collision threshold, ∆t = 600 s There are 16 actual predictive workload metrics. Some are quite simple, but others involve complex equations and multi-step algorithms. The equations are available in Chatterji and Sridhar (2001); the metrics are merely listed below, with commentary. 1. Sector count as a fraction of sector capacity. This is simply the ratio of the actual aircraft count in the sector and the nominal maximum capacity of the sector. 2. Fraction of flights in a sector that are climbing 3. Fraction of flights in a sector that are level 4. Fraction of flights in a sector that are descending 5. Inverse mean weighted horizontal separation 6. Inverse mean weighted vertical separation 7. Inverse of the average minimum horizontal separation between aircraft pairs 8. Inverse average minimum vertical separation 9. Inverse of minimum horizontal separation distance for aircraft pairs in the same vertical neighborhood 10. Inverse of minimum vertical separation distance for aircraft pairs in the same horizontal neighborhood 11. Fraction of aircraft pairs with time to go to conflict with less than threshold (∆t) 12. Inverse minimum time to go to conflict of aircraft pairs with time to go to conflict less than threshold (∆t) 13. Inverse of smallest time to go to conflict for aircraft pairs with time to go to conflict less than threshold (∆t) 14. Variance of groundspeed 15. Ratio of standard deviation of ground speed to mean of ground speed. 16. Mean conflict resolution difficulty. This measure is based on lookout tables, with different difficulty value associated with different crossing angles Chatterji and Sridhar (2001) used a neural network model to implement the above metrics and predict controller workload. As such, the utility of these metrics to be derived from the NDMS file via POWER will require further investigation. 3.2.5. FAA sector complexity metrics FAA (2001) proposed a set of metrics to reflect sector complexity. Many of these could easily be computed from data available from the NDMS files, and one is available directly from the POWER output (see section 5 below). Directly measured variables include 1. Number of aircraft 2. Volume occupied by aircraft (one aircraft)
38 3. Sector volume 4. Convergence angle 5. x, y, z –distance between aircraft in pairs 6. Degrees of freedom (DOF) for aircraft pair; the nominal 6 DOF for each aircraft (12 DOF for a pair) may be reduced if other aircraft or airspace boundaries constrain maneuver options (e.g., an aircraft cannot climb) 7. Aircraft’s speed and distance to coordination point From the above, the following derivative variables can be computed: 1. Aircraft density (by volume occupied by aircraft) 2. Aircraft density (number of aircraft / sector volume) 3. Convergence recognition index (increases as CA decreases, making recognition more difficult) 4. Separation Criticality Index 5. Degrees of freedom index 6. Coordination Task Load Index, based on time to reach sector boundary, or 5 nm before the sector boundary The last four metrics can be calculated from relatively simple equations, but necessitates the use of tabulated parameters. 3.2.6. Metron algorithm These metrics are based on the work by Wyndemere, Inc. (1996; see also Pawlak, Brinton, Crouch, & Lancaster, 1996 and Pawlak & Brinton, 1996). Several of the directly measured variables are available from the NDMS files, and consequently a number of complexity metrics could also be added to the POWER output. Directly measured variables include: 1. Aircraft count per sector. 2. Sector Area 3. Horizontal distance between aircraft pairs 4. Convergence angle between aircraft pairs 5. Number of other flights within 10 nm or between 10 nm and 15 nm of aircraft pairs 6. Distance to nearest sector boundary from aircraft in a pair 7. Distance from each aircraft to nearest sector boundary in nm 8. Aircraft vertical speed From the above, the following derivative variables can be computed: 1. Number of aircraft pairs with less than 8 or 13 nautical miles horizontal distance between them. 2. Convergence angle for aircraft pairs which are within 13 miles of each other.
39 3. Number of aircraft in the neighborhood of an aircraft pair projected to be in conflict. 4. Number of aircraft pairs which are in conflict with each other and are close to a subsector boundary. 5. Number of aircraft with an altitude change greater than 500 feet per minute 6. Variation in heading 7. Number of aircraft close to a subsector boundary 8. Measure of airspace structure and the distribution of aircraft within a sector 3.3. Summary Based on the review of the airspace complexity and dynamic density literature, the POWER output could readily be expanded by several complexity measures. These measures would not only add to the utility of the POWER program in investigating different sector characteristics from different facilities, but they would also provide ways to extract “independent variables” from ATC data against which human performance variables could be examined. However, the fragmentary nature of the metrics reviewed here and the lack of data for their validation pushes this possibility far into the future. As it turned out, theoretical foundations for measurement of these constructs could not be established from the ATC research literature, as such are not provided. Instead, it often seems to be the case that validity of inferences made about covert, not directly measurable constructs is based only on the authors’ proclamation that by measuring A (a directly measurable variable) they were in fact also measuring B (a covert, only indirectly measurable variable). Hence, it is also clear that much research remains to be done to create and validate a theoretical framework for establishing rigorous and reliable connections between directly measurable variables and indirect constructs of interest.
40
4. EVALUATION OF POWER METRICS FROM THREE ZID SECTORS Data from three sectors from the Indianapolis air route traffic control center (ZID ARTCC) were selected for POWER analysis. The selection criterion for these sectors was simply that they should be very different from each other with unique characteristics in terms of traffic patterns and load. A senior supervisor from ZID chose the sectors based on these requirements and his expert judgment; the sectors were River (26, RIV) low-altitude sector, Dayton (88, DAY) high-altitude sector, and Wabash (99, WAB) super high-altitude sector. 4.1. Sector Descriptions RIV is a low altitude (surface to 23,000 ft) sector that primarily serves to feed arriving traffic to Cincinnati/Northern Kentucky International airport (CVG). As CVG is a busy hub airport, RIV experiences several “rush hours” or traffic peaks every day. The traffic flow is predominantly towards Northwest (i.e., to CVG) and the primary duty of the controller is to vector traffic to a terminal entry point at intervals as prescribed by CVG approach. The volume of the entire sector is 18,589.63 nm3. DAY is a high altitude sector in the Northern part of ZID. Within it is one VORTAC, from which 8 jet airways radiate; in addition, two airways traverse across four of these in the Northern portion of the sector. The vertical dimensions of DAY are from 23,100 ft to 30,900 ft, and the volume of the sector is 7,243.44 nm3. These numbers mean that aircraft traversing through this sector are not only on crossing paths in the horizontal plane, but also often climbing or descending in the vertical plane. WAB is a superhigh-altitude sector, extending from 35,000 ft up (to the upper limits of the atmosphere for all practical purposes). The airway structure inside WAB is somewhat simpler that that of DAY, but many aircraft at these altitudes fly on direct routes (i.e., off-airways, according to the National Route Program). However, despite the extensive vertical coverage and a large volume at 44,762.87 nm3, it must be kept in mind that very few aircraft are able to utilize altitudes above FL 410 (410,000 ft) and that many aircraft will be at level flight across the sector. 4.2. Evaluation of the POWER Measures 4.2.1. Qualitative Analysis Data were recorded from the aforementioned sectors during two one-hour periods, during busy and slow times a day, as judged by the supervisor. In addition, the data were analyzed in 10-minute epochs, reflected in the following plots. The POWER metrics that showed most substantial differences between the sectors are qualitatively evaluated below, with commentary in the figure captions.
41
Figure 4.1. Number of aircraft controlled during the 1-hour data collection period; the data was collected in 6 10-minute epochs, represented by the individual bars. The number of aircraft is on the y-axis. Not much difference between the sectors can be seen in the light of this metric. The busy and slow hours within each sector, however, are clearly visible in these data.
Figure 4.2. Maximum number of aircraft (on the y-axis) under control during each of the 10minute epochs. This measure would be more representative of the controller’s taskload than the previous one, the short epoch duration highlighting the simultaneity of the traffic load.
42
Figure 4.3. Average vertical distance between aircraft pairs, in hundreds of feet. The POWER output includes several distance variables between aircraft pairs, horizontal, vertical, and Euclidean (slant) distances, as well as averages and minima of these. However, the horizontal distance does not consider differences in aircraft altitudes, and although the Euclidean distance is a good indicator of the true proximity between aircraft, it is quite meaningless to the controller, whose taskload is determined by the standard separations between aircraft. For example, a 2,000 ft Euclidean distance between aircraft is much better for the controller than 31,000 ft, if the former is vertical and the latter horizontal. Since altitude is the easiest way to separate traffic (see next section), average vertical distance will provide an indication of controller taskload. However, this measure is tied to the number of aircraft climbing or descending, and the taskload in the RIV sector is probably much higher than suggested by the large vertical spread of the aircraft in the sector.
43
Figure 4.4. Number of altitude changes nicely differentiates between the sectors according to their specific characteristics. Clearly, WAB is in a class of its own, as might be expected for a superhigh sector.
Figure 4.5. This is a derivative variable not included in the standard POWER output, the proportion of climbing and descending aircraft (ALT_CHNGS/CONCOUNT). The fact that this ratio can exceed 1 is explained by a single aircraft making several altitude changes during the measurement period. Again, WAB is clearly differentiated from the other two sectors.
44
Figure 4.6. Handoff acceptance latency (in seconds on the y-axis) is probably best associated with controller workload, the premise being that a busy controller will have longer latency in accepting a handoff. In the comparison between sectors, RIV is clearly distinguished. These results may reflect the fact that vectoring aircraft is a very attention-demanding task, and the controller often has to keep his or her eyes on a target to determine a correct turning point and time the vector accurately, with little opportunity to attend other tasks (e.g., accepting handoffs).
Figure 4.7. Handoff acceptance latency depicted in Figure 4.6 should be viewed together with the number of handoffs accepted.
45
Figure 4.8. This metric was also not included in the original POWER output, but could be derived from other variables. This is aircraft density, derived by dividing the number of aircraft (CONCOUNT) by the sector volume. The small size of the DAY sector is clearly evident in the above plot, giving it a much higher density index than RIV, despite the fact that both sectors handle about an equal number of aircraft during the data collection period. The low index value of WAB is an artifact, due to the large volume of the sector, which in turn is due to the high upper bound of the sector. In other words, the traffic density is very unevenly distributed in WAB sector, but this is not depicted by this metric.
4.2.2. Inferential Statistics The above metrics were also analyzed by Analysis of Variance (ANOVA) for each POWER metric using a model with sector (3) and time (2, busy and slow) as factors plus their interaction. The results appear in Table 4.1. When interpreting the results it is important to consider relationships between individual metrics and focus on those that can be viewed as truly independent. Also, some metrics are clearly more meaningful than others; for example, it is hardly surprising that the number of aircraft (CONCOUNT) would differ significantly between busy and slow times. Note also that the sample sector did not significantly differ from each other in terms of number of aircraft handled in the sampling period. However, it is important to note that there were significant differences between the sectors in terms of maximum number of aircraft under the controller’s responsibility at any one time (MAX_AC), with DAY and RIV experiencing much higher load in this respect than WAB. An example of independence of metrics is the number of altitude changes (ALT_CHNG). Clearly, the count of altitude changes depend on the total number of aircraft handled, but that there were significant differences between sectors in terms of the proportion of aircraft changing altitudes, with DAY and RIV again exhibiting higher load than WAB in this respect. Note that there were no differences between busy and slow times, as the proportion of aircraft changing altitudes does not depend on the number of aircraft in the sector.
46 While count of handoffs accepted (H_AC_CNT) clearly depends on the total numbe of aircraft handled, it is interesting to note that there were significant differences between sectors in the former respect but not in the latter (CONCOUNT). DAY had most handoffs, reflecting the nature of this sector literally in the “crossroads.” Average latency of accepting handoffs (H_AC_DUR) is an independent measure of controller workload, long latencies presumably reflecting the busyness of the controller, and there were significant differences between sectors; the independence of this measure from the number of aircraft or handoffs can be seen in that there were no differences between busy and slow times. Of the three sample sectors, DAY was clearly in the class of its own, with RIV and WAB exhibiting significantly smaller latencies. Finally, it was hardly surprising to see significant differences between both the three sectors and busy and slow times in terms of traffic density (FAA_AD2). However, there are some problems with this measure, as was explicated earlier. Table 4.1. Results of Analyses of variance for selected metrics from POWER output for differences between sectors in the ZID sample and busy and slow times. DV (POWER)
Effect
CONCOUNT
F
p
Sector Time Sector*Time
1.64 72.54 1.14
0.2115 0.0000 0.3335
MAXAC
Sector Time Sector*Time
3.96 93.40 0.62
0.0297 0.0000 0.5470
AV_VDIST
Sector Time Sector*Time
14.46 0.20 0.55
0.0000 0.6550 0.5838
ALT_CHNG
Sector Time Sector*Time
11.10 39.16 0.30
0.0002 0.0000 0.7462
P-ALT_CHNG
Sector Time Sector*Time
11.46 0.21 0.65
0.0002 0.6539 0.5293
H_AC_CNT
Sector Time Sector*Time
4.83 17.96 0.52
0.0152 0.0002 0.6014
H_AC_DUR
Sector Time Sector*Time
13.08 0.29 0.14
0.0001 0.5926 0.8665
FAA_AD2
Sector Time Sector*Time
48.45 31.49 7.31
0.0000 0.0000 0.0026
47 4.3. Analysis of Voice Data from the ZID Sample 4.3.1. Review of Relevant Literature Much of air traffic controllers’ work involves spoken communication. Presently, virtually all control actions must be communicated to pilots via voice radio. Hence, voice communications are intuitively and unsurprisingly an attractive method for examining controller taskload (note that a distinction is made here between overt taskload, which can be measured, and covert workload, which is individually experienced by controllers). However, very little research has been done to validate and quantify this putative relationship. A comprehensive study by Casali and Wierwille (1983) manipulated communication load during a simulated flight task; in addition to normal ATC instructions, the subjects were required to perform a call sign recognition task, with target call signs embedded in sets of extraneous call signs of varying difficulty. Of the 16 workload measurement techniques employed, eight were sensitive to communication load manipulations. These techniques included both subjective ratings (modified Cooper-Harper scale and multi-descriptor scale) and objective measures (time estimation, tapping rhythm, pupil diameter, errors of omission and commission, and communications response time). Hence, it is quite clear that communications load is a workload driver. However, the data reported by Casali and Wierwille (1983) does not allow for a reverse relationship to be established, that is, estimation of workload by analysis of the communication load. Several reasons prevent this: first, the article does not report any overall measures of communication load, such as number and durations of communications, and second, there were several other sources of workload present in the experiment, for example, piloting of the simulator. It should also be noted that a communication task is very different for a pilot and a controller. A pilot typically needs to respond to only a small fraction of messages transmitted on the frequency (i.e., only to those addressed to him or her), whereas the ratio of messages controllers receive and transmit is close to one (i.e., controllers talk to all the aircraft on a frequency). Hurst and Rose (1978) replicated an earlier study that had indicated that peak traffic and the duration of radio communications were good predictors of behavioral response of air traffic controllers working in air route traffic control centers. This study included 3,110 observations made on radar sectors at the 13 major radar control rooms in the U. S. Duration of radio communications compared to behavioral ratings were made by expert-observer controllers showed that the former were good predictors of the latter. A very strong relationship between controller workload and communications load was established in a study by Porterfield (1997). This study used ATC communications recorded from high-fidelity simulations and compared communication times to concurrently recorded subjective workload estimates (Air Traffic Workload Input Technique, ATWIT ). The primary communications metric was average communication time per minute, calculated for 4-minute intervals to match ATWIT probes. A maximum coefficient of correlation of .88 indeed is very impressive, and the average communication time per minute also closely followed ATWIT ratings over a 15-minute period. However, the ATWIT ratings were generally very low, maximum ratings 3.5 on a scale from 0 to 7. At a workload rating 3.5 the communication load was 11 s per minute, or a proportion of .183.
48 Manning, Mills, Fox, Pfleiderer, and Mogilka (2002) analyzed 12 traffic samples from Kansas City Air Route Traffic Control Center (ZKC ARTCC). These traffic samples were viewed on SATORI (Rodgers & Duke, 1993) software, which recreated the traffic situations, by 16 ATC instructors who provided ATWIT workload estimates at 4-minute intervals. The samples were also processed by POWER software, which extracted a number of objective ATC taskload metrics from the data. Communications were quantified by the number of communication events and their durations, categorized by their content and speaker, as well as total communication times in 4-minute time epochs. The multitude of dependent variables was subjected to principal components analysis to reduce their number and like measures were combined to four taskload components. The results showed significant correlations between ATWIT ratings and total number and duration of communications (r = .62, p < .01), and individual communication durations (r = .36, p < .05), as well as number of instructional clearances (r = .65, p < .01). The activity component of taskload, which combined number of aircraft, number of simultaneously controlled aircraft, and radar controller data entries, was also correlated with total number and duration of communications (r = .63, p < .01), as well as with the number of frequency changes (r = .36, p < .05) and instructional clearances (r = .52, p < .01). Hence, it may be concluded that communication metrics may be a valid indicator of controller workload and taskload, although the r-values reported certainly leave other factors to be accounted for. 4.3.2. Method The voice data from the ZID samples were converted to wav files at the FAA CAMI by Dennis Rester. These were analyzed by SPWave program (SPWave is freeware and can be downloaded from http://www.itakura.nuee.nagoya-u.ac.jp/people/banno/spLibs/spwave/). This program allowed for visualization of the voice data as a spectrogram, and a zoom capability allowed for very accurate determination of transmission begin and end times. The data were coded (but not transcribed) and entered into an Excel spreadsheet. From the coded data several variables were derived. The coding scheme and variables derived from the voice data are explained in detail in Appendix B. 4.3.3. Results There were a total of 53 separate variables that were derived from the voice data. The results reported here, however, only pertain to those variables that either have been shown to correlate with controller workload and those that showed significant differences between the different ZID sectors. Total number and duration of communications were highly correlated, as might be expected (R-squared = 0.854) and therefore only communication duration is discussed here. As Porterfield (1997) and Manning, Mills, Fox, Pfleiderer, and Mogilka (2002) had discovered, communication time was a good predictor of workload (subjective ratings) and it was therefore of interest to examine whether the three ZID sectors differed from each other in this respect (see Fig. 4.9).
49
Figure 4.9. Proportion of controller communication time in the six samples from ZID. Note that the maximum in DAY sector during busy time approaches 50%, meaning that the controller was speaking for almost half of the time during the 10-minute epoch. WAB had much lower communication load than the other two sectors, which might be expected for a superhigh sector. An ANOVA on the proportion of controller communication time showed nearly significant (at α = .05) differences between sectors, F(2, 29) = 2.90, p = .071, and significant differences between busy and slow times, F(1, 29) = 20.31, p < 0.001. The interaction between sector and time (busy or slow) was not significant. These results, however, should be moderated by the small sample size, with only 6 data points (epochs) per condition. Number of instructional clearances has also been associated with controller workload (Manning et al., 2002) and clear differences were found between the sample ZID sectors (see Figure 4.10). An ANOVA showed significant differences between sectors, F(2, 28) = 7.07, p < .0005, and between times, F(1, 28) = 19.09, p < 0.0005. The interaction between sector and time was not significant, however. Finally, we examined the number of frequency changes between sectors, as this variable has also been shown to correlate with workload. No statistically significant differences between sectors in the ZID sample were found, however, but time had a significant effect on the number of frequency changes, F(1, 27) = 17.51, p < 0.0005. This results is not surprising, as number of frequency changes strongly correlates with the number of aircraft in the sample, which clearly is the main difference between busy and slow times.
50
Figure 4.10. Number of controller-issued clearances shows significant differences not only between times but also between sectors. RIV is clearly
Figure 4.11. Activity count, which is a sum of three POWER metrics (ALT_CHNG + HDG_CHNG + HAND_CNT) regressed against controller communication duration, which in turn has been shown to be a good predictor of workload.
51 A rather roundabout way of providing criteria for the POWER metrics via analysis of voice data can nevertheless be postulated. Given that communication time has been found to be a good predictor of workload (Manning et al., 2002; Porterfield, 1997), we examined correlations between the communication time recorded from the ZID voice data and POWER metrics from the same samples. Best correlation was found between the sum of three controller activity metrics (ALT_CHNG + HDG_CHNG + HAND_CNT) and controller communication time. The premise was that aircraft altitude and heading changes currently necessitate a clearance, as does handoffs, they can be combined into an index that captures most of controller activity (Actvity Count). The results are depicted in Figure 4.11. A regression analysis showed a significant relationship between activity count and communication time, F(1, 31) = 43.66, p < .0001, Rsquared = .5848. 4.4. Summary There were two particularly intriguing outcomes from this effort of comparing a sample of three different ARTCC sectors in the light of POWER measures. First, a number of POWER metrics clearly differentiated between the sectors of different characteristics, revealing important factors that might affect controller taskload (e.g., maximum number of aircraft under controller’s responsibility at any one time, proportion of aircraft changing altitude, handoff acceptance latency). Equally important is to consider metrics that remained essentially invariant between sectors (e.g., number of aircraft), as these may reflect taskload factors that are independent from sector characteristics. Second, the present POWER output includes many parameters that are also part of the proposed airspace complexity and dynamic density measures. Hence, POWER measures available from real-world ATC at different centers and different sectors could be used to test and validate airspace complexity and dynamic density metrics. On the other hand, is the latter can be validated by other means (e.g., through controlled experiments and simulator studies), they could be included into the POWER program making them immediately available from en route sectors anywhere within the NAS.
52
5. DEVELOPMENT AND VALIDATION OF NEW MEASURES The purpose of this experiment was to provide an empirical foundation for the notion that controller workload in terms of time pressure could be determined by momentary conflict geometries between aircraft pairs in a sector. Hence, a count of aircraft pairs at a same altitude might predict controller workload, resulting in increased time required to make conflict—no conflict judgments in such cases within a given time available. The premises and hypotheses for the experiment are described below, following a detailed account of the method and results. We were also able to provide some validation of the proposed workload metric using POWER output from data from Kansas City ARTCC (ZKC), originally reported in Manning, Mills, Fox, and Pfleiderer (2001), together with associated subjective workload measures. 5.1. Experimental Investigation: Introduction 5.1.1. Air Traffic Controller Conflict Detection Performance The job of an air traffic controller consists of numerous tasks, many of which must be carefully time-shared (Hopkin, 1995; Roske-Hofstrand & Murphy, 1998). Consequently, time pressure is an inherent characteristic of air traffic control (ATC), as is the resultant workload experienced by controllers. The primary task of air traffic controllers is to ensure that aircraft under their responsibility are always separated according to the rules and regulations set forth by the appropriate (civil aviation) authority. Internationally, these “rules of the air” are determined by the International Civil Aviation Organization (ICAO) (specifically, in Annex 11, “Air Traffic Services,” and in Doc. 4444, “Air Traffic Management”). Individual countries often provide for more detailed regulations, and in the U.S., the Federal Aviation Administration (FAA) has published separation standards in the air traffic control handbook (Order 7110.65; Federal Aviation Administration [FAA], 2004). These rules and standards together with airspace structure and layout and traffic flows and patterns within the controllers’ area of responsibility form the framework within which controllers perform their jobs and tasks. This infrastructure also constrains controllers’ actions and shapes their working methods and techniques, as will be discussed in more detail later. There are numerous standard separation minima applicable to a wide variety of situations. Although detailed description of the ATC separation standards and practices is not possible here due to space constraints, they can be discussed briefly in terms of a few categories. The simplest classification of ATC separation is by vertical and horizontal separation. Unless two aircraft are at altitudes of at least 1,000 ft apart (2,000 ft above 29,000 ft, unless reduced vertical separation minima [RSVM] apply), they must be separated horizontally (by radar) by 3 nm or 5 nm (if they are within or outside 40 nm from the radar antenna, respectively) (FAA, 2004). In non-radar environment horizontal separation can be further classified as longitudinal and lateral, the former applied when aircraft are on the same or opposite trajectories and the latter when they are on crossing trajectories (as defined in the FAA Order 7110.65). Because the separation rules and minima are very stringent and in most cases nonnegotiable (except under visual flight rules [VFR] or when pilots see each other and can maintain visual separation), and because violation of them can result in severe penalties for controllers, strict adherence to them is intrinsic to controller’s jobs. Furthermore, controllers’ work is characterized by two conflicting goals, (1) expediting and maintaining orderly traffic flows, and (2) adherence to strict separation standards.
53 Arguably, then, controllers’ working methods and techniques are shaped by the rules they must abide by, which predominantly act as constraints. It is important to note that vertical separation is by definition non-radar, and that even in radar environments controllers often use non-radar separation in lieu of radar separation. The reason for this is simple: radar separation requires constant vigilance and monitoring to ensure that the required minimum distance between aircraft is maintained, whereas non-radar separation—once established—guarantees separation between the aircraft indefinitely (or, until the aircraft need to change their trajectories or altitudes). There are hence good reasons to presume that vertical separation would be preferred by controllers whenever situation warrants or permits, and that such preference would shape their working methods and practices. The general purpose of this research was to validate a hierarchical model on how such constraints would affect information acquisition behavior of controllers when judging the potential of conflicts between aircraft. Specifically, we hypothesized that the time required to perform such judgments would be dependent on the particular information controllers choose to process, and that they would seek to “economize” their time by choosing information that can be processed quickly whenever possible, before probing more effortful and time-consuming information sources. 5.1.2. Workload and information processing Controller workload is arguably one of the most important human factors issues associated with ATC (Hopkin, 1995; Wickens, Mavor, & McGee, 1997). Yet, controller workload cannot be measured directly but only inferred from subjective reports, measurable performance, or measurable task demands. Wickens, Mavor, and McGee (1997) defined workload as the load associated with the mental (including cognitive and affective) processes of the human operator. Because mental workload by definition results in part from processing information, particular kinds and amounts of information processed may determine the level of workload experience by a controller. Workload can also be conceptualized by the time available vs. time required paradigm, or time pressure experienced by a controller. It is hence reasonable to suggest that minimizing the time required to perform various individual tasks that constitute controllers’ jobs (e.g., separation assurance) is an effective way of managing workload. Vast amounts of information are present in the ATC environment that the controller must perceive and comprehend (levels 1 and 2 of situation awareness [SA]; Endsley, 1995) in order to project (level 3 SA) whether or not a conflict will occur at some point in the future. Endsley and Rodgers (1998) documented well over 50 separate information requirements that a controller must consider when managing flows of traffic. However, human attentional and working memory resources simply are too limited for an operator to integrate simultaneously diagnostic impact of more than a few pieces of uncorrelated information (Wickens & Hollands, 2000). Consequently, given a dozen information sources humans tend to rely almost exclusively on a favored three or four, the selection of which are determined by the weights assigned to them (Hopkin, 1995), and decision makers often switch strategies to reduce cognitive effort, increase accuracy, or respond to time pressures (Fennema & Kleinmuntz, 1995; Payne, Bettman & Johnson, 1993). Indeed, controllers have been shown to regulate their workload by employing a set of strategies that are more economical in a demanding situation than under lighter workload (Sperandio, 1978).
54 Even if we limit sources of information available to controllers to those relevant to momentto-moment judgments of potential of conflict between pairs of aircraft, there are still a large number of cues to process. In the modern radar environment, however, most of these cues are contained in the aircraft’s data block depicted on the controller’s plan view display (PVD). Willems, Allen, and Stein (1999) found that controllers spent over 75% time looking at the PVD, and 92% of this time was spent fixating on an aircraft’s data block. Similar results have also been reported by Crawford, Burdett, and Capron (1993) and Moray, Neil, and Brophy (1983) and they certainly suggest that it is the information contained in the aircraft’s data block (altitude and speed) and position representation, from which heading information can be derived via a history trail or predictor line, that are given the most importance in conflict detection process. A controller’s choice of information necessary for performing a conflict-detection task may be narrowed further by examining a possible hierarchical structure among the cues. In the following, we review evidence from past research that indeed point to a clear hierarchy in controllers’ choice of information, on which their subsequent judgments, decisions, and actions are based. In particular, three attributes of an aircraft in flight stand out as most critical to the controller: its altitude, heading (trajectory), and speed. 5.1.3. Altitude Controllers appear to examine information available to them in a certain order, which is dependent on concurrent demands on their information processing capacity. Leplat and Bisseret (1966) have argued that the task of a controller is primarily a form of categorization, whereby each aircraft is assigned meaning on the basis of its attributes, such as altitude, speed, and heading. Vingelis, Schaeffer, Stringer, Gromelski and Ahmed (1990) further suggested that controllers are not concerned with individual aircraft, but pairs of aircraft, specifically the future states of aircraft pairs (conflicting pairs and others). These pairs of aircraft can be further classified as focal, of immediate value, or extrafocal, not of immediate value (Niessen, Eyferth, & Bierwagen, 1999). Leplat and Bisseret (1966) found that controllers made such determinations by comparing six specific attributes of aircraft pairs. The first attribute was always the altitude of the aircraft in question due to its importance in ascertaining conflict likelihood. When information was withheld from controllers and made available only after a specific request, Sperandio (1971) found that under high load controllers requested fewer pieces of information than under low load and began to take into account only the flight levels of the aircraft, indicating a strong preference for altitude-based information versus heading and speed when performing the sequencing task. Helbing (1997) masked the information contained within the aircraft’s data-block on the screen as well as the history-trail information and measured the order and frequency in which information was unmasked on the radar display. He found that altitude was always the first piece of information that was unmasked, compared to heading and speed, and the frequency of altitude unmasking was more than twice as high compared to heading and speed information combined. This may also explain why recall of altitude is usually better than recall of heading or speed information (Gronlund, Ohrt, Dougherty, Perry, & Manning, 1998). Also subjective reports show that altitude is the most important cue compared to heading and speed information (Willems et al., 1999) and such perceptions have also been observed in controller actions. For example, Schaefer (2001) analyzed electronic transcripts and found that altitude instructions were utilized more frequently than heading and speed instructions when managing the flow of traffic. Also Davison and Hansman (2003) found a strong preference to merge aircraft into the Boston TRACON at different altitudes, due to maximal accuracy for
55 ensuring separation provided by altitude and minimal cognitive load associated with its processing. It must be acknowledged, however, that low cognitive processing cost associated with altitude is limited to cases where the altitudes of aircraft are constant. If aircraft are climbing or descending the use of altitude as a means of determining conflict likelihood is much more complicated, necessitating estimation or calculation whether the aircraft will at some point occupy the same altitude with less than adequate lateral separation between the aircraft. 5.1.4. Heading If two aircraft are at the same altitude (i.e., no vertical separation exists between them), their headings, or rather, tracks across the earth, provide another means for ascertaining whether they might be in conflict some time in future. Three distinct cases of using heading information for separation estimation can be identified. First, if two aircraft are on diverging headings, they will never conflict (i.e., their relative distance will continually increase). Second, if two aircraft are on parallel headings (same or opposite) but their respective tracks are more than required horizontal separation minimum apart, they will never conflict. Same and opposite headings with tracks less than required separation apart are a special case and will be discussed later. However, if aircraft are on converging headings separation between them must be ascertained by judging their respective distances to the point where their tracks intersect. Hence, unless aircraft are separated by altitude, in which case conflict likelihood can be determined with high accuracy in one step, the controller is required to estimate their headings, and if these are converging, use trajectory extrapolation to determine the point where their tracks intersect. Usage of this technique may appears to be more cognitively demanding than altitude checks, given that it requires spatial working memory (see Wickens, Mavor, & McGee, 1997) versus verbal working memory in the latter case. Although we are not aware of any empirical demonstrations to support this conjecture, controllers reportedly used or accessed heading information second, after first checking the altitude of the aircraft in question to ascertain conflict likelihood (Bisseret, 1971; Helbing, 1971; Leplat & Bisseret, 1966; Willems et al., 1999). The lower frequency at which heading information is sampled would also explain why it is recalled less effectively than altitude (Gronlund et al., 1998). The accuracy of trajectory extrapolation is highly dependent on the convergence angle between the aircraft in question, increased conflict angle reducing controllers’ conflict detection accuracy. Enard (as cited in Bisseret, 1981) attributed this to increased visual scanning required given the increased spatial distance between targets. Conflict angle as a factor affecting conflict detection ability, however, is often confounded with speed. Clearly, the case where aircraft are traveling at the same speed is very different from a case where their speeds differ, and speed has a direct impact on whether the estimation must be done on the distance or the time to the point of trajectory intersection. In the latter case, the time difference between aircraft reaching the point of trajectory intersection must be converted back to distance to judge the adequacy of lateral separation at that point and time. Remington, Johnston, Ruthruff, Gold, and Romera (2000), assessed the impact of convergence angle (when aircraft were at the same altitude) on controller performance, in terms of both accuracy and time of conflict detection. The study involved the effects of conflict angle (small or large), time to conflict (4–6 and 6–8 minutes) and airspace load (12, 16, 20 aircraft). The results showed that the greatest response time and lowest response accuracy occurred under high traffic load, high conflict time and high conflict angles. Remington
56 et al. (2000) postulated that physical distance between conflicting aircraft symbols on the display was the mediating variable; when the conflict times and angles were small, aircraft would be closer to one another on the display. Similarly, longer conflict times and larger convergence angles would inherently place aircraft farther away from one another on the display. Conflict detection in the special cases of parallel tracks with same or opposite headings does not involve estimation of trajectory intersection, but rather the distance between the tracks. If this distance is greater than required lateral separation, the aircraft will not conflict. If, however, the tracks are closer than minimum separation and the aircraft are on opposite headings and have not passed each other, they will be in conflict. If the aircraft are on the same heading, the controller must determine whether the trailing aircraft might overtake the preceding aircraft, (i.e., the former has higher groundspeed than the latter). As groundspeed is depicted on the aircraft data block, this check is similar to altitude check discussed in the previous section. 5.1.5. Speed Use of speed as a means of determining conflict likelihood carries with it higher cognitive costs due to the difficulty associated with integrating speed information into the mental representation of the controller (c.f., Sperandio, 1978). Speed information may be gleaned from the aircraft’s data block in numerical form, or from the spacing of the history trail dots or length of the predictor line in graphical form. Results from a number of studies have shown that controllers do not value speed as greatly as altitude and heading information in terms of what information they chose to select (Leplat & Bisseret, 1966). Controllers in Bisseret’s (1971) study requested speed information much less frequently than they did altitude and heading information and Helbing (1997) found that when deprived of information on a radar display, controllers always sought out speed information third after altitude and heading, with far lower frequency compared to the former variables. Personal views of controllers in a study by Willems et al. (1999) would seem to support this view, as they rated speed as being much less important than altitude and heading information. Most recently, a study by Gronlund et al. (1998) found that recall of speed information was significantly lower than altitude and heading. Empirical research that demonstrates the effects of speed on accuracy have incorporated the concept of relative judgment (RJ), where a subject is required to determine which of two or more moving objects will reach an intersection point first (Delucia, 1991; Delucia & Novak, 1997; Law et al., 1993), requiring the subject to take into account the speed at which the object is traveling. Air traffic controllers perform essentially the same task, whereby in situations where the aircraft are traveling at the same altitude and are converging, controllers must ascertain the likelihood that aircraft A will reach the intersection point before aircraft B. A complementary process to RJ, known as prediction motion (PM), occurs when the subject is then required to determine at what time one or more objects will reach the intersection point (Hancock & Manser, 1998; Kimball, 1970; Kimball, Hofmann, & Nossaman, 1973). Generalization of these results to ATC is difficult, however, because controllers are not only interested in which aircraft will reach the intersection point first and what time it will arrive there, but more importantly, whether or not separation between the aircraft reaching the intersection first and the next aircraft is assured, necessitating a resource-intensive transformation of temporal information to distance. Law et al. (1993) found that when both objects were traveling at the same speed, RJ accuracy was better. However, as the speed ratio between the objects increased, subject performance decreased. Kimball (1970) and Kimball et al. (1973) conducted studies specifically investigating the effects
57 of speed and controller experience on the ability to estimate at what time an object would reach the intersection point. Kimball (1970) found that the overall temporal error was greater as the angle of convergence increased and speed of the objects decreased. The angular effects were attributed to the greater distance evident between targets on the screen whereas speed effects attributed to the relative difficulty of speed integration (c.f., Sperandio, 1971). As both greater distances and slower velocities result in longer temporal intervals and longer viewing or prediction times, these in turn may reduce accuracy of prediction (Peterken, Brown, & Bowman, 1991). In a follow-up study Kimball et al. (1973) assessed the impact of object speed differentials (ranging from l:l to approximately 1:3) and subject experience (controllers vs. non-controllers) on temporal awareness. Results support the notion that increasing the speed differential between converging objects increases the associated temporal error resulting in lower accuracy because the controller must now integrate two (rather than one) pieces of speed information and project their implications. There was no difference in performance between controllers versus noncontrollers, a result that suggests that the ability to make such temporal assessments based on speed may largely be generic, rather than a specific characteristic of controllers. Sperandio (1971) found that controllers were less inclined to request speed information as traffic load increased, relying on altitude and heading to perform their task. This “awareness” of the higher amount of effort associated with processing speed may also explain why controllers generally avoid using speed as a means of ascertaining conflict likelihood between aircraft (Leplat & Bisseret, 1966; Bisseret, 1971; Helbing, 1997; Willems et al., 1999). 5.1.6. Premises and hypotheses This paper reports two experiments we conducted to test a hypothesis that air traffic controllers evaluate potential conflicts in a hierarchical manner, comparing altitudes first to check vertical separation, then extrapolate aircraft trajectories to estimate lateral separation, and finally perform speed-distance computations (or estimations) to check for longitudinal separation. We argue that such a hierarchy results from time pressure and the resultant mental workload, and together with the regulatory separation standards and procedures shapes the strategies controllers use to perform their tasks and manage their contemporaneous workload. We first formulate the following premises, based on the results from previous research: (1) controllers typically work under severe time pressure, having to time-share between multiple tasks and divide their attention among multiple aircraft under their responsibility; (2) mental workload can be related to a cost in terms of time it takes to perform the numerous tasks, or the time-cost of performing additional tasks; (3) To effectively manage their workload, therefore, it is beneficial for controllers to perform tasks quickly, or as quickly as possible, to achieve the desired goal; (4) Ensuring required separation is the primary duty of a controller; consequently, controllers must continually scan the traffic under their responsibility and evaluate developing situations and to make sure their actions—either planned or already carried out—do not result in conflict between individual aircraft. Our hypothesis can be stated as follows: verification that two aircraft are not in conflict can be accomplished by the following checks, ordered from easiest to the hardest in terms of timecost and mental effort expended: (1) are aircraft level at different altitudes at least standard vertical separation minimum apart? If yes, exit; if not, (2) are aircraft presently separated and on diverging trajectories, or on established trajectories (e.g., airways) that are known to be conflict-
58 free? If yes, exit; if not, (3) are aircraft at the same speed and have a distance difference to the point where their trajectories intersect that is at least the minimum required lateral separation? If yes, exit; if not, (4) will the aircraft nevertheless have a required lateral separation at the point of closes approach due to their different speeds? If yes, exit; if not, a solution to the impending conflict must be formulated and implemented. Because performing the above checks becomes progressively more time-consuming and effortful, we hypothesized that controllers prefer to perform the easiest checks (1 and 2) first, and only if these fail to verify absence of a conflict, perform the more difficult checks (3 and 4). This hierarchical manner of checking for conflicts between aircraft would be manifested in response times (RTs) in conflict judgment tasks including different two-aircraft conflict scenarios, longer RTs implicating both the number of checks performed as well as the time-cost incurred by the effort expended before a decision could be made. Hence, we would expect noconflict judgments to be made quickly in cases where aircraft are at different altitudes, as this is easily done and no other checks would be necessary. On the other hand, if the aircraft were at same altitude, the RT would contain both the time expended to checking their altitudes as well as any subsequent processes, for example, trajectory extrapolation and speed-time-distance computations, and be much longer. Creation of specific scenarios that allowed for testing this hypothesis will be described in the next section. 5.2. Method 5.2.1. Participants Two groups of participants were recruited for this study. The first group consisted of 13 volunteer graduate and undergraduate students at the University of Illinois at UrbanaChampaign, who gave their informed consent to participate and were paid USD 8.00 per hour for their time. All subjects were familiar with the basic ATC separation procedures and minima from coursework and previous exposure to related research. The second group consisted of full performance level (FPL) air traffic controllers from Toronto (YYZ). Fourteen controllers volunteered, gave their informed consent to participate, and were compensated CAD 60.00 for their time (approximately 2 hours). All participants in this group were male, with ages ranging from 28 to 55 years (mean = 38.2 years). All were trained and certified for radar and were proficient in operational use of radar displays in their facility. 5.2.2. Apparatus An ATC simulator was constructed for the purposes of this research. The simulator consisted of depiction of a radar display, mimicking the display system replacement (DSR) technology of the FAA in terms of alphanumeric and symbolic information displayed. The dimensions of the simulated radar display were approximately 60 x 60 nm. Targets (i.e., aircraft) generated by the computer exhibited realistic dynamics, and their position update rate of 10 times per minute, or every 6 seconds, was consistent with present en route radar technology as well. The aircraft position was represented by a yellow diamond; also displayed was history trail depicting 10 last positions of the aircraft, and a one-minute leader line. The aircraft position symbol was associated with a three-line data tag, showing the aircraft’s callsign, present and command altitude, and groundspeed. See Figure 5.1 for a representative screenshot from the
59 experiment. Another computer program was written to allow for creation of scenarios according to predetermined independent variables and their manipulations. A total of 160 ATC scenarios were developed for this experiment and will be described in detail below. The experiment was run on a Compaq Presario 2800CA laptop computer, with a 1.6 GHz Pentium IV processor and a 15.1-inch display with a resolution of 1280 x 830 pixels. The subjects’ responses were recorded by the computer and saved in a file for analysis. 5.2.3. Design 5.2.3.1. The experimental task. These experiments involved part-task simulations of a number of ATC scenarios, depicting two aircraft on a simulated radar screen. The subjects’ task was—upon the onset of the scenario, which also marked the beginning of a trial—to determine whether the two aircraft displayed were in conflict or not. For the purposes of this experiment we defined conflict as a situation where two aircraft, if they continued on their present trajectories, would at some future time come to a spatial proximity of each other that violated prescribed separation standards (5 nm horizontal and 1,000 ft vertical).
Figure 5.1. A screenshot from the experiment. In this scenario, the aircraft are at the same altitude (25,000 ft), have a 135-degree conflict angle, but traveling at a same speed (200 kts) will not be in conflict (DAL 001 will pass more than 7.5 nm behind SWA 452)
60
5.2.3.2. Independent variables. The independent variables were aircraft angle of convergence, altitude, speed, and miss distance. Angle of convergence had six levels: 0˚, 45˚, 90˚, 135˚, 180˚, and 225˚–315˚. Of these, only the 45˚, 90˚, and 135˚ angles were truly convergent; the 0˚ (same heading), 180˚ (opposite headings), and divergent headings were special cases and will be elaborated on below. Altitude was manipulated at two levels, same or different by at least 1,000 ft, which is the standard FAA vertical separation minimum below 29,000 ft. Speed had two levels as well, same or different. Speed difference ranged between 10 and 50 kts. Miss distance (the closest horizontal distance between the aircraft as they would pass each other) was either 2.5 nm or 7.5 nm. These values correspond to the 5 nm standard lateral separation minimum in en route environment and were chosen to achieve the desired level of difficulty for the experimental task. 5.2.3.3. Dependent variables. The dependent variables recorded were the subjects’ response time, measured from the onset of the trial to a key press according to the subject’s response (key ‘c’ for conflict and ‘n’ for no conflict), and response accuracy, or correctness of the subject’s response. Although the subjects were asked to respond as quickly as possible, a relatively high response accuracy requirement— 80% correct responses per block of trials—was imposed on them. This was done to minimize potential speed—accuracy tradeoffs and to improve the sensitivity of the primary dependent variable, response time. 5.2.3.4. Scenarios. A total of 160 scenarios were created through combinations of the independent variables. For converging courses there were 3 (45˚, 90˚, and 135˚) x 2 (same and different altitude) x 2 (same and different speed) = 12 different scenarios. In addition, in scenarios where the aircraft were on same altitude their eventual miss distance was either 2.5 or 7.5 nm, determining whether the scenario involved a conflict or not. In different altitude scenarios, which were no-conflict by definition, the miss distance was set at 2.5 nm. Hence, the total number of different scenarios with converging courses was 18; as each was replicated 5 times, the total number of scenarios becomes 90. Similarly, in scenarios where the aircraft were on a same heading, they were either at the same or different altitude and had the same or different speed. If the aircraft were on the same altitude, their trajectories were offset by either 2.5 or 7.5 nm for conflict or no-conflict scenarios. With 5 replicates of each condition, the total number of same and opposite heading scenarios was 60 (30 each). Finally, a total of 10 scenarios where the aircraft were on diverging headings and either at the same or different altitudes were created. The angle of divergence in these scenarios ranged from 45˚ to 90˚. Replicates were identical in terms of the independent variable combinations, but had different aircraft call signs and were rotated to different orientations to prevent the subjects from recognizing them from previous trials. All 160 scenarios will be counterbalanced by dividing them up into five blocks, each containing all 32 different experimental conditions. The order of presentation of trials in each block and the presentation of the blocks themselves were randomized for each subject to avoid order effects. There were a total of 110 no-conflict trials and 50 conflict trials. The disproportionate amount of conflict versus no-conflict trials was due to the fact that different-altitude scenarios were conflict-free by definition, and should be viewed as representative of realistic ATC environment and conflict trials as catch trials in this experiment.
61 5.2.3.5. Experimental design. This was a fully factorial within-subjects design, where all participants were subjected to all combinations of the independent variables. Data analyses, however, were performed in parts, and the data from the two groups were analyzed separately. 5.2.4. Procedure Participants were briefed on the nature of the experiment and the conditions of testing. Following this, participants signed a consent form and completed a brief demographic questionnaire. They were also familiarized with the hardware used for running the experiment as well as the experimental task. The participants performed 20 practice trials with feedback on the correctness of the participant’s judgment; no feedback was provided from the experimental trials. The participant’s only task was to determine if a potential conflict existed between the two aircraft presented on the simulated radar display and respond as fast and accurately as possible by pressing the “c” key for conflicts and the “n” key for no conflicts. A minimum accuracy requirement was set at 80% per block; if accuracy was lower, the block was repeated. The participants were also made aware of this criterion. If a participant did not respond in 60 s the trial terminated and the participant was prompted for a response. In the 60-second trial the participants could observe 10 target position updates and hence use target motion cues to determine whether they would be in conflict. The participants completed 5 blocks of trials, with each block consisting of 32 trials each. At the end of each block, the participants were provided with a composite score of how well they did in the block. The task was self-paced, requiring the participant to press a key to start each trial. Following completion of all five blocks, participants completed a post-experiment questionnaire and were debriefed and remunerated for their participation in their study. The complete experiment, including orientation and practice, data collection, and debriefing, lasted about two hours per participant. 5.3. Results The results are depicted in Figures 5.2 and 5.3 for the controller and student groups respectively. Due to the vastly different performances in the very different experimental conditions, we analyzed the data in parts, as shown by the numbered boxes overlaid on the aforementioned plots. Furthermore, as the results were similar from both groups, we will discuss the analyses of the controller data in detail first, and then examine how and where the results from the student group differed from these. A natural logarithm (ln) transformation was done on the response time (RT) data before analyses. This was due to the fact that RT distributions tend to be positively skewed as well as to reduce the effects of relatively large individual differences observed in the raw RT data. A general linear model (GLM) was used to perform univariate analyses of variance (ANOVA) on all data, except where noted otherwise, and Tukey’s method was used to test for pairwise differences between means. Analyses of residuals did not reveal any substantial departures from the assumptions of the models in any of the analyses, attesting to the success of the ln transformation in restoring normality of the data.
62 5.3.1. Controller group 5.3.1.1. Different altitudes. A mixed model ANOVA was used to analyze the data from scenarios where the aircraft were on different altitudes across all conflict angles, including diverging angles, and speeds (same or different). These data are highlighted in box no. 1 in Figure 5.2. Subject, block, and trial were included in the model as random factors, conflict angle and speed were fixed factors. In addition, the model included the interaction of the latter.
Figure 5.2. Results from the controller group. The results are further grouped for separate analyses by experimental conditions, which contained pairs of aircraft that were either at same or different altitude (sA, dA), same or different speed (sS, dS), and in conflict or not in conflict (C, NC), in different combinations. The conflict angle is depicted on the horizontal axis (DIV = diverging headings). The results show significant between-subjects differences; while individual differences in performance are expected, there is also evidence that some learning took place from block to block. Further analyses confirmed this, as the mean RT across all other factors decreased from block 1 (1.13 s) to block 4 (0.92 s). However, Tukey’s pairwise comparisons showed that only block 1 was statistically different from all other blocks. Speed, was also significant, F(1, 703) =
63 36.28, p < .001, as well as speed–conflict angle interaction. This results is expected, however, as the analysis involved also the special case of same headings, in which case it is important for the controller to also check the speed of the aircraft to determine whether the trailing aircraft is overtaking the preceding one. The results from this particular condition support this notion: the subjects took longer to determine whether the aircraft were in conflict or not when they had different speeds than when they were traveling at the same speed. This difference may be explained by the additional time required to determine whether a faster aircraft was overtaking a slower one. The most important results is, however, that in the trials where aircraft were at different altitudes there were no differences between the conflict angles at significance level of α < .1 (i.e., the main effect of conflict angle was not significant, F(5,703) = 1.62, p = 0.152, nor were any of Tukey’s pairwise comparisons). Discounting the speed–conflict angle interaction due to the same-heading special case, these results supports the hypothesis that controllers first examined altitude information when determining whether two aircraft are in potential conflict, and in this experiment they clearly made their decision without taking the time to consider any other characteristics of the traffic situation (i.e., conflict angle or speed difference); hence the near-uniform RT distribution across conflict angles and speed differences. 5.3.1.2. Diverging headings. Another ANOVA was performed on data from trials with diverging headings (box no. 2 in Figure 5.2). The model included random factors of subject, block, and trial, and fixed factors of altitude and speed (same or different for both) as well as the interaction of the latter. Altitude was significant, F(1, 79) = 5.69, p < .05, with higher RTs for aircraft pairs on same altitude than for those at different altitudes. This implies that the participants did not made their no-conflict decision based on the divergence of the aircraft trajectories, immediately perceptible upon the onset of the trial, but also paid attention to the altitudes of the aircraft, depicted numerically in the aircraft data tags. 5.3.1.3. Converging headings. Next, we analyzed the data from converging headings (45˚, 90˚, and 135˚) and same altitude trials (box no. 3 in Figure 5.2). The mixed model included random factors of subject, block, and trial, and fixed factors of conflict angle, speed, and whether the aircraft were in conflict or not (as defined by their eventual miss distance). Whether the aircraft were in actual conflict or not was significant, F(1, 779) = 32.8, p < .001. The controllers were quicker to respond when the aircraft were in potential conflict than when they were not, and this tendency was more pronounced at angles of 45˚ and 135˚. Within the conflict and no conflict conditions he controllers were also faster in their responses when the aircraft were traveling at a same speed then when their speeds were different, but he main effect of speed in this analysis was not significant (p = 0.12). No interactions reached significance, α < .1. 5.3.1.4. Same headings. These scenarios (see box 4 in Figure 5.2) were clearly special cases with substantially different conflict detection strategies necessary from the other experimental conditions. If the aircraft were at the same altitude, instead of extrapolating on the trajectories of the aircraft to determine their point of intersection, the controllers had to estimate the offset of the parallel trajectories to see if it was less than the 5 nm lateral separation minimum. Furthermore, even is this was the case, the aircraft could only be in conflict if a faster aircraft was behind a slower
64 one. A mixed model similar to the previous analyses was used, with subject, block, and trial as random factors and altitude, speed, and their interaction as fixed factors. Altitude, F(1, 297) = 58.18, p < .001, speed, F(1, 297) = 8.73, p < .005, and their interaction F(1, 297) = 8.47, p < .005, were significant. Qualitatively, the results conform to the hypothesized sequence of conflict detection: The response times are ordered (from fastest to slowest) from different altitude trials (which would be checked first) to same altitude no-conflict (trajectory offset checked second) to overtaking trials (which necessitated checking the aircrafts’ speeds). 5.3.1.5. Opposite headings. As was the case in the same-heading trials, opposite headings are a special case as well (box 5 in Figure 5.2). In these situations the speed of the aircraft has little bearing on potential conflicts (except that an actual conflict will occur sooner as the aircraft travel faster) and hence the controller must only check the altitudes and estimate the offset of the parallel but opposite trajectories. The model and analysis of these data was identical to the previous section. Results show that altitude, F(1, 368) = 160.52, p < .001 was significant. The trials in which the aircraft were on different altitudes were again fastest, implying that the controllers checked the aircrafts’ altitudes first and, if different, responded before looking for other cues of potential conflict. Of the other trials, those involving a conflict were detected faster (mean = 1.4 s) than non-conflict trials (mean = 1.63 s). This fact may attest to the salience of the 2.5 nm offset of the trajectories. Indeed, separate analysis of the same-altitude scenarios showed a significant effect of conflict on RT, F(1, 278) = 17.18, p < 0.001. 5.3.1.6. Response accuracy. Response accuracy was restricted to 80% correct responses or higher to avoid speed-accuracy tradeoffs and to enhance the sensitivity of the response time measure. The computer kept track of response accuracy for each block and subjects were required to repeat any block with less than 80% accuracy. However, it is possible that speed-accuracy tradeoff could have taken place between the experimental conditions included in the block, overall accuracy in the block still exceeding the criterion. To check for this possibility, average accuracy scores were computed for each condition across subjects and replicates. The results show that all experimental conditions were performed to a mean accuracy of 89 % (SD = 12%), but a few conditions fell below the required accuracy. The worst accuracy was 57.97% in the 45-degree conflict angle, same altitude, different speed, and no conflict condition. This result may indicate a degree of conservatism among controllers, that is, they are likely to err to the side of safety by calling a non-conflict a conflict. Indeed, all eight conditions with less than 80% accuracy were those with no conflicts (see Table 5.1). Plotting mean RTs against response accuracy reveals that those conditions with longest RT were also associated with lowest accuracy; hence, no speed–accuracy tradeoff took place (Figure 5.4). 5.3.2. Student group Overall, the patterns in the data are strikingly similar to those observed in the controller data. The student group shows slightly more variability and slightly longer response times. The analysis of data from the student group was performed in a similar manner as the controller group data, with identical models. The results, appearing in Figure 5.3, are concisely summarized below; comparison of the results from the two groups is discussed in the following section.
65
Table 5.1 Response accuracy by experimental condition. For both groups, the best response accuracy was achieved in different altitude trials and worst in same altitude trials, reflecting the difficulty of the conflict judgment task. The worst accuracy was also realized in no conflict trials, suggesting a conservative bias of the participants, or erring on the safe side by calling a no conflict a conflict. The experimental conditions in the left-hand column consisted of different conflict angles (the first number, in degrees), same or different altitudes (sA, dA), same or different speeds (sS, dS), and whether the planes were in conflict (C) or not (N). Condition 45,dA,sS,N 45,dA,dS,N 90,dA,sS,N 90,dA,dS,N 135,dA,sS,N 135,dA,dS,N 180,dA,dS,N 315,dA,sS,N 315,dA,dS,N 0,sA,dS,N 180,sA,sS,C 180,sA,dS,C 180,dA,sS,N 315,sA,dS,N 0,dA,sS,N 315,sA,sS,N 0,sA,sS,N 45,sA,dS,C 0,dA,dS,N 90,sA,sS,C 135,sA,sS,C 90,sA,dS,C 135,sA,dS,C 45,sA,sS,C 0,sA,dS,C 45,sA,sS,N 135,sA,sS,N 135,sA,dS,N 180,sA,dS,N 90,sA,sS,N 180,sA,sS,N 90,sA,dS,N 45,sA,dS,N
Accuracy YYZ Grp 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 98.57 98.57 98.57 98.57 97.44 97.14 96.15 95.71 95.71 94.29 92.86 91.43 84.29 84.29 82.86 81.43 80.00 77.14 77.14 75.71 71.43 71.43 64.29 57.97
UIUC Grp 98.46 100.00 95.31 93.85 98.46 100.00 95.38 100.00 100.00 98.46 93.94 93.85 98.46 100.00 90.77 95.83 93.94 86.15 96.92 83.08 75.38 80.00 72.31 93.85 87.69 73.85 73.85 78.46 63.08 58.46 44.62 58.46 56.92
66
Figure 5.3. . Results from the student group. The results are grouped in identical manner with the controller group (see Fig, 2); sA = same altitude, dA = different altitude, sS = same speed, dS = different speed, C = conflict, and NC = no conflict. The conflict angle is depicted on the horizontal axis (DIV = diverging headings).
67
Figure 5.4. Plots of response time vs. accuracy for both experimental groups. The negative slopes of the regression lines clearly show that no speed-accuracy tradeoff took place, but that more difficult trials with long RTs also resulted in lower accuracy than easier and faster performed trials. An ANOVA on data from conditions involving different altitudes (see box 1 in Figure 5.3) showed significant between-subject differences, block was significant in this analysis as well, and as was the case with the controller group, some learning occurred from block to block. Further analysis reveals that there was steady reduction in mean response times from block 1 (1.2 s) to block 4 (0.83 s), but Tukey’s simultaneous test for pairwise differences shows that only block 1 was different from blocks 3, 4, and 5, and block 2 from blocks 4 and 5. Unlike in the controller group data, conflict angle was significant in this analysis, F(5, 657) = 5.01, p < .001. This was due to significantly higher RTs in the 90-degree conflict angle condition as well as the contribution of the 0-degree special case (see Figure 5.3). Tukey’s test for pairwise differences shows that indeed 0-degree conflict angle was different from both 45- and 135-degree conditions, T = -3.65, p < .005 and T = -3.27, p < .05, respectively, and 45-degree angle different from 90-degree condition, T = 3.54, p < .01 and 90-degree angle from 135-degree condition, T = -3.14, p < .05. Overall, however, it again appears that conflict angle and speed had little impact on the time required to make conflict judgments on different altitude trials. The next analysis was performed on data from trials in which aircraft were on diverging headings (see box 2 in Figure 5.3). The results are nearly identical to those from the controller group. Altitude was significant, F(1,72) = 12.82, p < .005, with slower RTs observed in trials where the aircraft were on same altitude. Obviously, the participants checked for aircraft altitudes, too, despite the fact that all trials were no-conflict due to diverging headings, which should have been readily perceptible.
68 Analysis of data from same altitude, converging trajectories trials (box 3 in Figure 5.3) reveals significant differences between subjects and blocks. In the latter case, however, only the first two blocks were performed somewhat slower than the other three, but no pairwise differences were significant in Tukey’s test. Speed was marginally significant, F(1, 721) = 3.65, p = .056, and the interaction between speed and conflict angle was significant, F(2, 721) = 3.38, p < .05. It is difficult to see any clear trends in these data, however. The same heading condition was a special case (box 4 in Figure 5.3), as discussed before. Altitude, F(4, 275) = 5.87, p < 0.001, and speed, F(4, 275) = 5.87, p < 0.001, were significant. Block 1 was performed slower than other blocks, however, and it was significantly different from only block 5 by Tukey’s test for pairwise differences, T = -3.16, p < .05. The trials in which the aircraft were at different altitudes showed fastest RTs, followed by no-conflict trials with the aircraft on the same altitude; the longest RTs were observed in overtaking trials where the aircraft trajectories were offset by less than a required separation, supporting the hypothesis. Response accuracy of the student group was overall lower than that of the controller group (mean accuracy 85.75% vs. 89.79%) but otherwise shows similar pattern (see Table 5.1). However, plotting the percent correct responses against mean RT for each experimental condition reveals that those conditions with lower accuracy also had longer RTs, indicating that no speed–accuracy tradeoff took place (Figure 5.4). These results were nearly identical to the controller group. 5.3.3. Comparison of the groups As has already been pointed out, the two experimental groups, novices and experts, produced very similar patterns in their respective data. Both groups had fastest response times in conditions where that aircraft were on different altitudes, lending support to the hypothesis that altitude would be checked first and if different, a no-conflict decision based on that information only. The patterns between the two groups in the special cases of divergent, same, and opposite headings were similar, too. Finally, the conditions associated with converging headings together resulted in the longest RTs for both groups and even similar patterns within these conditions, although the student group showed much larger variability between these experimental conditions than the controllers.
69
Figure 5.5. Correlation of response times between experimental groups. The correlation coefficient R2 is very high at .9, signifying similar performances between skilled (YYZ controllers) and naïve (UIUC students) groups. Comparisons between groups were done separately for each experimental condition by performing two-sample T-tests on RT. The results are depicted in Table 5.2, ordered by the significance of the difference between groups. Of the total of 33 experimental conditions, the two groups were significantly different at < .05 in only 10 conditions. These are at the bottom of Table 5.1. In all other experimental conditions the performances of the two groups were indistinguishable. Figure 5.5 shows that the response times from each group were highly correlated (R2=.90). We will return to the implications of these results in the next section.
70 Table 5.2. Comparison of response times (RT) between the experimental groups. For all but 10 out of 33 experimental conditions the groups were indistinguishable.
Condition 45,dA,dS,N 90,sA,sS,N 0,sA,sS,N 315,dA,sS,N 135,dA,sS,N 315,sA,dS,N 0,dA,dS,N 45,dA,sS,N 45,sA,dS,N 180,dA,dS,N 90,dA,sS,N 180,dA,sS,N 315,sA,sS,N 135,sA,dS,N 315,dA,dS,N 135,sA,sS,N 90,sA,dS,N 45,sA,sS,N 180,sA,sS,C 0,dA,sS,N 0,sA,dS,N 90,sA,sS,C 135,dA,dS,N 45,sA,dS,C 180,sA,sS,N 45,sA,sS,C 0,sA,dS,C 90,sA,dS,C 180,sA,dS,C 135,sA,dS,C 180,sA,dS,N 90,dA,dS,N 135,sA,sS,C
UIUC Grp RT (s) Mean (SD) 2.92 (2.94) 7.30 (4.03) 4.74 (3.29) 3.34 (1.99) 2.79 (1.16) 4.32 (2.50) 4.15 (5.25) 2.64 (1.31) 9.16 (8.18) 3.27 (2.30) 3.42 (3.32) 3.09 (2.71) 4.00 (1.91) 9.23 (8.79) 2.52 (1.59) 9.50 (10.20) 9.16 (8.25) 9.69 (7.80) 5.87 (3.79) 3.10 (1.89) 4.93 (2.31) 7.95 (5.61) 2.83 (2.18) 7.88 (6.23) 8.67 (7.97) 7.82 (5.15) 7.53 (5.81) 10.15 (9.18) 6.01 (5.44) 10.26 (9.30) 8.83 (8.78) 3.85 (2.88) 8.87 (6.71)
YYZ Grp RT (s) Mean (SD) 2.99 (1.55) 7.03 (4.63) 4.49 (2.27) 3.06 (1.66) 2.95 (1.80) 3.94 (2.46) 3.67 (2.21) 2.79 (1.07) 8.04 (5.52) 2.95 (1.41) 2.94 (1.60) 2.71 (0.93) 3.35 (1.46) 7.56 (3.87) 3.18 (2.38) 7.42 (4.65) 7.32 (3.60) 7.82 (4.13) 4.86 (2.77) 2.61 (1.11) 4.05 (3.18) 6.40 (3.48) 3.56 (2.21) 6.10 (3.76) 6.35 (4.15) 6.19 (3.24) 5.72 (2.70) 6.80 (4.24) 4.07 (1.49) 6.69 (4.45) 5.54 (3.25) 2.73 (1.00) 5.69 (4.12)
Est. Diff. -0.07 0.27 0.25 0.28 -0.16 0.38 0.48 -0.15 1.12 0.32 0.48 0.38 0.65 1.67 -0.66 2.08 1.84 1.87 1.01 0.49 0.88 1.55 -0.73 1.78 2.32 1.63 1.81 3.35 1.94 3.57 3.29 1.12 3.18
T -0.16 0.37 0.52 0.56 -0.64 0.67 0.68 -0.71 0.92 0.98 1.06 1.07 1.34 1.41 -1.43 1.51 1.66 1.72 1.77 1.82 1.85 1.91 -1.91 2.00 2.10 2.17 2.30 2.69 2.78 2.81 2.85 2.96 3.29
p 0.874 0.714 0.605 0.576 0.521 0.503 0.501 0.478 0.357 0.329 0.292 0.288 0.187 0.162 0.158 0.136 0.100 0.088 0.079 0.071 0.067 0.059 0.058 0.049 0.039 0.032 0.024 0.009 0.007 0.006 0.006 0.004 0.001
5.4. Discussion This research replicated in part the results of Leplat and Bisseret (1966) and corroborated their proposed organization of conflict search strategies. Our study, however, showed how taskload associated with conflict detection could be quantified in terms of time required for conflict or no conflict judgment to be made, and how controllers tend to minimize this time-cost by adhering to a hierarchical strategy. The near-uniform RT distribution across conflict angles and speed difference when aircraft were at different altitudes clearly shows that the participants
71 first examined altitude information when determining whether two aircraft are in potential conflict, and, if vertical separation existed between the aircraft, made their no-conflict decision without taking the time to consider any other characteristics of the traffic situation (i.e., conflict angle or speed difference). Altitude was indeed dominant is all scenarios, including those with diverging trajectories, which was immediately perceptible upon the onset of the trial. This result may attest to the strength of the rule of checking altitude first or the fact that no route information (other than heading) was available to the participants. That altitude was indeed checked first in all conditions is not surprising given the strong support for this hypothesis in the related literature (Bisseret, 1971; Helbing, 1971; Leplat & Bisseret, 1966; Sperandio, 1971; Willems et al., 1999). The situation is more complex when we consider the subsequent levels in the proposed hierarchy. When same-altitude aircraft are on converging trajectories, multiple factors determine whether they may be less than required separation apart when their trajectories intersect, including convergence angle, speeds, and distance and time to the point of intersection. These factors often result in complex interactions and makes comparison of results from different experiments difficult. They may also explain why the findings of Remington et al. (2000) were not replicated in our study. Furthermore, the scenarios in Remington et al. (2000) included multiple distracter aircraft, making the visual scanning and conflict detection task markedly different from the present experiment. Hence, it is possible that performance decrements attributed to increased convergence angles were due to variations in traffic load rather than aircraft headings. 5.5. Implications Our results have important implications on two distinct aspects of air traffic controller performance. First, the benefit of vertical separation in terms of conflict detection performance was quantifiably substantial: controllers from YYZ made no-conflict decisions on average in 3.0 seconds when aircraft were at different altitudes vs. 6.0 (over twice as long) seconds when they were at the same altitude. The corresponding numbers for UIUC participants are 3.2 vs. 7.8 seconds. Hence, controller workload due to the time required to make conflict judgments could be determined simply by counting the number of aircraft pairs at the same altitude in a given sector at a given time. Such a measure for taskload, however, should be moderated by the fact that controllers do not constantly reassess the situation while working traffic, but only focus on aircraft requiring constant attention (c.f., Niessen et al., 1999). The proposed metric would therefore be most directly applicable to situations where the controller is either coming to a sector and assessing the traffic situation before taking control, or rebuilding his or her situation awareness after “losing the picture” for some reason. However, further research is required to determine the validity and utility of the proposed measure. Second, the close correspondence of results from naïve participants and professional air traffic controllers has implications on both our understanding of controller expertise and on experimental research on various ATC-related issues. As Kimball et al. (1973) have suggested, certain perceptual and decision-making abilities may be inherent characteristics of all humans, rather than learned skills exhibited only by experienced air traffic controllers. What sets expert controllers apart from general population may hence lie in areas of planning, time-sharing, and procedural and declarative knowledge, not what may here be deemed basic perceptual and cognitive processing capabilities. Therefore, experimental designs using non-controllers in part-
72 task settings may offer attractive and fruitful avenues for research on myriad of basic ATCrelated research questions and necessitate employment of complex high-fidelity simulations and expert controller participants only to validate research findings in operationally realistic task environments. The benefits of such an approach to ATC research and development in terms of effectiveness and cost are unequivocal. 5.6. Validation of Results We sought to validate the results of the experiment by using POWER output from data from Kansas City ARTCC (ZKC), originally reported in Manning, Mills, Fox, and Pfleiderer (2001), together with associated subjective workload measures. A modified POWER program counted the number of aircraft pairs at a same altitude from data from four different sectors at ZKC and three samples (one 8-minute sample and two 20-minute samples) from each. We also examined this measure in the sample data from ZID. 5.6.1. Subjective Workload Estimates from ZKC Data The subjective workload ratings were obtained from 16 ARTCC instructors from FAA Academy at Oklahoma City, OK. The ZKC traffic samples were recreated by Systematic Air Traffic Operations Research Initiative (SATORI; Rodgers & Duke, 1993). The participating controllers viewed the scenarios and provided estimates of workload they deemed was experience by the sector controller working traffic presented in the samples. Workload estimates were provided by the Air Traffic Workload Input Technique (ATWIT; Stein, 1985) using a workload assessment keyboard (WAK) at 4-minute intervals concurrently while viewing the scenarios as well as by the NASA Task Load Index (TLX; Hart & Staveland, 1988) at the end of each scenario. See Manning, Mills, Fox, and Pfleiderer (2001) for a complete description of the study. The results indicate a moderate correlation between the subjective workload measures from Manning, Mills, Fox, and Pfleiderer (2001) study and the number of aircraft pairs at a same altitude. A regression analysis of WAK inputs versus number of pairs of aircraft at same altitude was significant, F(1, 44) = 8.20, p = .0064, with an R-squared of .16 (see Figure 5.6). A regression of mean TLX scores (averaged across raters and epochs in scenarios) versus mean number of pairs of aircraft was also significant, F(1, 10) = 5.19, p < .05, with an R-squared of .34 (see Figure 5.7). These results, however, must be viewed with caution. It is quite clear that the number of aircraft pairs at a same altitude correlates strongly with the total number of aircraft in the scenario. Since the latter has repeatedly been implicated as a strong workload driver (Arad, 1964; Hurst & Rose, 1978), it is worth a closer look to see whether the impact of aircraft at same altitudes could be isolated.
73
WAK Scores
TLX: Average
5
50
4
40
3
30
2
20 y = 0.0995x + 2.6763
y = 2.004x + 29.132
2
R = 0.1571
1
2
R = 0.3418
10
0
0 0
2
4
6
8
10
12
14
0
No. Acft Pairs at Same Alt
2
4
6
8
Mean No. Acft Pairs at Same Alt
Figures 5.6 and 5.7. WAK scores, averaged across raters, at 4-minute intervals from a total of 12 sample scenarios (4 sectors, 3 samples each), regressed against the number of aircraft pairs at a same altitude at times of each rating. A moderate but clear positive correlation is apparent between these variables. A very strong correlation was indeed confirmed between number of aircraft (CONCOUNT), maximum number of aircraft present at one time (MAXAC), and pairs of aircraft at the same altitude (ALT_PAIRS). A correlation matrix is presented in Table 5.3. Table 5.3. A correlation matrix between number of aircraft (CONCOUNT), maximum number of aircraft present at one time (MAXAC), and pairs of aircraft at the same altitude (ALT_PAIRS); a p-value is given in parenthesis after Pearson correlation coefficient. CONCOUNT
MAXAC
MAXAC
0.921 (0.000)
ALT_PAIR
0.573 (0.000)
0.566 (0.000)
Mean WAK Score
0.709 (0.000)
0.749 (0.000)
ALT_PAIR
0.396 (0.006)
74 A possible method to isolate the effect of number of aircraft pairs at a same altitude is to compute the proportion of such pairs in all possible pairings of aircraft in a sector. The latter is given by combinations of aircraft pairs (note that the order does not matter) by the following equation: Comb(P,G) =
P! (P − G)!G!
where P is the size of the population (here, CONCOUNT) and G the size of a group, here 2 for a pair. The results are depicted in Figures 5.8 and 5.9, for the two workload measures (WAK and TLX, respectively). Regression analysis confirmed what is plain in the plots. No relationship could be found between the proportion of aircraft pairs at a same altitude and the WAK scores, F(1, 44) = 0.00, p = 0.9813, R-squared = 0.0003, or the mean number of aircraft pairs at a same altitude and average TLX scores, F(1, 10) = 0.05, p = 0.8355, R-squared = 0.0045.
WAK vs. Prop. Acft Pairs at Same Alt
Avg. TLX vs. Prop. Acft at Same Alt
5.0
50
4.0
40
3.0
30
2.0
20
1.0
y = -0.2662x + 2.8648
0.0 0.00
y = -26.739x + 33.975
10
2
R = 0.0003
2
R = 0.0045
0 0.05
0.10
0.15
Prop. Acft Pairs at Same Alt
0.20
0
0.02
0.04
0.06
0.08
Mean Prop. Acft Pairs at Same Alt
Figures 5.8 and 5.9. When the workload measures were plotted against the proportion of aircraft pairs at a same altitude of all possible aircraft pairing in the sector, no correlation could be found. There are, however, several aspects to consider that may make these validation efforts actually invalid. First, it may be argued that estimation of controller workload by viewing SATORI recordings of ATC scenarios does not represent workload experienced by the controller actually working the traffic. Moreover, it seems plausible that a passive observer would not
75 engage in conflict detection activities in a same way as a controller who is actively involved with the situation and responsible for separation of aircraft in her or his sector. Second, the assumption that controllers would actively and continually scan pairs of aircraft for conflicts is clearly false. Typically, controllers know the traffic flows in their sector and a relatively small number of “problem spots” where aircraft trajectories may intersect and separation assurance becomes a priority. Hence, a proportion of aircraft pairs at a same altitude of a subset of aircraft pairs in the sector either in close proximity to each other or on converging trajectories might have yielded better results. Unfortunately such metrics could not be derived from the data available to us. 5.6.2. Evaluation of ZID Samples We also examined the measure of aircraft pairs at a same altitude in the sample data from ZID. The number of aircraft pairs at a same altitude showed predictable differences between the sectors and between busy and slow times of day (Figure 5.10). A more interesting picture is revealed by examining the proportion of aircraft at a same altitude (Figure 5.11). This measure clearly distinguished the WAB sector with substantially larger proportion of aircraft at that same altitude than the other two sectors Acft Pairs at Same Alt 25
20
15
10
5
0 DAY busy
DAY slow
RIV busy
RIV slow
WAB busy
WAB slow
Sector
Figure 5.10. Number of aircraft pairs at a same altitude in the ZID sample.
76
Proportion of Aircraft Pairs at the Same Altitude 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 DAY busy
DAY slow
RIV busy
RIV slow
WAB busy
WAB slow
Sector
Figure 5.11. Proportion of aircraft at a same altitude in the sample ZID sectors. WAB is clearly distinguished, reflecting the characteristics of a superhigh sector. This finding may be explained by the unique characteristics of the superhigh sector, where aircraft necessarily occupy altitudes corresponding to their performance limits, that is, at the bottom layer of the sector. This effect was highlighted by the 2,000-ft vertical separation minimum at these altitudes, further restricting the distribution of aircraft at different altitudes.
77
6. FUTURE DIRECTIONS This section describes some theoretical work that was undertaken under the present project. This work pertains to examination of temporal mental models of operators in the context of dynamic systems (e.g., ATC), the cognitive underpinnings of mental representation of the temporal characteristics of dynamic tasks, and the consequent effects on human performance on such tasks. We also present an experimental paradigm for further investigation of these aspects of human performance and for development of objective controller performance measures that may me derived from operational ATC data, for example, using a modified POWER program. However, as modifications to the POWER program were not possible to implement within the auspices of the present project, thee efforts are presented here under the heading “Future Directions.” 6.1. Premises It may be hypothesized that the operator’s timing performance is a “product” of two parallel processes, an internal “clock” or some timekeeping mechanism, and the attentional and perceptual processes that sample the external environment (c.f., Neisser’s perceptual cycle). It may be further hypothesized that human sampling behavior is driven by the goodness of the temporal mental model, and in particular, three distinct aspects of it: (1) correct time to act or update the mental model from cues available in the environment, (2) the time available for action or checking of cues, and (3) the time required to perform action or check cues. The first depends on prospective memory and the latter two would be characterized by a mental model of information access cost. We make several key assumptions for development of time-based measures of taskload and performance: (1) a critical variable determining taskload is time pressure, which may be expressed as Time Available—Time Required; (2) an individual’s ability to perform under time pressure is dependent on an accurate temporal mental model of the task demands (Time Available) and own performance (Time Required); (3) the goodness of this temporal mental model is manifested in proper calibration of Time Required vs. Time Available, resulting in accurate and consistent timing of overt actions and correct action sequencing and scheduling; (4) the temporal mental model is affected by taskload as well as individual differences; and (5), measurable timing performance will allow for inferences to be made of both taskload and individual differences in coping with it. During any time-critical task, an individual will be confronted by three relevant features: (1) The correct time to act, or opening of a window of opportunity, (2) time available, or duration of a window of opportunity, and (3) time required to complete the task within a window of opportunity. In order to complete all tasks in an efficient manner, the operator must be aware of all three for each task. Although optimal sequence of tasks to be completed can be computed, it is a resource-intensive task that is often avoided (Moray, Dessouky, Kijowski, and Adapathya, 1991). It is likely that human operators will opt for simple strategies, placing salient visual features of the timing task above the more accurate information available from mental arithmetic or duration estimation. There is evidence that participants, given visually different stimuli, tend to underestimate their window of opportunity with faster-moving trials over slower-moving trials, even when the duration of trials is equal (Rantanen & Xu, 2001). Furthermore, there are
78 substantial performance decrements when visual stimuli are removed, forcing participants to rely on duration estimation (Xu & Rantanen, 2003). Next, we describe a multiple-task time pressure experiment that manipulated taskload using a time available/time required model, and measured the subsequent effect of taskload on performance. Time available was manipulated by varied temporal windows of opportunity, and time required was manipulated via increased scanning costs. 6.2. Experiment 2 6.2.1. Method 6.2.1.1. Participants Twelve students from the University of Illinois at Urbana-Champaign, 20–30 years of age, participated in the study. Participants were compensated $8 for an hour-long experiment, consisting of four 10-minute blocks during which participants were presented with a time pressure simulation involving tasks of low cognitive demand. 6.2.1.2. Apparatus A computer program developed specifically for this experiment presented participants with an abstract time-sharing task. The task was performance-dependent; speed and accuracy of performance in one trial affected the onset of subsequent trials. In this way, the task mimicked the dynamic nature of real-world scenarios (e.g., Air Traffic Control). Participants viewed a computer screen divided into four panes, of which only one was visible at a time (Figure 6.1). To view other quadrants, participants were required to move a cursor (using a mouse) to the desired quadrant. The previous portion of the screen would become blank, and the desired quadrant would become visible.
Figure 6.1. Screenshot from the experimental program depicting the experimental task. Subjects could view the separate tasks my moving a cursor to the corresponding quadrant on the display
79 Each quadrant contained a red progress bar, an indication of the current percentage progressed, a mark on the bar indicating the window of opportunity to reset the timer (the required task), and instructions for resetting the timer by typing a four-digit code (Figure 6.1). The participants were required to observe the progress bar to reach a strict window of opportunity and enter the code before the bar reached 100%. 6.2.1.3. Parameters The progress bars, present in each quadrant, moved at one of three different speeds: 60, 30, and 15 seconds to progress from 0 to 100% (a ratio of 1x : 2x : 4x, respectively). The “opening” of the windows of opportunity to reset the timers corresponded to the critical percentage point for a given trial, which occurred at 60, 80, and 90% (creating a window size of 40, 20, or 10%). While there were nine possible combinations of speeds and window sizes, there were only five possible durations of the window of opportunity. The five window durations were 1.5 seconds, 3 seconds, 6 seconds, 12 seconds, and 24 seconds. Pilot testing indicated that the 1.5-second durations were typically too short to complete the task, and that the 24-second durations were excessively long. Both window durations were removed from this experiment, thus only seven combinations of speed and percent were used (Table 6.1). Trials were sampled randomly from an even distribution of these seven speed/window size pairings, such that past trials held no predictive value for subsequent trials. During a block of trials, the delay parameter was set either at 0.5 or 1 second. Table 6.1. Experimental Design. Window durations resulting from Speed/Window size pairings. The 1.5 and 24-second durations were eliminated for the experiment. Speed
Window size (%)
1x
2x
4x
10%
6 sec
3 sec
(1.5 sec)
20%
12 sec
6 sec
3 sec
40%
(24 sec)
12 sec
6 sec
6.2.1.4. Procedure The participants were instructed to type the four-digit sequence in each quadrant to reset the timer, and to do so after its progress bar reached the critical percentage, but before the bar reached 100%. Once a trial was successfully completed, the red progress bar reset to 0%, marking the beginning of the next trial. Participants were presented with four 10-minute blocks of such trials. While no penalties were enforced for failure to complete the task within this window of opportunity, participants were asked to be as diligent as possible. 6.2.1.5. Output The program recorded a timeline (accurate to .01 seconds) of all events that transpired during the experiment, including movements from one quadrant to another, the opening and closing of
80 windows of opportunity, time of task completion, keystrokes made by the participant, and the speed and critical percentage of the bar. These data were sufficient for recreating the participant’s actions during the experiment. Several taskload and performance metrics were derived from these raw data. 6.3. Results 6.3.1. Data Reduction and Coding Two metrics for performance were used to judge individual performance during the time pressure task simulation. Performance was judged based on the proportion of trials that were successfully completed within their window of opportunity, as well as the percentage of the window of opportunity that elapsed prior to the initiation of a response (first key pressed). Good performance was manifested by the execution of tasks within the window of opportunity, as well as a minimization of lag between the opening of a window of opportunity and the onset of a response (i.e., timely performance). The raw output of the program used in our paradigm also allowed for the derivation of several objective taskload and performance metrics. This analysis will deal primarily with two such metrics whose difference resulted in our primary measure of taskload. The first metric, termed here model A, relates to individual performance. Model A was calculated as the ratio of time used to complete all trials in a given epoch (two minutes) to the average time required to complete that number of trials. “Time used” was the amount of time that elapsed from the opening of a window of opportunity until the completion of the task by the participant, and was summed across all tasks that occurred during an epoch. This is primarily a measure of efficiency, as lower values of A would reflect more optimal performance of the tasks that occurred during a specific epoch. The second metric, termed here model B, was calculated as the ratio of the summed window durations for tasks that occurred during an epoch to the average amount of time required for the participant to complete those tasks. Low values of B reflect minimal excess time to perform tasks, while higher values approximately represent the amount of buffer participants had within that epoch to perform the necessary tasks. Thus, model B functions primarily as a measure of task demand. The difference between model A, efficiency, and model B, task demand, results in the primary taskload metric. A negative value of model A – model B indicates that there was more time available than was used during that epoch, and a positive value indicates that less time was available than was required. 6.3.2. Window Size and Response Initiation Windows of opportunity varied based on the critical percentage and the speed of the moving bar, resulting in three discrete window durations (12 seconds, 6 seconds, and 3 seconds). The response initiation percentages were analyzed across the primary delay condition and speed/window size pairings by ANOVA. The main effect of delay condition was significant, F(1, 3638) = 58.65, p < .001, as were the main effects of speed and window size, F(2, 3638) = 268.63, p < .001, and F(2, 3638) = 105.221, p < .001, respectively. Response initiation
81 percentages were elevated with greater delay, and also became larger as window size and speed increased. Post-hoc comparisons of speed/window size pairs were performed to determine differences in performance within the three window durations. In the .5-second delay condition, response initiation percentage differed significantly for the two 12-second pairings, t(489) = 4.499, p < .005. Three comparisons were necessary for the 6-second window pairings as it was possible to form a 6-second window for all three speeds. The effects of pairing on performance between the 1x / 10% window and the 2x / 20 % window were significant, t(471) = 4.470, p < .005, as were 1x/10% and 4x/40% pairings, t(578) = 7.057, p