Berliner, D. C., D. Angell, and J. Shearer. .... Hammond, Kenneth R., McClelland, G.H., and Mumpower, J. Human Judgment and Decision Making: Theories,.
Tasks, Models & Methods System-Level Evaluation of Human-Computer Interaction (HCI) for Command & Control Decision Support
Lee Scott Ehrhart
March 1993
R&D Interim Technical Report #MISE-3-1-3 Contract #F30602-92-C-0119
Prepared for Rome Laboratory C3AB 525 Brooks Rd. Griffiss AFB, NY 13441-4505
Center of Excellence in Command, Control, Communications & Intelligence (C3I) George Mason University Fairfax, VA
Executive Summary The evolution of technology has resulted in dramatic changes in the role of human users in command and control (C2) systems. The last thirty years have seen a shift away from conventional modes of warfare with well-understood adversaries towards limited-objective or low-intensity warfare characterized by shorter warning times, greater ambiguity and a requirement to plan and execute responses in a greatly reduced time frame. While dramatic innovations in technology have greatly expanded information processing capabilities, the critical -- and often the most vulnerable -components in C2 systems remain the human decisionmakers. The increased pressure on C2 systems emphasizes the need for more comprehensive human-computer interaction (HCI) evaluation methods relating human performance to system effectiveness. Evaluation of the HCI component of a C2 system within the context of the overall system development goals is extremely difficult. Because of the complex interactions among humans, equipment, and information within the organizational structures and procedures involved in command and control, the contribution of the HCI design to overall system performance cannot be expressed in terms of a simple, direct metric. Evaluation of HCI designs requires an understanding of the interactions among the C2 system components (users, equipment, tasks, organization and procedures), the missions and functions the system supports, and the situational and environmental factors which affect those missions. At each stage in systems design and development, the various decisionmakers involved require information inputs from 1.
analyses of requirements (system objectives, functions, tasks, operational capabilities) and
2.
evaluations of performance and effectiveness characteristics (current and potential).
Designs for the complex systems supporting command and control (C2) decisionmaking derive conceptual requirements from models of C2 processes. The doctrine incorporated in these models and the missions defined by the organization provide the context for identifying the functional and task requirements that structure the relationships of humans and machines. These requirements, in turn, help to determine the appropriate measures of performance (MOPs) and measures of effectiveness (MOEs) that form the selection criteria for HCI designs.
This paper presents a survey of some of the tools and techniques currently used to evaluate human performance and attempts to assess whether and how these tools and techniques may be used to support the evaluation of human-computer interaction in overall system performance. Following a brief review of the basic concepts involved in system-level analysis and a discussion of the role of analysis and evaluation in the system development life cycle (SDLC) for the design and acquisition of C2 systems, there is a discussion of the system models for command and control which provide the necessary framework for analysis and evaluation of performance and effectiveness. This is followed by a brief survey of the various analysis and evaluation methods which comprise the toolbox for studying HCI performance. Finally, these methods are related to the evaluation questions relevant at each stage of the SDLC and suggestions are made for matching methods to evaluation requirements. This paper describes models and methods for realizing the potential utility of HCI evaluation to support the design and development of complex information systems. The approach presented embraces three basic principles: •
the design of human-computer interaction embodies the relationship of human users and computerbased aids in achieving system goals;
•
the decomposition of HCI functions, processes, and tasks provides measurable indicators of the extent to which specific designs fulfill system objectives; and
•
the utility of HCI evaluation to the system design process depends upon the application and interpretation of HCI measures in the context of a valid framework of objectives, functions, processes, and tasks.
Lee Scott Ehrhart
Tasks, Models & Methods
i
While it is infeasible to define “cookbook” procedures for either the design of systems or the evaluation of those designs, the conceptual approach presented attempts to assist design decisionmakers in defining flexible, robust plans for analysis and evaluation.
Lee Scott Ehrhart
Tasks, Models & Methods
ii
Contents Executive Summary ....................................................................................................................................................... i Contents ....................................................................................................................................................................... iii Figures........................................................................................................................................................................... v Tablesvi 1. Human Performance in Complex Environments ...................................................................................................... 1 1.1.
Introduction ............................................................................................................................................. 1
1.2.
System-Level Analysis............................................................................................................................. 2
1.3.
1.2.1.
Scope & Boundary Issues.......................................................................................................... 3
1.2.2.
Balancing Precision and Generalizability.................................................................................. 3
Performance Evaluation in the Systems Development Life Cycle........................................................... 4 1.3.2.
Phase-Related Evaluation Requirements ................................................................................. 11
2. Command and Control System Models.................................................................................................................. 15 2.1.
Characteristics of Effective Models....................................................................................................... 16
2.2.
Representative Approaches to Command and Control Modeling.......................................................... 17
2.3.
Measures of Performance (MOPs) and Measures of Effectiveness (MOEs) ......................................... 21
2.4.
Identifying and Describing Tasks for Command and Control ............................................................... 24
3. Analysis & Evaluation Methods for Complex Systems.......................................................................................... 27 3.1.
General Approaches to Data Collection & Organization....................................................................... 27 3.1.1.
Objective Methods .................................................................................................................. 27
3.1.2.
Archival Data Analysis............................................................................................................ 28 Observation ............................................................................................................................. 29 Raw Event/Time Records & Time Studies............................................... 29 Process Charts & Flow Process Charts .................................................... 30 Gantt Charts & Multiple Activity Charts.................................................. 30 FROM-TO Charts & Link Charts............................................................. 30 Occurrence Sampling ............................................................................... 30 Collecting Observational Data ................................................................. 30 Subjective Assessment Methods.............................................................................................. 32
3.1.3.
Subjective Observation Methods............................................................................................. 32 Ranking Methods .................................................................................................................... 32 Rating Scale Methods.............................................................................................................. 34 Checklists ................................................................................................................................ 36 Questionnaire & Interview Methods........................................................................................ 36 Verbal Protocol Analysis & Verbal Probe Techniques ........................................................... 38 Simulation ............................................................................................................................... 39
3.1.4.
Prototyping Techniques........................................................................................................... 40
3.1.5.
Scenarios & Storyboards ......................................................................................................... 42 Mock-Ups................................................................................................................................ 43 Interactive Storyboards............................................................................................................ 43 Rapid Prototyping ................................................................................................................... 43 Experimental Designs.............................................................................................................. 43 Laboratory Settings ................................................................................................................. 45 Simulated Settings ................................................................................................................... 46
Lee Scott Ehrhart
Tasks, Models & Methods
iii
3.2.
3.3.
3.4.
Operational Settings ................................................................................................................ 46 Task Analysis Methods.......................................................................................................................... 47 3.2.1.
Hierarchical Methods .............................................................................................................. 48
3.2.2.
Network Methods .................................................................................................................... 49
3.2.3.
Cognitive & Knowledge Description Methods........................................................................ 50
3.2.4.
Taxonomic Methods................................................................................................................ 50
3.2.5.
Formal Grammar Methods ...................................................................................................... 51
3.2.6.
Flow Charting Methods ........................................................................................................... 51
Cognitive Process Methods ................................................................................................................... 51 3.3.1
Approaches to Studying Cognitive Process............................................................................. 51
3.3.2
Experimental Manipulations to Test Process Hypotheses ....................................................... 52
Mental Workload Assessment ............................................................................................................... 54 3.4.1
Workload Measures................................................................................................................. 55
3.4.2.
Task Performance Measures.................................................................................................... 56 Physiological Measures ........................................................................................................... 56 Subjective Measures................................................................................................................ 57 Applying Workload Studies to Team Decisionmaking Problems ........................................... 57
4. Implementing HCI Analysis & Evaluation to Support the Systems Design & Development Process.................... 59 4.1.
Supporting the C2 System Development Life Cycle (SDLC) Phases .................................................... 59
4.2.
Selecting and Organizing Measures to Meet Evaluation Objectives ..................................................... 63
5. Summary ................................................................................................................................................................ 66 6. References.............................................................................................................................................................. 67
Lee Scott Ehrhart
Tasks, Models & Methods
iv
Figures 1.1
Simple Model Relating System Attributes to System Performance ................................................................ 4
1.2
Three Interfaces for Evaluation Focus ............................................................................................................ 6
1.3
Boar’s Structured Development Life Cycle .................................................................................................... 7
1.4
Andriole’s Systems Design Methodology....................................................................................................... 8
1.5
Evaluation Applications & Users: Conceptual Aspects of C2 System Development................................... 11
1.6
Evaluation Applications & Users: Implementation Aspects of C2 System Development............................ 12
1.7
DOD-STD 2167A Software Development Standard .................................................................................... 14
2.1
Conceptual C2 Process Model ...................................................................................................................... 23
2.2
Orr’s Conceptual Combat Operations Process Model .................................................................................. 24
2.3
The SHOR Model of Tactical Decision Processes ....................................................................................... 25
2.4
Relationship of Measures and System Boundaries ....................................................................................... 27
2.5
Factors and Measures Related to Mission Effectiveness .............................................................................. 30
3.1
Hierarchy of Observational Dimensions and Methods ................................................................................. 42
3.2
Hypothesized Effects of Information Display on Tactical Planning ............................................................. 72
3.3
Conceptual Relationship Between Workload Factors and Measures............................................................ 75
4.1
Contributions of Analytical & Empirical Evaluation Approaches ................................................................ 83
4.2
Modular C2 Evaluation Structure ................................................................................................................. 89
Lee Scott Ehrhart
Tasks, Models & Methods
v
Tables 1.1
Programmatic Aspects of C2 Development .................................................................................................... 9
1.2
Evaluation Requirements for C2 System Development Life Cycle Phases................................................... 16
2.1
Effects of Abstraction Level on Model Credibility....................................................................................... 19
2.2
Classes of Command and Control Models.................................................................................................... 21
2.3
C2 System Evaluation Measures................................................................................................................... 26
2.4
Characteristics of Good Performance & Effectiveness Measures................................................................. 28
3.1
Archival Data Analysis Methods .................................................................................................................. 34
3.2
Objective Observation Methods .................................................................................................................. 36
3.3
Objective Observation Methods Summary ................................................................................................... 40
3.4
Subjective Assessment: Ranking Methods................................................................................................... 43
3.5
Subjective Assessment: Rating Scale Methods............................................................................................ 45
3.6
Subjective Assessment: Checklists .............................................................................................................. 46
3.7
Subjective Assessment: Questionnaire & Interview Methods...................................................................... 47
3.8
Subjective Methods: Verbal Protocol & Probe Techniques ....................................................................... 49
3.9
Simulation Techniques.................................................................................................................................. 51
3.10
Prototyping and Evaluation to Match Changing Design Requirements ........................................................ 52
3.11
Prototyping Techniques ................................................................................................................................ 53
3.12
Summary of Alternative Evaluation Settings ................................................................................................ 58
3.13
Comparison of Experimental Settings........................................................................................................... 58
3.14
Task Analysis Methods Comparison............................................................................................................. 65
4.1
HCI Evaluation in Relation to SDLC Phases and Objectives ....................................................................... 86
Lee Scott Ehrhart
Tasks, Models & Methods
vi
Lee Scott Ehrhart
Tasks, Models & Methods
vii
1. Human Performance in Complex Environments 1.1.
Introduction
In the last thirty years the analysis and evaluation of manned systems for command and control (C2) have expanded from the physiological/perceptual focus of ergonomics and the training focus of behavioral scientists to include a greater emphasis on cognitive factors. The growing interest in the cognitive aspects of C2 systems is driven by a variety of factors. The evolution of technology has resulted in dramatic changes in the role of human users in C2 systems. The quest for and capability of acquiring more and better intelligence on evolving situations has resulted in a volume of complex data that can no longer be managed using manual techniques. This information overload is further compounded by the dramatic increases in speed and accuracy of weapon systems. The resulting requirement for faster responses has mandated a steady increase in automated processes and a corresponding shift in the role of humans to more supervisory activities. Finally, in addition to conventional modes of warfare with well-understood adversaries, military forces must now prepare for limited-objective or low-intensity warfare in novel situations. These new modes are characterized by shorter warning times, greater ambiguity and a requirement to plan and execute responses in a greatly reduced time frame. In response to these changes, C2 systems researchers and designers are exploring innovative concepts in all aspects of human-computer interaction (HCI). New developments in chip architecture have made technologically and economically feasible a variety of HCI approaches previously thought exotic (e.g., natural language interfaces and multimedia information display). Enthusiasm for these new technologies can result in their introduction into systems without a clear understanding of the potential impacts. On the other hand, certain technologies may be under-exploited when designers are unaware of their potential advantages. For the most part, the guideline literature provides only generic direction for designers. Although theoretically sound, guidelines may still be inadequate either because they are written at a level of abstraction which does not translate well to design requirements or they focus only on the traditional forms of dialogue and display. To meet the challenges of design, there is a critical need to evaluate HCI within the system context. The increased pressure on C2 systems emphasizes the need for more comprehensive HCI evaluation methods to relate human performance to system effectiveness. Investigations of C2 decision failures indicate critical inadequacies in the support of decisionmakers’ cognitive requirements, particularly in situations characterized by high threat, time constraint, ambiguity and uncertainty. While C2 systems are currently designed and evaluated for compliance to human factors standards, traditional models for assessing human reliability are generally unsuitable for describing and predicting the occurrence and effects of cognitive errors. Andriole [1990] indicates the erroneous tendency to equate “testing” of software, and by implication standard human factors, with the evaluation of overall system effectiveness. The need to systems that support human decisionmaking and that depend heavily upon human-computer interaction mandates a systems approach to the evaluation of HCI component in C2 systems. Command and control systems are often mistakenly identified as the electronic subsystems that support decisionmaking or assist in implementing those decisions. More accurately, the Joint Chiefs of Staff Pub 0-2 [1986] defines command, control and information systems as integrating the human operators and decisionmakers, organizational structure, doctrine and procedures, information processing systems, equipment and facilities to support command authority at all levels to accomplish the objectives of designated missions. The various components are interactively linked such that changes in one human, machine, or human-machine component can significantly modify the tasks performed by another. Furthermore, the system operates within an environment which includes not only its superordinate organization and related units, but also a dynamic array of potential adversaries. Finally, the system environment is also defined by its geographic location and associated climatic conditions. Meaningful evaluation of the potential effects of introducing new HCI technology into a C2 system requires the understanding of total context of that system’s operation. This paper presents a survey of some of the tools and techniques currently used to evaluate human performance, and it attempts to assess whether and how these tools and techniques may be used to support the design and evaluation of human-computer interaction in overall system performance to support C2 decisionmaking. This section continues with a brief review of the basic concepts involved in system-level analysis and addresses the roles of analysis and evaluation in the system development life cycle (SDLC) for the design and acquisition of C2 systems. Section Two discusses system models for command and control which provide the necessary framework for analysis and evaluation of performance and effectiveness. This is followed in Section Three by a brief survey of the various analysis and evaluation methods that
Lee Scott Ehrhart
Tasks, Models & Methods
1
comprise the toolbox for studying HCI performance. In Section Four, these methods are related to the evaluation questions relevant at each stage of the SDLC, and suggestions are made for matching methods to evaluation requirements. 1.2.
System-Level Analysis Meister [1991] summarizes the characteristics of a system in terms of the fundamental tenets of Gestalt: a)
the whole is greater than the sum of the parts;
b) the whole determines the nature of its parts; c)
the parts cannot be understood in isolation from the whole;
d) the parts are dynamically interrelated or interdependent. Viewed in this light, human decisionmaking performance is essentially inseparable from the larger system in which it is manifested. The nature of the system and its characteristics, or attributes, both constrain and empower the human decisionmakers which interact as part of the system (Figure 1.1). Thus, a system cannot be viewed simply in terms of hardware and software configurations nor merely as the interaction of human users with those configurations. The system is an organized whole that includes the decisionmaking environment, organizational structures, etc. The concept of system-level analysis has several implications for human performance analysis. ‘True system research’ as characterized by Meister [1991, 1981] bears a strong resemblance to some of the basic activities of requirements analysis: a)
incorporate all the various internal and external factors that may impact design and system performance;
b) relate cognitive/behavioral analyses, design, and evaluation to tasks, system goals, and output; and c)
model the interaction of processes and sub-processes across all levels of the functional hierarchy.
To this end, well-balanced system-level evaluation incorporates both quantitative and qualitative data using formal and informal collection methods as defined by the goals of the evaluation. Moreover, analysis and evaluation of humancomputer interaction and decisionmaking performance in C2 systems require entering the real world environment at some point. Large, complex systems cannot be brought effectively into the laboratory except as narrowly represented in subsystem simulations. This requirement for some in situ data collection does not preclude the collection and analysis of data drawn from controlled settings using formal and informal methods. However, the ultimate validity of the data and conclusions derived using such methods depends upon its generalizability to the target system and environment.
Lee Scott Ehrhart
Tasks, Models & Methods
2
System Attributes • goals & objectives • programs & missions • organizational context • environmental factors
• components • resources • structure
Human Roles & Capabilities
Empower Constrain
• inputs • processes • outputs
Human Performance
System Performance
Figure 1.1: Simple Model Relating System Attributes to System Performance 1.2.1.
SCOPE & BOUNDARY ISSUES
The systems that provide the subjects of effectiveness evaluations fall into one of three categories: •
existing systems;
•
new or updated systems that do better what the old systems did (i.e., faster, more accurately, more reliably, requiring less training time, easier to maintain, less costly, etc.); and
•
new systems to do what cannot be done with existing systems or to tackle new problems [Battilega & Grange, 1980]. Regardless of the nature of the system of immediate interest, there are inevitable comparisons to and interactions with current or planned systems. These complexities and interdependencies create difficulties in defining the scope of evaluation and determining the related boundaries. A broad focus, attempting to incorporate all the factors which might impact system performance, can result in an intractable number of variables. In contrast, setting too narrow a focus can result in a study that is more manageable and satisfies statistical requirements, but is of little use in answering critical design questions. Adelman [1992] represents system and subsystem boundaries of a decision support system (DSS) in terms of a series of nested, interdependent interfaces (Figure 1.2). The kernel of this representation is the interface between the user and the DSS. At the next level is the interface between the user-DSS system and the decisionmaking organization (DMO) that it serves. The third interface is that between the DMO and the environmental context of its operation. System-level analysis and evaluation must take into account each of these interfaces and their impacts on the other two, but need not model all at the same level of detail. Ultimately, the scope and boundaries of the evaluation must be defined in accordance with the objectives of the study. This relationship is discussed in greater detail in Section Two. 1.2.2.
BALANCING PRECISION AND GENERALIZABILITY
Designing any evaluation study, whether it is a classic experimental design or a quasi-experimental study, requires making some tradeoffs between the precision of the measurements and the generalizability of the results. Precision encompasses the reliability and internal validity of the research design, measurement instruments, and analysis methods.
Lee Scott Ehrhart
Tasks, Models & Methods
3
Generalizability, or external validity, indicates the extent to which the findings may be applicable in somewhat different contexts. In applied studies, generalizability is often defined as the relevance of the findings to the target operational environment. Precision requires a narrowing of focus and refining of techniques; generalizability, in contrast, involves methods which broaden the scope in terms of representative populations and settings. The balance between precision and generalizability shifts as a function of the evaluation objectives. In basic research involving the investigation and description of phenomena in controlled settings, precision and generalization are established through the rigor of research design and randomization techniques. Applied research, particularly that concerned with performance evaluation of existing or proposed systems, has somewhat different precision and generalization requirements and constraints. There, regardless of the precision achieved in terms of statistical significance of the findings, failure to achieve an adequate level of external validity implies the research has little relevance to the target environment. Nevertheless, sacrifices in internal validity undermine the assurance that the findings were not either the result of chance or of other, unmeasured factors.
Decisionmaking Organization (DMO)
User-DSS/DMO Interface
User User/DSS Interface
Decision Support System (DSS)
DMO/Environment Interface
Environment Figure 1.2: Three Interfaces for Evaluation Focus (adapted from Adelman, 1991) 1.3.
Performance Evaluation in the Systems Development Life Cycle
Systems and software engineering literature abounds with various life cycle models. Central to most of the accepted models is the notion of a structured, iterative process beginning with the identification of a problem and ending with an operational system. Although development is modeled with discrete phases and flows, it is generally understood that the actual processes overlap and many occur in parallel. The various models differ in terminology, phase boundaries, and the level of detail presented; however, most identify several common life cycle stages: •
requirements definition
•
design
Lee Scott Ehrhart
Tasks, Models & Methods
4
•
implementation
•
testing & evaluation
•
operational fielding
•
maintenance
These six phases are approximated in Boar’s [1984] structured development life cycle (Figure 1.3) and Andriole’s [1987] systems design methodology (Figure 1.4). With only cursory examination, these models would seem to confine evaluation activities primarily to the last stages of development. Adelman [1992] points out, however, that judgments and decisions pervade every phase of the development process. The results of analysis and evaluation (represented as feedback loops in Andriole’s model) provide input to support development objectives at each stage and to determine whether those goals have been achieved. Thus, evaluation is a critical component in requirements-driven design.
Problem Phase-Related Goals & Objectives Feasibility
Determine Technical, Operational, and Cost/Benefit Feasibility of the Proposed Application
Definition
Determine Requirements the System Must Meet
Preliminary Design Detail Design Implementation Conversion
Production / Maintenance
Determine a Physical Solution to the Requirements Provide an Exact Specification for the Construction of Each System Component Construct, Test, and Verify the System Convert from Current Mode of Operation to the New System
Operate the System on a Day-to-Day Basis Revise the System to Maintain Operational Correctness
Figure 1.3: Boar’s Structured Development Life Cycle (adapted from Boar, 1984)
Lee Scott Ehrhart
Tasks, Models & Methods
5
Lee Scott Ehrhart
Tasks, Models & Methods
6
System Selection / Configuration Input/ Display / Output Device Selection Man-Machine Interface Design
Assessment ➪ Evaluation Participatory "Closure"
/ Method Specification Criteria Development ➪ Goal Scenario Selection
Assessment / Change ➪ Situation Management Method Selection
➪ Documentation / Support Training
➪
Figure 1.4: Andriole's Systems Design Methodology (Andriole, 1987)
Feedback
System Evaluation
System Transfer
System Packaging
Hardware Selection/ Configuration
Assessment Input / Display / Output Design ➪ Off-the-Shelf Software Engineering
Software Selection/Design
Assessment
Comparison ➪ Cost-Benefit Requirements Compatibility
Methods Selection
• Flow-Charting ➪ Narrative Storyboarding
• Constraint Assessment User / Task / Organizational➪ Feasibility Doctrinal Matrix
Requirements Analysis
Modeling
Activities
Steps
1.3.1. C2 SYSTEMS DEVELOPMENT Prefatory to any discussion of C2 systems development, it is important to understand the organizational or programmatic context of these development efforts. In this context, systems comprise an organizational response to the requirements of C2 missions in terms of personnel, technology, and procedures. The process of arriving at this organizational response involves both top-down assignments of missions and functions and bottom-up assessments of capabilities and deficiencies with respect to those missions. The process culminates in the Program Objectives Memorandum (POM) and budget process as the individual services and programs compete for resources. Viewed from this programmatic level, the activities involved in developing or modifying C2 systems and subsystems may be organized into either conceptual or implementation concerns [Foster et al., 1986]. Table 1.1 presents the primary activities associated with the conceptual and implementation aspects of C2 systems development.
Conceptual Aspects •
Doctrine Development
•
Requirements Generation & Validation
•
C2 Contribution to Force Effectiveness
•
New Technology Assessment
•
Research & Development Goals Implementation Aspects
•
POM/Budget Process
•
Acquisition Process
•
Technical Evaluation
•
Operational Evaluation
Table 1.1: Programmatic Aspects of C2 Development The initial, conceptual aspects of C2 development principally require judgments regarding the relative merits of alternative courses of action. Doctrine development involves the combination of top-down and bottom-up methods for defining objectives (national, service, etc.), strategies and tactics, and missions ranging from global force employment to the unit level. These processes primarily involve qualitative (rather than quantitative) evaluation of alternative strategies and tactics with respect to missions and their execution within operational environments. Viewed in the context of doctrinal requirements, C2 systems support the capability to accomplish mission objectives rather than being an end in themselves. This concept lies at the heart of the identification and validation of new C2 capability requirements, evaluation of C2 system contribution to overall effectiveness, assessment of new technologies, and setting research and development (R&D) goals. Evaluation is required to compare the required and existing capabilities with respect to their effectiveness in support of the conduct or deterrence of combat. Thus, at the conceptual levels, these capabilities are still primarily expressed in qualitative terms. However, there is often inadequate consideration of innovative vs. traditional methods and a failure to recognize that “more and faster” may not be better. The implementation aspects of C2 system development are, by definition, funding and time constrained and necessarily involve more quantitative evaluation. The POM/budget process requires tradeoffs which weigh differing capabilities against the resources required to acquire them. Acquisition expands the requirements to fully define the operational concept and determine facilities, equipment, and personnel support needs for the new system. Technical evaluation encompasses the verification of the hardware/software system with respect to the design specifications. Through acceptance testing, utility and effectiveness evaluations, operational evaluation combines the technical
Lee Scott Ehrhart
Tasks, Models & Methods
7
component with the operational dimension (manpower and support) to determine how well the system performs in the intended environment. Evaluation is a key component in all of these activities. Process inputs take the form of internal studies and reports drawn from operational data, findings from basic and applied research, and evaluation studies from other development efforts. Detailed examination of the organizational processes involved is beyond the scope of this paper; however, Figures 1.5 & 1.6 indicate the relationships between the various decisionmakers and the activities required to support a coherent organizational response in the form of program plans and system development. The range of potential evaluation users and decisionmakers indicates the need for evaluation across the spectrum of quantitative and qualitative methods at varying levels of abstraction and detail. In fact, within any given phase the various stakeholders may reasonably base their decisions on entirely different studies. For example, the highest level decisionmakers may use aggregations of lower-level evaluations and even commission independent studies to aid their decision processes.
Lee Scott Ehrhart
Tasks, Models & Methods
8
Force Cmdr
Deputy Sec'y JCS & Svc Chief
Force Employment Planners
DOD Operational Cmdr
Doctrine Development
Requirements Generation & Validation C2 Contribution to Force Effectiveness
Development Agency Cmdr
Research & Development Goals
Operational Force Planners
New Technology Assessment
R&D Executives
System Cmd, Product Div
Figure 1.5: Evaluation Applications & Users: Conceptual Aspects of C2 System Development
Lee Scott Ehrhart
Tasks, Models & Methods
9
Acquistion Executives
SVC Chief Deputy Sec'y
Development Agency Cmdr
DOD
Acquisition Process
POM/Budget Process
Force Cmdr
Force Employment Planners
Conceptual Phases
Program & Project Managers
System Cmd, Product Div
Operational Evaluation
Technical Evaluation
OT&E Command System Analyst
Figure 1.6: Evaluation Applications & Users: Implementation Aspects of C2 System Development Design decisions related to the HCI aspects of C2 systems are first addressed in the conceptual phases of development. HCI evaluation necessarily serves as input to new technology assessment and R&D goal setting. In the budget and acquisition phases of implementation, evaluation of the resources required to accomplish capability objectives. This includes not only evaluation of the technological aspects of human-computer interaction, but also tradeoffs in the allocation of support functions as dictated by the HCI component in the system concept. Finally, HCI evaluation serves as inputs to and outputs of the technical and operational evaluations of the prototypes and fielded systems. The actual design and acquisition of C2 systems is controlled through federal acquisition regulations and development standards that mandate both the process and the review milestones by which it is monitored. Within the U.S. Department of Defense (DOD), DOD-STD-2167A Software Development Standard [DOD, 1988], is the primary software development management and engineering standard and is intended to cover total systems development (Figure 1.7). Although 2167A supplies detailed guidance for the management aspects of development (e.g., configuration management, product evaluation, formal and informal testing, and documentation), it provides minimal guidance with respect to hardware development and total system testing [Marciniak & Reifer, 1990]. Further specification of development reviews and review products is left to developers of the individual project’s request for proposal (RFP) and to contract negotiation. Theoretically, all evaluation requirements are specified precisely. In practice, the evaluation specified and contracted for may not provide sufficient feedback at the appropriate phase to prevent costly re-design or fielding of systems which do not meet the organizational requirements.
Lee Scott Ehrhart
Tasks, Models & Methods
10
1.3.2.
PHASE-RELATED EVALUATION REQUIREMENTS
The process involved in developing or modifying C2 systems and subsystems may be modeled in terms similar to the life cycles described previously. For example, Foster et al [1986] present a four-stage development process defined as conceptual, design, acquisition, and operational phases. In this model, the conceptual phase comprises those activities involved in the identification of desired system capabilities with respect to C2 missions. System definition provides the primary objective for the design phase. The supporting activities include developing the system requirements, determining current system capabilities and deficiencies, and selecting the “best” of alternative system definitions. During the acquisition phase, the “best” design is determined and the developmental system is implemented and evaluated. The operational phase involves fielding the system, final evaluation and maintenance. In a requirements-driven design process, the judgments and decisions made during each phase determine the objectives of the analyses and evaluations required to support those decisions. These phase-related objectives further define the scope and boundary of the evaluation in terms of the extent to which a given study considers organizational interactions and environmental context as well as the level of detail with which these factors are represented.
Lee Scott Ehrhart
Tasks, Models & Methods
11
Lee Scott Ehrhart
Tasks, Models & Methods
12
SRR - System Requirements Review SDR - System Design Review SSR - Software Specification Review PDR - Preliminary Design Review CDR - Critical Design Review TRR - Test Readiness Review FCA - Functional Configuration Audit PCA - Physical Configuration Audit FQR - Formal Qualification Review
Reviews
Pre-Software Development
!
SSR
Preliminary Design
PDR
PDR
Preliminary Design
Detailed Design
Detailed Design
TRR
Testing
FQR
PCA
FCA
Product Baseline
Production and Deployment
Testing and Evaluation
Developmental Configuration
Coding, Unit Testing and Integration Testing
CDR
CDR
Fabrication
Testing
System Integration and Testing
Figure 1.7: DOD -STD 2167A Software Development Standard (DOD, 1988)
Functional Allocated Baseline Baseline
Software Development
Software Requirements Analysis
System System SRR Requirements SDR Concepts Analysis
"
Pre-Hardware Development
Hardware Requirements Analysis
Hardware Development
Table 1.2 presents the relationship of evaluation objectives and the associated scope and boundaries defined. At each phase the system is considered in the context of the organizational and environmental factors which impact performance; however, these factors are represented at varying levels of detail depending upon the phase requirements. During the conceptual phase, the system in question is modeled at a relatively high level of abstraction; the desired performance is expressed primarily in qualitative terms. The nature of the interaction with other support systems and the external environment is modeled in very low detail. As development proceeds to later phases, the specification of requirements increases in detail with respect to the system itself and its interaction with other systems in the organization and with external environment. This specification, in turn, dictates the inclusion of more precise quantitative and qualitative measures to assure that the system meets both engineering specifications and organizational requirements. Evaluation of the HCI component of a C2 system within the context of the overall system development goals is extremely difficult. Because of the complex interactions among humans, equipment, and information within the organizational structures and procedures involved in command and control, the HCI design embodies most of the system concept available to the user to guide his/her mental model of the system. The HCI design includes such critical design factors as: •
the representation of system states and feedback to the operator on results of actions taken;
•
the representation of information regarding the situation elements external to the system (support systems, physical environment, threats, etc.);
•
the allocation of tasks between the human operator/decisionmaker and the computer as determined by the dynamics of the situation and the requirements of the analytical methods selected to support decision processes; and
•
the modes in which users may interact with all of this information to explore situations, develop hypotheses, generate options, select among alternatives, and implement their decisions.
Unlike features such as targeting accuracy, the contribution of the HCI design to overall system performance cannot be expressed in terms of a simple, direct metric. Performance evaluation of the HCI design requires an understanding of the interactions among the C2 system components (users, equipment, tasks, organization and procedures), the missions and functions the system supports, and the situational and environmental factors which affect those missions. The next section undertakes an examination of this evaluation context in terms of C2 system models.
Lee Scott Ehrhart
Tasks, Models & Methods
13
SDLC Phase
Evaluation Objectives (to support or influence)
Evaluation Characteristics • High level of abstraction • Primarily qualitative measures
Conceptual
Design
• Determination of desirable C2 system characteristics
• Organizational and environmental interactions represented in minimal detail
• Developing C2 system requirements & specifications
• System represented in moderate to high detail
• Determination of current system capabilities & deficiencies
• Qualitative & quantitative measures
• Selection of “best” of alternative system definition • Organizational and environmental interactions modeled in moderate detail
• Determination of “best” design Acquisition
• Provision of feedback on detailed design • Determination of whether developmental system meets specifications
Operational
• Determination of whether the operational system meets organizational requirements
• System modeled in moderate to high detail • Qualitative & quantitative measures • Organizational and environmental interactions modeled in moderate to high detail • Qualitative & quantitative measures • High detail in system and context modeling
Table 1.2: Evaluation Requirements for C2 System Development Life Cycle Phases (adapted from Foster et al., 1986)
Lee Scott Ehrhart
Tasks, Models & Methods
14
2. Command and Control System Models Analysis and evaluation of system processes and the roles of subsystem components in those processes inevitably require some type of modeling. The term “model” covers a wide range of representations with an equally wide variety of uses. The American Heritage Dictionary [1985] presents several definitions; three seem most appropriate to this discussion. By definition1, a model is: 1.
A tentative description of a system or theory that accounts for all of its known properties.
2.
A preliminary pattern serving as the plan from which an item not yet constructed will be produced.
3.
A small object, usually built to scale, that represents another, often larger object.
These three definitions encompass the three principal types of models that typically form the context for HCI evaluation: the conceptual model, the design model, and the developmental (or prototype) model. Design and development are in themselves refining processes in which analysis, modeling, and evaluation interact continually. In the early phases of development, models may be largely informal, conceptual expressions of policymakers’ or designers’ view of the system and its context. Evaluation of existing system operations feeds analysis to support the early stages of concept definition that, in turn, forms the first system “model.” Implicit in this model is some representation of the system’s purpose as it relates to organizational goals and the identification of criteria by which the achievement of those goals is recognized. As definition progresses, the current system model is analyzed in terms of the perceived deficiencies, or shortfalls, between what the system provides and what the organization needs. This process leads to the definition of yet another model -- a set of requirements for the next generation system and the criteria by which alternative designs, or system models, will be evaluated with respect to those requirements. Evaluation and modeling continue to play a key role in supporting decisions throughout the iterative synthesis processes of design and development. Even in the early phases of development, evaluation is still being performed upon models in the form of system prototypes. Finally, evaluation of operational systems is accomplished with the assumption that the evaluation criteria, established in the form of measures of performance (MOPs) and measures of effectiveness (MOEs), accurately represent (or model) the relationships between component, subsystem, and system performance and the larger purpose for which the system is intended. Modeling technological processes involving signal flows or data throughput in a large, complex system is a complicated task. Nevertheless, the problems tend to be boundable and tractable. In such cases, well-defined inputprocess-output models can be highly effective. The cause and effect relationships dictated by physical laws provide some assurance that changes in one part of the system will effect predictable changes in its related components. These relationships can be modeled and MOPs/MOEs established that may be used with a high degree of confidence to evaluate whether system performance meets the requirements. Modeling the human-computer interaction aspects of a C2 system supporting human decisionmakers in a complex, dynamic, high risk environment presents an additional set of challenges. Human performance in cognitive tasks is exceptionally resistant to representation in cleanly defined cause and effect models. The interaction strategies and technological features which comprise the HCI design generally cannot be linked directly to task performance -- let alone overall system performance. Moreover, once it is recognized that humans are not interchangeable components, it immediately becomes apparent that simple outcome models are insufficient. The model that provides the framework of HCI evaluation in this context must include some representation and appraisal of the processes involved in task performance as well as the outcome of that performance. HCI evaluation is almost always performed based upon a set of hypotheses that relate design features to changes effected upon processes which, in turn, effect changes in performance outcome. This section presents a brief examination of some of the common features in effective models and looks at command and control modeling approaches in light of those features. It continues with the identification and modeling of MOPs and MOEs with respect to the role of the HCI component in overall system performance and concludes with a discussion of the role of modeling in task allocation among individuals, teams, and computers.
1
Italics added by the author.
Lee Scott Ehrhart
Tasks, Models & Methods
15
2.1.
Characteristics of Effective Models
In keeping with the definitions presented, a model may be viewed as a structured representation of some aspect of a larger reality. In the context of analysis and evaluation for system development, models may be employed for a variety of reasons. For example, Hoeber [1981] identifies three basic purposes for modeling: •
Improve problem understanding for both the analyst/designer and the user/client;
•
Assist in developing solutions to complex, yet tractable, problems; and
•
Provide support for making choices where uncertainty and ambiguity cannot be resolved or there are no clear-cut solutions.
In each case, the underlying motivation for developing a model is the decision or “problem.” Furthermore, there is a implied assumption that the problem is complex enough to be difficult to understand or solve without the aid of a model to abstract the relevant interactions among the various critical aspects of the problem. Models are often criticized as simplifying the problem environment. However, Hoeber suggests that this feature of models is actually desirable in that it is the overwhelming complexity of the real world that limits decisionmaker's ability to solve the specific problem at hand. Modeling tradeoffs generally attempt to balance the advantages of greater simplicity against the risks of omitting a potentially critical factor. Table 2.1 presents these tradeoffs in terms of their effect on the credibility of the model and evaluation results. Level of Abstraction Credibility
Model
Results
Least Abstract
• Appearance of validity due to detail & greater match to real world inputs & processes
Most Abstract • Appears less valid due to simplicity & lack of real world detail • Can be supported with historical data or data from more detailed models
• Increased complexity may require simulation rather than an analytical model
• Increased potential of omitting important factors
• Large number of variables & assumptions increases possibility of bias.
• Dependent upon the credibility of the model
• Large number of variables makes comprehensive sensitivity analysis infeasible. Potentially important variables may not be fully explored.
• Simpler model is easier for the client to understand • Simpler model is more tract-able for sensitivity analyses
Table 2.1: Effects of Abstraction Level on Model Credibility (adapted from Battilega & Grange, 1980) There is a tendency in all modeling efforts to model those aspects which are best understood and readily lend themselves to representation. As Table 2.1 indicates, models may vary in degree of abstraction from highly detailed representations of all the tasks, personnel, organizational interactions, automated support systems and environmental factors involved in a decision domain on the one hand to representations of the relationships between a few highly aggregated variables on the other hand.. Less abstract models (i.e., those with greater detail) appear more “realistic” and are accounted a high degree of face validity. However, the exhaustive detail may not help the model users to clearly identify the critical factors and interactions. Increasing the number of variables and their associated assumptions also increases the potential for bias in the results. The large number of variables required and the accompanying increase in complexity can result in a model which is unwieldy; thus, it is often infeasible to do extensive sensitivity analysis on all the potentially relevant variables. More abstract models do not attempt to represent real world inputs and outputs and are less costly to develop. Extraction and aggregate modeling of selected variables render models that are easier to understand and manipulate. The goal of abstraction is not the production of a thin sketch of reality. In fact, much of the “knowledge” in a model is
Lee Scott Ehrhart
Tasks, Models & Methods
16
captured not in the detail, but in the aggregation and relationship of detail into a coherent picture. The credibility of these simpler models can be supported by direct or indirect links to more detailed models or historical data. There is, however, an increased possibility that an important factor or relationship may be omitted or lost in the aggregation process. Although the credibility of evaluation results from any model are tied to the credibility of the model itself, results derived from highly abstract models are more often subject to challenge than those produced with more “realistic” models. Ultimately, the level of detail chosen must be determined by the evaluation requirements. Effective models may characterized in terms of several key features:
2.2.
•
Level of detail is adequate to support evaluation of principal factors of interest at the current stage of development;
•
Representation scheme and mode are appropriate to the question at hand;
•
Assumptions regarding the nature and relationship of the variables can be supported by valid sources (historical data, acknowledged experts, output from other validated models); and
•
Model is understandable to the responsible analysts and the critical reviewers.
Representative Approaches to Command and Control Modeling
There are a variety of approaches to modeling command and control processes and systems. Crumley and Sherman’s [1990] survey of C2 models and theories identifies five broad classes of command and control models (Table 2.2). Implementation models include those modeling efforts presenting the structure, implementation, and working processes of command centers. The US Army’s field circulars on corps and division command and control are examples of this class of models. Organizational models are principally drawn from the theoretical perspectives of business organizations. Organizational-Process models focus on the command center function in terms of a management system. These models tend to be either those dealing with decisionmaking processes (e.g., the SHOR model [Wohl, 1981]) or those presenting taxonomies or function definitions (e.g., models featuring adaptive coping and measures of organizational competence [Olmstead et al., 1973; Olmstead et al., 1978]). Organizational-Evaluative models (e.g., HEAT [Defense Systems Inc., 1983]) come out of the modeling requirements for system evaluation.
Implementational Organizational • Process - decision making - taxonomy or functions • Evaluative Behavioral System Systems Oriented • Information transformation • Architecture Network
Table 2.2 Classes of Command and Control Models (Crumley & Sherman, 1990) Bearing some relationship to the organizational models, Behavioral System models concentrate on individual or team behavior as the critical factor in command and control. These models, such as that reported in Robins et al [1974], focus on providing a methodology for assessing and evaluating decision quality in command and control situations. Systems Oriented models are concerned with the description and evaluation of the specific facilities and technology used to support C2 functions. These models may be further classified as Information Transformation System models or
Lee Scott Ehrhart
Tasks, Models & Methods
17
Architecture models. The former focuses on the communication, integration, and transformation of information in C2 systems. The latter focuses more on the location and functional assignments of equipment and operational personnel in command posts. The final category, Network models, covers models employing PERT charts, Petri Nets and other network representations to model time-dependent, concurrent processes. The various example models presented in Crumley and Sherman differ primarily in perspective. Each attempts to model command and control processes or systems in terms of those aspects their particular definitions have designated salient. Their models also represent the viewpoints characteristic of their specific disciplines (management science, behavioral science, communication network architecture, etc.). Several of the organizational models enjoy relatively wide acceptance across disciplines. The model presented in Land et al [1985] is typical of a class of conceptual representations of the basic functions in C2 decisionmaking (Figure 2.1). Orr [1983] presents a similar model with the addition of intelligence functions and indications of the level of command involved at various stages (Figure 2.2). One of the most widely cited C2 decision process models is Joseph Wohl’s SHOR Model [Wohl, 1981]. As indicated in Figure 2.3, the SHOR model relates the functional activities and the information processed for each of four major elements in the decision process: stimulus, hypothesis, option, and response. Theoretically, a well-defined predictive or descriptive model of system components and processes would permit the evaluation of the potential impacts of introducing a new component or modifying the processes. This is particularly true when the object of measurement is input process output. However, the additive bottom-up, or data-driven, models which focus on individual, independent input process output sequences break down when applied to highly interactive, interdependent systems [Christie & Gardiner, 1990]. Nevertheless, several of the aforementioned approaches provide promising starting points for modeling the role of HCI in the team decision processes involved in C2 systems.
Lee Scott Ehrhart
Tasks, Models & Methods
18
Sense
Assess
Stimulus
Desired State
Generate
Environment
Own Forces Select
Plan
Direct
Response
Figure 2.1: Conceptual C2 Process Model (Land et al., 1986)
Lee Scott Ehrhart
Tasks, Models & Methods
19
Sense
Process Intelligence Analysis
Environment
Decide
Higher Levels Lower Levels Act
Figure 2.2: Orr’s Conceptual Combat Operations Process Model (Orr, 1983)
Lee Scott Ehrhart
Tasks, Models & Methods
20
Generic Elements
Stimulus (Data)
Functions/Required Gather/Detect Filter/Correlate Aggregate/Display
S
Store/Recall
Hypothesis (Perception) Alternatives
Create
Evaluate
H Select
Create
Option (Response Alternatives)
Evaluate
O Select
Information Processed Capabiilities, Doctrine; Position, Velocity, Type; Mass, Momentum, Inertia; Relevance and Trustworthiness of Data Where am I? C O M M A N D E R' S C A T E C H I S M
Where is the enemy? What is he doing? How can I thwart him? How can I do him in? Am I in balance? How long will it take me to ...? How will it look in .... hours? What is the most important thing to do right now? How can I get it done? Air Tasking Order:
Plan
Response (Action)
Who What When Where How
Organize
R Execute
The Near-Real-Time Modification/Update
Figure 2.3: The SHOR Model of Tactical Decision Processes (Wohl, 1981) 2.3.
Measures of Performance (MOPs) and Measures of Effectiveness (MOEs)
One of the principal objectives in the design or modification of C2 systems is to decrease the time required to complete the loop of activities between sensing and acting, while increasing the accuracy of assessments and effectiveness of decisions. Time savings are principally evaluated in terms of measures of performance (MOPs) -- the speed and accuracy both of equipment and personnel. Improvements in accuracy and effectiveness are assessed in relationship to the first as measures of effectiveness (MOEs) produced by virtue of the performance characteristics of the system under study.
Lee Scott Ehrhart
Tasks, Models & Methods
21
As there is some tendency in the literature to use MOP and MOE interchangably, some definition of these and related terms is useful. Brady et al [1986] defines measures with respect to system boundaries (Table 2.3). The dimensional parameters of a given component or physical entity are those inherent properties or attributes of the component (e.g., monitor resolution, etc.). Measures of performance include those attributes of component, subsystem or system behavior which may be attributed to the aggregation of the various dimensional parameters of the included components. Measures of effectiveness represent the system’s functional performance in the context of the operational environment and mission requirements. Meister [1990] points out that measures of performance are inherent to the system or subsystem, while measures of effectiveness are determined by the mission requirements. Finally, measures of force Measure
Description
Analysis Boundary
Dimensional Parameters
Properties or characteristics inherent in the physical Inside C2 System Boundary entities whose values determine system behavior and the structure under question even when at rest (size, weight, aperture size, capacity, number of pixels, luminosity)
Measures of Performance (MOPs)
Closely related to inherent parameters (physical and Inside C2 System Boundary structural) but measure attributes of system behavior (gain throughput, error rate, signal-tonoise ratio)
Measures of Effectiveness (MOEs)
Measure of how the C2 system performs its functions within an operational environment (probability of detection, reaction time, number of targets nominated, susceptibility to deception)
Outside C2 System Boundary
Measures of Force Effectiveness (MOFEs)
Measure of how well a C2 system and the force (sensors, weapons, C2 system, and structure) of which it is a part perform missions (contribute to battle outcome)
Outside Force Boundary
Table 2.3: C2 System Evaluation Measures (adapted from Brady et al., 1985 and Meister, 1990) effectiveness are aggregates of the MOEs for C2 systems and other force-level assets with respect to the larger mission of the force. The relationship of these measures and the analysis boundaries is indicated in Figure 2.4.
Lee Scott Ehrhart
Tasks, Models & Methods
22
Environment Force Boundary
C2 Boundary Subsystem
D
P
E
F
D = Dimensional Parameters P = Measures of Performance (MOPs) E = Measures of Effectiveness (MOEs) F = Measures of Force Effectiveness (MOFEs) Note: Time may be a parameter of all measures
Figure 2.4: Relationship of Measures and System Boundaries (adapted from Brady et al., 1986) The selection of appropriate measures is one of the most difficult aspects of evaluation design. This especially true when evaluating C2 systems designed to support the cognitive tasks involved in decisionmaking. For example, accuracy data (number of errors committed, etc.) may not be so relevant as the more difficult to capture information regarding error recovery or error prevention. While measures must be practical (i.e., feasible within time and cost contraints) and the required data collectible, they must also be valid with respect to the performance or behavioral questions addressed. Table 2.4 presents a list of criteria which should be considered in the selection of measures. Meister [1985] summarizes selection criteria in a similar list, suggesting that few measures meet all these criteria. He further cautions against the tendency to select obvious measures without fully examining their relationship to system outputs.
Lee Scott Ehrhart
Tasks, Models & Methods
23
Characteristics
Definition
Mission-oriented
Relates to force/system mission
Discriminatory
Identifies real difference between alternatives
Measurable
Can be computed or estimated
Valid
Reveals the performance of interest
Practical
Required data collection is feasible within time & cost constraints
Quantitative
Can be assigned numbers or ranked
Reliable
Consistent, accurate, repeatable
Realistic
Relates realistically to the C2 system and associated uncertainties
Objective
Can be defined or derived, independent of subjective opinion (It is recognized that some measures cannot be objectively defined)
Appropriate
Relates to acceptable standards and analysis objectives; acceptable to intended users
Sensitive
Reflects changes in system variables
Inclusive
Reflects those standards required by the analysis objectives
Independent
Mutually exclusive with respect to other measures
Simple
Easily understood by the user
Table 2.4: Characteristics of Good Performance & Effectiveness Measures [adapted from Miller et al., 1986 and Hennessey, 1990] The highly complex, interdependent systems that characterize command and control cannot be understood or adequately modeled using simple measures of outcome. Taking the viewpoint that C2 systems exist to accomplish specific mission-related functions, Alberts [1980] presents a methodology for evaluating the overall performance of C2 systems in the context of scenarios. The functional activities are broken down into tasks with measurement and evaluation occuring at three levels: •
Measures of System Performance - characteristics of equipment (speed, accuracy, etc.);
•
Measures of Information Attributes - characteristics of information flow through the system (quality, timeliness, etc.); and
•
Measures of Value-Added - assessment of the utility or value added to the decision through the system configuration under study.
The multicriteria evaluation method developed independently by Adelman and Donnell [1986] resembles Albert’s approach, but expands it substantially to include measures of user and organizational acceptance. In addition, their model addresses the match between the system created by the user, equipment, and organization and the target environment and mission requirements. The aggregation and analysis of multiple measures is addressed more fully in Section Three. 2.4.
Identifying and Describing Tasks for Command and Control
The selection of measures for evaluating the HCI components of C2 systems involving human decisionmakers necessarily requires the identification and description of tasks across the range of individual human users, teams, and computer-based support subsystems. In the development and evaluation of new system concepts, there is an additional requirement to examine alternative configurations in terms of how those tasks are allocated between humans and machines (or software).
Lee Scott Ehrhart
Tasks, Models & Methods
24
As the term suggests, HCI evaluations are principally concerned with the tasks and functions specifically requiring human-computer interaction. However, C2 systems typically involve multiple computer-based support subsystems and teams of users interacting to perform the necessary functions to accomplish a mission. Thus, there is a concommittant requirement to examine the indirect contribution of the HCI design/configuration to team and organizational effectiveness. Figure 2.5 presents a conceptual model of the relationship between the various system levels, human participants and relevant measures. The presumption in this diagram is not to present either a descriptive or prescriptive model by which data collected on HCI components could be extrapolated to make inferences about force effectiveness. Rather, the intent is to provide a notional framework to guide the selection of evaluation focus and the relevant measures to best examine the factors of interest at the various stages of design and development. These relationships will be discussed more fully in Section Four with respect to the application of analysis and evaluation techniques and methods.
Lee Scott Ehrhart
Tasks, Models & Methods
25
MOFEs
Mission Effectiveness Organizational Performance
System MOEs
C2 System Performance
System DPs
Subsystem MOPs
SubSystem DPs
DPs MOPs
Team Performance
C2 Subsystem Performance
HCI Component MOPs
HCI Component DPs
Subsystem MOEs
HCI Component MOEs
HCI Component Contribution
Dimensional Parameters Measures of Performance
MOEs MOFEs
Individual Performance Measures of Effectiveness Measures of Force Effectiveness
Figure 2.5 Factors and Measures Related to Mission Effectiveness
Lee Scott Ehrhart
Tasks, Models & Methods
26
3. Analysis & Evaluation Methods for Complex Systems The evaluation of HCI technology and its possible impacts on human performance usually is undertaken by researchers trained in human factors or industrial psychology. Their studies primarily examine basic cognitive functions (e.g., attention), task performance behaviors, and the training requirements imposed by new technologies and procedures. The data collection and analysis methods that characterize traditional human factors research have proven reasonably effective in evaluating the single operator performing well-defined, structured tasks. Findings from this research play a significant role in the evaluation and selection of candidate HCI technologies for development and incorporation into the design of new C2 systems. As previously discussed, C2 systems must support teams making critical decisions often in crisis situations. The tasks performed require integrating highly complex, dynamic information and making inferences where data is missing or ambiguous. Evaluation of task performance in this context is equally complex. It is not possible to establish a direct relationship between the features of specific HCI technologies and team performance. The concept of system-level analysis implies these relationships and their impacts are more accurately understood in terms of an integrated system of humans and machines working to accomplish tasks within an organizational and environmental context. Designing the evaluation of the HCI design in a system or subsystem of interest involves the following steps: •
constructing an adequate model of the interaction and tasks supported,
•
identifying the critical factors affecting the interaction and the performance of the tasks,
•
determining the appropriate metrics for evaluating alternative technologies or designs
•
designing evaluation studies employing the identified measures.
This section reviews accepted analysis and evaluation methods in light of these steps to determine their potential contribution to the evaluation of HCI designs in complex systems. Following a survey of basic data collection and organization techniques, the section continues with a discussion of how this toolbox of methods is used in task analysis, process study, and workload assessment. 3.1.
General Approaches to Data Collection & Organization
The cognitive and behavioral sciences provide the principal methods for collecting and organizing data related to human performance. However, in addition to the generally recognized disciplines within psychology, the techniques also draw from the practice of social research in disciplines such as anthropology. Similarly, study of the dynamics of team performance benefits from organizational theory models and methods developed for analyzing group processes. Linguistics research provides some of the essential models for the study of man-machine dialog, including the various advanced technological approaches to facilitating natural language dialog. Several key data collection and modeling techniques cross all of these disciplines. These techniques may be categorized as follows: •
objective methods
•
subjective assessment methods
•
simulation methods
•
experimental methods
In addition, prototyping and walk-through techniques employed to support system design and development provide an excellent means for capturing a variety of human performance data on new HCI technologies. This section continues with a brief examination of these methods and an assessment of the advantages and disadvantages associated with each. The material presented draws heavily upon several basic resources on the analysis and evaluation of human performance, most notably texts by Adelman [1992], Booher [1990], Cook and Campbell [1979], Keppel and Saufley [1980], Meister [1991, 1985], and Wilson and Corlett [1990]. The reader is directed to those writings for further elaboration on the use of the individual methods. 3.1.1.
OBJECTIVE METHODS
The first step in analyzing performance involves determining what is happening -- what activities, events, states, etc., comprise a given task or process. Depending upon the tasks involved and the accessibility of relevant data, there are two
Lee Scott Ehrhart
Tasks, Models & Methods
27
primary means of collecting objective data: archival records and direct observation. These methods supply additional insights on tasks and processes based upon quantitative data about the various task components, such as • • • • •
how many (counts), how often (frequency or pacing), how much (volume), how long (duration), and in what order (sequencing)
The degree of objectivity in these measures is influenced both by the means of collection (manual vs. automatic) and the interpretation given to the data during analysis. Automated methods are generally more reliable and less subject to recording biases than manual collection methods. However, while automated data collection often provides more data points at less cost, the easily collected data may not measure the most relevant indicators for the issue of interest. Both collection methods require making some judgments during analysis and interpretation that may introduce bias. Thus, the “objectivity” of the methods is at best relative to other collection techniques. Archival Data Analysis Archival data encompass the range of “hard copy” information available to the analyst for understanding the nature of tasks and procedures, the operational standards, and past performance. Manuals and other documentation resources present an obvious starting point for understanding the tasks, procedures, and other fundamental aspects of the work supported. While these sources include such rudimentary information as machine operation procedures, manuals and technical documents often exist that present the doctrinal aspects of tasks and system requirements. These are important sources for the domain knowledge critical for establishing appropriate MOEs and MOPs. Archived performance data supplies information over a longer period (months or years) to complement data collected during a study. Manually or automatically recorded logs are commonly available for many process control applications. To a lesser extent performance data exist in other domains in the form of records on resource use, service demand, etc. Data collected over an extended time period may bring to light trends or cyclical patterns in the tasks or task environment that would not be captured in short-term studies. Conversely, extending the period of the study may break down control. Even small changes in the environment, system or users introduce additional factors that may mediate the observed effects. The large number of data points available in archived performance data support establishing statistically significant results. Furthermore, as they are drawn directly from the operational environment, the recorded data are generally accorded a high degree of face-validity. Nevertheless, the available records may be of little value in evaluating the issue of interest for a number of reasons. The first and most obvious problem is that the data collected may not represent a relevant metric for focus of the study. In addition, the archives often retain only aggregated information rather than the original raw data, making it difficult to extract the specific items of interest. Finally, in addition to the potential for accidental error (e.g., improper calibrations or simple recording mistakes), manually recorded data may be deliberately inaccurate (e.g., biased to meet some internal or external standard). Despite time and cost savings, data not collected directly by the analyst requires careful review to ensure both its relevance and accuracy. Table 3.1 summarizes the tradeoffs involved in selecting and using archival data.
Lee Scott Ehrhart
Tasks, Models & Methods
28
Method
Performance Data
Advantages • Longer time period useful in identifying trends
• Available data may not match information requirements
• More data points with less effort and cost
• Passage of time increases number of factors potentially mediating effect
• Given certain constraints, can provide highly reliable, face-valid data • May provide benchmarks for performance evaluation
Manuals & Documentation
Disadvantages
• Useful for identifying standards and developing checklists for evaluation
• “Objectivity” of manually recorded data may be suspect based upon the original reasons for its collection • May not accurate reflect the way things are actually done
• Present the doctrinal context for procedures • May provide insights into the “evolution” of procedures
Table 3.1: Archival Data Analysis Methods In general, archived materials are most useful in the early modeling of tasks and task environments. Patterns of past performance and published standards help to identify or corroborate performance benchmarks which may be useful as MOPs. Ultimately, information regarding human-human and human-machine interactions are limited to standard operating procedures (SOPs) rather than providing an accurate reflection of the “real world” dynamics. Observation Observation is a natural “first approach” to the investigation of phenomena, particularly where human performance is involved. Observational data provide the descriptive foundations for most studies of human-machine system performance. Observational data collection methods range from objective measures in the form of time-tagged events or system states to more subjective evaluations of observed processes. Whether the data collected is quantitative or qualitative in nature, it must be observable. Thus, observational techniques are restricted to the collection and analysis of external events, processes, and system states. Drury [1990b] describes eight objective techniques for collecting and organizing observational data. Each method involves the recording of events or system states to trace a process or characterize a task or series of tasks in terms of time and resource requirements. The methods vary in terms of the raw data collected and information which may be derived from them. Table 3.2 summarizes the methods described by Drury in terms of the data collected and the processes and tasks which may be described. Raw Event/Time Records & Time Studies Event/time records present a time-stamped sequence of events tracing a single process, task, or occurrence. Collection may be automatic (as in flight recorders) or manual. In addition, composites may be constructed from one or more inputs providing all are reporting on some aspect of the same event. Removing the time element provides raw data for event frequency counts. Finally, event/time recording supports portions of other, more complex observational methods. The data from event/time records may be abstracted using statistical methods to describe task performance in terms of mean times for response, task completion, transition between tasks, etc. These metrics have been used primarily in the analysis of repetitive tasks. Performance data on most of the common HCI functions (selection, positioning pointers, etc.) exist in archival data banks or in standards documents. As such they provide baseline parameters to guide task allocation in complex, time-constrained environments and benchmark targets in prototype evaluation.
Lee Scott Ehrhart
Tasks, Models & Methods
29
Process Charts & Flow Process Charts Process charts present task sequences in spatial terms and are particularly useful in studying activities which involve multiple operators at distributed locations. The related technique of flow process charting replaces the spatial aspects of the task sequence with general process notations which resemble state-transition diagramming. In the context of team decisionmaking activities, process charts and flow process charts can be used to diagram the flow of information in a hierarchical, time-dependent decision process. This can be particularly useful where control in the decision process is passed, often using manual methods, along a hierarchy of physically separated team members. The network characteristics of flow process charts may also be used to highlight the interdependencies of information and decisions. The state-transition and information flow aspects of flow process charts resemble data flow diagramming and related techniques which are widely used to represent computer program logic. In general, the processes associated with humancomputer interaction may be more accurately represented by one of the several techniques that were specifically developed for that purpose. However, these methods rarely provide a means of representing time and volume data on tasks. There may be some utility in adapting the notation of time and flow volume from flow process charting to modified HCI data flow diagrams to develop task performance baselines. Gantt Charts & Multiple Activity Charts Gantt charts and multiple activity charts provide the means for graphically representing the time, sequence and duration of multiple activities. Both methods permit diagramming of sequential and concurrent tasks. Whereas Gantt charts allocate only one task per line, multiple activity charts group several tasks on a single line that may represent one operator, machine, or subsystem. For example, a multiple activity chart could be used to represent the various human and machine activities in the team decision problem by placing all the tasks of each team member on a single line with an associated line for the machine interactions involved in that team member’s task. This would be particularly useful for indicating contention for certain resources where team members were simultaneously requiring access to the same workstation or contending for the same communications link, etc. FROM-TO Charts & Link Charts Where the activities of interest require tracing multiple tasks along multiple paths, task information may be represented using the matrix notation of FROM-TO charts. Headings for the rows and columns of the chart are identical (much like the from-to mileage charts on maps indicating distances between cities). The individual cells created by the intersection of rows and columns may used to record flow volume or a weighted total for the transaction. The completed matrix can be converted to a link chart by representing the row and column headings as nodes and the cells as links. Link charts and related network modeling techniques, such as Petri nets, have been used on several occasions to model command and control activities. Chapanis [1959] used link charts to analyze human human and human 2 machine interactions involved in C decisionmaking for the purpose of improving performance by increasing communication throughput. FROM-TO charts and link charts could provide a useful representation of system-level interactions to the extent that the activities of interest were observable for manual or machine collection. Similarly, link charts have been used to model and revise spatial control layout in ergonomic studies. However, most of the relevant information from such studies has been successfully distilled into human factors principles. The more difficult task of modeling the cognitive aspects of control layout is currently being addressed in the ecological interface design research of Rasmussen & Vicente [1989]. Occurrence Sampling Unlike the continuous process tracing methods previously described, occurrence sampling involves the data collection at only predetermined intervals. This technique requires a careful design to determine the appropriate frequency of sampling and to develop a sufficiently robust classification schema to accurately and efficiently capture the information of interest. Occurrence sampling is useful only where system states are discrete and unambiguously observable. This requirement precludes its utility for collecting information on system performance where the tasks are primarily cognitive. Collecting Observational Data Many of the objective measures involved in the methods described may be collected in a non-intrusive fashion through the use of automated event/time recording or time/state recording. This is often the only accurate method of fully capturing complex user-machine interaction involving keystrokes and gestural input. Raw data can be used to establish
Lee Scott Ehrhart
Tasks, Models & Methods
30
basic task or process characteristics, such as time required for completion, frequency, percentage of resources required, sequences and temporal dependencies. This data may be further examined to determine system throughput and workload. Observational data can also be collected manually, either as it occurs or through post hoc analysis of videotaped sessions. Human observers are necessary when the processes or tasks of interest cannot be characterized solely in terms of keystrokes or similar machine inputs. Contemporaneous collection requires well-trained observers and is best suited to situations where the tasks or processes of interest are relatively simple and readily observable. Furthermore, manually recorded events must occur at a rate that permits recording by a human observer. Studies involving complex processes with multiple participants may require more than one observer and videotaping to provide a complete record. Even the most “objective” observational techniques are subject to error from a number of possible sources. Meister [1985] lists eight potential means by which error, or bias, may be introduced into a study: •
over-simplification resulting in loss of detail
•
observer preference for simpler more familiar reporting categories
•
focus on the beginning and end of behaviors, while ignoring intermediate behaviors
•
observer conjecture regarding operator thoughts or motives
•
influence of prior events/behaviors on the observation of the event/behavior of interest
•
distortion of observation due to situational context
•
stereotyping and prejudice
•
halo effects
In addition to these error sources during data collection, the validity and reliability of objective observation is affected during design by the selection of measures used and during analysis by the analysis methods and interpretations afforded the data collected. Table 3.3 summarizes the advantages and disadvantages of objective observation methods. Method
Objective Observation Methods
Advantages • Provides objective, quantitative data for evaluation of observable tasks and processes • Well-defined study can be usually be complete faster and lessexpensively than other methods • Some forms of data collection may be automated
Disadvantages • "Objective" measures still require human judgment: - measure selection - data summarization - data analysis - results interpretation • Data must be from simple, lowlevel, “observable” phenomena • Difficulty analyzing & interpreting large amounts of low-level data from numerous sources • Measures may not capture relevant aspects of performance --> low reliability & low validity • Not always feasible to collect relevant data in realistic settings -> reduces validity
Table 3.3: Objective Observation Methods Summary
Lee Scott Ehrhart
Tasks, Models & Methods
31
3.1.2.
SUBJECTIVE ASSESSMENT METHODS
Subjective Observation Methods Objective observation methods provide information about what is happening with task or event counts, frequencies, duration, sequencing, etc., but do not capture qualitative values for individual tasks (e.g., how well a task is performed) or the relative value of particular system design features to overall task performance. Furthermore, objective measures afford little insight into the cognitive processes involved in task performance. To capture these cognitive activities, Meister [1986] augments objective observation methods with two categories of subjective observation techniques: evaluative and diagnostic. Evaluative methods essentially annotate the objective description of events with subjective assessments of the quality of performance, the consequences of actions taken, or notations on human errors which cannot be captured using automated recording. Diagnostic studies further extend the analysis by noting the reasons (causes) for a recorded event. In a related work, Meister [1985] presents a taxonomy of quantitative and qualitative observational methods in hierarchical form (Figure 3.1). Meister’s taxonomy categorizes observation with respect to the focus: self, others, things and events. Although each category is supported by both quantitative and qualitative methods, as the focus moves from self and others to things and events, the techniques involved include objective as well as subjective measures. Selection of subjective observation methods for an evaluation of task performance involving human-machine systems necessitates balancing tradeoffs among several features: •
content - performance, behaviors, attitudes, traits, etc.
•
immediacy - real-time vs. retrospective, etc.
•
detail - discrete behaviors vs. attitudes inferred from an aggregate of behaviors
•
time interval - short sampling cycles vs. longer intervals
•
intrusiveness - transparent data collection vs. interaction with subjects
The decisions regarding these factors directly affect reporting accuracy and the complexity of the analysis required with subsequent effects in terms of the reliability and validity of the findings. For example, short sampling cycles are appropriate for recording simple, discrete actions, but may not capture the relationships among triggering events, responsive actions, and consequences. Longer sampling cycles may result in an unmanageable volume of complex information, making it difficult to separate the most essential factors. In evaluation study design, the selection of observation techniques must take into account the ramifications of these characteristic dimensions. Ranking Methods Ranking methods entail ordered comparisons intended to determine the relative importance of an individual feature, activity or event with respect to a set of associated items. Sorting exercises may precede actual ranking to construct hierarchies or taxonomic structures that are often as useful as the ranking of individual items. The principal benefits afforded by ranking methods are realized in time and cost savings. The methods, summarized in Table 3.4, are wellestablished with ample resources in the literature to guide design and analysis of reliable, valid studies.
Lee Scott Ehrhart
Tasks, Models & Methods
32
Interviews, Questionnaires, Diaries, etc.
Qualitative Observation
Observation of Self
Self-Report Quantitative Observation
Self-Ratings
Qualitative Observation
Observation of Others
Description
External Indices
Direct Observation, Objective Perf. Measures
Internal Indices
Ratings, Critical Incidents, etc.
Quantitative Observation
Qualitative Observation
Observation of Things & Events Quantitative Observation
Description, Checklists
Scaling Methods, Frequency Distribution, Direct Observation
Figure 3.1: Hierarchy of Observational Dimensions and Methods (adapted from Meister, 1985) Although providing some insights into preference for and priority of the ranked items, simple rankings record neither the relative value or weight of the individual items nor the rational for the assignment of rank order. This information generally must be captured in post hoc interviews that, in turn, reduce some of the time and cost savings. Rankings also suffer from a tendency to be sensitive only in the extremes (i.e., “most vs. least,” “best vs. worst,” etc.). The extreme points tend to function as “anchors” where the mid-range represents the area of weaker opinion or preference. Thus, the rankings of items between the extremes, accounting for as much as 80% of the total number, are often more variable across subjects and less reliable than those at the extreme points.
Lee Scott Ehrhart
Tasks, Models & Methods
33
Method
Ranking Methods (General Characteristics)
Advantages
Disadvantages
• Differentiates individual item with respect to a set of related items
• No determination of intrinsic value of ranked item
• Usually possible to administer to groups
• May not include information which explains the basis for the ranking
• Useful for collecting user preference data • Generally easy to create and use • Valid & reliable Simple Ranking Methods
Repertory Grids
• Relatively fast to administer and analyze
• Generally only sensitive at the extremes (i.e., best & worst)
• Inexpensive
• Items can only be ranked with respect to shared attributes
• Allow for differentiation across multiple dimensions
• Time consuming to administer and analyze
• Iterative approach can provide very robust information
• Not suitable for groups
• Used successfully for knowledge elicitation
• Requires care to avoid collecting too much trivia
Table 3.4: Subjective Assessment: Ranking Methods Repertory grids, a variation on simple ranking, provide a structure for capturing the subtleties in the ranking of individual items. The subject compares items in groups of three to categorize them such that two items are similar in some aspect which differentiate them from the third item and then explains the nature of that difference. Unlike simple ranking methods, the complexity of repertory grids requires a one-to-one interaction between the subject and the researcher, making them unsuitable for group settings. Repertory grids have been used in recent years to facilitate knowledge elicitation and domain modeling for knowledge-based systems. While the method makes valid claims for completeness, conducting an exhaustive battery across a number of subjects is both costly and time intensive. Furthermore, without careful planning the resulting avalanche of data may have limited relevance to the larger study. For these reasons, repertory grids seem more useful as a means of knowledge refinement for a well-focused subset of issues rather than an initial data gathering mechanism. Rating Scale Methods One of the most popular methods of capturing subjective data, rating scale methods are used extensively in both selfadministered instruments and as collection mechanisms in the observation of phenomena. The items evaluated can include features, concepts, activities, or events. The rater assigns values for items using quantitative or qualitative scales. Although rank order is not a direct product of this method, with some rating instruments it is possible in subsequent analysis to develop rankings based upon the data collected. Rating scales differ from ranking methods in the nature of the information provided. Where ranking methods capture the relative value for a specific attribute of one item in comparison with similar items, rankings do not assign specific values for the attribute of interest. In contrast, rating scales provide either specific or “fuzzy” values for attributes, but do not compare the relative values for an attribute across similar items. Neither method captures the reason for the assignment of rank or rating. As indicated in Table 3.5, the various rating methods vary with respect to the tradeoffs involved in ease of use, time requirements, and internal checks supported. An extensive literature exists outlining methods for establishing the reliability of the results internally and across raters. Rating scales have a variety of uses and may be combined with other techniques in questionnaires to constrain responses and simplify analysis. They may be used with self-administered evaluations, although careful design is required
Lee Scott Ehrhart
Tasks, Models & Methods
34
to prevent bias due to way in which the ratings are elicited. Rating scales may also be used to record qualitative or quantitative data in observational settings. Here again, the rating scales provide a recognitional structure for reporting observed behaviors. This feature is useful for transforming the observations to readily analyzed data sets, but is inflexible with respect to unanticipated events or behaviors. Real-time data collection using rating scales must be limited to the observation of a limited number of carefully selected behaviors to prevent falling behind. For obvious reasons, real-time rating is infeasible in highly dynamic environments. Retrospective assignment of ratings may reduce the accuracy due to hindsight bias or an attempt to rate a factor that was not specifically observed. These errors are more common when the raters are either the subjects of the evaluation or less experienced observers.
Method
Advantages • Scores assign a value to the item with respect to an attribute
Rating Methods (General Characteristics)
• Usually possible to administer to groups
Disadvantages • No information captured with regard to how one item differs from another with respect to the attribute of interest
• Useful for collecting user preference data • Generally easy to create and use • Valid & reliable • Relatively fast to administer and analyze
Simple Rating Scales
• Inexpensive
• Some problems with bias due to extreme responses or adjust-ments to suit predetermined preferences
• Comparisons are easy for respondents
• No internal checks for ambi-guities associated with rating • Increased data points allow for internal multidimensional concepts on unicross checks dimensional scales
Paired Comparisons
• Limited by time requirements • Not suitable for gross differ-ences or comparison of non-commensurate attributes Thurston’s Equal-Appearing Interval Technique
• Faster than Paired Compari-sons with same validity and reliability • Easy to analyze
Likert’s Summated Ratings Method
• Easy for respondents to complete
Semantic Differential Technique
• Similar to Likert’s Method in ease of response
• Scales are easier to generate than Thurston’s technique; same validity and reliability
• Factor analysis provides explanation of underlying dimensions of response
• Limited internal checks • Highly dependent upon wording of choices • Some problems with bias due to extreme responses or adjust-ments to suit predetermined preferences
• More time required for full analysis • Results dependent on robustness of the design
Table 3.5: Subjective Assessment: Rating Scale Methods
Lee Scott Ehrhart
Tasks, Models & Methods
35
Checklists As with ranking and rating methods, checklists provide a relatively simple, discrete method of recording certain aspects of task performance. The method makes use of recognition and can be generated fairly rapidly from SOPs or other documentary sources. However, the data recorded is limited to either the presence or absence of a behavior, trait, event, etc., without preserving the mediating factors. Unless the semantics of the tasks and behaviors are widely agreed upon in advance, there can be additional difficulties due to uncertainty about the wording of the list. Simple checklists can be used fairly confidently in real-time settings; more detailed checklists are better suited to retrospective application or post hoc videotape analysis. Table 3.6 presents the relative tradeoffs associated with checklist methods. Method
Checklists
Advantages • Relatively easy to generate and use • May be used with groups • Validity due to high inter-rater agreement • Provides reminders to the observers
Disadvantages • Problems with completeness and ambiguity in lists • Problems with conflicts between accepted standards and research-based guidelines • Records only superficial attributes of performance
Table 3.6: Subjective Assessment: Checklists Questionnaire & Interview Methods With questionnaires and interviews focus of the data collection shifts from what happens and the various details associated with how it happens to an attempt at understanding why it happens (Table 3.7). Time and expense constrain the number of interviews possible, particularly when subjects are not readily available. Questionnaires, in contrast, may be administered indirectly over time to a large number of people at distributed locations. The two most critical factors in the successful use of questionnaires are the design of the questionnaire and diligence in following up on responses. A welldesigned questionnaire not only incorporates carefully worded and ordered questions to avoid biasing responses, but also builds in conceptual redundancy and cross-checks to ensure accuracy.
Lee Scott Ehrhart
Tasks, Models & Methods
36
Method
Advantages
Questionnaires
Disadvantages
• May include rankings and ratings
• Problems with ambiguity in responses
• Potentially produces richer data set than rating & rankings alone
• Follow-up required to assure completeness of sample
• Variety of methods for clarifying responses through cross checks
• More difficult to code data
• Can be administered to groups
• Considerable time and skill required to collect data and analyze results • More care required in design to assure validity and reliability of results
Interviews
• May include rankings and ratings
(General Characteristics)
• Richness of response similar to questionnaire techniques • Process allows for clarification of questions and responses
• Care required to maintain structure and control of interview • Cannot be used with groups • More difficult to code data • Considerable time and skill required to collect data and analyze results • More care required in design to assure validity and reliability of results
Focus Groups & Experts
• Provide a means of getting up to speed rapidly on the procedures, problems, and environment • Most useful in early phases of design • Can be structured, yet allow a fairly open-ended discussion
‘Critical Incident’ Technique
• Used with small groups • Provides information about rare events not captured in other methods • Richness of response similar to other interview techniques
• May not result in data which is codable in the traditional sense • Useful results require considerable preparation time and skill on the part of the moderator and analyst • Care required to negotiate the internal politics of group members and stakeholders • May provide only limited information about ‘normal’ operations • Similar collection, coding, and analysis problems to other interview techniques • Requires highly skilled interviewer
• Process allows for clarification of questions and responses • Used successfully for knowledge elicitation Table 3.7: Subjective Assessment: Questionnaire & Interview Methods Interviewing individuals or groups provides a variety of rich data on the human actors, the tasks, the organizational doctrine, and the environmental factors which characterize the problem domain. For example, interviews with experts, individually or as part of focus groups, often accelerate the modeling of complex domains. With careful planning, focus groups can direct attention toward the pivotal issues, aiding in the identification of needs and the most appropriate measures for assessing the relative merits of alternative HCI design options.
Lee Scott Ehrhart
Tasks, Models & Methods
37
Somewhat more specialized, ‘critical incident’ interviewing assists in refining critical or crisis decision process models.2 This approach traces the steps taken (including cognitive processes) in making crucial decisions or responding to a crisis situation. For this reason, information drawn from these interviews is of particular interest in the design of C2 systems where the primary intent is to support the successful management of critical missions or crisis situations. In many respects the interviewing process employed is similar to the verbal protocol and verbal probe techniques discussed below, with the exception that the focus is retrospective. The information gathered in interviews is significantly more difficult to code and analyze than the data from more constrained methods. In addition, the interactive dynamics of interviewing involve considerably more training and skill than most observation tasks, particularly where focus groups or ‘critical incident’ interviewing techniques are involved. Conducting interviews individually and in groups requires experience keep the interview focused and detect important paths for further discussion. Verbal Protocol Analysis & Verbal Probe Techniques Verbal protocol analysis and verbal probe methods are real-time variations of the retrospective interview techniques. Both techniques require the subject to “think aloud” while performing a task. Verbal protocols are distinguished from verbal probe methods by the role of the interviewer/analyst. In the former, the interviewer’s role is as a passive observer; in the latter, the interviewer may interact with the subject during the performance of the task to clarify points. Although the probe technique is more intrusive, the interviewer can structure the session to facilitate expression and eliminate rambling. Neither method is suitable for exploring decision situations in which time pressure is a key factor. Table 3.8 presents the principal advantages and disadvantages associated with real-time verbalization methods.
2
The distinction between critical decisions and crisis decisions is primarily one of time and degree. Critical decisions are those which represent crucial pivot points in a given situation -- which may or may not be at a crisis stage. Crisis decisions, then, are those decisions made in response to a crisis situation.
Lee Scott Ehrhart
Tasks, Models & Methods
38
Method
Verbal Protocol Analysis
Advantages
Disadvantages
• Extremely time-consuming and difficult to • Provides a rich source of information code and analyze regarding the cognitive processes involved in a decisionmaking task • Coding and analysis are highly subjective • May be used with 2- & 3-person teams • Verbalizations vary with subjects, subject perceptions of analyst, organizational • Collecting process information context concurrently with task performance may avoid hindsight bias and other post hoc • Some question as to the completeness and analysis errors accuracy of what subjects verbalize vs. actual cognitive processes; data collected • Can be used to capture some sequential may be limited to only what happened information about tasks during the session • Verbalization of process is highly intrusive & may interact with task • Cannot be used concurrently with timestressed tasks or where completion time is a dependent measure
Verbal Probe Techniques
• Allows interaction between the subject and interviewer to clarify points as they arrive • Useful approach to storyboard and prototype review • Can provide a more structured, directed interview-like session to generate more robust data
• Similar problems and constraints in collection, coding and analysis as for verbal protocols • Additional problems maintaining focus on tasks at hand
Table 3.8: Subjective Methods: Verbal Protocol & Probe Techniques One of the rationales put forward for real-time verbalization is that it helps preclude the hindsight biases often induced when subjects are overly introspective about their reasoning processes. On the other hand, most subjects, regardless of their expertise in a domain, find it difficult to think aloud while engaged in the higher level cognitive tasks required for making complex decisions. In these cases, verbalization methods are more feasible where there are props (e.g., interactive prototypes) or other subjects to facilitate the interaction. For example, the methods are potentially effective in exploring team decisionmaking activities, where communication is a natural part of the process. It is important to recognize, however, the difference between the necessary communication for team interaction and full verbalization of the cognitive processes of all members of the team. Easily the most critical disadvantages of verbalization methods are difficulties with coding and analysis. The time and associated cost required to code even relatively simple sessions accurately make large sample sizes infeasible. Yet, in conducting fewer sessions, the researcher runs a greater risk of missing potentially critical features of the task because they were not manifested in the particular sessions recorded. Subjects often vary widely in their competence in both task and verbalization skills. This variability translates to highly variable data, particularly with a small sample size. Finally, the coding and analysis of verbalization data is extremely subjective and lacks easily implemented internal cross-checks. 3.1.3.
SIMULATION
In many operational settings, data collection using intrusive methods is completely infeasible. This is especially true where the decision situation involves a team operating in a crisis environment. Simulation covers a variety of techniques where data is collected by manipulating a model (Table 3.9). Simulation may be used to collect on task performance using current versions of fielded systems with archival data. These studies provide insights into potential enhancements required to existing systems. Finally, some form of interactive simulation is implied in the testing and evaluation of any
Lee Scott Ehrhart
Tasks, Models & Methods
39
prototype system -- from the simplest storyboard to the most sophisticated beta version of a new system (see Section 3.1.5). Statistical simulation is used primarily to create a descriptive model of tasks and behaviors. The inputs generally come from objective sources such as archival data or direct observation. For example, statistical simulation can be used to convert raw time/event data into descriptive task profiles. These descriptive outputs are often inputs into simulations using predictive models. For example, the descriptive data on a given task can be run against the performance parameters of a variety of HCI design alternatives to determine potential advantages or disadvantages. The descriptive and prescriptive models developed in the area of decisionmaking usually involve game theory or a network model of beliefs. None truly model the idiosyncratic factors inherent in human judgments or the interaction of those factors with HCI design features. These aspects are better addressed using interactive simulations. Here, the simulation techniques available range from simple, PC-based programs for an individual user/operator to full-scale simulation environments for teams, such as command center or aircraft simulations. The primary tradeoff is between the cost to develop the simulation and its fidelity to the target environment. Although there are many packaged simulations available for a range of command and control scenarios, they can rarely be used reliably without modification. Method
Advantages
Disadvantages
Statistical Methods
• Useful for generating a descriptive profile of human performance ranges, tasks and inputs to use in predictive models
• Increasing precision generally means increasing cost of developing and testing the simulation
Predictive Models
• Useful for evaluating written specifications
• Simulation of tasks involving human judgment require more time and resource intensive interactive simulation
• May identify bottlenecks at the task and sub-task level
Interactive Simulations
• Depends upon the accuracy of estimates
• May be used to evaluate consistency of the interaction design
• Increased precision requires more costly empirical data
• Can be used to collect data about human performance in existing system environment
• High costs associated with fidelity usually prohibit detailed simulation of complex systems
• Detailed simulations available from training applications
• Existing simulations may not accurately model areas of interest
Table 3.9: Simulation Techniques 3.1.4.
PROTOTYPING TECHNIQUES
With the growth of interactive computing and its application in support of complex decisionmaking, prototyping has become an important tool in capturing and analyzing user requirements. Additionally, in iterative design and development processes, prototypes aid in verifying and validating the working design against the requirements. Thus, prototypes vary widely in operational detail and functional capability, usually based upon which phase of the system development life cycle (SDLC) is involved. For example, early in the development a prototype may be no more than a set of sample screens sketched on paper or a cardboard mock-up of a control panel. More commonly, the term “prototype” is applied to early functioning versions of software and hardware. Departing from the design phase orientation of the classic SDLC model, Gardiner and Christie (1990) examine the role of prototypes in addressing questions on four design levels: conceptual (the system concept), semantic (the interaction concept), syntactic (the interaction form), and lexical (the interaction detail). Table 3.10 presents the proposed evaluation focus, prototyping support, and evaluation techniques appropriate for each level.
Lee Scott Ehrhart
Tasks, Models & Methods
40
Design Level
Evaluation Focus • System concept
Conceptual
• Appropriateness for user requirements and abilities
Prototyping Support • Written descriptions & scenarios • Storyboards
Evaluation Tools & Techniques • Focus groups • Walk-through • Predictive models
• Interactive storyboards
Semantic
Syntactic
• Interaction concept
• Interactive storyboards
• Informal user tests
• Broad definition of interaction, error feedback and user support
• Hardware mock-ups
• Controlled laboratory tests
• Interaction form
• Interactive storyboards
• Dialogue parameters & interaction sequences
• Partial prototypes
• Partial prototypes • Formal & informal user tests • Controlled laboratory tests • Gaming & Simulation • Field tests
Lexical
• Interaction detail
• Partial prototypes
• Formal user tests
• Specification of HCI
• Functional prototypes
• Gaming & Simulation • Field tests
Table 3.10: Prototyping and Evaluation to Match Changing Design Requirements (adapted from Christie & Gardiner, 1990) Gardiner and Christie’s model provides some useful guidelines for trading off the time and expense required for developing a prototype against the functionality and performance achieved. In addition, it indicates the extent of evaluation support possible with a relatively small investment. Five prototyping approaches are discussed in further detail below and summarized in Table 3.11.
Lee Scott Ehrhart
Tasks, Models & Methods
41
Method
Scenarios & Storyboards
Mock-Ups
Advantages • Low cost, low risk method for exploring requirements
• Verbal descriptions in scenarios are not as vivid as visual representations
• Scenarios can be re-used for later evaluations of design
• Paper storyboards support very limited exploration of interaction
• Storyboards & scenarios can later be incorporated into interactive storyboards
• Little utility in identifying potential human errors
• Low cost method for verifying the physical layout of hardware
• Limited to representing surface features
• May be useful in simulating environment for exercises where full interaction is not required
Interactive Storyboards
Disadvantages
• Useful for refining requirements & identifying potential human errors
• Full capture of ergonomic aspects of performance requires more expensive representation (pushable buttons, turnable knobs, etc.) • Will not identify throughput or information overload problems associated with data volume
• Provides low- to medium-fidelity environment for performing usability trials • Must be limited to presenting feasible designs within given constraints • May be developed with low to moderate cost using COTS software
Rapid Prototyping
• Useful (within limits) for evaluating performance with actual or simulated inputs
• Some developers resist showing detailed prototypes to clients • Increasing fidelity is costly
• May help prevent premature “freezing” of design • Moderate to high cost (costs may be reduced when CASE tools can be used to provide easily modified prototypes) Table 3.11: Prototyping Techniques Scenarios & Storyboards At the lowest level in cost and complexity, written scenarios presenting the basic aspects of an example situation provide a context for exploring tasks, decisions, and cognitive processes. As such they are often used in interviews to give end-users or domain experts a representative case for reaction. Scenarios are used extensively in both manual and computer-based gaming and simulation exercises. They also play a key role in any verification and validation test plan, specifying various input parameters modeling typical operations. Accompanying scenarios with paper storyboards helps subjects to visualize the verbal descriptions in a scenario. For this reason, storyboards make extremely useful props for verbal protocol sessions. In addition, they provide a relatively low cost, low risk method for getting a preliminary feel for how the system would be used in terms of typical tasks and situations. Storyboards may be annotated, reordered or even re-designed during requirements definition interviews. Paper storyboards are limited to representation of a set scenario with little possibility of exploring the range of interaction possible with the given design. The technique presents the sequence of screens, but does not capture potential interaction errors or the cognitive workload associated with a particular design. These aspects are better addressed with interactive storyboards.
Lee Scott Ehrhart
Tasks, Models & Methods
42
Mock-Ups Mock-ups encompass a variety of non-functioning physical representations ranging from cardboard models of single control panels to full-scale control centers with turnable knobs and flippable switches. They are primarily used for studying the ergonomic impacts of equipment layout on physical task performance. In many cases, physical mock-ups are unnecessary for studying HCI design since most of the visible features of interest are incorporated in interactive storyboards or prototype systems. Mock-ups should not be confused with other physical props (e.g., maps, charts, etc.) that support evaluation exercises. Interactive Storyboards Interactive storyboards serve as a powerful means for exploring HCI design alternatives without incurring the expense of developing a working prototype. This is particularly advantageous when the investigation is focused on evaluating several advanced interaction technologies rather supporting the design of a specific system. Interactive storyboards are also useful for working with experts or end-users to refine requirements. Subjects interact with a computer-based storyboard simulating the actual operation of the system. Interaction may take the form of informal exploration or subjects may be presented with tasks to perform using the simulated system. In the latter case, the storyboard provides a low- to medium-fidelity environment for assessing usability and identifying potential human errors. Verbal protocol methods may be used to elicit the cognitive processes involved in the interaction. Where storyboards are used in requirements definition and refinement, care must be taken not to present something in storyboard form which is infeasible within the technological and resource constraints of a working system. Although this method can be used to identify problems with cognitive workload due to the allocation of tasks between the operator and computer, it does not task the system sufficiently to delineate user or computer performance problems related to throughput or information overload. These issues must be addressed with operational prototypes that accept real-time data. Rapid Prototyping Although developing prototype versions of a system is not a new concept, until recently software prototyping tended to be restricted to semi-operational beta versions of systems under construction. As such, they represented a considerable investment in time and effort and major changes to the design were highly discouraged. Furthermore, it was not uncommon for a cost-conscious sponsor to stop development with the prototype. If the prototype offered most of the functionality of the completed system, the sponsor would take delivery on the prototype and cancel further development. Similarly, if the prototype indicated major problems with the design or development effort, the sponsor might consider it good management to cut his/her losses at that point. For obvious reasons, developers grew reluctant to show prototypes to their clients. The introduction of fourth generation languages and CASE (computer-aided software engineering) tools dramatically changed the role of prototyping in system design and development. Using the toolboxes provided in COTS (commercially available off-the-shelf) software, prototypes with complete interactive displays using windows and pull down menus can now be developed very rapidly for UNIX, DOS, Macintosh and other environments. This rapid development capability and the corresponding ease with which the software may be modified or even substantially re-designed, makes it possible for designers to develop and use prototypes during the earliest phases of design. These early prototypes provide many of the features of interactive storyboards while reducing the possibility of presenting the user with an infeasible system concept. Nevertheless, until the system is tasked with the full volume of data expected in the target setting, actual system performance and its impacts on the users will not be fully apparent. This has important implications for the reliability and validity of HCI design evaluations. 3.1.5.
EXPERIMENTAL DESIGNS
Experimental approaches to evaluating HCI designs often combine several of the data collection and analysis methods discussed previously. For example, scenarios may provide the task context with some form of prototype to facilitate the interaction and a combination of objective and subjective approaches may be employed to collect and record the performance data. The principal feature distinguishing experimental methods from other techniques is the manipulation of one or more elements (independent variables) to determine the effects upon other elements (dependent variables). The design of the experiment (including the selection of subjects, measures employed, collection and analysis methods) has a profound impact on the reliability and validity of the analyzed results, or findings. Campbell and Stanley’s [1963] classic text on experimental and quasi-experimental design serves as the foundation for most behavioral research. They present
Lee Scott Ehrhart
Tasks, Models & Methods
43
the concepts of internal and external validity and methods for promoting both in the design of experiments. Cook and Campbell [1979] further develop these methods for application to operational settings. Both reliability and validity are essentially functions of experimental control. Thus, results are reliable if they may be systematically replicated under the same experimental conditions. However, reliability does not ensure the validity of results. Findings from empirical studies are valid to the extent that the experimental design and analysis methods adhere to accepted principles of evidence. Cook and Campbell [1979] identify four types of validity: •
Internal Validity - the assurance that the manipulation of the independent variable is in fact the reason for the changes observed in the dependent variable;
•
Construct Validity - the correct operationalizing of the relationship between the independent and dependent variables;
•
Statistical Conclusion Validity - the assurance of sufficient sensitivity in the study design to reliably assess the covariation of variables and make subsequent inferences about the cause and effect relationships between variables, and the use of correct statistical tests to measure these relationships;
•
External Validity - the extent to which the experimental findings may be generalized to other settings and similar conditions.
Each type of validity can be undermined, or threatened, by a number of factors which must be addressed in the design of the study. For example, the internal validity of within-subjects designs3 may be threatened by changes in the subjects over time due to exposure to the various treatments or events external to the study. Adelman [1991] explores specific methods for countering various threats to validity when designing classic experiments, quasi-experiments, and case studies as vehicles for empirical evaluation of decision support systems. Quasi-experimental designs are appropriate in situations where the investigator cannot adequately control all variables or guarantee fully randomized subject assignments. Thus, quasi-experimental methods are often selected for field studies. For a variety of reasons, it is not always possible to fully maintain validity across all four dimensions. This is particularly true where studies involve complex relationships between variables and the experimental conditions do not permit rigorous control of all the potential threats to validity. These cases require making tradeoffs between the precision and generalizability of the findings based upon the purpose of the evaluation and the aspects of the study that are controllable. The experimental setting (i.e., laboratory, gaming/simulation, or operational/field) plays an important role in tradeoff decisions. Adelman & Donnell [1986] discriminate the three settings further in terms of the fidelity of each setting with respect to the target environment and organizational context (Table 3.12) and the system interfaces in Figure 1.2. The remainder of this subsection discusses the characteristics of these three settings. The advantages and disadvantages associated with each are summarized in Table 3.13.
3In
“within-subjects” or “repeated measures designs”, each participant is exposed to all levels or conditions.
Lee Scott Ehrhart
Tasks, Models & Methods
44
Type of Evaluation
Interface Examined
Low Environment Low DMO
Laboratory
DSS/U
High Environment Low DMO
Laboratory
Low Environment High DMO High Environment High DMO
Setting Fidelity
Experimenter Cost
Control
DA User Compatibility
Low
High
DSS/U
DA Coherence & Completeness
Low
High
Gaming Simulation
U/DMO
DA Compatibility with DMO
Moderate
Moderate
Field Test
DMO/ENV
DA Effect on DMO Performance
High
Low
Question Evaluated
3.12: Summary of Alternative Evaluation Settings (Adelman & Donnell, 1986) Experimental Setting
Laboratory
Simulation / Gaming
Operational / Field
Advantages
Disadvantages
• Greatest control over environmental factors • Allows systematic manipulation of independent variables and fully randomized designs • Variables may be limited to a tractable number • Careful design can assure reproducibility and internal validity • Sampling may be done in a non-intrusive fashion • Reduces problems of reactivity due to the introduction of new technology / procedures • Greater fidelity to target environment than laboratory; greater control than field studies • Facilitates non-intrusive sampling • Computer-based supports re-design and re-test within a short time span • Most representative of target system & environment • Generally uses subjects drawn from target population • Extremely rich data
• Problems with external validity and generalization to target system & environment • May not identify the causal factors effecting target environment • Focus on statistical significance may not include factors of practical relevance • Subjects often do not represent those in the target system • May involve disadvantages of both laboratory and field research • Higher costs and more time required to develop • Problems with bias due to subject conditioning
• Least control over environmental factors • Fully randomized designs may not be possible • Manipulation of some variables may be limited or infeasible • Problems with internal validity and reproducibility • Problems with reactivity due to the introduction of change and intrusive sampling
3.13: Comparison of Experimental Settings Laboratory Settings Laboratory-based research is generally the preferred method basic research in behavior. Conducting experiments in laboratory settings permits the greatest control over several factors Although rigorous control is not automatically guaranteed as a function of the setting, the laboratory provides more support for systematic manipulation of independent
Lee Scott Ehrhart
Tasks, Models & Methods
45
variables, exclusion of irrelevant or confounding factors, and fully randomized designs. Furthermore, eliminating the effects of certain factors reduces the variables to a tractable number. Careful design can assure reproducibility and internal validity. In many cases, laboratory equipment also permits non-intrusive sampling. Unfortunately, laboratory-based studies present problems in terms of the generalization of findings to the target system or environment. The limited factors studied may not include or identify the critical causal factors affecting target environment and the task performance. Similarly, focusing on statistical significance may preclude exploring factors of more practical relevance. As Table 3.12 suggests, laboratory settings can claim little similarity to the target environment and the organizational context. In addition, the fully randomized study may not employ truly representative subjects. Nevertheless, depending upon the level of fidelity attempted, laboratory settings can provide useful support to the conceptual and design phases of C2 system development at relatively low cost. To a somewhat lesser extent, this setting may also address evaluation issues during the acquisition phase. Simulated Settings Simulation or gaming exercises attempt to replicate some or all of the features of the operational environment in a somewhat more controlled setting. Subjects may perform tasks with prototypes and mock-ups of physical settings which closely resemble those present in the target environment. However, setting the evaluation away from actual operations permits eliminating some variables to make the analysis more tractable and assure greater internal validity in the findings. The simulated setting also facilitates non-intrusive sampling. In addition, when computer-based simulation is involved, it is often possible to integrate some automated sampling as part of the simulation. Some forms of data may actually be recorded in formats that may be analyzed immediately without further coding or conversion. Finally, computer-based support may also allow for a simple re-design and re-test within a short time span. These features make simulated settings particularly useful for evaluation during the acquisition phase of development. Although these methods enjoy greater fidelity to target environment than laboratory experiments and afford greater control than field studies, they may also incorporate the disadvantages of both. One of the toughest problems in evaluating C2 decisionmaking is the lack of adequate models to describe the C2 decisionmaking process in terms of how information contributes to C2 decisions. By definition, simulating the “real world” expands the number of variables that must be studied and controlled. In large C2 exercises, the requirement to heavily pre-script creates a situation where the actions of the participants have little effect on the course of events. Even well-designed computer simulations do not capture all of the nuances of the “real world” situation. This is of particular concern when the subject of the study is human human and human machine interaction in critical decision situations. Computer simulation suffers from the inability to adequately model human processes. This may be overcome by combining human processes with computer simulation to achieve a more accurate simulation of reality. Finally, creating system and environmental fidelity is relatively expensive and usually requires extensive use of development time. Operational Settings Field studies are the backbone of operational testing and evaluation. Field settings are often preferred for the final verification and validation of a new or updated systems. In addition, during the operational phase of the system life cycle, field studies monitor on-going performance and indicate areas for refinement or new development. Not all field studies employ experimental methods with specific treatments to manipulate variables and collect outcome measures. Cook and Campbell’s [1979] discussion of quasi-experimental designs provides the most comprehensive guide to conducting studies in operational settings. They indicate two reasons for the extension of experimental methods into field studies: 1.
the irrelevance of the controlled laboratory setting for applied research, and
2.
the inadequacy of non-experimental methods for inferring causation.
Although operational settings claim as the their principal advantage the highest environmental and organizational fidelity, achieving reliable, valid results in field research is extremely challenging. For example, manipulation of some variables may be limited or infeasible. Furthermore, two groups assigned to the same treatment may not experience the same environmental conditions due to uncontrollable factors external to the situation under study. Fully randomized assignment of subjects to groups and treatments may not be possible in field settings. Cook and Campbell present a set of selection parameters to identify the conditions most appropriate for attempting randomized designs as well as those which mitigate using randomization. The next subsections discuss three approaches to studying human-computer interaction: task analysis, process tracing, and workload assessment. Each method focuses on a slightly different aspect of performance; all three use some combination
Lee Scott Ehrhart
Tasks, Models & Methods
46
of the various data collection and analysis techniques described above.
3.2.
Task Analysis Methods
Task analysis covers a range of investigation activities focusing on one or more of the various factors that define a given task. Some form of task analysis is essential during system development to support requirements definition and system design. In addition, task analysis assists in the determination of appropriate foci for system test and evaluation. In the case of C2 systems, MIL-H46855B specifies a number of behavioral assessments required, when appropriate, in defense system development [Dept. of Defense, 1979]. These include various system, task, team and individual analyses. If a task constrains C2 system performance, it is necessary to find out what characteristics of the task do so. Meister [1981] identifies five task dimensions that may affect performance: •
Functional requirements (cognition, perception, etc.)
•
Complexity
•
Mental workload
•
Temporal factors (pace, duration, sequence, etc.)
•
Criticality
The tasks performed to accomplish organizational goals are usually not independent variables in an evaluation study. Rather, the various dimensions which characterize a given task may be manipulated directly or indirectly to evaluate the effect on overall performance. Direct manipulation involves the systematic variation of a task dimension such as the pace of the interaction. In evaluation of HCI designs, for example, two or three different task paces may be used as levels in a factorial design with two or more HCI design alternatives as the treatments. Some task characteristics cannot be manipulated directly without substantially changing the task. In this case, the study is designed to test a theory-based hypothesis regarding the relationship between the dimension of interest (e.g., mental workload) and other directly variable task and system characteristics. In addition to the direct performance requirements, tasks are defined by the organizational and environmental factors which provide the context for performance. Organizational factors include established doctrine and objectives for both the mission and the force. The structure of the organization, as represented in lines of authority and communication, also defines certain aspects of tasks. Another important factor is the way in which the team, both as a unit within the organization and as individuals, receives feedback and is rewarded for performance. Feedback and reward systems vary in terms of their focus (individuals, individuals in teams, team as a unit, etc.), the content of the feedback or reward (individual performance, team performance, or system performance), and the structure (method, speed, etc.). Finally, task analysis should include the situational or environmental factors that effect system performance. These include such seemingly disparate issues as the physical environment (noise levels, field deployment, etc.) and the likely situational context, such as joint and multinational operations. There is no single, accepted method for accomplishing task analysis. In fact, none of the standard methods fully address the range of task factors and questions. Fleischman and Quaintance [1984] compiled the most comprehensive survey of classification methods for describing human tasks. Although not a practitioner’s guide to task analysis, Fleischman and Quaintance’s text does present the theoretical foundations and major findings in research on human task performance. They organize their methods survey in terms of the content of the approaches, as follows: •
Behavior Description - categorizes tasks based upon observations and descriptions of actions performed and behaviors exhibited by the operator;
•
Behavior Requirements - categorizes tasks based upon a catalog of operator behaviors required for effective performance of the task;
•
Ability Requirements - categorizes tasks based upon the abilities requisite in the operator to achieve effective performance of the task; and
Lee Scott Ehrhart
Tasks, Models & Methods
47
•
Task Characteristics - categorizes tasks based upon certain aspects of the task that define and constrain human performance.
Behavior description approaches generate detail lists of overt behaviors. The methods, however, provide no direct means of prioritizing behaviors or identifying critical activities. Behavior and ability requirements methods do promote the analysis of critical aspects of task performance; however, both methods are limited by the subjective nature of the observation and categorization required. Human requirements are, by definition, ambiguous and difficult to quantify, although many of the ability requirements approaches use empirical methods which include mathematical counts and weighted factors. The resulting quantitative description is only as good as the accuracy and uniformity of the application of highly subjective semantic classifications. Finally, focusing on the task characteristics rather than the human performance of the task presents an entirely different perspective in the analysis of tasks. Task characteristic approaches also suffer from semantic difficulties in the identification of useful and appropriate descriptors. Meister’s five task dimensions cited previously supply one approach to the categorization of task characteristics. These four classification approaches do not cover all possible criteria for investigating tasks. For example, they do not include the various physiological aspects of tasks that comprise certain workload assessments. Perhaps more importantly, these approaches do not include the error classification and analysis methods that have high relevance in the design of HCI for critical systems and for the training of personnel. Summary discussions of the research findings on the identification, analysis, and mediation of error in human-machine interaction are found in Dhillon [1986], Norman [1981, 1988], Rasmussen et al [1987], Reason [1990], Rouse [1991]. Stammers et al [1990] break the process of task analysis into three iterative stages: data collection, description, and analysis. Since the ultimate objective of this process is the application of task analysis findings to the design, development, or evaluation of the target system, this requirement should drive the selection of suitable methods. The range of task analysis methods they identify includes those defined by their representation techniques (e.g., hierarchical, network, and flow chart methods) or by their content (e.g., cognitive and knowledge description, taxonomies, and formal grammars). The remainder of this section presents a brief overview of several typical methods. The advantages and disadvantages of each category are summarized in Table 3.14. 3.2.1.
HIERARCHICAL METHODS
Among the best documented and most widely used task analysis methods are those involving the hierarchical decomposition of tasks into their component subtasks. The decomposition process facilitates modularity and allows the analyst to control the analysis focus and the level of granularity. Perhaps equally important, hierarchical representations are relatively easy to learn and understand, making them particularly useful as a means of communicating concepts to decisionmakers and users. These methods correspond to the behavior requirements approaches cited above. Hierarchical Task Analysis (HTA), used extensively in process description for industry, employs a process of progressive redescription that successively breaks down tasks and subtasks into finer detail until the stopping criteria are met. These stopping criteria are based upon risk factor computed from the cost associated with non-performance of the subtask multiplied by the probability of performance failure. The determination of both cost and probability factors is often highly subjective and requires expert judgment. HTA diagrams also feature annotations of plans identifying various temporal aspects of tasks. This method has been applied successfully in the areas of training and human reliability assessment. The GOMS (Goals, Operators, Methods, and Selection) model, developed by Card, Moran and Newell [1983], is probably the best-known model for analyzing the behavioral requirements for procedural information processing tasks such as text editing. The major constructs of the model include: •
Goals - the objectives of a task and subtask
•
Operators - the elementary actions necessary to accomplish goals
•
Methods - the sequence of operators and subgoals to accomplish a goal
•
Selection Rules - the rules for choosing between alternative methods to achieve a goal
The GOMS model provides a means of predicting task completion time. However, the prediction is not robust enough to address delays due to errors or interruptions. Kieras [1988] presents a GOMS-based methodology for user interface design that supports
Lee Scott Ehrhart
Tasks, Models & Methods
48
Method Hierarchical Methods
Advantages
• Provides systematic & complete structure • Difficulties with representation of parallel activities for analysis • Broadly applicable • Well-documented • Easy to learn & apply
Network Methods
Disadvantages
• Limited representation of cognitive factors • Hierarchical representation may be misconstrued as prescriptive rather than descriptive
• Useful supplement to other analyses
• Limited applicability and narrow focus
• Analysis may be performed quickly given the right data
• Does not consider underlying cognitive or behavioral relationships
• Relatively easy to learn Knowledge Description & Cognitive Methods
• Provides methods for characterizing cognitive tasks not found in other TA methods • Consistent structure for representing task information
Taxonomic Methods
• Relatively easy to learn & apply • Provides rigorous, formal specifi-cation of task information • Provides a mapping of tasks to actions for HCI dialogue Flow Chart Methods
• Requires considerable analyst skill in knowledge elicitation • Expert sources may not be able to adequately verbalize knowledge
• Difficult to assure completeness & • Provides an explicit categorization of designation of mutually exclusive task information that can be adapted to a categories variety of uses • Well-documented
Formal Grammar Methods
• Difficulty assuring completeness in highly cognitive tasks
• Problems with definitions of categories can result in inconsistent assignment in allocating task elements • Narrowly focused • Inflexible structure • No method for describing interrelationships between task elements
• Provides a means of representing parallel • Difficult to assure completeness and designation of mutually exclusive user/system tasks and information flows categories involved • Well-documented • Relatively easy to learn & apply
• Problems with definitions of cate-gories can result in inconsistent assignment in allocating task elements
Table 3.14: Task Analysis Methods Comparison [adapted from Stammers et al., 1990, Meister, 1985] prediction of human performance, learning time estimates, and execution time estimates. To the extent that tasks are procedural and the GOMS components identifiable, this method is useful for guiding HCI design and evaluating existing systems or prototypes. 3.2.2.
NETWORK METHODS
Network methods are appropriate for the examination of such task dimensions as temporal factors, certain workload features, and spatial relationships. Network paradigms are also useful for describing communication flows, including human human, human machine, and machine machine. As discussed in Section 3.1.1., several objective methods support network analysis, including a variety of time-event charting methods, FROM-TO charts, and link analysis. Network methods appear to have some utility in the design of environments to support team performance. The most often
Lee Scott Ehrhart
Tasks, Models & Methods
49
cited study is Chapanis’ [1959] redesign of a battle cruiser command post to facilitate optimal communications. More recently, there has been some interest in applying Petri nets and similar representations to the modeling of command and control tasks. Network methods, however, do not address the underlying cognitive and behavioral factors in decisionmaking tasks. 3.2.3.
COGNITIVE & KNOWLEDGE DESCRIPTION METHODS
The traditional methods for task analysis generally do not address the cognitive processes and knowledge requirements of tasks. The development of expert systems and other knowledge-based decision support systems demanded the systematic framework for elicitation and representation of the knowledge that defines both the decision domain and the decision processes. As few of the relevant task elements are overt or otherwise observable, most of these methods employ a variety of subjective techniques, particularly the detailed elicitation techniques such as the critical incident method, verbal protocol analysis and verbal probe. Methods for describing knowledge or cognitive processes are, thus, limited by their dependence upon subjective assessments and the ability of experts to verbalize their decision processes. Cognitive Task Analysis (CTA) [Rasmussen, 1986] uses information from verbal protocols to analyze the skilled operator tasks involved in large scale control processes. Rather than analyzing the operator’s information processes, CTA provides a framework for organizing the sequence of ‘states of knowledge’ representing what operator knows about system operations at a given point. The schematic representation provides a systematic means for modeling humanmachine interaction in highly automated systems. This technique appears to have great utility in HCI design and evaluation for C2 systems, particularly those involving both team decisionmaking and the allocation of a significant number of information processing functions to machines. Task Analysis for Knowledge Description (TAKD) [Johnson et al., 1984] was originally designed as a means for organizing knowledge in training applications. This method has been used recently in usability testing for interface designs. The representation employed is similar to formal grammar methods and attempts to abstract task knowledge independent of the specific task. The goal is to increase the generalizability of the results to make them more useful for HCI design. 3.2.4.
TAXONOMIC METHODS
In most cases, some taxonomy of behaviors lies at the root of all task analyses. There are a large number of general taxonomies in print, providing an excellent source of descriptors for checklists in any investigation. As indicated above, Fleischman and Quaintance [1984] present the most comprehensive survey of taxonomic methods, organized in terms of the types of information recorded. The principal limitations associated with taxonomic approaches stem from semantic confusion regarding the categorization or labeling of behaviors or activities. Taxonomic methods have been applied most often in job description, personnel selection, and requirements specification. McCormick’s Position Analysis Questionnaire (PAQ) [1964, 1976] focuses on job description and more recently has extended the concept of job dimensions to more closely resemble abilities requirements techniques. PAQ defines job dimensions based upon job data and attribute profile data organized into six dimensions: information input, mental processes, work output, relationships with other persons, job context, and other job characteristics. This addition of task context also indicates the utility of McCormick’s PAQ for the description of the task characteristics as well as abilities and behaviors required for performance. One of the most commonly used taxonomies in HCI requirements analysis and design is Berliner’s hierarchical classification for measuring performance in military jobs [Berliner et al., 1964]. Berliner’s classification specifies four processes (i.e., perceptual, mediational, communication, and motor processes), that further break down into activities with specific behaviors. It is the specific behaviors that provide the observable and measurable entry points into the classification. Berliner’s method specifies the measures (e.g., times, errors, frequencies, workload, and motion dynamics) and categorize the instruments for collecting the measures. Fleischman’s Abilities Requirements Approach [Fleischman & Quaintance, 1984] principally defines perceptualmotor ability requirements for human task performance. As such, this method primarily supports the development of ergonomic requirements for systems supporting tasks with a significant psychomotor component, such as tracking and targeting. This method is well-documented with solid empirical foundations. It has been useful in designing effective training and predicting performance.
Lee Scott Ehrhart
Tasks, Models & Methods
50
3.2.5.
FORMAL GRAMMAR METHODS
One of the principal appeals of formal grammars is their ready translation into machine-understandable statements. This feature is useful in the development of expert systems and other knowledge-based applications. One of the primary advantages of formal grammars is their reduction of the ambiguity associated with more subjective approaches. However, formal grammars tradeoff completeness for precision and may not capture relevant information that does not fit into the classification scheme. Task Action Grammar (TAG) [Payne, 1984] enables the direct mapping of tasks to actions and models user knowledge. As with other methods, the TAG approach begins with task decomposition. The mapping of tasks to actions is accomplished by applying a set of rewriting rules. Although quite comprehensive in its domain, TAG is limited to the investigation of command languages and user-computer dialogue. 3.2.6.
FLOW CHARTING METHODS
In some respects, flow charting methods resemble both hierarchical and network approaches. The ability to model both parallel and sequential activities combined with the focus on information flows, decision points, and actions make flow charting methods ideal for the representation of HCI requirements. The job process chart method, developed for analyzing naval command tasks, specifies a three-level hierarchy for describing communication flows [Tainsh, 1985]. The top level of the hierarchy identifies the work stations and lines of communication between them. The next level describes the tasks performed at each station. Finally, the tasks and subtasks are defined in terms of their allocation to human or machine and the subsequent HCI requirements. Since flow charting methods employ some form of task taxonomy for identifying tasks, these methods also exhibit some of the classification ambiguity noted in other methods. Perhaps more critical in the development of C2 systems, these methods assume an unchanging external environment and a uniformity of user knowledge that undermines their validity in the target environment.
3.3.
Cognitive Process Methods
Design decisions regarding the use of human-computer interaction technology to support inferencing and decisionmaking require an understanding of the cognitive processes the interaction is intended to support. Technological attributes (e.g., MOPs) are useless without an understanding of where they fit in process and performance. In general, however, traditional task analysis methods are not appropriate for examining and modeling tasks primarily characterized by high-level cognitive processes. Cognitive processes can only be inferred using task analysis approaches through the examination of observable external events or activities and the subjective reports of the participants. Validating such inferences requires examination of the hidden links between the observable events. 3.3.1
APPROACHES TO STUDYING COGNITIVE PROCESS
Researchers in social psychology use methods to examine social and behavioral processes that could be useful in understanding cognitive processes. Fiske and Taylor [1984] cite several models that attempt to address process issues. For example, covariation models propose that people gather, retrieve, and combine information across several dimensions. Similarly, schema-based process models propose that people attend to data and retrieve data from memory better when it fits into pre-existing mental structures. These models appear to be supported in the research findings regarding the superiority of recognition over recall in memory tasks [Kintsch, 1970]. Social Judgment Theory (SJT), based upon the Brunswik Lens model of perception, is a representation of human judgments about individual decision criteria using heuristic interpretation of internal and external cues [Hammond et al., 1975, 1980]. The SJT model proposes that human decisionmakers respond to information overload by developing strategies, or heuristics, to select which information will be used and which will be ignored. Preference is given to those factors (cues) which are judged to be causally linked to the criterion under consideration. Numerous biases have been identified which may affect cue selection, including representativeness, vividness, and availability [Tversky & Kahneman, 1974]. A comprehensive review of human decision making biases and heuristics is provided in Sage [1981]. Related research indicates that data presentation features, such as spatial and temporal relationships, can also influence selection [Einhorn, 1971; Fischhoff et al., 1978; Hogarth, 1987; Miller, 1971].
Lee Scott Ehrhart
Tasks, Models & Methods
51
3.3.2
EXPERIMENTAL MANIPULATIONS TO TEST PROCESS HYPOTHESES
Because of the hidden nature of cognitive processes, the factors impacting human judgment and decisionmaking are most often explored using experimental methods. Process manipulations attempt to examine cognitive activity by varying some aspect of a specific stage in the process. Typically, the information content is the same in each treatment; the variables manipulated involve timing or modality, or both. Information timing manipulations are possibly the most common. These include not only studies which involve pacing of information presentation [i.e., time stress], but also those which examine the effects which may be due to the stage in the process at which information is available. Closely related variants are studies which manipulate the order of individual pieces of information to discover the impact on decision process and outcome [Adelman et al., 1991]. Similar means are used to assess the primacy and recency effects associated with the selection of cues in a decision process [Hogarth & Makridakis, 1981]. Of particular interest to HCI designers are those effects on process which may be associated with the mode, or channel, of information presentation (e.g., visual vs. aural, written vs. verbal, etc.). Modality studies are more difficult to analyze due to the increased possibility of undesired interactions. For example, there is a considerable literature in educational psychology on the highly individual effects of information presentation mode on attention and recall. Moreover, information presentation mode is much more difficult to isolate successfully from the content of the information. Each mode, or medium, of presentation has unique characteristics which can strongly affect the way in which information is evaluated and used in a decision process. For example, information presented in more vivid media (e.g., video, graphics, animation, etc.) can dominate that presented in pallid formats (e.g., text) in the selection and evaluation of decision cues [Nisbett & Ross, 1980]. As a consequence, it may not be possible to determine whether the results obtained are due to the mode of presentation or as a consequence of the modal effects on content. However, despite the important distinction for basic research, in cases where HCI design decisions are involved the accurate determination of process may not be so critical as the ability to manipulate overall effect. Fiske and Taylor [1984] emphasize the importance of process studies, in part because process appears to be domainindependent. For example, decision tasks requiring situation awareness involve similar cognitive processes. Whether the domain is command and control or economic forecasting, the human decisionmaker (DM) needs to find out what has changed in the situation and how it has changed. Using this information, the DM will make inferences (hypothesize) about the trends suggested by these changes, generate alternative responses, and select a course of action. Therefore, from an HCI design perspective, improving support for the cognitive processes involved in situation awareness would appear to have important implications for decision performance in these domains. Designing technology to support cognitive processes reiterates the requirement to understand process and the relationship of technological support to that process. The level of abstraction chosen to model this relationship determines the measures and the means by which they are collected. In team processes, one might reasonably select a level which isolates the individual team member, while retaining that member’s functional inter-relationships to the larger team processes. Thus, to study the system-level impacts of HCI design on situation awareness in tactical planning, it is useful to model the processes relating intelligence analysts and the tactical planners they support. This model helps to map the link between the technological support of the analysts and the subsequent performance of the planners (Figure 3.2). In this case, the HCI variable would be the display of information available to the analysts. To improve situation awareness, for example, an intelligent display might highlight changes occurring during a time period according to a predetermined prioritization scheme. Animating the changes would highlight existing patterns between the various elements. The development of the individual analyst’s situation awareness might be captured with a set of objective measures associated with the cues available in the display and subjective evaluation of the individual’s interaction with the team. The analysis team’s situation awareness could be evaluated in a timed study in terms of the completeness of their assessment and accuracy of their estimate. Finally, the effects on the planners’ awareness could be measured in a timed study along with the resulting effects on planning performance based upon the stability of the plan in the face of likely and possible counter responses.
Lee Scott Ehrhart
Tasks, Models & Methods
52
Display • Highlight changes • Animate activity across time
Analysts' Awareness
Improved?
Estimate Quality • Assessment completeness • Estimate accuracy
Improved? Planners' Awareness Plan Quality • Plan stability • Contingency planning
Figure 3.2: Hypothesized Effects of Information Display on Tactical Planning One approach to the study of complex processes is to perform a series of studies on different parts of the same model. After measuring a hypothesized mediator in one study, that variable is then manipulated as a treatment in a subsequent study. There are two problems associated with this method. First, it is difficult to be certain that the variable/treatment is the same in each study. Second, it is easy to develop alternative explanations for the results. These negative aspects may be countered by performing new studies with non-overlapping weaknesses. In addition, multiple views of the process phase of interest may be “triangulated” to increase accuracy. Using this method, the situation awareness study might be divided into two studies. The first would vary the display of information available to the analysts to measure the effect on situation awareness. The second would use the estimates produced using the two display treatments as conditions for the study of planning performance. Study design would have to include levels and treatments to rule out the effects of the individual composition of the planning team and differences between the various scenarios. In contrast to multiple studies, internal analysis involves a single study of the whole model with the links between process stages inferred from correlations internal to the study and treatment. Due to the higher level of abstraction required, this method does not provide a means of determining the direction of causality. Furthermore, it does not adequately isolate the measured variables from those which are unmeasured. Thus, it is possible to focus the study on a less relevant variable while ignoring the more critical indicator. For example, in the tactical planning example, the use of the “intelligent” display may strongly effect the analysts’ confidence in their estimate. In briefing their estimates to planners, the confidence of the analysts may have unmeasured effects on the way in which that briefing is heard and understood. This, in turn, is an unmeasured factor in the subsequent activities of the planners. Clearly, the most difficult factor in studying cognitive processes is operationalizing the variables in the hypothesized process. Although it is possible to measure performance under time constraints and see a difference between treatments, the relationship between the treatments, time and performance may remain ambiguous. This has serious implications for the development of decision support systems that attempt to incorporate the perceived improvements in performance by embodying the features of the “successful” treatment in the system design. Process analysis is an attempt to trace unobservable processes by examining overt activities. Cognitive process studies enable decomposition of the inferencing
Lee Scott Ehrhart
Tasks, Models & Methods
53
steps in decision tasks to more carefully examine the relationships between information inputs, inferencing strategies, and decision outcomes. 3.4.
Mental Workload Assessment The assessment of mental workload serves two main purposes in human-computer interaction design: 1) predictive - to determine the point (or load) where error or time delay will begin to appear or where individual and team performance begins to degrade -- how steep the fall off is, etc.; and 2) prescriptive - to determine the best human-machine system configuration, including the appropriate HCI design to support the user and the best allocation of tasks between the user, machine, and other team members.
The concept of “workload” appears so deceptively self-evident that the term is often used without definition. In fact, there is not even a universally accepted definition of workload. Hart and Wickens [1990] first describe workload as the human effort required to perform tasks, then proceed to diagram workload, effort and performance as three separate elements. In similar fashion, Pew [1979] presents workload as a function of the task and user variables that determine task demands contrasted against the user’s ability to meet task demands and, in turn, contribute to overall system performance. This three-way relationship between input, internal process and output is consistent with Meister’s [1985] description of workload as both a multidimensional construct and an intervening variable. Workload is a multidimensional input in the “loading” or burden to the human user from the task, environment, and system configuration. The user’s internal processes affect the experience of the load (not to be confused with the concept of stress) and drive the attempts to relieve the load. Workload also has multidimensional output, or mediational, effects on task performance and the user’s task strategies. Finally, both strategy and performance indirectly effect the system, organization, and environment. Figure 3.3 presents a conceptual representation of the possible relationships between workload factors and measures. However defined, workload appears to both vary as a function of the tasks, users, and system configurations and to affect the human user, task strategies, and system performance. Thus, mental workload cannot be described or predicted entirely in terms of task demands or performance. For example, variations in the consequences associated with user actions taken to meet task demands present an additional dimension to workload inputs. Gopher and Donchin [1986] present the problems associated with the subjective experience of workload in a simple example contrasting the task of walking a six inch wide plank placed at ground level versus the same task when the plank is 50 feet above the ground. The physical and perceptual task demands have not changed, yet it is highly likely that the difference the two heights will affect task performance measures (perhaps slower speeds with fewer errors), physiological measures (increased heart rate, etc.), and ratings of workload as perceived by the subjects. This difference has obvious implications for the assessment of workload in simulated or laboratory settings where, despite all efforts to replicate the real-world environment, the consequences of error cannot be replicated without at the very least violating the ethical constraints on the use of human subjects.
Lee Scott Ehrhart
Tasks, Models & Methods
54
Task Dimensions Organizational Context
Task Demands
HumanComputer System
Internal Processes Recognition Task & Response Strategies to WL
Task Performance Primary & Secondary Task Measures
Environmental Context Individual Differences Physiological Measures Workload Measures
Subjective Ratings
Potential Feedback Effects
Figure 3.3: Conceptual Relationship Between Workload Factors and Measures Moray [1988] indicates there are still no adequate theoretical models for predicting cognitive workload. Furthermore, despite extensive research in the field, there is insufficient validity and reliability in the findings to provide a foundation for generalization -- even to similar settings -- for the purpose of evaluating HCI designs. Within the three principal classes of workload measures (i.e., task performance measures, subjective ratings, and physiological measures), the variability between subjects performing the same fixed task is very high. Furthermore, a number of studies indicate the response variability is also high between classes of measures within the same subjects. Thus, an increase in workload input may be experienced and reported by the user, but those subjective ratings will not correlate reliably with performance measures such as error rate due to the subject’s changing of task strategies and other idiosyncratic factors. Moray acknowledges that much of the variability in findings may stem from the simplicity of the design and problems with adequate operational definitions for the factors analyzed. Unfortunately, the availability of subjects and research resources often drive the simplification of many designs. 3.4.1
WORKLOAD MEASURES
The three classes of measures comprise the toolbox for assessing workload: task performance, subjective ratings, and physiological measures.4 Task performance (or behavioral) measures and physiological measures are objective methods collected using direct observation techniques. Both performance and physiological measures have been extensively and reliably employed previously to evaluate the physical limits of operators (e.g., pilots) performing manual tasks. The principal factors are well-defined and lend themselves readily to measurement and prediction. Increased automation of 4
Several good reviews of the recent literature in mental workload measures may be found in Hancock et al. [1985], Hart & Wickens [1990], Kak [1981], Meister [1985], Meshkati et al. [1990], and Moray [1988].
Lee Scott Ehrhart
Tasks, Models & Methods
55
manual tasks has clearly reduced the physical workload in many settings while imposing additional cognitive burdens on the human supervisors of complex operations. As indicated previously, the concept of mental workload is difficult to define and, thus, difficult to measure. Insight into the “black box” of cognitive processes requires the addition of selfreport (subjective) evaluation techniques. Task Performance Measures Since the ultimate goal in HCI design is improved performance and it seems reasonable to assume that overloading the user will eventually degrade performance, research in mental workload often attempts to correlate performance measures (such as reaction time, accuracy, and error) with variations in task demands. Primary task measures assess the performance of the human-machine system, but do not reflect the changes in cost to the human user as a result of task demands. These techniques are highly dependent upon task characteristics. For example, O’Donnell and Eggemeier [1986] suggest the performance of a simple task may not vary significantly under increased workload, while moderately difficult tasks exhibit a linear decrease in performance as workload increases. However, when confronted with increased workload while performing very difficult, complex tasks, human operators adapt in a variety of ways which effectively change the task or task performance criteria. These dynamically revised task strategies attempt to assure completion of subtasks perceived as having greater priority and may even involve ignoring certain low priority tasks. Hart and Wickens [1990] indicate primary task performance measures appear to correlate with task demands and workload only for moderately difficult tasks, while at the extreme ranges of task difficulty, performance and workload dissociate. Studies suggest that primary task measures are more useful as indicators of task completion and momentary variations in workload rather than overall workload. Secondary task measures purportedly assess human resource costs by measuring the residual resources available to the user under various task loads. Theoretically, this residual capacity is indicated by the user’s ability to perform a secondary task while maintaining acceptable performance in the primary task(s). Although secondary measures are among the most common, Ogden et al. [1979] indicates the extreme difficulty in using and interpreting the results due to the lack of guidance for selecting an appropriate secondary task. The assumptions regarding “spare capacity” and task demands are further complicated by the characteristics of the primary and secondary tasks. For example, primary and secondary tasks performed concurrently may interfere with each other when they actively compete for the same perceptual or cognitive resources. This interference results in degraded performance for one or both tasks. In addition, since most secondary tasks are performed at specific intervals, they represent only a momentary increase in workload. The fundamental construct is further conflicted if users modify their task strategies for the primary task when required to perform the secondary task. Physiological Measures Physiological measures involve collecting data from the cardiovascular system (e.g., heart rate, blood pressure, galvanic skin response), respiratory system (e.g., respiration rate), or nervous system (e.g., brain activity, pupil size) while the subject is engaged in performing tasks. It is also possible to collect body fluids for biochemical analyses, but these are generally collected before and after task performance for comparison purposes. Most physiological collection methods require physical connection to the subject and may interfere with task performance. As a result, physiologic methods are typically reserved for laboratory or simulator settings. Even the less invasive methods are impractical in most operational settings. In theory, changes in physical states indicate changes in arousal. The concept of arousal is closely associated with motivation and attention. Thus, a very low level of arousal could correlate with poor performance, perhaps related to inattention. Similarly, a very high level of arousal might correlate with poor performance due to stress, overload, or hyper-vigilant behavior. The relationship between physiological measures and cognitive workload remains tenuous despite continued investigation. Although it is sometimes possible to establish a correlation between physiological and performance indicators, even these relationships cannot be reliably interpreted through the current theoretical or empirical literature. In a study on occupational stress, Kak [1981] rated 11 physiologic methods using criteria related to their validity, reliability, representativeness, invasiveness, and practicality. Kak’s study afforded heart rate, EKG, blood pressure, and body fluids the best overall ratings, including positive ratings in validity, reliability and representativeness. Hart & Wickens [1990] describe heart rate as the only physiological measure that consistently covaries with performance or workload. However, changes in heart rate and other measures of physical activation provide few insights into cognitive processes. With the increase in CRT display of information, measures of eye function (e.g., gaze direction, gaze duration, etc.) hold some promise for assessing mental workload in highly visual, information intensive tasks. For example, Harris
Lee Scott Ehrhart
Tasks, Models & Methods
56
et al [1986] used gaze direction and duration to examine task performance strategies. These methods would appear to have the greatest potential relevance to cognitive processing and decisionmaking. Subjective Measures In his survey of recent research, Moray [1988] identifies the advances in research using subjective measures, primarily in the form of rating scales, as providing the most promise for the assessment of mental workload. Regardless of the other measures collected, studies of operator workload typically include some form of self-report from the subject. Researchers often use subjective measures to support and validate objective physiological and task performance measures. This technique is particularly useful when the objective indices contradict each other. The operational rule of thumb is that the subject’s experience of loading overrides the evidence, thus, can be used to further interpret the conflicting objective measures. In addition to the problems associated with subjective approaches in general, subjective workload measures suffer from the various conceptual complexities described previously. In the absence of a unifying theory or empirical evidence, there is little guidance to assist the development of rating scales to capture the multidimensional aspects of cognitive workload. Meister [1985] lists five basic requirements for a generic workload scale: •
application to a variety task situations
•
adaptability to a range of user responses or experiences of workload
•
correlation with other workload indices
•
orthogonal dimensions
•
validity
Currently, no scales or subjective measures satisfy all of Meister’s criteria. There has been some promising work in the development of reliable, validated rating scales for workload assessment. Committed resources and coherent programs of research resulted in considerable research advances in the subjective measurement of pilot workload at NASA Ames and the U.S. Air Forces’ Wright Laboratory. Both organizations developed and validated subjective rating instruments that run on microcomputers. Wright Laboratory’s Subjective Workload Assessment Technique (SWAT) allows the pilot to rate his or her feelings about their current state at one of three levels (low, medium, or high) across three dimensions (stress, effort and workload). The NASA-Task Load Index (TLX) employs a bipolar scale to empirically determine the minimum number of dimensions required to account for the variation in individual attitudes about workload. In a series of studies [Hart et al., 1981; Hart & Hauser, 1986; and Hart & Staveland, 1988], the NASA research team reduced these dimensions to six: time pressure, physical demand, mental demand, performance, effort, and frustration. Perhaps the most promising aspect of the NASA and Wright Laboratory research is the study by Vidulich and Tsang [1985] establishing a good correlation between a single group of assessors using both the NASA-TLX and SWAT tools. 3.4.2.
APPLYING WORKLOAD STUDIES TO TEAM DECISIONMAKING PROBLEMS
The many difficulties associated with mental workload measurement mitigate against its use as a major input into HCI design decisions for decision support in team contexts. The validity and reliability problems associated with workload measurement for individual users make the extension of an indirect link from individual workload to team decisionmaking performance extremely tenuous. In addition, there are potential problems with the generalizability of the traditional operator workload measures to team processes which are less manual. One difficulty is the notion that performance is best measured in terms of number of errors and time to completion. Finally, what may be measured in isolation (i.e., the single team member) cannot necessarily be combined with other measures and yet cannot be reliably measured in context. Assessing the relative workload associated with various HCI designs appears feasible in more bounded team decisionmaking contexts, such as a planning team for Air Tasking Orders (ATOs), where there are procedural guidelines and stable roles. Load adaptation strategies (deferral, reallocation of tasks to other personnel, etc.) are constrained within analyzable and generalizable ranges. Finally, although the planning team is not insensitive to leadership styles, the variability in roles and tasks is considerably less than those in the larger command center setting. Given the characteristics identified for the three workload measurement classes, the question arises: what measures appear to be applicable for assessing mental workload in a team decisionmaking context? The use of objective task performance measures presumes that appropriate elemental tasks are identifiable. In tracking and targeting tasks, for
Lee Scott Ehrhart
Tasks, Models & Methods
57
example, it is possible to count the number of hits versus misses within a specific time period while varying the load with respect to the number and speed of targets on screen. However, it is not clear that such sufficiently discrete, sequential subtasks exist in complex decisionmaking tasks to make primary task measurement feasible. Even the relatively bounded example of ATO planning requires the coordination of complex subtasks. The most acceptable primary task measure would be a gross measure of time to complete while varying the load in terms of the number of critical factors involved. In this case, the critical factors function as a measure of complexity. Unfortunately, this kind of gross measure provides little in the way of diagnostic information about workload. Furthermore, it is more likely that an examination of ATO planning would involve a time constrained activity followed by expert evaluation of the delivered plan with respect to completeness, robustness, and similar qualitative metrics. Secondary task measures seem even less useful in this context. Introduction of secondary tasks in a planning context stretches the credibility of the scenario. Trivial tasks would be ignored unless required; critical tasks would significantly redefine the nature of the exercise. However, Hart & Wickens [1990] suggest that some secondary tasks may be embedded in a natural fashion into a primary task environment. For example, built-in alerts in a computer display can serve to measure performance in a secondary monitoring activity. In this case, physiological measures related to arousal and eye tracking might supplement the information collected on the monitoring task. Other physiological measures appear to have little application to team planning tasks. In addition, physiological measures for team activities present not only collection problems, but also considerable potential for interference with the planning task. Subjective measures could be used as part of workload assessment and may provide the only overview of the process in context. Post hoc ratings, questionnaires, and interviews would seem to be more appropriate than more intrusive measures due to the highly integrated and parallel nature of the subtasks. Breaking the exercise at set points to gather subjective ratings would seem too intrusive. An exception might be in a longer exercise where decision stages clearly exist as subjects await feedback on actions taken. Nevertheless, it would be necessary to handle carefully to avoid tainting the process through directed introspection. The use of workload assessment as an input to HCI design decisions requires careful task analysis. Cognitive research literature presents some generalizable findings that may guide the operational definition of inferencing and decisionmaking tasks with respect to information display. However, most researchers seem to agree that mental workload assessment is still too ambiguous in its application and interpretation to be used as a key input in either workload prediction or mediation design. Where data collection is feasible, workload measures may provide useful support to the findings derived from other evaluation techniques.
Lee Scott Ehrhart
Tasks, Models & Methods
58
4. Implementing HCI Analysis & Evaluation to Support the Systems Design & Development Process At each stage in systems design and development, the various decisionmakers involved require information inputs from 1.
analyses of requirements (system objectives, functions, tasks, operational capabilities) and
2.
evaluations of performance and effectiveness characteristics (current and potential).
Designs for the complex systems supporting command and control (C2) decisionmaking derive conceptual requirements from models of C2 processes. The doctrine incorporated in these models and the missions defined by the organization provide the context for identifying the functional and task requirements that structure the relationships of humans and machines. These requirements, in turn, help to determine the appropriate measures of performance (MOPs) and measures of effectiveness (MOEs) that form the selection criteria for HCI designs. The previous section reviewed an array of methods currently used in collecting and interpreting technical and performance data. This section presents a framework for identifying the appropriate analyses and evaluations to support systems design and development decisionmaking. 4.1.
Supporting the C2 System Development Life Cycle (SDLC) Phases
As indicated in the methods profiles, empirical studies provide a means for acquiring information about the actual performance characteristics and capabilities of systems, components, and the human users. For this reason, data must be collected from sources, such as: •
interaction with existing systems or prototypes;
•
validated simulations; or
•
archived performance data and standards.
The empirical study approach builds information in a data-intensive, bottom-up fashion. However, while empirical evaluations can be used to determine performance benchmarks, they do not permit direct insight into the performance requirements. These requirements evolve from a top-down analysis based upon the organizational and system objectives, functions, and the tasks identified with those functions. Without the analytical framework, the measures collected in empirical studies lack context and can misdirect decisionmakers. Rasmussen & Goodstein [1988] conceptualize the well-balanced evaluation for design of a cognitive system as the combination of top-down analytical evaluation and bottom-up empirical assessments (Figure 4.1). System design evolves through the top-down analysis of the intended purpose and identified functions. Functions are then decomposed into the procedures and tasks allocated to the machine and the user, culminating in the design that maps the system’s form. Bottom-up empirical evaluations first address the lower level human factors issues associated with fundamental usability and continue by evaluating the support of the cognitive requirements involved in the tasks. These human requirements interact with the system’s allocation of functional requirements and the capabilities afforded by the design. Despite some variations in terminology, this prescription of a combination of top-down analytical and bottom-up empirical evaluation is consistent with similar discussions in Meister [1985, 1991] and Adelman [1992]. Meister [1985] presents a series of human performance questions grouped by development stage and indicates the various analysis and evaluation methods that supply answers. The balance between analytical and empirical evaluation approaches shifts depending upon the stage in system development. For example, in the early stages of planning and design, there is a strong reliance on top-down analysis methods supported by the available objective data and subjective judgments. The later phases of detail design, testing, and system operation employ more rigorous empirical evaluation methods and wellstructured subjective measures to assess performance in terms of the functional requirements outlined in earlier phases of development. Section 1.3 presented a four-phase life cycle model for the design, development, acquisition, and operation of C2 systems (see Table 1.2, page 6). Despite the rational appeal of such conceptual models and the considerable influence DOD standards bring to bear on the process, the development of C2 systems rarely involves the full application of all facets of each phase. As indicated previously, very often “new systems” represent the up-dating or enhancement of existing designs. In other cases, systems are designed and developed to exploit the possibilities of new technologies.
Lee Scott Ehrhart
Tasks, Models & Methods
59
Only rarely is the design and development of a new system truly problem (requirements) driven. Rasmussen & Goodstein [1988] acknowledge that the notion a system “design can be a well structured implementation of a formulated goal” is essentially an academic ideal. This is not to imply that development is or should be entirely an ad hoc process. Rather, the related cost and time requirements often prohibit the kinds of exhaustive analysis and evaluation that the conceptual models of design and development prescribe.
Match Human and Job System top-down analytical evaluation
purpose
Human values needs
usability before problem solving
info processing function psychological mechanism process physiology functional correctness before ergonomics
form
anthropometry
bottom-up empirical evaluation
Figure 4.1 Contribution of Analytical & Empirical Evaluation Approaches (adapted from Rasmussen & Goodstein, 1988) Ultimately, decisionmakers must make design choices and set evaluation objectives within the time and cost constraints imposed. Studies indicate, however, that inadequate requirements analysis constitutes the primary source of error in systems development with resulting schedule and cost overruns [Andriole, 1990; Boar, 1984]. To assure requirements accuracy and completeness, Adelman [1992] advocates the integration of evaluation into the system development process in the early phases rather than as an separate stage late in the development process. Despite its deviation from some of the traditional SDLC paradigms, the incorporation of a more coherent evaluation plan spanning the entire development process need not require a radical shift in actual practice. Many of the analysis and design activities during the early development phases incorporate the products of previous evaluations and the standards derived from those and other evaluations. In addition, the recent adoption of iterative design methods supported by rapid prototyping makes use of empirical evaluation at much earlier design stages. Table 4.1 revisits the four phases in C2 design and development and indicates the relationship between the developmental objectives at each phase and the related HCI evaluation objectives and products. The Conceptual Phase includes the initial identification of system objectives and its desired characteristics. These initial system concepts are examined in terms of technical feasibility and their utility with respect to organization needs. The corresponding HCI evaluation objectives at the conceptual phase aid in assessing alternative system concepts with respect to the system objectives, design constraints, and the working concept’s appropriateness to the target users and environment. Before a new or enhanced system is introduced to support a command and control process, careful analysis required to assess the potential impacts of the system on the target organization, its missions, procedures, and the human team members. Top-down analysis methods are used during this phase to create the profiles of potential users, tasks, organizational context, and operational environment that help identify desired system characteristics. For example, user and task profiles incorporate information from objective and subjective evaluation and may include analytical inputs from formal task analyses and human factors standards. The system concept incorporates the allocation of system functions between the human users and the supporting machines. The HCI aspects of this allocation are established through task analysis, cognitive process modeling and workload assessment studies. Task scenarios developed at this point help to
Lee Scott Ehrhart
Tasks, Models & Methods
60
flesh out the system concept and provide an initial means of verifying the system concept. These scenarios can be used throughout the development for evaluation and requirements tracing. In the Design Phase , developmental objectives involve formalizing the system requirements and specifications based upon the needs identified in the previous phase and a more thorough appraisal of current system capabilities and deficiencies. This phase culminates in the selection of a system design for acquisition. The HCI evaluation objectives during this phase focus on supporting the evaluation of alternative system
Lee Scott Ehrhart
Tasks, Models & Methods
61
SDLC Phase
Phase Objectives
HCI Evaluation Objectives & Products Evaluation of System Concept:
• Determine desirable system characteristics
Conceptual
• Establish feasibility and utility of proposed system • Determine system objectives • Identify design constraints
• Profile users, tasks & situation/environment • Develop task scenarios • Determine desired task allocation across human & machine components • Examine appropriateness for target users and environment • Establish system concept as accurate reflection of system objectives and design constraints Evaluation of Alternative System Designs:
Design
• Develop system requirements and specifications
• Contribute to tradeoff analyses of various alternative design configurations
• Determine current system capabilities & deficiencies
• Develop performance requirements
• Select “best” of alternative system designs
• Develop storyboards & initial prototypes, mockups, etc. to support evaluation • Determine appropriateness for meeting objectives and requirements of system concept • Examine potential usability and learnability of design alternatives Developmental Evaluation
• Determine “best” design to meet specifications
Acquisition
• Support refinement of design specifications
• Establish detail design
• Develop working prototypes for incremental evaluation of evolving system
• Test & evaluate developmental system
• Certify correspondence to specified performance requirements • Verify adequacy of user support and error control & recovery • Check appropriateness of dialogue design Operational Evaluation
Operational
• Evaluate operational system to determine if it meets requirements
• Provide feedback on user & organizational acceptability to support redefinition of system concept & training requirements
• Determine new system • Identify previously unforeseen deficiencies in new system requirements resulting from capabilities & deficiencies • Determine necessary modifications to procedures and manning as a result of new system of operational system
Table 4.1: HCI Evaluation in Relation to SDLC Phases and Objectives designs with respect to the requirements and objectives embodied in the system concept. Task analyses, cognitive process modeling, and workload assessment studies performed to support the allocation of functions also contribute to the
Lee Scott Ehrhart
Tasks, Models & Methods
62
identification of performance requirements for both the machine components and human users. Early HCI designs manifested in storyboards, prototypes, and mock-ups may be evaluated at this point using the task scenarios developed previously. Objective and subjective evaluations, conducted either with target users or a small group of representative experts, provide additional feedback to the design process. These early studies help refine the evaluation instruments used later in system and operational testing. Customarily, the development and acquisition of large government systems involves a split between the specification of system requirements and the detail design and development of the system. During the Acquisition Phase , the developer takes over the realization of the system concept to develop a detailed design and subsequently proceed with implementation of the accepted design. The HCI evaluation objectives and products during this phase concentrate on the iterative evaluation requirements of the development process. Evaluation now focuses on working prototypes with particular emphasis on the empirical evaluation of system performance against the defined requirements. Task-oriented objective methods predominate at this point supported, as needed, by subjective evaluations to verify the adequacy of user support and error handling. Formal system testing takes place in the final stages of the Acquisition Phase using the evaluation instruments refined during earlier phases. A well-defined set of metrics, including subjective and objective measures, is critical to a comprehensive evaluation of the system. In addition, certain measures emerge at this point as key effectiveness indicators in the overall evaluation. Depending upon their strength as indicators and the ease of collection, these measures will form the core of the operational evaluations. In the Operational Phase, evaluation scrutinizes the installed system in context to determine whether it meets the original requirements. The delivery of a new or enhanced system also introduces new capabilities to the organization and may bring to light additional deficiencies that affect organizational missions and procedures. For this reason, HCI evaluation objectives in the Operational Phase focus not only on assessing the performance and effectiveness of the new system to assure it meets the original requirements but also on identifying any emerging applications and requirements. Many of the evaluation methods used in previous phases of development are applicable at this phase; however, the ongoing critical activities in command and control operational environments greatly constrain the selection of methods. Depending upon the constraints imposed, evaluators select appropriate metrics from the core group of key indicators established during system testing. 4.2.
Selecting and Organizing Measures to Meet Evaluation Objectives
At various stages in the development process, evaluation involves determining the effectiveness of the proposed or existing system, component, or enhancement in terms of the tradeoffs across functionality, performance, reliability, maintainability, and associated costs. The review of analysis and evaluation methods presented in Section 3 highlights not only their advantages, but also their limitations. No single method presented provides the range of information required for HCI design decisions. Examining the performance and effectiveness of human-machine cooperation requires the collection and analysis of objective data and subjective judgments from a variety of sources. In a systems-oriented evaluation, these various inputs fit into a hierarchical schema of performance measures (MOPs) and effectiveness measures (MOEs) to assure a coherent evaluation plan. Previous discussions examined the definition and structuring of an MOP/MOE hierarchy and described the key attributes of appropriate measures (see Section 2.3). Figure 4.2 outlines the steps for identifying and applying metrics to evaluate C2 systems. Based upon the evaluation objectives, the system boundaries drawn define the scope of the current investigation and identify the system components of interest. The boundary established also determines the level of abstraction required to model the C2 processes the system supports. In the examination of HCI designs for team decisionmaking support, this model indicates the roles and relationships of teams, individuals, and machines. This top-down analysis continues with a hierarchical decomposition of the functions, sub-processes, and tasks the system performs with links back to the larger operational processes and requirements. The MOEs selected form the criteria for assessing the system’s effectiveness in achieving the identified function or task objectives, while MOPs measure human and/or machine performance in a specific activity. Data collection and analysis methods are selected based upon their relative utility (reliability, representativeness, validity, etc.) in establishing values for the chosen measures. The bottom-up evaluation process culminates in an aggregation of measurement values that filters back up the evaluation hierarchy for interpretation against the objectives and requirements of the C2 system and processes.
Lee Scott Ehrhart
Tasks, Models & Methods
63
Evaluation Objectives
↓
C2 System Bounding
System Elements
C2 Process Definition
↓ Functions
↓
Specification of Measures (Criteria) MOPs, MOEs, & MOFEs
Measures for Functions Data Generation Exercises, Experiments, Simulation, Subjective Judgments
↓ Values of Measures
↓
Aggregation of Measures
Summary Score
Figure 4.2: Modular C2 Evaluation Structure [adapted from Sovereign & Sweet, 1986] The various data collected and other factors entering into the selection of an alternative HCI design typically use different scales, even after individual aggregation. Furthermore, these inputs differ in their relative importance to the overall decision. For example, when accuracy is a critical factor, measures related to errors committed and time spent correcting errors may be of greater relevance than system throughput. The validity of the analysis depends upon the ability to assign weights to the individual and aggregated measures to reflect their relative value. Finally, to discriminate between design alternatives, decisionmakers need a summary score -- a “bottom-line” assessment. Adelman and Donnell [1986] propose a three-point approach to decision support system (DSS) evaluation that appears applicable to the evaluation of HCI designs for decision aiding. Their method is based upon hierarchical combination of scores from technical, empirical, and subjective evaluations addressing the MOEs defined by the organization. The technical evaluation comprises those objective measures associated with verifying the correctness and
Lee Scott Ehrhart
Tasks, Models & Methods
64
reliability of the software and system architecture. Empirical evaluation uses objective methods to appraise the aid’s contribution to improved decision performance or overall system effectiveness. Finally, subjective evaluation focuses on user acceptance and subjective assessments of system utility and usability, including cost-benefit tradeoffs. Aggregation of the various evaluation facets is accomplished using a multi-attribute utility (MAU) model that structures the hierarchical relationship of the individual measures and assigns weights to values for combination into a summary score. Several microcomputer-based decision support packages currently employ variations of multi-criteria decisionmaking models. The more sophisticated versions of these tools assist the user in structuring a hierarchy of values. In addition, most support conversion of subjective evaluation terms into numerical scores. However, these programs do not provide guidance for identifying appropriate measures or creating the conceptual framework that relates those measures. Further research is needed to design knowledge-based aids that assist in modeling processes and developing plans for system evaluation.
Lee Scott Ehrhart
Tasks, Models & Methods
65
5. Summary This paper describes models and methods for realizing the potential utility of HCI evaluation to support the design and development of complex information systems. The approach presented embraces three basic principles: •
the design of human-computer interaction embodies the relationship of human users and computerbased aids in achieving system goals;
•
the decomposition of HCI functions, processes, and tasks provides measurable indicators of the extent to which specific designs fulfill system objectives; and
•
the utility of HCI evaluation to the system design process depends upon the application and interpretation of HCI measures in the context of a valid framework of objectives, functions, processes, and tasks.
While it is infeasible to define “cookbook” procedures for either the design of systems or the evaluation of those designs, the conceptual approach presented attempts to assist design decisionmakers in defining flexible, robust plans for analysis and evaluation.
Lee Scott Ehrhart
Tasks, Models & Methods
66
6. References Adelman, Leonard. Evaluating Decision Support and Expert Systems. New York: Wiley Interscience, 1992. . "Experiments, Quasi-Experiments, and Case Studies: A Review of Empirical Methods for Evaluating Decision Support Systems," IEEE Transactions on Systems, Man, and Cybernetics, 21(2) March/April 1991, 293-301. & Michael L. Donnell. "Evaluating Decision Support Systems: A General Framework and Case Study" in Stephen J. Andriole, Ed., Microcomputer Decision Support Systems: Design, Implementation, and Evaluation. Wellesley, MA: QED Information Sciences, Inc., 1986. Adelman, Leonard, Martin A. Tolcott, and Terry A. Bresnick. “Examining the Effect of Information Order on Expert Judgment,” Organizational Behavior and Human Decision Processes, 1991. Alberts, D. S. C2I Assessment: A Proposed Methodology. (MTR-80-W00041). McLean, VA: Mitre Corporation, 1980. Andriole, Stephen J. Information System Design Principles for the 90s: Getting it Right! Fairfax, VA: AFCEA International Press, 1990. . “Storyboard Prototyping for Requirements Verification,” Large Scale Systems, 12, 1987, 231-247. Andriole, Stephen J., Lee S. Ehrhart, Leonard Adelman, and Alexander H. Levis. Decision Aiding for Naval Anti-Air Warfare (AAW): The State of the Art & the State of the Expectation. (GMU/C3I-218-R) Fairfax, VA: Center of Excellence in C3I, 1991. Bainbridge, Lisanne. “Verbal Protocol Analysis,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990. Battilega, John A. & Judith K. Grange, (Eds.). The Military Applications of Modeling. Wright-Patterson AFB, Ohio: Air Force Institute of Technology Press, 1980. Berliner, D. C., D. Angell, and J. Shearer. “Behaviors, Measures and Instruments for Performance Evaluation in Simulated Environments.” In Proc. of the Symposium and Workshop on the Quantification of Human Performance. University of New Mexico, Albuquerque, NM, 1964. Boar, Bernard H. Application Prototyping: A Requirements Definition Strategy for the 80s. New York, Wiley Interscience, 1984. Booher, Harold R. (Ed.), MANPRINT: An Approach to Systems Integration. New York: Van Nostrand Reinhold, 1990. Brodsky, Stuart, Alexander H. Levis, Tony Richardson, Conrad Strack, Edison Tse, Clairice Veit. “Mathematics,” in Rikki Sweet, Morton Metersky, and Michael Sovereign, (Eds.). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, June 1986. Brooks, Frederick P., Jr. "Grasping Reality Through Illusion -- Interactive Graphics Serving Science" in O'Hare, J., Ed., Proc. of ACM SIGCHI '88 Conference: Human Factors in Computing Systems. New York: ACM Press, 1988, 1-11. Callero, Monti, Ralph Strauch, and Jack Lind. A Computer Aided Exercise Facility for Tactical Air Command and Control Evaluation: Concepts and Design Overview. Technical Report #N-1450-AF. Santa Monica, CA: Rand Corporation, April 1980. Campbell, Donald T. and Julian C. Stanley. Experimental and Quasi-Experimental Designs for Research. Boston: Houghton Mifflin Co., 1963. Card, Stuart K., Thomas P. Moran, and Allen Newell. The Psychology of Human-Computer Interaction. Hillsdale, NJ: Erlbaum, 1983. Chapanis, A. Research Techniques in Human Engineering. Baltimore, MD: Johns Hopkins Press, 1959. Christie, Bruce and Margaret M. Gardiner. “Evaluation of the Human-Computer Interface,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990.
Lee Scott Ehrhart
Tasks, Models & Methods
67
Cook, Thomas D. and Donald T. Campbell. Quasi-Experimentation: Design & Analysis Issues for Field Settings. Boston: Houghton Mifflin Co., 1979. Cox, Tom. “The Recognition and Measurement of Stress: Conceptual and Methodological Issues,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990. Crumley, Lloyd M. and Mitchell B. Sherman. Review of Command and Control Models and Theory. ARI Technical Report 915. Ft. Leavenworth, KS: U.S. Army Research Institute, September 1990. Defense Systems, Inc. HEAT User’s Manual. Draft Report. McLean, VA: Defense Systems, Inc., 1984. . Elements of C2 Theory. Draft Report. McLean, VA: Defense Systems, Inc., January 1985. Department of Defense. Defense System Software Development. DOD Standard 2167A. Washington, DC: DOD, February 29, 1988. . Human Engineering Requirements for Military Systems. MIL-H-46855B. Washington, DC: DOD, 1979. Dhillon, B. S. Human Reliability. Oxford, UK: Pergamon Press, 1986. Drury, Colin G. “Designing Ergonomics Studies and Experiments,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990a. . “Methods for Direct Observation of Performance,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990b. Einhorn, H.J. “Use of Nonlinear, Noncompensatory Models as a Function of Task and Amount of Information,” Organizational Behavior and Human Performance, 1971, 6, 1-27. Entin, Elliot E. An Investigation of the Combined Value of Shared Battle Graphics and Time-Tagged Information for Crisis Resolution. Technical Report #TR-524. Burlington, MA: Alphatech, Inc., October 1991. Fiske, S. T. and S. E. Taylor. Social Cognition. Reading, MA: Addison-Wesley, 1984. Fleischman, Edwin A. and Marilyn K. Quaintance. Taxonomies of Human Performance: The Description of Human Tasks. New York: Academic Press, 1984. Fischhoff, B., Slovic, P., and Lichtenstein, S. “Fault Trees: Sensitivity of Estimated Failure Probabilities to Problem Representation,” Journal of Experimental Psychology: Human Perception and Performance, 1978, 4, 342-355. Foster, William, Robert Allison, Robert Choisser, Edward C. Johnson, S.Z. Mikhail, Bernard Galing, Larry Rhoads. “Applications and the Need for C2 Measures,” in Rikki Sweet, Morton Metersky, and Michael Sovereign, (Eds.). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, June 1986. Gardiner, Margaret & Bruce Christie, (Eds.). Applying Cognitive Psychology to User-Interface Design. New York: John Wiley, 1987. Gopher, D. and E. Donchin. “Workload -- an Examination of the Concept” in K. Boff, L. Kaufmann and B. Thomas, (Eds.), Handbook of Perception and Human Performance, Vol. 2. New York: John Wiley, 1986. Hammond, Kenneth R., McClelland, G.H., and Mumpower, J. Human Judgment and Decision Making: Theories, Methods and Procedures. New York: Hemisphere/Praeger, 1980. Hammond, Kenneth R., Stewart, T. R., Brehmer, B. and Steinman, D. O. “Social Judgment Theory.” In M.F. Kaplan and S. Schwartz (Eds.) Human Judgment and Decision Processes. New York: Academic Press, 1975. Hancock, Peter A., Najmedin Meshkati, and M. M. Robertson. “Physiological Reflections of Mental Workload.” Aviation, Space, and Environmental Medicine, 56, 1985, 1110-1114. Harris, R. L., B. J. Glover, and A. Spady. Analytic Techniques of Pilot Scanning and Their Application. Technical Report. NASA TP-2525. Washington, DC: National Aeronautics and Space Administration, 1986. Hart, Sandra G., M. Childress, and M. Bortolussi. “Defining the Subjective Experience of Workload” in Proceedings of the Human Factors Society 25th Annual Meeting. Santa Monica, CA: Human Factors Society, 1981.
Lee Scott Ehrhart
Tasks, Models & Methods
68
Hart, Sandra G. and J. Hauser. “Inflight Application of Three Pilot Workload Measurement Techniques,” Aviation, Space, and Environmental Medicine, 58 (5), 1987, 402-410. Hart, Sandra G. and L. E. Staveland. “Development of NASA-TLX (Task Load Index: Results of Empirical and Theoretical Research. In Peter A Hancock and Najmedin Meshkati (Eds.), Human Mental Workload. Amsterdam: North-Holland, 1988. Hart, Sandra G. and Christopher D. Wickens. “Workload Assessment and Prediction” in Harold R. Booher (Ed.), MANPRINT: An Approach to Systems Integration. New York: Van Nostrand Reinhold, 1990. Hennessey, Robert T. “Practical Human Performance Testing and Evaluation” in Harold R. Booher (Ed.), MANPRINT: An Approach to Systems Integration. New York: Van Nostrand Reinhold, 1990. Hoeber, Francis P. Military Applications of Modeling; Selected Case Studies. New York: Gordon & Breach Science Publishers, 1981. Hogarth, R.M. Judgment and Choice. (2nd ed.) New York: John Wiley, 1987. & S. Makridakis. “Forecasting and Planning: an Evaluation,” Management Science, 27(2) 115-138, Feb 1981. Joint Chiefs of Staff. Unified Action Armed Forces (JCS Pub 0-2). Washington, DC: Joint Chiefs of Staff, 1986. Kak, A.V. “Stress: an Analysis of Physiological Assessment Devices” in G. Salvendy and M.J. Smith (Eds.), Machine Pacing and Occupational Stress. London: Taylor & Francis, 1981. Keppel, Geoffrey and William H. Saufley, Jr. Introduction to Design and Analysis; a Student’s Handbook. New York: W. H. Freeman & Co., 1980. Kieras, David E. “Towards a Practical GOMS Model Methodology for User Interface Design” in M. Helander (Ed.), Handbook of Human-Computer Interaction. Amsterdam: Elsevier, 1988. Kintsch, W. “Models for Free Recall and Recognition” in D.A. Norman (Ed.), Models of Human Memory. London: Academic Press, 1970. Kirwan, Barry. “Human Reliability Assessment,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990. Kleinbaum, David G., Lawrence L. Kupper, Keith E. Muller. Applied Regression Analysis and Other Multivariable Methods, 2nd Ed. Boston: PWS-KENT Publishing Co., 1988. Johnson, P., D. Diaper, and J. Long. “Tasks, Skills and Knowledge: Task Analysis for Knowledge Based Descriptions.” In Interact ‘84, Vol. 1. London: IFIP, 1984. Land, Walker, Ted Bean, Leon Godfrey, Judy Grange, Don Newman, Tony Snyder. “Model,” in Rikki Sweet, Morton Metersky, and Michael Sovereign, (Eds). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, June 1986. Majone, Giandomenico & Edward S. Quade, (Eds.). Pitfalls of Analysis. New York: John Wiley & Sons, 1980. Marciniak, John J. and Donald J. Reifer. Software Acquisition Management: Managing the Acquisition of Custom Software. New York: John Wiley & Sons, 1990. McCormick, E.J. The Development, Analysis, and Experimental Application of Worker-Oriented Job Variables. Final Report. (ONR Nonr-1100[19]). Lafayette, IN: Purdue University, 1964. . “Job and Task Analysis.” In M.D. Dunnette, (Ed.), Handbook of Industrial and Organizational Psychology. Chicago: Rand McNally, 1976. Meister, David. Behavioral Analysis and Measurement Methods. New York: John Wiley, 1985. . Behavioral Research and Government Policy: Civilian and Military R & D. New York: Pergamon Press, 1981. . Psychology of System Design. Amsterdam: Elsevier, 1991. . “Simulation and Modelling,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990.
Lee Scott Ehrhart
Tasks, Models & Methods
69
Meshkati, Najmedin, Peter A. Hancock, and Mansour Rahimi. “Techniques in Mental Workload Assessment,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990. Miller, P.M. “Do Labels Mislead? A Multiple Cue Study, Within the Framework of Brunswik’s Probabilistic Functionalism,” Organizational Behavior and Human Performance, 1971, 6, 480-500. Miller, Richard, Harold Glazer, Linda Hill, Charles Smith, Bruce Thieman. “Measures,” in Rikki Sweet, Morton Metersky, and Michael Sovereign, (Eds). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, June 1986. Moray, Neville. “Mental Workload Since 1979” in David J. Oborne (Ed.), International Reviews of Ergonomics, Vol. 2. London: Taylor & Francis, 1988. Nisbett, R. and L. Ross. Human Inference: Strategies and Shortcomings of Social Judgment. Englewood Cliffs, NJ: Prentice-Hall, 1980. Norman, Donald A. “Categorization of Action Slips,” Psychological Review, 88 (1), 1981, 1-15. O’Donnell, R. D. and F. T. Eggemeier. “Workload Assessment Methodology,” in K. Boff, L. Kaufman and J. Thomas (Eds.), Handbook of Perception and Human Performance. New York: John Wiley & Sons, 1986. Olmstead, J. A., M.J. Baranick, B.L. Elder. Research on Training for Brigade Command Groups: Factors Contributing to Unit Combat Readiness. (TR-78-A18). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences, 1978. Olmstead, J.A., H.E. Christensen, L.L. Lackey. Components of Organizational Competence: Test of a Conceptual Framework. (TR 73-19) Alexandria, VA: Human Resources Research Organization, 1973. Orr, G. E. Combat Operations C3I: Fundamentals and Interactions (Research Report AU ARI-82-5). Maxwell, Air Force Base, AL: Airpower Research Institute, 1983. Payne, S. J. “Task-Action Grammars.” In Interact ‘84, Vol. 1. London: IFIP, 1984. Pew, Richard W. “Secondary Tasks and Workload Measurement” in Neville Moray (Ed.), Mental Workload: Its Theory and Measurement. New York: Plenum Press, 1979. . & Sheldon Baron. "Perspectives on Human Performance Modelling." Paper presented at 1982 IFAC Conference, Baden-Baden, West Germany, September 1982. Rasmussen, Jens. Information Processing and Human-Machine Interaction: An Approach to Cognitive Engineering. New York: North-Holland, 1986. , K. Duncan, J. Leplat (Eds.). New Technology and Human Error. Chichester, UK: John Wiley, 1987. Rasmussen, Jens & L. P. Goodstein. “Information Technology and Work,” in Martin Helander (Ed.), Handbook of Human-Computer Interaction. Amsterdam: Elsevier, 1988. Rasmussen, Jens & Kim J. Vicente. “Coping with Human Errors Through System Design: Implications for Ecological Interface Design.” International Journal of Man-Machine Studies (1989) 31, 517-534. Reason, James. Human Error. Cambridge, UK: Cambridge University Press, 1990. Robins, J.E., L. Buffardi, T.G. Ryan. Research on Tactical Military Decision Making: Application of a Decision Prediction Concept in a SIMTOS Environment. (Technical Paper 246). Alexandria, VA: US Army Research Institute for the Behavioral and Social Sciences, 1974. Rouse, William B. Design for Success: A Human Centered Approach to Designing Successful Products and Systems. New York: John Wiley, 1991. Sage, Andrew P. “Behavioral and Organizational Considerations in the Design of Information Systems and Processes for Planning and Decision Support,” IEEE Transactions on Systems, Man, and Cybernetics, Sept 1981, 11(8), 640678. Stammers, Rob B., Michael S. Carey and Jane A. Astley. “Task Analysis,” in John R. Wilson and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990.
Lee Scott Ehrhart
Tasks, Models & Methods
70
Sweet, Rikki. “Evaluation Structure -- an Architecture,” in Rikki Sweet, Morton Metersky, and Michael Sovereign, (Eds). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, June 1986. Tainsh, M. A. “Job Process Charts and Man-Computer Interaction Within Naval Command Systems.” Ergonomics, 28, 1985, 555-565. , Morton Metersky, and Michael Sovereign, (Eds). Command and Control Evaluation Workshop. Proc. of the Military Operations Research Society (MORS) Workshop on C2 Measures of Effectiveness. Monterey, CA: Naval Postgraduate School, January 1985. Tversky, A. & Kahneman, D. “Judgment Under Uncertainty: Heuristics and Biases,” Science, 1974, 185, 1124-1131. Vidulich, M. and P. Tsang. “Assessing Subjective Workload Assessment: A Comparison of SWAT and the NASABipolar Methods” in Proceedings of the Human Factors Society 29th Annual Meeting. Santa Monica, CA: Human Factors Society, 1985. Wilson, John R. and E. Nigel Corlett, (Eds.) Evaluation of Human Work: A Practical Ergonomics Methodology. New York: Taylor & Francis, 1990. Wohl, Joseph G. “Force Management Decision Requirements for Air Force Tactical Command and Control,” IEEE Transactions on Systems, Man, and Cybernetics, SMC-11, No. 9, Sept 1981, 618-639. van Gigch, John P. System Design Modeling and Metamodeling. New York: Plenum Press, 1991
Lee Scott Ehrhart
Tasks, Models & Methods
71