Initial Validation of Tool for Interface Design Evaluation ... - CiteSeerX

5 downloads 349 Views 1MB Size Report
The tool, called Tool for Interface Design Evaluation with Sensors (TIDES), ..... The tasks required reading instructions and performing basic visual and auditory.
Initial Validation of Tool for Interface Design Evaluation with Sensors (TIDES) Patrice D. Tremoulet, Patrick Craven, Saki Wilcox, Susan Harkness Regli, Joyce Barton, Kathleen Stibler, and Adam Gifford Lockheed Martin Advanced Technology Laboratories 3 Executive Campus, 6th Floor Cherry Hill, NJ USA 08002 [email protected]

William Fitzpatrick John Hopkins University Applied Physics Laboratory 11100 Johns Hopkins Road Laurel, MD 20723 [email protected]

Dervon Chang and Laura Milham Design Interactive, Inc 1221 E. Broadway, Suite 110 Orlando, FL 32765 [email protected]

Dr. Marianne Poythress, 2001 South Mopac Expressway, Suite 824, Austin, TX 78746, [email protected] Chris Berka and Daniel J. Levendowski, Advanced Brain Monitoring, Inc., 2850 Pio Pico Drive, Suite A, Carlsbad, CA 92008, [email protected]

Abstract Under an initiative sponsored by the Office of Naval Research, the Lockheed Martin Disruptive Technologies Opportunities Fund (LM DTOF) team has leveraged and built upon augmented cognition technology previously developed by Lockheed Martin to design and develop a software tool that uses physiological data to improve human-computer interfaces (HCIs). The tool, called Tool for Interface Design Evaluation with Sensors (TIDES), displays objective measures of cognitive workload, visual engagement, distraction and drowsiness while participants interact with an HCI prototype or system. These measures are potentially very useful to help human factors engineers evaluate the efficacy of design alternatives. An initial concept validation experiment (CVE) was conducted where TIDES was integrated with a prototype of a future release of the Tactical Tomahawk Weapons Control System (TTWCS). In many respects, the CVE was similar to a traditional usability study. Twelve sailors interacted with the TTWCS user interface prototype to perform tasks required to execute missile strike scenarios. However, the scenarios used during the CVE were designed not only to be operationally valid, but also to include discrete phases that require specific levels of workload. Moreover, while interacting with the prototype, participants in the CVE wore a set of neurological and physiological sensors including a wireless continuous Electroencephalogram (EEG) and Electrocardiogram (EKG) sensor, wired EKG (the wired EKG was used as a backup for the newer wireless configuration) and Galvanic Skin Response (GSR) sensors, and an off-head, binocular eye tracker that logs point of gaze and pupil diameters. These sensors generate EEG, heart rate variability (HRV), GSR and pupilometry data that are used by the TIDES tool to compute cognitive workload, visual engagement, distraction and drowsiness values. The CVE addresses two objectives: 1) to demonstrate the feasibility of using TIDES during human-computer interface evaluations and 2) to provide data necessary to validate cognitive workload values derived from the TIDES system by comparing them to other measures of cognitive workload, including NASA TLX, expert ratings, and expected workload values generated with Design Interactive’s Multimodal Information Decision Support (MIDS) tool. Our results indicate that TIDES cognitive workload, visual engagement, distraction and drowsiness measures were obtained with high frequency without any action required by the usability study participant. Analyses of the CVE data revealed that there were significant correlations between the average of the cognitive workload gauge values produced by TIDES, expert ratings, and MIDS expected workload values. Results from the CVE suggest that TIDES represents a valuable contribution to human factors engineers’ toolbox. Distribution Statement A Approved for Public Release; Distribution is Unlimited

1

INTRODUCTION

In 2005 and 2006, the Office of Naval Research (ONR) Disruptive Technologies Opportunity Fund supported research exploring the use of neuro-physiological data to measure cognitive workload during an HCI evaluation. The hypothesis was that the sensor-based cognitive workload value would have a high correlation with other verified and validated measures of cognitive workload, such as NASA Task Load Index (TLX). A sensor-based measure would offer a number of advantages: 1) increased precision in measuring the subject’s cognitive state via a moment-by-moment data collection, 2) obtaining objective measurement of cognitive workload while the test is being performed 3) without the addition or distraction of secondary task. The results of the initial research were reported in Poythress et al., (2006). Challenges encountered during that HCI evaluation resulted in the conception of the Tool for Interface Design Evaluation with Sensors (TIDES) and the subsequent goals for a follow-on study: • Create a proof-of-concept prototype of TIDES, a Human Computer Interface (HCI) evaluation tool that provides an objective, domain-independent assessment of users’ cognitive workload. • Provide critical support to interpreting neuro-physiological data by generating estimates of users’ cognitive workload, distraction, drowsiness, and visual engagement. These measures can then be used to help identify events of interest, thus reducing data load and providing timely evaluation results to suggest design changes or to validate design. • Conduct a Concept Validation Experiment (CVE) patterned after an HCI usability study in which activeduty personnel run operationally valid scenarios designed such that different phases elicit different levels of cognitive workload. • Validate the TIDES cognitive workload measures by comparing them to NASA TLX, expert ratings, and Multimodal Information Decision Support (MIDS) expected values. This paper presents findings from this follow-on investigation. 1.1

Cognitive State Sensors

TIDES leverages previous research and development work performed to provide objective, real-time, domainindependent, task-independent workload measures derived from neuro-physiological sensors (Tremoulet and Regli, 2006). Lockheed Martin Advanced Technology Laboratories worked with Advanced Brain Monitoring (ABM) to develop an Electroencephalogram (EEG)-based gauge of cognitive workload (Berka et al., 2006) and previous work by ABM included the development of EEG indices of task engagement, distraction and drowsiness (Berka et al., 2004, 2005). Gauge development was done using ABM’s EEG acquisition system which measures the electrical activity of the individual’s brain with sensors placed on the scalp, allowing data to be captured unobtrusively. The EEG signals reflect the summated potentials of neurons in the brain that are firing at a rate of milliseconds. The EEG workload index, which ranges from 0 to 1.0, increases with increasing working memory load and during problem solving, mental arithmetic, integration of information, analytical reasoning and may reflect a sub-set of executive functions. EEG-workload levels are significantly correlated with objective performance and subjective workload ratings in tasks with varying levels of difficulty including forward and backward digit span, mental arithmetic and n-back working memory tests (Berka et al., 2007). 1.2

Tool for Interface Design Evaluation with Sensors (TIDES)

The Tool for Interface Design Evaluation with Sensors (TIDES) is an HCI tool that provides critical support for interpreting physiological data collected during usability evaluations. TIDES extends the framework developed by Lockheed Martin Disruptive Technology Opportunity Fund project by providing experimenters an instrument that supports the logging and analysis of system events, physiological responses, and user actions. All of these experimenter tasks are accomplished while study participants continue to interact with system of interest, in this case a prototype of a future TTWCS system. An experimenter is typically interested in linking operator performance data with operator physiological data on a common timescale to enable events of interest to be analyzed. Advances in neuro-technology and physiological measurements have enabled information to be captured that helps identify and indicate psychological states of interest (e.g., boredom, overloaded, and engaged), which aid the experimenter and HCI designers in the evaluation of the HCI.

The TIDES system architecture displayed in Figure 1 supports system-of-system integration with multiple sensor suites and domain task systems. System integration includes the SmartEye, Advanced Brain Monitoring, and Thought Technology sensors with the Cognitive State Assessor, and the test platform (TTWCS prototype) with TIDES. The TIDES interface adapter for TTWCS prototype allows for distributed system integration via CORBA. Sensor Log

TTWCS Log

Summary Reports Application System Events

Data Preparation & Reporting

Configuration

Monitoring

Sensor Data (e.g. EEG, EKG, GSR)

Cognitive State Assessor

User Interface

TIDES

Figure 1: TIDES architecture TIDES provides four critical types of information to an experimenter: • Real-Time Logging: The experimenter has the ability to enter events into the log to represent significant events that are not automatically logged by the TTWCS system. Prior to testing, the experimenter can set up events of interest with a quick key identifier using the Session Configuration shown in Figure 2, so that expected events can be logged during testing.

Figure 2: Real-time logging configuration •

Real-Time Monitoring: During testing, the experimenter can log events and monitor physiological sensors data as shown in Figure 3.

Figure 3: Real-time monitoring



Time Synchronization: Time synchronization between the physiological sensors logs and the test system logs is crucial in accurately matching sensor derived events to test platform and participant driven events. After testing, the experimenter can view data via the Timeline summary (Figure 4) to see a combined sensor view of the task.

Figure 4: Timeline Summary •

Data Extraction: Data is also extracted and presented in a CSV format as shown in Figure 5.

Figure 5: Epoch Summary 2

METHOD

The LM DTOF team conducted a concept validation experiment (CVE) in Norfolk, VA from August 22-31, 2006. The CVE’s purpose was to demonstrate that TIDES could produce objective measures of cognitive workload during usability evaluations without requiring the participants to switch focus from their primary tasks and to validate TIDES cognitive workload measure. A secondary objective of the CVE was to identify potential design problems in the TTWCS v6 HCI prototype based on observed errors, participant feedback, and/or TIDES gauge values. 2.1

Participants

Twelve participants were drawn from personnel stationed at Norfolk, VA. The average age was 29, and their years of service ranged from 1-18 years. Six participants had experience with one of the previous Tomahawk systems. All but one had some experience at Tomahawk workstations (averaging over six years) and had participated in an average of 52 Sea-Launched Attack Missile Exercise (SLAMEXs) and 28 theatre exercises. Two participants had experience participating in an operational launch. All participants were male; one was left-handed, and the rest were right-handed. Two participants reported corrected 20/20 vision. 2.2

Equipment and Materials

Testing was conducted at offices at the Naval Station in Norfolk, VA. The TTWCS prototype was simulated on a Dell Inspiron 8500 laptop connected to two, 19-inch flat panel LCD monitors displaying the prototype software. The

monitors were positioned vertically to re-create the configuration of the actual Tomahawk workstation. The prototype system recorded timestamps for both system and user events to a log file. One video camera was located behind the participant, and video was captured on digital video tape. The output from the console monitors was sent through splitters to two 19-inch displays to allow observers to view the screen contents from a less intrusive location. The EEG data was acquired from the wireless sensor headset developed by ABM from five channels using the following bi-polar montage: C3-C4, Cz-PO, F3-Cz, Fz-C3, Fz-PO. Bi-polar differential recordings were selected to reduce the potential for movement artifacts that can be problematic for applications that require ambulatory conditions in operational environments. Additionally, EKG information was collected on a sixth channel. Limiting the sensors (six) and channels (five) ensured the sensor headset could be applied within 10 minutes. The EEG data were transmitted via a radio frequency (RF) transmitter on the headset to an RF receiver on a laptop computer containing EEG collection and processing software developed by ABM. This computer was then linked to an Ethernet router via a Cat 5 Ethernet cable. The Ethernet cable carried the EEG data to TIDES. The Galvanic Skin Response (GSR) and another set of EKG sensors were connected to the ProComp Infiniti conversion box, which transmitted the signal via optical cable to two destinations. The first destination was a laptop containing Cardio Pro software, which allowed for the monitoring of the signal. The second destination was the TIDES laptop where software logged the GSR data and used the EKG signal to calculate heart rate variability (HRV). Eye gaze and pupil dilation data were collected using the Smart Eye system in which two cameras were positioned on either side of the monitor. An IR emitter near each camera allowed the SmartEye software to calculate the associated eye movement and dilation data. Paper questionnaires included a background questionnaire and a user satisfaction questionnaire. An electronic version of the NASA TLX was administered on the TTWCS task machine in between task phases. The test administrator recorded notes concerning the testing session and completed rating of the effort observed by the participant during the test session. Subsequent to the test session, aspects of the scenarios and individual user behavior were coded and entered into the MIDS tool. This tool produces a second-by-second total workload value for the entire scenario. 2.3

Experimental Design

The TIDES workload validation was a correlational design in which the validity of a novel measure (sensor-based cognitive workload) would be compared with other measures that typify accepted practices for measuring workload within the HCI and Human Factors Engineering research communities. The workload validation was run in the context of a diagnostic usability evaluation with only descriptive statistics reported for comparison to prior evaluations of the prototype. While two traditional usability measures were recorded: 1) user performance was measured by computing the number of errors made by each study participant and 2) user satisfaction was measured based on participants’ responses to a user satisfaction questionnaire issued at the end of the test: those results will not be presented here. 2.4

Procedure

The study protocol was approved in advance by the Naval Air Warfare Center - Aircraft Division (NAWC AD) Institutional Review Board. Each subject provided written informed consent before participating. The consent form explained the purpose, goals, and methods to be used during the test. Each participant’s test session lasted approximately eight hours. The participants were first briefed on the purpose and procedures for the study. To ensure participant anonymity, each participant was assigned a participant number. The participant’s name was kept separate from the quantitative data, the user satisfaction questionnaire ratings, and

any design recommendations that they made. The participant also completed a background questionnaire with information regarding demographics, rank, billet, and Tomahawk experience. After the questionnaire was completed, a cap was placed on the participant’s head and six EEG sensors were positioned in it (at F3, Fz, C3, Cz, C4, and Pz). At the same time, two EKG sensors were placed on the right shoulder and just below the left rib. The participant was informed to immediately tell the testing staff if they were uncomfortable at any time before or during the experimental session. An impedance check was done to ensure that interference to the signals was at 50 ohms or below. The RF transmitter and receiver were then turned on. Once a wireless connection was established, a participant number was assigned was and entered in the TIDES software interface. The participant was then asked to perform three tasks over 30 minutes to establish an individual baseline and calibrate the sensors. The tasks required reading instructions and performing basic visual and auditory monitoring tasks. Next, individual profiles were created for the SmartEye eye tracking system by taking pictures while the participant looked at five predetermined points on the displays. Additionally, during hands-on training, specified facial features were marked on the facial images to further refine the participant’s profile. Once the participant was ready to engage in the training and experiment scenarios, the EKG and GSR sensors from Thought Technology were attached. Three EKG sensors were attached, one on the soft spot below each shoulder and one on the center of the abdomen. The two GSR sensors were placed on the second and fourth toes such that the sensor was on the “pad” of the toe. A towel was wrapped around the participant’s foot to keep it from getting cold. The sensor hardware was turned on and CardioPro monitoring software was used to verify that the heart rate and GSR data were being generated. TIDES software was started up, and then data was sent from each sensor suite to TIDES. The software from ABM, SmartEye, and CardioPro were used throughout the experiment to monitor the signals from all sensors (Table 1). Network Time Protocol (NTP) was used to verify that all machines were synchronized to within a fraction of a millisecond. Table 1. Sensor Devices Used in CVE Sensor Vendor ABM

SmartEye Thought Technology

Description Collects EEG data from five channels and EKG on a sixth channel Collects point-of-gaze and pupilometry data Collects EKG and GSR data HRV calculated through thirdparty software

Hardware 6 EEG electrodes 2 EKG electrodes ABM sensor cap RF transmitter and receiver 2 cameras 2 IR-flashes ProComp Infiniti converter 2 GSR electrodes 3 EKG electrodes

Once all the sensor equipment had been calibrated and all system software was up and running, participants were trained through both lecture and hands-on practice format on using the prototype and on the tasks that they would be asked to perform. The lecture portion of training consisted of PowerPoint slides, which gave an overview of the system, as well as detailed pictures and descriptions of the windows and interactions that would be part of the participant’s experimental task. During the hands-on practice portion of the training, a member of the testing staff guided the participant through the task. The participant performed one practice trial using the system, during which various scenario tasks were accomplished. The testing staff answered any questions that the participant had during training. The total time for training was one and a half hours with one hour being allocated to the lecture portion and thirty minutes to the hands-on practice portion of the training scenario. Participants were given a thirty minute break prior to the test session starting. The neurophysiological sensors remained on the participant during the break so that they did not have to be re-applied.

During the experimental test session, the participant was presented with a scenario that included various tasks associated with preparing and launching Tomahawk missiles, similar to those presented during training. The participant was asked to respond to system events and prompts. An experimenter observed and recorded errors and additional observations using the TIDES tool and supplemental written notes. TIDES logged the sensor, and the TTWCS prototype logged user and system events throughout the test session. The sessions were also video and audio recorded. The video showed only the back of the participant’s head and the two computer screens. In the test scenario, participants were asked to execute two 10-missile salvos, which had the operators go through all necessary operations from receipt of the first strike package for the first 10-missile salvo until all of the missiles were launched from the first strike package. The scenario took approximately one hour and 15 minutes to complete. The scenario was divided into four task phases that occurred in a specific sequence for a strike. These phases included the major tasks of validation, planning, monitoring missiles spinning, missile launch, emergent targets, and post launch missile monitoring. The TTWCS system is fundamentally one of multi-tasking within these phases. For example, a participant could be switching between activities, such as final review, launching a missile, redirecting a missile in flight, and monitoring missiles in the air during the same period of time. Or, a participant can work any of the missions (MSNs) in any order desired during engagement planning. This results in clusters of activities during a phase. • Phase 1: Initial validation with no error and engagement planning with errors • Phase 2: Validation due to receipt of execute strike package, preparation for execution and emergent targets • Phase 3: Monitor Missiles • Phase 4: Launch missiles with emergent targets and receive and prepare second strike package The main criterion for successful person-machine performance in this cognitive task environment was the degree to which one successfully launched missiles at their designated launch time (T=0). In other words, how many missiles are launched on time? Following the completion of each of the four phases of the scenarios, the prototype was paused and the participant asked to fill out a questionnaire on perceived workload (NASA TLX). Then the scenario was resumed. After completion of all phases of the test scenario, the sensors were removed and the participant asked to fill out a questionnaire on the perceived satisfaction with the system. Upon the conclusion of each experimental run, the participant was debriefed to discuss any questions that they may have had. 2.5

Data Preparation

Four measures of cognitive workload were collected during TTWCS scenarios with four different phases: • Sensor cognitive workload (sensor-based cognitive workload value): Scores from the neurophysiologically-based gauges (CW-EEG) were generated by taking logs of second-by-second gauge output and averaging the workload values during the four different task phases. Data points associated with noisy sensor readings were removed before analysis. Please refer to Berka et al., 2004 for a description of this procedure. The construction of this gauge is detailed in Craven et al., 2006 and Berka et al., 2007 • NASA TLX (generated by participant survey): Total workload scores from the NASA TLX were used in the analyses. The total workload score is comprised of weighted scores of the six subscales (mental demand, physical demand, temporal demand, performance, effort, and frustration). One participant’s data was removed because his scores were constant (he did not rate any of the four task phases in either scenario as being different). • MIDS (generated by expert observation and cognitive conflict algorithm): Scores for the MIDS measure were generated for approximately half the participants. Close examination of the task domain and videotape of the study allowed for generation of estimates of workload for individual sensory channels (i.e., visual, auditory, and haptic), cognitive channels (i.e., verbal, spatial) and response channels (i.e., motor, speech). Workload is calculated based on a task timeline by summing (1) the purely additive workload level across attentional channels, (2) a penalty due to demand conflicts within channels, and (3) a penalty due to demand conflicts between channels. The amount of attention the operator must pay to each channel in the performance of each task used a 5-point subjective rating scale (1: very low attentional demand; 5: very high attentional demand). These estimates are combined to create an overall



3

cognitive workload estimate for each second that the participant was completing the scenarios. Please refer to Hale, K., et al., (2005) for details of this technique. Values from the six participants were averaged and correlations were made with average values of the other measure. Expert Rating (generated by expert observation): The expert ratings were generated by three HCI experts using a seven-point Likert scale. Experts rated the participants on six dimensions of interaction with the task. The mental effort rating was then extracted and scaled by rater to help eliminate individual rater tendencies.

RESULTS

As described above, four measurements were collected during each of the four phases of the test scenario. The measures estimated workload using either an external observer (experimenter) who monitored the actions of the participant, by the participant himself, or through the use of physiological-sensor-based gauges. The results are presented below first in terms of descriptive statistics, and then in terms of correlations of the measures. 3.1

Descriptive Statistics

Table 2 shows the mean values for the four measures. Table 2. Overall Descriptive Statistics Metrics CW-EEG (0 to 1.0)

NASA TLX (0 to 100)

Expert Rating (1 to 7)

MIDS (1-5)

Phase Phase 1 Phase 2 Phase 3 Phase 4 Phase 1 Phase 2 Phase 3 Phase 4 Phase 1 Phase 2 Phase 3 Phase 4 Phase 1 Phase 2 Phase 3 Phase 4

Mean .703 .702 .659 .708 22.56 28.62 24.64 41.56 2.75 3.46 2.04 4.78 11.31 11.11 3.54 13.31

For all four measures, Phase 4 had the highest mean value. During this phase, participants were required to launch missiles for which launching on time was the main criterion of success. They also were given emerging targets, requiring quick response. In addition to these time critical tasks, they were also preparing a second strike package. For CW-EEG, Expert Rating and MIDS, Phase 3 was the lowest mean value. During this phase, the participants were primarily performing a vigilance task as the missiles were being prepared for launch, which required no specific interaction. 3.2

Validation of EEG-based Cognitive Workload

Validation of the cognitive workload EEG measure (CW-EEG) was performed by computing correlations among the scores (NASA TLX, expert rating) or average scores (CW-EEG, MIDS) across subjects of the four measures collected during the CVE. Table 3 lists the Pearson’s product-moment correlation coefficient (r) of four measures of cognitive workload. Note that MIDS data was only coded on half the participants’ data). Table 3: Workload Measures Correlations Table Metrics CW-EEG NASATLX Expert Ratings MIDS

CW-EEG -.19^

NASA-TLS .19^ --

Expert .38** .34**

MIDS .51** .40*

.38**

.34**

--

.56**

.51**

.40*

.56**

--

**p

Suggest Documents