ASSESSING MENTAL WORKLOAD FROM SKIN ...

4 downloads 0 Views 1MB Size Report
PUPILLOMETRY USING WAVELETS AND GENETIC PROGRAMMING. Roger Lew1, Brian P. Dyre1, Terence Soule2, Stuart A. Ragsdale1, Steffen Werner1.
PROCEEDINGS of the HUMAN FACTORS and ERGONOMICS SOCIETY 54th ANNUAL MEETING - 2010

254

ASSESSING MENTAL WORKLOAD FROM SKIN CONDUCTANCE AND PUPILLOMETRY USING WAVELETS AND GENETIC PROGRAMMING Roger Lew1, Brian P. Dyre1, Terence Soule2, Stuart A. Ragsdale1, Steffen Werner1 1

Department of Psychology and Communication Studies, 2Department of Computer Science University of Idaho Moscow, ID

An essential component of augmented cognition (AC) is developing robust methods of extracting reliable and meaningful information from physiological measures in real-time. To evaluate the potential of skin conductance (SC) and pupil diameter (PD) measures, we utilized a dual-axis pursuit tracking task where the control mappings repeatedly and abruptly rotated 90° throughout the trials to provide an immediate and obvious challenge to proper system control. Using these data, a model-building technique novel to these measures, genetic programming (GP) with scaled symbolic regression and Age Layered Populations (ALPS), was compared to traditional linear discriminant analysis (LDA) for predicting tracking error and control-mapping state. When compared with traditional linear modeling approaches, symbolic regression better predicted both tracking error and control mapping state. Furthermore, the estimates obtained from symbolic regression were less noisy and more robust.

Copyright 2010 by Human Factors and Ergonomics Society, Inc. All rights reserved. 10.1518/107118110X12829369200918

INTRODUCTION Developing reliable instantaneous measures of human workload based on physiological measures is essential for the application of augmented cognition in reducing human error due to fatigue and stress. The most successful techniques have utilized measures which directly monitor brain activity like EEG (electroencephalography) and fNIR (near infrared imaging; St. John, Kobus, Morrison, & Schmorrow, 2004). Other physiological measures such as pupil diameter (PD) and skin conductance (SC, also known as galvanic skin response, GSR) have also been examined. Changes in SC are generally believed to reflect autonomic responses to anxiety or stress (Jacobs et al., 1994), while changes in PD have been linked to differences in difficulty of tasks including sentence processing, mental calculations and user interface evaluation (Just & Carpenter, 1993; Nakayama & Katsukura, 2007).While these measures have shown signs of promise, they generally have smaller effect sizes and less consistency between individuals than EEG and fNIR (St. John et al., 2004). Despite their limitations, these measures remain attractive for application in AC due to their comparatively low cost, high degree of portability, and low obtrusiveness. Here we examined whether a combination of spectral estimation and model building using genetic programming (GP)—a non-linear supervised learning technique that uses evolutionary concepts of fitness, recombination, and mutation to estimate optimum models—can increase the efficacy of SC and PD in predicting mental workload. Previous studies have applied spectral analysis to PD to predict workload with some success. Nakayama and Shimizu (2002 & 2004) found the power spectrum density of PD between 0.1- 0.5 Hz and 1.6 - 3.5 Hz to increase with difficult mental arithmetic. Marshall’s Index of Cognitive Activity, or ICA, (2000, 2002, 2007) uses wavelet analysis to estimate the PD power spectrum for measuring real-time cognitive activity. This approach converts continuously scaled wavelet components into vectors of binary (dichotomous) variables. Workload

classification is then based on either summing the processed wavelet components contained in the binary vectors, or applying linear discriminate analysis or neural network estimation to all or part of the vectors of binary wavelet components. Our purpose was to extend these studies by examining if genetic programming can improve upon existing procedures. Like Marshall’s ICA technique the present study used wavelets to estimate the power spectrum for both PD and SC while operators perform a dual-axis tracking task with abrupt changes in task difficulty created by rotating the control mappings 90 degrees. Offline, the detail and approximation coefficients from PD and SC were used to train genetic programs to predict tracking performance and classify the task-difficulty manipulation, since these indices represent our best independent estimate of the mental workload demanded by the tracking task. There are three important differences between the present study and previous work. First, this study used both raw time series and wavelet analysis of multiple physiological measures (SC and PD) to predict performance and task workload, which are expected to be more predictive than PD signals alone because the redundant measures may better differentiate signal from noise. Second, the wavelet components were estimated as continuous variables rather than processed into binary variables. Third, this study compared two modeling approaches based on the raw time series data and waveletestimated power spectrum: a traditional linear regression/discriminate analysis approach and a genetic programming (GP) approach. We expect that due to its inherent flexibility the GP approach will be more successful in predicting mental workload than linear modeling. While GP is not guaranteed to find an optimum model, it can often define models that are competitive with expert (human) derived models (Koza, 2005).

Downloaded from pro.sagepub.com by guest on November 19, 2015

PROCEEDINGS of the HUMAN FACTORS and ERGONOMICS SOCIETY 54th ANNUAL MEETING - 2010

255

order control dynamics and horizontal and vertical gains of 25° per second at maximum deflection. All participants used their right hands to operate the joystick. The simulation was presented in a darkened room on a 60 inch rear-projection monitor at a spatial resolution of 1280 x 1024 and a temporal resolution of 60 Hz with a viewing angle of 45° by 33.75°. A model ASL5000 head-mounted eye/head tracker was used to measure gaze direction, pupil diameter, and blink rate at 60 Hz. Skin conductance was measured at a temporal resolution of 256 Hz with a Thought Technologies ProComp5 Infiniti encoder using two finger-mounted sensors placed around the index and ring fingers of the left hand. Participants were instructed to leave their left hand in a stationary position during the course of the trial to reduce noise in the SC measurement.

“Normal” Control Mappings

Procedure

“Rotated” Control Mappings

Figure 1. (Top) Participants used a joystick to control the black cursor (crosshairs) shown below. The “balanced dot” shown to the left of the cursor moved about in a pseudo-random fashion. (Bottom) Every 60 seconds for the 480 second experiment duration the control mappings flipped.

METHOD Participants Ten University of Idaho students participated in this experiment. All had normal or corrected to normal Snellen visual acuity (20/20 or better). Participant 3 had limited knowledge of the hypotheses of the experiment; the remaining participants were naïve to the hypotheses of the experiment. Stimuli and Apparatus Participants performed a pursuit tracking task were they used a joystick to control a cursor and chase a balanced dot moving in a pseudo-random fashion (See Figure 1). The balanced dot was superimposed on a gray background and maintains equal-luminance by having a small dot with high luminance surrounded by an annulus of lower luminance such that the brightness of the entire dot is equivalent to the background. This stimulus precluded pupil dilations due to luminance changes since the entire display was approximately equiluminant. The dot’s horizontal and vertical location was determined by two sum-of-sine disturbances with frequencies and amplitudes set such that the full balanced dot always remained visible within the display and moved at trackable speeds. Both the horizontal and vertical disturbances had randomly determined initial phases. The joystick had first

Task difficulty was manipulated by introducing abrupt changes in control mappings. For the first minute of the experiment the control mappings were “normal” meaning moving the joystick forward moved the cursor up, moving the joystick backward moved the cursor down, moving the joystick right moved the cursor right, and moving the joystick left moved the cursor left. After 60 s the joystick control mappings were rotated abruptly 90° clockwise, such that moving the joystick forward-backward moved the cursor rightward-leftward, and moving the joystick leftward-right moved the cursor upward-downward (see Figure 1). For the eight minute duration of the experiment the control dynamics were rotated from the normal orientation to 90 degrees, then back to normal, etc. every 60 seconds. The “rotated” mappings were designed lower tracking performance, increase tracking error defined by the Euclidean distance between the center of the balanced dot and the cursor, as well as case transient physiological indicators reflective of increased workload. RESULTS Tracking Error Models Preprocessing. Offline the 480 second trial was divided into eight epochs of 60 seconds each. Epochs 1, 3, 5, and 7 had the normal control mappings, and the remaining even epochs had the rotated mappings. Data analysis proceeded by smoothing the PD data using a third-order cubic spline, centering the data by subtracting the mean diameter, and then log transforming the result. Negative values were treated by log-transforming the absolute value and then multiplying by -1. Next, to prepare the data for wavelet analysis, SC and PD were linearly interpolated to a sampling rate of 34 1/3 Hz so that each 60 second epoch contained 2048 samples. A discrete wavelet transform (DWT) was applied to these 60 second epochs resulting in a 1024 vector of detail coefficients and eight approximation coefficients (of lengths 512, 256, 128, 64, 32, 16, 8, 8). Each of the approximation coefficients and smoothed tracking error were then stretched to a length of 1024 so they would be in a convenient representation for

Downloaded from pro.sagepub.com by guest on November 19, 2015

PROCEEDINGS of the HUMAN FACTORS and ERGONOMICS SOCIETY 54th ANNUAL MEETING - 2010

Table 1. Overview of GP Model Parameters Initial Individuals: “Full Method” with depth 4, 5, or 6 Number of Layers: 10 Individuals / Layer: 100 Total Iterations: 270,000 Age Calculation: 1 + (icompleted - icreated) / popsize Max ages: Fibonacci: [5, 8, 13, 21, 34, 55, 89, 144, 233, ∞] Fitness: Scaled Symbolic Regression Selection: Tournament selection with a size of 7 Non-terminals: addition, subtraction, multiplication, division, power, absolute value, square root, exponential, sin, cos, tan, and if-less-than. Terminals: of scalars of pi, -100, -10, -1, 0, 1, 10, and 100; as well as vectors holding raw skin conductance, raw pupil diameter, the 9 skin conductance wavelet coefficients, and the 9 pupil diameter coefficients Crossover: 90/10 rule with “Standard” swapping Mutation: Point mutation, 10% for scalar constants,1% for terminals Parsimony Fixed at parsimony_penalty * SIZE(ind) where: Pressure: parsimony_penalty = .0005 To reduce over fitting at each iteration 1024 random time samples (16.66% of the available training data) were selected to calculate fitness. The vector terminals are all of length 1024 and contain data reflecting the randomly sampled time points. The resulting output of each program is a vector of length 1024 predicting of a participant’s tracking error at the randomly selected points in time.

model fitting. Models were fit to each participant’s data independently. To assess the models, only epochs 1 through 6 were used for training/fitting. The predictions of each model were tested on the data from epochs 7 and 8. It is important to point out that although these predictions were calculated offline subsequent to data collection, in principle, once a model is developed and trained, its predictions could be calculated in real-time by a modestly fast computer. Genetic Programming (GP). The raw-untransformed pupil diameter and the raw-untransformed skin conductance signals were used with their respective detail and

256

approximation wavelet coefficients to predict instantaneous tracking error with a Genetic Program (GP) utilizing Scaled Symbolic Regression (Keizer, 2004) with Age Layered Populations (ALPS; Hornby, 2009). For each participant GP was ran eight times. GP is an iterative machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program's ability to perform a given computational task. At each iteration, the “genetics” or genotype of the above average individual solutions are combined to try and form better solutions. To maintain diversity individual solutions are sometimes mutated to try and find better solutions. In this scenario the fitness landscape is defined by the ability of the programs to use the skin conductance and pupil diameter data to predict tracking error. The GP utilized a technique known as Age Layered Populations to increase the robustness of search and as a preventative measure against premature convergence (Hornby, 2009). Scaled symbolic error, as described by Keizer (2004), was used in place of root-mean-squared error to increase model performance. Scaled symbolic error allows the GP to try and fit the nonlinear changes of the fitness landscape, while leaving the linear fitting to a regression analysis. With Keizer’s method fitness is bounded between 0 and 1; where a 0 indicates the model accounts for 100% of the variability of the model. For brevity, the specifics of GP, ALPS, and scaled symbolic error cannot be included here. A more detailed description of the model and its implementation is available upon request. Some of the relevant details are summarized in Table 1. Linear Discriminant Analysis (LDA). To compare the GP models to LDA, random subsets of 1024 random points were generated and regressed on the 20 variables available to the GP. For each participant LDA was run 100 times. One hundred runs was enough to reveal that the r-squared distributions obtained by LDA are normal with far less variability between runs which implies multiple regression is fairly robust to the subset of time points but this also results in its top performance being somewhat restricted. GP on the other hand is more of a “shotgun” approach. Due to the amount of randomness in GP there are no guarantees of

Figure 2. The blue line represents measured tracking error in degrees (y) by time in seconds (x). Green shows the predicted tracking error on the training data and red shows the predicted error on the test data. Run 6 for subject 8 had a r2 for the training data of 0.491 and a r2 for the test data of 0.443 using only 2 physiological parameters and 13 nodes. While this solution is atypically good it does illustrate the potential of symbolic regression. Downloaded from pro.sagepub.com by guest on November 19, 2015

PROCEEDINGS of the HUMAN FACTORS and ERGONOMICS SOCIETY 54th ANNUAL MEETING - 2010

257

Table 2. Multiple Regression/LDA vs. Symbolic Regression. Listed are the r2 values for the tracking error models. In parenthesis the classification accuracy percentages of the classification models are listed.

Participant

Multiple Regression Best Training Best Test

Symbolic Regression Best Training Best Test

1

0.273 (67.6%)

0.115 (80.4%)

0.680 (80.3%)

0.119 (79.3%)

2

0.399 (74.3%)

0.179 (62.0%)

0.722 (70.8%)

0.314 (62.5%)

3

0.415 (80.9%)

0.069 (49.1%)

0.617 (86.1%)

0.187 (68.8%)

4

0.240 (56.2%)

0.177 (53.3%)

0.568 (69.6%)

0.189 (73.4%)

5

0.397 (65.6%)

0.019 (69.8%)

0.567 (81.0%)

0.181 (81.3%)

6

0.449 (50.0%)

0.314 (49.8%)

0.579 (87.6%)

0.332 (86.0%)

7

0.307 (50.1%)

0.089 (50.0%)

0.705 (70.7%)

0.104 (72.0%)

8

0.588 (92.5%)

0.481 (97.5%)

0.636 (93.1%)

0.443 (93.1%)

9

0.276 (50.1%)

0.175 (50.2%)

0.527 (84.0%)

0.160 (80.7%)

10 Averages:

0.411 (83.4%) 0.376 (67.1%)

0.115 (63.2%) 0.173 (62.5%)

0.760 (81.1%) 0.636 (80.6%)

0.121 (75.0%) 0.215 (77.2%)

converging on a “good” solution and in practice many mediocre and poor solutions often arise. Despite this fault, GP can sometimes produce solutions that are several times better than average (see Figure 2). For these reasons the best results generated by GP and LDA were compared to one another. On the test data, a paired samples t-test was used to compare the best r-squares for each participant found by GP to the best rsquares found by multiple regression. Here we are interested in top performance rather than average performance. In a real world setting the training would most likely take place offline so time is not a critical factor and several potential solutions could be evaluated before they are put into practice. Tracking Error Results To verify that the rotated mappings did in fact hamper tracking performance (increase tracking error) a 2 x 4 ANOVA evaluated the effects of mapping (normal vs. rotated) and block (1-4) on mean tracking error. As expected a significant main effect of mapping was found [F(1, 9)=133.682, p