eye gaze for intention and uncertainty estimation

Univ Access Inf Soc DOI 10.1007/s10209-009-0144-5

LONG PAPER

Attentive interfaces for users with disabilities: eye gaze for intention and uncertainty estimation Helmut Prendinger Æ Aulikki Hyrskykari Æ Minoru Nakayama Æ Howell Istance Æ Nikolaus Bee Æ Yosiyuki Takahasi

Springer-Verlag 2009

Abstract Attentive user interfaces (AUIs) capitalize on the rich information that can be obtained from users’ gaze behavior in order to infer relevant aspects of their cognitive state. Not only is eye gaze an excellent clue to states of interest and intention, but also to preference and confidence in comprehension. AUIs are built with the aim of adapting the interface to the user’s current information need, and thus reduce workload of interaction. Given those characteristics, it is believed that AUIs can have particular benefits for users with severe disabilities, for whom operating a physical device (like a mouse pointer) might be very strenuous or infeasible. This paper presents three studies that attempt to gauge uncertainty and intention on

H. Prendinger N. Bee National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan e-mail: [email protected] A. Hyrskykari (&) Department of Computer Sciences, University of Tampere, 33014 Tampere, Finland e-mail: [email protected] M. Nakayama Y. Takahasi CRADLE, Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8552, Japan e-mail: [email protected] Y. Takahasi e-mail: [email protected] H. Istance School of Computing, De Montfort University, The Gateway, Leicester LE1 9BH, UK N. Bee Institute of Computer Science, University of Augsburg, Eichleitnerstr. 30, 86135 Augsburg, Germany

the part of the user from gaze data, and compare the success of each approach. The paper discusses how the application of the approaches adopted in each study to user interfaces can support users with severe disabilities. Keywords

Attentiveinterfaces Disabilities Eye gaze

1 Introduction Interfaces based on eye gaze hold great promise to ease and improve the communication capabilities of motor impaired users whose only means of communication is eye gaze. A fundamental aspect of human communication is the signaling of intention, i.e., what a person wants or wants to bring about in the environment. The aim of this paper is to demonstrate how eye gaze can be used to detect automatically the intentions of users during human–computer interaction (HCI). The authors acknowledge the importance of the philosophical approach to the notion of ‘‘intention’’ [1, 2] with all its involved issues, and the efforts towards artificial agents with intention in artificial intelligence (e.g., [3]). In this paper, intention simply refers to the user’s aim to successfully interact with the interface in order to achieve goals related to the particular software currently being used. While not specifically directed toward disabled users, several approaches exist that rely on gaze as a primary input modality. Attentive user interfaces (AUIs [4, 5] or ‘attentional user interfaces’ [6]) envisage interaction styles where a user’s gaze provides information about the context of the user’s action, such that the system can actively adapt its own state to the user’s information state. Since users with motor impairments often find deliberate commandbased interaction effortful, AUIs may have considerable

123

Univ Access Inf Soc

potential benefits for disabled users in terms of a reduction of workload and effort during interaction. Furthermore, patterns of eye gaze can be used to make inferences about the user’s intention for future actions during interaction sequences, such that the system can offer shortcuts to these intended actions, or even undertake actions automatically on the user’s behalf. A phenomenon associated with the intention to act is uncertainty, which can be observed (at least) at three levels. First, users often demonstrate uncertainty about how to act when interacting with computers, i.e., what to do next in order to complete a task. Second, uncertainty may apply to the comprehension of presented information, leading the user to initiate actions aimed at reducing uncertainty. Third, interfaces typically require the user to perform a selection of some kind, which involves a temporary uncertainty about what to choose. If a system could detect behaviors which suggest uncertainty on the part of the user, then actions can be prepared (and even taken) automatically, thereby removing some or all of the workload from the user. The usage of ‘‘uncertainty’’ in this paper relates to the user’s cognitive state of unsureness or hesitation, and not to ‘choice under uncertainty’ as in decision theory (see, e.g., [7]). This paper will compare three approaches to the automatic recognition of intention and uncertainty based on eye movement and associated data. It will examine the success of these approaches and how they could be applied to assist disabled users in communication tasks using computers. Two of these approaches seek to identify uncertainty from gaze patterns, one in the context of reading a foreign language and the other in answering multiple choice questions. The third approach goes further and attempts to predict the outcome of a forced binary choice task, or the intention to choose one option in preference to the other. The rest of the paper is organized as follows. Section 2 describes the types of disabilities and related needs that should be addressed by attentive interfaces. Section 3 motivates the three studies of this paper, and provides brief summaries of the studies. Sections 4–6 explain each of the three studies in detail. Section 7 compares the studies and outlines application scenarios of intention and uncertainty estimation for disabled users.

2 Disabled users and system needs 2.1 Types of disabilities and associated needs Characterizing disabled users and their interaction needs is difficult, due to the individual nature and extent of a particular disability within an individual. Practitioners in the accessibility field are often reluctant to produce or endorse

123

lists of system requirements for classes of disabilities. They prefer instead to work on a case by case basis to define and address the communication needs of an individual. There are, however, broad descriptions that are of use here. The emphasis in this paper is on users with motor disabilities, an inability to exercise precise and timely control over input devices (such as a mouse and keyboard) which makes the use of these devices impossible or very effortful and error prone. The tasks affected by this inability fall into several broad categories. Firstly, the use of computer applications, such as web browsing, email and word processing. Second, communication about current needs (‘‘I would like some water please’’) and conversation (‘‘I think this about a certain issue’’). Related to this communication is control over the local environment (open to the curtains, change the television channel). A third category involves mobility control and independent and safe movement around the local environment using a mobility device such as a wheelchair or scooter. Some disabilities, such as cerebral palsy, mean that users are prone to involuntary spasmodic muscle movement when they try to carry a deliberate coordinated action. Other disabilities (such as motor neurone disease or ALS) involve a progressive inability to make any muscle movement. In both cases eye gaze can offer an effective means of communication. For some people with cerebral palsy, deliberate use of the eye invokes less involuntary muscle movement than other muscles. People with ALS can retain good control over their eye movements when effective control of other muscle groups has been lost. Users with disabilities usually have a human helper to assist with implementing control actions and, very importantly, in interpreting communication by the user. The helpers are often very skilled in quickly running through possible words the user wishes to say and interpreting gestures made by the user (possibly with small eye movements) as responses. 2.2 Command-based and non-command based interaction The distinction here lies in whether or not a user deliberately makes an input action. Staring at a key on an onscreen keyboard, for example, to send a key event to an underlying desktop application would constitute commandbased interaction. The use of eye gaze for this type of interaction is well established [8]. There has been relatively little research into non-command-based interaction for disabled users. Jacob [9] advocated using gaze for this type of interaction long ago, instead of deliberate commandbased interaction. Non-command-based interaction has reemerged in recent times in the form of attentive interfaces, primarily for able-bodied users and applications.

Univ Access Inf Soc

2.3 Attentive user interfaces Vertegaal [10] defines an attentive user interface as An Attentive Interface is a user interface that dynamically prioritizes information it presents to its users, such that information processing resources of both user and system are optimally distributed across tasks. The interface does this on the basis of knowledge—consisting of a combination of measures and models—of the past, present and future of the users attention, given the availability of system resources (p. 23). Thus, an attentive user interface embodies the notion of monitoring what the user is attending to during interaction with a system (in a broad sense of the word) and offering context-based options based upon that monitoring. Hence the system is able to offer commands based on what it thinks the user wants to do in the current context of interaction. This reduces the need for the user’s attention to be distracted away from the primary task to first find and then issue a command as that command is already offered by the system. It also reduces the workload and effort associated with giving that command. Knowledge of gaze position is an important component in attentive systems, since visual attention to objects on-screen (or in the world) give important clues as to intent, and gaze position gives a good indication of which objects the user is visually attending to [11]. One of the first systems to monitor a user’s gaze behavior and respond to mental states hypothesized from eye gaze patterns (in real-time) is the ‘gaze-responsive selfdisclosing display’ by Starker and Bolt [12]. In their application—which is inspired by Saint Exupery’s book The Little Prince—a two-dimensional virtual narrator would comment on visualizations of everyday items on a virtual planet. For instance, if the user demonstrates interest in a staircase, the narrator tells a story about it. If the user is looking alternately at multiple staircases, the system infers that the user is interested in the staircases as a group, and corresponds with an appropriate comment. Recently, the iTourist system revived this concept in the context of city trip planning [13]. Eye gaze can also play a central role in automatically adapting multimedia systems to the user’s interest, and thus contribute to the realization of ‘‘interest and emotion sensitive media’’ (IES) [14]. In an IES media system, eye gaze is interpreted in order to determine the subsequent branch of a multiplex script board that a user is likely interested in. Such a system is ‘emotion sensitive’ in that it also analyzes a user’s physiological activity such as pupil dilation and blink rate, from which emotional states (e.g., arousal or stress) may be inferred [15]. Other work in eye-based

multimedia systems aims to employ gaze for effortless and rapid image and video retrieval [16]. The InVision system described in [17] processes a user’s gaze directed at an interface depicting a kitchen environment, and infers whether the user is hungry or intending to rearrange the kitchen items, etc., from the gaze path. The AutoSelect system (one of the three systems contrasted in this paper) could be employed in this context to automatically choose the user’s preferred dish [18]. This system will be described in detail in Section 6. 2.4 Gaze for recognizing intention In human–computer dialogue, intention can be usefully thought of as pursuing a goal-oriented activity. This has long been a fundamental feature in various approaches to task analysis [19] and task modelling [20]. Before the ambitious question of recognising intention is tackled, it is necessary to consider just how much of users’ current cognitive state is revealed by their eye gaze and patterns. In a study of students undertaking a problem solving task involving gear assemblies, the displayed information was suddenly switched off during the trial [21]. Participants were asked to report what the last thing they remembered thinking about when the event occurred. In 73% of the cases, they reported the last thing as being a region of the screen, but they looked at this region only 46% of the time during the last or next-to-last fixation before the display was removed. Thus, the correspondence between eye position (visual attention) and reported general attention is not complete. Searching, knowledgeable movement, and prolonged searching were behaviours characterized by prototypical gaze patterns [22]. The recognised patterns were based on duration of fixations and the amplitude of saccades and these were used to automatically adapt the responsiveness of an interface in terms of time durations. Prolonged searching was assumed to arise when the user was not in a position to make an active selection of an object on the screen and the automatic selection mechanism was switched off. The anecdotal user evaluations were positive towards these adaptive features. Earlier work by Goldberg and Schryver [23] attempted to identify and classify gaze patterns according to whether a user intended to perform a zoom operation on a graphics display. This was in the context of an object identification task which required the user to make a conscious decision whether to zoom in, zoom out or do neither. They hypothesised that clusters of fixations which gradually became smaller signalled an intent to zoom in. Focusing visual attention towards the outer areas of the window, on the other hand, was indicative of an intent to zoom out. Gaze behaviour which did not fall into either category was

123

Univ Access Inf Soc

taken to characterize neither intent. Multiple discriminant analysis across a number of successive frames was used and 61% of the data was correctly classified. This level of success was obtained in a highly contrived experimental situation. Real-time interpretation of a wider variety of patterns in an uncontrolled environment would be considerable more challenging. 2.5 Attentive interfaces for users with disabilities One potential application lies in the emulation of a human helper. Human helpers become experts in ‘reading’ the eye movements of the people they look after, and interpret gaze gestures and movements as requests or statements. A user with severe mobility issues may look at the flowers in the room and decide that they need watering. Looking at the flowers for an extended period may suggest to the helper that the user wishes to do something with the flowers, and as a consequence run through a list of possible commands based on their perception of the current appearance of the flowers, for example ‘‘do you want to move the flowers into the window?’’ or ‘‘do you want to water the flowers?’’, and so on. This example should not be seen as a suggestion that an attentive interface fulfilling this function would be preferable to the human helper. Nevertheless, it would have the advantage of always being there at times when the human helper is not. The outcome of inferences made by the system about the user’s intent or interest can take different forms. The system can prepare suggestions for commands, which the user can ignore or accept. Smithe et al. [24] call this ‘‘context for action’’. These are intelligent default actions. An important issue here is the way in which the commands are presented and the workload inherent in attending to, or ignoring, the suggested commands. For disabled users, the primary motivation for context-based actions is the quest to reduce workload and increase efficiency of interaction. It is essential that attentive system responses do not increase workload or reduce efficiency. Work on attentive intelligent agents [25, 26] will be useful here as a starting point. Results from this work show that the presence of an interface agent seemingly triggers natural and social interaction protocols of human users in terms of their gaze behaviour. For instance, the gaze of the user follows the agent’s deictic gesture to some visual reference object, and subsequently returns to the agent’s face, apparently awaiting further information from the interface agent. Based on these findings, a new generation of virtual infotainment agents, the so-called ‘‘visual attentive presentation agents’’, have been implemented [27].1 These agents monitor the user’s gaze behaviour during 1

http://research.nii.ac.jp/*prendinger/GALA2006/.

123

their performance, and adapt their presentation to the user’s visual interest (or lack of it). If the user is found to divert his or her attention from the presentation content (typically a virtual slide), the agents would interrupt their presentation and give more information on the user’s object of interest, for instance, the view out of the (virtual) presentation room. In this way, the attentive agents provide a highly personalised experience of presented content to the user. 2.6 Eye behaviour and social communication The social functions of gaze have been realized in some eye-based attentive interfaces, which can have direct benefit for disabled users. Dyadic (two-person) communication involves typical gaze patterns that regulate the flow of conversation [28]. For instance, when beginning to speak, a person tends to look away (from the listener), and toward the end of the utterance, the speaker will start to look at the listener. As well as exhibiting social gaze behaviour naturally, people also do so intentionally, e.g., by glancing at the listener as a request for response. Eye-based interfaces can be programmed to recognize such ‘eye gestures’ [17] that hold social meaning, including (involuntary or purposeful) gestures like rapid blinking to indicate stress or glancing around to indicate lack of interest. Social and communicative eye behaviour is also intensively studied, and technically mimicked, in computermediated communication (CMC) and computer-supported cooperative work (CSCW) [29, 30]. 2.7 Ethical considerations There are significant ethical issues associated with recognising covert or private intentions where users may not be comfortable with automatic estimation and logging of their intention, such as the use of the internet for purposes that some may find questionable. Furthermore, there is an ethical issue concerning what information should be collected about an individual’s assumed intention during interaction with a system. These issues are acknowledged as being important, but will not be dealt with further in this paper.

3 The three studies To date there has been no work carried out that has specifically looked at automatically inferring intent from gaze patterns of users with disabilities. The motivation in comparing the three studies described in this paper is to assess how successful this endeavour has been in studies

Univ Access Inf Soc

with non-disabled users. From this, statements can be made about the application of the findings to the needs of disabled users. Each approach described in this paper has been implemented as a system by independent research groups. The purpose of each system will be explained, as will the gazebased metrics used, the level of success achieved, and, at the end of the paper, the potential application for disabled user groups. The first study (Sect. 4) considers an automatic dictionary facility for use when reading texts in a second language. It estimates when a user is uncertain about the meaning of a word in a passage of narrative text, and automatically presents a translation of the word in the user’s native language. It illustrates the potential of detecting uncertainty and comprehension difficulties by analyzing gaze. Within a predefined context (concentrated reading) it is possible to infer the user’s intentions (need of assistance) to a very satisfactory degree. The second study (Sect. 5) examines how uncertainty about the answers previously given to a set of multiple choice questions can be detected from gaze patterns. It shows that gaze patterns hypothesised to indicate uncertainty when looking over the questions correlate with subjective estimates of uncertainty or ‘‘strength of belief’’ (SOB). Finally, the third study (Sect. 6) demonstrates the utility of gaze in detecting preference, i.e., the intention to choose between two visually presented alternatives. A characteristic gaze pattern (the gaze ‘cascade effect’) allows one to observe the dynamics of uncertainty in choice situations. It also examines the extent to which data on pupil dilation as an indicator of preference can be integrated with estimates based on gaze patterns.

4 Detecting reading comprehension difficulties Detecting intention from the user’s natural gaze path implies that the gaze behaviour that is typical in the application context is known. If the gaze path deviates from the expected path, one may be able to infer something about the user’s cognitive processes. Eye movement behaviour in reading is a thoroughly studied field. Hence, when interpreting eye behaviour during reading, a rich platform of knowledge is available concerning what to expect from the recorded gaze path. When the user is concentrating on reading and the gaze path differs from the expected one, it can be assumed that the reader experiences uncertainty in comprehending the text being read. iDict is a gaze-aware environment that aims to assist people reading electronic documents in a second language. Normally, when reading text documents in a foreign

language, an unfamiliar word or a phrase interrupts the phase of reading and forces the user to obtain help from either printed or electronic dictionaries. In both cases the process of reading and line of thought gets interrupted. After the interruption, getting back to the reading context takes time, and may even affect the comprehension of the text being read. In iDict the reader’s eyes are tracked and the reading path is analyzed in order to detect deviations from the normal path of reading, indicating that the reader may be in need of help with the words or phrases being read at the time. When the gaze behaviour is used to automatically trigger system actions, interpreting the gaze path is always prone to errors. The reading may falter not only because the words or phrases used are unfamiliar, but because for example the text is cognitively demanding. The divergent reading path may also be due to some reasons even unrelated to the reading process. In iDict several sources of information are used to make the detection of comprehension difficulties as accurate as possible. To compensate for the inherent inaccuracy of eye tracking the measured point of gaze is algorithmically corrected [31]. An ‘index of difficulty’ in comprehending words in a piece of text is defined based on metrics associated with gaze behaviour [32]. When the difficulty threshold is exceeded, a brief translation or ‘gloss’ (glossary) in the native language is displayed automatically over the word considered to cause difficulty to allow a convenient quick glance at the available help. 4.1 Gaze metrics used This application faced the challenge of having to generate an ‘index of difficulty’ for each word in the passage of text and decide on an appropriate response in real time. The problem is confounded by the need to map fixations during reading to small closely packed target words on the screen. Based on previous research into reading behaviour, five candidate metrics were chosen and evaluated. These were (1) duration of the first fixation on a word, (2) the cumulative sum of fixation durations on the word when the word is entered for the first time, (3) total time, the sum of the durations of all fixations on a word during reading, including regressive fixations, (4) total number of fixations on a word, and (5) regressions, that is the number of fixations that enter a word starting from a word appearing later in the text. A set of experiments compared the performance of each metric. The words in the analysed text in the experiment were partitioned into problematic words and familiar words based on the reports of the individual participants. Each metric was assessed on the basis of how successfully it was

123

Univ Access Inf Soc

able to predict which group of words a fixated word belonged to. The most robust metric was total time on a word. The mean total time for problematic words was 864 ms, and the mean total time for familiar words was 398 ms. The next question concerned the selection of the threshold time value to use, after which the automatic display or gloss would be triggered. The problem of choosing an appropriate threshold or trigger value can be represented by two adjacent normal distributions on the reading time continuum. One distribution has a mean of 398 ms for familiar words, and the other has a mean of 864 ms for problematic words. The standard deviation of the familiar words distribution was 248 ms, so the mean of the problematic words distribution lies 1.87 standard deviations from that of the familiar words distribution. A short trigger time will reduce the probability of missing a problematic word, but increases the probability of showing a translation where it is not needed (i.e., a false alarm). A longer trigger time increases the risk of missing problematic words but will reduce false alarms. Filtering out false alarms without raising the threshold for problematic words was tested in experiments, by (1) personalizing the threshold using individual data profiles rather than group averages, (2) using word frequency (where the assumed reading time for a word is adjusted by its frequency of occurrence in the target language), and (3) using the word length for adjusting the time assumed to be necessary to read it. A trigger threshold of 2.5 standard deviations from the group mean for familiar words was finally chosen. This would provide a translation for 36% of problematic words, while reducing the probability of false alarms to 2.4%. As there were far more familiar than problematic words, reducing the probability of potentially irritating false alarms was considered to be more important than increasing the hit rate for problematic words. Furthermore, even though the hit rate of problematic words was ‘‘only’’ 36%, it was considered that it would increase when iDict was actually put in use later on. These percentages were computed from a normal session of reading text on a screen (without any expectation of getting help). 4.2 Evaluation of success 4.2.1 Testing the efficiency of the metrics The experiment aimed to test the usability of the system and earlier predictions made about the number of correct hits and false alarms for the chosen trigger time. The complete text contained 641 words and was divided into three blocks. It was divided into three blocks of text, presented in the Times font, in 12-point text with 1.5 line spacing on a 19 in. screen (with a resolution of

123

1,024 9 768). Each of the blocks contained about 250 words. Two means were employed for ascertaining that the subjects were concentrating on reading. First, the text was chosen carefully, taking the participants characteristics into account, and their comprehension of the text was checked after reading the text blocks. The text was an extract from a short story, which generated an interesting, tense setup that enticed the subjects to read further. All of the subjects considered the text interesting, and some of them even asked for a reference for the book, because they wanted to read the whole story. Second, the subjects were motivated to comprehend what they read by telling them in advance that, after reading each text block, they would have to give a verbal review to the experiment supervisor. After the verbal review of each block, the subjects pointed out the words that they considered to be problematic. Also a video transcript of the trials with an overlaid gaze path of reading was recorded and this was reviewed by the subject afterwards, in addition to the identification of problematic words after each block had been read. From the video recording the readers had an opportunity to review the glosses they got and to double-check whether the help received was expected or had been given needlessly. From the three text blocks the participants pointed out a total of 310 problematic words, and of these they got help (a gloss for the word) for 281 words. In addition to the correctly triggered help, the participants got 178 false alarms. This means that in a real-life situation iDict triggered help for • •

91% of the problematic words (281 out of 310) and 2.4% of the possible false alarms (178 out of 7,382).

As expected, the percentage of correct hits was far greater than had been predicted on the basis of the earlier experiments carried out when the users were just reading the text normally without the possibility of getting help from the system. The prediction made about the number of false alarms, that was the main initial concern, was on the other hand very accurate. The way in which the system worked had been explained to the participants before the trials begun, and some of the subjects took advantage of the fact that they could obtain a gloss for a word by deliberately dwelling on it until the gloss appeared. On these occasions, the subjects were using the system in ‘command-based’ mode rather than ‘attentive’ mode. 4.2.2 Testing the usability of the system A further experiment was designed to compare three modes of operation of the system. In the first gaze-only mode of operation the system deduced automatically which words

Univ Access Inf Soc

give rise to comprehension difficulties, and presented glosses for these automatically. In the second, combined mode, gaze was used to define the word the user was looking, but a simple activating command (pressing the space bar), was used instead of automatic action to launch the gloss over the word. Visual feedback (word became grey) was given if the gaze for a word was prolonged to show that a gloss was available. A third, mouse-only mode (analogous to a conventional interface) condition was included in the study, where the user selected words directly to show a gloss for. This was achieved by means of the mouse and without any reference to gaze behaviour. There were 18 able-bodied test readers participating in the experiment. Their ages varied from 20 to 27 years, nine of them were male and nine female. They all were native Finnish speakers, and none of them was familiar with iDict. Only three of them had a little experience with eye tracking, having once participated in an experiment in which eye tracking had been used. They used iDict to read six passages of English text taken from the same narrative. The passages were blocked into three blocks of two, the three conditions (the gaze-only, combined, and mouse-only modes) were presented one per block, and the order of the presentation of conditions was counter balanced across participants. Each of the text passages contained about 220 words. The first passage in each block was used as a practice session. To motivate the subjects to understand the text, and thus to take advantage of the help that iDict is able to provide, the subjects had to give a verbal review of each of the passages after reading. The eye tracker used in the experiment was a Tobii 1750. In order to measure the subjective assessment of usability for each of the input modes, the participants were asked to fill in the System Usability Scale (SUS) [33] questionnaire form, which asks them to state their level of agreement with ten statements relating to overall usability. The aggregate of the ratings for each statement gave an overall SUS score, where 40 represents the highest score in terms of usability for the 10 statements. The overall score for • • •

Condition A (gaze-only) was 29.89, with a standard deviation of 4.9; for Condition B (combined) was 29.94, with a standard deviation of 4.8; and for Condition C (mouse-only) was 34.56, with a standard deviation of 4.5.

The SUS score for the mouse-only condition was significantly better than that for the combined condition (p \ 0.001) and also for the gaze-only condition (p \ 0.001). However, the scores for all three conditions were over 20, meaning that the test readers experienced them all positively.

Since the SUS questionnaire results are probably positively biased toward the familiar condition, a straightforward subjective opinion of the conditions from the test readers was also sought. Therefore, they were asked to rank the different input conditions depending on which one they would prefer to use (see Table 1). In the free interview after the experiment a common reason given by users who preferred the mouse-only condition was that the mouse is the device most familiar to them, and they felt confident having full control of the application. However, several users reported that they experienced gaze-aware interaction as enjoyable and very natural, exemplified by the following comment: The gaze-only condition is most comfortable in the sense that it is so very natural … you only have to read. You don’t need hands at all. Even though the mouse-only condition received the highest number of top rankings (eight), it is interesting to notice that more than half of the participants did not rank the mouse-only condition first: ten participants ranked either the combined or the gaze-only condition first. There was no difference between the gaze-only and combined gaze and key condition in this respect. 4.3 Overall evaluation The experiments with the system have demonstrated that it is possible to detect uncertainty while reading passages in a second language and to respond automatically and appropriately with translations of the words causing difficulties. On the basis of the reading path, it was possible to pick out the problematic words (the words the readers needed translation help with) with the accuracy of 91% and only 2.4% of the familiar words triggered the help function. From a usability point of view, the overall preference for the system working in one of two attentive modes was about equal to the more familiar command-based mouse interface.

5 Certainty estimation for responses to multiple choice questionnaires The previous section has shown how uncertainty can be estimated from gaze information in the context of a reading Table 1 Frequencies of responses of overall input condition preferences Overall preference

Count

Condition gaze-only (A)

5

Condition combined (B)

5

Condition mouse-only (C)

8

123

Univ Access Inf Soc

task. In a more general sense, a typical human reaction to uncertainty is erratic or irregular eye movement [34]. This section examines the feasibility of an index of the ‘‘strength of belief’’ (SOB) by considering the gaze at previously chosen answers [35] to multiple choice questions. This builds on earlier eye movement analysis that has already been applied to detecting relevant text during an information retrieval experiment [36, 37]. Scan paths and transitions, which are thought to characterise high and low SOBs respectively, were used to identify those responses to questions where the participant was confident that his or her answer was correct (high SOB) or less confident (low SOB). These were compared with subjective estimates of levels of confidence in the correctness of the answers made by individual participants. This is directly analogous to the results of the first study, which derived an index of reading difficulty that was subsequently checked against the individual participant’s assessments of familiar and problematic words. The task consisted of a session where the participants answered questions, followed by a separate session where participants were asked to review their answers to the questions. Screen-shots (with English translation from Japanese) of the display for each session of answering and reviewing are shown in Figs. 1, 2. The multiple choice questions were arranged one question to a row, four rows to a column as shown in Fig. 1. The question appeared in the left-most column and four sets of four candidate answers appeared in the four columns to the right-hand side. Each question gave a type of fruit (apples, oranges, grapes and strawberries, in this example), and required the participant to identify which answer was a variety of that type of fruit. For the question about ‘apples’ in the first row, candidate

5.1 Gaze metrics used For each question, an estimate of the SOB (two categorical values: high SOB and low SOB) was made for the answer to each question, based on scan path analysis of eye movements during the reviewing session in Fig. 2. Eye movements were divided into saccades and fixations [38]. A fixation is defined as eye movement staying within a 0.3 visual angle and at a velocity of 3 deg/s or less. The detailed parameters were set according to the characteristics of eye movements [39] Scan paths were essentially analyzed as fixation patterns, and in particular the appearance of scan paths between the question column and chosen answers during reviewing session, because a participant may confirm the relationship between a question and a chosen answer which was less confident (low SOB). This is a key idea of estimating high or low SOB. To extract those scan paths, all fixation points were classified into four cells (for question items) in the left most column and into 16 cells (for chosen responses to the questions) in the column on the right, as shown in Fig. 2. The fixation cell numbers are marked for each cell in the bottom-right corner (the question items: 10, 20, 30, 40; cells for chosen responses: 11–44). Therefore, scan paths

Task01 [ Fruit ] Fruit

Variety 1

Apples

Fig. 1 Question display for the answering session (English translation)

answers were Muscat (grape), Tochiotome (strawberry), Delicious (orange), and Fukuhara (apple) in a column titled ‘‘Variety 1’’, and participants had to identify which was a variety of that type of fruit by clicking on ‘‘Fukuhara’’, in the case of apples. There were four rows and four columns listing a total of 16 questions in all. The content of the questions was selected in accordance to the results of a preliminary experiment.

Muscat

Variety 2 Tochiotome

Delicious Fukuhara

Grapes

Tochiotome

Oranges

Muscat

Delicious Fukuhara

Delicious Fukuhara

Muscat

Strawberries

Muscat

Tochiotome

Tochiotome

Delicious Fukuhara

0:50 Variety 3

Variety 4

Valencia Benidama

Suzuki

Amaou

Delaware

Nyohou Sekaiichi

Kyoho

Washington Koshu

Muscat

Tochiotome

Suzuki

Toyonoka

Kyoho

Toyonoka

Macintosh

Macintosh

Delicious Fukuhara

Nyohou Sekaiichi

Washington Koshu

Muscat

Suzuki

Toyonoka

Tochiotome

Kyoho

Macintosh

Delicious Fukuhara

Nyohou Sekaiichi

Washington Koshu

Muscat

Suzuki

Toyonoka

Tochiotome

Delicious Fukuhara

Kyoho

Nyohou Sekaiichi

Macintosh

Washington Koshu

Next

123

Univ Access Inf Soc

Task01 [ Fruit ] Fruit Apples

Fig. 2 Chosen answer display of fixation points enclosed in bold lined classification cells for the reviewing session (English translation)

0:20

Variety 1

Variety 2

Variety 3

Fukuhara

Benidama

Sekaiichi

Macintosh

Return

Return

Return

Return

10

11

12

Variety 4

13

14

Oranges

Delicious

Valencia

Suzuki

Washington

Return

Return

Return

Return

20

21

22

23

24

Grapes

Muscat

Delaware

Kyoho

Koshu

Return

Return

Return

Return

30

31

Strawberries

Amaou

Tochiotome Return 40

32 Nyohou Return

Return 41

33

42

34 Toyonoka Return

43

44 Next

qi0

qak

answers were kept high. These two-class classifications were automatically calculated from the scan path data, by a computer program using a three cell transition model. 5.2 Evaluation of the success of the SOB metric

qbk

5.2.1 Procedure and design Fig. 3 Transition state diagram for fixation cells

can be noted as a series of fixation cell numbers. This notation for the scan paths is similar to the work in [40]. The scan path is also illustrated in a state transition diagram, where fixation cells are defined as states. To determine whether scan paths are caused by uncertainty, two scan paths between fixation cells, namely the state transitions, were analyzed. According to the key idea, focus was on the transition pattern which is illustrated in Fig. 3. The fixation cell q for row I and column J can be noted mathematically as follows: qij{I:i = 1,…,4}, {J:j = 0, …, 4, j = 0:questionitem}. A transition may be observed from cell qak for row a(a [ I) and column k(k [ J, 1 B k B 4) to the same cell qak or to another cell qbk for row b, (a = b, 1 B a, b B 4) in the same column k via a question item cell qi0. A three fixation cell transition (qak, qi0, qbk) suggested that a viewer had confirmed his or her own answer, or the consistency of his or her answer choices in column k. All SOBs for the answers cells were set to high SOB in the initial condition. SOBs for qak and qbk were changed to low SOB when three fixation transitions occurred. If a viewer gazed at question item cell qi0, the SOB for qakor both the SOBs for qak and qbk were recognized as low SOBs. When all other patterns occurred, SOBs for engaged cells of

An experiment was conducted using consecutive answering and reviewing sessions. One minute was assigned for each session. The procedure, which was used by subjects and an experimenter together, consisted of the following steps 1. 2. 3. 4. 5. 6. 7.

Practice session answering sample questions. Calibration for eye tracking. Answering session (Fig. 1). Reviewing session (Fig. 2). Print out of screen shot. Determine an SOB score for each question. Return to step 2.

Five subjects participated in both sessions. For the first minute of the experiment, answers were selected, followed in the second minute by the reviewing session, where subjects reviewed their own answers without making corrections. The times for answering and reviewing were strictly controlled by a computer timer. All participants underwent a practice session. During the experiment, eye movements of each subject were recorded using a video-based eye tracker (nac:EMR8NL [41]). The task was displayed on a 17 in. LCD monitor (1,280 9 1,024 pixels) positioned 65 cm from the subject. A chin rest was used to stabilize a subject’s head, and a small infra-red camera used as an eye tracker was

123

Univ Access Inf Soc

positioned between the subject and the monitor, 40 cm from the subject. The subject’s hands were free. The tracker was calibrated at the beginning of each session. Eye movement was tracked on an 800 by 600 pixel screen at 60 Hz. Therefore, spatial resolution was about 0.03 of visual angle per pixel. Eye movement data was recorded on a PC as time course data. The tracking data was converted into visual angles according to the distance between the viewer and the display, so that the visual angles of the display were 21 by 27.

Table 2 Results of discrimination for strength of belief (%) using scan path discrimination

5.2.2 Subjective estimates of the strength of belief

The correct estimation rate may also depend on the threshold, which is the overlap point between ‘hit’ and ‘miss’ responses. Therefore, the effect of moving this threshold on the success rate of the classification was investigated. The threshold was changed in both directions from the average 58%, moving as far as 100 and 0, in 10% increments. The effect on the rate of discrimination with the SOB threshold is illustrated in Fig. 4. Significant levels of discrimination of SOB are displayed in the figure. According to Fig. 4, the rate of correct responses decreases monotonically with the threshold for SOB. The discrimination performance was better when the threshold was set to a lower value. Figure 4 shows that a significant rate of correct responses was obtained when the threshold was lower than 60%. This result suggests that a scan path between a question item and an answer area occurred when an SOB report was low. This discrimination procedure detected only low SOB for responses, so that the number of three cell transitions, which were extracted (as shown in Fig. 3) was small.

After the reviewing session, subjects indicated their confidence in each of their answers (or subjective SOB) ranking them on a scale between 0 and 100. The proportion of questions answered correctly was calculated for each subject. The average correct proportion was 72.0% and varied between 62.5% and 76.3%, and the standard deviation was 5.7, which is relatively small. There was a significant difference in subjective SOB between ‘right answer (hit)’ and ‘wrong answer (miss)’ [t(8) = 7.3, p \ 0.001], so subjective SOBs were reflected in the correctness of answers given, and this confirms the correlation between confidence and the rate of correct responses in previous research [42]. Since answer correctness can be classified as a ‘hit’ or a ‘miss’ for this question type, subjects’ reports were also divided into two levels using the overlap point of two normal distributions for ‘hit’ and ‘miss’ as the threshold (58% on the subjective estimate scale of 0 to 100%). In this way subjective SOBs were classified into ‘high’ and ‘low’ SOBs, such that 64.6% of all responses were classified as ‘high’ and the rest were classified as ‘low’.

Eye-movement estimation

SOB (High)

SOB (Low)

SOB (High)

57.5

24.1

SOB (Low)

7.1

11.3

5.2.4 Changing the threshold between high and low SOB

100

90

123

Rate of Correct Responses (%)

5.2.3 Discrimination results Using the model of fixation cell transition shown in Fig. 3, the average number of transitions across the cells was 61.1 per reviewing screen, and the average number of cells containing fixations for question items qi0 is 4.8. By applying the three cell transition model of Fig. 3, an average of 2.9 out of 16 questions per subject were classified as low SOB. The results per question are summarized in Table 2. Correct classifications of subjective SOBs from scanpath metrics was 68.8% (57.5 ? 11.3%), which is significantly higher than chance (p \ 0.05). The rate of correct classifications for high subjective SOBs is better. Discrimination performance was very weak for low SOBs, as more than half of these were wrongly classified as high SOBs, and presumably depended on the discrimination algorithm used for eye movement scan paths.

Subject’s report

80

70 p