Integrating Semantics into Multimodal Interaction Patterns - CiteSeerX

4 downloads 0 Views 275KB Size Report
Fusion. In: Proc. 8th International Conference on Spoken Language Processing. (INTERSPEECH 2004 - ICSLP), Jeju, Korea, October 4-8, 2004, pp. 1009–1012.
Integrating Semantics into Multimodal Interaction Patterns Ronnie Taib1,2 and Natalie Ruiz1,2 1

ATP Research Laboratory, National ICT Australia Locked Bag 9013, NSW 1435, Sydney, Australia {ronnie.taib,natalie.ruiz}@nicta.com.au 2 School of Computer Science and Engineering The University of New South Wales, NSW 2052, Sydney, Australia

Abstract. A user experiment on multimodal interaction (speech, hand position and hand shapes) to study two major relationships: between the level of cognitive load experienced by users and the resulting multimodal interaction patterns; and how the semantics of the information being conveyed affected those patterns. We found that as cognitive load increases, users’ multimodal productions tend to become semantically more complementary and less redundant across modalities. This validates cognitive load theory as a theoretical background for understanding the occurrence of particular kinds of multimodal productions. Moreover, results indicate a significant relationship between the temporal multimodal integration pattern (7 patterns in this experiment) and the semantics of the command being issued by the user (4 types of commands), shedding new light on previous research findings that assign a unique temporal integration pattern to any given subject regardless of the communication taking place.

1

Introduction

Multimodal interaction allows users to communicate more naturally and interact with complex information with more freedom of expression than traditional computer interfaces. The use of multiple modalities expands the communication channel between human and computer, and hence facilitates completion of complex tasks, compared to unimodal interfaces [8]. Over the past decade, research in the field has led to a number of theoretical advances in understanding the mechanisms governing multimodal interaction. The fine-grained analysis of multimodal interaction patterns (MIP) unveils broad classes of interaction preferences across users that could benefit automatic recognition of speech or gesture. Moreover, multimodal interaction research leverages recent progress in fields of both psychology and education, using cognitive load theory (CLT) in particular as theoretical background to explain the cognition processes behind the acquisition and selection of modalities during human communication. The overall aim of our research is to provide more consistent, integrated and intuitive human-computer interaction by decreasing users’ cognitive load A. Popescu-Belis, S. Renals, and H. Bourlard (Eds.): MLMI 2007, LNCS 4892, pp. 96–107, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Integrating Semantics into Multimodal Interaction Patterns

97

through the use of a multimodal paradigm. This paper presents the results of a user experiment addressing the effects of cognitive load on multimodal behaviour. Using speech and hand gesture as input modalities, we explored the effect of task complexity on temporal and semantic characteristics of MIP. 1.1

Related Work on Cognitive Load Theory and Analysis

Cognitive Load Theory (CLT) can be used in multimodal interaction research as a theoretical framework for hypotheses and interpretation of fine-grained multimodal behaviour under different conditions. The concept of cognitive load was developed within the field of educational psychology and refers to the amount of mental demand imposed by a particular task. It is associated with novel information and the limited capacity of working memory [11]. Since high levels of cognitive load can impede performance, efficiency and learning [11], CLT has been primarily used to help teachers design more efficient educational material, including multimedia presentations. The close association between CLT and working memory can be understood through the perspective of Baddeley’s theory of modal working memory [1]. This theory contends that certain areas of working memory are reserved exclusively for modal use e.g., the visuo-spatial sketchpad, for spatial and image representations, and the phonological loop for verbal, linguistic and audio representations [1]. In human computer interaction research, the benefits of multimodal interaction in complex tasks are well established [8], and there is evidence to suggest that users adapt their multimodal behaviour in complex situations to increase their performance [8]. Currently, cognitive load can be measured in a number of ways, for a variety of scenarios. Most commonly, it is measured by soliciting subjective load ratings after a task is completed [11]. Performance measures, such as scores, error-rates and time-to-completion measures have also been used. Subjects are rarely assessed in real-time and the probe-method, which solicits subjective ratings during task completion, interrupts the user’s task flow and potentially adds to the cognitive load [11]. A new method for measuring cognitive load which is unobtrusive and provides results in real time is needed. Given the known links between cognitive load and multimodality, we hypothesise that many aspects of users’ multimodal interaction will change with increases in cognitive load. From spatial and temporal arrangement of individual modalities within multimodal productions; as well as the semantic and syntactic structures of such productions, and even the selection of preferred modality and changes in modality specific features, many features could be cues to increases in cognitive load. The detection of these trends in user interaction could lead to reliable indicators of increased load for the user with the added benefit of being unobtrusively captured and continually assessed automatically by the system. In this research, we differentiate the term “designed cognitive load”, which refers to the complexity of the task (intrinsic and extraneous), from the term “experienced cognitive load” which is the actual degree of demand felt by a particular subject. Individuals’ experienced cognitive load, for the same task,

98

R. Taib and N. Ruiz

may change from time to time. The designed (or expected) cognitive load of the task can provide a coarse indication for the degree of experienced (or actual) cognitive load, but the latter can only be interpreted relative to other tasks. 1.2

Related Work on Multimodal Interaction Patterns

Multimodal interaction (or integration) patterns refer to micro level relationships between the inputs comprising a multimodal production, for example in the temporal, spatial or semantic domain. Temporal relationships, in particular, are crucial for correctly interpreting individual productions, in both humanhuman and human-computer communication. Multimodal input fusion (MMIF) systems rely on such knowledge to validate or reject potential fusion of inputs. Most systems in the past have employed artificially or arbitrary defined values, even though some methods using syntactic or semantic combinations based on machine learning have been developed [3]. The qualitative and quantitative aspects of the temporal relationships have been studied in a bid to provide a better design of MMIF modules, but also to progress the fundamental understanding of human communication. Oviatt et al. analysed the preferred integration patterns when using pen and speech in the QUICKSET system; experiments involving speech and handwriting helped distinguish two groups of multimodal integrators: sequential and simultaneous. The first group produced multimodal inputs sequentially, one modality at a time, whereas the second group overlapped modalities, at least partially, in the time dimension. The authors also show that machine learning could be used to quickly classify the user’s temporal pattern as soon as they begin to interact [5,10]. Further to simple temporal analysis, their study also determined the types of tasks more likely to induce multimodal interaction, as opposed to unimodal interaction. It was found that spatial location commands (e.g. modify, move) represented 86% of all multimodal inputs, against 11% for selection commands (e.g. zoom, label), and 3% for function commands (e.g. print, scroll). Finally, the study also reported that the order of semantic constituents in multimodal interactions from their corpus was different from the order in spoken English, but mainly due to the position of the locative constituents at the beginning in multimodal, and at the end in corresponding spoken inputs [10]. The study of MIP has been less prominent for other modality combinations, especially when input recognition is a major concern, for example in systems involving hand gesture. Bolt used a Polhemus sensor, based on electromagnetic field variations, to obtain the position and orientation of the hand from a fixed transmitter. The worn device is a cube of about 2 cm edge, and requires a chord, hence is fairly cumbersome and obtrusive while sensing only limited information [2]. Data gloves later allowed more complex gestural inputs due to finger position sensing but impede movement.Vision-based hand tracking and shape recognition became a reality a few years later, involving a range of algorithms and set-ups, e.g., single or stereo cameras. However, vision-based recognition rates are not yet satisfactory, so successfully combining such input with other modalities is currently seen as the most promising path to success.

Integrating Semantics into Multimodal Interaction Patterns

99

Attempts at MIP analysis in gesture and speech systems were made, such as Hauptmann’s framework to formalise and quantify speech and gesture interaction, with a view to automation. The work studied user preferences and intuitive behaviours when using speech and hand gesture to move a 3D virtual object [4]. Very detailed statistics were reported on the usage of words, lexicon, syntactic structure, hand and finger gestures, highlighting the type of multimodal interaction preferred by users and when it occurs. However, the results appear as an unstructured set of percentages, rather difficult to put in perspective or implement in a generic way. Finally, a psychological approach has been used to provide a formal framework for the study of speech and gesture interaction. For example, McNeill, Quek et al. established the existence of low level relationships between gesture, speech and gaze, suggesting cross-modal segmentation based on specific features of those modalities [12]. More recently, an HMM-based implementation was used to improve disambiguation of speech with gestures [6]. 1.3

Objectives and Hypotheses

This study was designed to identify the relationships between combined speech and gesture input productions and users’ cognitive load. The two input modalities are very familiar to users and psychologically closely interrelated, both in terms of planning and execution. Specifically, we hypothesise firstly that variations in redundant and complementary multimodal productions can reflect cognitive load changes experienced by the user. Redundant multimodal productions are ones that semantically double-up information over a number of modalities (first example in Table 1). Complementary productions are, conversely, those that convey different semantic information over a number of different modalities in the same multimodal production (second example in Table 1). Partially redundant productions occur when only part of the production (i.e. the function or the object) is expressed redundantly across modalities, while the rest of the production is unimodal or complementary. We expected many redundant productions when the cognitive load was low, and as cognitive load increased, complementary productions would be more prevalent, as users would begin to instigate strategies to maximise use of available working memory. To this end, all multimodal productions would be classified according to the degree of semantic redundancy in each one. The second hypothesis relates to the temporal structure of multimodal productions. While there is evidence that users fall into one of two major groups of Table 1. Redundant vs. complementary multimodal productions Turn REDUNDANT - Ex 1: Select the library on a map COMPLEMENTARY - Ex 2: Mark an event at a location

Modality Speech Gesture Speech Gesture

Productions “Select library” Point to library “Event” Point to library

Semantics

100

R. Taib and N. Ruiz

integrator pattern (i.e. simultaneous vs. sequential integrators), we also hypothesised that other factors of the communication, such as the type of information being manipulated, e.g., a function or an object, would also affect the way the temporal structure of the production is planned and executed. Knowledge of these patterns, the types of users who employ them and the domains in which they are likely to occur could make a significant contribution to the algorithms and strategies implemented in adaptive multimodal input recognition and fusion systems. Temporal integration patterns between two modalities can fall into one of the nine integrator ‘shapes’, one sequential pattern and eight distinct simultaneous patterns [8,9,10]. However, we hypothesised that the temporal integration patterns of each user’s production would be affected by the semantics of the information the user is trying to convey. For example, a user may exhibit different simultaneous patterns when zooming-in the map than when selecting a map entity. Hence, our null hypothesis is that the integrator pattern is chosen independently of the information semantic.

2 2.1

Experiment Design Wizard of Oz Set-Up

The selected modalities for the experiment were speech, hand motion and a set of specific hand gestures, hence very prone to errors by automatic recognisers, so we opted for a Wizard of Oz (WOz) implementation. This technique has been identified as an essential tool for the design of multimodal interfaces, where novel interaction patterns were expected to appear [14]. Indeed, WOz allows intuitive interaction while removing the bias caused by input recognition errors and misinterpreted semantic fusion of multimodal signals. Our wizard only performed manual recognition of speech and hand shapes, while hand motion detection was automated with an in-house, video-based recognition module. The hand tracking module was used by the subject to move a visual pointer on the system graphical interface, which was echoed on the wizard’s machine. This allowed the wizard be aware of the actions such as button

Fig. 1. Subject interacting with the system

Integrating Semantics into Multimodal Interaction Patterns

101

clicks or manual selections made by the subjects. Inputs directly generated by the subject, or via the wizard, were then processed by the application logic in order to update the interface and progress the task. Subjects stood 2m away from a large, wall-sized screen displaying the application. The camcorder and video camera used for tracking were both located on the right hand side (all subjects reported to be right handed) as shown in Fig. 1. 2.2

Application Scenario: Tasks and Modalities

Eliciting multimodal interaction under various levels of cognitive load requires carefully crafted task design. The use of novel modalities may only become salient for novel application functionality, as opposed to well-known applications as suggested by Lisowska and Armstrong [7]. Moreover, it has been shown that spatial location tasks such as map interaction are more likely to generate combined multimodal interaction [10]. The study scenario realised the update of a geographical map with traffic condition information, using either natural speech, or manual gesture, or a combination of these. Available gestures comprised: – Deictic pointing to map locations, items, and function buttons; – Circling gestures for zoom functions; and – Predefined hand shapes (fist, scissors. . . ) for item tagging. All tasks were designed to provide as much semantic equivalence as possible, meaning that atomic actions required could be completed using any single modality or combinations thereof. Subjects were shown multiple examples of interaction involving various blends of modalities and system functionality and could practise them during a 30-min training session. Table 2 provides examples of atomic actions and possible realisations using various modality combinations. Instructions for each new task were high-level situation descriptions and subjects were allowed freedom of inspection during their response actions. Typical sets of actions required for each situation was taught in training. Table 2. System functionality and examples of inputs System Functionality Zooming in or out of a map

Example of Interaction ; or Say: “Zoom in to the top right quadrant” Selecting a location/item of interest ; or Say: “St Mary’s Church” Tagging a location of interest with an and: ’accident’, ’incident’ or ’event’ marker Say: “Incident”; or Scissors shape Notifying a recipient (item) of an ac- and Say: “notify”; cident, incident or an event or fist shape and

102

R. Taib and N. Ruiz Table 3. Levels of cognitive load Level Entities Actions Distractors Time 1 6 3 2 ∞ 2 10 8 2 ∞ 3 12 13 4 ∞ 4 12 13 4 90 sec.

There were four levels of cognitive load, and three tasks of similar complexity for each level. The same map was used for each level to avoid differences in visual complexity, hence designed cognitive load. The designed cognitive load was controlled through the number of entities present on the map and actions required in order to complete the task. Table 3 summarises the various levels induced by changes in (i) number of distinct entities in the task description; (ii) number of distractors (entities not needed for task); (iii) minimum number of actions required for the task; and (iv) a time limit to complete the task. 2.3

Procedure

Twelve remunerated, randomly selected, native English-speaking participants (6 females, 6 males, aged 18-49) completed the study. As mentioned above, subjects were asked to perform a set of tasks under 3 different conditions: gesture-only, speech-only and multimodally. Each set consisted of 4 levels, with 3 tasks in each. The order of these conditions and the tasks within the levels was randomised to counter balance rank order effects. Video, hand position and UI interaction data were synchronised and recorded digitally. Users were also debriefed after each task level and were asked to provide a subjective ranking of the level of load relative to the other levels in that condition. The video data collected from the subjects was manually annotated: start and end time of speech and gesture were annotated with a precision of 1 frame (25fps). Gesture semantics were also tagged and speech orthographically transcribed.

3 3.1

Results Cognitive Load Analysis Based on Multimodal Redundancy and Complementarity

Out of 12 subjects, only the data from 9 was usable, since two users had difficulty comprehending the tasks, such that they could not achieve the goals of the task, and one did not finish for external reasons. The data collected for the multimodal condition for Levels 1, 2 and 4 was annotated for 6 out of these 9 users. In total, 1119 modal inputs were annotated, forming 394 turns and 644 productions. However, smaller numbers were used for the analysis of individual levels. To measure the perceived level of cognitive load, users ranked the tasks in increasing levels of difficulty along a 9-point Likert scale, the average difficulty

Integrating Semantics into Multimodal Interaction Patterns

103

Fig. 2. Proportion of purely redundant turns by level

score for Levels 1, 2 and 4 across these 6 users was 2.2, 4.2 and 5 respectively. Level 3 data was not annotated due to lack of time. For each user, we classified the multimodal turns into three groups: purely redundant, purely complementary and partially redundant turns. Fig. 2 shows the mean percentage and range of purely redundant turns across users, for each level, over all multimodal turns. However, statistical analysis was carried out on 5 of the 6 users; one subject exhibited hardly any redundant behaviour, preferring to interact in a complementary manner. We observed a steady decrease in redundancy as task difficulty increased. An ANOVA test between-users, across levels, shows there are significant differences between the means (F =3.88 (df=2); p

Suggest Documents