Toward Optimization of Multimodal User Interfaces for Tactical Audio Applications Zeljko Obrenovic1, Dusan Starcevic1, Emil Jovanov2 1
Faculty of Organizational Sciences The University of Belgrade Jove Ilica 154, 11000 Belgrade +381 11 3950 894 {
[email protected];
[email protected]} 2
Electrical and Computer Engineering Dept. The University of Alabama, 213 EB Huntsville, AL 35899 (256) 824 5094
[email protected]
Abstract. Tactical audio uses audio feedback to facilitate the precise and accurate positioning of an object with respect to some other object. Existing solutions in pointing and trajectory based Human Computer Interface (HCI) tasks have primarily explored visual feedback, sometimes in rather limited conditions. In this paper we have examined different sonification paradigms for tactical audio to improve the accuracy of pursuit tracking. We have developed a multimodal simulation system as an open environment for evaluation of different sonification and visualization techniques in tactical audio applications. The environment is implemented as an audio-visual scene using the Java3D package. The paper presents the quantitative results of of three pursuit tracking applications using a combination of acoustic and visual guidance and different background conditions. Experiments with 19 participants have shown that acoustic presentation improves the quality of human-computer interaction and reduces the error during pursuit tracking tasks for up to 19%. Moreover, experiments have shown that benefits do not exist in all conditions, indicating the existence of perceptual boundaries of multimodal HCI for different scene complexity and target speeds. Significant benefits of audio modes exist for medium complexity of interaction. User’s questionnaires indic ate that users are not aware of quantitative benefits of applied sonification paradigms. We have shown that the most appealing paradigms are not the most effective ones, which necessitates a careful quantitative analysis of proposed multi-modal HCI paradigms.
1 Introduction Sonification as acoustic data presentation can significantly improve the quality ofof human computer interaction particularly in the case of virtual/augmented reality systems. One of the most important advantages of sonification in augmented/virtual
2
Zeljko Obrenovic, Dusan Starcevic, Emil Jovanov
reality systems using audio modes is that a computer generated sonic scene leaves the visual field unimpaired, unobstructed, and ready for investigation of the environment for surprises [1, 2]. However, the use of sound is not panacea and wider acceptance of sonification and tactical audio applications will primarily depend on quality of user interfaces and how quick users can effectively learn to use the environment. Exploration into perceptual features of a plethora of sonification paradigms requires a flexible environment that could be easily customized to suit user or applications needs [3]. In this paper we present our multimodal simulation and training system for tactical audio applications and the quantitative results of experimental evaluation of selected sonification techniques. We investigated different sonification paradigms in customized sonification environments for dynamic target tracking applications. Primary contribution of this paper is experimental evaluation of user interaction in dynamic trajectory-based environments using combination of acoustic and visual guidance. We have shown that acoustic presentation improves the quality of humanmachine interaction and reduces the tracking error during guidance tasks. Typical issues and survey of sonification and tactical audio applications are given in section two. Design and implementation of the environment are presented in section three. Organization of experiments is described in section four, while analysis of results of experiments is given in section five. Results and discussion of user performance is given in section six. Section seven concludes the paper.
2 Sonification and multimodal interaction Sonification is the second most important presentation modality after visualization [4, 5]. Relationship between visualization and sonification is itself a complex design problem, due to the nature of the cognitive information processing. Efficiency of sonification, as acoustic presentation modality, depends on other presentation modalities [6]. Some characteristics of visual and acoustic perception, such as spatio-temporal resolution, are complementary. Therefore, sonification naturally extends visualization. The system that supports sonification must provide the ability to extract the relevant information features. Some useful guidelines about the use of sound and nonspeech audio cues can be found in [7, 8]. It has been shown that the use of sound, especially when coupled with other modalities of interaction, such as pointing and gesturing, improves multimodal interaction [7]. Well-designed multimodal systems integrate complementary modalities to yield a highly synergistic mix in which the strengths of each mode are used to overcome weaknesses in the other. Such systems can function more robustly than unimodal systems [9, 10]. Sonification could be used to facilitate insight into complex phenomena. We have applied sonification of brain electrical activity to improve the presentation of complex spatio-temporal patterns of brain electrical activity [11].
Toward Optimization of Multimodal Interfaces for Tactical Audio Applications
3
2.1 Tactical Audio Tactical audio systems use audio feedback to facilitate the precise and accurate positioning of an object with respect to some another object [1, 2]. Tactical audio has valuable applications in the field of surgery. Just as musicians use aural feedback to position their hands, surgeons could position instruments according to a pre-planned trajectory, pre-placed tags or cues, or anatomical models. Although ultrasound and other imaging modalities attempt to alleviate this problem, the nature and configuration of the equipment requires the surgeon to take his/her eyes off the patient. The use of tactical audio feedback enables the surgeon to effect a precise placement by enhancing his/her comprehension of the three-dimensional position of a surgical instrument with respect to some predetermined desired position within the patient’s body. This clearly benefits both patient and healthcare provider [2]. However, more research have to be done in order to develop more efficient tactical audio applications.
3 The Design of a Multimodal Simulation System We have developed a multimodal simulation and training environment for experimental evaluation of different sonification paradigms for tactical audio. The environment allows recording of relevant parameters of user interaction during guidance tasks. The environment is developed in order to support quantitative evaluation of effectiveness of sonification methods and assess user’s learning curve in a large population of users [3, 12]. In addition, our goal was to investigate suitability of standard PC software and hardware modules for efficient implementation of tactical audio applications. 3.1 The Architecture The architecture of the multimodal test environment is shown on Figure 1. The environment consists of an interaction space and a control interface. Main interaction between the user and the environment occurs in the interaction space, where the user tracks animated target on a 2D polygon. The interaction space consists of a multimodal integration module and of a path interpolator. Multimodal integration module does the multimodal presentation of data. We used one graphics mode in form of animated circle object. In experiments user simply tries to "cover" the center of the target using cursor. We also used two different sound modes: variable position sound object representing target position in 3D acoustic space [13, 14], and variable intensity sound object representing a distance from the target, e.g. the error in tracking of target object. The user could use any combination of these three modes. Position interpolator calculates new target position on each timer tick using discrete set of coordinates to provide "smooth" trajectory in space/time. In the control interface user can set various parameters including multimodal combination, speed, and file parameters. It also maintains generation of a path from a
4
Zeljko Obrenovic, Dusan Starcevic, Emil Jovanov
text file, and writes the results to a file. A path file is simple text file that contains trajectory of object as discrete set of X and Y coordinates. A result file is a XML file that contains samples of user cursor coordinates together with samples of object coordinates. The result file also contains a description of experimental conditions.
Fig. 1. The architecture of our environment. Although the environment is used to estimate possible performance improvement using tactical audio support, it is designed flexible in order to allow easy plug-andplay addition of new graphical and acoustic modes. The environment is based on commercially available technology, implemented using Java3D package [15] and standard Java components. This approach allows transparent execution of the environment on various platforms including stand-alone workstations, as well as distributed Web environment in the form of a Java applet. The Java3D package is primarily used for acoustic 3D effects. This package allows relatively easy real-time manipulation of sound objects . The environment can be used in two modes: training and experiment mode. In the training mode a user can arbitrarily set multimodal combination, speed of the object, start and duration of a session. This mode is particularly suitable for user training before the experiment, and for pilot testing of new modes. Experiment mode is designated for predefined tests, where users are not allowed to change any of the parameters in the environment. All parameters are preset in a file. The experiment is defined as a sequence of tests. For each test we define speed of the target object, duration of interaction, length of pause between two tests, as well as presentation modes used in the test. 3.2 The Multimodal Presentation Environment We have developed and tested three presentation variants of the environments in order to facilitate testing of interaction complexity:
Toward Optimization of Multimodal Interfaces for Tactical Audio Applications
5
• E1 - Homogenous background The simplest environment presents the circular object that moves over homogenous single-color background. • E2 - Static heterogeneous background In this environment the object moves over static heterogeneous background, which effectively adds visual noise to graphical presentation. Static heterogeneous background was simulated with a static picture of the map (Figure 2).
Fig. 2. The user interface of the environment with static heterogeneous background.
• E3 - Moving background In this variant of the environment the object is part of static heterogeneous background. The object does not move relative to the background. Instead, whole background, including the object, moves across the screen. For example, this way of interaction can be illustrated with aircraft simulation where you have to pursuit a static object on the ground while driving an aircraft.
4 Experiments Target tracking experiment was conceived to assess user performance in two dimensional pursuit tracking applications. We organized three experiments, each with one variant of the environment, as described in Chapter 3.2. Experimental setup, parameters, methods and procedures were the same for all tests. 4.1 Method The task of a user was to pursuit a moving object on the screen as close as possible. Target object was positioned according to a predefined trajectory taken from a file, and coordinates of both the target object and the cursor were written to the result file for off-line processing. We created four multimodal combinations (MMCs) (see Table 1). Each of the multimodal combinations was tested with four object speeds (see Table 2). Table 1. Multimodal combinations (MMCs) used in the experiments.
MMC1 MMC2 MMC3
Graphics mode + + +
Distance sound + -
Positional sound +
6
Zeljko Obrenovic, Dusan Starcevic, Emil Jovanov
MMC4
+
+
+
Table 2. Average speeds of the target object in the experiments.
Speed ID 1 2 3 4
Average speed (pixels per second) 50 100 200 300
Experiments ran on an 800 MHz Intel Pentium III desktop PC with 128 MB of RAM with Windows 2000 Professional operating system. Computer had S3 Incorporated Trio3D/2X graphics card, Samsung 15’’ SyncMaster 550s monitor and integrated CMI 8738/C3DX PCI audio device. Participants used headphones for better spatial sound perception. Screen resolution was set to 800 x 600 pixels. Dimensions of the interaction space were 500 x 500 pixels. Mouse motion sensitivity was set to medium. Frame rate for the first two experiments was about 50 frames per second, and for experime nt E3 it was about 30 frames per second. 4.2 Participants Nineteen persons participated in experiments. Participants were unpaid volunteers from military academy. Age of the participants ranged from 19 to 30 years. All participants had normal or corrected-to-normal vision and normal hearing, and were right-handed. All of them were regularly use GUI and mouse. Participants were asked to fill a questionnaire that assessed their prior experience with computers and mouse use. Twelve participants completed all three experiments. Sequence of experiments for participants in the group was not the same. Six participants took part in only one experiment, while one participant took part in two experiments. 4.3 Experimental Procedure All three experiments followed the same procedure. Initially, participants practiced interaction with the environment in the training mode. During this period all the speeds, as well as multimodal combinations covered in experiment were presented to the user. Each experiment consisted of 16 tests (4 multimodal combination x 4 speeds). Each test lasted for 35 seconds. Total duration of one experiment for each participant was around 10 minutes. This relatively short duration of each test is chosen since longer sessions had resulted in strong finger and wrist fatigue of participants. The experiment started with lowest speed, where all combinations were tested. After that the speed was increased. Before new test at higher speed, participant was able to make a pause. We encouraged participants to rest before proceeding to the next level. After experiment completion, participants were asked to rank multimodal combinations from one to four, where one represents the best combination.
Toward Optimization of Multimodal Interfaces for Tactical Audio Applications
7
5 Analysis Positions of the target object and the cursor were collected directly by the scene generation software . The data were then prepared for statistical analysis by computing values of a tracking error for each trial, and average tracking error for each speed and multimodal combination. We excluded trials of participants who reported “loss of control” during interaction due to the lack of attention or problems with handling a mouse. We excluded approximately 5% of all trials. For each sample in all experiments we calculated error as Euclidean distance between user cursor and the target object. After that, we calculated mean error and standard deviation for each multimodal combination and speed. We also calculated benefit ξ of using multi modal HCI compared to pure graphics mode (MMC1) running at the same speed as:
ξ = 1 − errorAUDIO errorMMC 1
(1)
where errorMMC1 represents an average error for first multimodal combination (MMC1), and errorAUDIO is average error for multimodal combinations with multi modal combinations (MMC2-MMC4) running at the same speed. We define d three relative errors, each for one multimodal combination with sound nodes. We calculated paired T-test between multimodal combinations (MMC2-MMC4) and pure graphics mode (MMC1) for the same target speed.
6 Results and Discussion Figures 3, 4, and 5 illustrate dependencies among speed, multimodal combination and benefits of multimodal HCI for all three experiments. Values marked with ‘*’ are statistically significant.
8
Zeljko Obrenovic, Dusan Starcevic, Emil Jovanov
Fig. 3. Benefits of multimodal HCI in the first experiment (Homogenous background).
Fig. 4. Benefits of multi-modal HCI in the second experiment (Static heterogeneous background).
It could be seen that the spatial audio mode improved significantly pursuit precision for the second target speed , since benefits of MMC3 and MMC4 combinations were significant and positive for all three experiments. Benefits of MMC3 and MMC4 were in first experiment 15% and 19%, in second experiment 18% and 15%, and in third experiment 14% and 8%. Second experiment exhibited benefits for all audio modes for third speed although these benefits were smaller than in the case of the second target speed (12%, 8%, and 10%). Third experiment has shown significant benefit for distance mode (MMC2) in first speed (18%). For other tests benefits were smaller and not statistically significant.
Fig. 5. Benefits of multi-modal HCI in the third experiment (Moving background).
Table 3 summarizes results of the analysis for all experiments. Results are grouped by speed and multimodal combination. Average errors in all three experiments are
Toward Optimization of Multimodal Interfaces for Tactical Audio Applications
9
shown in first three columns, while benefits of using audio modes compared to pure graphical mode for the same speed are shown in last three columns. Table 3. Average distance errors with standard deviations and benefits multi-modal combinations for all three experiments (E1, E2, E3). Values marked with ‘*’ are statistically significant.
Average error in pixels (SD) Speed MMC E1 E2 E3 1 7.40 (2.42) 7.18 (1.60) 7.69 (1.45) 1 2 7.03 (2.05) 7.45 (2.40) 6.55 (1.29)* 3 7.47 (2.04) 7.51 (1.68) 7.92 (2.14) 4 7.27 (3.36) 7.57 (1.92) 7.69 (1.58) 1 12.81 (3.30) 13.74 (3.39) 12.37 (1.68) 2 2 12.14 (2.87) 12.99 (3.91) 11.79 (1.48) 3 11.17 (2.22)* 11.68 (2.42)* 10.82 (1.73)* 4 10.76 (1.75)* 11.96 (2.74)* 11.44 (2.00)* 1 21.21 (2.60) 24.20 (2.96) 23.19 (2.84) 3 2 21.38 (3.33) 21.57 (3.33)* 22.30 (3.71) 3 20.25 (3.49) 22.45 (4.04)* 22.31 (3.42) 4 20.24 (3.50) 22.09 (4.10)* 22.84 (4.63) 1 33.45 (3.79) 36.75 (4.21) 40.18 (6.79) 4 2 34.20 (5.55) 37.24 (5.48) 38.96 (4.86) 3 33.72 (3.79) 35.61 (3.86) 38.92 (6.27) 4 33.91 (4.78) 36.70 (5.30) 39.92 (8.86)
E1
Benefit in % E2 E3
5.3 -0.9 1.7
-3.7 -4.4 -5.2
17.5 -2.8 0.1
5.6 14.7 19.1
5.8 17.6 14.8
4.9 14.3 8.2
-0.8 4.7 4.8
12.2 7.8 9.5
4.0 3.9 1.5
-2.2 -0.8 -1.4
-1.3 3.2 0.1
3.1 3.2 0.6
Paired t-tests between multimodal combinations with audio modes (MMC2MMC4) and pure graphics mode (MMC1) for the same speed revealed significant differences in average error across ten tests. In first experiment only significant effect existed for the second target speed, for MMC3 (p=0.009) and MMC4 (p=0.002). Both combinations use spatial sound mode. In second experiment significant change can be seen at the second and third target speed. For the second speed significant effect existed for multimodal combination with stereo position sound mode MMC3 (p