Overriding Errors in a Speech and Gaze Multimodal Architecture

2 downloads 35949 Views 326KB Size Report
target object always ranks higher on the final multimodal n-best list than the target. ... A head mounted eye tracker NAC EMR-HM8 was connected to a. PC.
Overriding Errors in a Speech and Gaze Multimodal Architecture Qiaohui Zhang

Atsumi Imamiya

Department of Computer and Media University of Yamanashi 81-55-2208505

Department of Computer and Media University of Yamanashi 81-55-2208510

[email protected]

[email protected]

Kentaro Go

Xiaoyang Mao

Center for Integrated Information Processing University of Yamanashi 81-55-2208084

Department of Computer and Media University of Yamanashi 81-55-2208652

[email protected]

[email protected] modalities. However, they handled the ambiguous selection problem via repositioning and enlarging the vicinity of the user’s fixation, so it could not provide a rapid selection interface. The speech and pen-gesture multimodal system proves that multimodal systems can support mutual disambiguation of errors and lead to more robust performance [6].

ABSTRACT This work explores how to use the gaze and the speech command simultaneously to select an object on the screen. Multimodal systems have long been a key mean to reduce the recognition errors of individual components. But the multimodal system generates errors as well. This present study tries to classify the multimodal errors, analyze the reasons causing these errors, and propose the solutions for eliminating them. The goal of this study is to gain insight into multimodal integration errors, and to develop an error self-recoverable multimodal architecture so as to make the error-prone recognition technologies perform at a more stable and robust level within multimodal architecture.

Reference [4] described an interface that accepted simultaneous speech, gestures and gaze input from a user. The similar systems [1][2][5] utilized eyesight, mouse, or pointing gesture to determine deictic referents in a spoken dialogue system. Although these systems could address the multiple-references problem caused by a user’s ambiguous speech input, each input was considered error-free, so no recognition errors were handled in their works. Our system attempts to correct the recognition errors within the multimodal architecture, and aims to respond the user with correct results.

Categories and Subject Descriptors H.5.2 [Information Systems]: Information interfaces and presentation – user interfaces, evaluation/methodology.

2. THE GAZE- AND SPEECHINTEGRATED MULTIMODAL SYSTEM

General Terms Performance, Experimentation, Human Factors, Measurement.

We make use of the n-best lists of multiple channels to integrate a multimodal result. The candidates in the gaze n-best list are ranked according to the distance from the gaze fixation to the objects. The nearest object from the gaze ranks first. The candidates in the speech n-best list are ranked according to the score reported by the speech recognition software. The one with the highest score ranks first.

Keywords Multimodal architecture, recognition errors, multimodal errors, eye tracking, speech input.

1. INTRODUCTION The gaze-based interface can provide a rapid, natural and convenient input method [3]. Speech is a main medium in our daily human communication. Therefore, our work is concerned with the integration of speech and eyesight.

2.1 Integrating Speech and Gaze The candidate with the highest speech recognition score in the intersection of the speech n-best list and the gaze n-best list will be the multimodal integration result. With this rule, in the vicinity of the user’s gaze (we call it the eye operative region), speech is the decisive factor. If multiple objects are chosen simultaneously due to user’s ambiguous description, then the nearest one from the gaze fixation will be the multimodal result.

However, both the eye tracking and the speech recognition are immature. The purpose of our research are to investigate how to integrate these two error-prone modalities in a manner that supports the mutual correction of individual recognition errors, as well as addressing the ambiguities of the speech signals.

This mutual correction can be realized because inappropriate candidates generated by the individual recognizers are eliminated during the integration processing. Each individual mode can impose semantic, temporal and spatial constraints on the multimodal integration, ruling out the unreasonable items and leading to a correct multimodal result. The gaze information gives a spatial limit to the multimodal integration, whereas the speech

The system [7] addressed the inaccuracy problems of the gaze and speech recognitions through probabilistic integration of these two Copyright is held by the author/owner(s). IUI’04, Jan. 13-16, 2004, Madeira, Funchal, Portugal. ACM 1-58113-815-6/04/0001.

346

then a correct result can be produced with our integration strategies.

information gives a semantic limit. As shown in figure 1, the gaze information rules out the object A, while the speech information rules out the object C, E and F. So, we can obtain the correct result B. Although both eyesight and speech are misinterpreted, the correct multimodal result can be restored.

For the unavoidable errors, the only way to reduce them is through the improvement of the accuracy of eye tracking and the speech recognition. For the errors related with the eye operative region, there is a trade-off between the large-boundary errors and the small-boundary errors. A small eye operative region takes the risk of excluding the intended target, whereas a large one may result in inclusion of more unrelated items. By conducting an experiment, we will investigate how to adjust this trade-off so as to minimize the two kinds of errors.

N-best list Speech Gaze Multimode A C B Eye cursor B, E B Eye operative C F region E Target

A gold

B yellow

C silver

D blue

E yellow

F green

Table 1. The classified multimodal errors. N-best list

Figure 1. An example of mutual correction of gaze and speech

Gaze

Speech

Small-boundary error



3. CLASSIFYING THE MULTIMODAL ERRORS

Speech-related error



The integration strategies are developed with the goal of integrating eye-movement data and voice commands in a lesserror way. However, the multimodal integration may generate error results also. In this section, we will discuss the reasons why the integration processing causes the error results, as well as the solutions to minimize these errors.



The multimodal integration errors can be classified into four categories (Table 1). In Table 1, ∈ represents that the intended target is included in the corresponding n-best list, ∉ represents that the intended target is not. “ RG(target) ” represents the rank of the intended target on the gaze n-best list. “ RG(Mresult) ” represents the rank of the multimodal integration result on the gaze n-best list. Similarly, RM and RS represent the ranking position on the multimodal and the speech n-best lists respectively.

Error type

Ranking position

S



R (Mresult)< RS(target)

RG(target)RG(Mresult)

Unavoidable error

4. EVALUATION EXPERIMENT 4.1 Participants and Apparatus Twelve non-native Japanese speakers participated in the experiment. Since the speech recognition rate would be poorer for non-native speakers than native ones, we predicted that the multimodal method would provide many more benefits for the non-native speakers. A head mounted eye tracker NAC EMR-HM8 was connected to a PC. The user’s eyesight was tracked by it and the fixation was calculated real-time. Our fixation detection algorithm was explained in [8]. The voice commands were processed using IBM ViaVoice 8.0. A tabletop microphone was used for this study. Subjects sat in front of a 15inch LCD display at a distance of 70cm.

The small-boundary error is caused because of the intended target not being included in the gaze n-best list. That is, the threshold of the eye operative region is very small. This kind of error can be recovered by enlarging the eye operative region so that the target is included in the gaze n-best list. The large-boundary error generates when a candidate, ranking lower than the target item in the gaze n-best list, becomes the multimodal result in reverse. This means that the eye operative region is set beyond what is necessary and encloses too many irrelevant items. This kind of error can be resolved through reducing the size of the eye operative region. Just including the intended target in the gaze n-best list is sufficient enough to get a correct result.

4.2 Stimuli and Experimental Task The stimuli in the experiment are shown in Figure 2. There are 10 kinds of colors, 2 kinds of sizes, and 3 kinds of shapes for these objects. The size of the big shape is 20mmX13mm. The size of the small shape is two thirds of the big one. The distance between the centres of two adjacent shapes is 40mmX27mm.

The speech related error is caused by the speech recognition error. The threshold of rejection for the speech recognizer should be low enough to make the target be included in the speech n-best list. The “unavoidable error” occurs in when another candidate(s) ranks higher than the target on both component n-best lists. No matter how to adjust the size of the eye operative region, a nontarget object always ranks higher on the final multimodal n-best list than the target. Note that, if an error integration occurs in although the intended target belongs to both component n-best lists (Table 1), it must be that the target ranks low on the speech n-best list, i.e. RS(Mresult)< RS(target). The reason is that if the target ranks high Figure 2. The screen layout of experimental task

347

recognition errors. Also, the gaze recognition error can be corrected by the speech information.

The experimental task was to select a highlighted target from the grid of objects. If successful, the next target would be highlighted; otherwise, the user should select it again with a maximum three tries apiece. This process continued until the 12 objects at the central area were selected to finish one session.

The experiment revealed that the speech-related errors reduced by simplifying the user’s verbal input. So, when the verbal commands are used to identify a referent, the best way to interact with the computer via verbal commands is to use the simple description.

4.3 Parameters and Experimental Design The independent parameters in our experiment are as follows: (1) Speech grammars: G3=size+color+shape

G1=color,

The experiment suggested that the integration strategies could be refined to the optimum through analyzing and measuring the classified multimodal integration errors case by case. The classification of the multimodal errors provided a convenience approach to facilitate the analysis of performance, to handle the errors and to define the optimal parameter settings for the integration architecture.

G2=color+shape,

The verbal commands were the attributes of the target, and were classified into three categories in order to investigate the contributions of the gaze information to simplifying the user’s speech. G3 provides a single unambiguous identification for each object, whereas G1 and G2 may refer to multiple objects.

6. ACKNOWLEDGMENTS

(2) Radius of the eye operative region: R1=1cm, R3=3cm, R5=5cm, R7=7cm, R9=9cm

We would like to thank Professors Kenji Ozawa, and Ryutaro Ohbuchi of University of Yamanashi for helpful discussions and comments. This research was supported in part by the Japan Society for the Promotion of Science, by the Telecommunications Advancement Organization of Japan, and by the Research Institute of Electrical Communication of Tohoku University awarded to A.Imamiya.

This parameter varied 2cm every time in order to investigate the influence of the different eye operative regions on the performance of the multimodal architecture. A within-subjects design was used for both independent parameters. Two sessions was conducted for each condition. Each subject performed a total of 3(grammar) x 5(region) x 2(session) x 12(trial) = 360 trials.

7. REFERENCES

4.4 Experimental Results

[1] Bolt, R.A. Put-that-there: Voice and Gesture in the graphics interface. in Proceedings of the ACM conference on Computer Graphics (New York, 1980), 262-270.

The experiment revealed that the small-boundary errors droped considerably, and the large-boundary errors increased with the enlargement of the eye operative region. At the region R3, the errors related with the eye operative region became the least among all of the regions. That is, when the radius of the eye operative region was set to the user’s gaze deviation from the targets, the boundary errors related with the eye operative region minimized. The user’s gaze deviation is the distance between the eye cursor (the eye position reported from the eye tracker) and the actual focus position. When the radius equals the gaze deviation, the intended target can be included in the gaze n-best list, and there are no other irrelevant objects in the list. When the threshold of the radius is smaller than the gaze deviation, the intended target is not included in the gaze n-best list, so the small-boundary errors increase. When the threshold is larger than the gaze deviation, many unnecessary objects are included in the gaze nbest list, so the large-boundary errors increase.

[2] Campana, E. Baldridge, J. Dowding, J. Hockey, B. A. Remington, R.W. Stone, L.S. Using eye movements to determine referents in a spoken dialogue system. in Proceedings of Perceptive User Interface (Orland, FL, 2001).

[3] Jacob, R. J. K. (1995). Eye tracking in advanced interface design, In W. Barfield & T. Furness (Ed.), Advanced Interface Design and Virtual Environments (pp. 258-288). Oxford: Oxford University Press

[4] Koons, D.B., Sparrell, C.J., & Thorisson, K.R. Integrating simultaneous input from speech, gaze and hand gestures. M.Maybury (Eds.), Intelligent Multimedia Interfaces. Menlo Park, CA: MIT Press, 1993, 257-276.

[5] Neal, J.G. Thielman C.Y. Dobes A. Haller S.M. Shapiro S.C. Natural language with integrated deictic and graphic gestures. M.T. Maybury & W. Wahlster(Eds.), Readings In Intelligent User Interfaces. San Francisco, CA: Morgan Kaufmann Press, 1998, 38-51.

The speech-related errors increased when the grammar became complicated. The speech-related errors were the least for the simplest grammar. With the complementary of gaze information, the ambiguities in the speech signals were resolved, so even the incomplete description worked well. For the simple and short speaking words, the speech recognition generates less error than the complex and lengthy words.

[6] Oviatt S.L. Mutual disambiguation of recognition errors in a multimodal Architecture. in Proceedings of CHI '99 (New York, N.Y., 1999). ACM Press, 576-583.

[7] Tanaka, K. A robust selection system using real-time multi-

Because the alternatives in the speech n-best list are few for the lengthy words, the unavoidable errors at the complex grammar G3 are rare.

modal user-agent interactions. in Proceedings of the 1999 conference on Intelligent User Interfaces (Redondo Beach CA, 1999), 105-108.

5. CONCLUSIONS

[8] Zhang Q.H., Imamiya A. & Go K. Text entry application

This paper presents a multimodal architecture that combines speech and gaze inputs to create a robust multimodal system. With our integration strategies, the gaze information is used to resolve the ambiguities of user’s speech and correct the speech

based on gaze pointing. in Proceedings of 7th ERCIM Workshop User Interfaces For All (Paris, 2002). 87-102.

348