Multimodal Speech-Gesture Interface for Handfree ... - CiteSeerX

6 downloads 135 Views 59KB Size Report
Multimodal Speech-Gesture Interface for Handfree Painting on a Virtual. Paper Using Partial ..... Figure 2: selecting a graphic tablet as target region for painting enables directly visual .... VRPN server are set up for Unix. Therefore, it makes ...
Multimodal Speech-Gesture Interface for Handfree Painting on a Virtual Paper Using Partial Recurrent Neural Networks as Gesture Recognizer Andrea Corradini, Philip R. Cohen Oregon Graduate Institute for Science and Technology Center for Human-Computer Communication 20000 N.W. Walker Rd, 97006 Beaverton, OR [email protected]

ABSTRACT - We describe a pointing and speech alternative to the current paint programs based on traditional devices like mouse, pen or keyboard. We used a simple magnetic field tracker-based pointing system as input device for a painting system to provide a convenient means for the user to specify paint locations on any virtual paper. The virtual paper itself is determined by the operator as a limited plane surface in the three dimensional space. Drawing occurs with natural human pointing by using the hand to define a line in space, and considering its possible intersection point with this plane. The recognition of pointing gestures occurs by means of a partial recurrent artificial neural network. Gestures along with several vocal commands are utilized to act on the current painting in conformity with a predefined grammar.

Keywords User-centered Interface, Painting Tool, Pointing Gesture, Speech Recognition, Communication Agent, Multimodal System, Augmented and Virtual reality, Partial Recurrent Artificial Neural Network.

1. INTRODUCTION The natural combination of a variety of modalities such as speech, gesture, gaze, and facial expression makes humanhuman communication easy, flexible and powerful. Similarly, when interacting with computer systems, people seem to prefer a combination of several modes to a single one alone [12, 19]. Despite the strong efforts and deep investigations in the last decade, human-computer interaction (HCI) is in its childhood and therefore its ultimate goal, aiming at building natural perceptual user interfaces, remains a challenging problem. Two concurrent factors produce awkwardness. First, current HCI systems make use of both rigid rules and syntax over the individual modalities involved in the dialogue. Second, speech and gesture recognition, gaze tracking, and other channels are isolated because we do not understand how to integrate them to maximize their joint benefit [20,21,25,30]. While the first issue is intrinsically difficult (everyone claims to know what a gesture is, but nobody can tell you precisely), progress is being made in combining different modalities into a unified system. Such a multimodal system, allowing interactions that more resemble everyday communication, becomes more attractive to users.

1.1 Related Work Like speech, gestures vary both from instance to instance for a given human being and among individuals. Beside this temporal variability, gestures vary even spatially making them more difficult to deal with. For the recognition of those single modalities, only few systems make use of connectionist models [3,7,17,27] for they are not considered well suited to completely address the problem of time alignment and segmentation. However, some neural architecture [10,14,29] has been put forward and successfully exploited to partially solve problems involving the generation, learning or recognition of sequence of patterns. Recently, several research groups have more thoroughly addressed the issue of combining verbal and non-verbal behavior. In this context, most of such multimodal systems have been quite successful in combining speech and gesture [4,6,26,28] but, to our knowledge, none exploits artificial neural networks. One of the first such systems is Put-That-There [4] which uses speech recognition and allows for simple deictic reference to visible entities. A text editor featuring a multimodal interface that allows users to manipulate text using a combination of speech and pen-based gestures has been presented in [28]. Quickset [6] along with a novel integration strategy offers mutual compensation between pen and voice modalities. Among gestures, pointing is a compelling input modality that has led to friendlier interfaces (such as the mouseenabled GUI) in the past. Unfortunately, few 3D systems that integrate speech and deictic gesture have been built to detect when a person is pointing without special hardware support and to provide the necessary information to determine the direction of pointing. Most of those systems have been implemented by applying computer vision techniques to observe and track finger and hand motion. The hand gesture-based pointing interface detailed in [24] was proposed to track the position of the fingertip in which the user points map directly into the 2D cursor movement on the screen. Fukumoto et al [11] report a glove-free camera based system providing pointing input for applications requiring computer control from a distance (such as a slide presentation aid). Further stereo-camera

techniques for detection of real time pointing gestures and estimation of direction of pointing have been exploited in [5, 8, 13]. More recently, [26] describes a bimodal speech/gesture interface, which is integrated in a 3D visual environment for computing in molecular biology. The interface lets researchers interact with 3D graphical objects in a virtual environment using spoken words and simple hand gesture. In our system, we make use of the Flock of Birds (FOB) [1], a six-degree-of-freedom tracker device based on magnetic fields to estimate the pointing direction. In an initialization phase, the user is required to set the target coordinates in the 3D space that bound his painting region. With natural human pointing behavior, the hand is used to define a line in space, roughly passing through the base and the tip of the index finger. This line does not usually lie in the target plane, but may intersect it at some point. We recognize pointing gestures by means of a hybrid partial recurrent artificial neural network (RNN) consisting of a Jordan network [14] and a static network with buffered input to handle the temporal structure of the movement underlying the gesture. Concurrently, several speech commands can be issued asynchronously. They are recognized using Dragon 4.0, a commercial speech engine. Speech along with gestures is then used to put the system into various modes to affect the appearance of the current painting. Depending on the spoken command, we solve for the intersection point and use it either to directly render ink or to draw a graphical object (e.g. circle, rectangle, or line) at this position in the plane. Since we implemented the speech and tracking modules are on different machines, we employed our agent architecture to allow the different modules to exchange messages and information. 2. DEICTIC GESTURES As for HCI, currently no comprehensive classification of natural gestures exists that would help in establishing a methodology for gesture understanding. However, there is a general agreement in defining the class of deictic or pointing gestures [9,15,18]. The term deictic is used in reference to gestures or words that draw attention to a physical point or area in course of a conversation. Among natural human gestures, pointing gestures are the easiest to identify and interpret. There are three body parts which can conventionally be used to point: the hands, the head, and the eyes. Here we are concerned only with manual pointing. In the western society, there are two distinct forms of manual pointing which regularly co-occur with deictic words (like “this”, “that”, “those” etc.): the one-fingerpointing to identify a single or a group of objects, a place or a direction, and the flat-hand-pointing to describe paths or spatial evolutions of roads, range of hills. Some researchers [23] argue that pointing has iconic properties and represents

a prelinguistic and visually perceivable event. In fact, in face-to-face communication deictic speech never occurs without an accompanying pointing gesture. In a shared visual context, any verbal deictic expression like “there” is unspecified without a parallel pointing gesture1. These multiple modes may seem redundant until we consider a pointing gesture as a complement to speech, which helps form semantic units. We can easily realize that by speaking with children. They compensate for their limited vocabulary by pointing more, probably because they cannot convey so much information about an object/location by speaking as they could by directing their interlocutor to perceive it with his own eyes. 2.1 An Empirical Study As described in the previous section, pointing is an intentional behavior that aims at directing the listener’s visual attention to either an object or a direction. It is controlled both by the pointer’s eyes and by muscular sense (proprioception). In the real world, our pointing actions are not coupled with “cursors”, yet our interlocutors can often discern the intended referents, processing the pointing action and the deictic language together. We conducted an empirical experiment to investigate how precise pointing is when no visual feedback is available. We invited four subjects to point at a target spot on the wall using a laser pointer. They did this task from six different distances away from the wall (equally distributed from 0.5m to 3m), ten times for each. Each time the subject attempted to point at the target with the beam turned off. Once a subject was convinced that he had directed a laser pointer toward the bulls-eye, we turned on the laser and determined the point the user was really aiming at. We determined the distance between the bulls-eye the user was aiming for and the actual area he indicated with the laser pointer. We then computed overall error for each distance as the average distance between desired and actual points on the wall over all the trials for that distance from the wall. The subjects were requested to perform this experiment twice and in two different ways: in a “natural” way and in an “improved” way. As natural way we asked the persons involved in this experiment to naturally point at the target, while in the improved way we specifically asked the person to try to achieve the best result (some people put the laser pointer right in front of the eye and closed the other, other put it right in front of the nose, etc.). The outcome of the experiment is shown in Figure 1. 1 This does not happen in sentences that are used for referencing places, objects or events the interlocutors have clear in their minds because of the dialogue context. E.g., in “Have you been to Italy? Yes, I have been there twice” or “I watched Nuovo Cinema Paradiso at the TV yesterday. Didn’t that film win an Oscar in 1989?”, deictic words are not accompanied by pointing gestures. Neither they are in sentences like e.g. “There shall come a time”, or “They all know that Lara is cute”, or “The house that she built is huge”, where they are used as conjunction or pronoun.

As expected, we can see how with increasing distance the error increases as well. In addition, when the user pointed at the given spot from a distance of 1 meter the error decreased from 9.08 to 3.89 centimeters from the “natural” to the “improved” way. Pointing Inaccuracy

average error (cm)

20

vertices of the future rectangular painting region. These points are chosen by pointing at them. However, since this procedure is to be done in the 3D space, the user has to aim at each of the vertices from two different positions. The two different vectors triangulate to select a point as vertex. In 3D space, two lines will generally not have an intersection. In such cases, we will use the point of minimum distance from both lines. With natural human pointing behavior, the hand is used to define a line in space, roughly passing through the base and the tip of the index finger. Normally, this line does not lie in the target plane but may intersect it at some point. It is this point that we aim to recover.

15 10 5 0 0.5

1

1.5

2

2.5

3

distance from wall (m) improved pointing

natural pointing

Figure 1: target pointing precision.

In light of this experiment, reference resolution of deictic gestures without verbal language is an issue. In particular, when small objects are placed close together, reference resolution via deictic gesture can be impossible without the help of spoken specification. In addition, direct mapping between the 3D user input (the hand movement) and user intention (pointing on the target plane) can be carefully performed only with visual feedback (information on current position). In the next session, we describe the system that has been built according to these considerations. 3. THE PAINTING SYSTEM 3.1 Estimating the pointing direction For the whole system to work, the user is required to wear a hand glove on whose top we put one FOB’s sensor. The FOB is a six-degree-of-freedom tracker device based on magnetic fields which we exploit to track the position and orientation of the user’s hand with respect to the coordinate system determined by the FOB’s transmitter The hand’s position is given by the position vector P reported by the sensor at a frequency of approximately 50Hz. For the orientation, we put the sensor almost at the back of the index finger with its relative x-coordinate axis directed toward the index fingertip. In this way, using the quaternion values reported by the sensor, we can apply mathematical transformations within quaternion algebra to determine the unit vector X which unambiguously defines the direction of the sensor and therefore that of pointing (Figure 2). The point P along with vector X is then used to determine the equation of the imaginary line passing through P and having direction X. When the system is started for the first time, the user has to choose the region he wants to paint in. This is accomplished by letting the user choose three of the

For this reason, when the region selected in the 3D space is neither a wall screen, nor a general surface on which the input can be directly output (tablet, the computer’s monitor etc.), the system can be properly used only when the magnetic sensor is aligned and used together with a light pointer. However, in this situation we also implemented a rendering module to draw the actual painting on the screen regardless of the target plane chosen in the 3D space.

Figure 2: selecting a graphic tablet as target region for painting enables directly visual feedback. The frame of reference of the sensor is shown on the left. On the right an exemplarity painting is shown as it appears on the tablet.

3.2 Motion Detection for Segmentation In order for us to describe in detail the motion detector, we first need to give some definitions. We consider the FOB data stream static anytime the sensor attached to the user’s hand remains stationary for at least five consecutive FOB’s reports. In this case, we also refer to the user as to be in the resting position. In a similar way, we say the user is moving and we consider the data stream dynamic whenever the incoming reports change in their spatial location for at least five times in a row. Static and dynamic data stream are defined in such a way that they are mutually exclusive, but not exhaustive. In other words, if one definition is satisfied, that implies that the other is not. However, the converse situation is not the case, since if one definition is not satisfied, this does not imply that the other is. Such non-complementarity makes the motion detector module robust against noisy data. For real-time performance purposes, the FOB’s data are currently downsampled to 10Hz so that, both static and

dynamic conditions need be fulfilled over a time range of approximately half a second.

addition, we furnish the input layer with a time window to accommodate a part of the input.

The motion detector is in charge of reporting to the gesture recognizer anytime transition from (into) dynamic into (from) static occurs. If the transition is from static into dynamic, the motion detector forwards the input stream to the RNN for the classification to start. The RNN is queried for the classification result if the following conditions hold simultaneously: a)

the opposite transition from dynamic into static is detected

b) the imaginary line described by the sensor in the space intersects the target region (the virtual paper) c)

the time elapsed between the begin of the classification and the current time is less than a given threshold chosen according to the maximum duration of the gestures used for the RNN training

The motion detector provides the recognizer with explicit start and end points for classification. Therefore, the RNN needs not segmentation to identify these characteristics of the movement. With the use of such a motion detector gestures need not to start with the hand in a given userdefined position, either. 3.3 Gesture Recognition by means of RNN We use artificial neural networks to capture the spatially nonlinear local dependencies and handle the temporal structure of a gesture. The most direct way to perform sequence recognition by static artificial neural networks is to turn the largest possible part of the temporal pattern sequence into an input buffer on the input layer of a network. In such a network a part of the input sequence is presented simultaneously to the network by feeding the signal into the input buffer and next shifting it at various time intervals. The buffer must be chosen in advance and has to be large enough to both contain the longest possible sequence and to maintain as much as possible context-dependent information but a large buffer means a large number of parameters and implicitly a large quantity of required training data for successful learning and generalization. Partial recurrent artificial neural networks (RNNs) are a compromise between the simplicity of these feedforward nets and the complexity of recurrent models. RNNs are currently the most successful architectures for sequence recognition and reproduction by connectionist methods. They are feedforward models with an additional set of fixed feedback connections to encode the information from the most recent past. That recurrence does not complicate the training algorithm since its value is fixed and hence not trainable. We deploy a partial recurrent network in which the most recent output patterns, which intrinsically depend on the input sequence, are fed back into the input layer. In

Figure 3: Example of RNN for Gesture Recognition.

The resulting network (Figure 3) is a hybrid combination of a Jordan network [14] and a static network with buffered input. The input layer is divided into two parts: the context unit and the buffered input units, which currently contains 5 time steps (i.e. half a second). The context unit holds a copy of the output layer and from itself at the previous time step. The recurrence’s weight from the output layer is kept fixed at 1. Indicating with µ the strength of the self connection, the updating rule at time instant t for the context unit C(t) can be expressed in mathematical form as function of the network’s output O(t) as:

(

C (t ) = C (t − 1) µ + O(t − 1) = ∑i = 0 µ t − iO(i ) t −1

)

It turns out that the context unit accumulates the past output values of the network in a manner depending on the choice of the parameter µ in the range [0,1]. Close to one, it extends the memory further back into the past but causes loss of sensitivity to detail. This recurrence permits the network to remember some aspects of the most recent past giving the network a kind of short-term memory. At a given time, the network output depends as well on the current input as on an aggregate of past values. As input values for the RNN, we use the signs of the differences between the spatial components of the current and previous sensor location. Thus, the input is threedimensional, translation invariant, and its components can assume only one value among {-1,0,1}. During the training, the desired output is always associated with the center of the buffered input window. As target, we take the sample following that center within the training sequence. An additional output neuron is always presented the value 1 when the sequence is a deictic gesture, the value 0 otherwise. Preclassified patterns are not necessary. The input vectors constituting the sequence are stepped through the context window time step by time step. The weight adjustment is carried out by backpropagation with least mean square error function. Both hidden and output

neurons compute the sigmoid activation function. The additional output neuron is checked anytime a classification result is required. We tested the recognizer on four sequences, two of a person performing pointing gestures toward a given virtual paper, and two of the same person gesticulating during a monologue without deictic gestures. Each sequence lasts 10 minutes and is sampled at 10Hz. One sequence for each class was used for training and testing. The recognition rate was up to 89% for pointing gestures, and up to 76% for non-pointing gestures. While only a few sequences from the deictic gesture data set were misrecognized (false negative), much more movements from the non-pointing gesture data set were misrecognized as pointing gesture (false positive). This is not surprising since long lasting gestures, which occur frequently during a monologue/conversation, are very likely to contain segment patterns that are very similar to deictic gestures. Due to the nature of the training data, the performed test looks only at the boundary conditions (false positive/negative). We plan to collect and transcribe data from users during a conversational event where both deictic and non-deictic gestures occur. Testing the system with this more natural data will permit to assess more precisely the performance of the recognizer. 3.4 The Speech Agent We make use of Dragon 4.0, a Microsoft SAPI 4.0 compliant speech engine. This speech recognition engine captures an audio stream and produces a list of text interpretations (with association probabilities of correct recognition) of that speech audio. These text interpretations are limited by a grammar that is supplied to the speech engine upon startup.

The user uses voice commands to put the system into various modes that remain in effect until he changes them. Speech commands can be entered at anytime and are recognized in continuous mode. 3.5 The Fusion Agent The Fusion Agent is a finite state automaton that is in charge for two major functions, i.e., the rendering, and the temporal fusion of speech and gesture information. This rendering is implemented with OpenGL on a SGI machine utilizing the Virtual Reality Peripheral Network (VRPN) [2] driver for the FOB. The fusion bases on a time-out variable. Once a pointing gesture is recognized, a valid spoken command must be entered within a given time (currently 4 seconds, as usually speech follows gestures [22]) or another pointing gesture must occur. Eventually, the Fusion Agent either takes the action (such as changing drawing color, select the first point of a line, etc.) associated with the speech command or issues an acoustic warning signal. The model nature of the state machine ensures consistent command sequences (e.g., “line begin” can only be followed by “undo”, “cancel” or “line end”). Depending on the performed action, the system may undergo a state change. 3.6 Agent Architecture The modules implemented for tracking, pointing and painting, and speech command recognition, need to communicate with each other. Agents communicate by passing Prolog-type ASCII strings (Horn clauses) via TCP/IP.

The following grammar specifies the possible selfexplanatory sentences: 1: = | | | 2: =

no | yes

3:

green | red | blue | yellow | white |

=

magenta | cyan 4: =

draw on | draw off | zoom in | zoom out | cursor on | cursor off | line begin | paste | select end | select begin | line end | copy | circle end | circle begin | rectangle end | rectangle begin

5: > =

exit | help | undo | switch to foreground | save | free buffer | switch to background | send to background | cancel | restart | delete | load

Here, and refers to the sets of commands which need to be issued without and with an accompanying pointing gesture, respectively.

Figure 4: agent communication within the entire system.

The central agent is the facilitator. Agents can inform the facilitator of their interest in messages which match (logically unify) with a certain expression. Thereafter, when the facilitator receives a matching message from some other agent, it will pass it along to the interested agent. Since ASCII strings and TCP/IP are common across various

platforms, agents can be used as software components that can communicate across platforms. In this case, the Speech Agent is running on a Windows platform. The best off-the-shelf speech recognition engines available to us (currently, Dragon) are on the Windows platform. On the other hand, the Flock of Birds and the VRPN server are set up for Unix. Therefore, it makes sense to tie them together with the agent architecture (Figure 4). Communication is straightforward. The Speech Agent produces messages of the type parse_speech(Message) which the facilitator forwards to the Fusion Agent. This ladder, with some simple parsing, can then extract speech recognition alternate interpretations and their associated probabilities from the message strings. The command associated with the highest probability value above an experimental threshold (currently 0.85) is chosen. 4. Conclusions and Future Work The presented system represents a real-time application of drawing in space on a two-dimensional limited rectangular surface. This is a first step toward a 3D multimodal speech and gesture system for computer aided design and cooperative tasks. A system might perhaps recognize from the user’s input some 3D objects from an iconic library and refine the user’s drawings accordingly. We anticipate expanding the use of speech to operate with 3D objects. Since the fusion component is an agent, we are going to make it a module in the entire QuickSet Adaptive Agent Architecture [16], to further use it as a sort of virtual mouse for the QuickSet [6] user interface. Possible alternative applications for this system range from hand cursor control by pointing to target selection in virtual environments. 5. ACKNOWLEDGMENTS This research is supported by the Office of Naval Research, Grant N00014-99-1-0377 and N00014-99-1-0380. Thanks to Rachel Coulston for help editing and Richard M. Wesson for programming support.

6. REFERENCES [1] http://www.ascension-tech.com [2] Taylor R.M., VRPN: A Device-Independent, Network-Transparent VR Peripheral System, Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2001.

[3] Boehm K., Broll W., Sokolewicz M., Dynamic Gesture Recognition

[8] Crowley J.L., Berard F., and Coutaz J., Finger Tracking as an Input Device for Augmented Reality, in Proc. of the Int’l Workshop on Automatic Face and Gesture Recognition, 195-200, 1995.

[9] Efron D., Gesture, Race and Culture, Mouton and Co., 1972. [10] Elman J.L., Finding Structure in Time, Cog. Sci., 14:179-211, 1990 [11] Fukumoto M., Mase K., and Suenaga Y., Realtime detection of pointing actions for a glove-free interface, in Proceedings of IAPR Workshop on Machine Vision Applications, 473-476, 1992.

[12] Hauptmann A.G., and McAvinney P., Gesture with speech for graphics manipulation. International Journal of Man-Machine Studies, Vol. 38, 231-249, February 1993.

[13] Jojic N., et al., Detection and Estimation of Pointing Gestures in Dense Disparity Maps, in Proceedings of International Conference on Automatic Face and Gesture Recognition, 468-474, 2000.

[14] Jordan M., Serial Order: A Parallel Distributed Processing Approach Advances in Connectionist Theory, Lawrence Erlbaum, 1989.

[15] Kendon A., The Biological Fundations of Gestures: Motor and Semiotic Aspects, Lawrence Erlbaum Associates, 1986.

[16] Kumar S., Cohen P.R., Levesque, H.J., The Adaptive Agent Architecture: Achieving Fault-Tolerance Using Persistent Broker Teams, Proc. 4th Int’l Conf. Multi-Agent Systems, 159-166, 2000.

[17] Lippmann R.P., Review of Neural Networks for Speech Recognition, Neural Computation, 1:1--38, 1989.

[18] McNeill D., Hand and Mind: what gestures reveal about thought, the University of Chicago Press, 1992.

[19] Oviatt S.L., Multimodal interfaces for dynamic interactive maps, in Proceedings of Conference on Human Factors in Computing Systems: CHI, 95-102, 1996.

[20] Oviatt S.L., Cohen, P.R., Multimodal interfaces that process what comes naturally. Communication of the ACM, 43(3):45-53, 2000.

[21] Oviatt S.L., et al., Designing the user interface for multimodal speech and gesture applications: State-of-the-art systems and research directions, Human Comp. Interaction, 15(4):263-322, 2000.

[22] Oviatt S., De Angeli A., Kuhn K., Integration and Synchronization of Input Modes during Multimodal HCI, Proceedings of CHI '97, 415-422, 1997.

[23] Place U.T. The Role of the Hand in the Evolution of Language. Psycoloquy, Vol. 11, No. 7, 2000, http://www.cogsci.soton.ac.uk

[24] Quek F., Mysliwiec T.A., Zhao M., FingerMouse: A Freehand Computer Pointing Interface in Proc. of Int’l Conf. on Automatic Face and Gesture Recognition, 372-377, 1995.

[25] Queck F., et al., Gesture and Speech Multimodal Conversational Interaction, Tech. Rep., VISLab-01-01, University of Illinois, 2001.

using Neural Networks; A Fundament for Advanced Interaction Construction, SPIE Conf. Elect. Imaging Science & Tech., 1994.

[26] Sharma R., et al., Speech/Gesture Interface to Visual-computing

[4] Bolt R.A. Put-That-There: voice and gesture at the graphics

[27] Tank D.W., Hopfield J.J., Concentrating Information in Time:

interface. Computer Graphics, Vol. 14, No. 3, 1980, 262-270.

[5] Cipolla R., Hadfield P.A., Hollinghurst, N.J., Uncalibrated Stereo

Environment, IEEE Comp. Graphics and Appl., 20(2):29-37, 2000. Analog Neural Networks with Applications to Speech Recognition, Proc. of the 1st Int’l Conf. on Neural Nets, Vol. IV, 455-468, 1987.

Vision with Pointing for a Man-Machine Interface, in Proc. of the IAPR Workshop on Machine Vision Application, 163-166, 1994.

[28] Vo M.T., Waibel A.A., Multimodal human-computer interface:

[6] Cohen P.R., et al., Quickset: Multimodal interactions for distributed

[29] Waibel A., et al., Phoneme Recognition Using Time-Delay Neural

applications. Proc. of the 5th Int’l Multimedia Conf., 31-40, 1997.

[7] Corradini A., Gross H.-M., Camera-based Gesture Recognition for Robot Control, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Network, vol. IV, 133-138, 2000.

combination of gesture and speech recognition, InterCHI, 1993. Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12): 1888-1898, 1989.

[30] Wu L., Oviatt S., Cohen P.R., Multimodal Integration – A Statistical View, IEEE Transactions on Multimedia, 1(4):334-341, 2000