LEARNING SPOKEN WORDS FROM ... - Semantic Scholar

LEARNING SPOKEN WORDS FROM MULTISENSORY INPUT Chen Yu and Dana H. Ballard Department of Computer Science University of Rochester Rochester, NY 14627,USA yu,dana @cs.rochester.edu

ABSTRACT Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. In this paper, we present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic(two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data. 1. INTRODUCTION Multimedia is more than simply the representations of information in various types of media. It is the integration and interaction among different media types that create challenging research topics and new opportunities[1]. A recent trend in multimedia research is to exploit the interaction between audio and video which gives rise to the integration of audio and visual processing. Research topics along the direction of audio-visual integration include automatic lip-reading, lip synchronization, joint audio-video coding, and bimodal person authentication. In this paper, we take advantage of the natural responses of human beings to speech to associate acoustic and visual signals. Specifically, eye movements are utilized as nonlinguistic information to solve the problems, which are conventionally considered using only acoustic signals and speech techniques, such as learning the spoken names of objects and speech translation of object names. Our work is motivated by the results of cognitive studies. Tanehhaus et al.[2] suggested that nonlinguistic information affects the manner in which linguistic input is processed. They argued that language processing is inextricably tied to reference and relevant behavioral context. Also, Cooper[3] found that people have a strong tendency to look toward objects referred to in conversations. He showed that the response system of eye movements in the presence of an on-going conversation is always characterized by a high degree of linguistic sensitivity. Based on the interpretations of the language, people naturally look at the objects in response to the heard words as they appear in context.

Despite the importance of eye gaze in studying human linguistic processing, little work has been done in using eye gaze for speech recognition. A few studies do propose several learning models to explain how words are learned from linguistic and contextual input. Among them, the work of Roy et al.[4] is particularly relevant to our work. Based on the fact that infants learn words by combining information from different modalities, Roy has presented a model of early word learning. The model has been implemented for the domain of shape and color name learning from microphone and camera input. However, his method relies on userguided labeling and segmentation to synchronize the acoustic and visual signals before applying the learning model. This paper proposes to integrate information from multiple modalities by using eye gaze signals as the pointing movement of human body to bind objects in the physical world with speech. In the system we developed, input consists of audio, visual and eye movement data in a dyadic conversation. The simultaneities between the acoustic signals generated by the speaker and the visual objects targeted by the listener when the speaker refers to objects allow us to associate the visual representations of objects with their spoken names. The system can also translate the spoken names of objects between different languages by processing data collected from conversations in different languages. Figure 1 summarizes our approach. To our best knowledge, there has no previous work on using eye gaze to integrate speech and vison for multimodal learning. In this novel approach, data from different modalities are integrated on the signal level without the need to transform to a symbolic representation layer first.

Fig. 1. Audio-Visual Association and Speech Translation

2. THE ROLE OF EYE GAZE Human beings continuously explore their environment by moving eyes. They look around quickly and with little conscious effort. A considerable body of work has been carried out to track people’s eyes and subsequently use this information as an input modality for human-computer interaction[5]. Different from these applications, we utilize eye gaze in the following two ways: Firstly, eye gaze plays an important role in multimodal integration. Systems which process multiple input modalities typically rely on the fact that the features of the input streams are correlated in time. This assumption may hold in certain cases such as lip reading. In many other situations, however, acoustic and visual signals are uncorrelated. We propose to solve this problem using eye gaze in the scenario of a dyadic conversation. When people are simultaneously presented with spoken utterances and a visual field containing elements semantically related to the informative items of speech, they tend to spontaneously direct their line of sight to these elements which are most closely related to the meaning of the language currently heard without prior instructions to do so. Thus, eye gaze can be used to correlate the visual objects selected by the listener with concurrent spoken words generated by the speaker. Secondly, compared with other modalities, such as gestures and voice, eye gaze has a unique property: it implicitly carries information on the focus of the user’s attention at a specific point in time. Thus, we can utilize eye gaze to find objects people are interested in. In the standard wearable computers, although they have the potential to “see” as the user sees from the “first-person” perspective, it is not trivial to extract objects of user interest in the clustered scene. In our system, however, we can directly utilize eye position as a cue to segment objects in the visual scene.

Fig. 2. Hardware Configuration to the Hiball tracker server machine. We have developed a multiprocess program that records signals from multiple sensors with timestamps. Figure 3 summarizes the software components of the system. The input data are provided by three kinds of sensors. Eye tracker and Hiball tracker generate eye position, head position and orientation data. The scene camera of the ASL provides the video of the scene a participant looks at, and the microphones sense acoustic signals. The techniques of processing and integrating data from these different modalities in the current experiment(Section 4)are described as we proceed through this section.

3. SYSTEM DESCRIPTION We collected visual, audio and eye movement data from a dyadic conversation, and used them as training data to build a multimodal system that associates visual representations of objects with their spoken names. The hardware configuration of the system is shown in Figure 2. Both participants of the conversation wear head-mounted eye trackers from Applied Science Laboratories(ASL) and Hi-ball head trackers developed by University of North Carolina(UNC) Tracker Research Group. The ASL tracker consists of a miniaturized illuminator camera that provides an infrared image of the eye at , and determines monocular eye position by monitoring the locations of the center of the pupil and the cornea reflection[6]. The headband of the ASL holds a miniature “scene-camera” to the left of the participant’s head that provides the video of the scene from the first-person perspective. The Hiball tracker reports precise six dimensional(6D) information of the position and the orientation of the participant’s head. The signals from multiple sensors are collected in the PC workstations. The speech signals of two participants are recorded using two clip-on microphones(one per person) whose outputs are mixed and digitized at a rate of 44.1 kHz with 16-bit resolution. The video of the scene is sent to the frame grabber on the workstation. The video signals are sampled at the resolution of 320 columns by 240 rows of pixels at the frequency of 15Hz. Eye tracking data, consisting of both the position of eye gaze and the size of pupil, are transmitted to the workstation through a RS-232C serial interface at the frequency of 60Hz. The system also collects data of head position and orientation by building a TCP/IP socket connection

Fig. 3. Overview of Software Architecture 3.1. Eye Movement Analysis Saccades are rapid eye movements that allow the fovea to view a different portion of the display. Often a saccade is followed by one or more fixations when objects in a scene are viewed. In the context of our experiment, we categorize eye movements into three modes: a visual-audio interaction mode in which the fixation of target is correlated with the meaning of concurrently heard language. a free-scanning mode in which a person continually alters his direction of gaze in a manner independent of the meaning of concurrently heard language. a point-fixation mode in which a person continues to fixate the same location independent of the meaning of concurrently heard language. Our goal is to find the durations of the first mode. We utilize the acoustic information to omit the time intervals of the third

50

Eye Fixation

2

Head Rotation

0

2

Head Fixation

Eye Velocity

mode(Section 3.4). To remove the eye movements of the second mode, we have developed a velocity-threshold-based algorithm using both eye position and head orientation data. The algorithm significantly reduces the size and complexity of eye data by removing raw saccade data points and collapsing raw fixation points into a single representative tuple. A sample of the results of eye data analysis is shown in Figure 4.

2

0

50

100

150

200

250

300

350

200

250

300

350

time

1 0

0

50

100

150

3.4. Audio-Visual Association

1 0

0

50

100

150

200

250

300

350

0

50

100

150

200

250

300

350

0

50

100

150 200 time 1/60 sec

250

300

350

1 0

Fixation

2 1 0

set. In the fourth step, visual clustering is performed by computing the distances between the feature vectors of the image sets. When the system first starts running, it has no cluster. When it receives the first image set, the system will create a cluster and initialize its representation with the feature vector of the image set. In the subsequent learning, for each input image set, the distances from the feature vector of the input set to the existing clusters are calculated. A pre-defined split/merge threshold is compared with the distances. If any distance is less than the threshold, the feature vector is classified into the corresponding cluster and the representation of the cluster is updated. If all of the distances are greater than the threshold, a new cluster is formed and initialized with the feature vector of input.

Fig. 4. Eye Fixations in space The first two Rows Point-to-point velocities of eye data and the corresponding fixation groups by removing saccade points. The third and fourth rows head orientation data(in 1D) and the head fixation groups. Bottom the results of fixation identification by integrating eye fixations and head fixations. In the algorithm, the fixation for which the standard deviation of its data points is too large or the duration is too short is removed. 3.2. Speech Processing The linguistic information is grounded in acoustic signals originating from microphones. As the first phase of the study, we focus on demonstrating our proposed approach in a simple case. In the current experiment(Section 4), we constrained the speech input of system by asking participants to speak only the names of objects in the scene. As a result, the speech stream can be directly segmented into speech segments that are delimited by silence. In the utterance endpoint detection algorithm we implemented, short bursts of speech are ignored since they are likely due to environmental noise, and short silences within an utterance are absorbed into a single utterance. Then, a sequence of mel-frequency cepstral coefficients(MFCC) is extracted from each spoken segment which will be used for clustering in Section 3.5. 3.3. Video Processing The input video stream is segmented into several sets of images based on the durations of eye fixations. Visual processing consists of four steps in our system. In the first step, each image in an image set is sent to an image segmentation module in which an object in the image is segmented from the background by using eye position as a seed for the region growing algorithm[7]. In the second step, the mask image obtained in the first step is sent to the feature extraction module that extracts the visual feature vector of the object. Based on methods of color histogram[8] and multidimensional receptive field histogram[9], an object is represented by a feature vector of 192 dimensions, consisting of 64 dimensions of color histogram and 128 dimensions of 2D histograms of local features. In the third step, the feature vector of each image set is computed by averaging the feature vectors of objects in the

The key constraint is that the eye fixations of the listener can be associated with the spoken utterances of the speaker. Consequently, visual signals in the eye fixation durations can be associated with spoken utterances. Implementing this association is straightforward. The spoken utterance endpoint algorithm timestamps the start and end points of each utterance. The eye fixation identification algorithm timestamps the start and end times of each fixation. Figure 5 shows the timestamps of spoken utterances and eye fixations. For each eye fixation, the time intervals and are computed. is the duration between the start time of the eye fixation and the start time of the spoken utterance that is before and closest to the eye fixation in time. is the duration between the end time of the eye fixation and the end time of the spoken utterance. The experimental results in[2] show it takes about to program a saccadic eye movement to the target object after hearing the end of the word that uniquely specified the target object. In our experiment(Section 4), the durations of spoken utterances range from about to while eye fixation durations appear highly variable, rang ing from less than to several seconds, and such changes even take place within consecutive fixations. In practice, we assume that it is at least halfway through a spoken word before the listener can identify its meaning. Thus, the earliest time of the lis(300ms/2). Therefore, when tener to begin moving eye is the time ( ) for the listener to program a saccadic eye movement is considered, the earliest start time of eye fixation possibly controlled by speech is about . At the other end, the latest time for the listener to specify the object from speech is . Thus, the latest start time of eye fixation possibly controlled by speech is . Also, we assert that if the end time of eye fixation is too close( ) to the end time of the spoken utterance, it is less likely that the eye fixation is controlled by speech. Based on this analysis, a spoken utterance is associated with an eye fixation only if and satisfy the following requirements: Condition 1: Condition 2: For example, in Figure 5, only the pair is associated while the pair does not satisfy Condition 2 and the pair does not satisfy Condition 1.

"#!

!

$ !

3.5. Speech-to-Speech Translation The system builds associations between the spoken names of objects in different languages based on their visually grounded meanings. This approach consists of two steps. Firstly, with the audiovisual pairs (described in the previous section), spoken words are clustered based on the similarities of their paired visual feature

5. CONCLUSION AND FUTURE WORK

Fig. 5. Association of eye fixation and spoken utterance vectors. This allows us to categorize spoken words in the same group even though they are in different languages or generated by different persons. In the second step, dynamic time warping is used to match acoustic features of spoken words in each group and cluster them into the subgroups that correspond to different languages. In this way, speech-to-speech translations of the spoken names of objects are accomplished. The key advantage to this approach is that mapping between spoken segments in different languages is defined by semantic associations grounded in the visual information common to them. 4. PRELIMINARY RESULTS For a preliminary study of using eye gaze to integrate speech and vison, we designed an experiment in which we gathered a corpus of multisensory data from two-person interactions. Both participants sit down at a table where there are six objects: car, pickup, motorcycle, jet, cow and elephant. Figure 6 shows the sample view obtained from the camera mounted on the participant’s head. One person is the speaker while the other participant is the listener. The speaker was asked to randomly speak the names of six objects while the listener naturally responds without prior instruction to intentionally move their eyes. Five two-person pairs participated in the experiment. Three of them spoke in English, and two of them spoke in Chinese. We recorded the acoustic signals of the speaker as well as video, eye gaze and head position data of the listener during their interactions. In total we collected 96 spoken utterances and approximately 12000 images.

The system demonstrates a novel approach to integrate multisensory input where eye gaze is used to associate acoustic signals with visual signals and accomplish speech-to-speech translations. From an engineering perspective, one advantage to this approach is that the system learns the spoken words directly from the multimodal observations without manually generated transcriptions as “teaching”information. The speech and visual inputs are directly associated by another input modality(eye gaze) without transforming to a symbolic representation layer. Our immediate goal is to use natural conversations as input. We also plan to carry out experiments in more natural situations, such as a lab orientation or shopping in a supermarket, in which objects in the scene are likely to be frequently referred by speech. In such situations, since we cannot assume that all the spoken words are related to the objects in the scene, additional efforts are needed to obtain audio-visual lexicons. 6. REFERENCES [1] T. Chen and R. Rao, “Audio-visual integration in multimodal communications,” Proceedings of the IEEE, Special Issue on Multimedia signal processing, vol. 86, no. 5, pp. 837–852, May 1998. [2] Tanenhaus M.K., Spivey-Knowlton M.J., Eberhard K.M., and Sedivy J.E., “Integration of visual and linguistic information in spoken language comprehension,” Science, vol. 268, pp. 1632–1634, 1995. [3] R. M. Cooper, “The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing,” Cognitive Psychology, vol. 6, pp. 84–107, 1974. [4] Deb Roy, “Integration of speech and vision using mutual information,” in Proceedings of Int. Conf. Acoustics, Speech and Signal Processing(ICASSP), Istanbul, Turkey, June 2000. [5] Dario D. Salvucci and Joseph H. Goldberg, “Identifying fixations and saccades in eye-tracking protocols,” in Proceedings of Eye Tracking Research and Applications symposium 2000. ACM SIGCHI and ACM SIGGRAPH, Nov 2000. [6] Jeff B. Pelz, Mary M. Hayhoe, Dana H. Ballard, Anurag Shrivastava, Jessica D. Bayliss, and Markus von der heyde, “Development of a virtual laboratory for the study of complex human behavior,” in Proceedings of the SPIE, San Jose, CA, 1999.

Fig. 6. The sample view of the scene obtained from the head-mounted camera

The expected results are audio-visual lexicons classified into six object groups. Based on the visual semantics, speech-to-speech translations are obtained by categorizing the audio-visual pairs of each object into two subgroups corresponding to English and Chinese separately. The correct rate of the grounded audio-visual lexicons is 93.6% and the accuracy of speech-to-speech translation is 71.6%. We should point out here that the high correct rate of audio-visual pairs is partly due to the constrained speech and the limited number of objects that may attract the listener’s attention. However, the results demonstrate that using eye gaze to associate speech with vision is a promising approach.

[7] Rolf Adams and Leanne Bischof, “Seeded region growing,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, June 1994. [8] Michael J. Swain and Dana Ballard, “Color indexing,” International Journal of Computer Vision, vol. 7, pp. 11–32, 1 1991. [9] B. Schiele and J. L. Crowley, “Object recognition using multidimensional receptive field histograms,” in Proceedings of European Conf. on Computer Vision, Cambridge, UK, 1996, pp. 1039–1046.