LEARNING FROM MULTIMODAL OBSERVATIONS Deb Roy MIT Media Laboratory 20 Ames Street, Cambridge, MA 02139, USA
[email protected] http://dkroy.www.media.mit.edu/people/dkroy/ ABSTRACT
Human-computer interaction based on recognition of speech, gestures, and other natural modalities is on the rise. Recognition technologies are typically developed in a statistical framework and require large amounts of training data. The cost of collecting manually annotated data is usually the bottleneck in developing such systems. We explore the idea of learning from unannotated data by leveraging information across multiple modes of input. A working system inspired by infant language learning which learns from untranscribed speech and images is presented.
1. RECOGNITION TECHNOLOGIES FOR HCI Human-computer interfaces based on natural modalities are on the rise. Machines which respond to natural speech and gestures have recently become useful in practical situations (cf. [6, 20, 1]). As design issues concerning the appropriate use of recognition technologies are better understood, the importance of these methods in HCI will increase. The successful development of recognition technologies is due in large part to progress in applying pattern recognition techniques to the problem of recognizing human behaviours captured by microphones, cameras, tablets, and other sensors. In contrast to the consistent signals generated by keystrokes and mouse clicks, natural modalities give rise to high degrees of variation. The appearance of a hand gesture or spoken phoneme are dramatically aected by immediate context. Individual dierences of physiology, style, cultural background, gender, and age compound the variations. Background noise adds yet another level of complexity. A recognition system must see past such variations and recover the intent underlying an observed behaviour. Practice has shown that the most eective approach to dealing with variability is to train statistical models from observed data. Statistical models capture expected sensory patterns and typical variations. Once trained, these models are matched against new observations to perform recognition.
2. THE BOTTLENECK: LACK OF DATA The quantity of data available for training models is typically the limiting factor on performance of recognition systems. Higher recognition accuracy is almost always achieved by increasing the amount of training data. The problem
is that training recognition models typically require large amounts of manually transcribed data. For example, to train a large vocabulary speech recognizer, hundreds of hours of speech must be recorded, along with human generated transcriptions. Some of the speech must also be transcribed at a phoneme level, a dicult process which requires highly trained human expertise. Even with hundreds of hours of data, many context-dependent sub-word units are not observed even a single time [19]. The prospect of replicating such data collections to cover new languages and dialects is daunting. Similar problems of data collection also face developers of gesture, cursive writing, and other recognition technologies.
3. MULTIMODAL LEARNING This situation calls for new approaches which are able to learn statistical models with little human annotated data. In principal this must be possible since every infant solves a similar problem. The world does not provide annotated data. Instead, it provides rich streams of continuously varying information through multiple modes of input. Infants learn by combining information from multiple modalities. A promising path of research is to build machines which similarly integrate evidence across modalities to learn from naturally occurring data. The key advantage to this approach is that potentially unlimited new sources of untapped training data may be utilized to develop robust recognition technologies.
3.1. Infants Learn from Multimodal Observations Infants are born with innately speci ed sense organs which constrain which aspects of the world are captured and represented by the perceptual system. Infants also possess powerful mechanisms which allow them to learn structure within and across modalities (cf. [9, 18]). By learning interand intra-modal patterns, infants begin to make sense of their environment, mastering sensorimotor skills in their rst few years of life. To gain insights into building systems which can learn from untranscribed multimodal data, we have developed a computational model of infant language learning.
3.2. A System which Learns from Multimodal Observations We have developed a model of cross-channel early lexical learning (CELL), summarized in Figure 1 [13, 12]. This model discovered words by searching for segments of speech which reliably predicted the presence of visually co-occurring shapes. Input consisted of spoken utterances paired with images of objects. This approximated the input that an infant might receive when listening to a caregiver while visually attending to objects in the environment. Output consisted of a lexicon of audio-visual items. Each lexical item included a statistical model (based on hidden Markov models) of a spoken word, and a statistical visual model of an object class. To acquire lexical items, the system had to (1) segment continuous speech at word boundaries, (2) form visual categories corresponding to objects, and (3) form appropriate correspondences between word and object models.
shape statistics estimation
long-term mutual information filter
phoneme estimation
speech segment 1 shape 1
spoken utterance 1 shape 1 spoken utterance 2 shape 2
short-term recurrence filter
speech segment 2 shape 2
spoken utterance N shape N
speech segment N shape N
short-term memory
long term memory
garbage collection
Figure 1: The CELL model. Camera images of objects are converted to statistical representations of shapes. Spoken utterances captured by a microphone are mapped onto sequences of phoneme probabilities. The term memory (STM) buers phonetic representations of recent spoken utterances paired with representations of co-occurring shapes. A short-term recurrence lter searches the STM for repeated sub-sequences of speech which occur in matching visual contexts. The resulting pairs of speech segments and shapes are placed in a long term memory (LTM). A lter based on mutual information searches the LTM for speechshape pairs which usually occur together, and rarely occur apart within the LTM. These pairings are retained in the LTM, and rejected pairings are periodically discarded by a garbage collection process. A speech processor converted spoken utterances into sequences of phoneme probabilities. We built in the ability to categorize speech into phonemic categories since similar abilities have been found in pre-linguistic infants after exposure to their native language [9]. At a rate of 100Hz, this processor computed the probability that the past 20 milliseconds of speech belonged to each of 39 English phoneme categories or silence. The phoneme estimation was achieved
by training an arti cial recurrent neural network similar to [11]. The network was trained with a database of phonetically transcribed speech recordings of adult native English speakers [16]. Utterance boundaries were automatically located by detecting stretches of speech separated by silence. A visual processor was developed to extract statistical representations of shapes from images of objects. The visual processor used `second order statistics' to represent object appearance. In a rst step, edge pixels of the viewed object were located. For each pair of edge points, the normalized distance between points and the relative angle of edges at the two points were computed. All distances and angles were accumulated in a two-dimensional histogram representation of the shape (the `second order statistics'). Threedimensional shapes were represented with a collection of two-dimensional shape histograms, each derived from a particular view of the object. To gather visual data for evaluation experiments, a robotic device was constructed to gather images of objects automatically (Figure 2). The robot took images of stationary objects from various vantage points. Each object was represented by 15 shape histograms derived from images taken from 15 arbitrary poses of the robot. The chi-squared divergence statistic was used to compare shape histograms, a measure that has been shown to work well for object comparison [14]. Sets of images were compared by summing the chi-square divergences of the four best matches between individual histograms. Phonemic representations of multi-word utterances and co-occurring visual representations were temporarily stored in a short term memory (STM). The STM had a capacity of ve utterances, corresponding to approximately 20 words of infant-directed speech. As input was fed into the model, each new [utterance,shape] entry replaced the oldest entry in the STM. A short-term recurrence lter searched the contents of the STM for recurrent speech segments which occurred in matching visual contexts. The STM focused initial attention to input which occurred closely in time. By limiting analysis to a small window of input, computational resources for search and memory for unanalyzed sensory input are minimized as is required for cognitively plausibility. To determine matches, an acoustic distance metric was developed [12] to compare each pair of potential speech segments drawn from the utterances stored in STM. This metric estimated the likelihood that the segment pair in question were variations of similar underlying phoneme sequences and thus represented the same word. The chisquared divergence metric described earlier was used to compare the visual components associated with each STM utterance. If both the acoustic and visual distance were small, the segment and shape were copied into the LTM. Each entry in the LTM represented a hypothesized prototype of a speech segment and its visual referent. Infant-directed speech usually refers to the infant's immediate context [17]. When speaking to an infant, caregivers rarely refer to objects or events which are in another location or which happened in the past. Guided by this fact, a long-term mutual information lter assessed the consistency with which speech-shape pairs co-occurred in the LTM. The mutual information (MI) between two random variables measures the amount of uncertainty removed regarding the value of one variable given the value of the other
[5]. Mutual information was used to measure the amount of uncertainty removed about the presence of a speci c shape in the learner's visual context given the observation of a speci c speech segment. Since MI is a symmetric measure, the converse was also true: it measured the uncertainty removed about the co-occurrence of a particular speech segment given a visual context. Speech-shape pairs with high MI were retained, and periodically a garbage collection process removed hypotheses from LTM which did not encode associations with high MI.
3.3. An Experiment in Language Acquisition We gathered a corpus of audio-visual data from infantdirected interactions [13]. Six caregivers and their prelinguistic (7-11 months) infants were asked to play with objects while being recorded. We selected 7 classes of objects commonly named by young infants [7]: balls, shoes, keys, toy cars, trucks, dog, horses. A total of 42 objects, six objects for each class, were obtained. The objects of each class vary in color, size, texture, and shape. Each caregiver-infant pair participated in 6 sessions over a course of two days. In each session, they played with 7 objects, one at a time. All caregiver speech was recorded using a wireless head-worn microphone onto DAT. In total we collected approximately 7,600 utterances comprising 37,000 words across all six speakers. Most utterances contained multiple words with a mean utterance length of 4.6 words. Speech segmentation could not rely on the existence of isolated words since these were rare in the data. The 42 objects were imaged from various perspectives using small CCD camera mounted on a four degree-of-freedom armature. A total of 209 images from varying perspectives were collected for each of 42 objects resulting in a database of 8,778 images. The speech recordings from the caregiver-infant play were combined with the images taken by the robot to provide multimodal input to the learning system (Figure 2). To prepare the corpus for processing, we performed the following steps: (1) Segment audio at utterance boundaries. This was done automatically by nding contiguous frames of speech detected by the recurrent neural network, (2) For each utterance, we selected a random set of 15 images of the object which was in play at the time the utterance was spoken. We evaluated the lexicons extracted from the corpus using three measures. The rst measure, M1, was the percentage of lexical items with boundaries at English word boundaries. The second, M2, was the percentage of lexical items which were complete English words with an optional attached article. The third measure, M3, was the percentage of lexical items which passed M2 and were paired with semantically correct visual models. For comparison, we also ran the system with only acoustic input. In this case it was not meaningful to use the MI maximization so instead the system searched for globally recurrent speech patterns, i.e. speech segments which were most often repeated in the entire set of recordings for each speaker. This acoustic-only model may be thought as a rough approximation to a minimum description length approach to nding highly repeated speech patterns which are
Infant-directed spoken utterances
First-person perspective images of objects
Inter-modal Learning (CELL)
spoken word model
spoken word model
visual object model
visual object model
Acquired audio-visual lexicon
Figure 2: Speech was recorded from natural mother-infant interactions centered around objects. A robot was used to capture images from various rst-person perspectives of the objects from the infant interactions. The speech recordings and images were provided as input for the learning experiment. The system learned statistical models of words and objects without the need for manual annotations. likely to be words of the language. Table 1: Results of evaluation on three measures (see text) averaged across all six speakers. M1 M2 M3 audio only 75% 318% 134% audio-visual 286% 728% 5710% Results of the evaluation shown in Table 1 indicate that the audio-visual clustering was able to extract a large proportion of English words from this very dicult corpus (M2), many associated with semantically correct visual models (M3). Typical speech segments in the lexicons included names of all six objects in the study, as well as onomatopoetic sounds such as \ruf-ruf" for dogs, and \vroooom" for cars. The comparison with the audioonly system clearly demonstrates the improved performance when visual context is combined with acoustic evidence in the clustering process. For the dicult test of word boundary detection (M1), multimodal input led to a four-fold improvement over acoustic-only processing. Inter-modal structure enabled the system to nd and extract useful knowledge without the aid of manual annotations or transcriptions.
4. CONCLUSIONS The results with CELL show that it is possible to learn to segment continuous speech and acquire statistical models of
spoken words by providing a learning system with not only speech, but also co-occurring visual input. The converse is also true. The system learns visual categories by using the accompanying speech as labels. The resulting statistical models may be used for speech and visual recognition of words and objects [13]. Manually trained data is replaced by two streams of sensor data which serve as labels for each other. This idea may be applied to a variety of domains where multimodal data is available, but human annotation is prohibitively expensive. With the proliferation of broadcast media and the Internet, sources of multimodal data are limitless. Recent work has begun to explore some of the possibilities. Speech recognizers may be trained (or adapted) by listening to the audio track of a television broadcast and using co-occurring closed caption text [8]. Although this idea does rely on human transcriptions, in this case there are large volumes of such transcriptions already available which may be harvested for this purpose. Text classi cation is of great importance for Internet search and retrieval. Pairs of hyperlinked web pages may be treated as multiple modes of data, leading to unsupervised learning of useful text statistics [3]. Audio-visual environmental signals can be clustered to model cross-modal structure [4]. More ambitious forms of multimodal learning may become feasible in the future. Learning from television may be possible by combining techniques of computer vision, speech recognition, and auditory scene analysis. Children's programs which often contain simple visual imagery and speech which refers to concrete visual scenes might serve to bootstrap learning. Multimodal web crawlers could gather linked sets of multimodal data to feed learning algorithms. To summarize, the success of human-computer interaction based on natural modalities will depend heavily on availability of abundant training data. New learning methods which leverage inter-modal structure can extract knowledge from readily available sources. The situation is similar to that of an infant learning from observations of the world without access to any perfect annotations or transcriptions. Although this paper has focused on the development of recognition technologies, speech synthesis, language understanding, and language translation are also moving towards data-driven statistical methods (cf. [2, 15, 10]). Multimodal learning may be applicable to aspects of these problems as well.
[5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
[15] [16] [17]
5. REFERENCES [1] F. Berard. The perceptual window: Head motion as a new input stream. In Proc. IFIP conference on Human-Computer Interaction (INTERACT), Edinburgh, Scotland, 1999. [2] M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal. The AT&T Next-Gen TTS system. In Proc. Joint Meeting of ASA, EAA, and DAGA, Berlin, Germany, 1999. [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. Conf. on Computational Learning Theory, 1998. [4] B. Clarkson and A. Pentland. Unsupervised clustering
[18] [19] [20]
of ambulatory audio and video. In Proc. of ICASSP, Pheonix, Arizona, 1999. T.M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. A. Gorin, G. Riccardi, and J. H. Wright. How may I help you? Speech Communication, 1997. J. Huttenlocher and P. Smiley. Early word meanings: the case of object names. In P. Bloom, editor, Language acquisition: core readings, pages 222{247. MIT Press, Cambridge, MA, 1994. P. Jang and A. Hauptmann. Learning to recognize speech by watching television. IEEE Intelligent Systems, 14(5):51{58, 1999. P.K. Kuhl, K.A. Williams, F. Lacerda, K.N. Stevens, and B. Lindblom. Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255:606{608, 1992. H. Ney, S. Niessen, F. Och, H. Sawaf, C. Tillmann, and S. Vogel. Algorithms for statistical translation of spoken language. IEEE Trans. Speech and Audio Processing, 8(1):24{36, 2000. T. Robinson. An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Networks, 5(3), 1994. D. Roy. Integration of speech and vision using mutual information. In Proc. of ICASSP, Istanbul, Turkey, 2000. D.K. Roy. Learning Words from Sights and Sounds: A Computational Model. PhD thesis, Massachusetts Institute of Technology, 1999. B. Schiele and J.L. Crowley. Probabilistic object recognition using multidimensional receptive eld histograms. In ICPR'96 Proceedings of the 13th International Conference on Pattern Recognition, Volume B, pages 50{54, August 1996. R. Schwartz, S. Miller, and D. Stallard J. Makhoul. Hidden understanding models for statistical sentence understanding. In Proc. of ICASSP, 1997. S. Sene and V. Zue. Transcription and alignment of the timit database. In Proceedings of the Second Symposium on Advanced Man-Machine Interface through Spoken Language, Oahu, Hawaii, November 1988. C.E. Snow. Mothers' speech research: From input to interaction. In C. E. Snow and C. A. Ferguson, editors, Talking to children: language input and acquisition. Cambridge University Press, Cambridge, MA, 1977. E. Spelke. The infant's acquisition of knowledge of bimodally speci ed objects. J. Experimental Child Psychology, 31:279{99, 1981. S. Young, J. Odell, and P. Woodland. Tree-based state tying for high accuracy acoustic modelling. In Proc. ARPA Human Language Technology Conf., Plainsboro, NJ, 1994. V. Zue, S. Sene, J. Glass, J. Polifroni, C. Pao, T. Hazen, and L. Hetherington. Jupiter: A telephonebased conversational inferface for weather information. IEEE Transactions on Speech and Audio Processing, 8(1):85{96, 2000.