demonstrations of Speak4it on both the iPhone and iPad platforms and explain ... Figure 1 (Left), the user pressed the 'Push to Speak' button and said âcoffee ...
SPEAK4IT: MULTIMODAL INTERACTION IN THE WILD MICHAEL JOHNSTON AND PATRICK EHLEN AT&T LABS RESEARCH, AT&T LABS ABSTRACT 2. SPEAK4IT INTERACTION Speak4itSM is a consumer-oriented mobile search application that leverages multimodal input and output to allow users to search for and act on local business information. In addition to specifying queries by voice (e.g., “bike repair shops near the Golden Gate Bridge”) users can combine speech and gesture. For example, “gas stations” + will return the gas stations along the specified route traced on the display. We provide interactive demonstrations of Speak4it on both the iPhone and iPad platforms and explain the underlying multimodal architecture and challenges of supporting true multimodal interaction in a deployed mobile service. Index Terms— multimodal integration, speech, gesture, location-based, local search 1.
INTRODUCTION
Figure 1. Speak4itSM interaction
Local search has frequently been used as a test bed application for investigation of multimodal interface techniques and processing models. The MATCH system [5] enabled users to interact using speech and drawing to search for restaurants and other businesses in several major U.S. cities. Local search and navigation was one of the key domains for the SmartKom Mobile prototype [7]. The AdApt system provided a multimodal interface for searching for apartments in Stockholm [4]. Oviatt [6] demonstrated significant improvements in task performance and user preference for multimodal over unimodal interfaces for interaction with dynamic maps in a real estate search task. QuickSet [1] addresses a related task of laying down features on a map. Thanks to a number of technical advances—including the capabilities of mobile computing devices such as smartphones and tablets, availability of high speed mobile networks, and availability of speech and language processing capabilities in the cloud—it is now possible to put true multimodal interaction into the hands of customers for use in their daily lives. To the best of our knowledge, the Speak4it application presented in this demonstration is the first commercially deployed mobile local search application to support true multimodal interaction where users can issue commands that integrate simultaneous inputs from speech and gesture. There are versions of Speak4it available for free download for both the iPhone and iPad.
978-1-4244-7903-0/10/$26.00 ©2010 IEEE
On launching the application, users are presented with a map of their local area (Figure 1, Left), and have the option of either clicking a “Speak/Draw” button or lifting the device to the ear to initiate a speech-based query.
147
Users can search for businesses based on the name of the business, category, or associated keywords. For example, in Figure 1 (Left), the user pressed the ‘Push to Speak’ button and said “coffee shops.” The system zooms in and displays the closest coffee shops to the user’s current location as selectable pins on the map (Figure 1, Middle). The results can also be viewed as a spin-able text list (Figure 1, Right) or in a ‘radar’ view. In the radar view, depending on the orientation of the device, results appear either as green markers within a compass dial, or superimposed over the camera viewfinder, providing an augmented reality effect where tags describing businesses float in line with the user’s perspective. In all of the views, users can tap on businesses in order to access more detailed information and initiate actions such as making a call, getting directions, or sending business details by email or SMS. Search over all U.S. business names is supported, including both chains and local businesses, e.g., “Starbucks” or “Ninth Street Espresso.” Users can explicitly state a location in their query, for instance, by saying, “real estate attorneys in Detroit Michigan,” or “Italian restaurants near the Metropolitan Museum.” They can manipulate the map view using drag (pan) and pinch (zoom) touch gestures, or using spoken commands. If the user says “San Francisco,” the map will zoom to display that city. The system also supports multimodal commands where the location is specified directly by drawing on the map [1,4,6].
SLT 2010
The system supports point, area, and line gestures (Figure 2). For example, the user can say, “French restaurants here” in conjunction with a point or area gesture. For a point gesture, the system returns the French restaurants closest to the point. For an area gesture, results within the area are returned. Users can also search for businesses along a specific route by drawing a line gesture. For example, the user can say “gas stations” and draw a line gesture (Figure 2, Right) and the system will return the gas stations closest to the route drawn.
Figure 2. Speak4it gesture inputs
mode. The user’s ink traces and encoded speech input are streamed over http to the multimodal platform. This multimodal data stream is received and de-multiplexed by a Multimodal Interaction Manager component. The user’s ink trace is passed to a gesture recognizer, which classifies it as point, line, or area. Audio is forwarded to a speech platform which first performs speech recognition using a statistical language model trained on previous query data. From here, the speech recognition output is passed to a natural language understanding component (NLU) that parses the query into a topic phrase that designates the user’s desired search subject (e.g., “pizza”) and, if applicable, a location phrase (e.g., “San Francisco”) that designates a desired location [2]. In cases where there is an explicit location phrase (e.g. “pizza restaurants in San Francisco”), the location phrase is geo-coded so search results from the topic phrase may be sorted and displayed according to their proximity to the location. If the location is not stated explicitly in the query, the Interaction Manager passes a series of features that pertain to possible salient locations in the current interaction to the location grounding component, which uses those features as input to attempt to determine the current most salient location [2].
3. MULTIMODAL ARCHITECTURE ACKNOWLEDGEMENTS The user interacts with a client application on the iPhone or iPad, which communicates over HTTP with a multimodal search platform that performs speech and gesture recognition, query parsing, geocoding, and search.
Thanks to Jay Lieske, Clarke Retzer, Brant Vasilieff, Diamantino Caseiro, Junlan Feng, Srinivas Bangalore, Claude Noshpitz, Barbara Hollister, Remi Zajac, Mazin Gilbert, Barbara Hollister, and Linda Roberts for their contributions to Speak4it.
4. REFERENCES [1] Cohen, P. R., M. Johnston, D. McGee, S.L. Oviatt, J. Pittman, I. Smith, L. Chen, and J. Clow. 1998. Multimodal Interaction for Distributed Interactive Simulation. In M. Maybury and W. Wahlster (eds.) Readings in Intelligent Interfaces. Morgan Kaufmann Publishers, San Francisco, CA, 562–571. [2] Ehlen, P. and M. Johnston. 2010. Location Grounding in Multimodal Local Search. Proceedings of ICMI-MLMI 2010. [3] Feng, J., S. Bangalore, and M. Gilbert. 2009. Role of Natural Language Understanding in Voice Local Search. Proceedings of Interspeech 2009, 1859-1862.
Figure 3. Multimodal architecture The client captures and encodes the user’s speech and touch input. One of the design challenges in adding multimodality was working out how to capture ink input on a touchscreen map. The established paradigm for map manipulation on the iPhone and similar devices is to use touch gestures to perform pan and zoom operations. What we found to work well was to use the “Push to Speak” button as essentially a “Push to Interact” button (“Speak/Draw”). When the user presses the button, touch gestures on the map are interpreted as referential pointing and drawing actions rather than direct map manipulation. After the user makes a gesture, clicks ‘Stop,’ or stops speaking, the map ceases to work as a drawing canvas and returns to the direct map manipulation
148
[4] Gustafson J., L. Bell, J. Beskow, J. Boye, R. Carlson, J. Edlund, B. Granström, D. House, and M. Wirén. 2000. AdApt – A Multimodal Conversational Dialogue system in an Apartment Domain. In Proceedings of ICSLP 2000. Vol. 2, 134-137. [5] Johnston, M., S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, P. Maloor. 2002. MATCH: An Architecture for Multimodal Dialogue Systems. In Proceedings of the 40th ACL. 376-383. [6] Oviatt, S.L. 1997. Multimodal Interactive Maps: Designing for Human Performance. Human Computer Interaction. 12, 93-129. [7] Wahlster, W. 2006 SmartKom: Foundations of Multimodal Dialogue. Springer-Verlag New York, Inc. [8] http://speak4it.com/