From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Intelligent talk-and-touch interfaces using multi-modal semantic grammars Bruce Krulwich and Chad Burkey Center for Strategic TechnologyResearch Andersen Consulting LLP 100 South WackerDrive, Chicago, IL 60606
[email protected]
Abstract Multi-modal interfaces havebeenproposed as a wayto capture theeaseandexpressivity of natural communication. Interfaces of thissortallowusersto communicate with computers through combinations of speech, gesture, touch, expression, etc.A critical problem indeveloping suchan interface is integrating these different inputs (e.g., spoken sentences, pointing gestures, etc.) intoa single interpretation. Forexample, incombining speech andgesture, a system mustrelate eachgesture to theappropriate partof the sentence. We areinvestigating thisproblem asit arises in ourtalkandtouch interfaces, whichcombine full-sentence speech andscreen-touching. Oursolution, whichhasbeen implementedin two completed prototypes, uses multi-modal semantic grammarsto match screen touches to speech utterances. Through this mechanism,our systems can easily support wide variations in the speech patterns used to indicate touch references. Additionally,they can ask specific focusedquestions to the user in the event of an inability to understand the input. Theyearl also incorporate other semantic information, such as contextual references or references to previous sentence referents, through this single unified approach. Ourtwo prototypes appear effective in providing a straightforward and powerfulinterface to novice computerusers.
1. Talk and touch interfaces Natural humandialogue is a lot morethan sequences of words. Context, gestures, expressions, and implicit knowledgeall play a considerable role in conveyingcomplexthoughts and ideas in a discussion. Thesefactors clarify the language, fill in the gaps, and disambiguatethe spokenwords themselves. CSTaR’stalk and touch project is attempting to incorporate information of this sort into an ~ intelligent interface based on full-sentence speech recognition and touch-screen sensing [Krulwich and Burkey, 1994]. Consider the following utterances, in the context of a typical electronic mail system: 1. Readthis message(with a corresponding touch to a messageheader) t Ourscreentouchescanbe easilyreplacedwithmouse clicks, etc. Krulwich 103
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
2. 3. 4. 5.
Send a messageto him (with a corresponding touch to a name) Send a message to him (without a touched name) Forwardthis messageto him and Bill (two touches: one message, one name) Forwardthis messageto him and Bill (with only a touched message)
In each of these cases, the system has to match the touched objects (references to messagesor people) to the ambiguousslots in the sentence. In cases 3 and 5 there are more potential references to touchedobjects than there are screen touches, so additional inference is required to determine the missing information from context, or at least determine specifically which informationis unspecified. This paper discusses the approach to these issues that is embodiedin our two talk and touch prototypes. Thef’trst, an interface to a typical e-mail systemshownin figure 1, allows commands such as those listed above. Weuse this system to introduce and illustrate our approachin section 2. The second, a video communicationsmanagershownbriefly in figure 2, supports commands for establishing video calls and multi-point video-conferences, sending and receiving multLrnedia e-mail, and sharing and manipulating documentsduring a video-conference. Wediscuss this secondsystemin detail in section 3.
2. Multi-modal semantic grammars The first step in our approachto developing an integrated interpretation of speech and screen z In general, semantic touches is to parse the spoken text into a multi-modal semantic grammar, grammarsare language grammarsthat parse text into semantic componentsinstead of syntactic ones [Burton, 1976]. For example, a typical syntactic grammarwouldparse the sentence forward the new messagefrom John to Bill and my team into syntactic constructs such as subject, verb, direct object, and prepositional phrase. Asemantic grammar,on the other hand, wouldparse it in terms of semantic constructs such as command,messagespecification, author, and recipients. This allows for easier and moreaccurate interpretation, as well as improvingperformancethrough early application of semanticconstraints. Wehave extended this idea by adding semantic constructs to the grammarthat refer to other sources of input information, such as screen touches, as well as those that refer to context. For example, the sentence "forward this message to Bill and Chad"would be parsed to recognize that "this message"is not only a messagespecification, but is also a reference either to a touched screen object or to a particular contextually-significant message. A samplegrammarof this sort, for top-level commands in our e-mail prototype, is shownin figure 3. The grammardefines nodes, or "tags," (shownin angled brackets) that correspond to blocks of text in the input string. Eachtag can be seen as a slot that maybe filled by a portion of an input sentence. Instead of the phrase from him being recognized as a prepositional phrase, as it might with a syntactic grammar, it can be recognized as an author of a message (the tag 2 Issues relating to speech recognition are discussed in section 4. In our discussion here we assumethat the spoken cormnandshave been recognized ,and are available in text form. 104
BISFAI-95
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
.
. ¯ ¯ . ¯ :...:. :::: | .... ..:: ¯ ... ::::¯ .:... ¯ :.:’:’:::::¯ . i :.’:. ¯ . ¯ " :.’. " " "’i :ii.:..:’." :.::’: Y""" -..’::ii: :" :’.i’i:. ¯...
Figure 1: A prototype talk and touch e-mail system workspace ), and in particular as a reference to a touched person on the screen (the tag ). The following are someof the tags used in parsing the sentence forward this message to BillandChad, andtheir corresponding text: = "forward this messageto Bill and Chad" = "forward this message to Bill and Chad" = "this message" = "this message" = "to Bill and Chad" = "Bill and Chad"
:;~.~:.~--
m Figure 2: A video communicationsmanager system Krulwich
105
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
::= I I I I I I I I I . ::= I I I . ::= TO I TO I TO l TO I I . ::= I . ::= MAILFROMMY I MESSAGESFROMMY I MAILI MESSAGES. ::= . ::= THE I THE . ::= RECEIVED I GOT f WASSENT. ::= . ::= . ::= BYI . ::= I THE I THE I I THE I THE . ::= l