A Conversational Model of Multimodal Interaction Adelheit Stein, Ulrich Thiel
Abstract Multimodal interaction is employed in a variety of contemporary information systems in order to enhance the flexibility and naturalness of the user interface. In this paper, we propose a comprehensive framework which allows the modeling and specification of multimodal interactions. To this end, we employ an extended notion of ‘dialogue acts’ which can be performed by linguistic or non-linguistic means. On this basis we discuss the temporal structure of multimodal interaction. First, a comprehensive set of constraints is presented that describes all patterns of exchange which can be encountered during a cooperative information-seeking dialogue. Second, we introduce a strategic level of description, which allows the specification of the topical structure of the dialogue according to a selected information-seeking strategy. The model was used to design and implement the MERIT system (Multimedia Extensions to Retrieval Interaction Tools), and led to a reduction in the complexity of the user interface while preserving most of the useful, but sometimes confusing, dialogue options of advanced direct manipulative interfaces.
An abbreviated version is published in: Proc. of the 11th National Conference on Artificial Intelligence (AAAI ’93), Washington DC, USA, July 11-16, 1993. Menlo Park: AAAI Press/ The MIT Press, 1993, pp. 283-288. Also available as: Arbeitspapiere der GMD No. 741, Sankt Augustin: GMD, March 1993.
Address all correspondence to: Dr. Adelheit Stein, Dr. Ulrich Thiel, GMD-IPSI, Dolivostraße 15, D-64293 Darmstadt E-mail:
[email protected] Phone: ++ 49 / 6151 / 869-841
2
A Conversational Model of Multimodal Interaction
Stein, Thiel
1 Introduction Multimodal user interfaces employ various means of conveying information, e.g. text graphics, forms, tables, and pictures. If the presented information items become complex, in most systems users are allowed to investigate the information items directly. Thus, the distance between the users’ intentions and the objects they involve in their actions is reduced to a minimum (cf. Hutchins, Hollan & Norman 1986). However, this exploratory approach to information retrieval has to overcome the problems arising from browsing in a large information space. These difficulties, which are well known in hypertext applications, can be attacked by combining the direct manipulation of objects by the user with cooperative system responses. To date, cooperation has mostly been investigated in the context of natural language interfaces, which regard user-machine interaction as a dialogue between two partners. As this notion is often referred to as the conversation metaphor (cf. Reichman 1986, 1989), the proposed hybrid interaction style, which integrates graphical and natural language components in a multimedia environment, will be called multimodal conversation in the following. In this paper, we will outline a comprehensive model for multimodal conversations. The singular contributions to an interaction – called dialogue acts – are represented by complex data structures. To capture the temporal aspects of dialogues the model encompasses sequences of dialogue acts. These sequences can be potential as well as factual in nature. Thus, we can represent an actual dialogue history and possible continuations thereof. The paper is organized as follows: the following section outlines the notion of multimodal dialogue acts, whereas in the third section the constraints which govern the temporal structure of a cooperative and coherent multimodal interaction are discussed. In the fourth section, we present an overview of a prototypical information system MERIT, which was designed and implemented as an application of our conversational model. Some of the benefits of the conversational approach are sketched in the concluding part of the paper.
2 The Notion of Multimodal Dialogue Acts In the conversational approach, user inputs such as mouse clicks, menu selections, etc. are not interpreted as invocations of methods that are executed independent of the dialogue context. Instead, the direct manipulation of an object is considered as a dialogue act expressing a discourse goal of the user. Therefore, the system can respond in a more flexible way by taking into account the illocutionary and semantic aspects of the user’s input. Additionally, this approach to human-computer interaction provides a basis for the integration of different interaction styles, such as natural language and graphics, in a multimodal information system. The choice of an appropriate combination of interaction modes is essential for a successful interface design. Searching for an appropriate interface design, we have to experiment with alternative combinations of modes. Thus, we need a flexible, modular architecture for multimodal interfaces composed of interaction tools.
Stein, Thiel
A Conversational Model of Multimodal Interaction
3
We distinguish between three major components: first, a dialogue manager which is capable of tracking the discourse goals of the user, and, consequently, planning system reactions as cooperative responses; second, a database access module providing relevant data; and as a third component, a presentation manager generating visualizations of the retrieved information items (cf. e.g. Arens & Hovy 1990, Feiner & McKeown 1990, Oei et al. 1992). The dialogue manager operates on the internal representations of dialogue acts. Thus, it is possible to handle the direct manipulations of the user in a context-sensitive way. Based on a representation of the dialogue history, dialogue acts of the user, which are performed by directly manipulating the display structure, can be identified. For this reason, the user’s manipulations (e.g. mouse clicks, menu selections) are not only executed as methods sent to the graphical surface objects, but they affect transformations of the underlying internal representation of the ongoing dialogue. The reactions of the system are instances of generic dialogue acts, like ‘inform’, ‘offer’, or ‘reject’, which are modeled as frames. The slot structure of a dialogue act frame refers to the illocutionary and locutionary components of the ‘utterance’. Locution is a speech act theoretic term denoting the part of an utterance “which is really said” in contrast to the part that has to be concluded from the context (e.g. the illocution). Since we pertain to an object-oriented presentation style, the system’s locution is performed by changing or creating informational objects. An informational object represents a fragment extracted from the underlying information base together with a specification of its graphical and/or textual presentation. In general, a dialogue act performed by the system will result in the visualization of new graphical objects on the screen in combination with the generation and presentation of a ‘comment’ in natural language. Thus, the user is informed about what is displayed, what is new and how this refers to the topics of the ongoing dialogue. As the comments should not only refer to schema concepts, but also mention their instances, a natural language generation component is part of the system architecture (cf. Dilley et al. 1992).
3 Structure of Multimodal Dialogues While the notion of dialogue acts captures the essence of single contributions to a multimodal interaction, a conversational model has also to address the problem of combining these actions to coherent sequences. Our approach takes local as well as global patterns of dialogues into account. The local structures are governed by the interrelations of dialogue roles and tactics of the information seeker and the information provider, and can be described by a complex network of generic dialogue acts. The global aspect relates the sequence of dialogue contributions to information-seeking strategies. Thus, a system is able to plan the content of subsequent dialogue steps according to principles of topical coherence.
4
A Conversational Model of Multimodal Interaction
Stein, Thiel
3.1 The Conversational Roles Network A formal schema of information-seeking dialogues allows dialogues to be modeled at the discourse act level or – in terms of Speech Act Theory – the illocutionary level1. In order to capture the temporal structure of the dialogue, we introduce the notion of dialogue states. During a dialogue, the dialogue acts performed by the dialogue partners change the dialogue state. Since there may be several possible continuations, our model for possible dialogues resembles a network, whereas each actual dialogue is, of course, represented as sequence of singular acts. The network we developed for our problem domain of information retrieval is called COR (modeling “Conversational Roles”). For a detailed description of the formalism and the theoretical framework we refer to Sitter & Stein (1992), and Maier & Sitter (1992 a, b). In the following, however, we will concentrate on the description of its major, characteristic features. The COR approach intends to hold for the genre of cooperative information-seeking dialogues in a human-computer interaction environment. The underlying assumptions can be formulated briefly as follows: S Subject to our model are information-seeking exchanges between two participants, i.e. dialogues, rather than communication in groups. S The participants act cooperatively and their contributions are quite matter-of-factly, in contrast to, e.g. narrative situations, sale-talks, or chattering. S Participant A seeks certain information – B is willing to provide knowledge in a specific domain. S A and B negotiate about the problem until they reach a common interpretation and the dialogue goals can be satisfied, or they are withdrawn. S Role changes between information seeker and information provider occur in certain situations. This is always the case in inserted sub-dialogues, e.g. when the information provider poses clarifying questions to get more information about, for instance, a request made by the information seeker. Basically, the COR network defines the generic dialogue acts available, e.g. asking, offering, promising, answering, and evaluating. This includes their interrelations in the process of the ongoing dialogue, by this way modeling possible sequences of dialogue contributions. COR can be regarded as a recursive state-transition network and defines in detail the mutual role changes of speaker and addressee considering the current situation of the discourse. It was influenced by the “Conversation for Action” model proposed by Winograd and Flores (1986) where discourses are interpreted as “negotiations”. We ex––––– 1. The notion of “illocutionary acts” is derived from the Theory of Speech Acts as put forward by Austin and Searle. We adopt their terms as far as possible to avoid ambiguity. Searle and Vanderveken (1985) distinguish several categories of illocutionary acts like “directives”, “commissives”, and “assertives” that can be specified for various discourse genres. Specific dialogue acts are for instance asking, ordering, promising, warning, apologizing.
Stein, Thiel
A Conversational Model of Multimodal Interaction
5
tended their model for the situation of information-seeking processes applying some concepts of Systemic Linguistic approaches to discourse modeling (cf. Fawcett et al. 1988, Halliday 1984) and Rhetorical Structure Theory (Mann & Thompson 1987). Thus, a more flexible way of interaction has been achieved allowing both dialogue partners to correct, withdraw, or confirm their intentions as well as insert sub-dialogues for clarification. Figure 1 shows the basic COR schema of a dialogue. Circles and squares represent the states on the top-level of the dialogue (circles are within a dialogue, squares indicate terminal states). Arrows represent transitions between two states, i.e. the dialogue contributions. A and B are the participants, A referring to the information seeker, B to the information provider. The order of the parameters indicates the speaker - hearer roles; the first parameter indicates the speaker, the second the addressee. Note, first, that the dialogue contributions (transitions) are themselves transition networks which may contain sub-dialogues of the type of the basic schema. We distinguish between two types of networks of dialogue contributions (cf. fig. 2, and fig. 3) which are described below. Second, one should keep in mind that in COR all dialogue contributions – except the inform contribution – can be ‘implicit’, i.e. they may be omitted (a ‘jump’) when the implicit intention can be inferred from the current context. For instance, it might be unnecessary to explicitly utter a ‘withdraw offer’ when one comes up with a new offer right away; or an information can be given without being asked explicitly for in case one can assume it is needed or can contribute to a cooperative dialogue.
8
Dialogue (A,B) withdraw request (A,B)
reject request (B,A)
2 request (A,B)
continue (A,B)
promise (B,A) withdraw request (A,B) reject request (B,A) withdraw commissive (B,A)
1
inform (B,A)
3
withdraw directive (A,B)
4
be contented (A,B)
5
reject offer (A,B) withdraw offer (B,A)
be discontented (A,B)
accept (A,B)
offer (B,A) 2’ withdraw commissive (B,A)
withdraw directive (A,B)
6
7
reject offer (A,B) withdraw offer (B,A)
8’
Figure 1: The basic COR ‘dialogue’ schema
withdraw directive (A,B)
withdraw commissive (B,A)
9 10
11 directives: request, accept; commissives: offer, promise
6
A Conversational Model of Multimodal Interaction
Stein, Thiel
The presentation in figure 1 displays a variety of underlying categorizations of the dialogue contributions by graphical means (e.g. orientation of arrows and placement of circles and squares). The main ideas can be described as follows: The bold arrows between the states and represent two ‘idealized’ straightforward courses of a dialogue: S A utters a request for information, B promises to look it up (possibly skips the promise) and presents the information, A is satisfied and finishes the dialogue. S B offers to provide some information (anticipating an information need of A), A accepts the offer (or part of it), B provides the information, A finishes the dialogue. However, such simple courses of actions are very rare in more problematic informationseeking contexts. Participants often depart from such a straight course which, besides, is perceived by them as quite natural. Dialogues are highly structured and normally contain a lot of corrections, retractions, embedded clarifying sequences, etc. In our work we insist to overcome reductionist conceptions of dialogues as iterations of question-answer pairs which seem to build the basis for most of the classical interfaces to information systems. Instead, a dialogue is interpreted as a complex net of mutually related commitments and expectations of the two participants. Within this context we use the notion of role expectation/ assignment. It has also been applied in discourse theories in computational linguistics, e.g. by Halliday (1984) who discusses this topic comprehensively. The transitions in figure 1 are not interpreted as simple/ atomic dialogue acts, but as possibly complex dialogue contributions. We inserted several transitions for withdrawing or rejecting a contribution. In every dialogue state , A and B may return either to state , thus preparing a new dialogue cycle, or they may quit the current dialogue (states ). The schemas (sub-networks) for the dialogue contributions seem to be more interesting in our current context than the basic schema. Here we can make a better point as to the specific characteristics of COR and explain some aspects of our approach to combine dialogue modeling with text organization features using Rhetorical Structure Theory. Figure 2 displays the schema for ‘inform’ contributions, i.e. the transition between and in the basic dialogue network. A more general term is ‘assert’ indicating an assertion or a statement, but in our context of information-seeking dialogues we tend to use the more specific term for the transition . The inform contribution is the only one of this type of network (whereas all other contributions can be represented like the network in fig. 3). Here A starts with an atomic inform (atomic acts are expressed by the notation A: ....). This inform act (its locution) could be a quite long monologue, a text, or a graphical presentation comprising several propositions. Of course, in that case it would have a semantic sub-structure and consist of several elements, like sentences. But the illocutionary point would not change, i.e. no new commitments or expectations are expressed or imposed on the address-
Stein, Thiel
A Conversational Model of Multimodal Interaction
7
Inform/ Assert (A,B) jump
a
A: inform/ assert
b
dialogue (B,A, solicit context information)
c
Figure 2: Schema of an ‘inform’ (‘assert’) contribution
ee. The atomic inform act can either lead directly to state (jump to and at the same time in the dialogue network to ). Or B may decide to initiate a sub-dialogue to solicit more context information about A’s inform act, e.g. asking a question related to the inform act. This transition between and then is a traversal of a basic dialogue network, too. The network in figure 3 is more complex. All dialogue contributions (request, offer, reject, withdraw, etc. – except inform/ assert) follow this schema. When A intends, for instance, to make a request, she can follow one of the two possible paths: or . On the first path A formulates the request and either ‘jumps’ to , or appends an ‘assert’ (network of fig. 2) supplying voluntarily some context information related to the request. If this context information is not given by A, B might decide to start a sub-dialogue to solicit such context information, e.g. asking for details, or the background of the request. The other
Request (A,B) dialogue (B,A, solicit context information) jump A: request
b assert (A,B, supply context information)
a assert (A,B, supply context information)
b’
c
A: request jump
jump
dialogue (B,A, identify request)
Figure 3: Schema of a ‘request’ contribution
8
A Conversational Model of Multimodal Interaction
Stein, Thiel
path is similar, but with a reversed order. After an assert of A supplying some context information about the intended request (or jump), she can either formulate the request now explicitly, but also skip it (jump). The latter would create the situation of an indirect request which is reminiscent of the term “indirect speech act” coined in Speech Act Theory (cf. Searle 1975). A simple example is: A: “I don’t know much about this RACE funding program.” Even if A does not then ask directly “What is it about?”, B might infer that A has expressed an indirect request by the first statement. We distinguish between two main types of components within these dialogue contribution networks (cf. in detail Sitter & Stein 1992). The first expresses the function (illocutionary point) of the whole dialogue contribution. This is normally an atomic dialogue act, e.g. the request or inform act in our example. The second serves for exchanging additional context information which is either supplied voluntarily (assert contributions) or requested in a sub-dialogue. Using the terminology of Rhetorical Structure Theory – RST (cf. Mann & Thompson 1987) we call a component of the first type a “nucleus” and a component of the second type a “satellite”. Both, nucleus and satellite are related to one another by rhetorical or semantic relations. The set of relations described by RST was developed for the analysis of written texts, i.e. monologues, and has been extended in other approaches in the field of Computational Linguistics (overview in: Maier & Hovy 1993). However, there exist some recent attempts to combine research done in the text-linguistic field with dialogue modeling approaches (cf. Maier & Sitter 1992 a, b; a similar attempt was made independently by Fawcett & Davis 1992). Maier and Sitter, for instance, extended the set of necessary relations, especially “interpersonal relations”, for the dialogue situation and combined it with the COR approach. This proved to be very useful in our context of human-computer retrieval dialogues, because their specifications can be directly integrated into our application described in section 4. 3.2 Modeling Information-Seeking Plans The COR network covers the illocutionary structure of dialogues, but does not supply means for a specification on the thematic or topical level. Since the thematic level governs the selection of the contents communicated in the dialogue acts, it plays an essential role in dialogue planning. If we want the system to engage in a meaningful cooperative interaction with the user, we have to address this question by supplying a prescriptive addition to the – thus far descriptive – dialogue model. We perceive dialogues as instantiations of more abstract entities, each of which represents a class of concrete dialogues. This is similar to approaches to generation of multimodal utterances using plan operators (e.g. Maybury 1991, Rist & André 1992). However, the abstract representations of dialogue classes are more complex. The classes comprise dialogues that show a similar basic pattern. Usually these patterns are closely related to certain strategies which are pursued by the user during her interaction. In the case of information systems, Belkin, Marchetti & Cool (1993)
Stein, Thiel
A Conversational Model of Multimodal Interaction
9
suggested a classification of dialogues with respect to information-seeking strategies. Based on this approach, we developed a set of typical dialogue plans or “scripts” (cf. Schank & Abelson 1977) for a given domain and task (cf. Belkin, Cool, Stein & Thiel 1993). A collection of such dialogue plans is the basis for selecting an appropriate plan for a given information need. Once a plan has been chosen, it provides suggestions to the user on how to continue in a given dialogue situation, and specifies the cooperative system reactions. On the implementation level, we represent dialogues as sequences of dialogue steps. The internal structure of a dialogue step is given by two parameters: The perspective of the step, and its implementation. The perspective determines the topical spectrum that can be addressed in this step without destroying the thematical coherence of the dialogue in general. Similar notions have been proposed by McCoy (1986) in the area of natural language interfaces and Reichman (1986) who takes a discourse analysis approach to multimodal dialogues. The second component of a step describes the possible and actual ways to implement the corresponding dialogue step. It may be implemented by a single dialogue act. In this case the variety is provided by the different forms the utterance may have. For instance, the presentation of a certain set of data may take the form of a list of the data records, a table, or a graphical presentation. However, the step may also be performed by a certain sequence of dialogue acts which then build a sub-dialogue that may – in accordance with the COR model – replace the single act. Thus, we have a means to prescribe a certain act as appropriate in the given situation, which allows the user at the same time to perform it in a way she prefers, e.g. by requesting context or help information. If we assume that information-seeking dialogues can be classified according to parameters like information need, etc., an obvious consequence will be that some of the members of a class will be ‘better’ than others with respect to certain criteria. Such a criterion may be derived from a specific user’s preference, or from a formal measure, like amount of exchanges. However, going beyond purely subjective or more or less reductionistic criteria, we think that this is a matter of experience. The selection of a certain dialogue plan should be backed by the experience gathered in previous dialogues, with the same user or others. An approach to problem solving based on past experiences is pursued in the area of case-based reasoning (CBR). In our experimental work, we adopted the ideas of CBR to the requirements of a user-guidance component (Tißen 1991, 1993; Stein, Thiel & Tißen 1992).
4 An Application – the MERIT System MERIT (Multimedia Extensions of Retrieval Interaction Tools) is a prototypical knowledge-based user interface to a relational database on European research projects and their funding programs, developed at GMD-IPSI (cf. Stein, Thiel & Tißen 1992). The database contains textual and factual information (a subset of the CORDIS databases that are offered
10
A Conversational Model of Multimodal Interaction
Stein, Thiel
online by the ECHO host) as well as additional interactive maps, scanned-in pictures, and audio files. The system features form-based query formulation (various form sheets for different ‘perspectives’ on the data), and the visualization of retrieval results in different situation-dependent graphical presentation forms. One major system component is a CaseBased Dialogue Manager (CADI, cf. Tißen 1991) which controls the retrieval dialogue and provides a library of “cases” stored after previous sessions with MERIT. The cases are used to guide the user through the current session proposing thematic progression basically by suggesting a specific order of query and result steps focusing on a specific perspective in each step. The case-based approach taken in MERIT will be combined with a “multi-dimensional concept of information-seeking strategies, ISSs”. In an enhanced MERIT version, cases are seen as instantiations of dialogue plans or scripts as developed by Belkin et al. (1993). In contrast to the current MERIT version, these abstract dialogue plans/ scripts build the basis for the definition of cases. They model not only the sequence of query- and result steps, but also model feedback sequences explicitly, e.g. where user and system enter sub-dialogues for negotiating the strategies or tactics of subsequent dialogue steps. We have applied the metaphor of ‘conversational roles’ proposed by the COR model in the design of the MERIT interface. The COR model is used to structure the internal dialogue plans and to visualize these structures to the user. For example, discourse acts that express a new content/ proposition are displayed as complex graphical objects, e.g. an ‘offer act’ (display of the currently available options of how to proceed in the dialogue, cf. fig. 4), or an ‘inform act’ (for presenting the retrieved results, cf. fig. 5). Simple (atomic) dialogue control acts are presented by special icons. dialogue act label
title, meta-info
related possible user actions (e.g. initiating sub-dialogues)
propositional content
Figure 4: A system’s ‘offer’
Stein, Thiel
A Conversational Model of Multimodal Interaction
11
The example in figure 4 represents an offer made at the beginning of a dialogue session. Here the user is asked to choose the case that suits her information need best. The graphical presentation of complex dialogue contributions – like offers, requests, inform acts – is composed of several distinct elements: The label in the upper left corner identifies the contributor and dialogue act type (e.g. “MERIT offers”, “user requests”, “MERIT presents”). A short text, mostly elliptic, summarizes the content of the dialogue act or gives some meta-information (“Please select ...”, “5 projects have been found”). Further, the concrete proposition is displayed below (e.g. in fig. 4 check-boxes for the alternatives among which the user may choose, or textual and factual data like in the text-boxes of fig. 5). The right bar/ upper right area is reserved for icons representing possible (local) user actions that refer to the current system’s contribution. The user may, for instance, reject an offer, request help, or pose clarifying questions and hereby enter a sub-dialogue (checkback-icon). The small text-labels of the icons (reject, check back, help) indicate the type of communicative act to be performed. We did not visualize other possible user operations like “apply” and “cancel” because they do not fit into our conversational approach. Graphi-
Figure 5: Presentation of retrieved results (a system’s ‘inform’ contribution) with additional context information the user requested in this step (photograph of the project’s contact person and interactive map highlighting the location)
12
A Conversational Model of Multimodal Interaction
Stein, Thiel
cal operations like scrolling text or copying and pasting are of course possible, but they are not interpreted as communicative acts.
5 A Coherent Interaction Model of MERIT As most advanced graphical interfaces, MERIT offers a wide variety of options to the user (cf. fig. 6, displaying the dialogue control objects). In a given situation, the user may be able to proceed in the current case, start sub-dialogues within the current step, switch to another case, etc. Usually, such dialogue options are presented to the user as additional components of the interface. These additions are disturbing in a direct manipulation interface, since they require context-dependent method evaluation. For instance, a help button is only useful if the invoked process renders a help message which really refers to the user ’s current problem. As a consequence, the invoked function must be supplied with situational parameters. Even if it is possible to define useful parameters based on a careful task modeling, the user’s mental model of the direct manipulative interface will be much more complex, since she has to take into account the situational setting in order to understand the functioning of the help button. If we regard more complex options, e.g. undo-functions, there are even problems with respect to an appropriate definition, e.g. what is the result of “undoing” an “undo”?, to say nothing of the user’s mental model of such a functionality. Since we provide a variety of dialogue options in MERIT, an alternative approach was necessary. In the following, we will outline how the conversational model of multimodal interaction allows us to integrate even complex meta-dialogic options into a coherent interpretation. Some of the options are special forms of conversational interactions. Local, case- or situation-dependent dialogue options: S help, check back: From the conversational perspective the user engages in a sub-dialogue related to the system’s current dialogue contribution (cf. fig. 4). Thus, the parameter setting of the metafunction is determined in a natural way. S reject offer, withdraw ..., continue: The meaning of these icons is intuitively grasped. They comply with transitions in the basic COR schema. For instance, ‘continue’ would lead to state , whereupon the user either formulates a new request (a query), or the system comes up with a new offer and/or information (presentation of data). S change content/ presentation: Here the user can enter a sub-dialogue related to the system’s inform act and request a paraphrase and/or context information (cf. fig. 2). The available options in a given situation (dialogue state) are displayed in the generated pull-down menu. For instance, the user may choose to get more detailed information about the currently presented items, to restrict the presentation to a subset of attributes, concepts, and relations, or simply to replace a given presentation form by another one (e.g. a table by a graph).
Stein, Thiel
A Conversational Model of Multimodal Interaction
13
history
change case
change presentation
interposed query CADI tools metadialogue on next step
withdraw
continue
Figure 6: Example screen of MERIT with dialogue control objects
Inserted meta-dialogues: S meta-dialogue on next step: Before the user decides whether to continue she may request information about the system’s further strategy. MERIT generates a situation-dependent information, describing the next step proposed by the current case in the current dialogue phase, as well as all subsequent steps. S history: By clicking one of the history icons the user starts a short meta-dialogue referring to inform or query states of previous dialogue cycles. The respective information (the query form or presentation of retrieved data) will be displayed . The user can then compare it to the current state and decide whether to return to the current state or to go back in the dialogue history.
14
A Conversational Model of Multimodal Interaction
Stein, Thiel
S interposed query: At any time the user has the opportunity to insert a short retrieval dialogue. She can pose a query and inspect the retrieved data, then return to current path/ case. S change case: The user finishes her current path, returns to the top-level dialogue (state ) and initiates a a new dialogue cycle to choose a new case. S CADI tools: Activating the CADI-tools has a similar function like ‘change case’, but here the user starts a tool-oriented system for retrieving cases from the case-library. Here, additional functionality for storing the current session as a new case and designing modified cases is provided as well (for details cf. Tißen 1993).
6 Conclusions The outlined comprehensive model of multimodal interaction is based on an extended notion of dialogue acts which can be realized by linguistic or non-linguistic means. The temporal structure of multimodal interaction is considered under local as well as global aspects. A comprehensive set of constraints in terms of a recursive transition network describes all (local) patterns of exchange which can be encountered during a cooperative dialogue. The (global) topical structure of the dialogue is defined according to a selected information-seeking strategy. The model was applied to design and implement the MERIT system (Multimedia Extensions to Retrieval Interaction Tools), and led to a reduction of the complexity of the user interface while preserving most of the useful, but sometimes confusing, dialogue options of advanced direct manipulative interfaces. The conversational approach permits dialogue features to be handled in an integrative manner, whereas in other dialogue models they are treated as separated extensions such as undo, history, and help functions.
References Arens,Y. & Hovy, E. (1990). How to Describe What? Towards a Theory of Modality Utilization. In: CogSci ’90, Proc. of the 12th Annual Conference of the Cognitive Science Society, 487-494. Hillsdale, NJ: Erlbaum. Belkin, N.J., Cool, C., Stein, A. & Thiel, U. (1993). Scripts for Information Seeking Strategies. Paper presented at: AAAI Spring Symposium ’93 on Case-Based Reasoning and Information Retrieval, Stanford University, CA, March 23-25. Belkin, N.J., Marchetti, P.G. & Cool, C. (1993). BRAQUE: Design of an Interface to Support User Interaction in Information Retrieval. Information Processing & Management. Special Issue on Hypertext 29(4) (in press). Dilley, S., Bateman, J., Thiel, U. & Tißen, A. (1992). Integrating Natural Language Components into Graphical Discourse. In: ANLP ’92, Proc. of the 3rd Conf. on Applied Natural Language Processing, 72-79, 1992.
Stein, Thiel
A Conversational Model of Multimodal Interaction
15
Fawcett, R.P., van der Mije, A. & van Wissen, C. (1988). Towards a Systemic Flowchart Model for Discourse. In: Fawcett, R.P. & Young, D. (eds.): New Developments in Systemic Linguistics. Vol. 2, 116-143. London: Pinter. Fawcett, R.P. & Davies, B. (1992). Monologue as Turn in Interactive Discourse: Towards an Integration of Exchange Structure and Rhetorical Structure Theory. In: Proc. of the 6th International Workshop on Natural Language Generation, Trento, Italy, 151-166. Berlin: Springer. Feiner, S.K. & McKeown, K.R. (1990). Coordinating Text and Graphics in Explanation Generation. In: AAAI ’90, Proc. of the 8th National Conference on Artificial Intelligence, Vol. I, 442-449. Menlo Park: AAAI Press / MIT Press. Halliday, M.A.K. (1984). Language as Code and Language as Behaviour: A Systemic-Functional Interpretation of the Nature and Ontogenesis of Dialogue. In: Fawcett, R.P. et al. (eds.): The Semiotic of Culture and Language. Vol. 1, 3-35. London: Pinter. Hutchins, E.L., Hollan, J.D. & Norman, D. (1986). Direct Manipulation Interfaces. In: Norman, D.A. & Draper, S.A. (eds.): User Centered System Design: New Perspectives on Human-Computer Interaction, 87-124. Hillsdale, NJ: Erlbaum. Maier, E. & Hovy, E. (1993). Organising Discourse Structure Relations Using Metafunctions. In: Horacek, H. & Zock, M. (eds.): New Concepts in Natural Language Processing, 69-86. London: Pinter. Maier, E. & Sitter, S. (1992 a). An Extension of Rhetorical Structure Theory for the Treatment of Retrieval Dialogues. In: CogSci ’92, Proc. of the 14th Annual Conference of the Cognitive Science Society, Bloomington, Indiana, 968-973. Hillsdale, NJ: Erlbaum. Maier, E. & Sitter, S. (1992 b). Natural Language Output in Information-Seeking Dialogues. In: Arbeitspapiere der GMD No 687, Sankt Augustin: GMD, October 1992. Mann, W.C. & Thompson, S.A. (1987). Rhetorical Structure Theory: A Theory of Text Organization. In: Polanyi, L. (ed.): Discourse Structure. Norwood, NJ: Ablex. Maybury, M. (1991). Planning Multimedia Explanations Using Communicative Acts. In: AAAI ’91, Proc. of the 9th National Conference on Artificial Intelligence, Anaheim, CA, July 1991, 61-66. Menlo Park: AAAI Press/ The MIT Press. McCoy, K.F. (1986). The ROMPER System: Responding to Object-Related Misconceptions Using Perspective. In: ACL ’86, Proc. of the 24th Annual Meeting of the Association for Computational Linguistics, New York. Oei, S., Smit, R., Schreinemakers, J., Marinos, L. & Sirks, J. (1992). The Presentation Manager, A Method for Task-Driven Concept Presentation. In: Neumann, B. (ed.): ECAI ’92, Proc. of the European Conference on Artificial Intelligence, 774-775. Chichester: John Wiley. Reichman, R. (1986). Communication Paradigms for a Window System. In: Norman, D.A. & Draper, S.A. (eds.): User Centered System Design: New Perspectives on Human-Computer Interaction, 285-313. Hillsdale, NJ: Erlbaum.
16
A Conversational Model of Multimodal Interaction
Stein, Thiel
Reichman, R. (1989). Integrated Interfaces Based on a Theory of Context and Goal Tracking. In: Taylor, M.M., Neel, F. & Bouwhuis, D.G. (eds.): The Structure of Multimodal Dialogue, 209-228. Amsterdam: North-Holland. Rist, T. & André, E. (1992). From Presentation Tasks to Pictures: Towards a Computational Approach to Graphics Design. In: Neumann, B. (ed.): ECAI ’92, Proc. of the European Conference on Artificial Intelligence, 765-768. Chichester: John Wiley. Schank, R. & Abelson, R. (1977). Scripts, Plans, Goals and Understanding. Hillsdale, NJ: Erlbaum. Searle, J.R. (1975). Indirect Speech Acts. In: Davidson, D. & Harman, G. (eds.): The Logic of Grammar, 59-82. Encino, CA: Dickinson Publishing Co. Searle, J.R. & Vanderveken, D. (1985). Foundations of Illocutionary Logic. Cambridge, GB: Cambridge University Press. Sitter, S. & Stein, A. (1992). Modeling the Illocutionary Aspects of Information-Seeking Dialogues. Information Processing & Management 28(2):165-180. Stein, A., Thiel, U. & Tißen, A. (1992). Knowledge-Based Control of Visual Dialogues in Information Systems. In: Catarci, T., Costabile, M.F. & Levialdi, S. (eds.): AVI ’92, Proc. of the 1st International Workshop on Advanced Visual Interfaces, Rome, Italy, 138-155. Singapore: World Scientific Press. Tißen, A. (1991). A Case-Based Architecture for a Dialogue Manager for Information-Seeking Processes. In: A. Bookstein et al. (eds.): SIGIR ’91, Proc. of the 14th Annual International Conference on Research and Development in Information Retrieval, Chicago, 152-161. New York: ACM Press. Tißen, A. (1993). Knowledge Bases for User Guidance in Information Seeking Dialogues. In: Wayne, D.G. et al. (eds.): IWIUI ’93, Proc. of the 1993 International Workshop on Intelligent User Interfaces, Orlando, FL, 149-156. New York: ACM Press. Winograd, T. (1988). Where the action is. BYTE, Dec. 1988. Winograd, T. & Flores, F. (1986). Understanding Computers and Cognition. Norwood, NJ: Ablex.