Rapid Multimodal Dialogue Design: Application in

Rapid Multimodal Dialogue Design: Application in a Multimodal Meeting Retrieval and Browsing System Miroslav Melichar1 , Agnes Lisowska2 , Susan Armstrong2 , and Martin Rajman1 1

´ LIA/CGC, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Switzerland, {miroslav.melichar|martin.rajman}@epfl.ch 2 ISSCO/TIM/ETI, University of Geneva, Switzerland, {agnes.lisowska|susan.armstrong}@issco.unige.ch

Abstract. In this paper we present an extension of the EPFL’s Rapid Dialogue Prototyping Methodology to include multimodality, and show how it can be applied in the design of a multimodal application, the Archivus system. We begin with an overview of the standard speechonly rapid dialogue prototyping methodology, followed by a discussion of the extensions implemented to include multimodal capabilities. We then show how the methodology is applied in the design of Archivus, and provide an example of a multimodal interaction with the system.

1

Introduction

Computer applications involving the use of spoken language technology are increasingly making their mark in mainstream markets. One particular challenge is the development of interactive spoken language applications, in which the computer dialogues with a user. Such applications are difficult to build primarily for two reasons. First, developing a dialogue system from scratch is a time consuming process. Second, while research efforts in both speech technology and natural language processing are improving there remain issues of robustness that pose difficulties for launching full scale dialogue systems beyond a restricted domain. In this paper, we propose a methodology that aims to find at least partial solutions to these two problems. We introduce a platform and architecture that allows for rapid development of flexible dialogue systems and extend this architecture to allow for multimodal input. As Oviatt points out in [1, 2] multimodal input can be helpful in resolving robustness issues, particularly when the multimodal input involves speech. This paper begins in Sect. 2 with an overview of the standard speech only rapid dialogue prototyping methodology. Sect. 3 describes the extension of this methodology with multimodal capabilities. Sect. 4 discusses its application in the design of a specific multimodal system, Archivus, used to browse and search through a database of recorded multimodal meetings. Sect. 5 describes an example of an interaction with the Archivus system, both from the user and system perspectives, while Sect. 6 concludes.

2

Rapid Dialogue Prototyping Methodology: an overview

The general purpose of the Rapid Dialogue Prototyping Methodology (RDPM) is to develop a generic platform that can easily be tailored per domain. The platform is composed of the task model, Generic Dialogue Nodes (the building blocks of the methodology) and the dialogue strategies. This section provides an overview of its main elements. A full description can be found in [3].

2.1

Task Model

The starting point of the RDPM is a generic application model for interactive user-machine interfaces that can be tailored to a range of applications. The general purpose is to allow the user to complete a task by selecting from a potentially large set of targets, those that best correspond to their needs through interaction with the system. The model, though quite simple, covers important classes of interactive systems such as information retrieval systems (the system helps the user retrieve the desired data) and interactive control systems such as Smart Home control systems (the system allows the user to select the relevant actions to be performed). Within this framework, the assumption is that the available targets can be individually described by sets of specific attribute value pairs. The goal of the interactive dialogue based interface is to provide guidance for the user to express the relevant search criteria (i.e. the correct attribute-value pairs) leading to the fulfilment of the goal. In the simple but frequent case where all the targets can be homogenously described by a unique set of attributes, the task model simply corresponds to the schema of the relational database, the entries of which are the individual targets. An example of such a situation is given in Fig. 1, where the entries of the database correspond to annotated transcriptions of the dialogues recorded during the meeting. These types of targets are used in the Archivus prototype that is described in more detail later in this paper.

Date Person Topic DialogAct Transcription Year Month Day FirstN FamilyN ... ... 2004 April 7 David P. sport request “Buy a new bike.” 2004 April 30 Susan A. furniture question “What colour?” ... ...

Fig. 1. An example of dialogue targets

2.2

Dialogue Model

The dialogue model of the RDPM defines the types of interactions that are possible between the system and the user. It is composed of two parts: the applicationdependent specification of what we call Generic Dialogue Nodes (GDNs) and the application-independent local and global dialogue flow management strategies. Generic Dialogue Nodes Each attribute in the task model is associated with a particular GDN. The role of this GDN is to perform simple interactions with the user to obtain a valid value for that attribute. Three types of interactions are possible according to the nature of the values being sought, i.e. simple and list processing GDNs (to elicit a specific value for an attribute to allow the system to respond to a user query) and internal GDNs (to guide the dialogue strategy). 1. Simple GDNs allow a user to directly specify a value for the associated attribute. These are useful when the number of possible choices for the value is small. 2. List Processing GDNs allow the user to browse through a list of values and select by number (position of value in the list). This type is particularly useful in cases of large numbers of possible values for an attribute or when the values are linguistically complex. Reducing the interaction vocabulary to numerals adds robustness to the speech recognition component, i.e. high recognition rate. 3. Internal GDNs are special GDNs which are not associated with any attribute in the task model, but are invoked by the dialogue manager in specific dialogue situations, such as at the beginning of the dialogue or when the user over-specified a request. Generic Dialogue Nodes serve to define each particular action that is possible in the interactive system and are thus application-dependent, i.e. they must be defined by the designer for each particular application. Consequently, the EPFL dialogue platform provides guidelines as to how GDNs should be defined, but not about their content. Dialogue Strategies The dialogue strategy serves as the decision making mechanism of the dialogue manager by providing rules about the progression of the dialogue. The RDPM dialogue-management approach handles dialogue strategies at two levels: local and global. Local dialogue flow management strategies handle problems related with the processing of the currently selected attribute. The goal here is to guide the user in finding an appropriate value for the attribute. Consequently, control within local strategies remains at the GDN level, since the dialogue management capabilities are part of the GDN itself. Local strategies play a key role in the processing of explicit requests for help, in cases where no input has been provided by the user and in re-establishing the dialogue context.

The global dialogue manager controls the higher level flow of the dialogue. This includes determining and controlling which GDNs need to be treated to accomplish the goals, activating those GDNs, processing newly populated attributes and selecting the next most relevant GDN. Additionally, the global management strategies treat issues such as resolving incoherencies in input, restarting the dialogue on user request, handling dead-ends in the dialogue, or ending the dialogue altogether.

3

Extending the RDPM to multimodal applications

Work in [4] laid the foundations for extending the RDPM to multimodal applications. This section describes in greater detail the specific issues and challenges faced in the implementation of these extensions and the design principles adopted. 3.1

Multimodal GDNs

Multimodal GDNs (mGDNs) are GDNs as described in Sect. 2.2 extended with additional elements for multimodal interaction. These additional (or modified) elements include: grammars for written and spoken natural language input; a set of multimodal prompts to guide the user; information about the graphical representation of the mGDN; a definition of the role of each GUI element. Additionally, multimodal fusion (the combination of inputs coming from different channels into a single abstract message) and fission (the process of communicating a single abstract message to the user using any of the available output channels) is handled at the level of individual mGDNs rather than at the global level. Dealing with the fusion and fission of modalities in a multimodal interface is an open research question. In the multimodal RDPM (mRDPM), following the logic of GDNs, mGDNs impose a constrained interaction and define a relatively narrow interpretation context for the input and output, thus making it possible to resolve fusion/fission problems. All of the above elements must be specified in a declarative specification language in order to automatically derive the dialogue model and the associated multimodal dialogue-driven interface from the task model. 3.2

Graphical Interface Elements

Extending the RDPM to allow for multiple modalities implies the inclusion of a graphical interface that the user can manipulate directly by keyboard, mouse or voice. This interface, no matter what the application, should include six basic elements. 1. A representation of the attributes whose values need to be defined on a per application basis. 2. A display area for the attribute and possible values currently under consideration.

3. A record of the attribute-value pairs that have already been specified by the user. Ideally, this area should be interactive, allowing users to easily browse and delete values they have entered. 4. System control options, e.g. access to help, submission of a new query, or exiting altogether. 5. A display area for system prompts. Following the tenet of multimodal input and output, system output, such as prompts, should be presented in different modalities. 6. In the case of keyboard-based input, an area to visualize what the user has typed. An instantiation of all of these elements in the framework of a particular application can be found in section 4.4 3.3

Design Principles

All of the above elements lead to the following four design principles elaborated for the extension of the RDPM to multimodal applications. 1. mGDNs are the building blocks of the multimodal interface. 2. mGDNs represent the only interaction channel with the system. The underlying design implication is that every active piece of the graphical interface must be connected to an mGDN. 3. Every mGDN is fully multimodal. Thus the user can interact with the mGDN using any combination of the modalities available in the application. This ensures that the user can interact with the system in the way that is most convenient and comfortable for them, and allows the user to switch modalities if they notice that using a particular modality is not having the desired effect on the system, which improves the robustness of the system itself. 4. Only one single mGDN is in focus at any given time. This principle ensures the feasibility of the overall objectives of the system by constraining in a logical fashion the possible interactions with the system at any given point in time. The mGDN in focus should be clearly marked. However, though only one mGDN is in focus, other mGDNs are active giving the user the opportunity to access them, though somewhat indirectly, if they so choose. 3.4

Architecture

Multimodal dialogue systems are in general complex pieces of software composed of a large numbers of modules. Furthermore, the types of these modules are not predefined and there are no guidelines that impose the responsibilities of each module which may vary across applications. These factors imply the need for a flexible architecture that allows for simple modification or addition of new modules. Ideally, the architecture is packaged with modules ready for immediate use, such as modules for speech recognition and synthesis and dialogue management, and an environment that supports all

the phases of the development cycle. This allows researchers to focus on the functionalities of the system rather than the implementational details of the various components. Existing frameworks that might satisfy these criteria, such as Galaxy-II [5] and the Open Agent Architecture [6] are, in our view, too general and do not impose a predefined communication paradigm on the dialogue system. Moreover, the distributed nature of these systems makes debugging difficult. To resolve these issues we have chosen to implement the mRDPM as a set of simple Java classes. Communication with other modules is performed through calling their methods in well-defined interfaces. The configuration of the system is defined in the application configuration file. When the system is launched, a special module called the Application Loader creates instances of all of the other modules, interconnects them as specified in the application configuration file and initializes them. This approach provides a highly flexible and easy to configure system, allowing for simple module development, while still leaving the possibility to include some distributed processing. The interaction of the modules in the proposed architecture is shown in Fig. 2.

Solution Space Visualization

Cu

The meaning of the symbols

Concepts

Concepts

t

Co nc

ep

Natural Language Understanding

Text

Recognition Supervision

Pointing Understanding

Utterance

History GDN

Text Input Field

or

n

Hi

st

MultimodalOutput

System Output Visualization MultimodalOutput

Fission Manager Spoken Prompt Visualized Prompt

Click Coordinate

GDN Pointing Zone

Mute Speakers

System Prompt Visualization

Prompt

Utterance

Mute Microphone

GD io

Criteria Selection

New GDN

sit

Co

ed ow

All

qu i

e"

pt

s

let

once

Ac

s" C

nc

Natural Language Understanding

ionDialogue State s

DN

Concepts

lut

Ns

Focu

"D e

GDN Selection Supervision

Interaction Manager

ep t

Concepts

So

ntG

"New Concepts

Concept Level Supervision

nt

rre

Fusion Manager

Dialogue State Info rre

y

Non-Graphical Modules

User Desktop Modules

Cu

WOzDesktop Modules

Utterance

Automatic Speech Recognition

Text-to-Speech Input

Output

Fig. 2. mRDPM architecture: components and main data flow

The Interaction Manager controls two groups of modules - the input and output modules. The role of the Fusion Manager is to combine the attributes and values from the different input sources. The Text Input Field module translates the user text input into concepts via the Natural Language Understanding module and simlarly with text produced by the Automatic Speech Recognition (ASR). Possible values for the mGDN in focus are displayed in the GDN Pointing Zone. Mouse clicks are translated into concepts by the Pointing Understanding module. The graphical modules History and Criteria Selection mGDNs work in a similar fashion except that these modules display only one GDN. Concepts

resulting from the fusion process are supervised by the Wizard in the Concept Level Supervision module and are then sent to the Interaction Manager. The Interaction Manager processes the concepts and selects the next GDN to be in focus. The dialogue state information is then updated, the Solution Space Visualization is modified and the multimodal output is issued by the Interaction Manager. The output is sent by the Fission Manager to the System Prompt Visualization module that displays it on the screen and sends it to the Text-toSpeech module which gives vocal feedback to the user.

4

The mRDPM in practice

The Archivus system, described in [4], is the result of a first attempt to apply the mRDPM in a software design process. The goal of the Archivus system is to provide users with an intuitive, multimodal way to access recorded meetings. The meetings are stored in several formats in a database, including the audio and video from the meeting, a text transcript, and annotations on the transcript including topics, dialogue acts, structure of the meeting, and events. Additionally, all documents associated with the meeting, such as the agenda and all slides and paper artifacts are stored in electronic form. Archivus allows the user to either browse the meetings or directly search for particular sections of the meeting. The Archivus system is being developed within the standard software design cycle. This allows the developers to find a balance between fulfilling the user requirements for system functionality based on preliminary user requirements studies [7] and the constraints imposed by making Archivus a multimodal dialogue system within the mRDPM. The issues discussed in the following sections result from trying to establish this balance and will be investigated in more detail in the future through Wizard of Oz experiments with users. 4.1

Levels of interaction

As pointed out in [4], the task that Archivus is trying to handle is relatively new, which makes it difficult to predict in advance how users will react to and interact with the interface. Consequently, the system needs to be developed to handle all three possible interaction paradigms - system driven, user driven and mixed initiative. In a system driven interface, user actions are strictly determined by the system itself. In the case of a dialogue system, the user can either answer the question or fulfil the demand of the system, no other actions are possible. While these types of systems can be useful for novice users to familiarize themselves with the capabilities of the system, more experienced users find this type of interaction too limiting and time consuming. At the other extreme, it is entirely up to the user to define all interactions and the system processes the input. Such systems are useful for expert users of the application or the domain, but are extremely difficult to use for less experienced or novice users. In between the two lies the mixed initiative system in which the user and the system have a more symbiotic

relationship. The system guides the user, but in a less constrained manner, giving the user the opportunity to initiate actions not necessarily suggested by the system. We believe that for most users, this latter type of interaction will be the most common, but this will only become apparent when experimentation with the system begins. 4.2

Modalities used

The choice of modalities for the Archivus system was determined primarily by practical motivations. The goal of the system is its deployment in a commercial or academic institution, where the software and hardware are expected to be standard and commercially available rather than high-end or laboratory prototypes. The input modalities are natural language through text entry or voice and direct manipulation using a mouse or a touchscreen. The output modalities are audio, text and graphics with the addition of video for showing the meetings. At an experimental level, we will also include the passive modality of facial emotion recognition, which will be used to help steer the decision making procedure of the dialogue management based on the apparent attitude of the user toward the system. 4.3

Additional Design Principles

In addition to the mRDPM design principles described in Sect. 3.3, several others need to be taken into consideration for the Archivus application. 1. As Archivus is in essence a database access and search application, it is useful for the user to be aware of the effect of their search criteria on the search space (the database content). This will provide immediate feedback to the user on the effectiveness of the search criteria and a permanent window on the selected subset of the database. 2. Available modalities should be presented as visual cues. In principle, all modalities are available to the user all the time but the user should also be able to disable any modality. For input involving tactile modalities, such as input via the touchscreen, keyboard or mouse, it is obvious to the user when the modality is in use and when it is not. The visual cues are particularly important for the voice modality, thus allowing the user to enable and disable this function with a visual display of the current state. 3. There should be consistency in the way that actions are performed across modalities. There are marked differences in what can be controlled and how something can be controlled using direct manipulation devices and voice interaction. For example, the use of a mouse offers access to a context sensitive menu which cannot easily be replicated with a touchscreen. When developing a multimodal system such as this one, certain seemingly standard conventions such as right-clicking must be suppressed in order to allow for the exact same functionalities to be accessed in an interaction. The challenge is determining which functionalities and actions can be carried across all of the possible modalities, and how suitable they are for the task at hand.

4.4

The user interface

The user interface of the Archivus system is based on the metaphor of a person interacting in an archive or library (see [4] for more detail). The layout is shown in Fig. 3(a), with the graphical components corresponding to the labels below - (1) view on the search space, (2) the system prompts, (3) interactive representation of the mGDN, (4) user input area, (5) constraint buttons, (6) system control buttons, (7) interaction history.

5

An example of use

In this section, we will show a simple example of an interaction with the Archivus system, from both the user perspective (what the user sees and says) and system perspective (what the system does). It is important to keep in mind that the scenario presented here is only one of several possible ways to solve the problem that the user faces. Note that each place marked with * indicates that any of the possible modalities could have been used to access that component or make a selection. Additionally, the three criteria that the user had to work with could have been entered in any order, and the order in which they were entered would have changed the interaction. In our example we will use the following scenario: a new employee of some company is asked by their manager to write a short report about what happened at a particular meeting, but the manager doesn’t remember any details about the meeting except that it took place at the end of April, Susan was there and there was a discussion about red sofas. The new employee knows that all of the meetings at the company are recorded, processed and archived in the Archivus database. He sits down in front of the Archivus system with the bits of information he has. Each step is described from the user and system perspective. i. User: Sees the initial system screen - Fig. 3(b). i. System: When the initial screen is displayed, a general mGDN is the current (focused) GDN with the prompt “What do you want to do?”. Most of the other GDNs are also active, i.e. they can process the values provided by the user if necessary. ii. User: Says* “Show me the meetings which Susan attended”. The system adds the constraint Susan to the interaction history and highlights the books (which in the Archivus metaphor represent the individual meetings) on the bookshelf to reflect which meetings satisfy the search criteria - Fig. 3(c). ii. System: The Archivus natural language understanding capabilities interpret the input as the user having specified Susan as a participant of the meeting (i.e. participant:Susan is sent to the dialogue manager). This information is stored as a constraint for the search. The system realizes that the search space is still too large to be presented to the user and that the attribute that could optimally reduce the search space is the date, so control is passed to the GDN that regulates date selection and the associated prompt is used. The elements on the screen are updated (bookcase, history, and the central interactive pane displays the date GDN).

(a) User interface layout

(b) Opening screen of Archivus

(c) Result of adding first search criteria

(d) Adding a second search criteria

(e) Adding a third search criteria

(f) Representation of the meeting that meets all of the search criteria

Fig. 3. Archivus screenshots

iii. User: The bookcase in Fig. 3(c) shows that there are 8 meetings which Susan attended, but in the meantime, the system has suggested that the user might try defining the date as a possible search criterion. However, since the user isn’t sure of the exact date, he prefers to search for the topic first. The topic criteria is selected via a mouse click* and a list of all possible topics contained in the active meetings appears (Fig. 3(d)). The user scrolls* through the list, identifies number 7 as the relevant topic and says* “number seven”. The topic is entered into the interaction history and again the bookcase changes to reflect the impact of the new search criteria. iii. System: The selection of the Topic button is translated into the attributevalue pair changeFocus:Topic and sent to the current GDN (i.e. Date). As this paircannot be processed by the Date GDN, it is forwarded to the global dialogue manager. The global dialogue manager chooses the Topic GDN as the new current GDN; the Topic GDN issues the prompt “What is the topic you are interested in?” and waits for user input. The screen elements are updated. The user scrolling through the list on the screen is translated by the graphical component to the pair list:down. This is understood by the current GDN as changing the list of the values in the area it is responsible for and the command is immediately performed by the GDN (without passing the control to the global dialogue management). As soon as the user says “number seven”, the natural language module translates it into the pair list:7 and sends it to the current GDN, which is able to translate it into the global pair Topic:redSofa and returns control to the global dialogue manager. The new information is added and processing continues as in the previous step (updating the search space, selecting the next optimal GDN in focus, updating the screen). The search space now contains just two meetings, which only differ in the day of month on which they took place. The system therefore suggests the day of month as the next current GDN (GDN in focus). iv. User: Only two books now match the specifications made by the user (Fig. 3(e)). The system suggests that the user specify the day of month of the meetings (since the other Date parameters are the same for both meetings). Since this is the only other piece of information that the user knows, he finally goes along with the system. He sees that there are only two meetings in April 2004, one at the beginning of April and the other towards the end. He clicks* on the date at the end of April. The system again adds the criterion to the interaction history, and the bookcase changes to reflect this. However, since there is now only one book (meeting) that matches the criteria, that book is opened in the interaction pane for the user to browse (Fig. 3(f)). Since a topic was already specified, the sections of the meeting that are related to that topic are marked by tabs on the book, and highlighted on the book pages to facilitate browsing (Fig. 3(f)). iv. System: By clicking on the graphical component, the pair date:30th is issued and sent to the current GDN. It is recognized as a global pair and control is passed to the global dialogue manager. The new information is added and the new content of the search space is determined. The system determines that the

search space is now a reasonable size and the Open Book GDN is selected as the current GDN. The screen is updated. The central zone now contains an open book which contains the representation of the meeting. v. User: Since the ultimate task was to write a short report on the meeting, the user can now simply browse the meeting (Fig. 3(f)) by either looking at the transcript (flipping the pages of the book using next/previous), watching the video for particular pages by activating the video icon (by clicking on it or by saying “play video”) or listening to the audio by activating the audio icon (either by clicking or by saying “play audio”). v. System: The user interacts with the book by issuing local instructions for the Open Book GDN. No control is passed to the global dialogue manager unless the user explicitly changes to another GDN or terminates the interaction.

6

Conclusions and Future Perspectives

In this paper we have described an approach to developing a multimodal dialogue system, which has been developed under the assumption that difficulties with dialogue systems can be addressed through a combination of careful multimodal interface design and strict adherence to a set of associated interaction design principles. The full system was presented from the overall architecture to the various components, through to the way in which the components interact with the user. We have also shown how this approach has been applied to a particular application, the Archivus system, and have shown an example of the interaction from both the user and system perspectives. Even the simple example used shows the complexity of the system and the multitude of possibilities that need to be accounted for in system design and by the dialogue management. As a next step this system will be deployed and tested with users in Wizard of Oz experiments. These will permit us to fine tune the system to meet user requirements and to observe the ways in which users interact multimodally when performing the tasks envisioned for this domain.

References 1. Oviatt, S.: Taming recognition errors with a multimodal interface. Commun. ACM 43 (2000) 45–51 2. Oviatt, S.: Multimodal interfaces. In Jacko, J.A., Sears, A., eds.: The HumanComputer Interaction Handbook. Erlbaum, Cambridge (2003) 286–304 3. Bui, T.H., Rajman, M., Melichar, M.: Rapid dialogue prototyping methodology. In: Proc. of TSD 2004, Brno, Czech Republic, Springer-Verlag (2004) 579–586 4. Lisowska, A., Rajman, M., Bui, T.H.: Archivus: A system for accessing the content of recorded multimodal meetings. In: Proc. of the MLMI’04, Switzerland (2004) 5. Seneff, S., Hurley, E., Lau, R., Pao, C., Schmid, P., Zue, V.: Galaxy-II: A reference architecture for conversational system development. In: Proc. ICSLP’98. (1998) 6. Martin, D.L., Cheyer, A., Moran, D.B.: The open agent architecture: A framework for building distributed software systems. Applied Artificial Intelligence 13 (1999) 7. Lisowska, A.: Multimodal interface design for the multimodal meeting domain: Preliminary indications from a query analysis study. Report, IM2.MDM-11 (2003)

Rapid Multimodal Dialogue Design: Application in

Rapid Multimodal Dialogue Design: Application in

Suggest Documents