A Framework for Rapid Multimodal Application Design - LIA/EPFL

19 downloads 42686 Views 245KB Size Report
generic operational approach for dialogue design; each application requires the development of a specific ... within a potentially large set of targets, the one (or the ones) that best corresponds to the needs (search ..... WOz Desktop. Modules.
A Framework for Rapid Multimodal Application Design Pavel Cenek1 , Miroslav Melichar2 , and Martin Rajman2 1

Masaryk University, Faculty of Informatics, Laboratory of Speech and Dialogue, 60200 Brno, Czech Republic [email protected] 2 ´ Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Artificial Intelligence Laboratory (LIA), CH-1015 Lausanne, Switzerland {miroslav.melichar, martin.rajman}@epfl.ch

Abstract. The aim of the work described in this paper is to extend the EPFL dialogue platform with multimodal capabilities. Based on our experience with the EPFL Rapid Dialogue Prototyping Methodology (RDPM), we formulate precise design principles that provide the necessary frame to use the RDPM to rapidly create an efficient multimodal interface for a given application. We analyze the consequences of the proposed design principles on the generic GUI and architecture required for the system.

1

Introduction

Spoken dialogue systems can be viewed as an example of an advanced application of spoken language technology. Dialogue systems represent an interface between the user and a computer-based application that allows for interaction with the application in a more or less natural manner. Although limited spoken communication with computers is now a reality, not only in laboratories, but also in commercial products, several problems still remain to be solved. One significant problem is the fact that there does not yet exist a really generic operational approach for dialogue design; each application requires the development of a specific model. To address this problem, the authors of [1, 2] proposed an efficient Rapid Dialogue Prototyping Methodology (RDPM) and implemented a software platform (hereafter referred to as the EPFL dialogue platform) that allows the concrete design of dialogue systems with a very short development cycle. Another problem that prevents spoken dialogue systems from broader use is the limited performance and reliability of current speech recognition and natural language understanding technologies. One of the research directions foreseen to overcome these limitations is the use of multimodal dialogue systems that exploit (besides speech) other interaction channels for the communication with the user. Within this perspective, the aim of the work described in this paper is to extend the EPFL dialogue platform with multimodal capabilities. Notice that

the original RDPM has already been tested in several unimodal (voice-only) dialogue systems, among which the InfoVox system3 (an interactive vocal system for providing information about restaurants) or the Inspire system4 [3–5] (a dialogue system for the vocal control of domestic devices within a SmartHome environment). The multimodal features of the new version of RDPM are currently tested in the Archivus system [6] that aims at providing the users with an intuitive way to access recorded and annotated meeting data. The rest of this contribution is organized as follows: Sect. 2 provides a brief overview of the standard speech-only RDPM. Sect. 3 first discusses the extensions needed to include the multimodal capabilities; it introduces the notion of multimodal generic dialogue node (Sect. 3.1) and it presents the associated design goals and principles (Sect. 3.2), a generic GUI layout (Sect. 3.3) and some details about the proposed system architecture (Sect. 3.4). Finally, Sect. 4 provides the conclusions and some future works.

2

Rapid Dialogue Prototyping Methodology

Dialogue prototyping represents a significant part in the development process of spoken dialogue systems. The Rapid Dialogue Prototyping Methodology [1, 2] allows the production of dialogue models specific for a given application in a short time. In outline, the RDPM divides the design into the following steps: (1) producing a task model for the targeted application; (2) deriving an initial dialogue model from the obtained task model; (3) carrying out a series of Wizardof-Oz experiments to iteratively improve the initial dialogue model. The RDPM focuses on frame-based dialogue systems [7], i.e. dialogue systems following the slot-filling paradigm. Although there exist more advanced dialogue system paradigms [8], they usually lack robustness and therefore are difficult to use in real-life applications with short design cycles. Frame-based dialogue systems seem to be a good compromise for practical dialogue systems, especially in the case where they are combined with robust natural language processing techniques. The general idea behind the RDPM is to build upon the hypothesis that a large class of applications potentially interesting for the setup of interactive usermachine interfaces can be generically modeled in the following way: the general purpose of the application is to allow the users to select, within a potentially large set of targets, the one (or the ones) that best corresponds to the needs (search criteria) that are progressively expressed by the users during their interaction with the system. Notice that this generic application model, although simple, is in fact quite powerful as it covers important classes of interactive systems such as information retrieval systems (in which case the targets are data items to be retrieved with respect to the information needs expressed by the users) and 3 4

InfoVox: Interactive Voice Servers for Advanced Computer Telephony Applications, funded by Swiss national CTI 4247.1 grant Inspire: INfotainment management with SPeech Interaction via REmote microphones and telephone interfaces, IST-2001-32746

interactive control systems, such as Smart Home control systems (in which case the targets correspond to the various actions that the system can perform on behalf of the users). Within this framework, we further make the assumption that the available targets can be individually described by sets of specific attribute:value pairs, and the goal of the interactive, dialogue based interface is then to provide the guidance that is required for the users to express the search criteria (i.e. the correct attribute:value pairs) leading to the selection of the desired targets. In the rest of this contribution, the descriptive attribute:value pairs will also be referred to as (depending on the context) the constraints or semantic pairs. 2.1

The Task Model

The definition of the valid constraints (e.g. the list of available attributes and attribute combinations, as well as the possible associated values) will generically be called hereafter the task model (or the data model in the specific case of information retrieval systems). The cumulated set of constraints acquired by the system at any given time during the interaction will be referred to as the current (search) frame and the subset of targets matching the current search frame will be called the current solution space or (search space). In the simple but frequent case where all the targets can be homogeneously described by a unique set of attributes, the task model simply corresponds to the schema of the relational database, the entries of which are the individual targets. An example of such a situation is given in Fig. 1, where the entries of the database correspond to annotated transcription of dialogue utterances recorded during a meeting. These types of targets are used in the Archivus system.

Date Person Topic DialogAct Transcription Year Month Day FirstN FamilyN ... ... 2004 April 7 David P. sport request “Buy a new bike.” 2004 April 30 Susan A. furniture question “What colour?” ... ... Fig. 1. An example of dialogue targets

2.2

The Dialogue Model

The dialogue model defines the types of interactions that are possible between the system and the user. In the RDPM, a dialogue model consists of two main parts: (1) the application-dependent declarative specification of so called Generic

Dialogue Nodes (GDNs) and (2) the application-independent local and global dialogue flow management strategies. While the GDNs must be defined by the designer of a particular application, the dialogue strategies have been implemented as a part of the EPFL dialogue platform and they are generally not intended to be modified by application designers. Generic Dialogue Nodes There is a GDN associated with each of the attributes defined in the task model. The role of the GDN is to perform the interaction with the user that is required to obtain a valid value for attribute. Different types of interactions are possible according to the nature of the values being sought: 1. Simple GDNs allow the user to directly specify a value for the associated attribute. Such GDNs are useful when the number of possible choices for the value is small. Example: selecting a day of the week or a device the user wants to operate. 2. List Processing GDNs allow the user to browse through a list of values and select one of them using the number identifying its position in the list. This GDN is particularly useful in the case of a large number of possible values for an attribute or when the values are linguistically complex. Reducing the interaction vocabulary to numerals adds robustness to the speech recognition component, potentially substantially improving its recognition rate. Example: a list processing GDN for selecting from a list of movies or names. In addition to the GDNs associated with some attribute in the task model, the system also contains Internal GDNs that are invoked by the dialogue manager in specific dialogue situations, such as at the beginning of the dialogue or when the user over-specified a request. To realize the interaction for which it is responsible, each GDN contains two main types of components: prompts and grammars. The prompts are the messages uttered by the GDN during the interaction. The role of the grammars is to make the connection between the surface forms appearing in the natural language utterances produced by the user and their semantic content expressed in the form of attribute:value pairs compatible with the task model. The grammars might also be used as a language model for the speech recognition engine to improve the quality of the recognition. Dialogue Strategies The term dialogue strategy refers here to the decision of the dialogue manager about the next step in the dialogue. The RDPM dialogue management handles dialogue strategies at two levels: local and global. The purpose of the local strategies is to handle problems related with the interaction within a particular GDN. The goal of these strategies is to guide the user towards providing some information for the attributes. When local strategies are applied, the control remains at the GDN level. The local strategies are carried out by the local dialogue manager which is a part of the GDN. Situations

managed by the local strategies typically include: (1) requests for help; (2) no input provided; (3) reestablishing the dialogue context; (4) speech recognition failures(No match); (5) request for prompt repetitions; (6) suspend/resume/start over dialogue, etc. As soon as the user provides a value compatible with the attribute associated with current GDN, control is handed back to the global dialogue manager where the global strategies are encoded. The purpose of these strategies is to process the newly populated attributes and to progress in the dialogue process towards its goals by selecting the next GDN that should be activated. The global strategies include: (1) a confirmation strategy; (2) a strategy for dealing with incoherencies; (3) a dialogue dead-end management strategy; (4) a dialogue termination strategy and (5) a branching strategy (selection of the next GDN). For more information on local and global dialogue strategies, see [1]. 2.3

Wizard-of-Oz experiments

Wizard-of-Oz (WOz) experiments [9] are an integral part of the RDPM. Their main role is to allow (in early steps of the design) the acquisition of experimental data about the behaviour of the users when interacting with the system. Not yet implemented functionalities that are in fact simulated by the hidden human operator called the Wizard. In the experiments, the Wizard uses a Wizard’s Control Interface (WCI) to fulfil his task. The interface required for a given WOz experiment is generated automatically from the task and dialogue models. The WCI consists of a set of modules which are inserted to appropriate places in the dialogue system. Their role is to redirect the dialogue system data flows to the Wizard’s control station. At the control station, the incoming data is visualized so that the Wizard can check or modify it, or even create new pieces of data if necessary. Once the Wizard is satisfied with the acquired data, it is sent back to the dialogue system where the dialogue processing can continue. Currently, the WCI allows the simulation or supervision of the following modules: automatic speech recognition, natural language understanding, and (re)starting or stopping the entire system. If needed, other modules can easily be developed and integrated.

3

Extending the RDPM to Multimodal Applications

As already mentioned, the aim of the work described in this paper is to extend the currently unimodal (voice-only) EPFL dialogue platform with multimodal capabilities. We believe that adding multimodality to a vocal dialogue system increases user satisfaction and/or the task achievement ratio. In addition, we might also observe interesting, unexpected user behaviour due to the increased complexity of the multimodal system. Creating a multimodal system is a complex task. One of the challenges is to cope with the problems of fusion and fission of modalities. The term multimodal fusion is generally understood as the process of combining the inputs coming

from different channels into one abstract message that is then processed by the dialogue manager (often called the interaction manager in multimodal systems). Analogically, the term multimodal fission refers to the process of communicating to the user an abstract message (issued by the interaction manager) through some combination of the available output channels. The fission techniques are usually thought of as of practical nature. In this perspective, user preferences have been observed and practical guidelines have been proposed [10], e.g. the fact that speech and graphic outputs need to be coordinated and unnecessary redundancy between the two channels should be avoided (the speech should convey a short version of the main message while the details might be displayed on the screen). The multimodal fusion is usually more complex to cope with. Indeed, since the fusion may happen at several levels in the system, there exist several different ways of understanding this term. – At the lowest level, we can think of fusing multiple coordinated streams of information generated by the user, not necessarily consciously. For example, speech processing can be combined with lip reading (image processing) in order to improve the accuracy of speech recognition. This type of multimodal fusion is not discussed in this paper. – Fusion can also take place at a higher level when, for example, some information that is loosely related to the ongoing communication act provides a useful context for interpretation of the user’s message. An example is the use of user’s location (e.g. in front of a lamp) to process a vocal utterance such as “Switch this on!”. We plan to focus on this type of fusion after the first set of experiments has been performed, provided that we have experimental evidence that this is a sufficiently frequent phenomenon in our applicative context. – Another type of fusion happens when Each modality provides different semantic pairs that simply need to be combined together. An example: if “What Susan said about this?” (semantic pair speaker:Susan) is said together with a click on the topic “sofa” visualized on the screen (topic:sofa), the resulting semantic pair set {speaker:Susan, topic:sofa} is simply produced. This type of modality fusion has already been implemented in our system by means of applying simple time stamp related rules during the fusion process. – The user voluntarily selects one of the available communication modalities (e.g. speech, mouse, or keyboard) for performing a task with the system. The different modalities can be then understood as alternative communication interfaces to the system. As the user is often allowed to use different modality for each subtask of a compound task, this type of modality switching might also be considered as a specific kind of fusion. However, we do not consider modality switching to be a true modality fusion since the fusion actually happens only at the discourse level (i.e. at the level of the combination of the outcomes of single subtasks).

Notice that the classical example “Put this there” can be solved at the level of the NLU (by disambiguating the deictic references “this” and “there” with the help of the pointing modality) or at the level of semantic pairs (by the temporal alignment of the semantic pairs – {action:move, object:this, location:there}; {pointingTo:objectX, pointingTo:PlaceY}). 3.1

Multimodal GDNs

Multimodal GDNs (hereafter referred as mGDNs) are GDNs as described in Sect. 2.2, but extended with additional elements required for multimodal interaction. These additional (or modified) elements include: – – – –

grammars for written and spoken natural language input; a set of multimodal prompts to guide the user; the information about the graphical representation (layout) of the mGDN; a definition of the role of each GUI element.

All of the elements must be specified in a declarative specification language in order to allow automatic derivation of the dialogue model and the associated multimodal dialogue-driven interface from the task model.

(a) Selecting a movie from a list

(b) Selecting a meeting participant

Fig. 2. GDN graphical representations (layouts)

3.2

Design Goals and Principles

In order to be able to fulfil our goal to extend the EPFL dialogue platform with multimodal capabilities, we define the following design principles: 1. Similarly to the standard RDPM, the elementary building blocks of our multimodal interface are mGDNs. Again, there are several types of mGDNs, each type encapsulating a particular kind of interaction and providing various graphical layouts (see Fig. 2). This allows the rapid building of multimodal dialogue systems exploiting existing building blocks.

2. The mGDNs represent the only interaction channel with the system. In other words, all inputs/outputs going to/coming from the system must be managed by some mGDN. The underlying design implication is that every active piece of the graphical user interface must be connected to some mGDN. 3. Every mGDN is fully multimodal, i.e. every mGDN systematically gives the users the possibility to communicate using all the defined modalities. It ensures that users can communicate in the way that is most comfortable for them and can switch between modalities if they find that communication using one of them is not yielding the expected results. 4. At any time, only one single mGDN is in focus (i.e. operational) in the interface. This very important principle strongly contributes to the overall feasibility of our objectives by narrowing the interpretation context for the multimodal fusion, as explained hereafter. It is important to notice that, even if only one mGDN can be in focus, several mGDNs can be active, i.e. ready to process the input provided to the mGDN in focus. This functionality is needed for the support of the mixed initiative. 5. Only a limited number of modalities are taken into account during system design. In our experiments, we will focus on three active input modalities, namely voice, pointing (either using a mouse or a touch screen) and text (keyboard), one passive input modality, namely emotion recognition (by face analysis) and three output modalities (graphics, sound and text). The aim of the above postulated design principles is to narrow the problem of modality fusion which is usually believed to be difficult to solve in the general case. In particular, in our approach, the multimodal fusion is handled at the level of individual mGDNs, and this is possible because only one mGDN is in focus at any given time to process the user’s inputs. In addition, each mGDN is aware of its role (what subtask should be solved by using it), of its graphical layout, speech input grammars etc., which makes it possible to foresee the structure of the multimodal input and fuse it correctly. 3.3

GUI

In general, the dialogues covered by the RDPM consist of two phases: (1) eliciting from the user the constraints that are needed to identify a small set of solutions that can then be presented to them and (2) giving the user the possibility to browse the current solution set in order to select the right solution. With respect to this dialogue structure and the above mentioned design principles, we propose the general GUI structure that is depicted in Fig. 3. The screen is divided into four main areas, each occupied by an mGDN. The area (4) serves for visualization of the solution space and should highlight the solutions that meet the current constraints defined by the user. The user should have the possibility to issue commands that switch between various visualization modes, rearrange the solution space or allow browsing the solution space. (3) is the area used by all the mGDNs (except the internal ones). Notice however that, in accordance with the above mentioned design principles, only the mGDN in

Fig. 3. The proposed general GUI structure (demonstrated on the multimodal system Archivus [6] for browsing a database of recorded meetings)

focus is visualized here. The possibility to explicitly switch between various constraint selection mGDNs is under the responsibility of a special mGDN called Criteria Selection that occupies the area (2). Area (1) is reserved for the history mGDN that serves for browsing and modifying the history of the interaction (e.g. changing or deleting an acquired semantic pair). The system prompt component (5), the text input component (6) and the control buttons (7) are special parts of the user interface shared by all mGDNs. 3.4

The Proposed Architecture

A multimodal dialogue system is a considerably complex software consisting of a relatively high number of modules. Each module requires different data sources for its initialization and information from different sources to produce its outputs. It is also often the case that the modules are created by various authors and were originally targeted for different kinds of applications. In addition, the number and types of modules in multimodal dialogue systems are not fixed and there is no consensus on what are exactly the responsibilities of every single module. This implies that a very flexible architecture is required that allows for a simple modification or addition of new modules. In the ideal case, the system should represent a complete development environment that supports all phases of the development cycle. It should also come with modules for immediate use (e.g. modules for speech recognition, synthesis, dialogue management) to allow researchers to focus on the functionality they are interested in.

Frameworks that might be considered as satisfying the above criteria are the Galaxy-II [11] and the Open Agent Architecture [12]. However, we see as a drawback the fact that they are both too general and do not impose any predefined communication paradigm on the dialogue system. Also the distributed nature of these frameworks makes the debugging of the targeted system more difficult. For all there reasons, we have chosen another approach to module composition. Each module is implemented as a simple Java class and the communication with other modules is simply realized by methods calls using well-defined interfaces. The exact configuration of the system is declaratively defined in a configuration file. When the system is launched, a special module called the application builder creates instances of all modules (each of them is identified by a name) and interconnects them as described in the configuration file. Since the initialization of each module is unique (as far as the resources and start-up parameters are concerned), there is a configuration file associated with each module which is used by the module for its initialization. The selected approach leads to a system that is very flexible and simply configurable and allows a simple module development and debugging (modules are objects in the Java language with well defined interfaces). In the same time, the possibility of some distributed processing is still fully open. On the local system, any module performing real operations can be replaced by a proxy that forwards the method calls to another computer and receives results from it. From the point of view of the rest of the system, this process remains fully transparent. The most complex case is the situation where a graphical module needs to be displayed on a remote machine(s). A possible solution to this problem is to rely on the standard VNC protocol. This choice makes the distribution extremely simple, but the price to pay is a higher network bandwidth consumption. Since WOz experiments are an important part of the RDPM, we should be able to supervise the functionality of certain module or even substitute them by the Wizard (human operator) using the graphical user interface that is used during the dialogue. This can be easily archived by inserting the graphical WOz module as a proxy of the module that we want to supervise. An example: if the goal is to supervise the quality of the speech recognition, the graphical module is plugged in between the SRE and the NLU. Then the recognized utterance is displayed to the Wizard who is able to modify it, if necessary. Similarly the Wizard can check the semantic pairs resulting from the NLU module. The proposed module composition is depicted in Fig. 4. The Interaction Manager controls two groups of modules: the input and output modules. The role of the Fusion Manager is to combine the semantic pairs from the different input sources (modalities). The Text Input Field module (area (6) in Fig. 3) allows the user to type in some text that is consequently translated by the NLU into semantic pairs. The same happens with the text produced by the ASR. Note that the ASR might be disabled using the Mute Microphone GUI module and the ASR result might be corrected by Wizard’s Recognition Supervision module. Possible values for the mGDN in focus are displayed in the GDN Pointing

Solution Space Visualization

Non-Graphical Modules

Cu

The meaning of the symbols Semantic Pairs Supervision

Sem . pairs

"New Sem . pairs

Sem . pairs

Sem . pairs

s:X" s

em .

ele

pair

te:Y "s

em

Natural Language Understanding

Pointing Understanding

Utterance Text

Criteria Selection

History GDN Click Coordinate

Recognition Supervision

Mute Microphone

New GDN

GD

MultimodalOutput

System Output Visualization

All

ed

MultimodalOutput

Fission Manager Spoken Prompt Visualized Prompt

GDN Pointing Zone

Mute Speakers

System Prompt Visualization

Prompt

Utterance

Text Input Field

ionDialogue State s

Interaction Manager

ow

.p

GDN Selection Supervision

lut

Sem . pairs

air

Natural Language Understanding

So

Ns

Focu

"D

nt

Hi st or y

Fusion Manager

Dialogue State Info

rre

Co nc ep ts Ac qu en tG isi DN tio n

User Desktop Modules

Cu rr

WOzDesktop Modules

Utterance

Automatic Speech Recognition

Text-to-Speech Input

Output

Fig. 4. Proposed module composition and the main data flow

Zone (area (3) in Fig. 3). Mouse clicks are translated into semantic pairs by the Pointing Understanding module. The graphical modules History and Criteria Selection mGDNs (areas (1) and (2) in Fig. 3) work in a similar fashion except that they display only one GDN. Semantic pairs resulting from the fusion process are supervised by the Wizard in the Semantic Pair Supervision module and are then sent to the Interaction Manager. The Interaction Manager processes the semantic pairs and selects the next GDN to be in focus (the decision can be modified by the Wizard in the GDN Selection Supervision). The dialogue state information is then updated, the Solution Space Visualization is modified and the multimodal output is issued by the Interaction Manager. The output is sent by the Fission Manager to the System Prompt Visualization module that displays it on the screen and sends it to the Text-to-Speech module which gives vocal feedback to the user. Each of the modules in the system can be sensitive to the global state of the dialogue (e.g. the GDN in focus, the list of active GDNs) through dynamic selection of its resources (e.g. using appropriate GDN dependent grammars). The information about the dialogue state can be obtained by reading the information published by the interaction manager.

4

Conclusion

The approach described in this contribution is believed to be generic enough to provide rapid prototyping techniques for a large class of multimodal dialogue systems. The main idea behind the approach is that problems that are difficult to solve in general can be reduced to solvable instances by a specific design of the multimodal interface relying on a precise set of the associated interaction principles.

Although our approach has already been partially tested in several prototypes, we are now faced with the need for building a complete multimodal system, as only the process of building and testing such a complete system makes it possible to identify the potentially problematic parts in our methodology and to improve the underlying theory. An important fraction of the work now lies in testing with real users, trying to understand what they really want, like, and do not like, and identifying how one can respond to that. Only a complete system reveals the real extent of the various problems encountered in human-machine interaction and allows us to focus on the most critical ones. Therefore we are currently building the first version of a multimodal system prototype, the Archivus system [6].

References 1. Bui, T.H., Rajman, M., Melichar, M.: Rapid dialogue prototyping methodology. In: Proc. of TSD 2004, Brno, Czech Republic, Springer-Verlag (2004) 579–586 2. Rajman, M., Bui, T.H., Rajman, A., Seydoux, F., Trutnev, A., Quarteroni, S.: Assessing the usability of a dialogue management system designed in the framework of a rapid dialogue prototyping methodology. Acta Acustica united with Acustica, the journal of the European Acoustics Association (EAA) 90 (2004) 1096–1111 3. M¨ oller, S., Krebber, J., Raake, A., Smeele, P., Rajman, M., Melichar, M., Pallotta, V., Tsakou, G., Kladis, B., Vovos, A., Hoonhout, J., Schuchardt, D., Fakotakis, N., Ganchev, T., Potamitis, I.: INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control. In: International Conference on Language Resources and Evaluation (LREC). Volume 5., Lisbon, Portugal (2004) 4. Krebber, J., M¨ oller, S., Pegam, R., Raake, A., Melichar, M., Rajman, M.: Wizard of Oz tests for a Dialogue System for Smart Homes. In: Proceedings of the 30th German Convention on Acoustics (DAGA) together with the 7th Congr`es Francais d’Acoustique (CFA), Strasbourg, France (2003) 5. Boland, H., Hoonhout, J., van Schijndel, C., Krebber, J., Melichar, M., Schuchardt, D., Baesekow, H., Pegam, R., M¨ oller, S., Rajman, M., Smeele, P.: Turn on the lights: investigating the Inspire voice controlled smart home system (2004) 6. Lisowska, A., Rajman, M., Bui, T.H.: Archivus: A system for accessing the content of recorded multimodal meetings. In: Proc. of the MLMI’04, Switzerland (2004) 7. McTear, M.F.: Spoken dialogue technology: Enabling the conversational user interface. ACM Computing Surveys 34 (2002) 90–169 8. Pallotta, V.: Computational dialogue models. MDM research project deliverable, Faculty of Computer and Communication Sciences, Swiss Federal Institute of Technology, EPFL IC-ISIM LITH, IN-F Ecublens, 1015 Lausanne (CH) (2003) 9. Dahlb¨ ack, N., J¨ onsson, A., Ahrenberg, L.: Wizard of oz studies: Why and how. In: Proc. of International Workshop on Intelligent User Interfaces, Orlando, FL (1993) 10. Bohus, D., Rudnicky, A.: Larri: A language-based maintenance and repair assistant. In: Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Volume 28 of Text, Speech and Language Technology. Springer (2005) 203–218 11. Seneff, S., Hurley, E., Lau, R., Pao, C., Schmid, P., Zue, V.: Galaxy-II: A reference architecture for conversational system development. In: Proc. ICSLP’98. (1998) 12. Martin, D.L., Cheyer, A., Moran, D.B.: The open agent architecture: A framework for building distributed software systems. Applied Artificial Intelligence 13 (1999)