A Framework for Domain-specific Multi-modal Dialog ...

9 downloads 1608 Views 331KB Size Report
tools for rapid development of the domain specific language model, corpus and in- teraction rules ... demonstration is based on the need for a conversational interface for mobile appli- cations. ... modalities such as free-form text, touch and textbox entries. .... application) that have been developed using the SOFIA framework.
A Framework for Domain-specific Multi-modal Dialog System Creation Silke Witt, Farzad Ehsani, Demitrios Master, Eryk Warren Fluential, 1200 Crossman Ave, Sunnyvale, CA 94089, USA

Abstract This paper presents a framework for rapid multi-modal dialog system development for domain specific systems as well as a run-time engine that automates domain independent tasks and behaviors of a conversational multi-modal system. We present the set of module that make up the engine and discuss some of the tools for rapid development of the domain specific language model, corpus and interaction rules. The capabilities of the multi-modal framework are demonstrate with the help of two conversational mobile systems that have been built using this framework.

1.

Introduction

The framework for multi-modal spoken dialog systems presented with this demonstration is based on the need for a conversational interface for mobile applications. Prior work on multi-modal spoken dialog frameworks tended to focus on speech input and output as the primary modality, see [1,2,4]. For examples, there were RavenClaw (later Olympus) [1,2], Galatea [3], and DIPPER [4]. Both Olympus and DIPPER provide a hub-based or n open agent architecture that offers interfaces to speech recognizers, TTS, dialog managers as well as natural language understanding. Another framework that specifically focused on multi-modal input integration is MATCH [5]. At the same time, extensive research has also been done in the area of multimodal interface design in the Human-computer interface community, [6]. The framework presented here is intended to bridge the areas of spoken dialog systems and multi-modal human computer interfaces by addressing both common challenges in the spoken language understanding domain and mechanisms to truly integrate multiple input and output modalities.

2

2.

SOFIA: A Multi-modal, multi-threaded Framework

The standard architecture for a spoken dialog system consists of these main components: Automatic speech recognition (ASR), spoken language understanding (SLU), dialog management (DM), and output generation. Compared with a spoken dialog system, a multi-modal dialog system needs to be extended to handle additional input modalities such as free-form text, touch and textbox entries. Likewise, there are also more output modalities that need to be integrated for a coherent output. This paper presents the SOFIA (Speech Operating system For Intelligent Agents) framework. The architecture is modular with well-defined interfaces between the modules and thus allows for easy expansion of the capabilities or intelligence of a given module without impacting other components Figure 1 illustrates its basic architecture. Note that the spoken language understanding is divided into three steps. Context-free SLU assigns semantic tags to the recognized string or text input using a combination of robust tagging and parsing. Topic identification assigns the resulting semantic tags to a topic out of a set of available topics using a classifier. In the third step, Situated SLU, the winning topic and associated semantic tags are matched to interaction guides (IGs), which essentially are frames. These IGs contain the current context and dialog history in form of filled variables and associated inference rules. Additionally, the design of the SOFIA framework has focused on robustness and rapid prototyping capabilities. The following paragraphs describe each component in more detail.

Fig. 1: SOFIA engine modules.

Domain-specific ASR The SOFIA framework is recognizer agnostic, that is all recognizer specifics are encapsulated within the domain-specific ASR module. Because for a commercially viable system a high recognition rate is crucial for user acceptance, in this framework recognition accuracy is increased via the usage of domain-specific language

3

models and pronunciation dictionaries. This in turn requires a large amount of domain specific training data before deployment. The SOFIA platform tools enable the rapid creation of such language models and pronunciation dictionaries via crowd-sourced data collection. Following the recognition, the resulting N-Best list is normalized. The framework comes with a set of generic normalization rules and can additionally be configured with domain-specific normalizations. Likewise, a typed text entry is processed by a spell-checker and then normalized as well. Next, the normalized N-Best list of the recognizer or the normalized text input is passed on to the context-free natural language understanding component. Context-free Natural Language Understanding Natural language understanding in this framework is achieved via a combination of tagging and shallow parsing. This approach ensures a basic level of error robustness, because recognition or spelling errors in words that do not carry crucial information, typically words like ‘the’ ‘a’ etc. We devised a regular expression language that allows defining a set of robust understanding rules for each topic in a given domain. The creation and testing of these regular expression rules is done in a graphical interface that allows for regression, coverage and accuracy testing. Topic Identification The tagged and parsed user input is then send to the topic identification module which consists of a SVM classifier. A topic is defined as a conceptual unit that covers a sentences and phrases around this topic. For example in a travel system ‘flight search’ might be one topic, ‘weather enquiry’ might be another one. How broad or narrow a topic is defined depends both on the nature of the system as well as the preferences of the system designer. The topic identification classifier returns an N-Best list of topics with associated filled tags, where each N-Best result has a classifier confidence assigned to it. Training of a domain-specific classifier is done with tagged set of input sentences. The necessary tools to automate this step are built into the SOFIA development IDE, see section 3. Situated Language Understanding The output of the topic identification step is then send to the situated language understanding module. The first step in this module is to match up all incoming topic requests to available interaction guides. Once the set of interaction guide candidates is determined, a ranking function calculates the most likely interaction guide. At all times, a queue of active interaction guides is being maintained. The highest ranking interaction guide will be placed on top of the queue (or stay on top of the queue in the case of a user input being a continuation of a conversation). This combination of ranking and an interaction queue provides a powerful mechanism for tracking multi-threaded conversations. That is, the system is capable of maintaining multiple conversation threads. Dialog Manager

4

The dialog manager represents the final processing step of the SOFIA engine. It is based on the information state update principle where the information state represents the entirely of all known variables which includes variable values derived via inference. Once the winning interaction guide has been determined, the filled tags associated with the user input, the dialog manager will use the filled data tags from the user input to update the context data of the winning interaction guide. A custom language has been defined for the interaction guides that also allows for usage of inference rules and the dialog manager will iterate through all variables associated with the IG until no more new values can be added via inference. Each IG comes with a series of action rules. Once the current context has been updated, the dialog manager will evaluate all action rules of the winning IG. The first action that evaluates to true, will be executed. Figure 2 contains an IG snippet illustrating an action rule.

Fig. 2: Sample action element of an interaction guide.

All domain independent aspects such as error handling and data processing have been built into the run-time components of the dialog manager. Backend Integration Manager The dialog manager utilizes a knowledge manager that encapsulates the integration to a system-specific backend in the required format such as SQL or web-service queries. Output generator The output modalities in our platform are audio playback, text display as well as a graphical display. An XML-based interface is used to send the output instructions for an action rules to the mobile application. Putting it all together, Figure 3 provides a simplified example flow for the processing of a user input via recognition, tagging and topic identification to XML output.

5

Fig. 3: Sample sequence of user input, winning topic with tagged data, dialog manager XML output, mobile app interpretation.

2.2 Error Detection and Recovery In addition to the design and integration complexity challenge in the creation of multi-modal spoken dialog systems, error detection and recovery is another important area that needs to be addressed for any successful deployment. Below are some example error types and the corresponding built-in recovery mechanisms of the system:  Non-meaning carrying word recognition (or spelling) errors: Errors in nonmeaning carrying words do not have an impact in this architecture since the robust parsing and tagging will still have the correctly filled tags. For example if ‘of’ is a deletion in ‘one glass of orange juice’ the tag will still be FOOD(‘orange juice’).  Partial recognition errors will lead to a partial tagging and in most cases the dialog manager can act upon these, meaning that the user still has the satisfaction of the system responding to his or her request even though not completely.  Under-specificity errors: The inference rules and data element specification in the interaction guides enables them to resolve ambiguity and thus reducing the need to ask the user for further clarification.  Spelling errors: The SOFIA system has a built-in spell-checker that can be used as is or enhanced with a domain-specific dictionary for increased accuracy.

3.

SOFIA Domain-specific System Development

The previous section discussed the run-time SOFIA engine modules. This section now provides a high level overview of the development IDE that have been built to simplify the development of systems in a new domain. For any new system these components need to be created:  Domain-specific language model and pronunciation dictionary  Domain specific corpus consisting of tagging elements and rules

6

 Domain specific topic classifier  A set of interaction guides for all specific system behaviors  Mobile application development In order to create the first three of the components listed above, a series of tools has been built to support rapid development as opposed to having create all domain independent knowledge and behavior again. For example, knowledge about dates and location as well as error handling is built into the engine. Lastly, command-line tools and automated test-set evaluation are available to test conversational interactions with a domain-specific set of interaction guides, corpus and language model.

Conclusion This paper describes a flexible framework that allows rapid development of robust multi-modal spoken dialog systems as well as providing a run-time engine for the execution of such systems. Its modular nature also allows experimenting and expanding the capabilities of each module, which is what is planned for future work as results from user testing become available. In this demo, we will demonstrate some of the development tools as well as prototypes (a travel and a wellness mobile application) that have been developed using the SOFIA framework.

References [1] D. Bohus, A.I. Rudnicky, “RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda”, CMU Computer Science Department, Paper 1392, 2003. [2] D. Bohus, A. Raux, T.K. Harris, M. Eskenazi, A.I. Rudnicky, “Olympus: an open-source framework for conversational spoken language interface research”, NAACL-HLT, Rochester, NY, April 2007. [3] Kawamoto, S., Shimodaira, H., Nitta, T., Nishimoto, T., Nakamura, S., Itou, K., Morishima, S., Yotsukura, T., Kai, A., Lee, A., Yamashita, Y., Kobayashi, T., Tokuda, K., Hirose, K., Minematsu, N., Yamada, A., Den, Y., Utsuro, T., and Sagayama, S., 2002.Open-source software for developing anthropomorphic spoken dialog agent, in Proc. of PRICAI-02, International Workshop on Lifelike Animated Agents. [4] J. Bos, E. Klein, O. Lemon, T. Oka,”DIPPER: Description and Formalisation of an Information-State Update Dialogue System Architecture”, SIGdial 2003, Sapporo, Japan, 2003. [5] M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, P. Maloor, MATCH: An Architecture for Multimodal Dialogue Systems, Proc. ACL, p. 376-383, Philadelphia, 2002. [6] C. Lee, S. Jung, K Kim, D. Lee, G.G. Lee, Recent Approaches to Dialog Management for Spoken Dialog Systems, Journal of Computing Science and Engineering, Vol $., No !, March 10, Pages 1-22. [7] B. Dumas, D. Lalamne, S. Oviatt, Multimodal Interfaces: A Survey of Principles, Models and Frameworks, Human Machine Interaction, Lecture Notes in Computer Science Volume 5440, 2009, pp 3-26.

Suggest Documents