An Architecture for Multi-Domain Spoken Dialog Systems
Nobuo Kawaguchi
Graduate School of Engineering, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8603, JAPAN
[email protected]
Shigeki Matsubara
Faculty of Language and Culture, Nagoya University
Katsuhiko Toyama
Graduate School of Engineering, Nagoya University Center for Integrated Acoustic Information Research(CIAIR), Nagoya University
Yasuyoshi Inagaki
Graduate School of Engineering, Nagoya University
Abstract
Several spoken dialog systems for the speci c task domain have been developed so far. But there are only a few multi-domain systems which consider about extensibility and scalability. This paper proposes a distributed architecture for the multi-domain spoken dialog systems which satis es extensibility and scalability. The key concept of the architecture is distribution and integration of the data fragments. The data fragments are information about speech input obtained from the continuous speech recognition engine. Each fragment is distributed and integrated through the hierarchy of the domain managers and the work modules.
1 Introduction
The spoken dialog systems have been studied to help persons who cannot use hands or pointing devices and who do not familiar with keyboard nor computers (Bolt, 1980; Takebayashi et al., 1992; Ando et al., 1994; Zue, 1997). Several conversational systems for the speci c domains have been developed so far. Jupiter(Zue, 1997) is a telephone dialogue system for weather information based on GALAXY(Goddeay et al., 1994) architecture. TOSBURG-II(Takebayashi et al., 1992) is a multi-modal fastfood clerk system that can manage hamburger orders. Sync/Draw(Matsubara et al., 1997) is a multi-modal drawing tool which incrementally understands speech and quickly responds for inputs. Almost all of these systems can only manage the single task domain in the sequential manner. When considering the conversational system for several task domains, the architectures for these systems are This research has been supported in part by Grant-in-Aid for COE Research Project (No.11CE2005) from Ministry of Education, Science, Sports and Cluture, Japan.
not suitable. The multiple domains conversational system should have the whole knowledge, rules and functionalities for these domains, and this requirement can not be satis ed without the distributed manner. For example, the drivers support system in the car1 should manage several task domains such as air conditioner, car radio, navigation system, information systems, etc. at the same time. It is quite dicult to develop a uniform knowledge base which can manage all information for these domains. Additionally, it might more dicult to keep the knowledge base extensible. Our objective is to design a basic architecture for constructing the conversational system for multiple task domains. Under the well-de ned and the extensible architecture, the development of the conversational system might be done in the distributed and separated manner. This paper proposes an architecture for the spoken dialog system mainly designed to develop the Driver's Secretary System. The key concept of the architecture is distribution and integration of data fragments. We regard the input and output of the conversational system as a stream of the input and output fragments respectively. These fragments are distributed and integrated through the hierarchy of the domain managers and the work modules. The work module is a simple conversational system for the speci c task domain. The module interprets the input fragments and responds by the output fragments. The domain manager is connected to several sub domain managers and work modules, and coordinate the distribution and the integration of the input and the output fragments. The hierarchical structure of the system is similar to the contract net protocol(Smith., 1980) for the multi-agent systems. But there are notable dier1 We call the system \Driver's Secretary System(DSS)".
Speech Recoginition Engine
Speech Synthesizer
Input Fragment
Input Fragment
Manager
Output Fragment
Sub Manager Selection
Master Manager
Distributer
AAA AAA AAA Figure 1: Basic design of archtecture Car Radio
Car Radio Control
Output Fragment
Mail DB
2 Multi-Domain System
Multi-domain spoken dialog system should have the following features. 1. Extensibility The system should be extensible to several domains. New domain should be easily added. 2. Scalability Even if the system is extended to a lot of domains, the system must work at the reasonable speed. 3. Usability When user want to use the system in the speci c domain, user can use the system as a single domain system. So, user do not need to understand the archtecture of the multi-domain system. Additionally, development of the system should be the compositional way. This means that the multidomain system should be composed from several single-domain systems and multi-domain systems. Without this feature, extensibility of the system is not easy to be satis ed.
3 D & I Architecture
This section describes our novel architecture for the multi-domain conversational systems. The key concept of our architecture is the distribution and integration of the input and the output fragments. Underlining idea of the concept is the diculty of understanding spoken language without considering the task domains. So we choose to distribute the whole input information to each domain speci c work modules. In our architecture, several domain managers and work modules are hierarchically composed. The
Integrator & Selector
Dialog Controler
Mail Tool
ences in the distribution and the integration mechanisms. In the following section, we rst explain the requirements for multiple domain system (Section 2). Section 3 describes our D & I archtecture for the multi-domain conversational systems. Section 4 presents the related works.
Dialog Context
Knowledge DB of Sub Managers
Figure 2: Internal design of Manager work module is a conversational system for the speci c domain. The domain manager has the knowledge about sub domain managers and sub work modules. Simple example of the system based on the architecture is shown in gure 1. In this gure, there is only one domain manager(Master Manger) which controls the whole dialogue. CarRadio Control and MailTool are work modules for controlling the car radio and managing e-mails respectively. In the following , we will explain the ow of fragments. First ,user input is recognized at the speech recognition engine. The speech recognition engine output the input fragments as a stream. The input fragment contain information(examples are shown in gure3) about input speech as the recognized word or phrase, the recognition probability, the relevance with the current context. Master Manger rst decides the relevance of the input fragment to each work module by the domain speci c knowledge. In this simple example, the vocabulary of the work modules are enough to decide the relevance. Figure 2 describes the internal design of the manager. Distributor then distributes the input fragments both of CarRadio Control and MailTool with the relevance. For input utterance \Search mail from Kazu", top two fragments shown in gure 3 are distributed to each module. The relevance of the each fragment is calculated from the knowledge of vocabulary. The output fragments contain the information about the utterance phrase , the con dence with task and the relevance. Hence CarRadio Control can not understand the word \mail", \from" and \Kazu", the module returns the output fragment with no relevance(0.0) and full con dence(1.0). MailTool can understand the fragment, then it returns some relevance and con dence. Then, Integrator of Master Manager integrates these outputs and outputs the
// Input Fragment to CarRadio Control Input: { ID : 34054, phrase : search mail from Kazu, probability: 0.75, relevance : 0.2 } // Input Fragment to MailTool Input: { ID : 34054, phrase : search mail from Kazu, probability: 0.75, relevance : 0.8 } // Output Fragment from CarRadio Control Output: { ID : 34054, module ID : CarRadio Control utterance : (null) relevance : 0.0, confidence: 1.0 } // Output Fragment from MailTool Output: { ID : 34054, module ID : MailTool utterance : no mail from Mr. Kazu, relevance : 0.8, confidence: 0.60 } // Input Fragment to MailTool for control Control:{ ID : 34054, selected : true }
Figure 3: Examples of Input and Output Fragments MailTool's fragment to the upper speech synthesizer and sends a control message to MailTool about the selection(Last fragments in gure 3). Response of the each sub-system is integrated and selected by Integrator considering the con dency and the dialog context. The input fragments don't have to contain the whole sentential information. They might be a word or a part of phrase. When using the word-by-word input fragments, the system can perform the incremental interpretation(Inagaki and Matsubara, 1995; Matsubara et al., 1997). In the next subsection, we'll explain our architecture in the more complex con guration. 3.1
Design of the Driver's Secretary System
Figure 4 shows a gobal structure of the Driver's Secretary System(DSS). There are some notable features in this design. Domain managers are now hierarchically composed in several layers. But they are not restricted to form the tree structure. For example, Information Manager and Reservation Manager are both sharing the work module Web Tool. Flight Manager and Restaurant Manager are owned by two domain managers. These managers should manage the dialog among two speakers. One is the driver and another is the clerk talking over the car phone. To manage the simultaneous dialog is also one of the interesting problems. There is also interesting feature. When user wants to drive somewhere, Drive Support Manager will manage the dialog to control the Navigation System. Then, the manager will try to reserve a parking near the desired destination , Parking Manager will noti-
ed to control the dialog. By the output fragment of Parking Manager, Reservation Manager will manage the following dialog. This kind of coordination may work among the several managers using the control fragments. The design of the DSS exempli es the extensibility of the architecture. While keeping the uniform architecture, we can extend the system to the other domains in favor of the stream of data fragments.
4 Related Work
GALAXY-II(Sene et al., 1998) is an archtecture for the information service conversational system designed by MIT. They employ the HUB architecture for the integration of the multiple domain servers. This can be regard as a simple version of the blackboard architecture (Cohen et al., 1994). The blackboard architecture has a diculty in scalability. If there are too many domain servers, it can not work well because of the network congestion. Our architecture works well in favor of hierarchical nature of the modules. The contract net protocol(Smith., 1980) is a similar concept of our architecture. But in our architecture, the distributor do not divide the input fragments. All fragments are distributed to submanagers with the relevance value2 . The integrator also do not simply conbine the results.
5 Conclusion
This paper proposes an extensible and scalable architecture for the multi-domain, multi-modal conversational systems. The novel features of our architecture are, 1. Each work module don't have to consider about the other modules and managers. So work modules can be developed separately and can be easily composed into the multi-domain system. 2. The domain managers which distribute and integrate the fragments only require the knowledge about their sub-managers and work modules. This feature makes the whole system compositional. 3. Because the input and the output fragments have the uniform structure, managers and modules can be connected in any way. Therefore, the con guration of the system is exible and extensible. 4. Each work module and domain manager runs concurrently. So the system based on the architecture is scalable. We expect that the usability of the each system is not lost in favor of intelligence of the domain man2 When the relevance is considerable low, the fragment isn't distributed.
Voice Input
AA AA AA
Speech Recoginition Engine
AAAA AAAA
Car Audio
Speech Output Echo Cancelation
Input Fragment
Speech Synthesizer
Steering Sensor
Speed Meter
Output Fragment Sensor Monitor
Car Radio Control
Master Manager
In-Car Device Control
Car CD Control
Air Conditioner Control
Information Manager
Drive Support Manager
Reservation Manager
Navigation System Control
Parking Manager
Geological Information
Restaurant Manager
Car Air-Con Mail Tool
Mail DB
Internet Manager
Flight Manager
Web Tool
Data
Navigation System
Phone Manager Voice SPREC & SYNTH
Car Phone
Figure 4: Canonical con guration of Driver's Secretary System agers. Evaluation of the usability should be done by the experiment as a future work. We are currently implementing Driver's Secretary System based on the architecture.
References
H. Ando, Y. Kitahara, and N. Hataoka. 1994. Evaluation of multimodal interface using spoken language and pointing gesture on interior design system. In Proc. of 4th International Conference on Spoken Language Processing, pages 567{570. R. A. Bolt. 1980. Put-that-there: Voice and gesture at the graphics interface. ACM Computer Graphics, 14(3):262{270. P. Cohen, A.C. Heyer, M. Wang, and S.C. Baeg. 1994. An open agent architecutre. In Proc. AAAI Spring Symposium, pages 1{8. C. Goddeay, E. Brill, J. Glass, C. Pao, M. Phillips, J. Polifroni, S. Sene, and V. Zue. 1994. Galaxy: A human language interface to online travel information. In Proc. of 4th International Conference on Spoken Language Processing, pages 707{710. Y. Inagaki and S. Matsubara. 1995. Models for incremental interpretation of natural language. In
Proc. of the 2nd Symposium on Natural Language
, pages 51{60. S. Matsubara, H. Yamamoto, N. Kawaguchi, Y. Inagaki, and K. Toyama. 1997. An interactive multimodal drawing system based on incremental interpretation. In IJCAI97 Workshop: Intelligent Multimodal Systems, pages 55{62. S. Sene, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue. 1998. Galaxy-ii:a reference architecture for conversational system development. In Proc. Processing
of 6th International Conference on Spoken Lan-
. R.G. Smith. 1980. The contract net protocol: Highlevel communication and control in a distributed problem solver. IEEE Transaction on Computers, 29(12):1104{1113. Y. Takebayashi, H. Tsuboi, Y. Sadamoto, H. Hashimoto, and H. Shinchi. 1992. A real-time speech dialogue system using spontaneous speech understanding. In Proc. of 3rd ICSLP, pages 651{654. V. Zue. 1997. Conversational interfaces: Advances and challenges. In Proc. EUROSPEECH'97, pages 9{18. guage Processing