Construction and Experiment of a Spoken Consulting Dialogue System

Construction and Experiment of a Spoken Consulting Dialogue System Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura MASTAR Project, NICT, Kyoto, Japan http://mastar.jp/index-e.html

Abstract. This paper addresses a spoken dialogue framework that helps users make decisions. Various decision criteria are involved when we select an alternative from a given set of alternatives. When adopting a spoken dialogue interface, users have little idea of the kinds of criteria that the system can handle. We thus consider a recommendation function that proactively presents information that the user would be interested in. We implemented a sightseeing guidance system with a recommendation function and conducted a user experiment. We provided an initial analysis of the framework in terms of the system prompt and users’ behavior, as well as in terms of user’s behavior and his/her knowledge.

1

Introduction

Over the years, a great number of spoken dialogue systems have been developed. Their typical task domains include airline information (ATIS & DARPA Communicator) [1] and railway information (MASK) [2]. Dialogue systems, in most cases, are used in the fields of database (DB) retrieval and transaction processing, and dialogue strategies are optimized so as to minimize the cost of information access. Meanwhile, in many situations where spoken dialogue interfaces are installed, information access by the user is not a goal in itself, but a means for a decision making [3]. For example, in using a restaurant retrieval system, the user’s goal may not be the extraction of price information but to make a decision based on the retrieved information on candidate restaurants. There have only been a few studies that have addressed spoken dialogue systems that help users make decisions. In this paper, we provide our model of consulting dialogue systems with speech interfaces. In this study, our model is concerned with the implementation of a sightseeing guidance system for Kyoto city. We address in our preliminary analysis the user’s experience while engaging with this system.

2

Dialog Model for Consulting

A sightseeing guidance system of the type that we are constructing is regarded as a kind of decision support system. That is, the user selects an alternative from a given set of alternatives based on some criteria. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 169–175, 2010. c Springer-Verlag Berlin Heidelberg 2010

170

T. Misu et al.

Choose the best spot

Goal

Criteria

Cherry blossoms

Japanese garden

Easy access

・・・・・

…. Alternatives

Kinkakujitemple

Ryoanjitemple

Nanzenjitemple

・・・・・

Fig. 1. Hierarchy structure for sightseeing guidance dialogue

There have been many previous studies of decision support systems in the operations research field, and the typical method that has been employed is the Analytic Hierarchy Process [4] (AHP). In the AHP, the problem is modeled as a hierarchy that consists of the decision goal, the alternatives for reaching it, and the criteria for evaluating these alternatives. In the case of our sightseeing guidance system, the goal is to decide on an optimal spot that is in agreement with the user’s preference. The alternatives are all sightseeing spots that can be proposed and explained by the system. As criteria, we adopt the determinants that we have defined in our tagging scheme of the Kyoto sightseeing guidance dialogue corpus [5]. The determinants include various factors that are used to plan sightseeing activities, such as “cherry blossoms”, “Japanese garden”, etc. An example hierarchy using these criteria is shown in Fig. 1. When adopting such a hierarchy structure, the problem of deciding on the optimal alternative can be solved by estimating weights for criteria. They are often optimized through pairwise comparisons, followed by weight tuning based on the results of such comparison [4]. However, the methodology cannot directly be applied to spoken dialogue systems. Generally, the system knowledge is usually not fully observable to users at the beginning of a dialogue, and is observed only through interaction with the system. In addition, spoken dialogue systems usually handle quite a few candidates and criteria, which makes pairwise comparison a costly affair. Although there have been several studies that are dealing with decision making with spoken dialogue interface [3], these works assume that the users know all the criteria that the users/system can use for making decision. In this work, we assume a situation where users are unaware of not only what kind of information the system can provide but their own preference or factors that they should emphasize. We thus consider a spoken dialogue system that provides users with information via system-initiative recommendations. We assume that the number of alternatives is relatively small, and that all alternatives are known to the users. This is highly likely in real world situations; for example, the situation wherein a user selects one restaurant from a list of the candidates presented by a car navigation system.

Construction and Experiment of a Spoken Consulting Dialogue System

3

3.1

171

Decision Support System with Spoken Dialogue Interface System Overview

The dialogue system we constructed is capable of two functions: answering users’ requests and recommending them information. In an answering function, the system can explain the sightseeing spots in terms of every determinant, unlike conventional systems that are only capable of explaining the pre-set abstract of a given spot, In the recommendation function, the system provides information about what the system can explain, since novice users are unlikely to know the capabilities of the system (e.g., the system, as part of its recommendation function, provides determinants that the user might be interested in). The system flow based on these strategies is summarized below. The system: 1. 2. 3. 4. 3.2

Recognize the user’s utterance, Detect the spot and determinant in the user’s utterance, Present information based on this understanding, Recommend information related to the current topic. Knowledge Base

Our back-end DB consists of 15 sightseeing spots as alternatives, and 10 determinants described for each spot. The number of alternatives is small compared to systems dealing with information retrieval. This work focus on the process of comparing and evaluating candidates that meet “essential condition” such as “Famous temple around Kyoto station”. We selected determinants that frequently appear in our dialogue corpus [5]. The determinants are listed in Table 4. Normally, these determinants are related and are dependent on one another, but in practice, the determinants are assumed to be independent and have a parallel structure. The spots are annotated in terms of these determinants if they apply to them. The value of the evaluation is “1” when the spot applies to the determinant and “0” when it does not. The text is generated by retrieving appropriate reasons from the Web. An example of the DB is shown in Table 1. Table 1. Example of the database (translation of Japanese) Spot name

Determinant Eval. Text Cherry blossoms 1 There are about 1,000 cherry trees in the temple ground. Best of all, the vistas from the main temple are amazing. Vista 1 The temple stage is built on the slope, and the Kiyomizu temple views of the town from here are breathtaking. Not Crowded 0 This temple is very famous and popular, and is thus constantly crowded. ... ... ... ...

172

3.3

T. Misu et al.

Speech Understanding and Response Generation

Our speech understanding process tries to detect sightseeing spots and determinant information in the automatic speech recognition (ASR) results. We thus prepared two modules for the spots and the determinants, respectively. In order to facilitate flexible understanding, we adopted an example-based understanding method based on vector space models. That is, the ASR results were matched against a set of documents written about the target spots1 , and the spots with the highest matching scores were used as understanding results. The ASR results were also matched against a set of sample query sentences and determinants are detected. In addition, we also concatenated contextual information on a spot or on a determinant under current focus if the ASR results included either a spot or a determinant. The system would then generate a response by selecting from one of the appropriate responses in the DB, and present it through synthesized speech.

3.4

System Initiative Recommendation

In the information retrieval systems, users often have difficulty making queries. This is particularly the case when they are unsure of what information the system possesses. In addition, it is important to raise awareness of the users’ potential preferences through a dialogue. We thus design a system initiative recommendation, which is followed by the system response. The content of the recommendation is determined based on one of the following three methods. 1. Recommendation based on the current focused spot This method is structured on the basis of the users’ current focus on a particular spot. Specifically, the system selects three determinants related to the current spot whose evaluation is “1” and presents them to the user. 2. Recommendation based on the current focused determinant This method functions on the basis of the focus on a certain specific determinant. The system selects three spots related to the current determinant whose evaluation of the determinant is “1”. 3. Open prompt The system does not make a recommendation, and presents an open prompt. After users acknowledge the domain and the system knowledge this system may be considered preferable, since users may become irritated with repetitive recommendations. Although we can use various recommendation methods such as the collaborative filtering, which is often used for recommender systems [6], in this work we use these simple methods, because we do not estimate user preference and thus the system may presents irrelevant information. An example dialogue with the system is shown in Table 2. Note that, the system was set to make possible recommendations randomly in the following experiment. 1

The documents we referred to were all sourced from Wikipedia.


173

Table 2. Example dialogue with the system (translation of Japanese) S1: Hello, this is the Kyoto tourist information system. How may I help you? U1: Please show me the cherry blossom of Ninnaji. S2-Ans.: The cherry blossoms of Ninnaji are specially refered to as “omuro no sakura”. These cherry trees are quite unique because they are lower than the others. S2-Rec.: Japanese garden, events and fall foliage are some of the areas in which information is available on Ninnaji; do you want me to explain? (Method 1) ... U11: Does it have a lovely landscape? S12-Ans.: From the top of the famous gate, you can see Kyoto city spread out below. S12-Rec.: I can name Kiyomizu-temple, Fushimiinari and Kurama-temple as famous spots that grant great vistas. (Method 2) U12: Kiyomizu temple. S13-Ans.: Kiyomizu temple is built on the side of the hill, and you are treated to a view of Kyoto city center. S13-Rec.: Do you have any other thing you want to know? (Method 3) U13: I decided to go to Nanazenji temple S14: The item, Nanzenji temple, has been chosen. Are there any questions? ...

Table 3. Percentages of user utterance types for system prompts

Acceptance Determinant Spot name Determinant and spot name Others (Commitment, OOS, etc)

4 4.1

Opening (%) Method 1 (%) Method 2 (%) Method 3 (%) 71.5 30.0 15.8 7.0 25.0 38.6 50.0 16.8 16.7 24.8 0 1.4 6.7 2.0 34.2 3.3 21.6 34.6

System Experiment and Analysis User Experiment

We collected test data from 72 subjects who had not used our system before. Subjects were requested to use the system to determine one sightseeing spot out of 15 alternatives, based on the information obtained from the system. No instructions on the system knowledge were given except the following three example utterances of “Could you please tell me the spots famous for XXX?” and “Tell me about XXX temple”. We asked the subjects not to use their own knowledge and experiences while arriving at a decision. Further, they were requested to utter the phrase “I’ll go to XXX,” signifying commitment once they had reached a decision. Only one set of dialogues was collected per subject, since the first such dialogue session would very likely have altered the level of user knowledge. The average length of dialogue before a user communicated his/her commitment was 16.3 turns with 7.0 turns being the standard deviation. 4.2

Analysis of Collected Dialogue Sessions

We transcribed a total of 1,752 utterances and labeled their correct dialogue acts (spots and determinants) by hand. The percentage of user utterances that the

174

T. Misu et al. Table 4. Analysis of user preference and knowledge

Percentage of users Percentage of users Percentage of users uttered who value it (%) who uttered it (%) before system recom. (%) Japanese garden 34.7 47.2 22.2 19.4 41.7 1.4 Not crowded 48.6 50.0 2.7 World heritage 48.6 22.2 1.4 Vista 16.7 19.4 19.4 Easy access 37.5 47.2 18.1 Fall foliage 33.3 51.4 13.9 Cherry flower 43.1 31.9 12.5 History 45.8 38.9 1.4 Stroll 29.2 36.1 8.3 Event Determinant

system could handle was 89.0%, out of which the system could correctly respond to 72.4%. Analysis of user utterances. First, we analyzed the relationship between system prompts and user utterances. The percentages of user utterances for system prompts are shown in Table 3. “Acceptance of recommendation” refers to the cases where the users accept the recommendation. That is, Method 1 is regarded as accepted when the user requests either of the recommended determinants. Method 2 is regarded as accepted when the user requests either of the recommended spots. The tendency of user utterances varies according to the recommendation types. Many users make queries the systems cannot handle (out of system; OOS) in the opening and in the open prompt (Method 3). Meanwhile, many users can make in-domain queries by presenting system knowledge through recommendations. Analysis of user preference and domain knowledge. We analyzed the sessions in terms of the preference and domain knowledge of the subjects. Table 4 lists the preferences by percentage that subjects emphasize when selecting sightseeing spots. These are based on questionnaires conducted after the dialogue session. (We allowed multiple selection.) Since subjects were asked to select determinants from the list of all determinants, their selections are considered to be their preferences under the all system knowledge. However, when the subjects start with the dialogue sessions, some of the above preferences turn out to be only potential preferences, owing to limited nature of the users’ knowledge about the system. In order to analyze the knowledge of user utterances, we analyzed the percentage of the utterances that included the determinants before the system recommendation. The result is shown in Table 4. Several determinants were seldom uttered before the system made its recommendations, even if they were important for many users. For example, “World heritage site information” and “Stroll information” were seldom uttered before the system’s recommendation, despite the fact that around half of the users had emphasized on them. These results show that some of users’ actual preferences


175

remained as potential preferences before the system made its recommendation or at the very least, the users were not aware that the system were able to explain those determinants; thus, it is important to have users notice their potential preferences through system-initiative recommendations. 4.3

Analysis of Users’ Decisions

Finally, we analyzed the relationship between user preference and the decided spot. We evaluated the number of agreed attributes between the user preferences and the decided spots. The average number of agreements was 2.20, which was higher than the expectation by random selection (1.96). However, if the users had known about their potential preferences and about the system knowledge, and then selected an optimal spot according to their preferences, the average number of agreements, then, would have been 3.34. This result indicates that an improved recommendation strategy can help users make a better choice.

5

Conclusion

In this paper, we addressed a spoken dialogue framework that helps users select an alternative from a list of alternatives. Through an experimental evaluation, we confirmed that user utterances are largely affected by system recommendations; moreover, we learnt that users can be helped to make better decisions by improving the dialogue strategies. Therefore, we will extend the framework of this research to estimate users’ preferences from their utterances. In addition, the system is expected to handle a more complex planning of natural language generation in recommendations, such as those discussed in [7]. We also plan to optimize the selection of responses and recommendations, based on users’ preferences and state of knowledge.

References 1. Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-machine Interaction for Learning Dialog Strategies. IEEE Trans. on Speech and Audio Processing 8, 11–23 (2000) 2. Lamel, L., Bennacef, S., Gauvain, J.L., Dartigues, H., Temem, J.N.: User Evaluation of the MASK Kiosk. Speech Communication 38(1) (2002) 3. Polifroni, J., Walker, M.: Intensional Summaries as Cooperative Responses in Dialogue: Automation and Evaluation. In: Proc. ACL/HLT, pp. 479–487 (2008) 4. Saaty, T.: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. McGraw-Hill, New York (1980) 5. Ohtake, K., Misu, T., Hori, C., Kashioka, H., Nakamura, S.: Annotating Dialogue Acts to Construct Dialogue Systems for Consulting. In: Proc. The 7th Workshop on Asian Language Resources, pp. 32–39 (2009) 6. Breese, J., Heckerman, D., Kadie, C.: empirical analysis of predictive algorithms for collaborative filtering. In: Proc. the 14th Annual Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998) 7. Rieser, V., Lemon, O.: Natural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems. In: Proc. 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2009)