Dialog Strategies in a tourist information spoken dialog system - Limsi

Dialog Strategies in a tourist information spoken dialog system H. Bonneau-Maynard, L. Devillers LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE fhbm, [email protected] ABSTRACT Dialog management is a core component of spoken language systems. Its role is to guide the dialog along a productive path, making the interaction with the user as natural and efficient as possible. We describe in this paper the dialog control strategy being developed for the tourist information retrieval system, PARIS-SITI, in the context of ARC AUPELF-B2 action on spoken dialog systems evaluation. In a tourist information task, the essential goal of the dialog is to help the user in obtaining selective information from a database covering a wide range of topics. The PARIS-SITI dialog manager uses a hierarchic representation of the domain to develop its response strategy. The dialog control strategy has been evaluated by testing the efficiency of two systems : with and without automatic mechanisms for guiding the user via suggestive prompts with both novice and expererienced subjects on a set of 4 scenarios performed by each speaker with both systems. Qualitative and quantitative criteria such as user satisfaction, understanding and recognition error rate, percentage of followed prompts and scenario success are discussed showing the interest of the proposed strategy.

1. INTRODUCTION A large number of spoken dialog systems have been developed making use of different strategies for dialog management, such as system-initiative or mixed-initiative and explicit or implicit confirmation. Recent progress have been made by including domain specific information in the dialog strategy. Authors in [5] explicitly proposes a departure from the model-based approach in favor of an information based approach, assuming that the information available at any given point in the dialogue, comprising results from the database requests, determines the actions to undertaken by the dialog system. The same idea has been explored in [7] where it is assumed that improving speech acts based models with domain knowledge makes it possible to extend the dialog applicability. In this paper, we describe the dialog management and the response generation strategies being developed in a tourist information retrieval system. Tourist information is the common domain for the ARC AUPELF-B2 [1] action, the aim of which is to evaluate spoken language systems for the French language. Mixed-initiative strategy is often used in advice-giving domains such as tourist information. However users may need to learn The experiments were partially supported by the ARC AUPELF-B2 action.

what the system can understand, if the system does not explicitly prompt them using the valid vocabulary. In [3] an experiment has been carried comparing mixed-initiative and system-initiative spoken dialogs in the context of repeated use by a single user which showed that mixed-initiative never surpasses the systeme initiative strategy’s accuracy. To avoid the mixed-initiative lack of guiding, the PARIS-SITI dialog strategy, which is actually mixed-initiative in the sense that the speaker is not the obliged to follow the prompts, has been improved by adding automatic mechanisms for guiding the user via suggestive prompts. After describing the hierarchic representation of the domain and the dialog stragegy based upon it, we present an experiment evaluating the efficiency of these mechanisms with 32 subjects, each speaker using two versions of the system: with an without guiding prompts.

2. SYSTEM The tourist information system is built upon the MASK [2] and RAILTEL/ARISE [8] rail travel information dialog systems developed at LIMSI. PARIS-SITI (Syste`me d’Informations Touristiques Interactif) is a french language information retrieval system, that allows users to obtain information (such as prices, payment procedures, opening hours, address, trip, descriptions and services offered), for a variety of objects (hotels, restaurants, cinemas, department stores, museums and monuments) in Paris. In this study, we focus on restaurants and hotels located in the district of Saint-Lazare station in Paris . The database contains information on approximately a hundred different objects. The system is composed of a 2000-word speaker-independent continuous speech recognizer, and components for natural langage understanding, dialog management and response generation. The speech recognizer uses acoustic models from MASK and a bigram language model estimated on the transcriptions of 5200 utterances recorded previously. The speech recognizer transforms the input signal into the most probable sequence of words and then forwards it to the natural langage understanding component which carries out a caseframe analysis and generates a semantic frame representation. If enough information is present in the semantic frame the dialog manager generates a database query. The retrieved information is formatted into a natural langage response by the response generator from a generation frame (taking into account the dialog context) and vocal feedback is provided to the user as well as a visual display of the different objects already selected. In order to have flexibility in testing different response strategies we used the LIMSI text-to-speech synthetizer.

SPECOM’98 Saint Petersburgh, Bonneau-Devillers

115

3. DIALOG STRATEGIES The spoken language system uses a mixed-initiated dialog to handle interactions between the user and the system. Most of the time the dialog manager makes a decision and generates a response after taking into account the user’s query in the context of the dialog history. However, often the user’s response needs clarification because either too little or much information (which could be ambiguous or incorrectly recognized) has been supplied. A dialog strategy armed of specifically guiding can improve the system robustness. As in most spoken dialog systems, prompt generation is used when information required for database access is missing. We use also this mechanism to clarify inconsistencies in the dialog. Figure 1 shows an example of a dialog extract dealing with an inconsistency (a restaurant name preceded with the word hotel) due to a recognition error. U: hello, I’m in the St Lazare station and I’m looking for a hotel near the Galeries Lafayette Recognized : hello, I’m in St Lazare station and I’m looking for a hotel Mahatma the Galeries Lafayette S: You are looking for a hotel or a restaurant ? U: It’s a hotel that I’m looking for S: Here are the hotel addresses... Figure 1: The dialog manager detects an inconsistance (Mahatma, a restaurant name, has been recognized instead of “a` coˆte´ de ”, but preceded by the word hotel). The system asks the user for a clarification

In order to better control the dialog system and to keep the user within the boundaries of the domain, we are exploring response generation strategies which encourage the user to implicitly follow a given dialog plan. In doing so the dialog manager tries to predict the user’s intention while nonetheless leaving open the possibility for the user to completely change the direction of the dialog. Our generation strategy differs from other approaches in that clarification dialogs are determined by the domain which is hierarchically represented along with the generation and dialog histories.

3.1. Domain modeling The hierarchic representation of the domain is derived from the database, where each database object contains attributes. We distinguish 4 classes of attributes : -Location attributes: address, metro, arrondissement, subway trip and duration from Saint-Lazare station. The proximity of objects is given with respect to the closest metro station, which is the reference commonly used by Parisians. - Hour attributes: hour of opening and closing, days of closing. -Price attributes: category, prices and payment possibilities for hotel objects, and prices of the different menus, and payment possibilities for restaurant. - Description attributes: description, services and telephone number for hotel objects and nationality, speciality, and telephone for restaurants object. Figure 2 shows how these attributes are hierarchically represented for the restaurant object. Each level in the hierarchy can be associated with a precision degree of the user query. 4 precision levels are distinguished: object-level, class-level,

information-level, constraint-level. The generation frame integrating the user’s query contains the identification of the query as well as its level in this hierarchy. For example for the query “I like seafood and I’m looking for a restaurant” the generation frame used by the dialog manager contains the following attributes : object-level = restaurant class-level = description, location information-level = speciality, address C-speciality = seafood The dialog manager maintains a history frame which keeps track of all the types of information and their level asked for by the user since the beginning of the dialog.

3.2. Generation strategies We view the aim of a tourist information retrieval system as providing the user with information which will allow him to take a decision. From this point of view the dialog strategy should serve not only keep the user within the boundaries of the system, but also help him to discover the different possibilities of the system and the contents of the database. The dialog strategy is more open than typically needed for simple train timetable retrieval in that the user does not necessarely have an a priori idea of what the system is able to provide. The generation strategy is organized in 3 steps depending on which levels are contained in the generation frame. Note that the user may at any time of the dialog query the system for one (or more) of the 4 classes of attributes listed in previous section. At each user’s query, a list of the selected objects with their attributes is given. Examples are given in Figure 3 4. - Step 0: if the generation frame contains only the object-level with no lower-hierarchy attribute, e.g. for the utterance “Hello I’ m looking for a restaurant”, the dialog manager proposes the user the class-levels for selection. - Step 1: When the generation frame contains only one class-level, the objective of the dialog manager is to encourage the user to provide attributes of the class. Figure 3 shows a dialog illustrating this step. In this dialog, the user begins by asking for a seafood restaurant (U1). The analysis of this query gives the generation frame shown above . According to the object hierarchy the dialog manager will encourage the speaker to provide additional information in the same class : the food style (S1). The user gives a C-style (French) (U2). The systems prompts the user to ask for the last description attribute, which is the telephone number(S2) . User agrees (U3). All attributes at information or constraint level have then been resquested by the user. If the generation frame contains an informationlevel with no constraint-level (which is not the case in the previous example), the system suggest the user to be more precise (going to the constraint-level). For example, if the user asks for the categories of the listed hotels, the next prompt will suggest that he chose a hotel category while showing him the different possibilites in the database “Which category of hotel do you prefer 1, 2, 3 or 4 stars ?’’ - Step 2: when the speaker has asked for all of the information in the same class, or when the generation frame contains more than one class-level, the system leads the user to ask for information in other classes. For example, at the end of the dialog in Figure 3, the dialog manager prompts the user (S3) to ask for


116

restaurant location I-arr.

I-metro

I-add.

C-arr.

C-metro

C-add.

time I-trip

I-dur.

I-hour

C-dur.

C-hour

prices I-day

desc.

I-prices

I-pay.

I-style

I-spec.

C-prices

C-pay.

C-style

C-spec.

I-pho.

Figure 2: Hierarchical domain representation used by the dialog manager for restaurant object. The first level corresponds to the object level and the second, classes level.

U1: I like seafood and I’m looking for a restaurant S1: Here is the list of seafood restaurants, do you have any preference about the food style. U2: French S2: here is the name of a French seafood restaurant, if you want more information you can ask for its phone number U3: give me the phone number S3: Here is the phone number of a french seafood restaurant, do you want its prices ? its location ? Do you have any time constraint ? U4: I want the prices of this restaurant S4: here are the prices of a french seafood restaurant

U: please I want a hotel S: Here is the hotel list, do you have a location preference ? Do you want prices or descriptions of these hotels ? U: I’m looking for a hotel near the Galeries Lafayette P-1 (location)

Figure 3: the system suggests the user to ask for every information about description, before suggesting him to ask for information in the classes that have not yet been used in the dialog

S: Here is the trip and its duration from Saint-Lazare station for the hotels near the Galeries-Lafayette. You can ask for prices, categories, services or descriptions of these hotels

the other class-level attributes (prices, hour, and location), according to the restaurant hierarchy, and because all attributes of description class has been requested. At each step, the user has always the possibility of not following the system suggestion. In this case, the dialog manager changes its plan according to the generation frame. For example in step 1, if the user asks for a different class-level, the generation frame will contain two class-level. The dialog manager will no longer insists with possibilities within the first class. The strategy described above has been improved after observing dialogs from experienced users interacting with the system. For the location class, we observed that if the user has an idea of a location, (which can be arrondissement, metro and address), it is rarely expressed in terms of multiple constraints. Therefor the system will not prompt the system to specify more than one information in this class.

- When all usefull location constraints are given by the user, the system suggests that he ask for information in the other classes

- the user followed the system suggestion by adding a location constraint

S: Here is the list of the hotels near the Galeries Lafayette. You can have more information about subway trip and duration from the Saint-Lazare station. U: How much time does it take to go to these hotels ? P-1 (location) - the user followed the system suggestion by adding a new location constraint

U: I’ll take the Adria hotel P-0 - the user did not follow the system suggestion

S: Here is hotel Adria’s address, you can ask for prices, categories, services or a description of this hotel U: I want prices please P-1(prices) - the user followed the system suggestion and asked for prices

S: Here are hotel Adria’s prices, you can ask for its services or a description - location and prices have been visited, suggestion for the 2 remaining classes

U: The services please P-1(services) S: Here are hotel Adria’s services. Figure 4: Dialog example with system suggestions. S: system prompt, U: userquery. Prompt labelling P-0: user did not follow suggestion and P-1: user did follow suggestion. Comments are given in italics.

4. DATA COLLECTION To independently test the effects of the guiding mechanisms of the dialog systems, we used 2 versions of the dialog system. In the first version (SP) the system uses the complete dialog strategy described behind. In the second version (S) all the extra guiding prompts depending on this strategy have been removed : the remaining prompts only contain then the direct response of the user query. Experiments were carried out with 32 subjects interacting with the system: 16 experienced users (PhD students or researchers in the speech and natural language area but not necessarily familiar with this dialog system) and 16 novice users. Each speaker carried out the four scenarios listed in Figure 5 with both systems. Half of the subjects used first the system with suggestive Prompts (SP) and the other half used first the system without suggestions (S). Each scenario sets only one type

of constraint (location for scenario A, description for B, hour for C and speciality for D). The scenario were meant to serve as an idea for the user, asking him to select only one object with this constraint after asking the system for information that may help to make a choice.

5. RESULTS ANALYSIS In order to evaluate the impact of the dialog strategy we labelled 4 scenarios from each speaker (i.e the scenarios with the first system, SP or S, that he used). Each utterance of each dialog was labelled in terms of understanding success. When the understanding is complete (all semantic slots are correctly instantiated), the system response is labelled (C-1). If an error occurs, we distinguish the case of a recognition error (CR-0) or under-


117

A- find a hotel not far from the Galeries Lafayette B- you are looking for a luxi hotel C- you are looking for a restaurant open late D- you wish to eat seafood

Users Novice Experienced

1st SP= 7.8 S= 6.0 SP= 7.4 S= 5.6

2nd S= 7.7 SP= 7.2 S= 6.9 SP= 6.6

Figure 5: Prototype scenarios used in the experiment. Table 2: Experienced and novice users satisfaction - average of the global marks given at the end of the session

Recognition error rate (C-0) rate (CR-0) rate Success prompt rate Mean user turns Mean information nbr Task error rate

S E N 24.9 25.0 9.4 10.4 19.3 21.2 6.6 7.8 6.7 6.9 17.2

SP E 28.4 4.3 25.0 51.0 6.5 6.8

N 26.9 5.6 24.5 66.3 8.3 7.6 4.6

Table 1: Recognition error rate corresponds to exact string match with literal transcription. Understanding error rates, average number of utterances, mean of different information and success prompts rate (#P-1/(#P-1+#P-0)). The total number of labelled dialogs is 128, with 935 total user utterances. (E means experienced speakers,S means novice speakers)

standing component error (C-0). Out of the domain utterances were labelled (O-0). For the SP system, we also labelled as a failure (P-0), as shown in Figure 4, the utterances of the speaker which did not follow the preceding prompt and as a success (P-1) if the user asked for at least one of the suggested items. Main results are summarized in Table 1. The global understanding error rate (CR-0 + C-0) is approximately the same for both systems (about 29% for experienced speakers and 30.5% for novice speakers). When the recognition is correct the understanding rate of the SP system is better than the S system (see line C-0 in Table 1 ). In order to compare the efficiency of SP and S systems, we labelled every dialog in terms of different information obtained by the user. The results, in terms of the average number of different information items per scenario are shown in Table 1. There is no difference for experienced users with both systems which is not surprising. These users follow only half of the prompts, with a prompt success rate of 51%. The results for novice speakers are quite encouraging. Novice users have a tendency to follow the prompts (66.3%) resulting in an increase in the mean number of different constraints obtained (6.9 with S system and 7.6 with SP system). At the end of the recording session, each user was asked to evaluate both systems (by giving a satisfaction mark in the range 0 to 10) and to describe differences between them see Table 2. Satisfaction marks for all the users after using first system, showed that the SP system is prefered. Users who first interacted with the prompting system (SP), generally did not notice any difference between the systems. Novice users gave the same mark for both systems whereas experienced users prefered the SP system. All users who first interacted with the system without guiding prompts, noticed the difference between both systems and clearly prefered SP. Very encouraging remarks on the interest of the prompts were given (“the second system (SP) is more helpful to reach the goal by proposing different information, with the first system (S) I had no idea of the system limits”). The last comparison between both systems may be done in term of scenario success. A scenario was a success for the speaker

if the final object selected is one of the possible objects given the scenario constraint. These contraints authorized to select a subset of 3 hotels for the scenario A, 5 for the B, 9 restaurants for scenario C and 4 for the D. The difference between systems is significant : with the SP system, the error rate is only 4.6% for the speakers who interacted with that system first (from a total of 64 tests), compared to of 17.2% for the speakers who interacted with the S system first. This means that the prompting system help the speaker to reach his goal.

6. CONCLUSION We have presented the domain specific hierarchy and the dialog strategy developed upon it for PARIS-SITI, a tourist information retrieval spoken dialog system. We propose a mixed-initiative dialog strategy improved with suggestive prompts which encourage the user to implicitly follow a dialog plan. The strategy has been evaluated with 32 subjects interacting with two versions of the system : with and without guiding prompts. The users assessment shows that they clearly prefer to be guided by the prompts, even for experienced speakers who only follow half prompts. With about the same recognition and understanding performances, the prompting system helps the subjects to reach their goals. Furthermore, we are studying the labelled dialogs in the aim of being able to automatically adapt the dialog strategy to the given user.

REFERENCES [1] J. Mariani “The Aupelf-Uref Evaluation-Based Language Engineering Action and Related Projects”, LREC Conference, Granada, May 1998. [2] J.L. Gauvain, S. Bennacef, L. Devillers, L. Lamel, R. Rosset: “Spoken Language component of the MASK Kiosk” in K. Varghese, S. Pfleger(Eds.) “Human Comfort and security of information systems”. Springer-Verlag, 1997. [3] M. Walker, D. Hindle, J. Fromer, G. Di Fabbrizio, C. Mestel, “Evaluating competing agent strategies for a voice Email agent”, ESCA Eurospeech Rhodes, 1997. [4] M. Walker, D. Litman, C. Kamm, and A. Abella, “Paradise: a general framework for evaluating spoken dialog agents”. In Proc. of the ACL/EACL 1997. [5] M. Denecke, A. Waibel, “Dialogue strategies guiding users to their communicative goals”, ESCA Eurospeech Rhodes, 1997. [6] J. Alexandersson, N. Reithinger, “Learning dialogue structure form a corpus”, ESCA Eurospeech Rhodes, 1997. [7] N. Dahlba¨ck, A. Jonsson, “Integrating domain specific focusing in dialogue models”, ESCA Eurospeech Rhodes, 1997. [8] S. Bennacef, L. Devillers, S. Rosset, L. Lamel, “Dialog in the RAILTEL telephone-based system”, ICSLP-96 and ISSD-96, October 1996.


118