evaluating dialogue strategies in a spoken dialogue system for email

EVALUATING DIALOGUE STRATEGIES IN A SPOKEN DIALOGUE SYSTEM FOR EMAIL Fernando Farfán, Heriberto Cuayáhuitl and Alberto Portilla Intelligent Systems Research Group Department of Engineering and Technology Universidad Autónoma de Tlaxcala Apartado Postal No. 140 Apizaco, Tlaxcala, México 90300 email: {farfan, hcuayahu, aportilla}@ingenieria.uatx.mx ABSTRACT This paper presents an evaluation of directed dialogue (DD) and mixed initiative (MI) strategies in a spoken language system for Email. We compare the DD strategy, in which the system controls the dialog, to the MI strategy, in which users can flexibly control the dialog. For evaluating both strategies we used the PARADISE framework, which supports comparisons among dialogue strategies. Our experimental results show that the MI strategy performance surpasses the DD strategy in efficiency and user satisfaction, but needs more work for achieving qualitative measures and task success. KEY WORDS Dialogue strategies, spoken dialogue systems, email by phone, speech recognition, and word spotting

1 Introduction In Mexico, automated services by telephone are few and most of these services are IVR-DTMF systems based on touch tone interfaces. Currently, these services tend to be replaced by Spoken Language Systems (SLS), which are more sophisticated than IVR-DTMF systems, provide a broader range of input, reduce call hold time, and generate costs savings. A SLS allows a human computer interaction based on human speech, using speech technology such as automatic speech recognition (ASR) and speech synthesis or Text To Speech (TTS). A SLS can interact with the user by employing a dialogue strategy such as Directed Dialogue (DD) or Mixed Initiative (MI). DD guides the user interaction to the achievement of goals, with the system always taking the initiative in the interaction, while in MI either the system or the user can take the initiative in the interaction. For example, consider the two dialogue strategies exemplified in D1 and D2, where S is the system and U is the user. In fact, the dialogues were in Spanish, but we here give only their English translation. Directed Dialogue Dialogue Strategy (D1) S: Welcome to the email system by phone! S: Since this is your first call, I will provide you with some

instructions. S: Your messages are presented as a sorted list, so you can say: “first”, “next”, “previous”, or “last”. If you want to erase a message, say “delete”, otherwise, if you want to hear a message, say “read”. To end the call, say “good bye”. S: Now you will hear your list of messages, remember that if you dont know what to say, you can say “instructions”. Let’s begin! S: Message sent by “Armando Rocha”, subject “my new cell phone number”. U: (silence) S: Message sent by “Alberto Morales”, subject “Company meeting”. U: Read S: Please, hold on while I get the message. S: The message is: S: “Hi, I just want to let you know that the meeting will be next Friday at 5:30. Don’t be late”. S: That was the message. Please say one of the following choices: “repeat”, “reply”, “delete” or “continue”. U: Good bye S: Thanks for calling the email system by phone. Good bye. Mixed Initiative Dialogue Strategy (D2) S: Welcome to the email system by phone! S: Since this is your first call, I will provide you with some instructions. S: If you want to hear a message, you can say “I would like to hear the messages of Ricardo Pérez”, if you want to reply to a message you can say “I want to reply to the message of Ricardo Pérez”. S: What would you like to do? U: Hi, I would like to hear Alberto Morales’s messages. S: Please, hold on while I get the messages. S: There is just one message and the message is: S: “Hi, I just want to let you know that the meeting will be next Friday at 5:30. Don’t be late”. S: That was the message. What would you like to do? U: Good bye S: Thanks for calling the email system by phone. Good bye.

Human operator

Spontaneous Continuos Natural Language I N P U T

Word Spotting

Wait For Call

Natural Dialogue System

Request NIP

Natural Language System Instruction s

IVR-Speech

Discrete Words

Digits

Welcome Prompt

Directed Mixed Dialogue Initiative DIALOGUE COMPLEXITY

NIP OK? false

New and Unread email

IVR-DTMF Menu

true

Invalid User

Free Flowing

Figure 1. Spoken Language Systems Classification [6].

Thanks For Calling

Action to perform

BodyPart

Header

Delete Hang Up

Reply

At first glance, MI may seem a better strategy than DD, but previous works suggest that there are several difficulties with MI strategy [1-3]. First, MI requires more complex grammars, which could result in high automatic speech recognition (ASR) error rates. Second, MI may require users to learn what the system can understand, due to the fact that the system does not prompt with valid vocabulary. Thus, our motivation in this work is to evaluate both dialogue strategies (DD and MI) in order to contribute to the solution of the problems described above which MI confronts today. This paper presents a comparative analysis of the performance of two dialogue strategies (DD and MI) in a spoken language system for accessing email by phone. In this work we used the most sophisticated evaluation framework currently available for spoken language systems, PARADISE [4]. Our experimental framework uses a state of the art speech recognizer, which has the ability to handle out-of-vocabulary words [5]. The following section describes the system design. Section 3 describes our experimental design, where we provide details of the evaluation methodology. In section 4, we describe our experimental results. Finally, in section 5 we provide our conclusions and comment on future directions of this work.

2 System Design Spoken language systems can be classified by dialogue complexity and by the type of speech input provided by the user. A SLS classification is shown in Figure 1. In order to evaluate both dialogue strategies, two systems were developed: a Speech-IVR system and a Natural Language System (NLS). The Speech-IVR system uses the DD strategy where discrete words are used as speech input, while MI uses a NLS with word spotting capabilities and continuous natural language as speech input. Our systems provide the following capabilities:

Figure 2. High-Level diagram for DD and MI strategies.

• User Authentication, • Tells the number of new and unread emails, • Headers and body-part consultation, • Deleting and replying messages A high level call flow diagram for our systems is illustrated in Figure 2. To help users to interact with the system, the system instructs them in the first call, after user authentication. Our dialogue manager uses a state machine to implement both dialogue strategies. Most of the states include a DialogModuleT M provided by the recognizer, which takes care of the conversation. A DialogModule includes: an initial prompt, timeout prompts, retry prompts, confirmation prompts, a help prompt, and a vocabulary or grammar. We enabled a set of global commands to be reached anytime in the call flow: cancel, help, instructions, and call termination. The system implementing the DD strategy only uses vocabularies. The system implementing the MI strategy uses grammars, which include a set of filler models to the left and right of keywords for modeling out-of-vocabulary words. In this way we were able to recognize any phrase with at least one keyword and at most two keywords (action and name). The communication architecture components provide critical support for our systems. Figure 3 shows the system architecture. The following hardware and software components were employed in the system communication architecture: • ASR: SpeechWorks Recognizer 6.5 Second Edition for Mexican Spanish (barge-in enabled)

Automatic Speech Recognizer

MAXIMIZE USER SATISFACTION

Speech Synthesizer

PSTN

Card Telephony Interface

MINIMIZE COST MEASURES

MAXIMIZE TASK SUCCESS

VRU INTERNET/ INTRANET

Storage Resources

Kappa

EFFICIENCY METRICS

QUALITATIVE METRICS

Elapsed time, user and system turns.

Retry, barge-in, cancel, etc.

Database Mail Server Device Access

Figure 3. System communication architecture.

• Speech Synthesizer: Eloquent 5.0 Mexican male voice • Card Telephony Interface: Dialogic D21H • Programming Language: C++ • Telephone Line Simulator: Skutch AS-26 • Mail Server: VisNetic MailServer 5.0.2 • Operating System: Windows NT 4.0

3

Figure 4. PARADISE’s structure of objectives for dialogue performance [4].

of scenarios. Kappa is defined by Equation 1, and it is described in [7]. P(A) is the proportion of times that the Attribute-Value Matrix (AVM) for a dialogue agrees with the AVM for the scenario key, and P(E) is the proportion of times we would expect the AVMs for the dialogues and keys to agree by chance. When agreement is perfect (all task information items are successfully exchanged), then Kappa is 1. When agreement is only at chance, then Kappa is 0.

• Telephone: Telmex Regulated, BE-408 Type 1

κ = (p(a)−p(e))/(1−p(e))

• PC: Pentium 600 MHZ, 256 MB RAM

where, p(e) =

Experimental Design

The experimental setting was similar to that described in [2]; we also applied the PARADISE evaluation framework for comparing dialogue strategies. The overall structure of objectives in PARADISE that provides the basis for estimating a performance function is shown in Figure 4, with user satisfaction at the top level. In this framework, user satisfaction is correlated to user task success and such success has a payment in cost measures. The cost measures can be classified as either efficiency measures or qualitative measures. Efficiency measures are correlated to the user and the system interaction as a normal course of events (the user speaks and the system listens and understands perfectly what the user just said, while the time goes by), but not everything works perfectly, the system can fail in automatic speech recognition and produce a set of retries or timeouts (qualitative measures) when the user does not know exactly what to say by example. To know task success, the coefficient Kappa is calculated from a confusion matrix that summarizes how well a user or system achieves the information goals of a particular task for a set of dialogues instantiating a set

and, p(a) = (

(1)

Pn

2 i=1 (ti /T )

Pn i=1

(2)

M (i, j))/T

(3)

As shown in Figure 4, performance also includes a function that combines cost measures. The PARADISE framework represents each cost measure as a function ci that must be minimized. These cost measures and the Kappa coefficient are the basis for formulating the equation for system performance. This system performance is calculated by the following equation. perf ormance = (α∗N (κ))−

P i=1

wi ∗N (ci ) (4)

Where α is a weight on κ, the cost functions are weighted by wi , and N is a Z score normalization function, defined by the following equation. N (x) = (x − x ¯)/σx

(5)

For evaluating both dialogue strategies, 50 undergraduate students, with an average age of 21 and who receive on average four emails per day, tested our system. Their only previous experience in consulting email was through the UNIX “pine” tool or with the support of a graphic user

Table 1. Tasks to be performed by each user.

Table 2. Tasks to be performed by each user. Attribute Possible Values

Task Description

Goals

1

Get Armando’s cellular telephone number.

2

3

Armando Rocha has prepared a meeting, so he is expecting you to call him. He sent an email with his cellular telephone number. Fernando Mata Terrazas has invited you to his wedding. Alberto Morales sent an email about a meeting next Friday and he wanted you to confirm your attendance.

Get wedding’s day and hour. Delete the message. Get meeting’s day and hour. Confirm your assistance to the meeting.

Day Hour Delete

Value Matrix (AVM) to represent the information retrieved from users emails. An example of AVM for task 2 is shown in Table 2. In order to compute user satisfaction with subjective metrics, we performed a survey with the following questions: • TTS quality: Was the system easy to understand? • ASR quality: In this conversation, did the system understand what you said? • Task Ease-of-Use: In this conversation, was it easy to find the message you were looking for?

interface such as HotMail or Yahoo Mail Service. Each student was requested to perform three tasks including five goals, giving us a total of 125 goals per strategy. The tasks are listed in Table 1. It is important to mention that our testers had no previous experience with a SLS, so that our experiments were performed with novice users. Before our testers started a call, they were given the following tips for using the systems, recommended by [8].

• Interaction Pace: Was the pace of interaction with the system appropriate in this conversation? • User experience: Did you know what you could say at each point of the dialogue? • System response: How often was the system sluggish and slow to reply to you in this conversation?

• If you do not know what to say, you can say “help” anytime you want.

• Expected behavior: Did the system work the way you expected?

• If the system didn’t understand you correctly and is performing a wrong task, you can say “cancel” to go back a previous step.

• Future use: Based on your current experience with the system, would you use it regularly as another medium for accessing email?

• If you keep silent, the system will tell you what to say. • You can barge-in a system dialog, so you do not have to wait until it finishes a word or phrase. • When you finish your tasks, you can say “good bye” to end your call. The data collection for email evaluation consisted in: • The recording of dialogues using the .ulaw audio files. • The recording of metrics for each user session. • A survey that helped us to know the opinion about the system usage. In this way, we determined a set of objective and subjective metrics. These metrics are: Task Success, Bargein, Timeout, Retry, Elapsed Time, Help, Cancel, System Turns, User Turns, and Mean Recognition Score (MRS). To measure task success, users provided us with the information required by each task. We used an Attribute

Any day from Monday to Sunday 5:30, any other hour Yes, No

The user satisfaction survey allowed users to fill in a web page with 8 questions and multiple choice answers. The possible answers to most of the questions were: “almost never, “rarely”, “sometimes”, “often”, “almost”, and “always”. Some questions had responses like “yes”, “no” or “maybe”. Every response was mapped to an integer from 1 to 5, with 5 representing the highest score. The survey also included a text field where users were encouraged to introduce comments about our systems. All responses were summed up, resulting in a User Satisfaction measure for each dialogue ranging from 8 to 40, where 8 is the worst case for all questions and 40 represents the best case.

4

Experimental Results

For computing the system performance by (4), we computed a Multiple Linear Regression (MLR) over all objective metrics, which produced a set of coefficients (weights) describing the relative contribution of each

Table 3. Metrics performance for DD and MI strategies. Metrics Efficiency Measures System Turns User Turns Elapsed Time (secs) Qualitative Measures MRS Timeout Retry Help Cancel Barge-in Task Success (κ) User Satisfaction

Table 4. User Satisfaction Survey Results.

Directed Mixed Dialogue Initiative 48.24 21.48 398.24

38.96 17.56 346.24

0.86 0.60 3.44 0.52 0.16 4.32 0.83 31.72

0.82 1.24 4.92 0.08 0.12 2.16 0.79 33.32

factor. We used US (user satisfaction) as the predictor factor. The results of the regression showed that Elapsed Time (ET), Barge-In and Retry were the most significant factors (p < 0.028). Task success (κ) was computed using equation 1, and ET, Barge-In and Retry were taken from the log files of our systems. Using these metrics we computed the performance function with a second MLR for obtaining α and wi , resulting in the following performance function. perf ormance = 4.950(κ) + 0.009(ET ) + 0.595(Barge − in) − 0.325(Retry) A summary of metrics collected from our systems applying the PARADISE framework is listed in Table 3. 1 This table shows that MI is more efficient and if fact better than DD in three measures (System Turns, User Turns and Elapsed time). However, according to the qualitative measures, DD is better in Mean Recognition Score (MRS), Timeout, Retry and Barge-In. Task Success in the DD strategy was also better than in the MI strategy. It is important to mention that the MRS for MI had a good performance (0.82) despite the difficulty of the recognition tasks in this strategy. We attribute good speech recognition performance in MI because of the use of a word spotting technique based on the use of explicit garbage models [5]. Table 4 1 shows that users prefer the MI strategy to the DD strategy. In this table we can observe some relevant positive results for the MI strategy: Task ease-of-use, user expertise, expected behavior and future use. We find these results interesting due to the fact that we have 25 different opinions for each dialogue strategy. Thus, we assume that users may find MI easy to use because they can say the action they want in just one utterance. Even when task 1 The numbers in bold font represent better results compared with the other strategy.

Criteria

Directed Dialogue TTS Quality 3.92 ASR Quality 4.12 Task ease 3.72 Interaction pace 3.08 User expertise 4.12 System response 4.28 Expected behavior 4.36 Future use 4.12 User Satisfaction 31.72

Mixed Initiative 4.08 4.20 4.40 2.68 4.20 4.16 5.0 4.60 33.32

success is lower in MI, users feel confident in what to say and expect from the system. Also, it seems that users liked MI more due to the fact that they say would use it in the future as another medium for reading email. These results tell us that in a near future MI could be the dialogue strategy most used by spoken dialogue systems. We consider that the key factors are MRS and User Interface (UI) design. So our hypothesis is that having a friendly and easy to use UI and a robust speech recognizer, the MI strategy can surpass the DD strategy. Finally, our experiments were performed with novice users, and we suspect that with expert users, MI could reach the same scores in task success as the DD strategy. Also, timeouts and retries would be reduced dramatically due to the fact that users would know what to say and as a consequence MRS may be more accurate.

5

Conclusion

We presented our experiments comparing two dialogue strategies, directed dialogue (DD) and mixed initiative (MI), in the context of a spoken dialogue system for accessing email by phone. Our results showed that the MI dialogue strategy is more efficient, which means that the metrics System Turns, User Turns and Elapsed time score better than the DD strategy. Also, our results showed that the DD strategy is better in qualitative metrics such as Task Success, MRS, Timeout and Retry. At this point, we conclude that there are two key components that may improve the performance in dialogue strategies: A robust speech recognizer and a well designed user interface. Achieving these key components, better results can be obtained for efficiency and qualitative metrics in both dialogue strategies. Furthermore, our results showed that user satisfaction is better in the MI strategy. Thus, combining high performance in efficiency and qualitative metrics with user satisfaction makes the MI strategy a potential dialogue strategy to be most used by spoken dialogue systems. For achieving high scores in MRS, it is necessary to deal robustly with out-of-vocabulary speech. Our immediate future direction consists in a detailed

study of out-of-vocabulary speech, which according to our results is a key component for dealing with the MI strategy. Other future directions consist in a revision of the UIs considering the addition of interaction demonstrations and careful design of timeouts and retries, which will be performed taking into account our experimental results, user comments and a speech data analysis. Finally, another important issue would be related to user experience. Our results are for novice users, may be for expert users the preference and performance for MI is even more?

6

Aknowledgement

This research was partially supported by SpeechWorks International Inc. with equipment and software licenses. We would like to thank Ben Serridge for his writing revision on this paper.

References [1] Danieli, M., and Gerbino, E., Metrics for Evaluating Dialogue Strategies in a Spoken Language System, In Proceedings of AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, California, USA, 1995, 34-39. [2] Walker, M. A., Fromer, J., Fabbrizio, D. G., Mestel C., and Hindle, D., What Can I Say? Evaluating a Spoken Language Interface to Email, In Proceedings of CHI, California, USA, 1998, 582-589. [3] Walker, M. A., Kamm, C. A, and Litman, D., Towards Developing General Models of Usability with PARADISE, Natural Language Engineering: Special Issue on Best Practice in Spoken Dialogue Systems, 2000. [4] Walker, M. A., Litman, D., Kamm, C. A, and Abella, A., PARADISE: A framework for Evaluating Spoken Dialogue Agents, In Proceedings of ACL/EACL, 1997, 271-280. [5] Cuayáhuitl, H. and Serridge, Out-Of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems, In Proceedings of the Mexican International Conference on Artificial Intelligence, LNAI 2313, Mérida, Mexico, 2002, 156-165. [6] Thomas, B. S. and Joy J., Saying what Comes Naturally, Speech Technology Magazine, March-April 2001. [7] Carletta, J. C., Assessing the Reliability of Subjective Coding, Computational linguistics, 22(2), 1996, 249254. [8] Telephone Speech Standards Committee, Universal Commands for Telephony-Based Spoken Language Systems, CHI Bulletin, Volume 32, Number 2, April 2000, 25-30.