Recognition and Understanding Simulation for a ... - Semantic Scholar

1 downloads 0 Views 158KB Size Report
systems in order to acquire a dialog corpus for a new domain. The main idea is the simulation of recognition and understanding errors in the acquisition of the.
Recognition and Understanding Simulation for a Spoken Dialog Corpus Acquisition? F. Garcia, L.F. Hurtado, D. Griol, M. Castro, E. Segarra, E. Sanchis Departament de Sistemes Inform`atics i Computaci´o (DSIC) Universitat Polit`ecnica de Val`encia (UPV) Cam´ı de Vera s/n, 46022 Val`encia, Spain {fgarcia,lhurtado,dgriol,mcastro,esegarra,esanchis}@dsic.upv.es

Abstract. Since the design and acquisition of a new dialog corpus is a complex task, new methods to facilitate this task are necessary. In this paper, we present a methodology to make use of our previous work within the framework of dialog systems in order to acquire a dialog corpus for a new domain. The main idea is the simulation of recognition and understanding errors in the acquisition of the new dialog corpus. This simulation is based on the analysis of such errors in a previously acquired corpus and the definition of a correspondence table among the concepts and attributes of both tasks. This correspondence table is based on the similarity of semantic meaning and frequencies. Finally, the application of this methodology is illustrated in some examples. Key words: Dialog Systems, Error Simulation, Corpus Acquisition.

1

Introduction

The study and development of spoken dialog systems is an emerging field within the framework of language and speech technologies. The scheme used for the development of these systems usually includes several generic modules that deal with multiple knowledge sources. These modules must cooperate to satisfy user requirements: they must recognize the pronounced words, understand their meaning, manage the dialog, perform error handling, access the databases, and generate an oral answer. Each module has its own characteristics and the selection of the most convenient model varies depending on certain factors (the goal of the dialog, the possibility of manually defining the behavior of the module, or the capability of automatically obtaining models from training samples). The process of designing, implementing, and evaluating a dialog system is increasingly complex. One of the most successful approaches is the statistical approach, which probabilistically models processes that are automatically learned from corpora of real human-computer dialogs [1–4]. The main reason for using the statistical approach is that we want to estimate a dialog manager that is able to deal with variability in user behavior. These models also ?

This work has been partially supported by the Spanish Government and FEDER under contract TIN2005-08660-C04-02, and by the Vicerrectorado de Innovaci´on y Desarrollo of the Universidad Polit´ecnica de Valencia under contract 4681.

reduce development and maintenance costs. Systems with improved portability, more robust performance, and an easier adaptation to other domains can be developed. The success of statistical approaches, however, depends on the quality of the models and the quality of the data. Therefore, the acquisition of the corpora and the definition of the semantic representation for the labeling are processes that are key to obtaining the quality data that is needed to train satisfactory models. In this paper, we present a methodology to make use of our previous work within the framework of dialog systems in order to acquire a dialog corpus for a new domain. The main idea is the simulation of recognition and understanding errors in the acquisition of the new dialog corpus. This simulation is based on the analysis of the recognition and understanding errors generated when our automatic speech recognition and understanding modules [5] are applied to a previously acquired corpus. To translate these errors to the new corpus, a correspondence table among the concepts and attributes of both tasks is defined. This correspondence is based on the similarity of the semantic meanings and frequencies. This methodology has been applied within the framework of two Spanish projects: DIHANA [6] and EDECAN [7]. The main objective of the DIHANA project was the design and development of a dialog system for access to an information system using spontaneous speech. The domain of the project was the query to an information system about railway timetables and prices by telephone in Spanish. Within the framework of this project, we developed a mixed-initiative dialog system to access information systems using spontaneous speech [8]. The behavior of the main modules that compose the dialog system was based on statistical models that were learned from a dialog corpus that was acquired and labeled within the framework of the DIHANA project. The main objective of the EDECAN project currently underway is to increase the robustness of a spontaneous speech dialog system through the development of technologies for the adaptation and personalization of the system to the different acoustic and application contexts in which it can be used. Within the framework of this project, we will build and evaluate a fully working prototype of a dialog system for access to an information system using spontaneous speech, as in the DIHANA project. In this case, the domain is the multilingual query to an information system about information and booking of sport activities. For the development of the dialog system, we will use statistical approaches as in the DIHANA project. Therefore, the acquisition of a corpus for the new domain will be necessary, and the proposed methodology will be applied. A new architecture has been designed for the acquisition of this new corpus. Using this architecture, two Wizards of Oz (WOZ) will take part in the acquisition: one to simulate the behavior of the recognition and understanding modules, and the other to simulate the behavior of the dialog manager. Section 2 presents the architecture that has been defined for the acquisition of the corpus in the EDECAN project. Section 3 presents our previous work within the framework of the DIHANA project and the semantic representation defined for the EDECAN task. Section 4 presents our methodology for simulating errors and confidence values, and the application of this methodology is illustrated in some examples. Finally, Section 5 presents some conclusions and future work.

2

An architecture for the acquisition of corpora

As stated in the introduction, we are working on the construction of corpus-based spoken dialog systems for access to information systems. In our approach, the parameters of the main modules that constitute the dialog system are automatically estimated from data. Therefore, when we want to design a dialog system for a new task, we need a corpus of dialogs for this task. Following the main contributions of the literature, we made acquisitions using the WOZ technique, that is, acquisitions were made with real users and a simulated dialog system. In the WOZ technique, a person substitutes the machine in almost all the functions. In other words, s/he listens to the user turn and builds the query to the information system and the system frame (a codified system answer). This is done by using a software platform (which, for example, stores the historic information supplied by the user in previous turns, etc.) in order to apply the dialog strategy. The system frame is then converted by the Answer Generator and by the Text-To-Speech modules in the answer to the user. In the WOZ technique, there is usually only one person who performs all the functions described above. In our experience, this is too much for a single person to do. In this work, we propose working with two WOZ: the understanding simulator and the dialog management simulator. The first one listens to the user and simulates the automatic speech recognition and understanding modules, supplying the simulated user frame. From this frame, the second WOZ performs the dialog manager simulation as described above. The architecture proposed for the acquisition of the new dialog corpus is shown in Figure 1.

Fig. 1. The proposed acquisition schema for the EDECAN corpus.

The separation of the recognition and understanding function and the dialog management function offers several advantages. The main one is that each WOZ must carry out fewer tasks than before, each WOZ becomes more specialized and the performance of each task improves (if there is a single person doing all the tasks, s/he must manage multiple knowledge sources simultaneously). A separate understanding simulator can better simulate the future automatic understanding module because it only knows (listens to) the user inputs (the system outputs are not listened to). A separate dialog manager simulator can also better simulate the future automatic dialog manager, because their experimental conditions are also the same.

3

The DIHANA and the EDECAN corpora

As in many other dialog systems, the representation of the user and system turns is done in terms of dialog acts [9]. The semantic representation chosen for the DIHANA and EDECAN tasks is based on the concept of frame. Therefore, the understanding module generates one or more concepts with the corresponding attributes. 3.1

The previous task: DIHANA and the semantic representation

One of the objectives of the DIHANA project was the acquisition of a dialog corpus. The DIHANA task consists of a telephone-based information service for trains in Spanish. A set of 900 dialogs was acquired using the standard WOZ technique, and 225 naive speakers collaborated in the acquisition of four dialogs corresponding to different scenarios. The number of user turns was 6,280, and the vocabulary size was 823 different words. In this task, we identified six task-dependent concepts: HOUR, DEPARTURE - HOUR, ARRIVAL - HOUR , PRICE , TRAIN - TYPE, TRIP - DURATION , SERVICES and three task-independent concepts: ACCEPTANCE, REJECTION, NOT- UNDERSTOOD. The task-dependent concepts represent the concepts the user can ask for. Each concept has a set of attributes associated to it: City, Origin-City , Destination-City, Class, Train-Type, Order-Number, Price, Services, Date, Arrival-Date, Departure-Date, Hour, Departure-Hour, ArrivalHour. This set represents the restrictions that the user can place on each concept in an utterance. A labeled DIHANA dialog is shown in Figure 2. A set of experiments was carried out to evaluate the accuracy of our recognition and understanding modules in the DIHANA project [5]. The results obtained for the frame slot accuracy (the number of correctly understood units divided by the number of units in the reference) were: 93.90% for the correct transcriptions without using the recognition module, and 84.20% using both the recognition and understanding modules. Table 1 shows errors for the experiments using recognition and understanding modules in terms of substitutions, insertions and deletions. 3.2

The new task: EDECAN and the semantic representation

The new task defined in the framework of the EDECAN project, is a service for the information and booking of sport activities. The service is intended to be used via telephone and via a multimodal information kiosk. We plan to perform the acquisition as

M1 Welcome to the railway information system. How can I help you? U1 Good evening, I want to know timetables from Barcelona to Valencia on April the 24th in the morning. HOUR : Origin-City =Barcelona Destination-City =Valencia Date =[2007-24-04] Hour =morning M2 There are two trains. The first train leaves at eight twenty-five and the last one leaves at ten thirty. Do you want anything else? U2 Yes, I would like to know how much the first train costs. ACCEPTANCE PRICE : Order-Number =first M3 The price of that train is 8.90 e. Do you want anything else? U3 No, thank you. REJECTION

M4 Thanks for using the information system. Have a good journey!

Fig. 2. A labeled DIHANA dialog (English translation from the original in Spanish). M stands for “Machine turn” and U for “User turn”. Table 1. Recognition and understanding of errors in the DIHANA system.

Concepts Attributes

Substitutions Insertions Deletions 3.63% 5.33% 2.33% 4.55% 5.99% 2.02%

explained in Section 2. A kiosk will be set up in a public hall of an education center in our university. A total of 240 dialogs will be recorded by 24 speakers following 15 types of scenarios. In order to perform this new task, we have obtained a set of 50 person-to-person dialogs that were recorded at the telephone sport service of the University. These dialogs were analyzed and we identified the following: task-independent concepts AC CEPTANCE , REJECTION , NOT- UNDERSTOOD and task-dependent concepts AVAILABIL ITY , BOOKED , BOOKING , CANCELLATION . The attributes associated to the concepts are: CourtId, CourtType, Date, Hour, Sport. An example of a labeled person-to-person dialog is shown in Figure 3.

4

Understanding simulator

The aim of the understanding simulator in our proposed acquisition architecture is to simulate the behavior of our previous DIHANA recognition and understanding modules. The first WOZ translates the user utterance into a correct user frame using the understanding editor. Then, an error simulator adds errors to the correct frame generating the simulated frame. This process is shown in Figure 4. The error simulator reproduces the behavior of the recognition and understanding modules developed in the DIHANA project following the error distributions shown in Table 1. To translate these errors to the new corpus, a correspondence table among the concepts and attributes has been manually defined. This correspondence is based on

S1 Welcome to the sport service. How can I help you? U1 I want to know the availability of tennis courts on May the 26th in the evening. AVAILABILITY : Sport =tennis Date =[2007-26-05] Hour =evening S2 There are three available hours on May the 26th in the evening: from four to five, from six to seven, and from seven to eight. Which do you want to book? U2 I would like to book from six to seven. BOOKING : Hour =[18:00-19:00] S3 I have just booked tennis court number 4 on May the 26th from six to seven in the evening for you. Do you want anything else? U3 No, thank you. REJECTION

S4 Thank you for using the sport service. Goodbye.

Fig. 3. A labeled EDECAN person-to-person dialog (English translation from the original in Spanish). S stands for “System turn” (the person who attends the service) and U for “User turn”.

Fig. 4. Understanding Simulator for the EDECAN task.

the similarity of the semantic meanings and frequencies. Table 2 shows the correspondences among the concepts and the attributes of the two tasks.. In a typical DIHANA dialog, the user first asks for the timetable of a trip and then the price of such a trip (as in the example in Figure 2). Analogously, after analyzing the person-to-person dialogs of the sport service at the University, it is usual to ask for the availability of a sport facility and then to book it (as in the example in Figure 3). Therefore, we established a correspondence between HOUR and AVAILABILITY and between BOOKING and PRICE. The error and confidence measure simulator not only introduces errors into the user frames but also generates a confidence value for each concept and attribute in the simulated frame. This confidence value is calculated using a weighted coefficient that considers whether an error has been introduced in the simulation. Figure 5 shows the effects of applying the error and confidence measure simulator for a user turn.

Table 2. Correspondence among concepts and attributes of the user turns of the DIHANA and the EDECAN corpora. CONCEPT correspondence EDECAN DIHANA Train Information Sport Info. & Booking AVAILABILITY

HOUR ARRIVAL - HOUR DEPARTURE - HOUR

BOOKING

PRICE

BOOKED

TRAIN - TYPE

CANCELLATION

TRIP - DURATION SERVICES

ACCEPTANCE

ACCEPTANCE

REJECTION

REJECTION

NOT- UNDERSTOOD NOT- UNDERSTOOD

ATTRIBUTE correspondence EDECAN DIHANA Train Information Sport Info. & Booking Sport City Destination-City Origin-City CourtType Train-Type Class CourtId Order-Number Price Services Date Date Arrival-Date Departure-Date Hour Hour Arrival-Hour Departure-Hour

User Utterance:

I want to know the availability of tennis courts on May the 26th in the evening. Correct Frame: AVAILABILITY: Sport =tennis Date =[2007-26-05] Hour =evening Simulated Frame: AVAILABILITY (8): Sport =tennis (7) Date =[2007-26-05] (7) Hour =morning (5) CourtType =indoors (4) Fig. 5. An example of applying the error and confidence measure simulator for a user turn (confidence values are shown in brackets).

5

Conclusions and Future Work

We have presented a methodology to adapt our previous work within the framework of dialog systems in order to acquire a new corpus for a different domain. This proposal is based on the use of two different WOZ and an error simulator between them. The first WOZ simulates the combined behavior of the recognition and the understanding modules. S/he listens to the user utterance and transcribes it in terms of semantic user frames. The second WOZ carries out the functions of the dialog manager. S/he receives the user frames, interacts with the information system, and generates the answer of the system in terms of system frames.

The error simulator receives the user frame generated by the first WOZ and returns a modified frame by adding simulated errors. This frame is the input for the second WOZ. The errors introduced by this module simulate the errors introduced by our recognition and understanding modules in our previous project. Moreover, the error simulator calculates a confidence value for each frame slot. Using an error simulator allows us to achieve two objectives. On the one hand, it allows us to simulate the approximate behavior of our recognition and understanding modules before implementing then. This simulation is based on the analysis of the errors introduced by these modules in our previous project. On the other hand, it is possible to simulate a wide range of application environments by varying the parameters used for the generation of errors. For example, a noisier environment can be simulated by increasing the quantity of errors introduced by the simulator. Since the two WOZ carry out the correct transcription of their respective modules, each dialog acquired using the proposed methodology is already labeled in terms of understanding frames and dialog management frames. This way, a later labeling phase is not necessary, as is required when a classic approach for the WOZ paradigm is used. As future work, a more detailed study of the confidence measures used by the error simulator is needed. Once the acquisition is done, we will have real recognition and understanding modules for the new task, and we will be able to compare the behavior of the understanding modules of DIHANA and EDECAN projects to verify whether the correspondence among the semantic representation of the two tasks is appropriate.

References 1. Potamianos, A., Narayanan, S., Riccardi, G.: Adaptive Categorical Understanding for Spoken Dialogue Systems. In: IEEE Transactions on Speech and Audio Processing. Volume 13(3). (2005) 321–329 2. Torres, F., Hurtado, L., Garc´ıa, F., Sanchis, E., Segarra, E.: Error handling in a stochastic dialog system through confidence measures. In: Speech Communication. (2005) (45):211– 229 3. Hurtado, L.F., Griol, D., Segarra, E., Sanchis, E.: A stochastic approach for dialog management based on neural networks. In: Proc. of Interspeech’06-ICSLP, Pittsburgh (2006) 49–52 4. Williams, J., Young, S.: Partially Observable Markov Decision Processes for Spoken Dialog Systems. In: Computer Speech and Language. Volume 21(2). (2007) 393–422 5. Grau, S., Segarra, E., Sanch´ıs, E., Garc´ıa, F., Hurtado, L.F.: Incorporating semantic knowledge to the language model in a speech understanding system. In: IV Jornadas en Tecnologia del Habla, Zaragoza, Spain (2006) 145–148 6. Bened´ı, J., Lleida, E., Varona, A., Castro, M., Galiano, I., Justo, R., L´opez, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: Proc. of LREC’06, Genove, Italy (2006) 1636–1639 7. Lleida, E., Segarra, E., Torres, M. In´es Mac´ıas-Guarasa, J.: EDECAN: sistEma de Di´alogo multidominio con adaptaci´on al contExto aC´ustico y de Aplicaci´oN. In: IV Jornadas en Tecnologia del Habla, Zaragoza, Spain (2006) 291–296 8. Griol, D., Torres, F., Hurtado, L., Grau, S., Garc´ıa, F., Sanchis, E., Segarra, E.: A dialog system for the DIHANA Project. In: Proc. of SPECOM’06, S. Petersburgh (2006) 131–136 9. Fukada, T., Koll, D., Waibel, A., Tanigaki, K.: Probabilistic dialogue extraction for concept based multilingual translation systems. In: Proc. Int. Conf. on Spoken Language Processing. (1998) 2771–2774

Suggest Documents