Topic verification for dialogue systems based on ... - IEEE Xplore

0 downloads 0 Views 58KB Size Report
Oct 13, 2005 - Topic verification for dialogue systems based on Bayesian hypothesis testing. D. Pérez-Pin˜ar López and C. Garcıa Mateo. A novel approach to ...
Topic verification for dialogue systems based on Bayesian hypothesis testing D. Pe´rez-Pin˜ar Lo´pez and C. Garcı´a Mateo A novel approach to the task of verifying the uttered topic for simple dialogue systems is presented. Confidence measures generated by in-parallel topic-adapted automatic speech recognisers are used. Recognition performance, identification of user intention and detection of user-initiated topic change are greatly enhanced, most particularly for difficult topics, such as proper names or confirmations.

Introduction: One of the most difficult aspects of practical dialogue systems is error handling and recovery. Such systems cover a wide range of real-life applications (ticket reservation, weather information services, e-mail access, etc.), but experience shows that they fail in some cases as a result of decoding errors, user-initiated topic change, or both. The dialogue manager must deal with these recogniser errors both to avoid user frustration and enhance application usability. But correction techniques are complex and tedious: the dialogue may flow through a series of turns that are useless from both the application point-of-view and from the user perspective. As such, avoiding these situations is highly desirable and represents a better approach than looking for an ideal correction algorithm. Knowing what the user wants to do at each step of the dialogue is crucial, and this has been coped with using a number of methods [1]. Most of these techniques, however, only operate correctly when no errors or unexpected situations arise. The core of our proposal lies in detecting user intentions, understood as topic detection for simple dialogues. Instead of devising complex highlevel techniques based on semantics or pragmatics, the low-level modules of the system are modified and tailored to deal with specific application topics. Several in-parallel topic-adapted speech recognisers replace the classic application-adapted recogniser. Through the use of mixed language models, each recogniser is adapted to each topic. Acoustic and linguistic features are extracted from each recogniser in order to generate confidence measures, which are then fed into a statistical classifier that outputs the most likely topic at each dialogue step. This novel architecture for the speech recogniser module equips the dialogue system with the ability to verify uttered topics and to detect topic changes, while eliminating correction techniques in most cases. In addition, this approach enhances speech recognition performance, most particularly for difficult topics, such as proper names or confirmations. System architecture: The general problem of topic verification and topic-change detection can be expressed and studied using Bayesian hypothesis testing [2], where the departure point is the supposition of a system awaiting a particular topic. The null hypothesis (H0) represents those cases in which the user says something corresponding to the expected topic. The alternative hypothesis (H1) contains all the remaining cases, i.e. out-of-application or topic-changing utterances. This verification method is implemented using topic-adapted speech recognisers and statistical classifiers. Topic adaptation is achieved by means of topic-specific language models (LMi) and dedicated classifiers. Each classifier is fed with features from the recognisers, then generates a confidence indication (likelihood ratio, LR) in regard to the user staying on the expected topic. At each dialogue step, all recognisers generate transcriptions and features, and all classifiers combine these into LRs, which are then fed into an output classifier that selects the most probable topic (LMi) and transcription (wi). This architecture is depicted in Fig. 1b, in contrast with the classic recognition module shown in Fig. 1a, in which the recogniser uses a language model adapted to the application (LMApp) and generates a corresponding transcription and LR. We will use the architecture in Fig. 1a as our baseline. Experimental framework: To validate this architecture, an experimental application was built using a subset of the Spanish SpeechDAT database [3]. The application consisted of four different topics, selected because of their frequency of use in simple dialogue systems, namely, dates, numbers, names and confirmations. The corpus contained 5000 recording sessions from 991 users (479 males and 512 females). Each user was assigned to a different partition (training, validation and test), distributed as illustrated in Table 1.

Fig. 1 Recognition module architecture a Application-adapted recogniser b Topic-adapted in-parallel recognisers

Table 1: Number of utterances and orthographic transcriptions for each topic and partition Topic

Training Validation Test

Dates

2248

375

Names

1494

249

375 249

Numbers

3745

625

625

Confirmations

1478

247

247

We employed a large vocabulary continuous speech recogniser based on continuous hidden Markov models (CHMM). The recognition engine consisted of: 1. a two-pass recogniser - a Viterbi algorithm which works synchronously with a beam search; and 2. an A* algorithm [4]. Acoustic models were generated from SpeechDAT databases. As training data we used 40 hours of speech, from which we generated 627 acoustic units (demiphones consisting of two-state HMMs). Each HMM-state was modelled by a mixture of 4 to 8 Gaussian distributions with a 39-dimensional feature vector: 12 mel-frequency cepstrum coefficients (MFCC), normalised log-energy, and their first- and second-order time derivatives. As stated above, the core of our proposal is topic-adapted recognition. This was accomplished through the use of topic-adapted language models, obtained by mixing a universal language model with topicspecific language models. Trigram language models were trained using SpeechDAT transcriptions, and the SRILM toolkit [5] with Katz smoothing. The original topic vocabulary contained 115, 581, 99 and 66 words for dates, names, numbers and confirmations, respectively. The application vocabulary consisted of 761 words. An additional universal language model was obtained from a newspaper corpus with a vocabulary of 20 000 words. After mixing, vocabulary size was approximately 20 000 words for each LM. Mixture weights were fixed, equal to 15% topic LM and 85% universal LM. Topic verification: The verification process evaluated the alternative hypotheses. Although several approximations could be used, in our experimental framework we modelled H1 using a combination of alternative LRs, i.e. the likelihood ratios from all topic-adapted recognisers, with the exception of that corresponding to the expected topic. Classifiers were implemented as multilayer perceptrons (MLPs), which received features as inputs (acoustic likelihood, LM probability and transcription word count) and generated LRs as outputs. The hidden layer contained six sigmoid-type neurons. All these networks were trained using the validation partition from SpeechDAT, then optimised using a simple genetic algorithm [6] that achieved a mean error reduction of 4.2%. Results: To evaluate our topic-verification architecture, two sets of experiments were conducted. First, a classical architecture was used to recognise speech input for each topic, with topic detection correction evaluated and used as the baseline. Secondly, speech belonging to

ELECTRONICS LETTERS 13th October 2005 Vol. 41 No. 21

each topic was fed into our new recognition module with in-parallel recognisers, and topic verification results were compared to our previous baseline. Each set of experiments was run for every possible topic defined as the null hypothesis (i.e. the expected topic). Moreover, to avoid artifacts deriving from the relatively low number of utterances contained in our database, partitions were randomly rearranged into five distinct non-overlapping sets. Results were obtained for each arrangement and then averaged. Table 2 shows the results for both the baseline system and the proposed architecture, in terms of topic recognition correction (CORR), false alarms (FA) and false rejections (FR).

Table 2: Topic detection results for baseline system and proposed architecture Baseline system Topic

In-parallel recognisers

Acknowledgments: This project has been partially financed by the Spanish Ministry for Education & Science under Project TIC200202208, and by the Autonomous Government (Xunta) of Galicia under Project PGIDT03PXIC32201PN. # IEE 2005 Electronics Letters online no: 20053119 doi: 10.1049/el:20053119

D. Pe´rez-Pin˜ ar Lo´pez and C. Garcı´a Mateo (Signal Technologies Study Group, Department of Signal Theory and Communications, University of Vigo, Campus Universitario, 36310 Vigo, Spain) References 1

CORR (%) FA (%) FR (%) CORR (%) FA (%) FR (%)

Confirmations

67.91

11.05

21.04

93.53

2.60

3.87

Dates

88.32

6.37

5.31

91.78

3.96

4.27

Names

83.54

10.18

6.28

94.35

3.07

2.59

Numbers

91.25

3.42

5.33

95.29

2.06

2.66

2

3 4

These results show a great improvement in topic detection. The number of false alarms and false rejections is low in all cases, and recognition correction is improved through the use of topic-adapted language models. Overall system performance is improved, both in terms of reducing recognition errors at the word level and in assisting the dialogue manager identify the topic uttered by the user. Interestingly, the results also indicate a high correction rate for names and confirmations. This is especially important in practical dialogue systems, in which these topics play a fundamental role. This technique can be successfully applied to simple vocal services where user interactions are usually short and concise.

30 August 2005

5 6

Carberry, S.: ‘Toward a robust dialogue system: recognizing dialogue acts’. Proc. Pacific Association for Computational Linguistics (PACLING), Kitakyushu, Japan, 2001 Garcı´a-Mateo, C., Reichl, W., and Ortmanns, S.: ‘On combining confidence measures in HMM-based speech recognizers’. Int. Workshop on Automatic Speech Recognition and Understanding (ASRU99), Keystone, CO, USA, December 1999 Moreno, A.: ‘SpeechDAT Spanish Database for Fixed Telephone Networks’. Corpus Design Tech. Rep., SpeechDAT Project LE2-4001, 1997 Dieguez-Tirado, J., Garcı´a-Mateo, C., et al.: ‘Adaptation strategies for the acoustic and language models in bilingual speech transcription’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), March 2005, pp. 833–836 Stolcke, A.: ‘SRILM – an extensible language modelling toolkit’. Proc. Int. Conf. Spoken Language Processing, Denver, CO, USA, 2002, Vol. 2, pp. 901–904 Khare, V., and Yao, X.: ‘Artificial speciation of neural network ensembles’. Proc. UK Workshop on Computational Intelligence (UKCI’02), Birmingham, UK, 2002, pp. 96–103

ELECTRONICS LETTERS 13th October 2005 Vol. 41 No. 21