A Decoding Algorithm for Speech Input Statistical Translation

2 downloads 195 Views 114KB Size Report
This algorithm is a dynamic-programming based algorithm, which uses a word graph in ... In statistical text translation, a perfect input sentence (with no errors) is ...
A Decoding Algorithm for Speech Input Statistical Translation∗ Ismael García-Varea1 , Alberto Sanchis2 , and Francisco Casacuberta2 1 Dpto. de Inf., Univ. de Castilla-La Mancha, 02071 Albacete, Spain

Email: [email protected]

2 Inst. Tecnológico de Inf., Univ. Politécnica de Valencia, 46071 Valencia, Spain

Abstract. In this paper, we present an algorithm for speech input statistical translation. This algorithm is a dynamic-programming based algorithm, which uses a word graph in the input as a representation of the acoustic of a given utterance. A beamsearch implementation of this algorithm has been made and experimental results with the so called E U T RANS -I task are presented.

1

Introduction

Throughout last ten years many papers have been published in an attempt to solve the statistical approach to text translation [1,2,3,4,5]. In a similar way, some works have focussed on the problem of direct speech translation [6,7,8], but using different approaches to the statistical one, as finite-state transducers. In statistical text translation, a perfect input sentence (with no errors) is assumed. On the other hand, when we try to apply the output of a speech recognition system as the input of a statistical text-to-text translator we come up against speech recognition errors. This problem is similar to the one that can be found in the combination of word decoding and language modeling. In this case, the best results are achieved when both processes are integrated. In our case, we are faced with the problem of integrating the speech recognition process and the translation process. Therefore, the problem arises as to how to combine these two processes in a suitable way. In [9], following the ideas of [10], a semi-decoupled decoding algorithm is presented. Other works presented so far have dealt with this problem in mainly two different ways: 1) Using the n-best recognized source sentences provided by a speech recognizer system, translating them sequentially and choosing the best translation obtained; and 2) using finite-state transducers rather than a fully stochastic approach, which provides a translation hypothesis when the recognition process ends. Our plan is to develop a decoding algorithm for speech input under the pure statistical point of view. In [10], the Bayes decision rule for speech input statistical translation was presented. Our work here is mainly based in this paper, but using different assumptions (in order to simplify this decision rule) to the models involved in the final formulation. Experimental results with the so called E U T RANS -I task [11] are also presented in section 7. ∗ Work partially supported by the Europan Union under IST Programme (IST-2001-32091) and by the

Spanish CICYT under grant TIC 2003-08681-C02-02 Petr Sojka, Ivan Kopeˇcek, and Karel Pala (Eds.): TSD 2004, LNAI 3206, pp. 307–314, 2004. c Springer-Verlag Berlin Heidelberg 2004 

308

2

Ismael García-Varea, Alberto Sanchis, and Francisco Casacuberta

Speech Input Translation: Review

In this section, a review of the formulation presented in [10] is presented. The problem of speech-input statistical translation can be established as:

eˆ = arg max Pr (e1I |x 1T ) e1I

where x 1T is an input acoustic sequence. The process can be stated as: x 1T → f 1J → e1I where f1J is the input decoding of x 1T , and e1I the corresponding translation of f1J . By applying the Bayes decision rule in the same way as when the text input sentence is provided and assuming that the length of the output string I is known, this problem can be formulated as3 :

arg max Pr (e1I |x 1T ) = arg max eI

1

eI

1

    

Pr (e1I ) ·



Pr ( f1J |e1I ) · Pr (x 1T | f 1J )

J

f1

    

(1)

Here, no special modelling assumption is made, apart from the reasonable assumption that Pr (x 1T | f 1J , e1I ) = Pr (x 1T | f 1J ) i.e. the target string e1I does not help to predict the acoustic vectors (in the source language) if the source string f 1J is given.

3

The Acoustic and Lexicon Models

To simplify the Bayes decision rule for speech translation, two modelling assumptions are considered: – Acoustic modelling: For each input hypothesis f 1J , we will assume, without lost of generality, that it has a segmentation x¯ 1J of the x 1T associated to it. The acoustic probabilities provided by the speech recognizer are denoted by p(x¯ j | f j ). Thus, for each possible f 1J we have:

Pr (x 1T | f 1J ) =

J

p(x¯ j | f j )

j =1

– Lexicon modelling: By introducing the concept of alignments [12] a1J , 1 ≤ a j ≤ I for 1 ≤ j ≤ J , we have:

Pr ( f1J |e1I ) =



Pr ( f 1J , a1J |e1I )

a1J

f 1J are not of direct interest for the speech translation task. Mathematically, this is captured by introducing the possible source word strings f 1J as hidden variables into the Bayes’ decision rule.”

3 In words of [10]: “From a strict point of view, the source words

A Decoding Algorithm for Speech Input Statistical Translation

where:

Pr ( f1 , a1 |e1 ) = J

J

I

J

j −1

Pr ( f j , a j | f 1

j −1

, a1

309

, e1I )

j =1

Using the IBM-2 alignment model and taking into account that the source sentence f 1J is not well formed for speech input [10], we use a more complex translation model by including the dependence on the predecessor word. Making the suitable transformations results in:

Pr ( f1J |e1I ) =

J  I

p(i | j, I ) · p( f j | f j −1 , ei )

j =1 i=0

A stochastic dictionary p( f j |ei ) was used in [12]. In this case, a “bigram-dictionary” is introduced.

4

The Decoding Algorithm

Taking into account the previous assumptions, we can rewrite equation (1) in the following terms:

I

arg max(e1I |x 1T ) = arg max e1I

e1I

J  I  p(ei |ei−1 ) p(i | j, I )· p( f j | f j −1 , ei )· p(x¯ j | f j ) f 1J j =1 i=0

i=1

(2) In order to solve this maximization problem in a dynamic-programming-like way, we define the score(e1I , x 1T ) associated to an output string hypothesis given the acoustic sequence as the second part of the previous equation. In general, we can define this score for a prefix (e1k , k ≤ I ) of the output string hypothesis and if a maximization is done over all f1J instead of summing then:

score(e1k , x 1T ) ∼ = max

J  k

f 1J j =1 i=0

p(k| j, I ) · p( f j | f j −1 , ek ) · p(x¯ j | f j )

(3)

From the previous equation, it does not seem that a solution by dynamic programming (DP) can exist. But, if we define:

s(eˆk , x 1T ) = max f 1J

 J  

p(k| j, I ) · p( f j | f j −1 , ek ) · p(x¯ j | f j )

j =1

  

(4)

the score(e1k , x 1T ) can be approximated by

score(e1k , x 1T ) ≈ score(e1k−1 , x 1T ) + s(ek , x 1T )

(5)

where ek is computed as:



 ek = arg max p(ek−1 |ek ) · T (ek−1 , k − 1) · score(e1k−1 , x 1T ) + s(ek , x 1T ) ek ∈E

(6)

310

Ismael García-Varea, Alberto Sanchis, and Francisco Casacuberta

and,

T (ek−1 , k − 1) =

k− 1

p(ei−1 |ei )

(7)

i=1

The method for computing this score in dynamic programming style is similar to the algorithm proposed in [13] for text input. The Decoding Algorithm for Word Graphs In this case, the maximization for each input sentence f1J , can be computed only on a subset of possible f 1J , i.e. those belonging to a Word Graph W G(x 1T ). Each path in W G(x 1T ) has an input sentence f1J and a segmentation x¯ 1T associated to it. Thus the s(ek , x 1T ) will be:

s(ek , x 1T ) =

max

 J 

f 1J ∈W G(x 1T )  j =1

p(k| j, I ) · p( f j | f j −1 , ek ) · p W G(x T ) (x¯ j | f j ) 1

  

(8)

where p W G(x T ) (x¯ j | f j ) is the probability assigned to the input word with an associated x¯ j by 1 the graph W G(x 1T ). Once the search procedure for obtaining a translation sentence from the word graph is defined, we need to clarify some implementation details (heuristics) that we have made. First of all, we don’t know the length of the input sentence to be translated a priori because it is provided in the form of a word graph. On the other hand, inferring a probability distribution for p( f j | f j −1 , ei ) could not be straightforward, and, in any case, the distribution obtained from training data will be very sparse. Taking these limitations into account we made the following assumptions/simplifications of the models: – With respect to the alignment probability distribution, we remove the dependency on the input sentence length ( J ) because as we commented before, this parameter is unknown a priori. Thus, the final alignment probability distribution used was p(i | j, I ). – With respect to the ‘well-formedness’ of the source sentences f 1J , we need a more complex model ( p( f j | f j −1 , ek )). For the sake of simplicity we will make the following approximation to this distribution:

p( f j |ei ) · p( f j | f j −1 ) p( f j | f j −1 , ei ) ∼ =    f  p( f j |ei ) · p( f j | f j −1 ) j

This decomposition of probabilities could be seen as a product of the probability associated to the input language model that governs the correct input sentences, with the probability of a stochastic dictionary in the same way as was presented for text input in [12]. Alternatively, this probabilistic distribution can be approximated by:

p( f j | f j −1 , ei ) ∼ = α · p( f j |ei ) + (1 − α) · p( f j | f j −1 ) Under these circumstances, the s(ek , x 1T ) can be computed from the comparison of two graphs by dynamic programming, one corresponding to the input language model and the second corresponding to the word graph.

A Decoding Algorithm for Speech Input Statistical Translation

311

The first graph is built from the input “bigram dictionary”. The nodes are the pairs ( f, j ), where f is an input word and j is an integer in the range 0 ≤ j ≤ J . The edges have the form (( f, j − 1), ( f  , j )) with weight defined as p(k| j, I ) · p( f |ek ) · p( f | f  ). The second graph is built from the word graph. The set of nodes is the same as in the first graph. The edges have the form (( f, j − 1), ( f  , j )) if there is a path in W G(x 1T ) such that f = f j −1 and f  = f j with weight p(x¯1T | f  ). The global search was performed in the same way as in [13]. A trellis is performed (indexed by the target sentence position). In order to compute the best path to a given state of the trellis, a comparison process between two graphs is needed for each possible previous hypothesis to the actual state. The entire process was performed using a dynamic programming algorithm in addition to beam-search techniques in order to reduce the computational temporal cost. An overview of this algorithm is depicted in algorithm 1.

Algorithm 1. Beam search algorithm for speech-input statistical translation. Algorithm DP-D ECODING _WG Require: I, p( f j |ek ), p(i| j, I ), W G(x1T )( p( x¯ j | f j )),

L Min ( p( f j | f j −1 )), L Mout ( p(ei |ei−1 ))

Ensure: eˆ1I = arg maxe I Pr (e1T |x1T ) 1 //Initialization T Compute s(e0 , x1 ) comparing the WG and the L Min by using the following values p( f j |e0 ), p(0| j, I ), and the corresponding acoustic and input language model probabilities //Iteration for all stage i in the trellis do 1.- Compute a new (ei ) to the solution by using equation (6) comparing the input language model and the word graph with the suitable distribution probabilities 2.- Obtain a new value for s(ei , x1T ) 3.- Update the score(e1i , x1T ) with the new obtained value for s(ei , x1T ) end for

The output sentence length I can be estimated from a priori probabilistic distribution of input lengths.

5

Experimental Task

We selected the E U T RANS -I task [11] to experiment with the translation algorithm proposed here. This task consists of a semi-automatically generated Spanish–English corpus. The domain of the corpus consists of a human-to-human communication situation at a reception desk of a hotel. The corpus consist of 10,000 random sentence pairs for training purposes from the above corpus. The input and output vocabulary sizes were 689 and 514, respectively. A multi-speaker speech corpus for the task was acquired. A total of 436 Spanish sentences were selected from the text corpus. They were divided into eleven sets: one common set consisting of 16 sentences, and ten sets of 42 sentences. Each one of twenty speakers (ten

312

Ismael García-Varea, Alberto Sanchis, and Francisco Casacuberta

male and ten female) participating in the acquisition of this corpus, pronounced the common set and two sets of the other ten sets, totalling 2,000 utterances, 15,360 words and about 90,000 phones. The sampling frequency was 16 kHz. From this speech corpus, two sub-corpora were extracted: – Training and adaptation (TravTR): 16 speakers (eight male and eight female), 268 sentences, 1,264 utterances (approx. 11,000 words or 56,000 phones). – Speaker-independent test (TravSI): 4 speakers (two male and two female, not involved in TravTR), 84 sentences (not in TravTR), 336 utterances (approx. 3,000 words or 15,000 phones).

6

Word Graph Generation

In order to test the performance of the translation algorithm 157 test word graphs were generated from 157 randomly selected utterances from the TravSI sub-corpus. The word graph generation process was carried out with the HTK HMM Toolkit V2.1 [14], using the acoustic models and an input language model. In the Acoustic models each one of 24 context independent Spanish phonemes was modeled by a continuous-density HMM with three emitting states and a left-to-right topology with loops in the emitting states. The emission distribution of each state was a mixture of Gaussians. The HTK HMM Toolkit V2.1 was used to estimate the parameters of these HMMs from the union of two corpora: the 1,264 utterances in the TravTR sub-corpus, and an additional set of 1,530 utterances (by 9 speakers, 4 male and 5 female) from a different, quasi-phonetically-balanced corpus. This speech material was processed each 10 msecs, to obtain 10 cepstral coefficients of a Mel-filter bank plus the energy and the corresponding first and second derivatives. The final models had a total of 2,462 Gaussians. In the input language model a non-smoothed bigram language model was inferred by using the whole text corpus (490,000 text sentences) described in the previous section. This input language model was used by the HTK software for generating the word graphs and is different to the one used in the DP-D ECODING _WG algorithm used in the comparison of two graphs.

7

Experimental Results

The translation model (stochastic dictionary p( f j |ek ) and alignment distribution probabilities p(i | j, I )), the input trigram language model ( L Min ), and the output trigram language model ( L Mout ) used in the DP-D ECODING _WG algorithm were trained from the 10,000 whole text corpus. The translation results are shown in Tables 1 and 2. In order to compare the performance of the system with respect to the text input (the best recognized sentence and the correct sentence), the results with a decoupled text-input version of the DP-D ECODING _WG are also reported. The assessment criteria in the experiments were the well known word-error rate WER and the subjective sentence-error rate SSER. Each translated sentence was judged by a human examiner according to an error scale from 0.0 to 1.0 in order to compute the SSER. A score of 0.0 means that the translation is semantically and syntactically correct, a score of 0.5 means

A Decoding Algorithm for Speech Input Statistical Translation

313

that a sentence is semantically correct but syntactically wrong and a score of 1.0 means that a sentence is semantically wrong. Table 1. Speech-input translation results for 157 test input word graphs. Input WER 6.8

Translation WER/SSER (%) 49.4/73.2

Table 2. Text-input decoupled translation results with 157 test input sentences. The translation results of the correct test input sentences are shown in the first row. In the second row the results correspond to the best recognized input sentences obtained from the comparison of the input language model and the corresponding word graph.

Correct Input Best Path WG

8

Input WER 0.0 6.8

Translation WER/SSER (%) 45.4/71.9 50.1/73.6

Concluding Remarks

A new decoding algorithm for speech input statistical translation is presented here. Even though the performance achieved is low, we consider this work as a first approximation to the speech-input problem under the statistical framework with a real implementation in an integrated recognition/translation process. When the results for speech-input and decoupled text-input are compared, it is clear that no better results results are obtained in the decoupled case. Therefore, the integration of the recognition/translation process could help the final translation results. For future work, we are considering eliminating some of the simplifications which would lead to using the corresponding smoothed models as well as more powerful models. We consider the algorithm presented here to be a first solution to the speech-input statistical translation problem.

References 1. Alshawi, H., Xiang, F.: English-to-Mandarin speech translation with head transducers. In: Spoken Language Translation Workshop (SLT-97), Madrid (SPAIN) (1997) 54–60. 2. Berger, A.L., Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Gillett, J.R., Lafferty, J.D., Printz, H., Ure˘s, L.: The Candide system for machine translation. In: Proc. ARPA Workshop on Human Language Technology, Plainsboro, NJ (1994) 157–162. 3. Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., Sawaf, H.: Accelerated DP-based search for statistical translation. In: European Conf. on Speech Communication and Technology, Rhodes, Greece (1997) 2667–2670.

314

Ismael García-Varea, Alberto Sanchis, and Francisco Casacuberta

4. Wang, Y.Y., Waibel, A.: Decoding algorithm in statistical translation. In: Proc. 35th Annual Conf. of the Association for Computational Linguistics, Madrid, Spain (1997) 366–372. 5. Wu, D.: A polynomial-time algorithm for statistical machine translation. In: Proc. of the 34th Annual Conf. of the Association for Computational Linguistics (ACL ’96), Santa Cruz, CA (1996) 152–158. 6. Vidal, E.: Finite-state speech-to-speech translation. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing. Volume 1., Munich, Germany (1997) 111–114. 7. Lavie, A., Levin, L., Waibel, A., Gates, D., Gavalda, M., Mayfield, L.: JANUS: Multi-lingual translation of spontaneous speech in a limited domain. In: Procs. of the 2nd Conf. of the Association for Machine Translation in the Americas, Montreal, Quebec (1995) 252–255. 8. Casacuberta, F., Ney, H., Och, F.J., Vidal, E., Vilar, J.M., Barrachina, S., García-Varea, I., Llorens, D., Martínez, C., Molau, S., Nevado, F., Pastor, M., Picó, D., Sanchis, A., Tillmann, C.: Some approaches to statistical and finite-state speech-to-speech translation. Computer Speech and Language 18 (2004) 25–47. 9. García-Varea, I., Sanchis, A., Casacuberta, F.: A new approach to speech-input statistical translation. In: Procs. of the International Conference on Pattern Recognition (ICPR 2000). Volume 3., Barcelona, Spain, IEEE (2000) 94–97. 10. Ney, H.: Speech translation: Coupling of recognition and translation. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AR (1999) 517–520. 11. Amengual, J., Benedí, J., Castaño, M., Marzal, A., Prat, F., Vidal, E., Vilar, J., Delogu, C., di Carlo, A., Ney, H., Vogel, S.: Definition of a machine translation task and generation of corpora. Technical report d4, Instituto Tecnológico de Informática (1996) ESPRIT, EuTrans IT-LTR-OS-20268. 12. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (1993) 263–311. 13. García-Varea, I., Casacuberta, F., Ney, H.: An iterative, DP-based search algorithm for statistical machine translation. In: Proc. of the Int. Conf. on Spoken Language Processing (ICSLP ’98), Sydney, Australia (1998) 1235–1238. 14. Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (version 2.1). Cambridge University Department and Entropic Research Laboratories Inc., Cambridge, UK (1997).

Suggest Documents