An Empirical Comparison of Stack-Based Decoding Algorithms for Statistical Machine Translation Daniel Ortiz1 , Ismael Garc´ıa Varea1 , and Francisco Casacuberta2 1
Dpto. de Inf. Univ. de Castilla-La Mancha, 02071 Albacete, Spain
[email protected] 2 Inst. Tecnol´ ogico de Inf. Univ. Polit´ecnica de Valencia, 46071 Valencia, Spain
Abstract. Unlike other heuristic search algorithms, stack-based decoders have been proved theoretically to guarantee the avoidance of search errors in the decoding phase of a statistical machine translation (SMT) system. The disadvantage of the stack-based decoders are the high computational requirements. Therefore, to make the decoding problem feasible for SMT, some heuristic optimizations have to be performed. However, this yields unavoidable search errors. In this paper, we describe, study, and implement the state of the art stack-based decoding algorithms for SMT making an empirical comparison which focuses specifically on the optimization problems, computational time, and translation results. Results are also presented for two well known task, the Tourist Task and the Hansards task.
1
Introduction
The goal of the translation process in SMT can be formulated as follows: a source language string f = f1J = f1 . . . fJ is to be translated into a target language string e = eI1 = e1 . . . eI . Each target string is regarded as a possible translation for the source language string with maximum a-posteriori probability P r(e|f ). According to Bayes’ decision rule, we have to choose the target string that maximizes the product of both the target language model P r(e) and the string translation model P r(f |e). Alignment models for structuring the translation model are introduced in [2]. In statistical alignment models, P r(f , a|e), the alignment a = aJ1 is introduced as a hidden variable, and the alignment mapping is j → i = aj from source position j to target position i = aj . Typically, the search is performed using the so-called maximum approximation: ˆ = arg max{P r(e) · e e
a
P r(f , a|e)} ≈ arg max{P r(e) · max P r(f , a|e)} e
a
(1)
Work partially supported by Spanish CICYT under grant TIC2000-1599-C02-01.
F.J. Perales et al. (Eds.): IbPRIA 2003, LNCS 2652, pp. 654–663, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Empirical Comparison of Stack-Based Decoding Algorithms
655
Many works [1, 6, 3, 5] have adopted different types of stack-based algorithms to solve the global search optimization problem stated above. All these works make their own optimizations in order to make the use of stack decoders feasible. Here, we pay special attention to some optimization problems which are not addressed in previous works, and we propose some possible solutions. In this paper we show how to perform the global search optimization problem following the different types of stack-based decoding algorithms proposed so far. Then, we describe, study, and implement the state of the art stack-based decoding algorithms for SMT making an empirical comparison which focuses specifically on the optimizations problems, computational time, and translation results.
2
Stack-Based Decoding
The stack decoding algorithm, also called A∗ algorithm, was introduced by F. Jelinek in [4] the first time. The stack decoding algorithm attempts to generate partial solutions, called hypotheses, until a complete sentence is found; these hypotheses are stored in a stack and ordered by their score. In our case, this measure is a probability value given by both the translation and the language model introduced in section 1. The decoder follows a sequence of steps for achieving an optimal hypothesis: 1. Initialize the stack with an empty hypothesis. 2. Iterate (a) Pop h (the best hypothesis) off the stack. (b) If h is a complete sentence, output h and terminate. (c) Expand h. (d) Go to step 2a. The search is started from a null string and obtains new hypotheses after an expansion process (step 2c) which is executed at each iteration. The expansion process consists of the application of a set of operators over the best hypothesis in the stack. Thus, the design of stack decoding algorithms involves defining a set of operators to be applied over every hypothesis as well as the way in which they are combined in the expansion process. Both the operators and the expansion algorithm depend on the translation model that we use. In our case, we used IBM Model 3 and IBM Model 4. The operators we used in our implementation for IBM Model 3 and IBM Model 4 are those defined in [1] and [3], that we describe below: – add: adds a new target word and aligns a single source word to it. – extend: aligns an additional source word to the last generated target word. – addZf ert: adds two new target words: the first has fertility zero, and the second is aligned to a single source word. – addN ull: aligns a source word with the target N U LL = e0 word.
656
Daniel Ortiz et al.
Algorithm 1.1 Expansion algorithm for IBM Model 3(hip) for all not(covered position j in(hip) do if hip.is opened() then hip = hip hip .extend(); push(hip ) {opened extension} hip .close(); push(hip ) {closed extension} else f =obtain j-th source word(j) for all e word of source vocabulary do if e N U LL then hip = hip hip .add(e, j); push(hip ) {add} hip .close(); push(hip ) {add + close()} for all ze word of source vocabulary do hip = hip hip .addZf ert(ze, e, j); push(hip ) {addZfert} hip .close(); push(hip ) {addZfert + close()} end for else {Connect j with N U LL} if hip.phi0 < m/2 then hip = hip hip .addN ull(j; push(hip )) {addnNull} end if end if end for end if end for
The expansion algorithm we have implemented is strongly inspired on the one given in [1] for IBM Model 3, (see Algorithm 1.1 for details). This algorithm was adapted to use IBM Model 4. Basically, there are two different stack-based algorithms depending on the number of stacks used in the decoding process (one or several). The first type of algorithm is the A∗ algorithm which uses a single stack and its basic operating mode was described in the previous section; in this case the stack will store all the hypotheses ranked by their score. The second type of algorithm are the Multi-stack algorithms where those hypotheses with different subsets of source aligned words are stored in different stacks. All the search steps given for A∗ algorithm can also be applied here, except step 2a. This is due to the fact that multiple stacks are used instead of only one. In this case another distinction can be made according to the criterion of hypotheses selection: – Multi-stack algorithms. They select the best hypothesis or the N -best hypotheses that are stored in the queues. – Multi-stack algorithm with threshold. In this case all stacks are explored, and a numeric threshold value is calculated and associated to each stack. Only those hypotheses whose scores are greater than the threshold of the stack
An Empirical Comparison of Stack-Based Decoding Algorithms
657
can be candidates to be selected within the expansion process. The definition of a function to compute the threshold is needed in order to characterize the algorithm. A specific example of a thresholding function is given in [1].
3
Optimizations and Related Search Errors
Stack decoding has a remarkable advantage: under certain conditions, the optimality of the search process can be guaranteed. However, it has an important disadvantage: the search process requires a high computational complexity. ˆ be the optimal reference translation of a source sentence f and let e be Let e ˆ, we say that a translation hit the translation that the decoder returns. If e = e ˆ and P r(e | f ) < has occurred. In contrast, a search error occurs when e = e ˆ and P r(e | f ) > P r(ˆ P r(ˆ e | f ). There exists another possibility, if e = e e | f ), then we say that the error is due to the translation and/or language models. In the following, we describe the possible optimizations (proposed in [3, 5]) to be applied in order to reduce the search space and the corresponding associated search errors. In all cases, the description is corroborated by empirical results. To establish a compromise between efficiency and effectiveness, we carried out some experiments1 corresponding to a test set of 100 input sentences of length 8 from the Tourist task. As in previous works, we used IBM Model 4 as translation model and 3-grams as a language model. The following optimizations do not require further explanation, but note that the search errors can be avoided by using their maximum values, which always involves a substantial increment in the computational time: – Reduce the number of possible source word translations (W ) from the size of the target vocabulary to a prioritized candidate list as defined in [3]. From W = 1 to W = 12 we reduce the search errors from 82 errors to 18 errors, but the secs. per sentence are increased (on average) from 14 secs. to 31 secs. Higher values of W do no pay off. – Reduce the number of possible zero fertility words (Z), that is, consider only a certain number of zero fertility target words of a prioritized list. A similar experiment was done varying Z from 1 to 50 reducing search errors from 79 to 13 errors but increasing the secs. per sentence from 2.7 secs. to 98 secs. – Stack length limitation (S). The number of possible partial hypotheses to be stored during the search can be huge. Then, in order to avoid memory overflow problems, the maximum number of hypotheses that a stack may store has to be limited. It is important to note that for a partial hypotheses, the higher the aligned source words, the worse score. These partial hypotheses will be discarded sooner when an A∗ search algorithm is used due to the possible effect of the S parameter. This cannot occur if a multi-stack algorithm is used, because only hypotheses with the same number of covered positions can compete with each other. 1
Experiments were done on a Pentium III machine at 600 MHz.
658
Daniel Ortiz et al.
– Restrict the number of source position to be aligned per expansion (A) for every position yet uncovered of the source sentence. Then the amount of hypotheses generated in every expansion can be reduced by setting the maximum value to A. This is specially important for dealing with long sentences. Furthermore, in the experiments, we obtained 67% of search errors for A = 1, and 18% for A = 8 (the length of the sentences). No improvements were obtained for higher values of A = 4. However, the optimization of the reduction of addZf ert comlexity proposed in [3] cannot avoid search errors. This optimization is based on the fact that the addZf ert operation should not be systematically applied but applied only when the probability of a partial hypothesis is increased. That is, the addZf ert operation can yield a better hypothesis than the add operation if it increases the language model probability more than it decreases the translation model probability. This is because addZf ert adds a single contribution to the translation model probability consisting of the fertility term of the zero fertility word added. We have observed that this optimization can cause search errors if a trigram language model is used, this was not observed in [3]. Let us suppose that, during the expansion process, the addZf ert operation is not applied because of the optimization condition. In the next iteration, the uninserted zero fertility word might substantially increase the language model probability, thus yielding a much better partial hypothesis than the one obtained without applying the addZf ert operation. The immediate solution to this problem could be to use a bigram instead of a trigram. However, this solution will degrade the translation quality as has been shown in other works. The solution we propose is to postpone the decision to discard the addZf ert operation (and obviously the associated hypothesis) to the next iteration when all history for the trigram language model is known. Table 1 shows that the addZf ert optimization (as proposed in [3]) has, in effect, created search errors, which could be avoided by using our solution. The 18% of search errors are due to the other optimizations. As expected a substantial decrease in computational time is achieved. In the experimentation process we carried out, we observed a special phenomenon which was not mentioned in previous works. In a relatively high number of sentences, the optimal translation of a given source sentence had two or more consecutive zero fertility words, or the sentence ended with a zero fertility word. According to the proposed operators in the literature, the decoders will not be
Table 1. Experiments with addZf ert optimizations secs. per sent. Hits Model errors Search errors
Without opt. Heuristic opt. Postponed opt. 48.5 7.6 26.4 43 38 43 39 39 39 18 23 18
An Empirical Comparison of Stack-Based Decoding Algorithms
659
Table 2. add2Zf ert and reverseAddZf ert test Secs. per sent. Hits Model errors Search errors
addZf ert add2Zf ert reverseAddZf ert 15.4 26.4 33.8 32 43 45 30 39 38 38 18 17
able to yield the optimal hypothesis because there does not exist a sequence of operators to avoid this phenomenon. We have called these errors as algorithminherent search errors. We propose introducing two new operators in order to reduce the algorithm-inherent search errors: – add2Zf ert: this works like the addZf ert operator, but one more word with zero fertility is added. This operator allows the algorithm to produce two consecutive words with zero fertility. – reverseAddZf ert: this is similar to the addZf ert operator, but the word with zero fertility is added after the word whichd has a fertility greater than zero. The reverseAddZf ert can also yield hypotheses in which the last word has zero fertility. Table 2 shows an experiment which shows the reduction of search errors due to these two new operators. The experiment was carried out maintaining the other optimizations parameters at the same value. In spite of this, a total of 8 search errors occurred due to the fact that there were more than two consecutive zero fertility words. Thus, the definition of a new addN Zf ert (N > 2) operation is needed in order to completely avoid search errors. Obviously, the secs. per sentence are increased in all cases. We also implemented two optimizations proposed in [5] which do not provoke any search errors and involve a substantial speed up. The first optimization is hypotheses Recombination which can discard hypotheses which cannot be distinguished by their language model state or by the translation model state, the second optimization is the use of admissible heuristic functions which estimates the cost of completing a partial hypothesis. A heuristic function is called admissible if it never underestimates the probability of a completion of a partial hypothesis. Here we used T , T F and T F L. These functions take into account the probabilities of the lexicon model, the fertility model, and the language model respectively, in order to calculate the heuristic value. The sentence cost is significantly reduced from 26 secs. without any heuristic function to 11 secs. if we use T F L heuristic. For the rest of the experiments, we adopted the compromise to use: W = 12, Z = 24, A = 5, T F L and S(A∗ ) = 10000, S(M stack) = 1000, for the Tourist task. Similar experimentation was carried for the Hansards task yielding the following values: W = 8, Z = 8, A = 3, T F (due to the extremely huge cost of calculating the language model probabilities required by T F L heuristic). These
660
Daniel Ortiz et al.
Table 3. Complexity per iteration with and without optimizations Algorithm Without optimizations With optimizations A∗ m|E|2 · 2 log(x) AW Z · 2 log(SStack ) m 2 m m M-stack N 2 + m|E| · 2(g(2 ) + log(x)) N 2 + AW Z · 2(g(2m ) + log(SM stack )) M-stack + thr 2m f () + m|E|2 · 2(g(2m ) + log(x)) 2m f () + AW Z · 2(g(2m ) + log(SM stack ))
values allowed for a reasonable computation time without quantitatively degrading the translation quality.
4
Complexity per Iteration
A study of the complexity per iteration for the worst case has allowed us to understand more about the effects of the optimizations. See Table 3 for a study of the complexity of the algorithms with and without optimizations (the complexity is expressed in terms of m). The symbols used are: – – – –
m: number of words of the source sentence. |E|: cardinality of the target vocabulary. f (): complexity of the function that calculates the threshold. g(): complexity of retrieving the appropriate stack for inserting the given hypothesis into it (only for multiple-stack algorithms). If we use a hash table, the mean cost of this operation is considered constant. – x: the number of hypotheses in the stack. Without optimizations, this value is not bounded, and with optimizations, it is fixed to the values of S. If no optimization is applied, the complexity will be prohibitive, specially for multiple-stack algorithms, where we have to iterate over all the stacks in order to select the hypotheses that will be expanded later. This task introduces an exponential term in the complexity. On the contrary, if we apply the optimizations, the complexity per iteration will be reduced.
5
Efficiency of Stack-Based Decoders
In order to compare the three different stack decoders introduced in section 2, a simple experiment for the Tourist task was carried out. The results are shown in Table 4. The threshold algorithm obtained the worst results. The thresholding function used here consisted of taking the best hypothesis of each stack for expansion. Such a strategy requires many more iterations than the approaches without threshold. Further work must be done with algorithms of this kind. On the contrary, we have similar costs for A∗ and multi-stack (without threshold) algorithms. However, we expected the A∗ algorithm to have better results due to its lower complexity. A more detailed study is shown in Table 5 where we have processed five different test corpora of sentences with a fixed length.
An Empirical Comparison of Stack-Based Decoding Algorithms
661
Table 4. Algorithm influence using a test set of 100 sentences of length 8 A∗ multi-stack threshold Secs. per sent. 28.2 26.4 557.2 Hits 43 43 43 Model errors 39 39 39 Search errors 18 18 18
Table 5. Comparison between A∗ and multi-stack algorithms
A∗
multi-stack
Sent. length Secs. per sent. Expansion time (secs.) Select hyp. time (secs.) µ-secs. per push op. Discarded hyps due to S Secs. per sent. Expansion time Select hyp. time µ-secs. per push op. Discarded hyps due to S
4 0.64 0.63 0 16 0 0.59 0.57 0.01 17 228
6 3.90 3.87 0 18 0 3.63 3.59 0.04 25 8.9K
8 28.22 28.10 0 23 169K 27.00 26.45 0.46 26 202K
10 62.32 62.12 0 26 624K 64.50 62.02 2.24 27 496K
12 300.84 300.15 0 32 4.8M 374.77 330.43 43.30 31 3.5M
As we expected, the decoding cost increases in relation to the length of the sentence. However, if we use a single stack algorithm, no time is spent on selecting the best hypothesis for the expansion. On the contrary, multi-stack algorithms spend a significant part of their decoding time doing this for long sentences. In any case, the value of the parameter S and its effect on the amount of discarded hypotheses seems to be more important than the importance of the hypotheses selection. Note that the value of S is closely related to the algorithm type, and theoretically, can be lower for multiple-stack algorithms than for the A∗ algorithm. Further work must be done in order to set the minimum values of the parameters for each algorithm. Finally, the cost of the push operation is very similar for all the different algorithms. This result is in accordance with the theoretical complexity due to the logarithmic factor used in the expression.
6
Translation Results
The experimental results were carried out using two different tasks: – The Tourist task consists of a semi-automatically generated Spanish–English corpus. The domain of the corpus consists of a human-to-human communication situation at a reception desk of a hotel. The corpus consist of 10,000 random sentence pairs for training purposes from the above corpus. The input and output vocabulary sizes were 689 and 514, respectively.
662
Daniel Ortiz et al.
Table 6. Translation quality for the Tourist and Hansards tasks for different test sentence lengths and different values of S(S = 10K and S = 100K, left and right columns respectively; ”-” means equal value). In all cases, 100 test sentence were translated, by using the A∗ algorithm Sent. length WER PER SER Secs×sent.
6 16.9 16.6 56 2.10 2.26
Tourist task 8 10 11.3 6.6 11.3 6.5 57 50 10.3 12.6 20.2 24.7
12 6 10.0 - 51.6 9.8 - 50.9 63 89 88.3 104 28.2 34.8
Hansards task 8 10 52.9 53.2 63.8 64.8 50 50 61.5 61.7 84 84 100 100 141 273 374 1335
12 63.9 63.8 57.3 58.3 98 97 912 4569
– The French-English Hansards task consists of debates in the Canadian Parliament. This task has a very large vocabulary of about 100, 000 French words and 80, 000 English words. A sub-corpus of 128,000 sentences was selected for training purposes. For both tasks, the training of the different translation models was carried out using GIZA++ software (http://www-i6.informatik.rwth-aachen.de/ och). One hundred sentences (disjointly from the training corpus) of length up to 12 were selected for testing. A 3-gram language model was used, which was trained with the English counterparts of both tasks. To evaluate the translation quality, three different error criteria were used: WER (Word Error Rate) computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated string into a reference target string; PER (Position independent Error Rate) similar to WER but the order of the words is not taken into account; and SER (Sentence Error Rate) defined as the number of times that the translation generated by the decoder is not identical to the reference sentence. The translation quality for both the Tourist task and the Hansards task is shown, in Table 6. For the Tourist task, the effects of the heuristic optimizations produced significant error rates. Further work must be done in order to improve the training stage, including a preprocessing of the corpus. On the other hand, the Hansards task is much more complex than the Tourist task, so it has higher error rates. The results are shown for two different values of the parameter S. Note the small influence of this parameter on the translation quality, and the great reduction in the processing time.
7
Conclusions
An empirical and theoretical study of stack-based algorithms has been done, paying special attention to those optimization problems that were not discussed in previous works. According to the experimental results we can conclude that: – Stack-based decoders cannot yield optimal translations with a reasonable temporal cost per sentence. Even if we do not apply any optimization, the
An Empirical Comparison of Stack-Based Decoding Algorithms
663
way the operators are defined produces algorithm inherent search errors. We propose addN Zf ert-like operations to deal with such errors. In our opinion, errors of this kind could also be reduced by using better trained or better translation models. – The main source of search errors seems to be the zero fertility words and their related optimizations. A possible solution is also proposed. – The model errors that we obtained were always higher than the search errors. For a complex task like Hansards, this problem is much more important. Further work must be done on the statistical model if we want to improve the translation quality. – Multi-stack algorithms have the negative property of spending significant amounts of time in selecting the hypotheses to be expanded. In contrast, for the A∗ algorithm, it is not possible to reduce the S parameter, as much as in the multi-stack case, in order to speed up the search without loss of translation quality. For future work, we plan to investigate in detail the specific effect of the S parameter on the different algorithms as well as the use of different thresholding functions for multi-stack algorithms. We also plan to make a more exhaustive comparison paying particular attention to the influence of the optimizations. We have just started to apply these algorithms to translation assistant applications where the prediction of short partial hypotheses is used instead of whole sentence translations. These are offering very promising results.
References [1] Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, A. S. Kehler, and R. L. Mercer. Language translation apparatus and method of using context-based translation models. United States Patent, No. 5510981, April 1996. 655, 656, 657 [2] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993. 654 [3] Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. In Proc. of the 39th Annual Meeting of the ACL, pages 228–235, Toulouse, France, July 2001. 655, 657, 658 [4] F. Jelinek. A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13:675–685, 1969. 655 [5] Franz J. Och, Nicola Ueffing, and Hermann Ney. An efficient A* search algorithm for statistical machine translation. In Data-Driven MT Workshop, pages 55–62, Toulouse France, July 2001. 655, 657, 659 [6] Ye-Yi Wang and Alex Waibel. Fast decoding for statistical machine translation. In Proc. of the ICSLP, pages 1357–1363, Sydney, Australia, November 1998. 655