Translation Table Compression under End-Tagged Dense Code Tito Valencia
Lorena O. Cerdeira
Eva L. Iglesias
Francisco J. Rodr´ıguez
Dept. of Computer Science University of Vigo Ourense, Spain Email:
[email protected]
Dept. of Computer Science University of Vigo Ourense, Spain Email:
[email protected]
Dept. of Computer Science University of Vigo Ourense, Spain Email:
[email protected]
Dept. of Computer Science University of Vigo Ourense, Spain Email:
[email protected]
Abstract—In recent years, the quality of Phrase-Based Statistical Machine Translation has increased dramatically partially due to the significant increase of available parallel corpus. If we talk in terms of space, this advantage becomes a disadvantage because the increased size of the parallel corpus implies an exponential increase in the size of the translation tables. To solve this problem, there are solutions that reduce the size of the translation tables limiting the length of sentences that are incorporated into the tables. This solution reduces the space, but at the expense of increasing the possibility of worsening the translation of long sentences. In this paper, we propose the compression of the phrase-based translation tables using End-Tagged Dense Code to codify the phrases in source and target languages. The use of this technique allows us to reduce the size of translation tables and therefore it is possible to add longer sentences.
I. I NTRODUCTION When a meeting with people from different countries takes places, the responsibility of the understanding of these people lies on human interpreters. These interpreters translate in real time what is being said in the meeting. The human effort to do so is both high and expensive and that is the reason of the origin of Machine Translation (MT). The idea that lies behind MT is quite easy, to place a machine in the place of a human interpreter. This target, although easy to understand is quite hard to accomplish since there are several limitations in MT that do not affect human interpreters. All these limitations are related to the complex nature of natural languages. In natural languages there are words with several meanings and translations where the accurate interpretation is needed. There are also idioms, expressions which can not be directly translated but that need to be placed within an environment in order to translate them properly. Sometimes the right translation is given not by the words the speaker is saying but by how these words are being said. There is a huge amount of knowledge that a human interpreter has that would be extremely difficult to provide a machine with. Taking these limitations as a reality instead of as a situation that might be changed, in the end a machine translation system has to produce translations without an important part of the knowledge. To overcome these restrictions machine translation has on its side the processing capacity inherent to machine systems and the statistical decision theory. The statistical
978-1-4244-7820-0/10/$26.00 ©2010 IEEE
decision theory tries to provide the machine translation system with the means to make decisions with incomplete knowledge, which is the goal of Statistical Machine Translation (SMT). The SMT has become a trending topic in research communities and the field experienced a rapid progress. The research in this field as well as its quality has been improving due to the availability of huge amounts of data for training statistical models. [1] SMT is based on the working hypothesis that every sentence E 1 in a target language is a possible translation of a given sentence F in a source language. A human interpreter, considering his own knowledge, would choose the better sentence when there are two possible translations of a given sentence, in SMT this choice is made considering the probability assigned to the possible sentences which is to be learned from a bilingual text corpus. [4] There are several models in SMT which can be identified by how these probabilities are applied, and so there are wordbased models and phrase-based models. A. Word-Based Models A Word-Based Model applies the probabilities to words which are considered as the translation units of the process. This type of translation system has a huge dependence on alignment. An alignment consists on mapping words in source language to words in target language. This is done considering the translations of the texts of a bilingual text corpus. For every sentence in the corpus, each word in source language is mapped to its equivalent word in the sentence in target language. Every time two words are aligned the probability that each one is the best translation for the other rises, and so, as the number of times two words f and e are aligned increases, also does the probability of a good translation of 1 In the original work on IBM Model 1 Brown et al. [2] carried out translations from French to English. Hence they use the notation f and e to source and target language respectively. The only difference that Brown et al. [2] proposes to distinguish between sentences and words is that the sentences are written in bold. In [3] the same notation is used but instead of translating from French to English, they translate from a foreign language to English, keeping this way the f and e notation. In our work we keep that same notation but we change f and e for F and E to better distinguish between sentences and words.
IUCS2010
these words. These probabilities are stored on a translation table that is taken as a guide when doing a translation. In spite of the advantage that this approach involved to the field of SMT, it became also clear through experience that this model was inappropriate since lots of local context were lost during translations. B. Phrase-Based Models Due to the problems of the previous model, the PhraseBased Model emerged. This model takes as a basic working unit not the word but a longer unit, typically a sequence of consecutive words or phrase. This model has the advantage that instead of learning how each word is translated, it learns how some phrases are translated, even those with plenty of words. The success of a phrase-based translation has a high relation with the goodness of the phrase translation table. A way of constructing this table consists on creating a phrase-level alignment between each pair of sentences in the parallel corpus. After that it is possible to identify consistent pairs of phrases considering that alignment. The selected phrases may be different sizes. The short phrases show up more frequently and they are very useful to translate new sentences. However, long phrases are not useless but quite necessary in order to get a better knowledge of the environment, and also they allow to translate bigger chunks at once. Obviously the fact of allowing the extraction of pairs of phrases of any length has as consequence that the number of extracted phrases is very high. Nevertheless most long phrases identified during the training might never appear again, so to reduce the number of extracted phrases and to keep a handy translation table it is possible to limit the maximum length of the phrases. [3] For us, the main advantage of this approach is that if we want to translate a sentence, whatever its size is, its translation might be already in the translation table and so a correct translation is guaranteed. To do so it becomes necessary that the number of stored phrases is as big as possible which is the reason of the need to compress the translation table. In next chapters we focus on the task of compressing the phrase-based translation table. First, a brief introduction to End-Tagged Dense Code is made. This compression scheme is the one we have selected to perform the compression. Next, compression and decompression processes applied to phrasebased translation table are explained along with an small example. II. E ND -TAGGED D ENSE C ODE In recent years, new compression techniques particularly suitable for its application in natural language texts have been developed. These techniques allow searches of compressed text directly thus avoiding the need to decompress the text before searching. To do this, these new techniques are based on the use of words as symbols to compress [5] (rather than characters as in the classical methods).
In 2003, Iglesias [6] proposes a compression technique based on words that codifies the original text in such way that shorter codewords are assigned to most frequent symbols. The generated codewords are prefix codes which means that there is no codeword that is prefix of another one. The advantage of codifying using prefix codes is that any codeword can be decompressed without any reference to the following codewords because the end of a codeword is always recognizable. Thereby, if the string to decompress starts with 01 and 01 is the codeword assigned to a input symbol A we are certain that it corresponds to the symbol A, and there is no need to check if it could be the beginning of a longer codeword. Unlike the classic Huffman code [7] where codewords are composed by one or more bits, the compression scheme that Iglesias proposes, assigns to every word one or more bytes and consequently it is said to be byte oriented. In 2000 Moura et al. [8] [9] had already proposed a similar approach, the Tagged Huffman Code. In this scheme the codewords are composed by one or more bytes, keeping the first bit of every byte as flag bit which clearly allows the distinction of the beginning of every codeword placed inside the compressed text. Specifically, the first byte of every codeword has a flag bit set to 1 while the remaining bytes of the codeword have their flag bit set to 0. For the 7 remaining bits this proposal uses the classic Huffman code, ensuring this way the prefix code property. The proposal made by Iglesias significantly improves the Moura et al. technique achieving a high number of codified symbols with every byte. The first bit is also used as flag bit but the difference is that it is the last byte of the codeword the one which starts with 1 and the remaining ones with 0. This change has surprising consequences. Now the flag bit is enough to ensure that the codeword is a prefix code, no matter what we do with the other 7 bits. So Huffman coding is not necessary. The optimal code assignment — the one minimizing the length of the output — is obtained with the following procedure2 : [10] 1) The words in the vocabulary are ordered by their frequency, more frequent first. 2) Codewords from 10000000 to 11111111 are assigned sequentially to the first 128 words of the vocabulary, using the 27 possibilities. 3) Words at positions 128+1 to 128+1282 are encoded using two bytes, by exploiting the 214 combinations from 00000000:10000000 to 01111111:11111111. 4) Words at positions 128+1282 +1 to 128+1282 +1283 +1 are encoded using three bytes, by exploiting the 221 combinations from 00000000:00000000:10000000 to 01111111:01111111:11111111. And so on. Basing the compression on the frequency of the show up of the words implies that is a semi-adaptive method, namely, the coding phase must be split in two steps. In the first one the vocabulary is created and put in order, in the second one the text is codified using that vocabulary. 2 We
underline the flag bits
The coding phase is very fast since obtaining the codewords is extremely simple: It is only necessary to put the vocabulary words in order by frequency and then sequentially assign the codewords. See [6] [10] [11] for more details. III. A PPLYING E ND -TAGGED D ENSE C ODE TO P HRASE -BASED T RANSLATION TABLE The size of the phrase-based translation table implies a limitation to the SMT since, as we mentioned before, it forces the cut down of the size of the phrases that are included in the table [3]. The strength of the SMT systems consists of the idea of keeping stored translations for every possible sentence that may appear. Therefore, the fact of cutting down the size of the phrases may reduce the quality of the translation of long sentences. To reduce the impact of this limitation, we propose the compression of the phrase-based translation table by using End-Tagged Dense Code. This will decrease the size of the tables and so, we will be able to include longer phrases and improve this way the translation of long sentences. To explain the compression and decompression processes, we will use the phrase-based translation table example shown in the Table I. This translation table has been generated by Moses3 [12] and it may be downloaded from its website4 . In the first column there are the phrases in source language — German, in this case –, in the second one, their translations in target language — English — and in the third one the probability that it is the most likely translation. A. Compression Process The End-Tagged Dense Code being semi-adaptive implies that the compression process of the translation table must be done in two steps. In the first one, the different words of the language (vocabulary) that appear in the table are obtained and they are decreasingly ordered according to their show up frequency. In the second one, the codification process is carried out. In the translation tables, words from two different languages (source language and target language) are kept, and so, we need to generate two vocabularies, one for each language. Thus, in the end we will get the codified table together with the two vocabularies. In our case, in the end we will have the VG and VE vocabularies that keep the words in German and English respectively. The Tables II and III show the vocabularies that were generated starting from the Table in the example I. In the first column there are the words, and in the second one the frequency of the show up in the translation table. The second column is actually not necessary, although in the example we include it to record the frequency of the show up of the words. Once the vocabularies are generated, we start to replace the words with their corresponding codewords. These codewords are generated from the position that the word has in the vocabulary. 3 Open
Source Statistical Machine Translation Tools.
4 http://www.statmt.org/moses/
TABLE I P HRASE -BASED T RANSLATION TABLE E XAMPLE Source Phrases der das das das die ist ist das ist das ist es ist es ist ein ein klein klein kleines kleines haus alt altes gibt es gibt
Target Phrases the the it this the is ’s it is this is it is this is a an small little small little house old old gives there is
Probabilities 0.3 0.4 0.1 0.1 0.3 1.0 1.0 0.2 0.8 0.8 0.2 1.0 1.0 0.8 0.8 0.2 0.2 1.0 0.8 0.2 1.0 1.0
TABLE II G ERMAN VOCABULARY SORTED BY WORD FRECUENCY Words ist das es ein klein kleines gibt der die haus alt altes
Frecuencies 6 6 3 2 2 2 2 1 1 1 1 1
To perform the coding, we may use a similar process to the one used when changing values expressed in a higher base to a lower base. The idea consists of doing several consecutive divisions of the number we want to encode (in our case, the position i that the word has in the vocabulary) divided by a quotient that depends on the number of output symbols. So, if it was a conversion from decimal to binary the quotient would be 2, while for our coding it is 27 = 128(since the eight bit must be kept as flag bit). The rest obtained after every division corresponds to the value of a symbol(byte) of the final code. The algorithm keeps the symbols that compose the code starting by the less significative one to the most significative one. The flag bit of the less significative byte(the last one in
TABLE III E NGLISH VOCABULARY SORTED BY WORD FRECUENCY Words is the it this small little old ’s a an house gives there
Frecuencies 6 3 3 3 2 2 2 1 1 1 1 1 1
the code) is set to 1, adding 128 to the obtained value. In the Algorithm 1 it is shown the pseudocode of the algorithm that permits the coding of a word that is place in the i position of the corresponding vocabulary. Algorithm 1 Pseudocode implementation of the Word Coding Algorithm using End-Tagged Dense Code. 1: Codifify(i) 2: //Calculate the value of the last byte of the code (flag bit is 1) 3: i ← i − 1 4: output (i mod 128) + 128 //add flag 5: i ← i div 128 6: // Calculate the value of remaining bytes of the code (flag bit is 0) 7: while i > 0 do 8: i←i−1 9: output (i mod 128) 10: i ← i div 128 11: end while 12: EndCodify As an example, in the Algorithm 2 we detail how to generate the codeword that corresponds to the word that has the i = 10 position. To simplify, we will use 3 bits bytes instead of 8 bits bytes, and therefore, the number of symbols to consider is 23−1 = 22 = 4 instead of the usual 27 = 128. The result of the execution of the algorithm is 001 101. According to the Algorithm 1, the time required to generate for a word that has in the vocabulary the i position its corresponding codeword is of O(log i) order, since in each iteration of the loop, i decreases to i/128. The complexity of this algorithm may also be expressed as of O(l) order, where l is the codeword’s length (the number of bytes), since in each iteration a byte of the output codeword is obtained. In our example, the codewords associated to each word can be seen in the Tables IV y V. Once the compression of the table is complete, the two first
Algorithm 2 Example of the coding of the word number 10 using the End-Tagged Dense Code. 1: Codify(10) 2: i ← 10 − 1 = 9 3: output (9 mod 4) + 4 = 5 ⇒ byte 101 4: i ← 9 div 4 = 2 5: i > 0 ⇒ i ← 2 − 1 = 1 6: output (1 mod 4) = 1 ⇒ byte 001 7: i ← 1 div 4 = 0 8: i ≤ 0 ⇒ EndCodify TABLE IV G ERMAN VOCABULARY WITH THEIR CORRESPONDING CODEWORDS Words ist das es ein klein kleines gibt der die haus alt altes
Codewords 100 101 110 111 000 100 000 101 000 110 000 111 001 100 001 101 001 110 001 111
TABLE V E NGLISH VOCABULARY WITH THEIR CORRESPONDING CODEWORDS Words is the it this small little old ’s a an house gives there
Codewords 100 101 110 111 000 100 000 101 000 110 000 111 001 100 001 101 001 110 001 111 010 100
columns will have the binary codewords associated to the their corresponding words. In the Table VI it is shown the previous translation table compressed. B. Decompression Process The decompression of the phrase-based translation table may be done in just one step as long as in this step the necessary codewords are decoded — the codewords generated by the End-Tagged Dense Code are prefix code, so we can decompress random sections of the table — using the corresponding vocabularies.
TABLE VI C OMPRESSED P HRASE -BASED T RANSLATION TABLE UNDER E ND -TAGGED D ENSE C ODE E XAMPLE Source Phrases 000 111 101 101 101 001 100 100 100 101 100 101 100 110 100 110 100 111 111 000 100 000 100 000 101 000 101 001 101 001 110 001 111 000 110 110 000 110
Target Phrases 101 101 110 111 101 100 000 111 110 100 111 100 110 100 111 100 001 100 001 101 000 100 000 101 000 100 000 101 001 110 000 110 000 110 001 111 010 100 100
Probabilities 0.3 0.4 0.1 0.1 0.3 1.0 1.0 0.2 0.8 0.8 0.2 1.0 1.0 0.8 0.8 0.2 0.2 1.0 0.8 0.2 1.0 1.0
First we decide what we want to decompress. Back to the previous example of the Table I we might want to decompress the last row. Second the text is parsed from left to right, separating the codewords. To do this we read the ”bytes” one by one by checking if the first bit is 0 or 1. When it is 0, we read the following ”byte” and we add it to the codeword. However, if the first bit is 1, it would indicate that it is the last ”byte” of the codeword and the next ”byte” would part of another codeword. For example, we would have the codewords ”110” and ”000 110” in the first column and ”010 100” and ”100” in the second. Once the codewords are obtained, the third step is to decode them. To identify which is the word associated with a codeword, a simple algorithm can be run. This algorithm returns the position i occupied by that word in the vocabulary. The algorithm is similar to the one used for conversion of numerical values in a base to numerical values in bigger base, and it is introduced in the following example. The decimal value p corresponding to the binary number 0100101 can be calculated as: p = 1×20 +0×21 +1×22 +0×23 +0×24 +1×25 +0×26 = 37 (1) In our case, it also analyzes the code byte by byte, starting by the one that occupies the less significant position, and it multiplies the value of each byte by a power of 128 which depends on the position that the byte occupies in the codeword. In Algorithm 3 a pseudocode implementation of the algo-
TABLE VII G ERMAN D ECODING E XAMPLE Codewords (binary) 110 000 110
Positions (decimal) 3 7
Words es gibt
TABLE VIII E NGLISH D ECODING E XAMPLE Codewords (binary) 010 100 100
Positions (decimal) 13 1
Words there is
rithm is shown. This algorithm allows the obtaining of the position i of the codeword. Algorithm 3 Pseudocode implementation of End-Tagged Dense Code decoding algorithm. 1: Decoding(i) 2: r ← 0 //1st word-position 3: p ← 0 //integer code value 4: ← 0 //number of analized bytes 5: repeat 6: input x //read the byte of the codeword 7: p ← p + (x mod 128) × 128 8: r ← r + 128 9: ←+1 10: until x ≥ 128 11: //Ends when the last byte of the previous codeword is found 12: return the position i = r + p 13: EndDecoding The Algorithm 3 allows the decoding of a codeword c in a time of O(log c) = O() order because, as happened in coding, in each iteration of the loop a byte is analyzed and the value of c is divided by 128. In our example, to decompress the first column (of the last row of the table) it is necessary the German vocabulary and to decompress the second column it is necessary the English vocabulary. In Tables VII and VIII we can see, for each row, the correspondences between the code we want to decode, the position of the vocabulary obtained by applying the Algorithm 3 and the word that is in that position of the phrase translation table. So, when decompressing we get that the codeword ‘110 000 110” matches to es gibt and “010 100 100” matches to there is. IV. C ONCLUSION AND F UTURE W ORK This paper has presented the implementation of a compression technique — End-Tagged Dense Code — on phrasebased translation table reducing the space between 30 and 60 percent.
This new direction in research opens up many possibilities an issues that require further research and experimentation. ACKNOWLEDGMENT This work has been supported by the project XUGA 08SIN009305PR supported by Xunta de Galicia. R EFERENCES [1] F. J. Och and H. Ney, “The alignment template approach to statistical machine translation,” Computational Linguistics, vol. 30, no. 4, pp. 417– 449, December 2004. [2] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” in Computational Linguistics, vol. 19, no. 2, 1993, pp. 263–311. [3] P. Koehn, Statistical Machine Translation. Cambridge University Press, 2010. [4] A. G. Ramis, “Introducing linguistic knowledge into statistical machine translation,” Ph.D. dissertation, Universitat Polit`ecnica de Catalunya, October 2006. [5] A. Moffat, “Word-based text compresion,” Software, Practice and Experience, vol. 19, no. 2, pp. 185–198, 1989. [6] E. L. Iglesias, “Una nueva t´ecnica de compresi´on de textos con soporte text retrieval y su adaptaci´on a lenguas romances,” Ph.D. dissertation, Universidade da Coru˜na, June 2003. [7] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” in Proc. Inst. Radio eng., vol. 40, no. 9, September 1952, pp. 1098–1101. [8] E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates, “Fast and fexible word searching on compressed text,” ACM Transactions on Information Systems, vol. 18, no. 2, pp. 113–139, April 2000. [9] N. Ziviani, E. S. de Moura, G. Navarro, and R. Baeza-Yates, “Compression: A key for next-generation text retrieval systems,” IEEE Computer, vol. 33, no. 11, pp. 37–44, 2000. [10] N. R. Brisaboa, E. L. Iglesias, G. Navarro, and J. R. Param´a, “An efficient compression code for text databases,” in Proc. of the 25th European Conference on IR Research (ECIR’03) - LNCS2633, F. Sebastiane, Ed., vol. 2633, 2003, pp. 468–481. [11] A. Fari˜na, “New compression codes for text databases,” Ph.D. dissertation, Universidade da Coru˜na, April 2005. [12] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” in Proceedings of the ACL 2007, Prague, June 2007, pp. 177–180.