Similarity-based Bilingual Word Alignment Framework for SMT

13 downloads 0 Views 538KB Size Report
Pooja: similarity-based bilingual word alignment framework for SMT is presented in this paper. It enhances the bilingual alignment of the corre- sponding word ...
Pooja: Similarity-based Bilingual Word Alignment Framework for SMT Prasert Luekhong1,2,Taneth Ruangrajitpakorn3 , Thepchai Supnithi3 and Rattasit Sukhahuta2 1

College of Integrated Science and Technology, Rajamangala University of Technology Lanna, Chiang Mai, Thailand e-mail: [email protected] 2 Computer Science Department, Faculty of Science, Chiang Mai University, Chiang Mai, Thailand e-mail: [email protected] 3 Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, Thailand e-mail: { taneth.rua, thepchai}@nectec.or.th

Abstract Pooja: similarity-based bilingual word alignment framework for SMT is presented in this paper. It enhances the bilingual alignment of the corresponding word by using a similarity score from the bilingual dictionary. The design of Pooja is suitable in SMT developing process since input and output are in standard format. From the experiment, Pooja shows an equivalent quality to GIZA in terms of BLEU point. From a human evaluation, Pooja was slightly evaluated to return a better and more fluent word selection. Keywords: word alignment system, parallel mapping tool, machine translation enhancer, similarity score, bilingual dictionary 1 Introduction Statistical machine translation (SMT) approach has become a major approach in machine translation research for decades over other approaches [1] since SMT shows the potentials in fluent natural translation[2] [3], fewer requirements in development [4], etc. Processes in training every SMT can be separated into three main parts: 1) alignment phase, 2) rule extraction phase and 3) model generation phase [2]. The alignment phase is a phase that attempts to align bilingual text to indicate a unit translation. It is a crucial process that determines the selected translation words in the result [5]. In the recent SMT developments, the word alignment process is handled by the open-source tools, i.e. GIZA [6] and Berkley [7]. Their technique is to automatically find corre-

spondences in a parallel text by focusing on lexicon existing in a given bilingual text. The abovementioned method relies on the correspondence of given words, which are not always a translation to each other since the method focuses solely on a frequency. From our experiment, several unmatched corresponding words in the alignment table were evidenced as shown in Table 1. Table 1. Thai-English alignment table generated from GIZA Source Word อาชญากร อาชญากร อาชญากร อาชญากร เตือน เตือน เตือน เตือน เตือน เตือน เตือน เตือน เตือน เตือน เตือน

Target Word The criminal disclosed criminals caution you remind warned alarm alert jog siren blew memorial mind

Frequency 0.0000935 0.4166667 0.5 0.5 0.1666667 0.00014 0.5135135 0.25 0.0652174 0.1428571 0.0333333 0.1111111 0.0454545 0.0909091 0.0023725

From Table 1, the first column shows Thai word; the middle column shows the aligned English words to the given Thai word; the last column states the parameter gained from frequency. Since they are aligned, the given English words will be recognised as the translation of the source word in the later process. From our observation, we however found that the underlined

English words shown in Table 1 are totally irrelevant to the meaning of the given source, and they are incorrectly aligned especially for the third pair (shown in bold) “อาชญากร  disclosed” which has a high parameter as 0.5. This evidence proves that correspondence from a bilingual corpus alone cannot generate a trustworthy parallel alignment. Since the output of the alignment greatly affects the translation result of SMT, the question raises in this research that “how to improve the word alignment system to correctly match their corresponding words”. To improve the matching, a dictionary-based translation is applied to find the appropriate corresponding words. In this work, a new method of alignment system using similarity of word translation by bilingual dictionary is proposed. The rest of the paper is organised as follows: Section 2 introduces Pooja: similarity-based word alignment system; Section 3 describes how to integrate Pooja into SMT; Section 4 shows experiment setting and result; in Section 5, we gives a discussion on Pooja; Section 6 states the conclusion and list of future work. 2 Pooja: similarity-based word alignment system Pooja is a word alignment system designed to align corresponding words in the bilingual corpus. To overcome incorrectly aligned words, a bilingual dictionary is exploited to find the certain word translation as the ideal-translation. The ideal-translation word is used to compare with all the given words in target language text and the word with the highest similarity score to the ideal-translation is selected as the corresponding words in parallel text. For more understanding, the overview of Pooja processes is illustrated in Figure 1.

Figure 1. An overview of Pooja: similarity-based word alignment system processes According to Figure 1, three main processes of Pooja are 1) metric of words generation, 2) similarity calculation and 3) matching alignment.

2.1 Metric of words generation The input of this process is a parallel sentence. To align all words, a matrix table of each word in a pair of sentences is generated to indicate word position. We use an algorithm given in Figure 2 to create a matrix table of each word. #Pooja Word Alignment Algorithm Matrix of Words Generation: Loop until Bilingual Sentence is Empty Read Source Sentence into Array of SourceWords Read Target Sentence into Array of TargetWords For i < Length of SourceWords For j < Length of TargetWords Word Matrix[i][j]={SourceWords[i], TargetWords[j]} End For End For End Loop

Figure 2. An algorithm to create a matrix table of each word in a parallel sentence 2.2 Similarity calculation For each word from a source language, it is looked up in the assigned bilingual dictionary to find the ideal-translation. The ideal-translation is brought to compare with each word given in the target language text according to an algorithm shown in Figure 3. #Pooja Word Alignment Algorithm Similarity Calculation: // The String Similarity formulae are as equation (4): For i < Length of Word Matrix[][] Row For j < Length of Word Matrix[][] Column Calculate String Similarity of Word Matrix[i][j] Put String Similarity to Matrix[i][j] End For End For Matching Alignment: Loop until Maximum String Similarity=0 search Maximum String Similarity Position Get Word Align from Maximum String Similarity Position Set Maximum String Similarity Position=0 End Loop

Figure 3. An algorithm to iteratively calculate the similarity score In the assigned dictionary, one entry can contain multiple translation words that must only express the same conceptual sense. For multiple translation words of the same conceptual sense, each has the same priority value. To solve multi possible options, we group them into one record which means that one source word can be divided to several translation words and they will be applied as a translation word in rows in the matrix. Similarity score is utilised to find the words with inflection and the misspelled words since it can

increase the chance to match the words. Moreover, it will raise the chance to work with the words in different form such as American English and British English in the dictionary. To calculate the similarity score, the equation (4) is applied. ( ) [8] represents the string similarity between words. We exploit a merge of normalized longest common subsequence ( ), maximal consecutive longest common subsequence starting at character 1( ) and maximal consecutive longest common subsequence starting at any character ( ) is a source word we use to denote Dictionary, which has a set of source entries and a set of target entries . In the dictionary, there { }.The formulae are as folis lows: (

(

)

(

))

( )

(

(

)

(

(

)

(

(

( )

( )

(1)

) )) ( (

) ))

(

)

(2) (3)

We use the weighted sum of these individual values , and to determine string similarity score, where , , are weights and + + = 1. In the initial step, we equally set value since there is no prior research indicating the significance for each value in the current state. For future work, we will specific by leaning a significance of each parameter by . Therefore, the similarity of the two strings is: (

)

(4)

2.3 Matching alignment process From the previous process, the words in target language are assigned with the similarity score. With the highest score, parallel words are aligned to their correspondence systematically. However, there may be a case that some words do not exist in dictionary. They therefore cannot be computed for a similarity score, and left unaligned. The unaligned source words are automatically aligned to unaligned target words as a phrase template for the present translation. The unaligned phrase templates are collected and used as a source for finding the corresponding

parallel words and extracted as a possible pair to be determined by the linguists to add into the dictionary to improve the further alignment. 3 Pooja integration to SMT Pooja is a parallel word alignment system designed to use in SMT. It can replace existing word alignment in the SMT development without causing conflict or incompatibility. For more information, Figure 4 illustrates the workflow on developing HPBT (for instance of SMT approach) in each process and indication of Pooja replacement in the process. According to Figure 4, the usual HPBT development has following steps, respectively [9]. 1. Collect statistic for parallel corpus 2. Align corresponding word among source and target text 3. Generate list of pairs of translation in word level 4. Learn word translation from aligned words 5. Extract collocated group of words into phrase 6. Give score to phrases based on frequency 7. Tune statistical parameter from language models of other source (optional) 8. Tune statistical parameter from other generation models (optional) 9. Create decoder configuration file Since Pooja can manage the parallel corpus from the beginning, it can replace step#1 to #3 (from Figure 4, a line arrow represents HPBT training processes with Pooja as an alignment system while a dash line arrow demonstrates a normal HPBT with GIZA), and the output of Pooja can be used in the step#4 onward. In fact, a word alignment is required in any SMT, Pooja can be applied to any SMT since the input and the output format are in standard format. In the testing process, decoder [9] [10] is exploited as usual to generate translation results from the inputted source sentence. 4 Experiment setting and result 4.1 Experiment setting To test potential of Pooja, we applied it in training Thai to English SMT and compared it with the SMT using GIZA [6] as word alignment. The parameters in all processes were the same setting since we wanted to evaluate the translation results of SMT by the different alignment system. The training and testing resource were

Figure 4. Pooja in the HPBT process Thai-English parallel sentences. The training corpus contained 110,190 sentences while 592 separated sentences were random for testing the translation accuracy. The bilingual dictionary assigned to provide an ideal-translation was Lexitron: electronic Thai-English dictionary [11] [12] (accessed on 10:32, 10th May 2013). It contains 40,853 lexical entries of Thai to English translations and 83,229 lexical entries of English to Thai translation. Both BLEU point [13] and human evaluation were used in this evaluation since there is a possibility that some translation result may have different word selection than the given translation in a target pair. For the human evaluation, three translation experts give score in scale from 1 (worst) to 5 (best) for appropriateness of word selection. 4.2 Translation result in BLEU point BLEU is an automatic evaluation provided to find the sameness and difference between the translation result and translation reference [13]. BLEU applies a character-based similarity to calculate into a measurable score. The performances of Pooja and GIZA are insignificantly different in terms of BLEU point as 21.60 and 21.05, respectively. Visual and interactive BLEU score for each sentence and the translation results are available online at 203.185.132.229/TH-EN-Bleu.html (powered by IBLEU [14]). However, the contextual translation results were not much alike. There are about

47.64% of the test sentences that return the exactly same translation. 4.3 Human evaluation result By the scale from 1 (worst) to 5 (best), three translation experts as evaluators gave score to evaluate the translation results for both SMT using GIZA and Pooja in alignment phase. The criteria for scoring are the correctness of the word sense and suitability of the selected word to the surrounding context. We divided the result into two aspects: word level and sentence level. In the word level, we give the results based on score per word. For sentence level evaluation, we show the mean score from each sentence and list the summed score in range to signify the overall sentence translation. The comparison result from human evaluation in the word level is illustrated as circle graphs in Figure 5 while Table 2 shows the result in sentence level. 5 Discussion In this work, we present a new framework to enhance alignment process in SMT called Pooja. From the results shown so far, it shows potential equal to the renowned word alignment system, GIZA. However, from result analysis, we found that the incorrect translation result obtained by two main issues. The former is the unknown words from the dictionary issue, and the latter is the aligning to NULL issue.

The former issue is a specification to the assigned bilingual dictionary. It is usual that the lexical entries in dictionary cannot cover the lexicons in the language. Although a process to handle unknown word is included in Pooja, the number of found unknown word is much greater beyond the expectation especially in Thai side as around 12.94%. The found unknown words include 1) named-entity (NE), 2) numeral expression, 3) jargon in specific domain, and 4) missing common words. The first issue is the major issue in translation involving Thai language since Thai named-entity (NE) [15] is often a polysemy to a common word without any explicit marker in text. To surpass the first unknown word issue, named-entity recognition (NE recognition) system is needed to be bundled before the alignment process. However, there currently is no report on reliable NE recognition system for Thai so far. For second unknown word issue, the current alignment cannot efficiently manage the number in the context since the number expression can be varied from time to time. To solve it, number parser is required to form the sequential number units in both digit and spelled-out to easily map to each other in parallel text. Once the numeral expression is transformed into a certain format, the alignment will be processed with an exact string matching. The third and the last can be handled by adding more lexical entries to the dictionary while the current word pairs (about

500 pairs) obtained from the experiment are waiting for linguists to approve and will be inserted to the dictionary. By the mentioned process, the number of lexical entries will continuously increase and will lower the number of unknown word further on. The latter issue is a problem of word(s) aligning to NULL. This issue is apparently produced from the current limitation of Pooja since it presently does not allow aligning one word to multiple words. This case often occurs with the parallel text with unequal word amount, which is a normal phenomenon in a language translation especially in greater different numbers in cross-cultural language translation and a different language typology translation. This issue can be solved by implementing a method to allow one-to-many alignment. To implement such idea, an algorithm to merge a word from both prior and succeeding position into the focused word will be developed. The merged constituent will be realised as a word and will be process in a similarity score calculation as normal word. The weight parameter will be added to prioritise the nearby word rather than the long-distance word since the greater the distance is, the less relevance the word should be. This method is apparently expected to solve the issue of incorrect word segmentation for language without explicit word boundary.

Figure 5. Word level human evaluation results compared the performance of Pooja and GIZA Table 2. Sentence level human evaluation results compared the performance of Pooja and GIZA Pooja Sentence Level below 2.00 2.01-3.00 3.01-4.00 over 4.00 SUM

amount 39 255 262 36 592

Giza percentage 6.59 43.07 44.26 6.08 100.00

amount 44 281 238 29 592

percentage 7.43 47.47 40.20 4.90 100.00

6 Conclusion and future work In this paper, the new approach of a parallel word alignment for statistical machine translation called “Pooja” has been presented. It enhances the mapping of the corresponding word in parallel text by the usage of similarity score from the bilingual dictionary. It is designed to be replaceable in SMT developing process since it takes the input as simple as plain parallel corpus and returns the output in standard format that is applicable to the onward SMT process. From the experiment, Pooja shows an equivalent quality to GIZA in terms of BLEU point and, from a human evaluation, Pooja obtains a better translation result in some cases. To improve the accuracy and enhance the ability, we plan to develop Pooja to allow one-to-many word alignment. We also plan to attach the named-entity recognition and numeral expression parser to solve the major unknown word issue. Since the accuracy of the similarity score calculation depends on the dictionary, missing lexicons will be continuously added to decrease the chance of encountering the unknown word. Acknowledgment This work was supported by the Office of the Higher Education Commission, Thailand for supporting the grant fund under the program Strategic Scholarships for Frontier Research Network for the Ph.D. Program. References [1] P. Koehn, Statistical machine translation. Cambridge University Press, 2010, p. 433. [2] J. Olive, C. Christianson, and J. MaCary, Handbook of Natural Language Processing and Machine Translation. Springer, 2011. [3] A. Lopez, “Statistical machine translation,” ACM Computing Surveys, vol. 40, no. 3, pp. 1–49, Aug. 2008. [4] Y. Wilks, Machine translation: Its scope and limits. Springer-Verlag New York Inc, 2008. [5] C. Mermer and M. Saraçlar, “Bayesian word alignment for statistical machine translation,” in Proceedings of ACL HLT, 2011, pp. 182–187. [6] F. J. Och and H. Ney, “A Systematic Comparison of Various Statistical Alignment Models,” Computational

Linguistics, vol. 29, no. 1, pp. 19–51, Mar. 2003. [7] B. Taskar, S. Lacoste-Julien, and D. Klein, “A discriminative matching approach to word alignment,” in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 2005. [8] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based word similarity and string similarity,” ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 2, pp. 1–25, Jul. 2008. [9] D. Chiang, “Hierarchical Phrase-Based Translation,” Computational Linguistics, vol. 33, no. 2, pp. 201–228, Jun. 2007. [10] P. Luekhong, R. Sukhahuta, P. Porkaew, T. Ruangrajitpakorn, T. Supnithi, and R. Sukhauta, “A Comparative Study on Applying Hierarchical Phrase-based and Phrase-based on Thai-Chinese Translation,” in International Conference on Knowledge, Information and Creativity Support Systems, 2012, no. Ldc, pp. 126–133. [11] “LEXiTRON:Thai-English Electronic Dictionary.” [Online]. Available: http://lexitron.nectec.or.th/2009_1/. [Accessed: 10-May-2013]. [12] K. Trakultaweekoon, P. Porkaew, and T. Supnithi, “LEXiTRON Vocabulary Suggestion System with Recommendation and Vote Mechanism,” in SNLP2007 The Seventh International Symposium on Natural Language Processing, 2007, pp. 43–48. [13] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU : a Method for Automatic Evaluation of Machine Translation,” Computational Linguistics, no. July, pp. 311–318, 2002. [14] N. Madnani, “iBLEU: Interactively debugging and scoring statistical machine translation systems,” in Proceedings of the Fifth IEEE International Conference on Semantic Computing, 2011. [15] H. Chanlekha and A. Kawtrakul, “Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic Information,” in Proceedings of IJCNLP-2004, 2002, pp. 49–55.

Suggest Documents