Using Dictionary and Lemmatizer to Improve Low ... - Science Direct

50 downloads 0 Views 182KB Size Report
May 9, 2016 - E-mail address: [email protected].my ... technology allows it to be used by people in daily life and also in language processing .... If we have the translation for the word “orange” in the training data, but not the word ...
Available online at

ScienceDirect Procedia Computer Science 81 (2016) 243 – 249

5th Workshop on Spoken Language Technology for Under-resourced Languages, SLTU 2016, 9-12 May 2016, Yogyakarta, Indonesia

Using Dictionary and Lemmatizer to Improve Low Resource English-Malay Statistical Machine Translation System Yin-Lai Yeong*, Tien-Ping Tan, Siti Khaotijah Mohammad School of Computer Science, University Science Malaysia, Penang 11800, Malaysia

Abstract Statistical Machine Translation (SMT) is one of the most popular methods for machine translation. In this work, we carried out English-Malay SMT by acquiring an English-Malay parallel corpus in computer science domain. On the other hand, the training parallel corpus is from a general domain. Thus, there will be a lot of out of vocabulary during translation. We attempt to improve the English-Malay SMT in computer science domain using a dictionary and an English lemmatizer. Our study shows that a combination of approach using bilingual dictionary and English lemmatization improves the BLEU score for English to Malay translation from 12.90 to 15.41. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license © 2016 The Authors. Published by Elsevier B.V. ( Peer-review underresponsibility responsibility Organizing Committee of SLTU Peer-review under of of thethe Organizing Committee of SLTU 2016 2016. Keywords: Statistical machine tranlstion; English-Malay; dictionary; parallel corpus; lemmatization

1. Introduction Machine Translation (MT) is a process to translate text from a source language to target language using software. The maturity of the technology allows it to be used by people in daily life and also in language processing technology such as summarization. There are two main state-of-art approaches to machine translation: example based machine translation (EBMT) and statistical machine translation (SMT). Both approaches are data driven approaches that require a parallel text corpus that contains source sentences and their equivalent translations in the target language to build the translation rules. EBMT translates by analogy. Given a test sentence, EBMT translate by finding similar sentence found in the parallel text corpus 1. _________ * Corresponding author. Tel.: +604-6534386; fax: +604-6563244 or +6046573335. E-mail address: [email protected] 

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( Peer-review under responsibility of the Organizing Committee of SLTU 2016 doi:10.1016/j.procs.2016.04.056


Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249

On the other hand, SMT analyzes parallel text corpus to generate statistics translation models 2. An unknown source sentence is translated to the most probable target sentence using the translation models. Over the past decades, the distinctions between EBMT and SMT have becoming less and less as both approaches have borrowed the features from each other to improve the translation. For example, phrase based translation approach that is used in EBMT has now been applies also in SMT which previously based only on word. There is a lot of progress in statistical machine translation (SMT). The attractiveness of the approach is that the quality of the translation is already quite encouraging for many resource rich languages. In addition, there are common architecture and tools available in SMT, which allow continuous improvement on the approaches. Most of the approaches used in SMT is also language independent. SMT takes a source sentence, S = [s 1 s2 … sn] in the source language, and generates a target sentence, T* = [t1 t2 … tn] in the target language, where s1 s2 s3 … sn are phrases / null in the source language, and t1 t2 t3 … tn are phrases / null in the target language. There are many possible target sentences that can be translated from a source sentence. The idea is to find the most probable sentence as follow: ܶ ‫ כ‬ൌ ܽ‫ݔܽ݉݃ݎ‬൫ܲሺܶȁܵሻ൯


The equation will be decomposed using Bayes theorem as follow: ܶ ‫ כ‬ൌ ܽ‫ ݔܽ݉݃ݎ‬ቀ

௉൫ܵ หܶ൯ൈ௉ሺ்ሻ ௉ሺௌሻ



Since P(S) is always constant, thus it can be removed. The equation can be simplified as follow: ܶ ‫ כ‬ൌ ܽ‫ݔܽ݉݃ݎ‬ሺܲሺܵȁܶሻ ൈ ܲሺܶሻሻ


P(T) is the probability of a target language sentence, which is modeled by a language model. The language model for the target language can be build with a target language text corpus. On the other hand, P(S|T) is the probability of a source sentence given the target sentence, which is modeled by a translation model. The model is build using a parallel text corpus. Thus, the quality of a machine translation system is largely depending on the availability of the large amount of resources to build robust language model and translation model. For low resourced language, the limited amount of these resources will proof building a reasonable good SMT system difficult. We attempt to build an English-Malay SMT. Malay is an Austronesian language spoken officially in Malaysia, Indonesia, Brunei, and Singapore. Although Malay is not an under-resourced language, obtaining large parallel corpus is a challenge. The reason is because Malaysian learn English since primary school, and majority of speakers can communicate well in both languages. Thus, this makes the availability of comparable text in English and Malay limited. Therefore, in our work, we explore on methods to improve the translation model using under-resourced approaches. 2. Related Works The first statistical machine translation software was CANDIDE from IBM 3. Since then, a lot of progress has been made in SMT. There are some works in Malay-English machine translation. Many of them focus on EBMT 4-6. There is not much work on Malay-English SMT available. Nevertheless, there are English-Malay SMT systems that available online that uses SMT architecture such as Google Translate and Bing Translate. Google used SYSTRAN for several years, but switched to a statistical translation method in October 2007 7. 2.1. Parallel Corpus Acquisition In developing a Malay-English SMT, there are two types of corpus required: Malay text corpus for building

Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249

language model and parallel Malay-English corpus for building translation model. Acquiring Malay text corpus is easier as a lot of text can be obtained easily. However, acquiring parallel text is more challenging. There are works on mining parallel data from the Internet especially from news websites 8-10 and Wikipedia 11-12. To extract parallel text from Internet, webpages are first retrieved using a web-crawler. Candidates websites can be matched based on the publication date (for news articles) 8, size of text, file name, title of the text by adopting an initial baseline translation system, URLs 9 and other heuristics to reduce the scope of the search. A more computational intensive algorithm can then be applied subsequently to obtain sentence alignment using dynamic programming, iterative relaxation 13, sentence length 14, and word translation probability 15. Besides using online data, there are also works that explore on building parallel text using translated literature such as novels 16, bibles 17. Translated books contains good quality translation, and finding the aligned translated target sentence for a particular source sentence is also easier since they will not deviate a too much from the original position. The work by Lim et. al 18 have extracted 25 thousand parallel English-Malay sentences from Kamus Inggeris-Melayu Dewan, while Abdul Rahman et. al 19 have collected 15 thousand parallel sentences in the domain of agriculture and health. 2.2. Phrase Translation Modeling Phrase based translation uses phrase instead of word as unit of translation. There are a few of researchers working on phrase based machine translation where they are using different methods in machine translation 20-23. Holger 21 used neural networks to directly learn the translation probability of phrase pairs using continuous representations. Phillip 22 used Pharaoh as a decoder for phrase-based SMT. Wu et. al 23 used out of domain parallel corpora to improve in domain translation model through interpolation. 3. Approach In our approach, we use the examination papers archive from School of Computer Science, Universiti Sains Malaysia to build an English-Malay parallel corpus. The examination papers are good sources for building parallel corpus since the questions are written in Malay and English since 2000. We extracted all the questions from the softcopy of the exam papers and then segment them into sentences. The corresponding pair of English and Malay sentence is then aligned, while ensuring the format matched the requirement of the Moses SMT toolkit. A total of more than 23 thousand pairs of parallel sentences were extracted. The English-Malay parallel sentence corpus obtained is then used as test corpus using SMT toolkit. Since the training parallel corpus 6 that we have is not very large, and it is from generic domain, there will be many out of vocabulary (OOV) translations. We examined a few approaches that uses dictionary and lemmatizer/stemmer to improve the machine translation task. 3.1. Inserting Translations from Bilingual Dictionary to Phrase Translation Table The first approach we examine is to determine whether inserting target language to source language (targetsource) phrase translations found in the bilingual dictionary into the training parallel corpus for building the translation model will improve the BLEU score. The hypothesis is that a bilingual dictionary will provides additional word or phrase translation pairs not found in the training data. There are two variations of method used. In the first method, each phrase translation example found in the bilingual dictionary is added into the existing parallel training corpus as a new parallel sentence. A translation model is then built using the additional data. In the second method, we examined the approach of adding a small delta value to p(s’|t’) in the phrase translation table, if we found the translation phrase pairs in the dictionary, where s’ is the source phrase, while t’ is the target phrase in the dictionary. We also add the same delta value to p(t’|s’), since Moses toolkit also make use of it to calculate P(S|T). Nevertheless, according to 24, p(s’|t’) is a more important feature compared to p(t’|s’).



Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249

3.2. Modeling Separate Lemma and Suffix in Phrase Translation Table English lemmatizer and stemmer are used to lemmatize or split the word in the training and testing parallel sentences to English lemma/stem and its suffix. The idea is that by splitting an English word to lemma and its suffix, this will allow English words with the same lemma to be translated as long as an observation is available in the training data. For example, the word “orange” and “oranges” have the same translation in Malay, which is “limau”. If we have the translation for the word “orange” in the training data, but not the word “oranges”, this will allow us to translate the word “oranges” to “limau”, because the lemma for both word is “orange”. We will only apply this method if the lemma is a substring of the original word. For example, the lemma and suffix for the word “oranges” are “orange” and “-s” respectively. If the lemma is not a substring, we will not split it. For example, the word “geese”, the lemma is “goose”. The lemmatized training parallel corpus will then be used for training as usual. For example the sentence, “what are the operations that can be performed on them ?” will become “what are the operation -s that can be perform -ed on them ?”. Besides that, from our observation, we found that many of the suffixes in English are not translated. Thus, we also try out a different attempt, where the suffixes produced are simply deleted, and retain only the lemma. 4. Experiments and Results In our experiment, Moses SMT toolkit was used. Malay text corpus 25 with around 870MB was used to create the Malay language model using SRILM 26. For building the baseline translation model, we used 124,723 thousand English-Malay parallel sentences from6, and produced more than 1.1 million of translation entries. For testing, the computer science domain parallel text we extracted was used. There were more than 23 thousand parallel sentences extracted from the examination papers archives. Our baseline test gave us a 12.90 BLEU score. In the subsequent test, we used a Malay-English bilingual dictionary to improve the baseline translation table. The bilingual dictionary was obtained from 6. The dictionary contains more than 190 thousand translation entries. The dictionary examples were added into baseline parallel sentence corpus for training using Moses. This approach improved the BLEU score to 13.40 from 12.90. In the second test, we tested the second dictionary method, which was to add a small delta value to EnglishMalay phrase translation pairs found in the bilingual dictionary in the phrase translation table. We examined four different delta values, which were 0.001, 0.005, 0.0001 and 0.0005. The BLEU score improved to 13.60 and higher from 12.90. Table 1 below shows the improvement for the dictionary approaches. Table 1. Dictionary approach result Approaches

BLEU Score



Add English-Malay phrase translation pairs from dictionary for training.


Add delta=0.001 to English-Malay phrase translation pairs from dictionary.


Add delta=0.005 to English-Malay phrase translation pairs from dictionary.


Add delta=0.0001 to English-Malay phrase translation pairs from dictionary.


Add delta=0.0005 to English-Malay phrase translation pairs from dictionary.


In the last experiment, we analyzed using different lemmatizer and stemmer to split a word to lemma/stem and suffix. We tested Stanford parser 27, Snowball stemmer 28, Porter stemmer 29, Lancaster stemmer 30 and Lovins stemmer 31. After the lemmatization/stemming, the suffix in the sentence was either retained or being removed, before Moses used them for training translation table. We use the parallel sentences that were added with the bilingual phrase translation from dictionary as our new baseline. Table 2 below shows the result.

Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249


Table 2. Lemmatizer/Stemmer result Lemmatizer / Stemmer

Suffix retained (BLEU score)

Suffix removed (BLEU score)




Stanford parser



Snowball stemmer



Porter stemmer



Lancaster stemmer



Lovins stemmer



The result is a big surprise to us, as they show that when English suffixes are removed, the Malay translation improved a lot compared to retaining the suffix. The highest BLEU score increases achieved using Stanford parser with suffixes removed, where the BLEU score improved to 15.41 from 13.40. On the other hand, when the suffixes were retained, we only get a small BLEU score improvement to 14.05. The second best result was using Snowball stemmer, where the BLEU score was 15.14, obtained by deleting all suffixes from parallel sentence. On the other hand, when the suffixes were retained, BLEU score improve to 13.96 only. The third best result was produced using Porter stemmer. Two other stemmers gave worst results compared to the baseline, they are Lancaster stemmer and Lovins stemmer. The comparatively worst result obtained using suffixes suggest that it was due to the poor alignment step. If during alignment, null is aligned with all suffixes, we should be able to get the same result as when the suffixes are removed. The fact is that certain English suffixes should be translated to a Malay word. For example, “-ing” is quite often translated to the Malay word “sedang”. Thus, if alignment is correct in the sentences with suffixes, the Malay translation obtained for English sentences should be even better than the one without suffix. We also tried to test another idea, by interpolating the phrase table created with the training data containing suffixes (BLEU score 14.05, refer to table 2) with another phrase table created with the training data where the suffixes were removed (BLEU score 15.41, refer to table 2). To do that, we added NULL to suffix alignments at the lexical table created by GIZA++ using the training data with the suffixes are removed. Moses was rerun to recreate the phrase table. The phrase table created was then interpolate with the phrase table created with the training data where suffixed were retained (BLEU score 14.05) using the value (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3) but the BLEU scores we got reduce to 14.70+ (from 15.41). 5. Conclusions and Future Works In this paper, we investigate the usage of English-Malay Dictionary and English lemmatizer to improve the BLEU score of English-Malay translation. Both approaches have shown to be able to improve English-Malay translation. In the dictionary approach, by simply adding a small delta values to the found English-Malay phrase pairs in the translation table shown to increase slightly higher the BLEU score, compared to adding the translation phrase pair for training. On the other hand, when English lemmatizer is used to convert English word to lemma and suffix. The resulted parallel sentences are used for training. The Malay translation produced is improved. Our study show that the improvement is the higher using Stanford parser. Besides, when the suffixes in the parallel sentences are deleted, the BLEU score obtained is even higher compared to when retaining the suffixes. This observation suggests that there is an alignment problem in the suffixes that prevent the suffix results to be better than the one without suffix. If the suffix alignments improve, we can foresee to see an even higher improvement in the BLEU score. Thus, for future work, we will look into improving the suffix alignment to improve the English-Malay translations. One idea we are looking at is by improving the initializing of the word alignments in GIZA++, which we hope would produce a better phrase table in Moses.


Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249

Acknowledgements This work is supported by Universiti Sains Malaysia (USM), Research University Grant (RU) by project number 1001/PKOMP/811295. References 1. H. Somers, L. McLean, and D. Jones, “Experiments in multilingual example-based generation,” Proceedings of the 3rd Conference on the Cognitive Science of Natural Language Processing, Dublin, 1994. 2. Bp. Brown, J. Cocke, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, “A statistical approach to language translation,” COLING'88 (Association for Computational Linguistics) 1, 71-76, 1988. 3. A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, J. Gillette, J. Laffert, R. Mercer, H. Printz, and L. Ures, “The candide system for machine translation,” Proceedings of the Workshop on Human Language Technology, 157-162, 1994. 4. H. Mosleh, Al-Adhaileh, and E. K. Tang, “Example-based machine translation based on the synchronous SSTC annotation schema,” in: MTSummit VII, 244-249, Singapore, 1999. 5. H. Mosleh, Al-Adhaileh, E. K. Tang, and Y. Zaharin, “A synchronization structure of SSTC and its applications in machine translation,” COLING 2002 Post-Conference Workshop “Workshop on Machine Translation in Asia”, Taipei. 6. H. N. Lim, H. H. Ye, C. K. Lim, and E. K. Tang, “Adapting an existing example-based machine translation (EBMT) system for new language pairs based on an optimized bilingual knowledge bank (BKB),” Conference on Translation, 399-406, Kuala Lumpur, 2007. 7. A. Chitu, “Google switches to its own translation system,”, 2007, Retrieved 27-01-2016. 8. D. Munteanu, and D. Marcu, “Extracting parallel sub-sentential fragments from non-parallel corpora,” ACL 2006, 21-88, Sydney. 9. Y. Zhang, K. Wu, J. Gao, and P. Vines, “Automatic acquisition of Chinese–English parallel corpus from the web,” Advances in Information Retrieval Journal, 420-431, 2006. 10. J. Uszkoreit, J. M. Ponte, A. C. Popat, and M. Dubiner, “Large scale parallel document mining for machine translation,” Coling 2010, 11011109, Beijing. 11. T.-N.-D. Do, V.-B. Le, B. Bigi, L. Besacier, and E. Castelli, “Mining a comparable text corpus for a Vietnamese-French statistical machine translation system,” Proceedings of the 4th EACL Workshop on Statistical Machine Translation, 2009, 165–172, Athens. 12. M. Mohammadi, and N. GhasemAghaee, “Building bilingual parallel corpora based on wikipedia,” International Conference on Computer Engineering and Applicatons, 2010, 264-268, Bali. 13. M. Kay, and M. Roscheisen, “Text-translation alignment,” Technical Report, 1988, Xerox Palo Alto Research Center. 14. W. A. Gale, and K. W. Church, “A Program for Aligning Sentences in Bilingual Corpora,” Computational Linguistics, 19, 75–102, 1993. 15. S.F. Chen, “Aligning sentences in bilingual corpora using lexical information,” Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 9-16, 1993, Columbus. 16. L. Huang, “Building a Chinese-English parallel corpus of modern and contemporary Chinese novels,” Style in Translation: A Corpus-Based Perspective New Frontiers in Translation Studies 2015, 31-41. 17. P. Resnik, M. B. Olsen, and B. Diab, “Creating a parallel corpus from the book of 2000 tongues,” Computer and the Humanities, 33, 129-153, 1999. 18. L.-T. Lim, and E. K. Tang, “Building an ontology-based multilingual lexicon for word sense disambiguation in machine translation,” Proceedings of the PAPILLON-2004 Workshop on Multilingual Lexical Databases. Grenoble. 19. S. A. Rahman, N. Ahmad, H. A. Hashim, and A. W. Dahalan, “Real time on-line English-Malay machine translation (MT) system,” Third Real-Time Technology And Applications Symposium 2006, UPM, Malaysia. 20. R. Zens, F. J. Och, and H. Ney, “Phrase-based statistical machine translation,” M. Jarke et al. (Eds.): KI 2002, LNAI 2479, pp. 18–32, 2002. © Springer-Verlag Berlin Heidelberg 2002. 21. H. Schwenk, “Continuous space translation models for phrase-based statistical machine translation,” Proceedings of COLING 2012: Posters, pages 1071–1080, COLING 2012, Mumbai, December 2012. 22. P. Koehn, “Pharaoh: A beam search decoder for phrase-based statistical machine translation models,” 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28 - October 2, 2004. 23. H. Wu, H. Wang, and C. Zong, “Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora,” Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 993–1000 Manchester, August 2008. 24. A. Lopez, and P. Resnik, “Word-based alignment, phrase-based translation: What’s the link?,” in Proceedings of the Conference of the Association for Machine Translation, Cambridge, pp.90–99, 2006. 25. T. -P. Tan, H. Li, E. K. Tang, X. Xiao, and E. S. Chng, “Mass: A Malay language LVCSR corpus resource,” Cocosda’09, Beijing, 2009. 26. A. Stolcke, “SRILM – An extensible language modeling toolkit,” in Proceedings of the International Conference on Spoken Language Processing, pp.901–904, 2002. 27. Manning, D. Christopher, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The stanford CoreCLP natural language processing toolkit,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60, 2014. 28. M. F. Porter, “Snowball: A language for stemming algorithms,”, 2001. Retrieved 17-03-2016. 29. M. F. Porter, “An algorithm for suffix stripping,” Program, 1980; 14, 130-137.

Yin-Lai Yeong et al. / Procedia Computer Science 81 (2016) 243 – 249 30. C. D. Paice, “Another stemmer,” ACM SIGIR Forum, Volume 24, No. 3. 1990, 56-61. 31. J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computer Linguistic, vol.11, no.1/2, pp. 22-31, 1968.