23. Building an English-Filipino Tourism Corpus and Lexicon for an ASEAN Language Translation System Charmaine S. Ponay University of Santo Tomas
[email protected] Charibeth K. Cheng De La Salle University
[email protected] Abstract A parallel corpus is a valuable resource in natural language processing and computational linguistics. It is used in machine translation, lexicography, statistical language analysis, cross language information retrieval, among others. This study aimed to build a parallel corpus of Philippine tourism data and evaluated the corpus using a statistical machine translation system. The corpus is to be used by the Philippine component of the ASEAN-MT1 project. ASEANMT is a network-based ASEAN language public translation service. It aims to automatically translate among ASEAN languages and English, with English as the pivotal language. The source texts were in English and were gathered from four websites about Philippine tourism. The Filipino translation was done manually. Translations were by sentence. Guidelines to translating were followed like, multiple translations must be avoided and the commonest word order must be followed. Aside from the translations, the corpus is also annotated with the named entities, such as people’s names, group names, company names, currency units, temporal entities, language names, locations, products, and artistic creations. The parallel corpora has 21, 491 sentences with 370,910 English words and 416,290 Filipino words. Each sentence was from varying lengths, from one word to approximately 30 words per sentence. The corpora are categorized into Festivals and Events, Provincial Profile, Tourist Attractions and General Information. These categories were based on the source websites and the information of the text. A lexicon was created by manual extraction of named-entities from the different Philippine tourism websites. Most of the named-entities for the lexicon were retrieved from lists provided by the different Philippine tourism websites also. The named-entities were manually translated and classified.
1
http://aseanmt.org
The corpora was assessed based on the following factors – (1) its overall BLEU when used on a statistical machine translation system, (2) BLEU score of the sub-corpora per translator, (3) BLEU per category, and (4) BLEU score of the sub-corpora per translator per category. We found out that the number of function words, named-entities and numbers, ambiguity and alignment are the main factors which affect the quality of the machine translation. Function words add to the number of words in the translated sentence which causes misalignment while named-entities and numbers causes “penalty” to the BLEU score because they were not translated. Keywords: Statistical machine translation systems, corpus building, tourism, BLEU score and named-entity
References Evans, D. (2007). Corpus building and investigation for the humanities. University of Nottingham. http://www. corpus. bham. ac. uk/corpus-building. shtml. Accessed 10 Feb. 2015 Mukherjee, J., Kunstler, V., Maiwald, P., Saage, S. 2007. A guide to corpus linguistics. http://fss.plone.unigiessen.de/fss/faculties/f05/engl/ling/help/materials/restricted/clguide.pdf/file/ A%20Guide%20to%20Corpus%20Linguistics.pdf. Accessed 10 Feb. 2015 Pumikalek, J. 2007. Building parallel corpora from the web. Sridhar,V., Barbosa, L., and Bangalore,S. 2006. A scalable approach to building a parallel corpus from the web. http://www.research.att.com/export/sites/att_labs/techdocs/TD_100424.pdf. Accessed 10 Feb. 2015 Tan, L., and Bond, F. 2011. Building and annotating the linguistically diverse. NTU-MC (NTU- Multilingual Corpus).