Improving Machine Transliteration Performance by Using Multiple Transliteration Models Jong-Hoon Oh1 , Key-Sun Choi2 , and Hitoshi Isahara1 1
Computational Linguistics Group, National Institute of Information and Communications Technology (NICT), 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289 Japan {rovellia, isahara}@nict.go.jp 2 Computer Science Division, EECS, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701 Republic of Korea
[email protected]
Abstract. Machine transliteration has received significant attention as a supporting tool for machine translation and cross-language information retrieval. During the last decade, four kinds of transliteration model have been studied — grapheme-based model, phoneme-based model, hybrid model, and correspondence-based model. These models are classified in terms of the information sources for transliteration or the units to be transliterated — source graphemes, source phonemes, both source graphemes and source phonemes, and the correspondence between source graphemes and phonemes, respectively. Although each transliteration model has shown relatively good performance, one model alone has limitations on handling complex transliteration behaviors. To address the problem, we combined different transliteration models with a “generating transliterations followed by their validation” strategy. The strategy makes it possible to consider complex transliteration behaviors using the strengths of each model and to improve transliteration performance by validating transliterations. Our method makes use of web-based and transliteration model-based validation for transliteration validation. Experiments showed that our method outperforms both the individual transliteration models and previous work.
1
Introduction
Machine transliteration has received significant attention as a supporting tool for machine translation (MT) [1,2] and cross-language information retrieval (CLIR) [3,4]. During the last decade, several transliteration models – grapheme1 based transliteration model (GTM) [5,6,7,8], phoneme2 -based transliteration model (PTM) [1,9,10], hybrid transliteration model (HTM) [2,11], and correspondencebased transliteration model (CTM) [12,13,14] – have been proposed. These models 1 2
Graphemes refer to the basic units (or the smallest contrastive units) of a written language: for example, English has 26 graphemes or letters. Phonemes are the simplest significant unit of sound. We used ARPAbet symbols to represent source phonemes (http://www.cs.cmu.edu/~ laura/pages/arpabet.ps).
Y. Matsumoto et al. (Eds.): ICCPOL 2006, LNAI 4285, pp. 85–96, 2006. c Springer-Verlag Berlin Heidelberg 2006
86
J.-H. Oh, K.-S. Choi, and H. Isahara
are classified in terms of the information sources for transliteration or the units to be transliterated; GTM, PTM, HTM, and CTM make use of source graphemes, source phonemes, both source graphemes and source phonemes, and the correspondence between source graphemes and phonemes, respectively. Although each transliteration model has shown relatively good performance, it often produced transliterations with errors. The errors are mainly caused by complex transliteration behaviors, meaning that a transliteration process dynamically uses both source graphemes and source phonemes. Sometimes either source graphemes or source phonemes contribute to the transliteration process; while sometimes both contribute. Therefore, it is hard to consider the complex transliteration behaviors depending on one transliteration model because one model just concentrates on only one of the complex transliteration behaviors. To address this problem, we combined the different transliteration models with a “generating transliterations followed by their validation” strategy as shown in Fig. 1. First, we generate transliteration candidates (or a list of transliterations) using GTM, PTM, HTM, and CTM. Then, we validate the candidates using two measures — a transliteration model-based measure and a web-based measure.
Fig. 1. System architecture
This paper is organized as follows. In section 2, we review previous work based on the four transliteration models. In section 3, we describe the framework of different transliteration models, and in section 4, we describe the transliteration validation. In section 5, we describe our experiments and results. We then conclude in section 6.
Improving Machine Transliteration Performance
2 2.1
87
Previous Work Grapheme-Based Transliteration Model
The grapheme-based transliteration model (GTM) is conceptually a direct orthographical mapping model from source graphemes to target graphemes. Several different transliteration methods have been proposed within this framework. Kang & Choi [5] proposed a decision tree-based transliteration method. Decision trees, which transform each source grapheme into target graphemes, are learned and then they are directly applied to machine transliteration. Kang & Kim [6] and Goto et al. [7] proposed a method based on a transliteration network. The transliteration network is composed of nodes and arcs. A node represents a chunk of source graphemes and its corresponding target grapheme. An arc represents a possible link between nodes and it has a weight showing its strength. Li et al. [8] used a joint source-channel model to simultaneously model both the source language and the target language contexts (bigram and trigram) for machine transliteration. Its main advantage is the use of bilingual contexts. The main drawback of GTM is that it does not consider any phonetic aspect of transliteration. 2.2
Phoneme-Based Transliteration Model
Basically, the phoneme-based transliteration model (PTM) is composed of source grapheme-to-source phoneme transformation and source phoneme-to-target grapheme transformation. Knight & Graehl [1] modeled Japanese-to-English transliteration with weighted finite state transducers (WFSTs) by combining several parameters such as romaji-to-phoneme, phoneme-to-English, English word probability models, and so on. Meng et al. [10] proposed an English-toChinese transliteration model. It was based on English grapheme-to-phoneme conversion, cross-lingual phonological rules and mapping rules between English and Chinese phonemes, and Chinese syllable-based and character-based language models. Jung et al. [9] modeled English-to-Korean transliteration with extended Markov window. First, they transformed an English word into English pronunciation by using a pronunciation dictionary. Then they segmented the English phonemes into chunk of English phonemes, which corresponds to one Korean grapheme by using predefined handcrafted rules. Finally they automatically transformed each chunk of English phoneme into Korean graphemes by using extended Markov window. The main drawback of PTM is error propagation caused by its two-step procedure – errors in source grapheme-to-source phoneme transformation make it difficult to generate correct transliterations in the next step. 2.3
Hybrid Transliteration Model and Correspondence-Based Transliteration Model
There have been attempts to use both source graphemes and source phonemes in machine transliteration. Such research falls into two categories, the
88
J.-H. Oh, K.-S. Choi, and H. Isahara
correspondence-based transliteration model (CTM) [12,13,14] and the hybrid transliteration model (HTM) [2,11]. The CTM makes use of the correspondence between a source grapheme and a source phoneme when it produces target language graphemes; the HTM just combines GTM and PTM through linear interpolation. The hybrid transliteration model requires the grapheme-based transliteration probability (P r(GT M )) and phoneme-based transliteration probability (P r(P T M )), and then it combines the two probabilities through linear interpolation. Oh & Choi [12] considered the contexts of a source grapheme and its corresponding source phoneme for English-to-Korean transliteration. It is based on semi-automatically constructed context-sensitive rewrite rules in a form, A/X/B → y, meaning that X is rewritten as target grapheme y in the context A and B. Note that X, A, and B represent correspondence between English grapheme and phoneme like “r : |R|” – English grapheme r corresponding to English phoneme |R|. Oh & Choi [13,14] trained a generative model representing transliteration rules by using the correspondence between source grapheme and source phoneme, and machine learning algorithms. The correspondence makes it possible to model machine transliteration in a more sophisticated manner. Several researchers [2,11] have proposed hybrid model-based transliteration methods. They modeled GT M and P T M with WFSTs or a source-channel model. Then they combined GT M and P T M through linear interpolation. In their P T M , several parameters are considered, such as the source graphemeto-source phoneme probability, source phoneme-to-target grapheme probability, target language word probability, and so on. In their GT M , the source graphemeto-target grapheme probability is mainly considered.
3
Framework of Different Transliteration Models
Let SW be a source word, PSW be the pronunciation of SW , TSW be a target word corresponding to SW , and CSW be a correspondence between SW and PSW . PSW and TSW can be segmented into a series of sub-strings, each of which corresponds to a source grapheme. Then, we can write SW = s1 , · · · , sn = sn1 , PSW = p1 , · · · , pn = pn1 , TSW = t1 , · · · , tn = tn1 , and CSW = c1 , · · · , cn = cn1 , where si , pi , ti , and ci = < si , pi > represent the ith source grapheme, source phonemes corresponding to si , target graphemes corresponding to si and pi , and the correspondence between si and pi , respectively. With this definition, GTM, PTM, CTM, and HTM can be represented as Eqs. (1), (2), (3), and (4), respectively. i+k P rg (TSW |SW ) = P r(tn1 |sn1 ) ≈ P r(ti |ti−1 (1) i−k , si−k ) i
P rp (TSW |SW ) = P r(pn1 |sn1 ) × P r(tn1 |pn1 ) i+k i−1 i+k P r(pi |pi−1 ≈ i−k , si−k ) × P r(ti |ti−k , pi−k )
(2)
i
P rc (TSW |SW ) = P r(pn1 |sn1 ) × P r(tn1 |cn1 ) i+k i−1 i+k ≈ P r(pi |pi−1 i−k , si−k ) × P r(ti |ti−k , ci−k ) i
(3)
Improving Machine Transliteration Performance
P rh (TSW |SW ) = α × P rp (TSW |SW ) + (1 − α) × P rg (TSW |SW )
89
(4)
With the assumption that each transliteration model depends on the size of the contexts, k, Eqs. (1), (2), (3) and (4) can be simplified. To estimate the probabilities in Eqs. (1), (2), (3), and (4), we used the maximum entropy model, which can effectively incorporate heterogeneous information [15]. In the maximum entropy model, event ev is composed of a target event (te) and a history event (he), and it is represented by a bundle of feature functions (fi (he, te)), which represent the existence of certain characteristics in the event ev. The feature function enables a model based on the maximum entropy model to estimate probability [15]. Therefore, designing the feature functions, which effectively support certain decisions made by the model, is important. Our basic philosophy for the feature function design for each transliteration model is that the context information collocated with the unit of interest is important. With this philosophy, we designed the feature functions with all possible combinations of i+k i+k i−1 (si+k i−k , pi−k , ci−k , and ti−k ). Generally, a conditional maximum entropy model is an exponential log-linear model that gives the conditional probability of event ev =< te, he >, as described in Eq. (5), where λi is a parameter to be estimated, and Z(he) is the normalizing factor [15]. 1 exp( P r(te|he) = λi fi (he, te)) (5) Z(he) i Z(he) = exp( λi fi (he, te)) te
i
With Eq. (5) and feature functions, conditional probabilities can be estimated i+k in Eqs. (1), (2), (3), and (4). For example, we can write P r(ti |ti−1 i−k , ci−k ) = P r(teCT M |heCT M ) because we can represent target events (teCT M ) and history i+k events (heCT M ) of CTM as ti and tuples < ti−1 i−k , ci−k >, respectively. In the same i−1 i+k i−1 i+k i+k way, P r(ti |ti−k , si−k ), P r(ti |ti−k , pi−k ), and P r(pi |pi−1 i−k , si−k ) can be represented as P r(te|he) with their target events and history events. We used a maximum entropy modeling tool [16] to estimate Eqs. (1), (2), (3), and (4).
4
Transliteration Validation
We validated transliterations by using web-based validation, Sweb (s, tci ), and transliteration model-based validation, Stm (s, tci ), like in Eq. (6). Using Eq. (6), we can validate transliterations in a more correct and robust manner because Sweb (s, tci ) reflects real-world usage of the transliterations in web data and Stm (s, tci ) ranks the transliterations independent of the web data. ST V (s, tci ) = Stm (s, tci ) × Sweb (s, tci ) 4.1
(6)
Transliteration Model-Based Validation: Stm
Our transliteration model-based validation, Stm , uses the rank assigned by each transliteration model. For a given source word (s), each transliteration model generates transliterations (tci in T C) and ranks them using the probability
90
J.-H. Oh, K.-S. Choi, and H. Isahara
described in Eqs. (1), (2), (3), and (4). The underlying assumption in Stm is that the rank of the correct transliterations tends to be higher, on average, than the wrong ones. With this assumption, we represented Stm (s, tci ) as Eq. (7), where Rankg (tci ), Rankp (tci ), Rankh (tci ), and Rankc (tci ) represent the rank of tci assigned by GTM, PTM, HTM, and CTM, respectively. Stm (s, tci ) =
4.2
1 1 1 1 1 ×( + + + ) (7) 4 Rankg (tci ) Rankp (tci ) Rankh (tci ) Rankc (tci )
Web-Based Validation: Sweb
Korean or Japanese web pages are usually composed of rich texts in a mixture of Korean or Japanese (main language) and English (auxiliary language). Let s and t be a source language word and a target language word, respectively. We observed that s and t tend to be near each other in the text of Korean or Japanese web pages when the authors of the web pages describe s as translation of t, or vice versa. We retrieved such web pages for transliteration validation. There have been several web-based validation methods for translation validation [17,18] or transliteration validation [2,19]. They usually rely on the web frequency (the number of web pages) derived from “Bilingual Keyword Search (BKS)” [2,17,18] or “Monolingual Keyword Search (MKS)” [2,19]. BKS retrieves web pages by using a query composed of two keywords, s and t; while MKS retrieves web pages by using a query composed of t. Qu & Grefenstette [17] and Wang et al. [18] proposed BKS-based translation validation methods, such as relative web frequency and chi-square (χ2 ) test. Al-Onaizan & Knight [2] used both MKS and BKS and Grefenstette et al. [19] used only MKS for validating transliterations. However, web pages retrieved by MKS tend to show whether t is used in target language texts rather than whether t is a translation of s. BKS frequently retrieves web pages where s and t have little relation to each other because it does not consider distance between s and t in the web pages. To address these problems, we developed a validation method based on “Bilingual Phrasal Search (BPS)”, where a phrase composed of s and t is used as a query for a search engine. Let ‘[s t]’ or ‘[t s]’, ‘s And t’, and ‘t’, respectively, be queries for BPS, BKS, and MKS. The difference among BPS, BKS, and MKS is shown in Fig. 2. In Fig. 2, ‘[s t]’ or ‘[t s]’ retrieves web pages where ‘[s t]’ or ‘[t s]’ exists as phrases; while ‘s And t’ retrieves web pages where s and t simply exist in the same document. Therefore, the number of web pages retrieved by BPS is more reliable for validating transliterations, because s and t usually have high co-relation in the web pages retrieved by BPS. For example, web pages retrieved by BPS in Fig. 3 usually contain correct Korean and Japanese transliterations and their corresponding English word amylase as translation pairs in parentheses expression. For these reasons, BPS is more suitable for our transliteration validation. Let T C be a set of transliterations (or transliteration candidates) produced by different transliteration models, tci be the ith transliteration candidate in T C, s be the source language word resulting in T C, and W F (s, tci ) be the web
Improving Machine Transliteration Performance
91
Fig. 2. Difference among BPS, BKS, and MKS
Fig. 3. Web pages retrieved by BPS
frequency for [s tci ]. Our web-based validation method, Sweb , can be represented as Eq. (8), which is the relative web frequency derived from BPS. W F (s, tci ) + W F (tci , s) tck ∈T C (W F (s, tck ) + W F (tck , s))
Sweb (s, tci ) =
(8)