The Application of Bayesian Alignment Techniques to Transliteration

The Application of Bayesian Alignment Techniques to Transliteration Generation and Mining Andrew Finch ∗ , Keiji Yasuda ∗ , Takaaki Fukunishi † , Seiichi Yamamoto ∗ Multilingual

†

and Eiichiro Sumita

∗

Translation Laboratory NICT, Kyoto, JAPAN Email: {andrew.finch,keiji.yasuda,eiichiro.sumita}@nict.go.jp † Doshisha University Kyoto, JAPAN Email: {dtk0706,seyamamo}@mail.doshisha.ac.jp

Abstract—Bayesian techniques have recently been applied to many areas of natural language processing, and have proven themselves particularly useful in areas involving segmentation and alignment. This paper looks at the direct application of these techniques to the co-segmentation/alignment of grapheme sequences. We detail a novel Bayesian model for unsupervised bilingual character sequence alignment of corpora for transliteration. The technique is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrase-based statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. Our experiments show that the Bayesian alignment gives rise to a better performing model that is 30% smaller than the model of a baseline system trained in the usual manner by a GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system. We also compare our alignment approach to the popular m2m aligner when used to build another class of transliteration model: the joint source-channel model. Our results on a number of different language pairs show that in all cases the Bayesian aligner was again able to learn more compact models that gave rise to better transliteration performance. This paper also takes a look at the task of evaluating the usefulness of transliteration in a real-world machine translation system. Popular automatic measures of machine translation demand exact matches for the tokens produced by a transliteration system. If a phonetically equivalent token is generated, these measures consider it to be an error, even though the token contains information that could be useful to a user. As a consequence, often the best strategy is to delete unknown words rather than transliterate them. In this paper we show by using human evaluation, that for Japanese-English translation, transliteration actually does help to improve machine translation quality. In the final part of the paper, we investigate the application of Bayesian alignment to the task of transliteration mining. We define a set of features based on the characteristics of the forced-alignment of candidate transliteration pairs and use these features together with a support vector machine classifier (SVM) to predict whether or not a given pair is a genuine transliteration. Our approach is simple, but was able to achieve performance comparable to the state of the art in the field when tested on data from the NEWS 2010 data mining shared task, indicating that features from forced alignment are very predictive in transliteration mining.

I. Introduction Machine translation has to handle wide variety of proper nouns to achieve high translation performance. However, it is impossible to discover all of these words from limited quantities of bilingual training data, or even to manually maintain a bilingual dictionary because of the huge number of distinct proper nouns in the real world including new words and expressions being created day by day. In many languages large numbers of words are imported into the language by transcribing them a phonetically similar sequence of characters. The process is called transliteration, and if used appropriately can be used to expand the coverage of a machine translation system. It is possible to couch the problem of transliteration as a problem of direct grapheme to grapheme transduction that proceeds in a monotone order. Recently systems based on phrase-based statistical machine translation technology [1], [2], [3] and the joint source-channel model [4] are being actively researched and have achieved state-of-the-art performance on this task. This type of approach makes no linguistic assumptions about the data and no intermediate phonetic representation is required, because the transduction is directly from grapheme to grapheme. The advantages of this type of approach are that the only training corpus required is a set of bilingual word pairs, and the approach can be applied directly to a wide range of language pairs without the need to develop a set of linguistically-motivated heuristics specific to the languages involved. In this paper we focus on translation methodologies that use grapheme sequence pairs as their basis for generation, namely transliteration using phrasebased statistical machine translation (PBSMT) techniques, and the joint-source channel approach. At the core of the PBSMT approaches is the phrase-table. In transliteration, this is a set of bilingual grapheme sequence pairs that are concatenated to generate the transliteration. The creation of a phrase-table during a typical training procedure for a PBSMT system consists of the following steps: 1) Word alignment using GIZA++ [5] 2) Phrase-pair extraction using heuristics (for example grow-diag- f inal-and from the MOSES [6] toolkit)

This approach works very well in practice, but a more elegant solution would be to arrive at a set of bilingual grapheme sequence-pairs (we use this term to describe the analogue of the phrase-pair at the grapheme level) in one step, from a generative model. Unfortunately, when traditional methods that use the EM algorithm to maximize likelihood are applied to the task, they produce solutions that can grossly over-fit the data. As an extreme example, the most likely segmentation of a corpus into sequence-pairs, assuming no limits on sequence-pair size would be the entire corpus as a single bilingual sequence-pair, holding all the probability mass. GIZA++ mitigates this problem by aligning the words in a one-to-many fashion. The single word on one side of the alignment acts as a constraint on the size of the bilingual pairs. A similar approach can be taken in transliteration, where a single character in one language is permitted to align to multiple characters of the other, but not vice versa. This approach is reasonable for English-Chinese transliteration [4], [7], where one Chinese character can be assumed to map to several English characters. In GIZA++ this one-to-many alignment is done twice: from both source-to-target and also from target-to-source. A table of word-to-word alignments is then constructed from (typically the intersection) both of these alignments. Additional word alignments that are not in the intersection are added based on evidence and heuristics, and finally all possible phrase-pairs are extracted from the table of alignments that are consistent with the table. In [8], [9] many-to-many alignment is performed directly using maximum likelihood training, but evidence trimming heuristics that exclude part of the available training data are required to prevent the models from overfitting the data. [10] have successfully applied a similar Bayesian technique to grammar induction and [11], [12] have developed a tractable Bayesian methods for the more complex task of bilingual phrase pair extraction for SMT, which involves reordering. [13] tackle the overfitting problem in phrasal alignment by using a leave-one-out approach using a strategy that despite being a different paradigm, shares many of the characteristics of our approach. [14] have also developed Bayesian adaptor grammar approach to alignment for transliteraion. In this paper we extend existing monolingual word segmentation models ([15], [16]) to bilingual segmentation, and provide a simple yet elegant way to directly segment a bilingual training corpus in a many-to-many fashion without overfitting, using a Bayesian model in the manner of [17]. In the first part of this paper we present experiments that investigate the effectiveness our Bayesian alignment method as a means of building transliteration models. We then move on to show ways in which the forced alignment produced by this process can be exploited for transliteration mining. In the final section of the paper, we look at the effectiveness of using such approaches when integrated into a machine translation system. This is motivated by the fact that automatic machine translation metrics, for example the BLEU score [18],

and the NIST score [19] often have a natural bias towards shorter translations over translation errors. That is to say, it is often a better strategy for a machine translation system to output nothing or an very short erroneous translation than to output a long, incorrect translation. Within these metrics attempts to correct this this length bias are included, but unfortunately they are not completely effective and some degree of bias remains. As a evidence of this phenomenon, it is common practice in competitive machine translation evaluation campaigns for the systems to delete untranslated unknown words from their machine translation output, rather than keep them or attempt to transliterate them (for example, [20]). For this reason it is important to study the effect of introducing transliteration into a machine translation system through a human evaluation experiment, even though such experiments are expensive in terms of human effort. We now move on to motivate the Bayesian co-segmentation scheme using in training our transliteration model. II. Methodology In this section we explain the basic model that underlies the Bayesian alignment model that forms of the core of the systems studied in this paper. The model we use is a unigram Dirichlet process model that generates bilingual grapheme sequences in the same manner as a joint-source channel model [4]. A. Joint Source-channel Model Let us assume we are given a bilingual corpus consisting of a source sequence s1M = and a target sequence t1N = . We distinguish sequences of characters from single characters by using a boldface font with an overbar. We adopt the joint source-channel model of [4] as the underlying generative model, and we make the additional assumption that the segments are independent of each other (our approach can easily be extended to model these dependencies at the expense of some additional complexity, see [16]). Under this model, the corpus is generated through the concatenation of bilingual sequence-pairs (we will use this term repeated throughout this paper to refer to corresponding sequences of source and target graphemes, as defined below). A bilingual sequence-pair is a tuple (s, t) consisting of a sequence of source graphemes together with a sequence of target graphemes (s, t) = (, ). The corpus probability is simply the probability of all possible derivations of the corpus given the set of bilingual sequence-pairs and their probabilities. p(s1M , t1N ) = p(s1 , s2 , . . . , s M , t1 , t2 , . . . , tN ) ∑ = p(γ) γ∈Γ

(1) where γ = ((s1 , t1 ), . . . , (sk , tk ), . . . , (sK , tK )) is a derivation of the corpus characterized by its co-segmentation, and Γ is the set of all derivations (co-segmentations) of the corpus.

The probability of a single derivation is given by the product of its component bilingual sequence-pairs:

p(γ) =

K ∏

p((sk , tk ))

(2)

k=1

The corpus for our experiments is segmented into bilingual word-pairs. We therefore constrain our model such that both source and target character sequences of each bilingual sequence-pair in the derivation of the corpus are not allowed to cross a word segmentation boundary. Equation 2 can therefore be arranged as a product of word-pair w derivations of the sequence of all word-pairs W in the corpus. p(γ) =

∏

∏

p((sk , tk ))

(3)

w∈W (sk ,tk )∈γw

where γw is a derivation of bilingual word-pair w. B. Dirichlet Process Model Intuitively, the Dirichlet process model has two basic components: a model for generating an outcome that has already been generated at least once before, and a second model that assigns a probability to an outcome that has not yet been produced. To encourage the re-use of model parameters, the probability of generating a novel bilingual sequence-pair is considerably lower then the probability of generating a previously observed sequence pair. The probability distribution over these bilingual sequence-pairs (including an infinite number of unseen pairs) is learned directly from unlabeled data by Bayesian inference of the hidden co-segmentation of the corpus. More formally, the underlying stochastic process for the generation of a corpus composed of bilingual phrase pairs γ is usually written in the following from: G|α,G0 ∼ DP(α, G0 ) (sk , tk )|G ∼ G

(4)

G is a discrete probability distribution over the all bilingual sequence-pairs according to a Dirichlet process prior with base measure G0 and concentration parameter α. The concentration parameter α > 0 controls the variance of G; intuitively, the larger α is, the more similar G0 will be to G. For the base measure G0 that controls the generation of novel sequence-pairs, we use the joint spelling model described in [17], that assigns exponentially smaller probabilities with increasing source/target sequence length. 1) The Generative Model: The generative model is given in Equation 5 below. The equation assignes a probability to the kth bilingual sequence-pair (sk , tk ) in a derivation of the corpus, given all of the other sequence-pairs observed so far (s−k , t−k ). Here −k is read as: “up to but not including k”.

p((sk , tk ))|(s−k , t−k )) = N((sk , tk )) + αG0 ((sk , tk )) (5) N+α In this equation, N is the total number of bilingual sequencepairs generated so far, N((sk , tk )) is the number of times the sequence-pair (sk , tk ) has occurred in the history. C. Bayesian Inference By repeatedly scoring bilingual sequence pairs with the probability from Equation 5, the algorithm is able to cosegment and align source and target grapheme sequences through an iterative process of Bayesian inference using Gibbs sampling. The training procedure is based on an extension of the forward filtering backward sampling algorithm [16] which is too complex to describe in full here, but is covered in detail in [17]. III. Transliteration Generation In this section we compare our Bayesian method to two state-of-the-art methods for transliteration generation. In the first experiment we compare our approach to the GIZA++ and phrase extraction heuristics technique that is frequently used to build the translation model for a phrase-based statistical machine translation system. In the second experiment, we look at the differences between our Bayesian technique and the publicly available m2m aligner within in the context of a transliteration system based around a joint-source channel models. A. Phrase-based Statistical Transliteration For our experiments we use the phrase-based machine translation techniques introduced by [6], integrating our models within a log-linear framework [21]. Word alignment was performed using GIZA++ [5] and sequence-pair extraction using the MOSES [6] tools. The decoder used was an inhouse phrase-based machine translation decoder that operates according to the same principles as the publicly available MOSES [6] SMT decoder. In these experiments 5-gram language models built with Witten-Bell smoothing were used. The system was trained in a standard manner, using a minimum error-rate training (MERT) procedure [22] with respect to the BLEU score on the held-out development data to optimize the log-linear model weights. Rama and Gali [23], evaluated several techniques for sequence-pair extraction for transliteration and found the grow-diag- f inal-and heuristic to be the most effective, we therefore adopt this method in the baseline system our experiments. During the phrase-table generation process of a typical phrase-based SMT system, GIZA++ is run twice to generate alignments at the word level, from source-to-target and from target-to-source. Following this step, the grow-diag- f inal-and procedure is used to extract all phrases consistent with the word alignments arising from the two GIZA++ runs. When building a phrase-table from the alignment achieved at final

iteration of our Gibbs sampling procedure, we use a much simpler heuristic that is in the same spirit to derive a larger set of phrases consistent with the initial co-segmentation. Our experiments show that this is a necessary step that considerably improves system performance. The algorithm we use for phrasal extraction from the cosegmented corpus is as follows: within a single bilingual word-pair, agglomerate all contiguous bilingual sequence-pairs in all possible ways, but limit the size of the agglomerated source and target phrases to match the maximum phrase length parameter used to train the SMT system (this was set to 7 in our experiments). This is not strictly necessary, but we performed this step to keep the phrase-table generated from our Bayesian segmentation comparable to that generated by the baseline system. 1) Experimental Data: Our training data consisted of 27993 bilingual single word-pairs that were used in the NEWS2010 workshop transliteration shared task. The development data consisted of 3606 bilingual word-pairs drawn from the same sample. The evaluation data consisted of a further 1935 bilingual word-pairs not contained in the other two data sets. The corpus statistics for the three corpora are given in Table I. Corpus

word-pairs

Training Development Evaluation

27993 3606 1935

Characters En Ja 188941 131275 24066 16651 11863 8199

Avg. Word Len. En Ja 6.75 4.69 6.67 4.62 6.13 4.24

TABLE I Statistics of the English-Japanese bilingual corpora.

We used the data to train a phrase-based SMT system to perform transliteration from English to Japanese. We trained our Dirichlet process model on the same parallel data set, and extracted transliteration phrase-tables from the co-segmentation of the corpus at the final iteration (iteration 30). 2) Evaluation Procedure: The results presented in this paper are given in terms of official evaluation metrics used in the NEWS2010 transliteration generation shared task [24]. In our results, ACC refers to the top-1 accuracy score, that measures the percentage of the time the top hypothesis from the system exactly matches the reference. F-score measures the distance of the best hypothesis from the reference transliteration; the reader is referred to the workshop white-paper [24] for more details. For brevity, we only report our results in terms of ACC and F-score in this paper, but the results in terms of the other NEWS2010 metrics have the same character. 3) Evaluation Using Automatic Metrics: Our results on the English-to-Japanese transliteration task are summarized in Table II. It is clear from the table that using sequence-pairs from only the sample at the final iteration of the training produces gave lower performance than the baseline system. The phrase-table derived in this way contained only 3372 sequence-pairs as opposed to over 140,000 in the phrasetable extracted from the GIZA++ alignments. Moreover these sequence-pairs were very short compared to those in from

the baseline system’s phrase-table: approximately 3 characters in both source and target on average, compared to around 5 characters for the baseline system. When a phrase-table built from agglomerations of the same set of sequence-pairs was used, a much larger phrase-table of around 100,000 phrases resulted, with sequence-pairs that are comparable in size to those of the baseline, around 5 characters. On the transliteration task, this phrase-table gave an improvement of approximately 1% in ACC over the baseline system, from a phrase-table that was about 30% smaller in size. Moreover, since the sequence-pairs are concatenations of 3372 component sequence-pairs, this model could be stored very compactly if necessary. Further gains were obtained by interpolating the agglomerated model together with the baseline model. We believe this gain may be due to the effect of smoothing. 4) Discussion: Our experiments were designed to favor the baseline model since the system was tuned using the MERT procedure with its own phrase-table. It is possible that our proposed system would have obtained a higher score if tuned with its own phrase-table, however we chose not to as this would have introduced additional variance from the differences in the two MERT search processes into the results. However, to verify that the agglomeration step was truly necessary, we also ran an experiment that was tuned with respect the small unagglomerated phrase-table: ‘tuned on Bayesian phrase-table’ in Figure II, rather than the baseline. As expected, performance improved possibly due to more optimal weights for the length models, since the phrases are shorter, but did not improve to the same level as the system that used the agglomerated phrase table. It is interesting to note that the system’s performance was improved dramatically simply by grouping the phrases into larger units. This highlights one of the advantages of the phrase-based translation approach. The agglomerated model, because of the way it was constructed, is not able to generate anything the simpler model cannot, but when larger sequencepairs are used to build the target sequence the characters in the phrase carry with them the implicit context of the other characters in the phrase, all of which have occurred together in the same context in the training corpus. In the model with the unagglomerated sequence-pairs, this role is performed mainly by the language model. In spite of the fact that we used a 5-gram language model the system clearly benefited from a model that contained longer sequence-pairs as the basic translation unit. Looking at the phrase-table overlap figures in Figure II it seems that the process of agglomeration produced a phrase table with a reasonable degree of overlap (57%) with that produced by using GIZA++ and the MOSES phrase extraction procedure. B. Decoding Consistency We ran an experiment to investigate the reasons for the improvements in system performance. Our hypothesis was that the Bayesian system had produced a phrase table that led to a more consistent decoding process. This was based on the belief

ACC

F-score

0.313 0.278 0.283 0.323 0.329

0.745 0.726 0.732 0.748 0.752

Phrase Extraction Model GIZA++ and grow-diag-final-and Bayesian Segmentation (tuned on baseline phrase-table) Bayesian Segmentation (tuned on Bayesian phrase-table) Bayesian Segmentation (+agglomerated) Bayesian Segmentation (+integrated)

Phrase-table Entries 143382 3372 3372 102507 164258

Phrase-table Overlap (%) 100 2 2 57 100

Avg. Phrase Length En Ja 5.41 4.80 2.60 2.75 2.60 2.75 5.54 4.83 5.46 4.81

TABLE II The experimental results for the three systems together with some statistics of their phrase-tables. Here +agglomerated means the sequence-pairs were extracted by agglomeration from a single sample at the end of the training. In +integrated the phrase-tables from the baseline system and the agglomerated system were linearly interpolated with equal weights. Differences between systems were all found to be significant by paired t-testing at a level of 0.05, except for the ACC scores for the agglomerated and integrated systems.

that the fact that the Dirichlet process model strongly encourages reuse of the bilingual sequence-pairs it discovers. This should result in a more compact phrase-table, and should entail that similar words in the corpus are likely to be decoded in more homogenous fashion. To test the hypothesis we modified the machine translation decoder to count the number of types of bilingual sequence-pair used to decode the evaluation data, and re-ran the English-Japanese transliteration experiment that showed the largest gain in performance. We found that the decoding process that used the phrase-table generated from our Bayesian model (with agglomerated sequence-pairs) used a total of 3496 unique sequence-pairs, whereas decoding using the phrase-table extracted using GIZA++ and grow-diag-finaland required a total of 3970 phrase pairs during the decoding process, supporting our hypothesis. The 3496 sequence-pairs from the Bayesian model’s phrase-table, could be further analysed into 1289 component bilingual pairs that were present in the segmentation in the sample taken at the end of the training process. 0.7 baseline + dictionary + transliteration

0.65

baseline + dictionary

0.6 0.55

iot ar 0.5 no it 0.45 las na 0.4 rT 0.35

0.3 0.25 0.2 0

0.2

Fig. 1.

0.4

0.6

Dictionary coverage

0.8

1

Results of machine translation evaluation

C. Training Times Bayesian methods are often criticized for their slow performance. The current implementation will require optimization to enable it to handle long sequences, but for transliteration data, where the sequence length is short, the process is practicable. The full training process completed on the NEWS2010 training data set in 15 minutes, averaging around 30 seconds for each iteration over the data.

IV. Joint source-channel model generation In these experiments we compare the Bayesian segmenter to a similar state-of-the-art segmentation tool that is capable of many-to-many alignments: the publicly available m2m alignment tool 1 [25] that is trained using the EM algorithm and is based on the principles set out in [26]. The system we used for generation in these experiments, was a phrase-based SMT system with its translation model augmented by a joint-source channel model trained directly from the bilingually aligned corpus. The score from this model was introduced into the log-linear model and its weight trained together with the other weights in the MERT procedure. In this system we effectively replace the agglomeration step described in the previous section with a language model over bilingual sequence pairs. This has a similar effect of introducing longer range context into the translation model. We observed in preliminary experiments that this hybrid system is able to outperform a system based only on the joint-source channel model, and also the PBSMT transliteration system described in the previous section. The reason the system outperformed the earlier PBSMT system we believe is that the joint-source channel model is a better way to model longer grapheme sequence pairs, being able to generalize, rather than being memory-based. The reason for why the system outperformed the joint-source channel model is likely to lie in the length models: one controlling the generation at the grapheme level, the other controlling generation at the grapheme sequence level. These models serve to mitigate the tendency of both the target language model and joint-source channel model to generate short sequences. This is confirmed by inspection of the log-linear weights which were both negative. The experiments were run in the same way for each of the aligners using the same script, the only difference being the choice of aligner used. We used data from the 2009 NEWS workshop for our experiments, and evaluated using the F-score metric used for the shared task evaluation. The aligners were run with their default settings, and with the same limits for source and target segment size. It may have been possible to obtain better performance from the aligners by adjusting specific parameters, but no attempt was made to do this. The results are shown in Table III. In all experiments, the Bayesian segmenter gave the best performance, and the 1 http://code.google.com/p/m2m-aligner/

largest improvement was on language pairs that have large grapheme set sizes on the target side. The grapheme set size is shown in Table III in the ‘Target Types’ column. The source grapheme set sizes were very similar and small (around 27) for all experiments, as the source language was either English or in the case of Jn-Jk, a romanized form of Japanese. Looking at the n-gram statistics in Table III, for languages with large grapheme sets the number of unigrams in the Bayesian model is less than half that used by the m2m model. Learning a compact model is one of the signature characteristics of the Bayesian model we use; adding a new parameter to the model is extremely costly, and the algorithm will therefore strongly prefer to learn a model in which the parameters are re-used. Initially we considered the hypothesis that the difference in performance between these two approaches came from differences in the sparseness of the language models. Surprisingly however, the numbers of bi-grams and tri-grams in the joint language models are quite similar. Another explanation is that the smaller number of unigrams indicates that the segmentation is more self-consistent and therefore makes the generation task less ambiguous. This is supported by looking at the development set perplexity. On the Jn-Jk task where the differences between the systems are the largest, we found that a joint language model trained on the Bayesian segmentation had 1-, 2-, and 3-gram perplexities of 218.3, 88.4 and 87.5 respectively, whereas the corresponding m2m model’s perplexities were 321.8, 120.5 and 119.3. The number of segments used to segment the corpus was the same for both systems in this experiment. Table IV gives an example from the data of the differences in segmentation consistency. The Bayesian segmentation is strongly self-consistent. The source sequence ‘ara’ has been segmented identically as a single unit in all cases. The m2m system also shows self-consistency, but uses a few different strategies to segment the start of the sequence. Interestingly the Bayesian method in this example has segmented according to the correct linguistic readings of the kanji. We investigate this further in the next section.

A. Linguistic Agreement In this experiment, we attempt to assess the ability of each segmentation scheme to discover the underlying linguistic segmentation of the data. We took a random sample of 100 word-pairs from the Japanese romaji to Japanese kanji training corpus. This language pair was chosen because it exhibited the largest difference between the two aligners, and because the ‘correct’ alignment/segmentation is almost always unambiguous and known. The segmentation of this sample using both systems was labeled as either ‘correct’ or ‘incorrect’ by a human judge using a Japanese name reading dictionary as a reference. We found that Bayesian segmentation agreed with the human segmentation in 96% of the test cases, and whereas the m2m system agreed in 42% of cases.

V. Human Evaluation using Speech-to-speech Translation Field Data In this section, we report the results from experiments in machine translation carried out to evaluate the effectiveness of our transliteration method with real-word data collected from the application of mobile translation devices in the field. The test set for this evaluation was extracted from user log data of a set of speech-to-speech translation field experiments that occurred in the fiscal year 2009 [27]. The field experiments were undertaken nationwide in Japan in five broad regions of Japan: Kanto, Kansai, Kyushu, Hokkaido and Chubu. The field experiments were undertaken as part of the Ministry of Internal Affairs and Communications initiative titled “Field Testing of Automatic Speech Translation Technology and its Contribution to Local Tourism.” We sampled 100 sentences the from manually transcribed user log data of the field experiments from each of the five areas. Then we selected only those sentences that contained untranslatable named entities, leaving a total of 74 sentences. All of the test sentences contained at least one proper noun written in hiragana or kan ji characters. The translation direction of the evaluation is from Japanese into English. As a baseline system we used a standard phrasebased statistical machine translation system trained on the BTEC corpus consisting of 691,829 Japanese and English sentence pairs. To see the upper bound of the machine translation performance, we manually constructed a bilingual dictionary which consists of word categories and English translations of all proper nouns in the test set. The dictionary was used to varying degrees in the translation process [28], the degree to which the dictionary was used was controlled through a parameter setting in the system. A single human judge was used to score the translation output, according to a five-level scale ranging from ‘perfect’ through ‘acceptable’ to ‘nonsense’. We evaluated the transliteration component of the system on the transliteration task used in our field experiments. Each source word was transliterated from kanji into romaji, and the output evaluated against a reference set with up to 3 references per input sequence using the same scoring metrics as the NEWS2010 workshop. The transliteration accuracy on this task was 19.14%, this is lower than the NEWS2010 task scores but this is to be expected as out task contains both multi-word sequences and also sequences that are not transliterations of each other. Nonetheless the F-score, which is a character level score based on the length of the longest common subsequence, was 71.03, indicating the performance is quite respectable at the character level. Visual inspection of the output, showed that the biggest single cause of errors in the system is due to the assumption that all words should be transliterated. In reality, expressions may need to be translated, or partially transliterated and partially translated. One such example from the test corpus being the expression in Kanji: “ITAMIKUUKOU”. This should be correctly transliterated as “Itami Airport”, however the transliteration system produced “itamikuukou” as output

Language Pairs En-Ch En-Hi En-Ko En-Ru En-Ta Jn-Jk

Target Types 372 84 687 66 64 1514

m2m F-score 0.858 0.874 0.623 0.919 0.885 0.669

Bayesian F-score 0.880 0.884 0.651 0.922 0.892 0.767

1-grams 9379 3114 4337 1638 2852 7942

m2m 2-grams 44003 15209 11891 6351 14696 27286

3-grams 75513 30195 14112 14869 27869 38365

1-grams 4706 1867 2968 1105 1561 3532

Bayesian 2-grams 38647 20218 11233 12607 17195 22717

3-grams 72905 34657 14729 23250 30244 37560

TABLE III Transliteration quality measured in terms of F-score, by using alternative segmentation schemes together with statistics relating to be number of parameters in the models derived from the segmentations.

m2m arad7→荒 ar7→新 ar7→荒 ar7→新 ar7→新 ar7→荒 ar7→荒 araj7→荒 arak7→新 arak7→荒 ar7→荒 ar7→荒 ar7→荒 ar7→荒 arasa7→荒 ar7→荒

a7→田 ae7→江 ahori7→堀 ai7→井 ai7→居 ai7→井 ai7→居 ima7→島 i7→木 i7→木 akid7→木 ao7→尾 ao7→生 aoka7→岡 wa7→沢 aseki7→関

Bayesian

a7→田

ara7→荒 ara7→新 ara7→荒 ara7→新 ara7→新 ara7→荒 ara7→荒 ara7→荒 ara7→新 ara7→荒 ara7→荒 ara7→荒 ara7→荒 ara7→荒 ara7→荒 ara7→荒

da7→田 e7→江 hori7→堀 i7→井 i7→居 i7→井 i7→居 jima7→島 ki7→木 ki7→木 ki7→木 o7→尾 o7→生 oka7→岡 sawa7→沢 seki7→関

da7→田

TABLE IV Example segmentations from the m2m segmenter and the Bayesian segmenter, taken from a long contiguous section of the training set where both techniques disagree on the segmentation.

- a perfect transcription into kana, but incorrect nonetheless. Future research will need to address ways to identifying when to transliterate and when to translate. The problem of reordering was also an issue for our approach, although these errors were found in less than 4% of the sentences in the test corpus. An example of this type of error is “FUJISAN”. The system’s output in this case was “fujisan”, however the correct outputs were “Mount Fuji” or “Mt. Fuji”. In this example, both transliteration and translation are required, but in addition the order of the words is swapped. We believe that modeling this re-ordering process would give rise to improvements in system performance.

Fig. 1 shows the field test translation evaluation results. The vertical axis represents the translation ratio which is the ratio of better than acceptable translation sentences to the total number of test sentences. To see the relationship between the dictionary size and the translation performance, we controlled the dictionary size. The horizontal axis shows dictionary coverage, which is the ratio of the proper nouns covered by the dictionary to the total number of occurrences of proper nouns in the evaluation set. ◦ plots shows machine translation with conventional dictionary usage [28]. While, • plots arose from the translation strategy that uses the transliterated results

only i f there is no entry in the dictionary 2 . It is clear from the figure that the use of transliteration within a machine translation system can lead to improvements in translation quality from a user’s perspective. In Figure 1, the values at 0 on the x-axis represent the system performance without the ability to look up the correct translations for named entities in a dictionary. The translation performance is significantly higher at this point. The differences between the two lines on the graph gradually diminish as the dictionary is used more and more for the translation of named entities, and at a value of 1 on the x-axis of course the two lines meet, as the strategies are equivalent to each other at this point. When dictionary coverage is low, we can expect transliteration to have a substantial impact on system performance, however, when we look at the line “baseline + dictionary + transliteration” at both 0 and 1 on the x-axis we can see the effect of the errors arising from the transliteration system. Adding the dictionary to the transliteration system clearly improves the overall system performance. Therefore, further work is needed to improve the quality of machine transliteration systems, and also in how and when to apply them. 2 To incorporate transliterated results into machine translation, we used the same framework as the conventional dictionary-based technique which was proposed by [28]. In this method, category of the word must be known. In our experiments, even for the transliterated words, we used manually assigned category.

Japanese Character Sequence

アン a

ド do

リュー riyuu

English Character Sequence

an

d

roid

Model Score:

0.034

0.012

10e-12

f1

f2

f3

f4

logprob numsegs

|t| |s|

|sbad |+|tbad | |s|+|t|

minprob

TABLE V The feature set used by the SVM to classify candidate transliteration pairs.

VI. Transliteration Mining In this section we look at the appliction of our Bayesian alignment method to transliteration mining: the task of deciding whether or not a pair of character sequences in different languages are transliterations of each other. An example of an aligned grapheme sequence pair, the output of applying our Bayesian alignment process to the bilingual data, is illustrated in Figure 2. Our idea is built on the assumption that the better able our model is to generate a bilingual word pair, the more likely it is that the word pair is a transliteration pair that we would like to mine. We use the Dirichlet process model to co-segment and align the data, extract features from this segmentation (explained below) and use them to train a Support Vector Machine (SVM) [29] to classify them as correct or incorrect transliteration pairs. A. Feature Set We used a total of 4 features in our SVM classifier, these are shown in Table V. Feature f 1 is based on the first of the two assumptions above. Feature f 2 is a simple length-based heuristic which was expected to be generally useful. Feature f 3 is designed to capture the idea underpinning the second of the two hypotheses above, that is: what proportion of the candidate pair cannot modeled directly by the features learned by the Dirichlet process model. f 4 focuses on the score of the weakest part of the derivation. In Table V, logprob is the log probability of the sampled derivation of the two grapheme sequences, according to our generative model. numsegs is the number of bilingual segments used in this derivation. minprob is the log probability of the segment with the lowest probability in the derivation. |s| and |t| are the lengths (in graphemes) of the source and target words respectively. |sbad | + |tbad | is the total number of graphemes in both source and target, that are in bad segments. Here by bad segment we mean a bilingual segment that has not been observed in the training corpus and thus is only receiving a contribution from the base measure component

Average log probability of the segments

Fig. 2. A co-segmentation of the transliteration word-pair candidate ‘andoriyuu’ (Japansese transliteration of the English ‘andrew’) and ‘android’ in English. The figure shows the co-segmentation together with the probabilities of each segment. It can be seen that the segments ‘an’ (Japanese), ‘an’ (English), and ‘do’ (Japanese) and ‘d’ (English) both receive high probabilities from the model, whereas the segment ‘riyuu’ (Japanese) and ‘roid’ in the English receives a very low probability from the model because source and target grapheme sequences are long, and this pairing has not been observed in the corpus.

Log probability of the least likely segment

Fig. 3. A scatter plot of two features derived from the model scores of the training data set for the English-Russian task. Negative examples were selected from the shaded area.

of our Dirichlet process model (a bad segment is illustrated in Figure 2 as the rightmost segment in the sequence). B. Experiments For our experiments we used data from all tracks of the NEWS 2010 Named Entity Workshop [30], [31], [32]. Our experiments were not part of the official NEWS2010 shared evaluation, but used the same data sets. The training data for this track consisted of title-pairs of interlanguage links between wikipedia articles. These titles are noisy in the sense that they can be sequences of words, only some or even none of which may be transliterations of each other. The proportion of correct transliteration pairs to incorrect pairs in the training data was unknown. In addition, 1000 ‘seed’ pairs of clean data were provided. The seed pairs contained only one word for each language and all were positive examples of transliteration pairs; no negative examples were included in the seed data. For evaluation, the participants were expected to mine transliteration pairs from the full training set. A set of approximately 1000 interlanguage links (each giving rise to 0, 1 or more transliteration pairs) was randomly sampled

1

1

0.9

0.9

En-Ta

0.7 0.6 0.5 0.4

proposed lcsr50 random baseline

0.3 0.6

0.65

Fig. 4.

0.7 0.6 0.5 0.4

0.7

0.75 0.8 0.85 Precision

En-Hi

0.8 Recall

Recall

0.8

0.9

0.95

1

proposed lcsr40 random baseline

0.3 0.6

0.65

0.7

0.75 0.8 0.85 Precision

0.9

0.95

1

The precision and recall of our proposed method for the English-Tamil and English-Hindi tasks.

from the training data, and not disclosed to the participants. In our experiments we used the same data and the same precision/recall/f-score evaluation metrics that were used in the official runs for the NEWS2010 workshop (refer to [31] for full details). As positive examples for training the SVM classifier, we were able to use the set of seed sentences, however no negative examples were provided for this task. [33] overcame this issue by generating their own set of negative examples in two ways. In the first, they pair the seed source sequences with random incorrect target sequences chosen from the set of all target sequences in the corpus excluding the target sequence corresponding to the source. In the second, they pair source sequences with incorrect target sequences that are similar to the source, i.e. that have a large longest common subsequence ratio (LCSR). We used the same techniques, and in addition we propose an approach that creates a set of negative examples by exploiting the natural clustering that is induced by the features derived from our Bayesian model. In this approach negative examples are selected where the values of average log sequence probability and minimum component sequencepair probability were as high as possible while excluding all of the set of seed sentences from the sample. This is illustrated in Figure 3. Figure 4 shows the results of the English-Tamil and EnglishHindi tasks. The results for the other language pairs were similar in character and are omitted for brevity. They will be reported in the NEWS2011 workshop. Since the mixture of positive and negative examples in the test data is not known a priori, we provide results from our system for a range of values of the classification threshold on the output of the SVM. This gives precision/recall curves for each of the strategies for generating negative examples: our proposed approach, the approach based on LCSR, and the approach based on random sampling. For the baseline we plot a point (“•” on the graph) for the precision and recall of the top-ranked system in the NEWS2010 transliteration shared task to represent the current state of the art. The graph on the bottom right shows similar results on a new English-Japanese task that we constructed in a similar manner to the NEWS workshop tasks. For most language pairs we found our precision recall curve was to the right of the baseline system indicating that our

simple features based on forced alignment can give rise to performance comparable to the state of the art on this task. Our approach typically made errors on sequence pairs that are genuine but contain novel sub-sequences of graphemes for which our model has no corresponding sequence pair. Feature f 3 in our model was designed to address this issue by balancing evidence from the lengths of the ‘bad’ segments in the pairs against evidence from the lengths of the ‘good’; the idea being that an unobserved sequence pair within a much larger context of observed sequence pairs is likely to be a correct but novel alignment, rather than an incorrect alignment. Nonetheless some errors of this type remain, but the frequency of type of error can be expected to decrease with training set size. VII. Conclusion In this paper we have presented an efficient implementation of a Bayesian method to co-segment and monotonically align sequences of tokens. We have compared the usefulness of the segmentation arising from this Bayesian process to two wellrespected alignment schemes for the purpose of training a transliteration generation model. Our experiments have shown that our technique when compared to both baseline methods is able to produce more compact models with fewer parameters. Furthermore, transliteration systems trained from the Bayesian segmentation were able to outperform both of the baseline systems. This paper also presented a study of the effectiveness of using a transliteration engine within a real-world industrial machine translation application. Our experiments showed that transliteration is useful for improving the quality of machine translation for sentences containing out of vocabulary words. We also presented results showed how this process is affected by integrating bilingual dictionaries with varying degrees of coverage. Finally, this paper detailed a simple method of discriminating between correct and incorrect transliteration candidate pairs for transliteration mining by using features derived from the bilingual forced-alignment together with an SVM classifier. Our experiments show that despite the simplicity of our technique, it is able to reach levels of performance that are comparable to the state of the art in the field.

In the future we would like to develop more sophisticated Bayesian models, and investigate methods for identifying and dealing with different source languages. We would also develop the techniques in this paper into an end-to-end realworld application to mine transliteration pairs that can be used to train a transliteration engine that is integrated into an industrial machine translation system. References [1] A. Finch and E. Sumita, “Phrase-based machine transliteration,” in Proc. 3rd International Joint Conference on NLP, vol. 1, Hyderabad, India, 2008. [2] T. Rama and K. Gali, “Modeling machine transliteration as a phrase based statistical machine translation problem,” in NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration. Morristown, NJ, USA: Association for Computational Linguistics, 2009, pp. 124–127. [3] S. Noeman, “Language independent transliteration system using phrase based smt approach on substrings,” in NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration. Morristown, NJ, USA: Association for Computational Linguistics, 2009, pp. 112–115. [4] H. Li, M. Zhang, and J. Su, “A joint source-channel model for machine transliteration,” in ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 2004, p. 159. [5] F. J. Och and H. Ney, “A systematic comparison of various statistical alignment models,” Computational Linguistics, vol. 29, no. 1, pp. 19–51, 2003. [6] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowa, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: open source toolkit for statistical machine translation,” in ACL 2007: proceedings of demo and poster sessions, Prague, Czeck Republic, June 2007, pp. 177–180. [7] D. Yang, P. Dixon, Y.-C. Pan, T. Oonishi, M. Nakamura, and S. Furui, “Combining a two-step conditional random field model and a joint source channel model for machine transliteration,” in NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration. Morristown, NJ, USA: Association for Computational Linguistics, 2009, pp. 72–75. [8] D. Marcu and W. Wong, “A phrase-based, joint probability model for statistical machine translation,” in In Proceedings of EMNLP, 2002, pp. 133–139. [9] M. Bisani and H. Ney, “Joint-sequence models for grapheme-tophoneme conversion,” Speech Commun., vol. 50, no. 5, pp. 434–451, 2008. [10] P. Blunsom, T. Cohn, C. Dyer, and M. Osborne, “A gibbs sampler for phrasal synchronous grammar induction,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL. Suntec, Singapore: Association for Computational Linguistics, August 2009, pp. 782–790. [11] J. DeNero, A. Bouchard-Cˆ ot´e, and D. Klein, “Sampling alignment structure under a bayesian translation model,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’08. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008, pp. 314–323. [Online]. Available: http://portal.acm.org/citation.cfm?id=1613715.1613758 [12] G. Neubig, T. Watanabe, E. Sumita, S. Mori, and T. Kawahara, “An unsupervised model for joint phrase alignment and extraction,” in ACL, 2011, pp. 632–641. [13] J. Wuebker, A. Mauser, and H. Ney, “Training phrase translation models with leaving-one-out,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden: Association for Computational Linguistics, July 2010, pp. 475–484. [14] Y. Huang, M. Zhang, and C. L. Tan, “Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars,” in ACL (Short Papers), 2011, pp. 534–539. [15] J. Xu, J. Gao, K. Toutanova, and H. Ney, “Bayesian semi-supervised chinese word segmentation for statistical machine translation,” in COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 2008, pp. 1017–1024.

[16] D. Mochihashi, T. Yamada, and N. Ueda, “Bayesian unsupervised word segmentation with nested pitman-yor language modeling,” in ACLIJCNLP ’09: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Morristown, NJ, USA: Association for Computational Linguistics, 2009, pp. 100–108. [17] A. Finch and E. Sumita, “A Bayesian Model of Bilingual Segmentation for Transliteration,” in Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), M. Federico, I. Lane, M. Paul, and F. Yvon, Eds., 2010, pp. 259–266. [18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 2001, pp. 311–318. [19] G. Doddington, “Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics,” in Proceedings of the HLT Conference, San Diego, California, 2002. [20] X. Duan, D. Xiong, H. Zhang, M. Zhang, and H. Li, “I2r�s machine translation system for iwslt 2009,” in Proceedings of the International Workshop on Spoken Language Translation, 2009, pp. 50–54. [21] F. J. Och and H. Ney, “Discriminative training and maximum entropy models for statistical machine translation,” in In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), 2002, pp. 295–302. [22] F. J. Och, “Minimum error rate training for statistical machine translation,” in Proceedings of the ACL, 2003. [23] T. Rama and K. Gali, “Modeling machine transliteration as a phrase based statistical machine translation problem,” in In Proc. ACL/IJCNLP Named Entities Workshop Shared Task, 2009. [24] M. Z. Haizhou Li, A Kumaran and V. Pervouchine, “Whitepaper of news 2010 shared task on transliteration generation,” in In Proc. ACL Named Entities Workshop Shared Task, 2010. [25] S. Jiampojamarn, G. Kondrak, and T. Sherif, “Applying many-tomany alignments and hidden markov models to letter-to-phoneme conversion,” in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Rochester, New York: Association for Computational Linguistics, April 2007, pp. 372–379. [Online]. Available: http://www.aclweb.org/anthology/N/N07/N07-1047 [26] E. S. Ristad and P. N. Yianilos, “Learning string edit distance,” IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 20, no. 5, pp. 522–532, May 1998. [27] H. KAWAI, R. ISOTANI, K. YASUDA, E. SUMITA, U. Masao, S. MATSUDA, Y. ASHIKARI, and S. NAKAMURA, “An overview of a nation-wide field experiment of speech-to-speech translation in fiscal year 2009 (Japanese only),” in Proceedings of 2010 autumn meeting of Acoustical Society of Japan, 2010, pp. 99–102. [28] H. Okuma, H. Yamamoto, and E. Sumita, “Introducint a translation dictionary into phrase-based smt,” The IEICE Transactions on Information and Systems, vol. 91-D, no. 7, pp. 2051–2057, 2008. [29] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine Learning, 1995, pp. 273–297. [30] A. Kumaran, M. M. Khapra, and H. Li, “Whitepaper of news 2010 shared task on transliteration mining,” in Proceedings of the 2010 Named Entities Workshop. Uppsala, Sweden: Association for Computational Linguistics, July 2010, pp. 29–38. [Online]. Available: http://www.aclweb.org/anthology/W10-2404 [31] ——, “Report of news 2010 transliteration mining shared task,” in Proceedings of the 2010 Named Entities Workshop. Uppsala, Sweden: Association for Computational Linguistics, July 2010, pp. 21–28. [Online]. Available: http://www.aclweb.org/anthology/W10-2403 [32] A. Kumaran and H. Li, Eds., Proceedings of the 2010 Named Entities Workshop. Uppsala, Sweden: Association for Computational Linguistics, July 2010. [Online]. Available: http://www.aclweb.org/anthology/W10-24 [33] S. Jiampojamarn, K. Dwyer, S. Bergsma, A. Bhargava, Q. Dou, M.-Y. Kim, and G. Kondrak, “Transliteration generation and mining with limited training resources,” in Proceedings of the 2010 Named Entities Workshop. Uppsala, Sweden: Association for Computational Linguistics, July 2010, pp. 39–47. [Online]. Available: http://www.aclweb.org/anthology/W10-2405