Revealing Translators' Knowledge: Statistical Methods in ... - racai

1 downloads 0 Views 99KB Size Report
Jul 25, 2002 - Practical Translation Lexicons for Language and Speech Processing. DAN TUFIS AND ANA MARIA BARBU. Romanian Academy (RACAI), 13, ...
P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 5, 199–209, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. 

Revealing Translators’ Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing DAN TUFIS¸ AND ANA MARIA BARBU Romanian Academy (RACAI), 13, “13 Septembrie”, 74311 Bucharest 5, Romania [email protected] [email protected]

Abstract. Parallel corpora encode extremely valuable linguistic knowledge about paired languages, both in terms of vocabulary and syntax. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a “natural” text from the perspective of a native speaker of the target language. The “naturalness” of a translation implies not only the grammaticality of the translated text, but also style and cultural or social specificity. We describe a program that exploits the knowledge embedded in the parallel corpora and produces a set of translation equivalents (a translation lexicon). The program uses almost no linguistic knowledge, relying on statistical evidence and some simplifying assumptions. Our experiments were conducted on the MULTEXT-EAST multilingual parallel corpus (Orwell’s “1984”), and the evaluation of the system performance is presented in some detail in terms of precision, recall and processing time. We conclude by briefly mentioning some applications of the automatic extracted lexicons for text and speech processing. Keywords: alignment, bitext, lemmatization, tagging, translation lexicon 1.

Introduction

A bitext or a parallel text is an association between two texts (written or spoken) in different languages that represent translations of each other. By extension, a parallel text might contain several language translations of the same (source) text. A collection of parallel texts, or even a large enough parallel text is called a parallel corpus. Parallel corpora may be further classified according to various criteria, but here we will be concerned with the most general requirement, namely, that a text in whatever language present in a parallel corpus represents the translation for a text in another language also present in that corpus. Sometimes, especially in a multilingual parallel corpus, it is useful to identify the text that served as a source for the translations of the other texts, since extra-linguistic considerations might partially obscure the translation equivalence relation (see below) between a pair of texts in two languages different from the source one. A bitext encodes extremely valuable linguistic knowledge about

the paired languages, both in terms of vocabulary and syntax. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a “natural” text from the perspective of a native speaker of the target language. The “naturalness” of a translation implies not only the grammaticality of the translated text, but (depending on the text’s register) also style and cultural or social specificity. The meaning distinctions and the translation equivalents provided by standard dictionaries are based inherently on a number of “milestone” senses. However, analysis of parallel corpora shows that for the majority of content words in a standard dictionary one may find real uses of those words where no “milestone” sense or translation would fit in the corresponding context. Because of this, it often happens that many valid word (expression) translations cannot be found in traditional printed bilingual dictionaries. Extracting bilingual dictionaries from corpora is a process based on the notion of translation equivalence (Gale and

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

200

KL1548-02

19:55

Tufis¸ and Barbu

Church, 1991). One of the widely accepted definitions (Melamed, 2001) of translation equivalence defines it as a (symmetric) relation that holds between two different language texts, such that expressions appearing in corresponding parts of the two texts are reciprocal translations. These expressions are called translation equivalents. The granularity at which translation equivalents are defined (paragraph, sentence, lexical) defines the granularity of a bitext alignment (paragraph, sentence, lexical). In this paper we will be concerned with the finest granularity level, namely lexical translation equivalents. Most modern approaches to automatic extraction of translation equivalents rely on statistical techniques and fall roughly into two categories. The hypothesestesting methods (e.g., Gale and Church, 1991; Smadja et al., 1996) rely on a generative device that produces a list of translation equivalence candidates (TECs). The TECs that show an association measure higher than expected under the independence assumption are assumed to be translation-equivalence pairs (TEPs). The TEPs are extracted independently of one another, and therefore the process might be characterized as a local maximization (greedy) one. The estimating approach (Brown et al., 1993; Kupiec, 1993; Hiemstra, 1997) is based on building a statistical bitext model from data, the parameters of which are to be estimated according to a given set of assumptions. The bitext model allows for global maximization of the translation equivalence relation, considering not individual translation equivalents but sets of translation equivalents (sometimes called assignments). The extraction process, as considered here, may be characterized as a testing approach. It generates first a list of translation equivalent candidates and then successively extracts the most likely translationequivalence pairs. It does not require a pre-existing bilingual lexicon for the considered languages. Yet, if such a lexicon exists, it can be used to eliminate spurious candidate translation equivalence pairs and thus to speed up the process and increase its accuracy. Our algorithm relies on some pre-processing of the bitext, described in the following. 2.

July 25, 2002

Segmentation; Words and Multiword Lexical Tokens

In general, in many languages, a lexical item is considered to be a space-delimited string of characters or what is usually called a word. However, it is not

always necessary to interpret a space in text as a lexical item delimiter. For various reasons, in many languages and even in monolingual studies, some sequences of traditional words are considered as making up a single lexical unit. For translation purposes, considering multiword expressions as single lexical units is necessary because of the differences that might appear in linguistic realization of commonly referred concepts. The recognition of multiword expressions as single lexical tokens and the splitting of single words into multiple lexical tokens (when it is the case) are generically called text segmentation, and the program that performs this task is called segmenter or tokenizer. In the following we will refer to words and multiword expressions as lexical tokens or, simply, tokens. For our experiments we used Philippe di Cristo’s multilingual segmenter MtSeg (http://www.lpl.univaix.fr/projects/multext/MtSeg/) developed for the MULTEXT project. The segmenter comes with tokenization resources for many western European languages, further enhanced in the MULTEXT-EAST project (Erjavec and Ide, 1998; Dimitrova et al., 1998; Tufi¸s et al., 1998) with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The segmenter is able to recognize dates, numbers and various fix phrases, and to split clitics or contractions (where the case), etc. To cope with the inherent incompleteness of the segmenter resources, in addition to using a collocation extractor (unaware of translation equivalence) we experimented with a complementary method that takes advantage of the word alignment process by trying to identify partially correct translation equivalents. This method is described at length in Tufi¸s (2001) and briefly reviewed in Section 5.2.1 on partial translations. 2.1.

Sentence Alignment

After MtSeg tokenized the two parts of the bitext, they were given as input to the sentence aligner. We used a slightly modified version of the Gale and Church CharAlign sentence aligner (Gale and Church, 1993). In the following we will refer to the alignment units as translation units (TU). In most cases, sentence alignments of all bitexts of our multilingual corpus are of the type 1:1, i.e., one sentence is translated as one sentence (see Table 1). Native speakers of the languages paired to English validated the alignments, so that most of the alignment errors were corrected.

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

Revealing Translators’ Knowledge

Table 1. “1984”.

1:1 Alignment units in the multilingual parallel corpus

Language pair

1:1 Alignment units

Percentage (%)

Bulgarian-English

6558

98.42413

Czech-English

6438

96.75383

Estonian-English

6426

97.42268

Hungarian-English

6479

97.16557

Romanian-English

6047

94.04355

Slovene-English

6572

98.38323

3.

Tagging and Lemmatization

In our experiments we used a tiered-tagging approach with combined language models (Tufi¸s, 1999) based on TnT (Brants, 2000), a trigram HMM tagger. This approach has been shown to provide for Romanian an average accuracy of more than 98.5%. The tieredtagging model is based on using two different tagsets. The first one, which is best suited for the statistical processing, is used internally while the other one (used in a morpho-syntactic lexicon and in most cases more linguistically motivated) is used in the tagger’s output. The mapping between the two tagsets is more often than not deterministic (via a dictionary lookup), and in the rare cases where it is not, a few regular expressions may solve the non-determinism. The idea of tiered tagging is working not only for very fine-grained tagsets, but also for very low-information tagsets, such as those containing only part of speech. In such cases the mapping from the hidden tagset to the coarse-grained tagset is strictly deterministic. In Tufi¸s (2000) we showed that using the coarse-grained tagset directly (14 non-punctuation tags) the best accuracy was 93%, while using a tiered tagging and combined language model approach (92 non-punctuation tags in the hidden tagset) the accuracy was never below 99.5%. Lemmatization is in our case a straightforward process, since the monolingual lexicons developed within MULTEXT-EAST contain, for each word, its lemma and morpho-syntactic codes. Knowing the wordform and its associated tag, the lemma extraction is simply a matter of lexicon lookup for those words that are in the lexicon (for unknown words, the lemma is automatically set to the wordform itself). Erjavec and Ide (1998) provide a description of the MULTEXT-EAST lexicon encoding principles. A detailed presentation of their application to Romanian is given in Tufi¸s et al. (1998). Unless specified otherwise, in the following by token type we will mean the single (lemmatized) lexical string

201

standing for all inflected forms of the same lemma. A token type may have various occurrences (inflected or uninflected) in a given text. 4. 4.1.

Translation Equivalence Underlying Assumptions

There are several underlying assumptions one can consider in keeping the computational complexity of a word alignment algorithm as low as possible. None of these assumptions is true in general, but the situations in which they are not true are rare enough so that ignoring the exceptions would not produce a significant number of errors and would not lose too many useful translations. Moreover, the assumptions we used do not prevent additional processing units from recovering some of the correct translations missed because they did not observe the assumptions. The assumptions we used in our basic algorithm are the following: • a lexical token in one half of the TU corresponds to at most one non-empty lexical unit in the other half of the TU. This is the 1:1 mapping assumption that underlies the work of many other researchers (e.g., Kay and R¨oscheisen, 1993; Melamed, 2001; Ahrenberg et al., 2000; Hiemstra, 1997; Brew and McKelvie, 1996). When this hypothesis is not correct, the result is a partial translation. However, remember that a lexical token could be a multiple word expression previously found and segmented as such by an adequate tokenizer; a lexical token, if used several times in the same TU, is used with the same meaning. This also assumption is used explicitly by Melamed (2001) and implicitly by all the previously mentioned authors; • a lexical token in one part of a TU can be aligned to a lexical token in the other part of the TU only if the two tokens have compatible categories (partof-speech). In most cases, compatibility reduces to the category identity, that is, the two considered tokens have the same part of speech (POS). In the general case, it is possible to define compatibility mappings (e.g., participles or gerunds in English are translated quite often as adjectives or nouns in Romanian, and vice versa). This is essentially one very efficient way to cut off the search space and postpone dealing with irregular ways of POS alternations; and

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

202

KL1548-02

July 25, 2002

19:55

Tufis¸ and Barbu

• although the word order is not an invariant of translation, it is not random either; when two or more candidate translation pairs are equally scored, the one containing tokens that are closer in relative position is preferred. This preference is also used in Ahrenberg et al. (2000). Based on sentence alignment, POS tagging and lemmatization, the first step is to compute a list of translation equivalence candidates (TECL). By collecting, in each part of TU j (the jth translation unit), all the tokens of the same POSk (in the order they appear in the text and removing duplicates) one builds the ordered sets Sj Tj j L POSk and L POSk . For each POSi , let TUPOSi be defined Tj Sj as the Cartesian product: L POSi ⊗ L POSi . Then, CTU j (candidates in the jth TU) is defined as follows: CTU j =

no.o f. pos 

j

TUPOSi

(1)

i=1

With these notations and considering that there are n alignment units in the whole bitext, TECL then is defined as: TECL =

n 

CTU

j

(2)

j=1

TECL contains a significant amount of noise, and many TECs are very improbable. In order to eliminate much of this noise, the most unlikely candidates are filtered out of TECL. The filtering is based on scoring the association between the tokens in a TEC. Any filtering would eliminate many wrong TECs but also some good ones. The ratio between the number of good TECs rejected and the number of wrong TECs rejected is just one criterion we used in deciding which test to use and determining the threshold score below which any TEC will be removed from TECL. For the ranking of the TECs and their filtering we experimented with four scoring functions: MI (pointwise mutual information), Dice coefficient, LL (loglikelihood), and χ 2 (chi-square). In order to define these scoring functions, let us consider the following notations: • TEC = TS TT  ∈ TECL, the current translation equivalent candidate containing token TS and its candidate translation TT ; • n 11 = the number of TUs which contains the current TEC TS TT ;

• n 12 = the number of TUs in which the TS token was paired with any other token but TT ; • n 21 = the number of TUs in which the TT token was paired with any other token but TS ; • n 22 = the number of TUs in which no TECs containing either TS or TT appeared; • n 1∗ = the number of TUs in which TS appeared (irrespective of its associations); • n ∗1 = the number of TUs in which TT appeared (irrespective of its associations); • n 2∗ = the number of TUs in which TS did not appear; • n ∗2 = the number of TUs in which TT did not appear; • n ∗∗ = the total number of TUs. In terms of the above definitions, pointwise mutual information, Dice coefficient, loglikelihood and chisquare are defined by Eqs. (3)–(6) respectively. n ∗∗ ∗ n 11 n 1∗ ∗ n ∗1 2n 11 DICE(TT , TS ) = n 1∗ ∗ n ∗1 2  2  n i j ∗ n ∗∗ LL(TT , TS ) = 2 ∗ n i j ∗ log n i∗ ∗ n ∗ j j=1 i=1  n ∗n 2 2  2  n i j − i∗n ∗∗ ∗ j 2 χ (TT , TS ) = n ∗∗ n i∗ ∗ n j∗ j=1 i=1 MI(TT , TS ) = log2

(3) (4) (5)

(6)

After empirical tests we found that the best ratio between the number of good TECs rejected and the number of wrong TECs rejected was given by the LL test (Dunning, 1993) with the threshold value set to nine (see 5.2.1). A baseline algorithm might not be very different from the filtering procedure discussed above. However, for improving the precision, the thresholds of whatever statistical test is used are higher. Some additional restrictions, such as a minimal number of occurrences for TS TT  (usually this is 3), are also used. This baseline algorithm may be enhanced in many ways (using a dictionary of already extracted TEPs for eliminating generation of spurious TECs, stop-word lists, considering token string similarity, etc.). An algorithm with such extensions (plus a few more) is described in Gale and Church (1991). Although extremely simple, this algorithm, applied on a sample of 800 sentences from Canadian Hansard, was reported to provide impressive precision (about 98%). However, the algorithm managed to find only the most frequent word types (4.5%)

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

Revealing Translators’ Knowledge

covering more than half (61%) of the word occurrences in the corpus. Its recall is modest if judged in terms of word types (cf. Melamed, 2001). 4.2.

203

number of associations of TTj with TSi must be higher than (or at least equal to) any other TSq (q = i). All the pairs selected in TPk are removed (the respective counts are zeroed).

The Baseline Algorithm (BASE) 4.3.

Our baseline algorithm is a simple iterative algorithm, significantly faster than the previous one with much better recall, even when the precision is required to be as high as 98%. It can be enhanced in many ways (including those discussed above). It has some similarities to the iterative algorithm presented in Ahrenberg et al. (2000), but unlike it, our algorithm avoids computing various probabilities (or more accurately, probability estimates) and scores (t-score). At each iteration step, the pairs that pass the selection (see below) will be removed from TECL so that this list is shortened after each step and eventually may be emptied. Based on TECL, for each POS, a Sm ∗ Tn contingency table (TBLk) is constructed, with Sm the number of token types in the first part of the bitext and Tn the number of token types in the other part of the bitext, as shown in Table 2. The rows of the table are indexed by the distinct source tokens, and the columns are indexed by the distinct target tokens (of the same POS). Each cell (i, j) contains the number of occurrences in TECL of TSi , TTj . The selection condition is expressed by the Eq. (7): TPk = {TSi TTj  | ∀ p, q(n i j ≥ n iq ) ∧ (n i j ≥ n pj )} (7) This is the key idea of the BASE extraction algorithm, and it expresses the requirement that in order to select TSi , TT j  as a translation equivalence pair, the number of associations of TSi with TTj must be higher than (or at least equal to) any other TTp ( p = j). Similarly, the Table 2.

Contingency table with counts for TECs at step k

A Better Extraction Algorithm (BETA)

One of the main deficiencies of the BASE algorithm is that it is quite sensitive to what Melamed (2001) calls indirect associations. If TSi , TTj  has a high association score and TTj collocates with TTk , it might very well happen that TSi , TT k  also receives a high association score. Although, as observed by Melamed, in general, the indirect associations have lower scores than the direct (correct) associations, they could receive higher scores than many correct pairs, and this will not only generate wrong translation equivalents, but will eliminate from further considerations several correct pairs, deteriorating the procedure’s recall. To weaken this sensitivity, the BASE algorithm had to impose that the number of occurrences of a TEC be at least three, thus filtering out more than 50% of all possible TECs. Still, because of the indirect association effect, in spite of very good precision (more than 98%), approximately 50% of the correct pairs were missed out of the considered pairs. The BASE algorithm has this deficiency because it looks globally at the association scores and does not check within the TUs to determine if the tokens making the indirect association are still there. To diminish the influence of the indirect associations and consequently remove the occurrence threshold, we modified the BASE algorithm (Tufi¸s and Barbu, 2001a, b, c) so that the maximum score is not considered globally but within each of the TUs. This brings BETA closer to the competitive linking algorithm described in Melamed (2001). The competing pairs are only the TECs generated from the current TU, and the one with the best score is the first selected. Based on the 1:1 mapping hypothesis, any TEC containing the tokens in the winning pair is discarded. Then, the next best-scored TEC in the current TU is selected, and again the remaining pairs that include one of the two tokens in the selected pair are discarded. The multiple-step control in BASE, where each TU was scanned several times (equal to the number of iteration steps), is not necessary any more. The BETA algorithm will see each TU unit only once, but the TU is processed until no further TEPs can be reliably extracted or the TU is emptied. This modification improves both precision and recall in comparison with the BASE algo-

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

204

KL1548-02

19:55

Tufis¸ and Barbu

rithm. In accordance with the 1:1 mapping hypothesis, when two or more TEC pairs of the same TU share the same token, and they are equally scored, the algorithm has to make a decision and choose only one of them. We used two heuristics: string similarity scoring and relative distance. The similarity measure we used, COGN(TS , TT ), is very similar to the XXDICE score described in Brew and McKelvie (1996). The threshold for the COGN(TS , TT ) was empirically set to 0.42. This value depends on the pair of languages in the considered bitext. The motivation for the COGN heuristics is that when two similar words appear in aligned sentences, they are very likely to be mutual translations (i.e., cognates). The actual implementation of the COGN test considers a language-dependent normalization step that strips some suffixes, discards the diacritics and reduces some consonant doubling, etc. This normalization step was hand-written, but based on available lists of cognates, it could be automatically induced. The second filtering condition, DIST(TS , TT ) = |n − m|, considers relative distance between the tokens in a pair (n and m are indexes of TS and TT in the considered TU). The COGN filter is stronger than DIST, so that the TEC with the highest similarity score is the preferred one. If the similarity score is irrelevant, the DIST(TS , TT ) filter gives priority to the pairs with the smallest relative distance between their tokens.

5.

July 25, 2002

Experiments, Results and Evaluation

We conducted experiments on one of the few publicly available multilingual aligned corpora, namely the “1984” multilingual corpus (Dimitrova et al., 1998) containing 6 translations of the English original. This corpus was developed within the MultextEast project, published on a CD-ROM (Erjavec et al., 1998) and recently improved within the CONCEDE project. TRACTOR-TELRI Research Archive of Computational Tools and Resources (www.tractor.de) distributes this newer version. Table 3 shows for Table 3.

each of the seven languages of the parallel corpus (Bulgarian-BU, Czech-CZ, English-EN, Estonian-ET, Hungarian-HU, Romanian-RO, and Slovene-SI) the following information: the number of distinct wordforms, the number of lemmas, and the number of lemmas occurring at least three times (this was the threshold used by the BASE algorithm). The evaluation protocol specified that all the translation pairs are to be judged in context, so that if one pair is found to be correct in at least one context, then it should be judged as correct. The evaluation was done for both BASE and BETA algorithms but on different scales. The BASE algorithm was run on all the 6 bitexts with the English hub, and native speakers of the second language (with good command of English) validated four of the six bilingual lexicons. The lexicons contained all parts of speech defined in the MULTEXTEAST lexicon specifications except for interjections, particles and residuals. The BETA algorithm was run on the Romanian-English bitext, but at the time of this writing the evaluation was finalized only for the nominal translation pairs.

5.1.

The Evaluation of BASE Algorithm

For validation purposes we limited the number of iteration steps to four. The extracted dictionaries contain adjectives (A), conjunctions (C), determiners (D), numerals (M), nouns (N), pronouns (P), adverbs (R), prepositions (S) and verbs (V). Table 4 shows the evaluation results for those languages, where we found voluntary native speaker evaluators. The precision (Prec) was computed as the number of correct TEPs divided by the total number of extracted TEPs. The recall (considered for the non-English language in the bitext) was computed two ways: the first one, Rec∗ , took into account only the tokens processed by the algorithm (those that appeared at least three times), and the second one, Rec, took into account all the tokens irrespective of their frequency counts. Rec∗ is defined as the number of source lemma types in the correct TEPs divided by

The lemmatized monolingual “1984” overview.

Language No. of wordformsa

BU

CZ

EN

ET

HU

RO

SI

15093

17659

9192

16811

19250

14023

16402

lemmasa

8225

8677

6871

8403

9729

6987

7157

No. of 2-occ lemmasa

3350

3329

2916

2729

3294

2999

3189

No. of

a The

number of lemma (types) does not include interjections, particles, residuals.

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

Revealing Translators’ Knowledge

Table 4. steps.

Partial evaluation of the BASE algorithm after 4 iteration

Table 5. BASE evaluation on the noun dictionary extracted from the Romanian-English bitext (non-hapax).

ET-EN

HU-EN

RO-EN

SI-EN

Entries

Entries

1911

1935

2227

1646

Good entries

1774

Types in good entries

1366

Rec∗

Noun types in text

3435 (1948 occ 1)

Bitext 4 Steps

Prec/Rec

96.18/18.79

96.89/19.27

98.38/25.21

98.66/22.69

57.86

56.92

58.75

57.92

the number of lemma types in the source language with at least 3 occurrences. Rec is defined as the number of source lemma types in the correct TEPs divided by the number of lemma types in the source language. The rationale for showing Rec∗ is to estimate the proportion of the missed considered tokens. This might be of interest when precision is of utmost importance. When the threshold of a minimum of 3 occurrences is considered, the algorithm provides a high Prec and a good Rec∗ . The evaluation was fully performed for Estonian, Hungarian and Romanian and partially for Slovene (the first step was fully evaluated, while the rest were evaluated from randomly selected pairs). As one can see from the results in Table 4, the precision is higher than 98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for Estonian. Rec∗ ranges from 56.92% (Hungarian) to 58.75% (Romanian). The standard recall Rec varies between 18.79% and 25.21% (quite modest, since on average, the BASE algorithm did not consider 60% of the lemmas). To facilitate the comparison with the evaluation of the BETA algorithm we ran the BASE algorithm for extracting the noun translation pairs from the Romanian-English bitext. Our analysis showed that depending on the part of speech the extracted entries have different accuracy. The noun extraction had the second worst accuracy (the worst was the adverb), and therefore we considered that an in-depth evaluation of this case would be more informative than a global evaluation. We set no limit for the number of steps and lowered the occurrence threshold to two. The program stopped after ten steps with 1900 extracted noun translation pairs, out of which 126 were wrong (see Table 5). Compared with the four-step run, Prec decreased to 93.36%, but both Rec (39.76%) and Rec∗ (70.12%) improved. We should mention that despite the general practice of computing recall for the bilingual dictionary extraction task (be it Rec∗ , or Rec), this is only an approximation of the real recall. The reason for this approximation

205

1900

Prec/Rec/Rec∗ [%]

93.36/39.76/70.12

is that in order to compute the real recall one should have a gold standard with all the words aligned by human evaluators. In general, such a gold standard bitext is not available, and the recall is either approximated as above or evaluated on a small sample. The result is taken to be more or less true for all the bitext. In the initial version of the BASE algorithm we used a chi-square test to check the selected TEPs. However, as the selection condition given by the Eq. (7) is highly restrictive, the chi-square test became unnecessary and we eliminated it. This resulted in a very small decrease in recall, which was compensated by better precision and a significant improvement in time.

5.2.

The Evaluation of the BETA Algorithm

The BETA algorithm preserves the simplicity of the BASE algorithm but significantly improves its recall (Rec) at the expense of some loss in precision (Prec). As said before, at the time of this writing, the evaluation for the BETA algorithm had been done only for the Romanian-English bitext and only with respect to the dictionary of nouns (the full bilingual lexicon contains almost 8,000 entries). The figures in Table 6 summarize the results for the Romanian-English noun sub-lexicon. The results show that Rec (72.66%) almost doubled compared with the best Rec obtained by the BASE algorithm for nouns (39.76%, see Table 5). The coverage (the percentage of tokens in the text covered by the dictionary) also improved and reached a value of 93.06% (BASE had a coverage of 85.83%). However, the price Table 6.

BETA evaluation.

Noun types in text

3435

No. entries

4023

Correct entries

3149

Types in correct entries Prec/Rec

2496 78.27/72.66

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

206

KL1548-02

July 25, 2002

19:55

Tufis¸ and Barbu

Table 7. BETA evaluation, hapax TECs filtered by max(COGN(T j S , T j T ) ≥ 0.42). Noun types in text

3435

No. entries

3713

Correct entries

3007

Types in correct entries Prec/Rec

2371 80.98/69.02

for this significant improvement was a serious deterioration of Prec (78.27% versus 93.36%). The analysis of the wrong translation pairs revealed that most of them were hapax pairs, and they were selected because the DIST measure enabled them, so we considered that this filter is not discriminative enough for hapaxes. On the other hand, for the non-hapax pairs the DIST condition was successful in more than 85% of the cases. Therefore, we decided that the additional DIST filtering condition be reserved for non-hapax competitors only. The results in Table 7 show that 166 erroneous TEPs were removed but also that144 good TEPs were lost. Prec improved (80.98% versus 78.27%) but Rec depreciated (69.02% versus 72.66%). The Coverage score for this modified version of BETA slightly decreased to 92.36%. The BASE algorithm allows for trading off between Prec and Rec∗ by means of the number of iteration steps. The BETA algorithm allows for similar trading off between Prec and Rec by means of the COGN and DIST thresholds and obviously by means of an occurrence threshold. For instance, when BETA was set to ignore the hapax pairs, its Prec was 96.11% (better then the BASE precision of 93.36%), its Rec∗ was 96.41% (BASE with 10 iterations had a Rec∗ of 70.12%), and its Rec was 59.66% (BASE with 10 iterations had a Rec of 39.76%). 5.2.1. Partial Translations. As the alignment model used by the translation equivalence extraction is based on the 1:1 mapping hypothesis, inherently it will find partial translations for those cases where one or more words in one language must be translated by two or more words in the other language. Although we used a tokenizer aware of compounds in the two languages, its resources were obviously partial. In the extracted noun lexicon, the evaluators found 116 partial translations (3.86%). In this section we will discuss one way to recover the correct translations for the partial ones discovered by our 1:1 mapping based extraction program.

First, from each part of the bitext a set of possible collocations was extracted by a simple method called “repeated segments” analysis. Any sequence of two or more tokens that appears more than once is retained. Additionally, the tags attached to the words occurring in a repeated segment must observe the syntactic patterns characterizing most of the real collocations. If all the content words (nouns, verbs, adjectives and adverbs) contained in a repeated segment have translation equivalents, then the repeated segment is discarded as not being relevant for a partial translation. Otherwise, the repeated segment is stored in the lexicon as a translation of its head-noun. This simple procedure managed to recover 62 partial translations and improve another 12 (still partial, but better). The BETA extraction algorithm did not find translations for 892 Romanian noun lemmas. Out of these, 47 occurred three or more times, 102 exactly two times and 743 only once. As we have shown, the initial phase is to build a search space for the translation equivalence pairs. As this space in general is very large, one has to filter it one way or another. However, besides throwing away a large number of noisy candidates, some correct pairs were lost in this process. This was responsible for about 60% of the correct missed translation pairs (the vast majority of them were hapax pairs, translating secondary meanings of quite frequent words). Working with much larger corpora should decrease the influence of this factor. We found 20 English sentences (192 lemmas) not translated in Romanian, out of which 85 lemmas appeared in no other part of the novel. We noticed that many erroneous translation pairs were extracted from very long TUs. The explanation is that the long TUs produce a high level of noise for the way we computed the list of candidates. Because of alignment problems (errors, missing translations and long TU) the recall was affected by about 9.5% (6% direct influence and about 3.5% indirect influence of the noise). Tagging errors (about 0.5% in the Romanian part and about 2.8% in the English part) were responsible for about 18.5% of the missed correct translations. Many missing translations got an explanation by virtue of the human translation idiosyncrasies as well as by the different nature of the language pairs considered. Being a literary translation, several words in the original were paraphrased, and some words were translated differently (by synonyms). In many cases words in one language were translated into the other by words of different parts of speech (from the algorithm point of

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

Revealing Translators’ Knowledge

Table 8. BASE extraction time for each of the bilingual lexicons (all POS).

7.

Language

We presented two simple but very effective algorithms for extracting bilingual lexicons, based on a 1:1 mapping hypothesis. We showed that for the case in which a language-specific tokenizer is responsible for preprocessing the input to the extractor, the 1:1 mapping approach is not an important limitation any more. Incompleteness of the segmenter’s resources may be compensated for by using a post-processing phase for recovering the partial translations. We showed that such a recovery phase could successfully take advantage of the already extracted entries. There are various applications of the approach we presented in this paper. Besides the obvious applications in dictionary acquisition for machine translation systems, there are many other areas where such a system may be extremely useful. Below we briefly mention some of the areas where we used or plan to use the technology discussed here.

Bg-En

Extraction time (sec) 181

Cz-En

148

Et-En

139

Hu-En

220

Ro-En 4 steps 28 steps

183 415

Si-En

157

view this is identical to a tagging error). A few words were wrongly translated and some others were simply ignored. Altogether, the translator was deemed responsible for 12% of the missed pairs. 6.

207

Implementation

The extraction programs, both BASE and BETA, are written in Perl, and we tested them under UNIX, LINUX and WINDOWS (DOS). Table 8 shows the BASE running time for each bitext in the “1984” parallel corpus with all POS considered (Cygwin-a Unix emulator for Windows, Pentium II/233 MHz, 96 MB RAM) for BASE and BETA. On the noun dictionary extraction task, the BETA algorithm was shown to be more than twice as slow (232 sec) as BASE (10 steps, 103 sec), yet still faster than most of the programs of which we are aware. The additional running time for the BETA algorithm as compared to BASE was mainly due to the COGN test. A quite similar approach to our BASE algorithm (also implemented in Perl) is presented in Ahrenberg et al. (2000). Their algorithm needed 55 minutes on an Ultrasparc1 Workstation with 320 MB RAM to process a novel of about half the length of Orwell’s “1984”. They used a frequency threshold of three and the best results reported are 92.5% precision and 54.6% recall (our Rec∗ ). For a computer manual containing about 45% more tokens than our corpus, their algorithm required 4.5 hours with the best results being 74.94% precision and 67.3% recall (Rec∗ ). The BETA algorithm is closer to Melamed’s extractor, although our program is greedier and never returns to a visited translation unit. In Melamed (2001) information is not provided on any of the extraction times, which we suspect were higher than in our case.

Conclusions and Further Work

• Building a lexical ontology for Romanian (Tufi¸s and Cristea, 2002) to be incorporated into a multilingual ontology. This is an international project, BalkaNet, aimed at adding new languages to the EuroWordNet ontology (Vossen, 1999). Building the synsets for Romanian and other languages in the project, and mapping them onto the Interlingual Index heavily relies on the sense equivalence underlying our extracted dictionaries. • Checking terminological consistence in translated technical documents. For all European Union (EU) accessing countries the translation of the “acquis communitaire” (the accumulated laws and obligations of EU membership), is a constant preoccupation. One of the biggest problems in this extremely resource-consuming activity (this is human translation with hundreds of different translators) is ensuring the terminological consistency. Technical terms should be translated systematically the same way; if our translation equivalents extractor were to find many translations for a given term, this could be an indication of translation inconsistency. This could be easily checked by simply following the links (created by the program) to the parallel sentences from whence the suspected translations appeared. The same system could be used to identify translation omissions (sentences or fragments of sentences). • Word sense discrimination is an active research area with great promise in the domains of intelligent information retrieval systems, data mining, and

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

208

KL1548-02

July 25, 2002

19:55

Tufis¸ and Barbu

knowledge management systems. Some preliminary results of a sense clustering algorithm relying on our automatically extracted translation lexicons are presented in Ide et al. (2001), and they demonstrate that discrimination of the senses of a given word based on its translations into the other languages of a parallel corpus is more precise relative to using only monolingual contextual information. • Polyglot speech synthesis based on a bilingual or multilingual dictionary will make it possible to synthesize stretches of foreign languages within sentences of a different base language. Such a mixedlanguage or polyglot synthesis is necessary to synthesize, e.g., English proper names, place names, and quoted English words within a Romanian utterance. Multilinguality need not be textual only, but will take on spoken form when information services extend beyond national boundaries or across language groups. Database access by speech will need to handle multiple languages to service customers from different language groups within a country or travelers from abroad. Public service operators (emergency, police, department of transportation, telephone operators, and others) in the US, Japan and the EU frequently receive requests from foreigners unable to speak the national language. The main application areas of multilinguality speech are: ◦ Spoken language identification by determining a speaker’s language automatically. This is of particular interest to public services, where callers could be routed to appropriate human translation services (Morimoto et al., 1993; Muthusamy et al., 1992); ◦ Multilingual speech recognition and understanding. Spoken language database access systems, for example, could operate in multiple languages and deliver text or information in the language of the input speech (Cole et al., 1995, 1996); ◦ Speech translation. This is still a very ambitious possibility and currently a subject of research, with good potential application in communication assistance (Waibel et al., 1991). ◦ Another promising application of this system could be a multilingual tutorial for language teaching through various skills (e.g., listening, comprehension, dictation, grammar exercises, translations). As far as we are concerned, the natural language processing stage of a polyglot speech synthesis sys-

tem (Traber et al., 1999) could also benefit from the results presented in this paper. We recently started a common research with a collective from the “Speech Technology & Signal Processing Laboratory” of the Faculty of Electronics and Telecommunications from Bucharest, which developed a complete textto-speech system in Romanian language (Burileanu et al., 2000). This research activity is currently directed toward multilingual text analysis and phonetic conversion tasks.

References Ahrenberg, L., Andersson, M., and Merkel, M. (2000). A knowledgelite approach to word alignment. In J. V´eronis (Ed.), Parallel Text Processing, Text, Speech and Language Technology Series, vol. 13. Boston: Kluwer Academic Publishers, pp. 97–116. Brants, T. (2000). TnT—A statistical part-of-speech tagger. Proceedings of the Sixth Applied Natural Language Processing Conference. Seattle, WA: ANLP. Available at http://www.coli.unisb.de/∼thorsten/. Brew, C. and McKelvie, D. (1996). Word-pair extraction for lexicography. Available at http:///www.ltg.ed.ac.uk/∼chrisbr/papers/ nemplap96. Brown, P., Pietra, S.A., Della Pietra, V.J., and Mercer, R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311. Burileanu, D., Burileanu, C., and Niculiu, T. (2000). Connectionist methods applied in automatic speech synthesis. Romanian Journal of Information Science and Technology, 3(3):201–210. Cole, R.A., Hirschman, L., Atlas, L., Beckman, M., Bierman, A., Bush, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., Levinson, S., McKeown, K., Morgan, N., Novick, D., Ostendorf, M., Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C., Zahorian, S., and Zue, V. (1995). The challenge of spoken language systems: Research directions for the nineties. IEEE Transactions on Speech and Audio Processing, 3(1):1–21. Cole, R.A., Mariani, J., Uszkoreit, H., Zaenen, A., and Zue, V. (Eds.) (1996). Survey of the State of the Art in Human Language Technology. Cambridge, UK: Cambridge University Press. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H., Petkevic, V., and Tufi¸s, D. (1998). Multext-East: Parallel and comparable corpora and lexicons for six Central and East European languages. Proceedings ACL-COLING’98. Montreal: ACL, pp. 315–319. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74. Erjavec, T. and Ide, N. (1998). The Multext-East corpus. Proceedings LREC’1998. Granada: ELRA, pp. 971–974. Erjavec, T., Lawson, A., and Romary, L. (1998). East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM. Gale, W.A. and Church, K.W. (1991). Identifying word correspondences in parallel texts. Fourth DARPA Workshop on Speech and Natural Language. Asilomar, CA, pp. 152–157. Gale, W.A. and Church, K.W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75– 102.

P1: AAA INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY

KL1548-02

July 25, 2002

19:55

Revealing Translators’ Knowledge

Hiemstra, D. (1997). Deriving a bilingual lexicon for cross language information retrieval. Proceedings of the Fourth Groningen International Information Technology Conference for Students. Groningen: University of Groningen, pp. 21–26. Ide, N., Erjavec, T., and Tufi¸s, D. (2001). Automatic sense tagging using parallel corpora. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium. Tokyo: NLPRS Organization, pp. 83–90. Kay, M. and R¨oscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1):121–142. Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics. Columbus, Ohio: ACL, pp. 17–22. Melamed, D. (2001). Empirical Methods for Exploiting Parallel Texts. Cambridge, MA: MIT Press. Morimoto, T., Takezawa, T., Yato, F., Sagayama, S., Tashiro, T., Nagata, M., and Kurematsu, A. (1993). ATR’s speech translation system: ASURA. Proceedings of the Third Conference on Speech Communication and Technology. Berlin, Germany, pp. 1295– 1298. Muthusamy, Y.K., Cole, R.A., and Oshika, B.T. (1992). The OGI multi-language telephone speech corpus. Proceedings of the 1992 International Conference on Spoken Language Processing. Banff, Canada: University of Alberta, pp. 895–898. Smadja, F., McKeown, K.R., and Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1):1–38. Traber, C., Huber, K., Jantzen, V., Nedir, K., Pfister, B., Keller, E., and Zellner, B. (1999). From multilingual to polyglot speech synthesis. Proceedings of the Eurospeech’99. Budapest: Speech Technology Center, vol. 2, pp. 835–838. Tufi¸s, D. (1999). Tiered tagging and combined classifiers. In F. Jelinek and E. N¨oth (Eds.), Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692. New York: Springer-Verlag, pp. 29–33.

209

Tufi¸s, D. (2000). Using a large set of Eagles-compliant morphosyntactic descriptors as a tagset for probabilistic tagging. Proceedings LREC’2000. Athens: ELRA, pp. 1105–1112. Tufi¸s, D. (2001). Partial translations recovery in a 1:1 word-alignment approach, RACAI Research report, Bucharest, p. 34. Tufi¸s, D. and Barbu, A.M. (2001a). Automatic construction of translation lexicons. In V.V. Kluev, C.E. D’Attellis, and N.E. Mastorakis (Eds.), Advances in Automation, Multimedia and Video Systems, and Modern Computer Science, Electrical and Computer Engineering Series, WSES Press, http://www.worldses.org, pp. 156– 161. Tufi¸s, D. and Barbu, A.M. (2001b). Extracting multilingual lexicons from parallel corpora. Proceedings of the ACH-ALLC Conference. New York: New York University ITS Publishers, pp. 122–124. Tufi¸s, D. and Barbu, A.M. (2001c). Accurate automatic extraction of translation equivalents from parallel corpora. In P. Rayson, A. Wilson, T. McEnery, A. Hardie, and S. Khoja (Eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster: Lancaster University, pp. 581–586. Tufi¸s, D. and Cristea, D. (2002). Methodological issues in building the Romanian Wordnet and consistency checks in Balkanet. Proceedings of the Workshop on Wordnet Structures and Standardization, and How These Affect Wordnet Applications and Evaluation. Las Palmas: ELRA, pp. 35–41. Tufi¸s, D., Ide, N., and Erjavec, T. (1998). Standardized specifications, development and assessment of large morpho-lexical resources for six Central and Eastern European languages. Proceedings LREC’98. Granada: ELRA, pp. 233–240. Vossen, P. (Ed.) (1999). EuroWordNet: A Multilingual Database with Lexical Semantic Networks for European Languages. Dordrecht: Kluwer Academic Publishers. Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A., and Tebelskis, J. (1991). JANUS: A speech-to-speech translation system using connectionist and symbolic processing strategies. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, pp. 793–796.