Sentence alignment in bilingual corpora based on ... - CiteSeerX

2 downloads 0 Views 53KB Size Report
in the cases of inverted sentence order and missing segments. 1. Introduction to sentence alignment. The problem of computer-aided translation has been ...
Sentence alignment in bilingual corpora based on crosslingual querying Frédérique Bisson, Christian Fluhr CEA/DIST 91191 Gif-sur-Yvette Cedex, France { bisson, fluhr }@cartier.cea.fr

Abstract The effectiveness of translation memory for computer-aided translation depends on the results of previous sentence alignment. This paper describes a new approach to sentence alignment, based on a crosslingual querying using the technology of an existing product, SPIRIT (Syntactic and Probabilistic Indexing and Retrieval of Information in Texts). Sentence alignment and crosslingual querying based on bilingual reformulation are similar problems: both are based on a semantic proximity between two texts in different languages; both aim to find the sentences that contain most of the information demanded by the query. However, sentence alignment requires the irrelevant part of a sentence to be as short as possible. Crosslingual querying provides sentence alignment with candidates. ARCADE evaluation has shown that this approach is very robust in the cases of inverted sentence order and missing segments.

1. Introduction to sentence alignment The problem of computer-aided translation has been changed by the introduction of the notion of translation memory. This approach is based on the compilation of existing translations that can be used as a source of knowledge for new translations. This could be very useful for translating technical documentation because much of the texts is unchanged or similar from one version to another. This approach is also justified by the economic importance of this kind of translations and the need to minimize the time required for translation. Of course, parts of texts shorter than sentences like simple sentences, idiomatic expressions, compounds or single words, can also be aligned. This can help to build bilingual dictionaries. A translation memory is based upon sentence alignment, i.e. recognition of sentence-to-sentence links or correspondences in translated texts. The products currently available on the market are not satisfactory and sentence alignment or automatic alignment revision is performed by human translators in most translation departments. Several research teams are working on sentence alignment (for an overview, see [VER2000]). Their systems generally process in two steps: the calculation of a score for all the pairs of candidate sentences and the search for minimal-cost alignments. The score is attributed based on three types of methods. The first uses the formal characteristics of the sentences, particularly the sentence length, calculated in words [BRO91] or in characters [GAL91]. It is assumed that two candidate sentences for alignment are of similar length. So a short sentence in a source language will probably be converted to a short sentence in a target language. The second method relies on sentence lexical characteristics. The system performs a partial word alignment to obtain a refined sentence alignment. This method is used by [KAY88], [CAT89], or [DEB92], who align words with the help of bilingual dictionaries. The last method is a mixed method using a proportion of cognate words (lexical characteristic) and length criteria (formal characteristic). Such is the case for [SIM92] and [GAU95]. Cognate words are words that are written similarly in the source and target languages, such as «activités» in French and «activities» in English.

2. Crosslingual information retrieval There are three main approaches to crosslingual querying of text databases. The first consists of a statistical approach assuming that some of the documents in a database are in both languages. In some cases this approach does not require effective document translations: it may be enough to have a common topic in the documents in different languages, especially in the case of journal articles. The most well-known model is Gerard Salton's vector space model [SAL89]. In this model, the database is represented as a vector space with as many dimensions as the number of different words in the database. Documents and queries are vectors in this space and document-query proximity is evaluated by computing the cosine of the angle between the two vectors. Latent semantic indexing is an improved model. It achieves a sort of implicit reformulation in both languages and between the languages by reducing the number of dimensions to a few hundred. The results are satisfactory, but imply the storage of translated documents in the databases. The second approach is based on the use of a machine translation (MT) system to translate either the database or the queries (which is the most general case). The drawback of this approach is that the MT system provides one translation for each word; so any incorrect translation (as often happens for polysemic words) will give unsatisfactory query results. The third approach, which is ours, employs bilingual reformulation - which tries every possible translation of each word. When there is an answer to the query, ambiguities are resolved using the relevant documents as a semantic filter to choose the right translation. For more details on crosslingual information retrieval see [GRE98].

3. The EMIR approach to crosslingual information retrieval The EMIR (European Multilingual Information Retrieval) project was based on the technology of an existing product, the SPIRIT System. SPIRIT means "Syntactic and Probabilistic Indexing and Retrieval of Information in Texts" [FLU94]. Linguistic processing consists of a morphosyntactic parsing that gives each word a part of speech, recognizes idiomatic expressions using a dictionary, normalizes words through lemmatization, recognizes dependency relations (especially in compounds), identifies general language synonyms and solves some homographic problems (especially when syntactic parsing is an appropriate alternative). At the end of the process, empty words are dropped depending on their part of speech. The statistical model provides the user with a list of documents sorted according to their relevance. The SPIRIT model differs from the vector space model because it assigns a weight to each database word according to its discriminating power, but does not assign a weight to each word in each document.

SPIRIT components:

Query in natural language

Linguistic dictionary

Documents

Morphological Analysis Recognition of all the word forms: conjugations, singular/ plural, acronyms…

Syntactic analysis Part of speech tagging Recognition of linguistic dependencies

Monolingual and bilingual dictionaries

Reformulation

Statistical analysis

Representation of the query

Data storage

Comparison

Answers grouped in relevance-ranked classes

Figure 1: SPIRIT components

This point is important because the aim of the system is to find relevant information, even in documents whose main content is outside the scope of the query. The SPIRIT system can be considered to be a weighted boolean system. It groups documents into classes that have a common concept intersection with the query; each class is a partition of the database. Thus, each document is ranked in the best possible class.

Example : A query of our library catalog about "management of radioactive waste" gives the following results: First class: 166 documents, characterized by "Management of radioactive waste". Second class: 62 documents, characterized by "management of waste and radioactive". Third class: 4 documents, characterized by "radioactive waste and management". Forth class: 44 documents, characterized by "management of waste". Fifth class: 356 documents, characterized by "radioactive waste" Etc. Reformulation is used to infer from the original query words other words expressing the same concepts. Reformulation can be in the same language (synonyms, hyponyms, etc…) or in a different language (bilingual dictionary) [DEB88]. The comparison tool is used for a quick evaluation of all possible intersections between query words and documents. It also computes a relevance weight for each document. The weight depends only upon query-document intersection for information retrieval. Each word has a weight in the database computed using the inverse document frequency. The weight of a query-document intersection is the sum of the weights of all words. Other weighting functions can be used for other purposes (see §5). General principles of crosslingual querying based upon bilingual dictionaries: All possible translations are inferred from the original query words. Some of these translations are eliminated because they are not in the database. The rest are filtered by the database.

Number of translations per translation rule 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 Number of translations in the dictionary

Number of rules with this number of translations 18621 8621 4287 2073 1153 630 396 240 167 103 69 58 48 20 15 21 5 11 4 7 4 2 4 77431

Figure 2: English-French translations distribution

Number of translations per translation rule 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 23 24 Number of translations in the dictionary

Number of rules with this number of translations 14643 8109 4469 2431 1384 810 479 266 190 119 87 50 30 22 26 8 6 9 8 1 1 1 77506

Figure 3: French-English translations distribution For example, the French-English dictionary has about 33000 entries with a mean ambiguity rate of 2.33. There are about 14600 words with only one translation. The maximum number of translations for a word is 24. Similarly for the English-French dictionary, there are about 36500 entries with a mean ambiguity rate of 2.12. There are about 18600 words with only one translation. The maximum number of translations for a word is 24. In fact, for any relevant document, there will be a broad intersection between the concepts expressed by the query words and those expressed in the documents. In this case the translation of the query words is likely to be right due to the co-occurrence of translations for a large number of query words. Example : If the query contains "free space": The possible translations of "free" in French are: gratuit, large, libre. The possible translations of "space" in French are: blanc, espace, place, période. The only possible translation found in our safety regulation database is "espace libre" which is the right translation. Of course, dependency relations are also considered. While two words in dependency relation can have more than one hundred translations (using word for word translation), very few of them occur in the database with the same dependency relation. The SPIRIT’s crosslingual architecture is described in [FLU97/2].

4. Discussion of crosslingual information retrieval results The effectiveness evaluated by the standard TRECEVAL method has given encouraging results because the difference between monolingual and bilingual querying is getting smaller. A detailed study of the problems shows that the system dysfunction is mainly due to the incompleteness of dictionaries and the inconsistency in the linguistic knowledge, used in the system. That is why we have postponed the commercial development of highly sophisticated tools, such as multistep reformulation or translation filtering through the best documents [ELK97], to focus on quality control of linguistic data (for more details see [FLU97/1]). Example of inconsistency: In the monolingual dictionary "color" and "colour" are normalized by "color" but in the bilingual dictionary the entry is "colour" and not "color". Our bilingual dictionaries presently contain around 32,000 entries for French-English and EnglishFrench. One of the most critical points is the identification of translations missing from the bilingual dictionary. When a word has no translation it is obvious that there is a need for dictionary updating. But there may be no translation of a word that already exists in the bilingual dictionary, or of a compound word that cannot be translated word for word. It is very difficult to identify these cases. Compounds are particularly important in technical subjects. This kind of problem is the subject of current research. We are trying to extract unknown compound or single word translations from already translated texts. As a consequence, the crosslingual querying community is very interested in methods that use already translated documents to extract sentence and word correspondences.

5. Sentence alignment based on crosslingual information retrieval Crosslingual querying based on bilingual reformulation and sentence alignment are similar problems. In both cases, we must compute the proximity between two texts that are in two different languages. The main difference is that, in information retrieval, the proximity value refers to the semantic intersection between the reference text (query or text selected for dynamic hypertext) and texts stored in the database. The more the intersection reflects the query, the more relevant is the document, even if this intersection is for a very small number of words compared to the length of the whole text. In 1-to-1 sentence alignment, a proximity can be calculated to evaluate whether the two sentences can be considered to be corresponding translations. Obviously, in such a case, the intersection should be equal to each of the sentences if the bilingual dictionaries were exhaustive. As this is not the case, the intersection can only be partial. Given the similarity of the two problems, we have decided to use crosslingual querying techniques to find 1-to-1 sentence alignments. The proximity between two sentences is computed considering both the intersection and also the portions missing from each sentence. The corpus of texts, one part of which is a translation of the other, consists of two sets of ordered sentences, one for each language. These two sets are indexed into two separate document databases using the standard SPIRIT system. The documentary unit is the sentence for each database. Our aim is to use crosslingual querying to detect the link between sentences in a source language and their translations in a target language. We have attempted to obtain great precision by finding links that have a wide margin of certainty, giving us a good reference from which to develop alignment strategies for more difficult cases.

First step: The first step is one-to-one alignment. Our algorithm is not symmetrical. One of the two languages is selected as the starting language, in our case French. We have not yet done experiments to choose the best reference language for our alignment algorithm. The best one is probably the source language of the bilingual dictionary with the best coverage. But this must be checked. The nature of the language (rate of polysemies) may also have a significant effect on the alignment. As part of this first step, each French sentence of one document is used as a query in a crosslingual querying of the target language database, which consists of English sentences extracted from the translated documents. The result is a list of classes of English sentences ordered according to the weight of the semantic intersection between the query sentence in French and the sentences in English. The sentences in each class are ranked in descending order according to their character length. The sentences in the first classes (the best intersection) are candidates for alignment with the French sentence used as a query. We confirmed the alignment by using each English sentence of the first two classes for a reverse crosslingual querying. Thus, the English sentence is used as a query in a crosslingual querying of the French sentence database. The new results consist of an ordered list of classes of French sentences. If the French sentence used as the first query can be found in the first two classes, we consider that this sentence is the translation of the English sentence used for the querying. This approach checks that English and French sentences are more or less good translations. Several sentences can be candidates. It is now necessary to refer to the sentence position to give a choice that really aligns the sentences in their context. We have used the sentence rank in the document and the proportion of intersection terms between the two sentences to confirm the alignment. This method establishes most of the alignments, i.e. 1-1 alignments. One French sentence is translated into one English sentence. A full description of the algorithm is given in [«Mutual benefit of sentence/word alignment and crosslingual information retrieval « in VER2000].

Operating chart and example:

Sentence Crosslingual Querying In English database

First Query sentence: Les connaissances et qualifications sont évaluées sur titres mais également au cours d'épreuves, en général orales Position of the sentence doc number / sentence number: 1/13 Result 1: First class containning: connaissances, qualifications, évaluées, épreuves, orales. contenant 1 document(s) Knowledge and qualifications are assessed on the basis of the qualifications which the applicant submits but also by tests, generally oral ones. 1 / 12 Second class containing: titres, épreuves. contenant 1 document(s)

List of English sentence s

Crosslingual Querying In French database

List of French sentence s

Comparison

Check

Ok Next Sentence

Although the amendment of the Preservation of Public Security Act may bring an end to the practice of detention without trial, the Community and its Member States believe that recent events have shown that much progress is still required. / 1 / 237 New Query sentence: Knowledge and qualifications are assessed on the basis of the qualifications which the applicant submits but also by tests, generally oral ones. / 1 / 12

Non Ok

Result 2: First class containing: Knowledge, qualifications, assessed, oral. 1 document(s) Les connaissances et qualifications sont évaluées sur titres mais également au cours d'épreuves, en général orales. / 1 / 13 Second class containing: basis, applicant, submits, ones. contenant 1 document(s) 3. Il n'appartient pas au Conseil de se prononcer sur les critères à la base du choix opéré par un État membre concernant la sélection de candidats nationaux à soumettre au jury européen. / 1 / 916

Second step: 1-2 and 2-1 alignments For 1-2 and 2-1 alignments, we try to concatenate an unaligned sentence with a preceding or following sentence that has been aligned. If the intersection increases, then the real alignment requires the concatenated sentences. A full description of the algorithm is given in [VER2000].

Operating chart and example:

Non-aligned sentences

Search previous alignment

Search next alignment

Concatenation And Crosslingual Querying

Concatenation And Crosslingual Querying

Score 2

Score 1

No improvment

Check

score Improvment of score

Next sentence

Previous or Next Alignment

In addition, it encourages access to information, makes the home more energy efficient and enables older adults and disabled persons to remain at home for longer periods (Chin, 1985; Girardin, 1991; Östlund, 1992). / 2 / 901 Result: First class containing: access-information, home, older, persons, 1985, Girardin, 1991, Östlund, 1992. containing 1 document(s) intersection is on 10 words La domotique facilite le travail à domicile, le télédivertissement, les télécommunications, favorise l'accès à l'information tout en augmentant l'efficacité énergétique de la maison et en maintenant les personnes âgées et les personnes handicapées à domicile plus longtemps (Chin, 1985; Girardin, 1991; Östlund, 1992). / 2 / 905 In addition, it encourages access to information, makes the home more energy efficient and enables older adults and disabled persons to remain at home for longer periods (Chin, 1985; Girardin, 1991; Östlund, 1992). / 2 / 901 Home automation facilitates working at home and provides access to entertainment and telecommunications. / 2 / 900 Result: First class containing: facilitates-working, accessinformation, Home, access, older, persons, 1985, Girardin,1991, Östlund, 1992. containing 1 document(s) The intersection is now of 13 words La domotique facilite le travail à domicile, le télédivertissement, les télécommunications, favorise l'accès à l'information tout en augmentant l'efficacité énergétique de la maison et en maintenant les personnes âgées et les personnes handicapées à domicile plus longtemps (Chin, 1985; Girardin, 1991; Östlund, 1992). / 2 / 905

Because the number of words in the intersection has increased, the concatenation is considered to be a better link than the original 1-to-1 link.

6. Experiments results in the framework of ARCADE ARCADE, sponsored by AUPELF-UREF (Association of French-speaking countries Universities) is a program for the evaluation of parallel text alignment systems [LAN98]. This evaluation program is organized into two campaigns, each lasting a couple of years. Six systems participated in the first campaign, which focused on sentence alignment. We took part in the first part of the second campaign (1998-1999) , which followed two tracks: the sentence alignment track and the word alignment track. Twelve systems were evaluated for the first track, five for the second. The evaluation was based on two corpora (French-English bilingual texts). -

The corpus JOC consisted of records of questions and answers on European Community matters. 200,000 words per language were used for the sentence track.

-

The corpus BAF, i.e. 400,000 words per language, contained different types of texts, institutional, scientific, technical and literary texts: INST: A debate at the Canadian Parliament. VERNE: "Le voyage de la terre à la Lune", by Jules Verne. SCIENCE: Scientific papers. TECH: A technical manual with a glossary.

The evaluation metrics were precision and recall. The F-measure, a combination of precision and recall, was also taken into account.

Precision

=

Recall

=

Ar ∩ A A Ar ∩ A Ar

Where A contains the alignments proposed by each system and Ar stands for the alignments taken as a reference (assumed to be correct). F-measure =

2* ((Precision * Recall) / (Precision + Recall))

The F-measure combines recall and precision in a single measure of efficiency [VAN79]. These metrics can be calculated at different levels of "granularity", such as a sentence, a word or a character. Hence the following results.

Our results: The results for the ten texts of the JOC corpus were not very different. This corpus did not pose many difficulties because the English versions are very close to the French ones. We had the following data at character level: Min : the minimum result for all the corpus texts. Max : the maximum result for all the corpus texts. Average : the average of the results for all the corpus texts.

Precision Recall F-measure

Min 97.28 89.85 93.52

Max 99.47 97.1 98.27

Average 98.38 93.19 95.71

Table 1: JOC results at character level The BAF corpus was more difficult because it consisted of different types of texts and posed many problems, such as sentence inversion, missing sentences, etc. For example, a technical documentation contained a glossary that was ordered differently in the two languages. And the English version of the novel by Jules Verne was missing many segments. Hence the following data at character level:

Precision Recall F-measure

Min 95.65 3.6 6.94

Max 99.73 97.18 98.24

Average 98.10 83.15 86.58

Table 2: BAF results at character level These results do not include the Glossary of the TECH sub-corpus. Two main conclusions could be drawn from these tests: - Our approach is very robust for dealing with sentence order inversion and missing sentences. The reason is that the first criterion selected by our algorithm for sentence alignment is not information about sentence position, but information about the semantic content of these sentences. - Our system maintains its rank whatever the type of text examined. The F-measure at character level was approximately: Corpus JOC INST VERNE SCIENCE TECH (including the glossary)

Percentage 98 % 95 % 91 % 96 % 93 %

Table 3 : F-measure at character level

We got the smallest variation among the 12 systems tested for the different types of texts: 91% < f-measure 2) so far because our final goal is to extract word translations, and this does not require us to process all the possible alignments.

8. Conclusion This initial evaluation of common sentence alignment has confirmed the effectiveness of our approach. Future improvements will focus on the alignment algorithm and on improving our bilingual dictionary. The development of technologies for word alignment is most important and future sentence alignment may greatly benefit from the growing quality of bilingual dictionaries. Hence the following question.

Translation memory is built from aligned sentences, and they make it possible to find the sentence which is the closest to the sentence to be translated among the data stored in the memory. If crosslingual querying proves to be the best tool for aligning translated sentences, is it necessary to store translated documents in the translation memory? It might be sufficient to store many texts in the target language and directly collect the sentence that seems to be closest to the sentence to be translated. This possibility is quite interesting because monolingual texts are much more frequent than bilingual ones.

9. References Brown, P., Lai, J. & Mercer, R.L. (1991). Aligning sentences in Parallel corpora, Proc. 29th Meeting of the Association for Computational Linguistics, Berkeley, California. Catizone, R., Russel, G. & Warwick, S. (1989). «Deriving Translation Data from Bilingual Texts», Proc. Of 1st Lexical Acquisition Workshop, Detroit. Debili, F., Fluhr, C. & Radasoa, P. (1988). About reformulation in full text IRS, Conference RIAO 88, MIT Cambridge, mars 1988, Texte repris et modifié pour la revue "Information processing and management" Vol. 25, N° 6 1989, pp 647-657. Debili, F. & Sammouda, E. (1992). «Appariement des phrases de textes bilingues français-anglais et français-arabe, Proc. 15th International Conference on Computational Linguistics (Coling 92) Nantes, France. Elkateb, F. & Fluhr, C. (1997). EMIR at the crosslingual track of TREC-6, TREC-6 Conference, 19-21 Novembre 1997, Gaitherburg, Maryland Fluhr, C., Mordini, P., Moulin, A. & Stegentritt, E. (1994). EMIR Final report, ESPRIT project 5312, DG III, Commission of the European Union, october 1994. Fluhr, C., Schmit, D., Elkateb, F. & Gurtner, K. (1997-1). Multilingual database and crosslingual interrogation in a real internet application, workshop "Cross-language Text and Speech retrieval" dans "AAAI 1997 Spring Symposium Series", 24-26 mars 1997, Stanford University, Californie. Fluhr, C., Schmit, D., Ortet, P., Elkateb, F. & Gurtner, K. (1997-2). SPIRIT-W3, A distributed crosslingual indexing and retrieval engine, INET'97, Kuala Lumpur, Juin 1997. Gale, W. & Church, K.W. (1991). A program for aligning sentences in parallel corpora, Proc. 29th Meeting of the Association for Computational Linguistics, Berkeley, California. Gaussier, E. (1995). Modèles statistiques et patrons morphosyntaxiques pour l’extraction de lexiques bilingues. Thèse de doctorat en informatique fondamentale , Université Paris VII. Grefenstette, G. & al. (1998). Cross-language information retrieval, Boston: Kluwer Academic Publishers, 1998. Kay, M. & Röscheisen M. (1988). Text-Translation Alignment, Technical Report, Xerox Palo Alto Research Center.

Langlais, Ph., Simard, M., Véronis, J., Armstrong, S., Bonhomme, P., Débili, F., Isabelle, P., Souissi, E., & Théron, P. (1998). ARCADE: A co-operative research project on bilingual text alignment. Proceedings of First International Conference on Language Resources and Evaluation (LREC), Granada, Spain, 28-30 May 1998, 289-292. Salton, G. (1989). Automatic text processing. The transformation, analysis and retrieval of information by computer. Addison-Wesley publishing company 1989. 530p. Simard, P., Foster, G. & Isabelle P. (1992). Using Cognates to Align Sentences in Bilingual Corpora, Proc. 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 92), Montreal, Canada. Van Rijsbergen, C. J. (1979). Information Retrieval. 2nd edition, London, Butterworths. Véronis, J. (2000). Parallel Text Processing, Jean Véronis scientific editor, Kluwer Academic Publishers, Text Speech and Language technology series, to be published in 2000.