A lexical approach to text alignment using Intex Duško Vitas, Cvetana Krstev Abstract: This paper describes the work in progress on application of Intex to text alignment. Lexical resources incorporated in Intex and local grammars are used to identify those elements in translated text that usually represent literal translations of the original. The motivation for the experiment and basic ideas of the algorithm are illustarted. 1. The motivation for the production of aligned resources Compilation of Serbian parallel corpora at the Faculty of Mathematics (Belgrade) began with the participation in the TELRI project, and production of CD "East meets West - A Compendium of Multilingual Resources", which contains, among other resources Plato's Republic aligned in 17 languages and Orwell's 1984 aligned in 8 languages, in both cases including Serbian. We have continued to collect texts, mainly for French-Serbian parallel corpus where French is source and Serbian target language. Corpus consists predominantly of literary and newspaper texts. Aligned literary texts include among others Voltaire's Candide, J. Vern's Le tour du monde en quatre-vingt jours, G. Flaubert's Bouvard et Pécuchet, and P. Louys's La Femme et son pantin. In some cases two transaltion in Serian were obtained. These texts have been tagged to the sentence level, and then aligned using either Vanilla (Danielsson, 1997) or MLAlign alignment program (Romary, 1995). The obtained output had to be in both cases hand proven. Certain number of contemporary texts translated to Serbian is being prepared for the alignment as well. Those are mainly texts related to philosophy, sociology, ethnology, and sciences. However, for contemporary texts, contrary to literary classics, it was usually difficult to obtain both the original and the translation in digital form. The newspaper corpus consists mainly of the French monthly "Le Monde Diplomatique" and its translation to Serbo-Croatian. The acquisition of texts has started in March 2001, when the publication of Serbian translation began, and collecting of both source text and target text is done regularly since: the source texts are downloaded from the "Le Monde Diplomatique" site while the translation is obtained directly either from the publisher or translator. Part of these articles has been aligned, the other part is in the phase of preprocessing in order to be aligned. The motivation for the construction of this kind of multilingual resource is multifold: 1. It can be used as a powerful linguistic resource, for instance for language teaching. 2. In bilingual or multilingual lexicography aligned texts can be a source of reliable data. For the production of traditional bilingual dictionaries aligned corpora can provide evidence of translation equivalents. Also, for the construction of semantic networks, such as BalkaNet that is being constructed using WordNet methodology (Miller, 1990) they can be used to check semantic relations.
3. More specifically, this kind of resource can help to solve the problem of structural derivation by redefining a Serbian entry in dictionaries of the DELAS/DELAF type or by constructing appropriate finite automata. In this case French is a kind of meta-language with which one can try to encompass the differences in translation. For instance, the analysis of Candide’s aligned texts showed that four entries in Serbian DELAS, namely (Engl. baron), (Engl. baroness), (Engl. baron's), (Engl. baroness's) correspond to one entry in French DELAS (Engl. baron), (Vitas 2002). 2. Software for text alignment Two basic approaches are used for automatic text alignment: statistical and structural. A well known program using statistical approach is Vanilla (Danielsson, 1997). This program works with texts segmented in two levels at most. These two levels are usually interpreted as paragraphs and sentences, but can actually be any other tags. The algorithm is based on the presumption that two units that correspond to each other have approximately the same number of characters. As a consequence, it is not possible to align a structurally tagged text with a rough text. Structural tags are not taken into consideration during the alignment process. The program is simple to use and can be obtained as an open code. However, in some cases many interventions in aligned texts have to be done by hand. A serious drawback of the algorithm is the prerequest that both texts have to have the same number of higher level units. The program is not supported by a concordancer, but the latter can be developed independently. One example of aligned units from already mentioned Vern's novel is: *** Link: 1 - 2 *** En tout cas, il n'était prodigue de rien, mais non avare, car partout où il manquait un appoint pour une chose noble, utile ou généreuse, il l'apportait silencieusement et même anonymement. .EOS U svakom slucyaju nije bio rasipnik, ali ni tvrdica. Gde god je nesxto trebalo za neku plemenitu, korisnu ili velikodusxnu stvar, on je davao cxutecxi i neopazxeno. .EOS
XML tag is used in both target and source text for tagging sentence elements. In this example, one source sentence is aligned wirh two target sentences, as is notified by Link: 1 – 2 sign. An example of a structural approach is MLAlign program by Laurent Romary and Patrice Bonhomme. For the use of this program the logical layout of texts has to be XML tagged. The program maps the logical layout of the source text to the logical structure of the target text. A concordancer has been developed that supports it. The problem is that not all of the available resources are XML tagged. Our experience shows that human XML tagging is time consuming and biased by human tagger, while, the automatical tagging is error prone.
3. The use of lexical resources for alignment Various ideas have already been exploited in order to improve the alignment process (for instance, (Chan, 1993)) but none of them concentrates on a selection of a subset of textual units that are, as a rule, literally translated. The idea to use the lexical resources in alignment arose during our first experiments with the exploitation of aligned texts of Candide. This experience showed that the use of Intex (Silberztein, 1993) with each of the texts in turn was more useful than the use of sentence aligned texts. The reason for this is simple: usually the concordances of both the original and the translation are consulted in search for the translation equivalents, and lexical resources incorporated in Intex enabled a precise specification of search requirements for both languages. The experience with aligned texts also suggested that there are certain text units that are more often than not translated literally. In order to check this presumption one simple experiment has been undertaken using texts from the issue of "Le Monde diplomatique" of May 2001 where the main topic are advertisements. The initial presumtion was proven right for certain simple text units, such as dates and currencies. These elements can be described by graphs that are similar in both source and target languages. The distribution of some other lexical units for which simple graphs can be constructed for both languages has shown similar behaviour on the same text samples. Such units are, for instance, toponyms and proper names. Besides these self-evident sets of literally translated text units another set of this kind emerged: it contains those units that Intex identifies as "unknown words" where one usually finds trademarks, different acronyms, etc. In order to identify their occurrences in both original and translated text it is sufficient to construct a graph that recognizes units that belong to the intersection of sets of unknown words in both original and its translation. Those units are not being translated, they are rather transferred to the translated text. For instance: The excerpt from “unknown words” in the French version of “Le Monde Dimpomatique” charisme et en évocation subversive. Benetton assimilera son nom de marque à grandes étiquettes de disques comme BMG envoient désormais des «équipes de roupes (Axa, la Société générale, la BNP, les AGF, les géants de la vente nchise sur la rébellion adolescente, Body Shop disposera de la compassion, adelphie ou Chicago, ils disent «Eh, bro [frère], regarde-moi les baskets», e Nike, décrivait sa conversion à le bro-ing à Harlem: «On est allés à le rme pour désigner cette pratique: le bro-ing. Cette expression vient du fait phrase: «The Gore Prescription Plan: Bureaucrats Decide.» Puis, sur fond eau de vérification de la publicité (BVP), organisme émanant des annonceurs,
The corresponding excerpt "unknown words" from the Serbian version yki nastrojenoj reklamnoj industriji. Benetton se poistovecxuje sa borbom elike diskorgrafske kucxe kao sxto je BMG sxalju danas "ulicyne ekipe" grupe („Aksa”, „Sosiete Zxeneral”, „BNP”, „AGF” i giganti prodaje putem , Pepsi je znak mladalacyke pobune, Body Shop saosecxanja, Reebock nastupajucxi ovim recyima: "Hi, bro /brother/! Pogledaj ove teniske!", Nike Aron Kuper ovako je opisao svoj bro-ing sistem u Harlemu: "Otisxli, da je za to stvorila vlastiti termin: bro-ing. Taj izraz je nastao u a, ecyenica: The Gore Prescription Plan: Bureaucrats Decide (Gorov plan nagona za kupovinom. U oktobru 1999. BVP, kojim dominiraju interesi
The density of some of these literaly translated lexical units for the original text is illustrated in Figure 1. The density of lexical elements that are literally translated shows that they can be reliable ‘anchors’ for the corresponding segments in the original text and its translation. In principle, using the elements recognized by Intex, it is possible to preedit the text for the aligners of Vanilla type (using the paragraph tags, if they exist, and {S} tags incorporated by Intex for sentences), as well as for MLAlign (XML/SGML output from the FST that recognizes the literally translated lexical units).
Figure 1. The frequency of units representing proper names in the original French sample of "Le Monde diplomatique" On the other hand, the alignment process can be seen as a generalization of the bootstrapping method proposed in (Gross, M. 2000) where local grammars are developed step by step in order to cover specific meanings of keywords in concordances. Here, the bootstrapping method is applied in order to incorporate as many corresponding tags in source and target text as necessary to cover it with anchors with appropriate density. Finally, this process of lexical recognition enables XML-tagging of both the original and the translation by tags of the form: textual unit.
For instance, the same tag is inserted in the French text ( 33 degrée Celsius ) and in the Serbian text Trideset i tri stepena Celziusa , where the attribute value represents the "canonic" value of the recognized sequences obtained as the output of the corresponding transducer. The information obtained form such tags is twofold: the tag name represents the type of the recognized lexical unit while the attribute value enables the comparison of tags in the source and target text. Only the tags with the same name and attribute value can be potential anchors.
Figure 2 The occurrences of all forms of proper names Buvar and Pekiše in Serbian translation One more example The existence of literal translations is less obvious in literary texts. The next experiment with lexically driven alignment was undertaken using the text of Flaubert's novel Bouvard et Pécuchet. The first step was to automatically identify the stable "points", or anchors, that connect the original and the translation in a way suitable for alignment. In Flaubert's novel those "points" were, in the first place, names of the main heroes. However, while in French text only the form Bouvard occurred, in Serbian text the name of the same hero appeared in several inflective forms of the noun and its corresponding possessive adjective : Buvar; N: Buvar+Buvara+Buvaru+Buvarom+Buvare Buvarov; AdjPoss: Buvarov+Buvarova+Buvarovoj+Buvarovom+Buvarovog+ Buvarovu+Buvarovih+Buvarovi+Buvarovim+Buvarove+Buvarovo
A similar situation occurred with the keyword Pécuchet, as one French form — Pécuchet — corresponded to several forms of the noun Pekisxe and possessive adjective Pekisxeov in Serbian text: Pekisxe; N: Pekisxe+Pekisxea+Pekisxeu+Pekisxeom Pekisxeov; AdjPoss:Pekisxeov+Pekisxeova+Pekisxeove+Pekisxeovi+ Pekisxeovo+Pekisxeovu+Pekisxeovim+Pekisxeovog+Pekisxeovoj+ Pekisxeovom
However, when these correspondences were taken into consideration the frequencies of occurrences of names Bouvard et Pécuchet were exactly the same in the original and translated text. Their distribution in the original and translation showed the expected similarity, as can be seen in figures 2 and 3 respectively.
Figure 3. The occurrences of Pécuchet (in the front), Bouvard (in the middle), and cumulatively Bouvard+ Pécuchet (in the back) in the French text. The diagram shows that both in the original and its translation the names of the main characters covered just above 1% of whole text. Excerpts from the concordances of the original and the translated text are given in Appendix I. These concordances were produced by using the following regular expression in ‘Locate Pattern’ option ((+< Pécuchet>)(:< verbs of announcement >))*
and then by sorting the produced lines according to ‘Text Order’ (Silberztein, 2002). This regular expression specifies the refined subset of the third subset represented in Figure 3. 4. The basic idea of the algorithm In order to produce the aligned concordances with "literally" translated equivalents as keywords the information in the corresponding concord.ind files can be used that gives the starting position of every such tag with respect to the beginning of the file. The aligned unit now is not a sentence but rather a text excerpt that begins with one keyword and ends with another. The further development of the program was done in two steps: 1. The user can combine files with indices of the original and the translation in several different ways. For instance, one choice is to align using only one keyword, that is pairs of identical tags such as {ime:buvar}, or to use various keywords that user can choose from a list. The prerequisite is, of course, that the text has been indexed with the chosen keyword(s). 2. The program that produces concordances on the basis of intervals that approximately represent the equivalent sequences in both texts functions as follows: All the occurrences that correspond to a certain pattern in both original and translated text are identified using Locate Pattern facility. In the course of the concordance production in one language, an independent application associates to the context of current text (original or translation) the corresponding interval from the aligned text, under the assumption that the order of literally translated units in source and target texts are the same. The input data for the program, as well as the program results are given in Appendix II. In aligned concordances two consecutive tags bound the lines, with approximately 20 characters preceding the starting tag and 20 characters following the ending tag. This expansion outside tag boundaries seems necessary, as translation equivalents tend to occur in the vicinity of anchors, that is literally translated units. Along with concordances the program produces row statistics: number of characters between tags in source and original text and their ratio. The first obtained results seem to be promising, as the explicit tagging of original and translation becomes obsolete. Moreover, it is possible to generate aligned concordances according to various keywords, and even to align according to sentence boundary (Intex {S} tag) using some additional keyword that would enable alignment of text units shorter than a sentence. References Chen S., Aligning Sentences in Bilingual Corpora Using LexicalInfomation, Meeting of the ACL, 1993, 9-16
Danielsson, P./Ridings, D. (1997): Practical Presentation of a "Vanilla" Aligner, Presentation held at the TELRI Workshop in Alignement and Exploitation of Texts, Ljubljana, February, 1 –2 Gross, M. (2000): A Bootstrap Method for Constructing Local Grammars. In: Bokan, N. (ed.): Proceedings of the Symposium Contemporary Mathematics, Faculty of Mathematics, Belgrade Vitas, D., Krstev, C. (2002): Structural derivation and meaning extraction: a comparative study on French-Serbo-Coratian parallel texts. In: G. Barnbrook, P. Danielsson, M. Mahlberg (Eds.): Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora. Birmingham, The University of Birmingham Press, (to be published) Miller, G., Beckwith, R., Fellbaum, Ch., Gross, D., Miller, K. (1990): Introduction to WordNet: An on-line lexical database, International Journal of Lexicography, 3(4):235-244 Bonhomme, P. and L. Romary (1995): The Lingua Parallel Concordancing Project: Managing Multilingual Texts for Educational Purpose. In: Proceedings of Language Engineering, Montpellier, June 26-30, 1995. Silberztein, M (1993): Dictionnaires élelctroniques et analyse automatique de text (le system INTEX), Paris: Masson Silberztein, M. (2002): Intex (Manuel) (http://grelis.univfcomte.fr/intex/downloads/Notes.pdf) Appendix I Concordances of French text with pattern ((+< Pécuchet>)(:< verbs of announcement >))*
Moi je suis veuf
dit Bouvard et sans enfants !
:1ms:1fs} en avais l'idée ! reprit Pécuchet mais je ne osais pas vous papiers s'envoleraient !
s'écria Pécuchet qui redoutait, en plus, les e la toiture.
Bouvard lui dit :A votre place, {j',je.P g>Faites-moi la conduite
reprit Bouvard l'air extérieur vous /p> Mon oncle !
dit Bouvard, et le flambeau que il tenait l'escalier.
Pécuchet descendit les marches sans répondre /seg>
Elle !
dit Pécuchet, en désignant sa testable.
Bouvard et Pécuchet reprirent ensemble prit le pas gymnastique ; et il disait à Bouvard courant du même train à son PV:ms} est de votre faute ! reprit Bouvard.
Il il est beau, l'amusement ! reprit Pécuchet qui venait de
Très bien !
dit Bouvard on a du temps devant soi
Concordances of Serbian text with the same pattern
A ja sam udovac
recye Buvar i bez dece!
g>Na to sam pomisxlxao!
odgovori Pekisxe ali se nisam usudxivao > Hartije cxe se razleteti!
uzviknu Pekisxe, koji se josx uz to bojao o d sxkrilxaca.
Buvar mu recye: Da sam na vasxem eg>
Ispratite me
nastavi Buvar napolxu cxe vas vazduh >
Moj stric!
recye Buvar, i svecxa koju je drzxao osvetli podruglxivo.
Pekisxe se ne mogade uzdrzxati a da ne kazxe >
NXu!
recye Pekisxe, pokazujucxi na grudi. eosporno.
Buvar i Pekisxe ponovisxe zajedno: Oh, na sav glas. Najzad Buvar izjavi da ne zxeli da obnovi zakup. To je vasxa krivica
odgovori Buvar.
Padao je u eg>Da, basx lepa zabava!
odvrati Pekisxe, koji ga je cyuo.
>
Vrlo dobro
recye Buvar imamo vremena pred sobom.
Appendix II Excerpt from French text with tags inserted for names Bouvard and Pécuchet {S}Pour s'essuyer le front, ils retirèrent leurs coiffures, que chacun posa près de soi ;{S} et le petit homme aperçut écrit dans le chapeau de son voisin : {ime:buvar} ;{S} pendant que celui-ci distinguait aisément dans la casquette du particulier en redingote le mot : {ime:pekisxe}. {S}-- "Tiens !" dit-il "nous avons eu la même idée, celle d'inscrire notre nom dans nos couvre-chefs." {S}-- "Mon Dieu, oui ! on pourrait prendre le mien à mon bureau !" {S}-- "C'est comme moi, je suis employé." {S}Alors ils se considérèrent. {S}L'aspect aimable de {ime:buvar} charma de suite {ime:pekisxe}. {S}Ses yeux bleuâtres, toujours entreclos, souriaient dans son visage colore.{S} Un pantalon à grandpont, qui godait par le bas sur des souliers de castor, moulait son ventre, faisait bouffer sa chemise à la ceinture ;{S} -- et ses cheveux blonds, frisés d'eux-mêmes en boucles légères, lui donnaient quelque chose d'enfantin. {S}Il poussait du bout des lèvres une espèce de sifflement continu. {S}L'air sérieux de {ime:pekisxe} frappa {ime:buvar}.
Excerpt from Serbian text with tags inserted for names Buvar and Pekisxe {S}Da bi obrisali cyela, skidosxe svoje kape i spustisxe ih pokraj sebe;{S} onaj manji primeti, upisano u sxesxiru svog suseda: {ime:buvar};{S} drugi pak lako procyita u kacyketu cyoveka u relengotu recy: {ime:pekisxe}. {S}Gle, recye pala nam je na um ista misao da napisxemo svoja imena u nasxim kapama. {S}Eh, bozxe, naravno, mogao bi mi je ko uzeti u kancelariji! {S}Kao i meni, ja sam cyinovnik. {S}Tada se odmerisxe. {S}Ljubak {ime:buvar} izgled namah ocyara {ime:pekisxe}.
{S}Njegove plavicyaste ocyi, uvek poluzatvorene, smesxile su se na njegovom rumenom licu. {S}Pantalone na preklop, koje su se malo pri dnu sxirile nad cipelama od dabrovine, ocrtavale su mu trbuh i nabirale kosxulju u struku;{S} a njegova plava kosa, koja se prirodno meko kovrdyala, davala mu je necyeg detinjastog. {S}Krajicykom usana izvodio je neku vrstu neprekidnog zvizxduka. {S}Ozbiljan {ime:pekisxe} izgled napravi utisak na {ime:buvar}.
Aligned French/Serbian concordances for the names Bouvard and Pécuchet 128 eau de son voisin : {ime:buvar} ;{S} pendant que celui-ci distinguait aisément dans la casquette du particulier en redingote le mot : {ime:pekisxe}. {S}-- "Tiens ! 90 xiru svog suseda: {ime:buvar};{S} drugi pak lako procyita u kacyketu cyoveka u relengotu recy: {ime:pekisxe}. {S}Gle, recye 38 1.4222 304 redingote le mot : {ime:pekisxe}. {S}-- "Tiens !" dit-il "nous avons eu la même idée, celle d'inscrire notre nom dans nos couvre-chefs." {S}-- "Mon Dieu, oui ! on pourrait prendre le mien à mon bureau !" {S}-- "C'est comme moi, je suis employé." {S}Alors ils se considérèrent. {S}L'aspect aimable de {ime:buvar} charma de suite {i 258 a u relengotu recy: {ime:pekisxe}. {S}Gle, recye pala nam je na um ista misao da napisxemo svoja imena u nasxim kapama. {S}Eh, bozxe, naravno, mogao bi mi je ko uzeti u kancelariji! {S}Kao i meni, ja sam cyinovnik. {S}Tada se odmerisxe. {S}Ljubak {ime:buvar} izgled namah ocyara 46 1.1782 40 L'aspect aimable de {ime:buvar} charma de suite {ime:pekisxe}. {S}Ses yeux bl 44 risxe. {S}Ljubak {ime:buvar} izgled namah ocyara {ime:pekisxe}. {S}Njegove plav -4 0.9090 455 ar} charma de suite {ime:pekisxe}. {S}Ses yeux bleuâtres, toujours entreclos, souriaient dans son visage colore.{S} Un pantalon à grand-pont, qui godait par le bas sur des souliers de castor, moulait son ventre, faisait bouffer sa chemise à la ceinture ;{S} -- et ses cheveux blonds, frisés d'euxmêmes en boucles légères, lui donnaient quelque chose d'enfantin. {S}Il poussait du bout des lèvres une espèce de sifflement continu. {S}L'air sérieux de {ime:pekisxe} frappa {ime:buvar 435 izgled namah ocyara {ime:pekisxe}. {S}Njegove plavicyaste ocyi, uvek poluzatvorene, smesxile su se na njegovom rumenom licu. {S}Pantalone na preklop, koje su se malo pri dnu sxirile nad cipelama od dabrovine, ocrtavale su mu trbuh i nabirale kosxulju u struku;{S} a njegova plava kosa, koja se prirodno meko kovrdyala, davala mu je necyeg detinjastog. {S}Krajicykom usana izvodio je neku vrstu neprekidnog zvizxduka. {S}Ozbiljan {ime:pekisxe} izgled napravi utis 20 1.0459
Dusko Vitas Assistant professor Faculty of Mathematics Studentski trg 16 YU-11000 Belgrade e-mail: [email protected] Cvetana Krstev Assistant professor Faculty of Philology Studentski trg 3 YU-11000 Belgrade e-mail: [email protected]