A TEI Encoding of Aligned Corpora as Translation Memories - CiteSeerX

A TEI Encoding of Aligned Corpora as Translation Memories Tomaz Erjavec

Department for Intelligent Systems Jozef Stefan Institute Jamova 39 SI-1000 Ljubljana Slovenia tomaz.erjavec@ijs.si

Abstract Sentence level aligned parallel corpora are a basic language resource for multilingual research. TEI proposes several ways in which to encode such corpora, but all separate the alignment information from the texts. This article proposes an alternative TEI conformant inplace encoding of aligned texts, which is similar to that employed for translation memories. For research corpora targeted towards language engineering and terminology/translation studies, we argue that such an encoding can aid the process of corpus acquisition, processing and exploitation.

1 Introduction Sentence level aligned parallel corpora are a basic language resource for multilingual research. A growing number of parallel corpora are being produced, recent ones including MLCC (Armstrong et al., 1998), Crater (McEnery et al., 1997), ENPC (Johansson et al., 1996), and MULTEXT-East (Dimitrova et al., 1998). We are involved in a project to produce a Slovene-English sentencelevel aligned parallel corpus of one million words. The corpus is meant as a standardised Slovene dataset for, on the one hand, language engineering research and, on the other, linguistically oriented translation and terminology studies.1 One of the guidelines of the project was to use the Text Encoding Initiative Guidelines (Sperberg-McQueen and Burnard, 1994), so called TEI P3, for the annotation of our corpus, as do most of the other bi-lingual corpus projects. Some make use of a proper instantiation of the TEI, For further information on the corpus see http://nl.ijs.si/elan/. 1

whereas others only take TEI as the basis for producing their own de nition of the structure, the Document Type De nition, DTD, either in SGML (Goldfarb, 1990), as does TEI or, more and more often, in the more recent XML (W3C, 1998). Recently tools for translators, esp. translation memory software has become a successful commercial product, e.g., with the Translator's Workbench2 or Deja Vu.3 Translation memory software stores aligned segments, usually sentences, of previous bi-texts. When presented with a new original it compares its sentences with those stored in the translation memory and oers their translations for (edited) inclusion to the translator. The translation memories with which such tools work are produced semi-automatically via an interactive process of segmentation and alignment. They thus closely resemble classical aligned corpora. In the project we tried to make the production, and expected distribution and usage of the corpus cost-eective and simple. To this end, conversion from original document formats, segmentation, alignment and hand-validation was performed by three collaborators of the project on home PCs, in two cases with with Deja Vu, which oers an interactive alignment environment and in one with Unix scripts. The documents produced were, in turn, converted to our interpretation of TEI. Our work- ow diers from the one usually associated with similar projects; the effective input for conversion into the TEI encoding (except header meta-information) were simple tab-delimited line parallel les, without any markup. It is on the basis of such simple source format that our encoding is based. The article is structured as follows: Section 2. gives an overview of current recommendations of the TEI for encoding parallel corpora and dis2 3

http://www.trados.com http://www.atril.com

cusses the related Corpus Encoding Speci cation, CES (Ide, 1998). Section 3. introduces the Translation Memory Exchange standardisation eort (Melby, 1998) and contrasts it with TEI. Section 4. turns to the document structure used in our corpus and how it is instantiated in TEI P3. Section 5. summarizes and argues availability advantages in using our proposal instead of TEI recommendations on encoding parallel corpora.

2 Stand-o markup of alignment

The TEI P3 book discusses alignment of parallel texts in multilingual corpora in section 14.4.2 and oers four dierent methods of encoding. The rst choice is whether the objects to align are points or intervals and the second whether the alignment itself is encoded as cross references in the segmentation markup or in a free standing linkage element. The basic assumptions for TEI parallel corpora is thus that the integrity of the two (or more) text documents is retained, as alignment is encoded on the meta-level. A closely related view is held by the Corpus Encoding Standard (Ide, 1998), a proposal that has already been used for encoding several multilingual aligned corpora, e.g., the Bible (Resnik et al., 1998) and the MULTEXT-East aligned corpus (Erjavec and Ide, 1998). The CES goes one step further than TEI recommendations: the original documents, so called primary data, are completely unmodi ed by the alignment which, possibly with the segmentation, is held in a separate SGML alignment document. This document can contain a header which is followed by the element listing pointers into the primary data. To illustrate with an example from the MULTEXTEast English-Slovene '1984':
Prevodna enota Translation unit ... English Slovene Slovene original - English translation English original - Slovene translation ...

The of a bi-text is composed of translation units, s, which, in our case, have exactly two segments. The segment elements are taken directly from TEI.analysis, and allow signi cant segment-level markup. In particular, our corpus is tokenised, with extensions planned for including part-of-speech tags and lemmas of words as attributes of . Below we give an example of a translation unit from the corpus: Slovenija je ozemeljsko enotna in nedeljiva država. Slovenia is a territorially indivisible state.

As can be seen, the structure of the aligned corpus is extremely simple. This makes it suitable for direct processing with limited tools or computer expertise. The DTD for the above encoding was produced via a parametrisation of the TEI. The TEI 'Chicago Pizza Model' allows the construction of a particular TEI SGML DTD by a) choosing one base tagset, b) adding additional tagsets and c) de ning local extensions. The following SGML prolog implements our DTD:

]>

The two extension les are quite short. The entity extension IGNOREs the standard de nition of the TEI , while the DTD extension rede nes to be composed of translation units only; each translation unit contains two segments and the standard global attributes, in particular the ID id and idref lang). The elements are de ned in the TEI.analysis module. To illustrate: teitmx.ent: teitmx.dtd:
- - -

(tu+)> (seg, seg)> %a.global;>

If it is felt as necessary, the above de nition can be easily expanded to contain more information about the translation unit in question (terms, revision description), making it even more similar to classical translation memories. The main dierence to these remains that each bi-text still retains its header. To enable the above to work for a SGML conformant system, the TEI distribution is also needed. Because many SGML tools have problems coping with the SGML complexity used in TEI, we have also made a one- le DTD describing the same markup. This DTD is used for local processing and will be included together with the distribution. The DTD is produced automatically from the above parametrisation via the on-line 'Pizza Chef' service.4 We are considering implementing also a strict version of the one- le DTD, which would allow only the elements and nestings expected in our corpus. It would thus be much more prescriptive than the TEI-derived DTD and could serve as a validation aid.

5 Conclusions

The paper presented an instantiation of TEI, which can be used for encoding multilingual parallel corpora. While the corpus is TEI conformant, 4

http:// rth.natcorp.ox.ac.uk/TEI/pizza.html

we have not implemented the TEI/CES suggestions for encoding parallel aligned corpora, but rather chose an encoding closer to that of translation memories. In particular, the alignment is not separated from primary data, but is in-place. The advantages are greater simplicity of incorporating translation memories into the corpus and easier usability of the corpus. There is, unfortunately, another and maybe stronger argument for using the in-place encoding instead of the stand-o one, especially if the corpus is to be distributed. Most corpus projects report that a substantial amount of time was spent on securing appropriate permission from copyright holders of the original texts. In our experience, the greatest reluctance of publishers to sign the agreement comes from the fear that their digital copy could be used for making pirate copies of the complete texts. It does much to assuage this fear to state that the digital originals which they provide will be held in con dence and, after processing, destroyed and that any use of the corpus will be only over its words, phrases and sentences. The encoding presented here goes a long way towards this guarantee as the texts have no original markup left (say, paragraphs, titles, etc) and would thus require a substantial amount of eort to recreate in their entirety. However, the aligned corpus is still usable for translation studies.

6 Acknowledgements

I would like to thank Spela Vintar and two anonymous reviewers for helpful comments and suggestions on this paper; all errors of course remain my own. The corpus whose encoding was presented here would not have been possible without the contributions by Roman Maurer, Andrej Skubic, and Spela Vintar who acquired and aligned the majority of source texts. The work on the corpus was in part supported by subcontract to the EU ELAN MLIS 121 project.

References

Lars Ahrenberg, Magnus Merkel, Daniel Ridings, Anna Sagvall Hein, and Jorg Tiedemann. 1999. Automatic processing of parallell corpora: A swedish perspective. http://numerus.ling.uu.se/corpora/plug/. Susan Armstrong, Masja Kempen, David McKelvie, Dominic Petitpierre, Reinhardt Rapp, and Henry Thompson. 1998. Multilingual corpora for cooperation. In First International Conference on Language Resources and Eval-

uation, LREC'98, pages 579{980, Granada. ELRA.

Ludmila Dimitrova, Tomaz Erjavec, Nancy Ide, Heiki-Jan Kaalep, Vladimr Petkevic, and Dan Tu s. 1998. Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In COLING-ACL '98, pages 315{319, Montreal, Quebec, Canada. Tomaz Erjavec and Nancy Ide. 1998. The MULTEXT-East corpus. In First International Conference on Language Resources and Evaluation, LREC'98, pages 971{974, Granada. ELRA. Charles F. Goldfarb. 1990. The SGML Handbook. Clarendon Press, Oxford. Nancy Ide. 1998. Corpus Encoding Standard: SGML guidelines for encoding linguistic corpora. In First International Conference on Language Resources and Evaluation, LREC'98, pages 463{470, Granada. ELRA. http://www.cs.vassar.edu/CES/. Stig Johansson, Jarle Ebeling, and Knut Ho and. 1996. Coding and aligning the englishnorwegian parallel corpus. In K. Aijmer, B. Altenberg, and M. Johansson, editors, Languages in Contrast, pages 87{112. Lund University Press. http://www.hit.uib.no/enpc/. Tony McEnery, Andrew Wilson, Fernando Sanchez-Leon, and Amalio Nieto-Serrano. 1997. Multilingual Resources in European Languages: Contributions of the CRATER Project. Literary and Linguistic Computing, 12(4). Alan Melby. 1998. Data exchange standards from the OSCAR and MARTIF projects. In First International Conference on Language Resources and Evaluation, LREC'98, pages 3{8, Granada. ELRA. http://www.lisa.unige.ch/tmx/. Philip Resnik, Mari Broman Olsen, and Mona Diab. 1998. Creating a Parallel Corpus from the \Book of 2000 Tongues. In Proceedings of the Text Encoding Initiative Tenth Anniversary User Conference, Brown University, Providence, Rhode Island. http://www.stg.brown.edu/webs/tei10/. C. M. Sperberg-McQueen and Lou Burnard, editors. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford. Henry Thompson and David McKelvie. 1997. Hyperlink semantics for stando markup of read-only documents. In SGML Europe'97. http://www.ltg.ed.ac.uk/ht/sgmleu97.html.

Jorg Tiedemann. 1998. Parallel corpora in Linkoping, Uppsala and Goteborg (PLUG). Work package 1., Department of Linguistics, Uppsala University. http://numerus.ling.uu.se/corpora/plug/. W3C. 1998. Extensible markup language (XML) http://www.w3.org/XML/.