Encoding a parallel corpus: The TRIS corpus experience Carla Parra Escartín University of Bergen
Abstract This paper focuses on one of the many aspects to be taken into account when developing a new corpus: its encoding. During the compilation of the corpus of Technical Regulations Information System (the TRIS corpus) several encoding issues arose. In this paper the author discusses the possibilities available with regards to encoding as well as the decisions taken and the strategies followed. The author discusses standards for character encoding and corpus markup and explains how these were integrated in the compilation of the TRIS corpus. Keywords: corpus planning, parallel corpora compilation, corpus encoding, standardization
* Principal contact: Carla Parra Escartín Marie Curie Early Stage Researcher Language Models and Resources Research Group (LaMoRe) Department of Linguistic, Literary and Aesthetic Studies (LLE) University of Bergen, HF-‐bygget, Sydnesplassen 7 N-‐5007 Bergen, Norway Tel.: +47 55 58 89 45 E-‐mail:
[email protected]
C. Parra Escartín
1. Introduction This paper will discuss several issues related to corpus encoding and the use of available encoding standards applicable to the compilation of corpora. To illustrate this, the compilation process of the corpus of Technical Regulations Information System (in what follows the TRIS corpus) is used. The TRIS corpus is being compiled for the purposes of a larger project which aims at researching the translational correspondences between German nominal compounds and their Spanish phraseological correspondences. Details about its compilation process and its main characteristics can be found in Parra Escartín (2012). According to the Collins Cobuild online dictionary1, encoding in computing is “the action of converting (characters and symbols) into a digital form as a series of impulses”. The Tech Terms Computer Dictionary2 refers to it as “the process of converting data from one form to another ” and specifies that “there are several types of encoding, including image encoding, audio and video encoding, and character encoding”. Thus, when we refer to the encoding of a corpus we may be referring to different aspects and even different kinds of encoding. My experience in compiling the TRIS parallel corpus has made me aware of this fact. This paper aims to discuss the role of encoding at different stages of a corpus compilation process. This is done to illustrate the role it plays in each phase. The remainder of this paper is divided into sections which follow what could be considered the logical progression of a corpus compilation. At each phase the problems and challenges faced are explained and discussed as well as the strategies adopted and the decisions taken. In the next section (Section 2), I first explain the role of encoding within the compilation of a corpus. Section 3 focuses on the importance of character encoding and its role in corpora and Section 4 is devoted to the different types of markup that we may choose for a corpus.
2. The corpus encoding workflow In order to understand the role of encoding in the compilation process of a corpus it is important to see at which stages it plays a particular role. If we take into account the definitions given in Section 1, the very first phase of the compilation process already implies several changes in the encoding of the files included in the corpus. In the case of the TRIS corpus, the files were automatically retrieved from the Database of the DG Enterprise and Industry Project3 of the European Commission by means of a crawler (a computer program capable of performing recursive searches)4. After all files in outdated formats no longer available and corrupted files were disregarded, every remaining file was classified according to its original format. MS Word files were directly stored for later verification while PDF files underwent a further process. PDF “text” files were automatically converted to MS word, while PDF “scanned image” files were processed with ABBYY FineReader – an Optical Character Recognition (OCR) software – and converted to MS Word. Finally, all MS Word files were proofread and verified manually to ensure that no conversion problems had arisen. Figure 1 below illustrates the process that every crawled file underwent prior to being aligned.
1 http://www.collinsdictionary.com/dictionary/english/encoding 2 http://www.techterms.com/definition/encoding
3 http://ec.europa.eu/enterprise/tris/index_en.htm 4 For details please see Parra Escartín (2012).
62
Encoding a parallel corpus: The TRIS corpus experience
Figure 1: File selection and conversion process prior to alignment
After all files were considered ready, file pairs in German and Spanish were also verified and their formatting was checked to ensure that it matched and that it would not provoke any problems at the alignment stage. In the next phase – still in process –, MS Word files are aligned using SDL Trados WinAlign, a proprietary software programme within the suite of the Computer Assisted Translation tool (CAT tool) SDL Trados Studio 20095. WinAlign automatically converts the files to RTF (Rich Text Format) and once the alignment has been manually verified and confirmed it can be exported as a translation memory in the SDL Trados proprietary format or in the de facto standard format TMX (Translation Memory eXchange)6. In the case of TRIS the translation memories corresponding to each individual file are exported in the SDL Trados proprietary format; then they are merged and converted to TMX format; and finally they are converted to TEI P5 format. Simple plain text documents with one sentence per line are also created from the TMX files. These files will be subsequently Part-‐of-‐Speech (POS) tagged. Figure 2 illustrates how the original MS Word files are transformed into different formats at the different stages of the corpus compilation.
Figure 2: Different file encoding stages during the corpus compilation process
5 http://www.sdl.com/products/sdl-‐trados-‐studio/ 6 The TMX format is explained in Section 4.2.
63
C. Parra Escartín
Finally, it is also worth mentioning that the corpus is to be released in different encoding formats to facilitate its reusability in other research projects. Concretely, the corpus will be released in plain text, POS-‐tagged text, TMX and TEI P5. This choice is grounded on several reasons. First of all, and as argued by Wynne (2005), it is important to avoid proprietary formats. As he points out: If your corpus is made up of files in a format for a commercial wordprocessing program, such as Microsoft Word, then they cannot be processed by most corpus analysis tools. What is more, the format may not be supported indefinitely into the future, and there will come a time when users won’t be able to read the files any more.
Wynne (2005) continues arguing that encoding a corpus in XML is usually a good choice since it not only is appropriate for its long-‐term preservation but also ensures the usage of Unicode for encoding the text. The TMX and TEI P5 encoding formats are actually markup formats in XML as we shall see later in Section 4. The other two formats in which the corpus is released have been chosen to satisfy the needs of the research project in which the TRIS corpus will be first used. Generic tools often require “raw text” or plain text files to work, and thus I had to produce them for my own research. Additionally, I also needed POS-‐tagged files to run experiments and more concretely files in the TreeTagger7 format. Providing these two additional formats along with the other two standard formats will enable the reusability of the corpus without requiring prior conversion processes.
3. Character Encoding: The minimal kind of encoding but yet a critical one Character encoding may be considered the minimal kind of encoding. However, it is crucial as it will determine whether or not a text is appropriately displayed in a user’s computer. McEnery and Xiao (2005) offer an extensive and clear overview of the importance of character encoding as regards corpus construction as well as of its evolution across history. As they point out, “character encoding in a corpus must be consistent if the corpus is to be searched reliably”. In fact, something that may seem as simple as character encoding is not trivial. During the compilation of the TRIS corpus several encoding problems arose when manipulating the files in the corpus. This is something that McEnery and Xiao (2005) also mention: “In many cases, however, multiple and often competing encoding systems complicate corpus building, providing a real problem”. Many efforts have been made over time to ensure readability and interoperability as regards character encoding in different operating systems. The Unicode standard has been the result of these common efforts and it is commonly used nowadays in many cross-‐platform applications. It includes three encoding formats: UTF-‐8, UTF-‐16 and UTF-‐32 (Unicode Transformation Format 8 bits, 16 bits and 32 bits respectively). One of its main strengths is that it is 100% backward compatible with ASCII (McEnery and Xiao, 2005). Sasaki (2010) explains the differences between the three of them: The most widely used encoding form is UTF-‐8. If the multilingual corpus contains only Latin based textual data, UTF-‐8 will lead to a small corpus size, since this data can be represented mostly with sequences of single bytes. If corpus size and bandwidth are no issues, UTF-‐32 can be used. However, especially for web based corpora, UTF-‐32 will slow down data access. UTF-‐16 is for environments which need both efficient access to characters and economical use of storage. Finally, the aspect that an XML processor must
7 The TreeTagger is a tool for annotating text with part-‐of-‐speech and lemma information developed at the
Institute for Computational Linguistics of the University of Stuttgart. More information can be found at its website: http://www.ims.uni-‐stuttgart.de/projekte/corplex/ TreeTagger/.
64
Encoding a parallel corpus: The TRIS corpus experience
be able to process “only” UTF-‐8 and UTF-‐16, and not necessarily other encoding forms, should be taken into account when deciding about the appropriate encoding form.
From his reasoning it can be concluded that UTF-‐8 was the right choice for the TRIS corpus as it only includes Latin based textual data and therefore there was no need for using an encoding format that would imply a larger size such as UTF-‐16. The files of the TRIS corpus were not originally encoded in UTF-‐8. The translation memory files were obtained in a Windows Operating System because the software used for alignment (SDL Trados WinAlign) is not available in other operating systems. However, when manipulating the files in another operating system –a Mac OS–, problems arose because Windows uses its own proprietary encoding (ISO Latin 1) which in turn is not compatible with Macintosh and other operating systems. This problem is easy to overcome by automatically converting the encoding format. To ensure the future readability and reusability of the TRIS corpus, the original ISO Latin 1 (also known as ISO 8859-‐1) encoding produced by SDL Trados WinAlign was converted to UTF-‐8. This was done using the command displayed in Figure 3 which instructs the computer to automatically convert from ISO-‐8859-‐1 to UTF-‐8 encoding all .txt files in the directory we are currently in. The character encoding conversion was done prior to the conversion of the aligned files in the Trados proprietary encoding format to the standard TMX format.
IRU ৱOH LQ W[W GR LFRQY I ,62 W 87) ৱOH ! ৱOHXWIW[W GRQH
)LJXUH 8QL[ FRPPDQG WR DXWRPDWLFDOO\ FRQYHUW /DWLQ ৱOHV WR 87) Figure 3: Unix command to automatically convert Latin1 files to UTF-‐8
4. Corpus Markup As defined in Morrison et al. (2000), markup is “a form of text added to a document to transmit information about both the physical and electronic resource”. I will not discuss here the benefits &RUSXV 0DUNXS of using a common and standardized markup framework as it has already been widely discussed, reasoned and agreed upon. Instead, I will focus on the different standards that are $V GHৱQHG LQ 0RUULVRQ HW DO PDUNXS LV ڨD IRUP RI WH[W DGGHG WR D GRFXPHQW available with regards to corpus markup. In this paper, the term “standard” is not restricted to WR WUDQVPLW LQIRUPDWLRQ DERXW ERWK WKH SK\VLFDO DQG HOHFWURQLF UHVRXUFH ک, ZLOO QRW official standards such as ISO, ETSI or OASIS standards and therefore may also be used to refer GLVFXVV KHUHwhich WKH EHQHৱWV RI XVLQJ D FRPPRQ DQG VWDQGDUGL]HG PDUNXS IUDPHZRUN to markup formats are regularly and widely used. This section is divided in three DV LW KDV DOUHDG\ EHHQ ZLGHO\ GLVFXVVHG UHDVRQHG DQG DJUHHG XSRQ ,QVWHDG ZLOO another subsections: one in which the markup languages SGML and XML are introduced , (4.1), IRFXV RQ WKH GLৰHUHQW VWDQGDUGV WKDW DUH DYDLODEOH ZLWK UHJDUGV WR FRUSXV PDUNXS one in which industrial standards are discussed (4.2) and a final one with a special focus on the linguistic m of linguistic resources (4.3). ,Qarkup WKLV DUWLFOH WKH WHUP ۆVWDQGDUGۇ LV QRW UHVWULFWHG WR R৳FLDO VWDQGDUGV VXFK DV ,62 (76, RU 2$6,6 VWDQGDUGV DQG WKHUHIRUH PD\ DOVR EH XVHG WR UHIHU WR PDUNXS 4.1. Brief IRUPDWV introduction o mUHJXODUO\ arkup: SDQG GML and XXVHG ML डLV VHFWLRQ LV GLYLGHG LQ WKUHH VXE ZKLFK tDUH ZLGHO\ SGML (Standard Generalized Markup Language) and XML (EXtensible Markup Language) are VHFWLRQV RQH LQ ZKLFK WKH PDUNXS ODQJXDJHV 6*0/ DQG ;0/ DUH LQWURGXFHG structured DQRWKHU markup languages. HTML (Hypertext Markup Language), example, is a type of RQH LQ ZKLFK LQGXVWULDO VWDQGDUGV DUH GLVFXVVHG DQGfor D ৱQDO RQH ZLWK SGML used to mark up text and graphics so that the most popular web browsers can interpret D VSHFLDO IRFXV RQ WKH OLQJXLVWLF PDUNXS RI OLQJXLVWLF UHVRXUFHV them. To identify the markup in a document, both SGML and XML use named elements delimited by angled brackets (“”). As explained in (Walsh and Muellner, 1999), “An essential characteristic of structured markup is that it explicitly distinguishes (and accordingly “marks up” %ULHI LQWURGXFWLRQ WR PDUNXS 6*0/ DQG ;0/ within a document) the structure and semantic content of a document. It does not mark up the way in which the document will appear to the reader, in print or otherwise.” Moreover, the structure of 6*0/ 6WDQGDUG *HQHUDOL]HG 0DUNXS /DQJXDJH DQG ;0/ (;WHQVLEOH 0DUNXS the documents is controlled by either document type definitions (DTDs) or XML schema. A DTD /DQJXDJH DUH VWUXFWXUHG PDUNXS ODQJXDJHV +70/ +\SHUWH[W 0DUNXS /DQJXDJH is a set of declarations regarding the structure of a document, and its goal was to retain a level of IRU H[DPSOH LV DfW\SH RI 6*0/ XVHG WRm PDUN WH[Wto DQG JUDSKLFV VR WKDW WKHinto PRVWXML DTDs. compatibility with SGML or applications that ight XS want convert SGML DTDs EURZVHUV LGHQWLI\ WKHrPDUNXS LQit DiGRFXPHQW It consists SRSXODU of a list ZHE of tag names FDQ and LQWHUSUHW specifies WKHP their 7R combination ules and s also used to check ERWK 6*0/ DQG ;0/ XVH QDPHG HOHPHQWV GHOLPLWHG E\ DQJOHG EUDFNHWV ۇۆDQG that a particular document is appropriately structured. ۇ!ۆ$V H[SODLQHG LQ :DOVK DQG 0XHOOQHU ڨ$Q HVVHQWLDO ࠼DUDFWHULVWLF RI VWUXFWXUHG PDUNXS LV WKDW LW H[SOLFLWO\ GLVWLQJXLVKHV DQG DFFRUGLQJO\ ڨPDUNV XSک ZLWKLQ D GRFXPHQW WKH VWUXFWXUH DQG VHPDQWLF FRQWHQW RI D GRFXPHQW ,W GRHV QRW PDUN XS WKH ZD\ LQ ZKL࠼ WKH GRFXPHQW ZLOO DSSHDU WR WKH UHDGHU LQ SULQW RU RWKHU ZLVH ک0RUHRYHU WKH VWUXFWXUH RI WKH GRFXPHQWV LV FRQWUROOHG E\ HLWKHU GRFXPHQW W\SH GHৱQLWLRQV '7'V RU ;0/ VFKHPD $ '7' LV D VHW RI GHFODUDWLRQV UHJDUGLQJ 65 WKH VWUXFWXUH RI D GRFXPHQW DQG LWV JRDO ZDV WR UHWDLQ D OHYHO RI FRPSDWLELOLW\ ZLWK 6*0/ IRU DSSOLFDWLRQV WKDW PLJKW ZDQW WR FRQYHUW 6*0/ '7'V LQWR ;0/ '7'V
C. Parra Escartín
While SGML was commonly used in the past, there has been a shift of markup language and nowadays it is more common to use XML. In fact, all the markup standards that will be discussed in the next subsections have either moved towards XML or were already conceived in XML.
4.2. The Translation Memory eXchange (TMX) and other LISA standards. Industrial Standards entering into Academia and beyond TMX stands for Translation Memory eXchange and it is an XML format to encode translation memories and ensure that they can be reused and exchanged among different CAT tools without encountering any troubles. It was developed by the Localization Industry Standards Association (LISA) and after having been widely adopted in the industrial sector it has made its way into the academic and institutional sector as well. In fact some of the Language Technology Resources released by the European Commission are in this format. Examples of this are the DGT-‐ Translation Memory8 and the ECDC-‐TM; the Translation Memory of the European Centre for Disease Prevention and Control9. Its increasing presence as an encoding format has led to the appearance of tools to extract TMX files and convert them to simple .txt UTF-‐8 files if needed. This is the case of the extract-‐tmx-‐corpus tool10, which is currently used to prepare input files for the Statistical Machine Translation System MOSES11. LISA was sadly dissolved in March 2011 but its contributions towards standardization in the Localization Industry were of great magnitude and some of the standards developed by them are still widely used. The body in charge of creating new standards was a specific committee called OSCAR (Open Standards for Container/Content Allowing Reuse) and as a result of their work five community standards were successfully published: the Translation Memory eXchange (TMX)12, the TermBase eXchange (TBX)13, the Segmentation Rules eXchange (SRX)14, the Global information management Metrics eXchange Volume (GMX-‐V)15 and the XML Text Memory (xml:tm)16. As can be inferred from the previous paragraph, LISA – an industrial initiative to cooperate and standardize the localization field – was a very important agent as regards standardization. It cooperated with the relevant agents in the field to ensure the success of its proposals: the ISO TC 37 group, OASIS XLIFF and the Open Architecture for XML Authoring and Localization (OAXAL). As stated in the TBX definition (Open Standards for Container/Content allowing Reuse, 2008), the TBX, for instance, is actually identical to ISO 30042. When its dissolution was announced, the European Telecommunications Standards Institute (ETSI), worked together with LISA on a proposal to create a new Industry Specification Group (ISG) for Localisation Industry Standards (LIS), which would ensure the maintenance of the five LISA OSCAR standards mentioned above as well as the cooperation with LISA’s cooperating partners. As stated in Guillemin and Trillaud (2012), “the ETSI is a standardization institute which produces standards from information and communications technology, including fixed, mobile, radio, converged, aeronautical, broadcast and internet technologies and is officially recognized by the European Union as an European Standards Organization. ETSI is an independent, not-‐for-‐profit association with more than 700 member companies and organizations, drawn from 62 countries across five continents worldwide, that determine its work program and participate directly in its work”. Guillemin and Trillaud (2012) offer a summarized explanation of 8 http://ipsc.jrc.ec.europa.eu/?id=197 9 http://ipsc.jrc.ec.europa.eu/?id=782 10 http://code.google.com/p/extract-‐tmx-‐corpus/ 11 http://www.statmt.org/moses/ 12 http://www.gala-‐global.org/oscarStandards/tmx/tmx14b.html 13 http://www.gala-‐global.org/oscarStandards/tbx/tbx_oscar.pdf 14 http://www.gala-‐global.org/oscarStandards/srx/srx20.html
15 http://www.gala-‐global.org/oscarStandards/gmx-‐v/gmx-‐v.html 16 http://www.gala-‐global.org/oscarStandards/xml-‐tm/xml-‐tm.html
66
Encoding a parallel corpus: The TRIS corpus experience
LISA’s dissolution and what was done to ensure the continuity of the standards developed within this professional association. As of February 2013, the ETSI has officially released the TMX as ETSI ISG LIS GS Translation Memory eXchange (TMX)17 and the GMX-‐V as Global information management Metrics eXchange Volume (GMX-‐V)18. The XML Text Memory (ETSI ISG LIS GS XML Text Memory (xml:tm)) has reached the status of a stable draft19, and the TBX (ETSI ISG LIS Term-‐Base eXchange (TBX)) is still an early draft20, as is the SRX (ETSI ISG LIS Segmentation Rules eXchange (SRX))21. The efforts made to ensure the continuity of the standards despite LISA’s dissolution are a proof of the importance that they have acquired for industry, academia and the public sector. TMX and TBX are probably the two standards most related to the Natural Language Processing (NLP) field and as exemplified above, TMX is in fact starting to be a standard used for the release of new linguistic resources. Converting the TRIS corpus into TMX As has been mentioned in Section 2, for the alignment of the MS Word files the commercial software SDL Trados WinAlign is used. One of the reasons behind this decision is that sentence alignment can be carried out from native MS Word files and no format conversion prior to alignment is required. Moreover, the decision was taken due to practical reasons: WinAlign saves time at this stage of the process while producing bilingual files either in its own proprietary format or in TMX.
Figure 4: The SDL Trados WinAlign Interface
17 http://www.etsi.org/deliver/etsi_gs/LIS/001_099/002/01.04.02_60/gs_
LIS002v010402p.pdf LIS004v020000p.pdf 19 http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37769 20 http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37750 21 http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37767 18 http://www.etsi.org/deliver/etsi_gs/LIS/001_099/004/02.00.00_60/gs_
67
C. Parra Escartín
Figure 4 shows the user interface of WinAlign. As can be seen, the program proposes automatic alignments (dotted lines), and a human validator can correct those alignments, confirm (line) or reject them (no line at all). The program also permits the user to join or split segments as well as edit them if needed. This is very useful as sometimes it is necessary to join several segments into one. This is the case, for example, when in the original MS Word file in German there is a list with the verb in a separate line at the end of the list while in the Spanish translation the verb occurs at the beginning of the list. German grammar requires that certain structures have the verb at the end and this cannot be done in Spanish. The editing feature of WinAlign allows the user to edit the text in the segments (e.g. to correct typos not previously detected) and join/split them accordingly so that they are paired with the appropriate sentence in the other language. Figure 5 illustrates the structure of an aligned segment produced by SDL Trados WinAlign in the .rtf format that the program uses internally. Furthermore, as mentioned earlier WinAlign also allows the user to export the alignment as a TMX file. One drawback of Trados is that the resulting translation memories (TMs) include a lot of unnecessary formatting information that has to be cleaned before further exploitation of the corresponding files. Another drawback is that when merging several TMs into one, the program filters out all duplicates and deletes them and it does not keep track of the order in which sentences appear in the text. This is because it is a Computer Assisted Translation Tool and these details are not relevant for its intended usage. IRU LWV LQWHQGHG XVDJH Ih`l= IZmHBiv=Ryy I*`l=GA:L5 I*`.=keyRkyRk- kR,RR Ia2; G4.1@h=L+? mĸ2`;2rƳ?MHB+?2M 1`2B;MBbb2M- rB2 xX "X HM;2` M?Hi2M/2M 2ti`2K2M h2KT2`im`2M- >Q+?rbb2`- 1`/#2#2M- GrBM2M@ Q/2` Jm`2M#;M;2M- _mib+?mM;2M- lM7HH2M- 62m2` Q/2` MT`HH pQM 6?`x2m;2M m/;HX bBM/ /B2 q2;r2Bb2`#`ȹ+F2M ;2xB2Hi m7 /B2 KƳ;HB+?2M mbrB`FmM;2M /2` mĸ2`;2rƳ?MHB+?2M lKbiM/2 ?BM xm #2bB+?iB;2MX Ia2; G41a@1a=.2 ?#2`b2 T`Q/m+B/Q H;ȯM ?2+?Q 2ti`Q`/BM`BQ- +QKQ TQ` 2D2KTHQ- i2KT2`im`b 2ti`2Kb /2 Kmv H`; /m`+BƟM- `B/bb2ŌbKQb- Hm/2b Q `;vQb- +Q``BKB2MiQb- ++B/2Mi2b- BM+2M/BQb Q BKT+iQb /2 p2?Ō+mHQb v bBKBH`2b- b2 /2#2`{M BMbT2++BQM` HQb TƟ`iB+Qb T` K2MbD2b /2 +``2i2` 2bT2+Ō7B+K2Mi2 2M +mMiQ Hb TQbB#H2b `2T2`+mbBQM2b /2 Hb +B`+mMbiM+Bb 2ti`Q`/BM`BbX Ifh`l=
)LJXUH 6DPSOH DOLJQHG VHJPHQW SURGXFHG E\ T6'/ 7UDGRV :LQ$OLJQ DE Figure 5: Sample of aRIn DQ aligned segment produced by SDL rados WinAlign (abbreviated)
EUHYLDWHG To overcome these challenges, another industrial application is used: ApSIC Xbench22. ApSIC Xbench supports several input formats (such as TMX and Trados’ proprietary .rtf format) 7R user RYHUFRPH WKHVH several FKDOOHQJHV DQRWKHU LQGXVWULDO LV XVHG $S6,& and allows the to merge translation memories DSSOLFDWLRQ without removing duplicates and $S6,& ;EHQFK VXSSRUWV VHYHUDO LQSXW DQGall 7UDGRVۃ respecting ;EHQFKtt the order in which they appear. Thus, this IRUPDWV tool is VXFK used DV to 70; merge single files into one file per subdomain in the corpus and convert them to TMX. Even though the TMX format is SURSULHWDU\ UWI IRUPDW DQG DOORZV WKH XVHU WR PHUJH VHYHUDO WUDQVODWLRQ PHPRULHV not really necessary for my research project (simple plain monolingual files with one sentence ZLWKRXW UHPRYLQJ GXSOLFDWHV DQG UHVSHFWLQJ WKH RUGHU LQ ZKLFK WKH\ DSSHDU डXV per line would have been enough), I deemed it appropriate to convert the resulting translation WKLV WRRO LV XVHG WR PHUJH DOO VLQJOH ৱOHV LQWR RQH ৱOH SHU VXEGRPDLQ LQ WKH FRUSXV memories DQG into FRQYHUW TMX as WKHP this hWRas become tandard our fiIRUPDW eld and ould ensure interoperability 70; (YHQa sWKRXJK WKHin 70; LVwQRW UHDOO\ QHFHVVDU\ and reusability in the long run. The TMX files are further processed with a python script to add IRU P\ UHVHDUFK SURMHFW VLPSOH SODLQ PRQROLQJXDO ৱOHV ZLWK RQH VHQWHQFH SHU OLQH additional ZRXOG information to each sentence in the DSSURSULDWH corpus. WR FRQYHUW WKH UHVXOWLQJ WUDQV KDYH EHHQ HQRXJK , GHHPHG ODWLRQ PHPRULHV LQWR 70; DV WKLV KDV EHFRPH D VWDQGDUG LQ RXU ৱHOG DQG ZRXOG HQVXUH LQWHURSHUDELOLW\ DQG UHXVDELOLW\ LQ WKH ORQJ UXQ डH 70; ৱOHV DUH IXUWKHU SURFHVVHG ZLWK D S\WKRQ VFULSW WR DGG DGGLWLRQDO LQIRUPDWLRQ WR HDFK VHQWHQFH LQ WKH FRUSXV 22 http://www.apsic.com/en/products_xbench.html
68
)LJXUH VKRZV WKH VWUXFWXUH RI WKH ৱQDO 70; ৱOHV $V FDQ EH VHHQ DOO 70; GRFXPHQWV DUH GLYLGHG LQWR D KHDGHU DQG D ERG\ HOHPHQW डH VWUXFWXUH RI DQ\ 70; GRFXPHQW LV DSSURSULDWHO\ GHVFULEHG DQG GRFXPHQWHG LQ WKH 70; GHৱQLWLRQ UHOHDVHG E\ (76, " :KDW IROORZV LV D EULHI VXPPDU\ RI WKH LQIRUPDWLRQ WKDW FDQ EH IRXQG WKHUH डH KHDGHU ڽHQFORVHG ZLWKLQ WKH KHDGHU! KHDGHU! WDJV ڽ
Encoding a parallel corpus: The TRIS corpus experience
Figure 6 illustrates the structure of the final TMX files. As can be seen, all TMX documents are divided into a header and a body element. The structure of any TMX document is appropriately described and documented in the TMX definition released by ETSI (Localization Industry Standards (LIS) ETSI Industry Specification Group (ISG), 2013). What follows is a brief summary of the information that can be found there. I\tKH p2`bBQM4]RXy] 2M+Q/BM;4]lh6@3]\= I5.P*huS1 iKt Sl"GA* ]@ffGAa Pa*_,RNN3ff.h. 7Q` h`MbHiBQM J2KQ`v 2s+?M;2ff1L] ]?iiT,ffrrrXiiiXQ`;fQb+`biM/`/bfiKtfiKtR9X/i/]= IiKt p2`bBQM4]RX9]= I?2/2` +`2iBQMiQQH4]a.G h`/Qb qBMHB;M 3XjXyX3ej] +`2iBQMiQQHp2`bBQM4]1/BiBQM 3 "mBH/ 3ej] Q@iK74]a.G hJ3 6Q`Ki] b2;ivT24]b2Mi2M+2] /KBMHM;4]1L@la]i b`+HM;4].1@h] /iivT24]tKH] +`2iBQM/i24]CmM2 kyRk] +`2iBQMB/4]*`H S``- lB"] = If?2/2`= I#Q/v= Iim imB/4]"yyuRNNN6BH2RRNNNykRRad] +`2iBQM/i24]kyRRRRR9hRNj8w] +`2iBQMB/4]GA:L5]= Iimp tKH,HM;4].1@h]= Ib2;=.b 6M;bvbi2K U#2BbTB2H?7i BM ## R /`;2bi2HHiV /B2Mi /xm- o2`#`2MMmM;b;b2 pQM 62m2`biii2M KBi MB2/`B;2M o2`#`2MMmM;b;bi2KT2`im`2M UKBi /B2b#2xȹ;HB+?2M aB+?2`?2Bib2BM`B+?imM;2MV BMb 6`2B2 xm H2Bi2MXIfb2;= Ifimp= Iimp tKH,HM;4]1a@1a]= Ib2;=1H bBbi2K /2 +?BK2M2 U`2T`2b2Mi/Q iŌimHQ /2 2D2KTHQ 2M H 7B;m` RV bB`p2 T` +QM/m+B` H 2ti2`BQ` HQb ;b2b /2 +QK#mbiBƟM T`Q+2/2Mi2b /2 ?Q;`2b +QM #D i2KT2`im` /2 HQb ;b2b /2 +QK#mbiBƟM U+QM HQb +Q``2bTQM/B2Mi2b /BbTQbBiBpQb /2 b2;m`B//VXIfb2;= Ifimp= Ifim= Iim imB/4]"yyuRNNN6BH2RRNNNykRRa3] +`2iBQM/i24]kyRRRRR9hRNj8w] +`2iBQMB/4]GA:L5]= Iimp tKH,HM;4].1@h]= Ib2;=Hb 62m2`biii2M FQKK2M x" "`2MMr2`i;2`i2 D2r2BHb KBi :b Q/2` >2BxƳH 2ti` H2B+?i Hb "`2MMbiQ77 BM "2i`+?iXIfb2;= Ifimp= Iimp tKH,HM;4]1a@1a]= Ib2;=GQb ?Q;`2b +QMbB/2`` bQM TQ` 2D2KTHQ HQb 2[mBTQb /2 ŌM/B+2 /2 +QK#mbiBƟM [m2 2KTH22M +QKQ +QK#mbiB#H2- `2bT2+iBpK2Mi2- ;b Q 7m2H 2ti`@HB;2`QXIfb2;= Ifimp= Ifim= XXX If#Q/v= IfiKt=
)LJXUH 6DPSOH D 70; DOLJQHG DEEUHYLDWHG Figure 6IURP : Sample from a TMX ৱOH aligned file (abbreviated)
The header – enclosed within the
tags – contains the metadata about the document. The body – enclosed within the tags – contains all the translation units in the translation memory. In the header there is information related to the Tool with which a Translation Memory has been created and its version (“creation tool” and 69
C. Parra Escartín
“creationtoolversion” respectively); the original translation memory format (“o-‐tmf ”); the kind of segmentation used (“segtype”); the default language in which the administrative and informative elements are written (“adminlang”); the source language of the translations included in the translation memory (“srclang”); the type of data we have (“datatype”); the creation date of that concrete translation memory (“creationdate”); and the identifier for the creator of the translation memory (“creationid”). The body of any translation memory consists of one or more translation unit elements (enclosed within ), which in turn include one or more translation unit variants (enclosed within ). In the TRIS corpus, the translation unit element consists of two translation unit variant elements. Besides, every translation unit is described by means of three attributes: “tuid”; “creationdate”; and “creationid”. The attribute “tuid” (translation unit identifier) offers most of the information for every single sentence. For instance, the tuid tuid=“B00Y1999File119990211S7” in Figure 6 stands for the construction domain (B00), Year 1999 (Y1999), file name 119990211 (File119990211), sentence 7 (S7). The attribute, “creationdate” contains information about the date and time in which the translation unit was created and the “creationid” refers to the creator of the translation unit. Its value usually corresponds to the user ID of the user who created the unit. In order to specify that a translation unit comes from an alignment tool, SDL Trados WinAlign assigns itself as the creator by using the value “ALIGN!”. The translation unit variant consists of a segment element and the information corresponding to that segment for a given language. The attribute “xml:lang” refers to the language variety used in the segment that appears below. Its value must be compliant with the RFC 3066 [6]23. Thus, in the case of TRIS “DE-‐AT ” refers to German (Austria) and “ES-‐ES” to Spanish (Spain). The text between the tags is the actual text and the fact that two translation unit variants are grouped together in a translation unit indicates that one is the translation of the other.
4.3. Standards currently being fostered within the NLP field
Current European initiatives such as Meta-‐share24 are making major efforts towards the usage of standards and good practices in our field. Since the TRIS corpus is to be released through Meta-‐ Nord, the Meta-‐share node to which the University of Bergen belongs, their documentation was consulted to decide which standards to use with regards to corpus encoding. As stated in Deliverable 4.1 of the Meta-‐Nord project25: Metadata descriptions and other interoperability standards, suitable standards for corpus encoding would be TEI or (X)CES (Borin and Lindh, 2011, p.15). Therefore, I decided that my corpus would use one of these two markup languages to ensure that it would be compliant with current initiatives on standardization, curation and sustainability of Language Resources and Tools (LRTs). The next two subsections (4.3.1 and 4.3.2) briefly explain each of them, while Subsection 4.3.3 discusses which of these two standards (TEI and (X)CES) is best and reasons the decision taken. Finally, Subsection 4.3.4 provides details about the encoding of the TRIS corpus in TEI P5 format. 4.3.1. The Text Encoding Initiative (TEI) The Text Encoding Initiative (TEI) is a non-‐profit organization which counts in its consortium members from academia, research projects and individual scholars from around the world. In their website26 they offer extensive documentation about the initiative as well as guidelines and a wide range of materials. Their main goal is to collectively develop and maintain the TEI guidelines for the encoding of texts in digital form. In order to reach a wide audience their 23 http://www.ietf.org/rfc/rfc3066.txt
24 http://www.meta-‐net.eu/meta-‐share 25 http://www.meta-‐net.eu/ 26 http://www.tei-‐c.org/index.xml
70
Encoding a parallel corpus: The TRIS corpus experience
Guidelines are aimed for their usage in Humanities, Social Sciences and Linguistics and since 1994 they have been used in a vast number of projects, institutions and resources. Since their first release, the TEI guidelines are periodically updated and feedback from the user community is incorporated to fulfill user needs and requirements. The last release of the TEI Guidelines for Electronic Text Encoding and Interchange was done in late January 2013 and it accounts for version 2.3.0 of the TEI P5. Besides, although the current version is the TEI P5, resources encoded in previous versions, such as the TEI P4 format, can still be used without interoperability problems thanks to the usage of the corresponding DTD. An example of a resource encoded in a prior version of the standard but still widely used nowadays is the case of the JRC Acquis (Steinberger et al., 2006), which was released in TEI P4. 4.3.2. The XML Corpus Encoding Standard ((X)CES) Another effort towards standardization of corpus encoding is the one carried out by the Expert Advisory Group on Language Engineering Standards (EAGLES27). As a result of their work a first Corpus Encoding Standard (CES)28 was developed. It started being a SGML standard compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative29. (X)CES stands for XML Corpus Encoding Standard and it is a newer version of CES encoded in XML. It is currently more frequently used than CES because XML has become the most currently used markup language. However, is not only an XML version of CES and as pointed out by Simões and Fernandes (2011) not all corpora which claim to be encoded in (X)CES are truly encoded in (X)CES but rather in CES encoded in XML: “… some researchers claim they are releasing their corpora in XCES format, but they are just encoding CES in XML, and XCES is more than that.” 4.3.3. TEI and (X)CES. A Comparison TEI and XCES have become the de facto standards for corpus encoding and most corpora are in one of the two formats or at least easily convertible to them. Several papers (Przepiórkowski and Bánski, 2011; Przepiórkowski, 2009; Bánski and Przepiórkowski, 2010; Simões and Fernandes, 2011) refer to TEI as the standard and reference for corpus encoding and it seems reasonable to think of it for the encoding of newly compiled corpora. For the encoding of TRIS a comparison between the two standards was made with the aim of determining which seemed best. The first drawback found in the case of XCES is its lack of documentation and authors like Przepiórkowski (2009) and Simões and Fernandes (2011), for example, already point this out. In fact, not knowing how the encoding should actually look like makes it particularly difficult to encode a corpus from scratch in this format. Przepiórkowski (2009) also states this as follows: “http://www.xces.org/ refers to old CES documentation as “supporting general encoding practices for linguistic corpora and tag usage” and “largely relevant to the XCES instantiation”, although the CES documentation is hardly applicable to the second version of XCES”. In the same paper, Przepiórkowski (2009) also mentions as another reason against XCES “the potential for confusion regarding the version of the standard (in particular, for many years DTD and XML Schema specifications co-‐existed on XCES web pages, without clear information that they specify different representations”. The same is pointed out in another paper: “There is a potential for confusion regarding the version of the standard. XCES was derived from TEI version P4, but it has not been updated to TEI P5 so far” (Przepiórkowski and Bánski, 2011). In the XCES website30 it is stated that “XCES is continually under development and future work will include making the XCES 27 http://www.ilc.cnr.it/EAGLES/home.html 28 http://www.cs.vassar.edu/CES/ 29
More information about http://www.cs.vassar.edu/CES/. 30 http://www.xces.org/
the
origins
of
CES
can
be
found
at
their
website:
71
C. Parra Escartín
compliant with TEI P5”. TEI P5 was released in November 2007 and is updated every six months. The last time the XCES website was updated was June 200831. This highlights the outdatedness of XCES and contrasts with the willingness of the TEI community to keep their proposed standard up to date32. On the other hand, a possible drawback of TEI is its extensive documentation: the current version of the guidelines (January 2013) comprises 1641 pages. As Przepiórkowski (2009) points out, “usually there is more than one way of representing any given annotation, so designing a coherent and constrained TEI-‐conformant schema for linguistic corpora is a daunting task”. TEI P5 was the standard chosen to encode the TRIS corpus due to what is argued above. Moreover, the active support and willingness to resolve doubts and make clarifications in the TEI mailing list were also a clear advantage towards choosing TEI. Finally, it also seemed the best option with regards to the interoperability and sustainability of a resource being developed since it is also periodically reviewed and documented. XCES is not documented enough and – as mentioned in the previous Subsection 4.3.2 – the resources available in XCES are not always truly encoded in XCES but rather represent interpretations – own XML versions – of the previous CES format or schemata based on XCES. Deliverable D.2.1 of the Let’s MT project offers a good example of this last issue. As Tiedemann and Wijnitz (2010, p. 6) explain, the alignment information of their parallel corpora will be stored “in links between sentences in external files pointing to the appropriate documents using the unique sentence IDs for identification of the aligned segments” and for this they “will use a simple XML format based on the XCES standard”33. If resource developers create new encoding formats based in XCES, they are not using the standard any more and therefore their resources will encounter interoperability problems in the long run. 4.3.4. The TRIS corpus in TEI P5 format In this subsection the encoding of the TRIS corpus in TEI P5 will be briefly explained. As described in (Sperberg-‐McQueen and Burnard, 2009, p. 139), “a full TEI document combines metadata describing it, represented by a element, with the document itself, represented by a element”. The is a variant defined for the representation of language corpora or collections of texts. It consists of one or more complete elements (i.e. elements consisting of a and a element) and additionally has its own describing the whole corpus. This allows for a more general description of the corpus as a whole in the element prefixed to the whole corpus, and a more detailed description of every element comprised in the in their respective . Chapter 15 of the TEI P5 Guidelines (Sperberg-‐McQueen and Burnard, 2009) describes how to encode a corpus. In what follows the encoding of the TRIS corpus is described to exemplify the TEI P5 structure of a teiCorpus. First of all it must be pointed out that while it was clear that the element should be used, it was also necessary to establish the inner structure of the TRIS corpus as a whole and determine how it would be encoded. The TRIS corpus includes files written in Germany, Austria and Spain, thus originally written in either German or Spanish and translated into the other language. Furthermore, we have two language variants in the case of German: Austrian and German. The corpus also includes texts from different domains and subdomains and is ordered by year of publication from 1999 to 201034. So far, only the texts for a particular
31 The last time this was verified was February 2013.
32 The last TEI P5 release was done in January 2013 and stands for version 2.3.0 of the standard. 33 The emphasis is my own. 34 See Parra Escartín (2012) for detailed information about the texts in the corpus.
72
Encoding a parallel corpus: The TRIS corpus experience
domain (Construction) have been released for public usage35 but other domains will be included shortly. When designing the TEI structure it was decided to have a general for the whole corpus and then have a element for every domain and year. This makes it relatively easy to add new files on the fly once they are ready to be added to the corpus and does not prevent the corpus from being released beforehand. I. The header. As explained above, the element contains information about the corpus as a whole. Every TEI-‐conformant text must have a header prefixed to it. TEI headers consist of four major parts that must be always included: 1. A file description (): “a full bibliographical description of the computer file itself, from which a user of the text could derive a proper bibliographic citation (…)” (Sperberg-‐McQueen and Burnard, 2009) 2. An encoding description (): relates to how the source files where manipulated prior to encoding. 3. A text profile (): contains classificatory and contextual information about the text. 4. A revision history (): contains information about the changes done during the development of the text. Thus, the TRIS corpus starts as follows: Ii2B*Q`Tmb p2`bBQM4]8Xk] tKHMb4]?iiT,ffrrrXi2B@+XQ`;fMbfRXy]= Ii2B>2/2` tKH,HM;4]2M] ivT24]+Q`Tmb]=
)LJXUH %HJLQQLQJ RI WKH 75,6 FRUSXV WHL&RUSXV! HOHPHQW RI WKH 75,6 FRUSXV Figure 7: Beginning of the TRIS corpus element of the TRIS corpus header KHDGHU Where version refers to the TEI Guidelines version used (5.2) and xmlns is the namespace for the Text Encoding Initiative. Within the element there are two attributes: the xml:lang attribute, which refers to the language in which the is written, and type, which refers to the type of document it refers to.
ZULऔHQ ZKLFK UHIHUV WR WKHin W\SH GRFXPHQW LW UHIHUV Figure 8, Figure WKH 9 aWHL+HDGHU! nd Figure 1LV 0 d isplay tDQG he iW\SH nformation provided the RI header of the TRIS WR the current release is the only one done so far in TEI there is no corpus in TEI. Since element so far. As the names and values of the attributes are quite self-‐ )LJXUHV DQG VKRZ WKH LQIRUPDWLRQ SURYLGHG LQ WKH KHDGHU RI WKH 75,6 explanatory no further details are given. If the reader wants further information about the TEI FRUSXV LQ 7(, 6LQFH WKH FXUUHQW UHOHDVH LV WKH RQO\ RQH GRQH VR IDU LQ 7(, WKHUH Header, please see Chapter 2 of the TEI P5 Guidelines (Sperberg-‐McQueen and Burnard, 2009, p. LV QR UHYLVLRQ'HVF! HOHPHQW VR IDU $V WKH QDPHV DQG YDOXHV RI WKH DऔULEXWHV 17–53). DUH TXLWH VHOIH[SODQDWRU\ QR IXUWKHU GHWDLOV DUH JLYHQ ,I WKH UHDGHU ZDQWV IXUWKHU LQIRUPDWLRQ DERXW WKH 7(, +HDGHU SOHDVH VHH &KDSWHU RI WKH 7(, 3 *XLGHOLQHV 6SHUEHUJ0FठHHQ DQG %XUQDUG S ڽ ,, H WHL+HDGHU! $ऑHU WKH KHDGHU IRU WKH ZKROH FRUSXV WKH WHL&RUSXV VWUXF WXUH UHTXLUHV D 7(, HOHPHQW ZLWK LWV RZQ KHDGHU GHVFULELQJ WKDW SDUWLFXODU HOHPHQW RI WKH FRUSXV डLV KHDGHU ۆLQKHULWV ۇWKH JHQHUDO FKDUDFWHULVWLFV IURP WKH XSSHU RQH LQ WKH FRUSXV DQG WKXV SURYLGHV WKH VSHFLৱF LQIRUPDWLRQ UHODWHG WR WKH WH[W EHLQJ HQFRGHG LQ LWV WH[W! DऔULEXWH $औULEXWHV DQG YDOXHV VSHFLৱHG KHUH RYHUZULWH WKH RQHV LQ WKH XSSHU KHDGHU IRU WKLV SDUWLFXODU FRPSRQHQW RI WKH FRUSXV डXV IRU LQVWDQFH WKH LQIRUPDWLRQ DERXW WKH QXPEHU RI ৱOHV LQ WKH WH[W LV XSGDWHG IRU WKLV 35 http://metashare.nb.no/repository/browse/parallel-‐corpus-‐of-‐documents-‐from-‐the-‐technical-‐ SDUWLFXODU HOHPHQW DV ZHOO DV WKH QXPEHU RI VHQWHQFHV DQG WKH QXPEHU RI ZRUGV regulations-‐information-‐system-‐for-‐german-‐spanish-‐ SHU ODQJXDJH )LJXUH VKRZV DQ H[DPSOH RI KHDGHU IRU WKH ৱOHV ZULऔHQ LQ $XVWULD v02/d12552021dcc11e28f61001708556d5a64b9251fd03048ecaf7fe1abdc48a2d1/ LQ LQ WKH FRQVWUXFWLRQ GRPDLQ 73 ,,, H WH[W! डH WH[W! HOHPHQW LV ZKHUH WKH DFWXDO FRUSXV LV VWRUHG :KHQ LW LV FUHDWHG D XQLTXH LG LV DVVLJQHG WR LW WR HQDEOH IXWXUH UHIHUHQFLQJ H[WUDFWLRQ DQG
C. Parra Escartín
I7BH2.2b+= IiBiH2aiKi= IiBiH2=S`HH2H *Q`Tmb Q7 /Q+mK2Mib 7`QK i?2 h2+?MB+H _2;mHiBQMb AM7Q`KiBQM avbi2K 7Q` :2`KM@aTMBb? UpyXkVIfiBiH2= I7mM/2`=1l mM/2` 6Sd- J`B2 *m`B2 +iBQMb- aSj S2QTH2 AhL- ;`Mi ;`22K2Mi kj39y8 UT`QD2+i *G_VIf7mM/2`= IT`BM+BTH=*`H S`` 1b+`iŌMIfT`BM+BTH= I`2bTaiKi= IMK2=*`H S`` 1b+`iŌMIfMK2= I`2bT=+Q`Tmb +QKTBHiBQM- T`Q+2bbBM;- 2M+Q/BM; M/ K`FmTIf`2bT= If`2bTaiKi= IfiBiH2aiKi= I2/BiBQMaiKi= I2/BiBQM M4]oyXk]=o2`bBQM yXk r?B+? 2ti2M/b i?2 T`2pBQmb p2`bBQM yXR M/ 7Bt2b bQK2 7Q`KiiBM; 2``Q`b /2i2+i2/ i?2`2XIf2/BiBQM= I`2bTaiKi= I`2bT=6Q`KiiBM; 2``Q`b BM pyXR +Q``2+i2/ M/ +Q`Tmb 2MH`;2/If`2bT= IMK2=*`H S`` 1b+`iŌMIfMK2= If`2bTaiKi= If2/BiBQMaiKi= I2ti2Mi= IK2bm`2:`T= IK2bm`2 ivT24]7BH2b]=ky8IfK2bm`2= IK2bm`2 ivT24]b2Mi2M+2b]=dye93IfK2bm`2= IK2bm`2 ivT24]rQ`/bn.1@h]=ej3NydIfK2bm`2= IK2bm`2 ivT24]rQ`/bn1a@1a]=Nkj3jyIfK2bm`2= IfK2bm`2:`T= If2ti2Mi= ITm#HB+iBQMaiKi= I//`2bb= I//LK2=AMbiBimii 7Q` HBM;pBbiBbF2- HBii2``2 Q; 2bi2iBbF2 bim/B2`If//LK2= I//`GBM2=SQbi#QFb d3y8If//`GBM2= I//`GBM2=8yky "2`;2M ULQ`rvVIf//`GBM2= If//`2bb= I/i2=kyRkIf/i2= ITm#HBb?2`=lMBp2`bBi2i2i B "2`;2MIfTm#HBb?2`= ITm#SH+2="2`;2M- LQ`rvIfTm#SH+2= I/Bbi`B#miQ`=lMBp2`bBi2i2i B "2`;2MIf/Bbi`B#miQ`= IpBH#BHBiv biimb4]`2bi`B+i2/]= IHB+2M+2=lM/2` M2;QiBiBQMIfHB+2M+2= IfpBH#BHBiv= IfTm#HB+iBQMaiKi= IbQm`+2.2b+= IT=N3fj9f1* .B`2+iBp2- h_Aa .i#b2- 1m`QT2M *QKKBbbBQMIfT= IfbQm`+2.2b+= If7BH2.2b+=
Figure 8: eRI lement f the TFRUSXV RIS corpus header )LJXUH ৱOH'HVF! WKH o75,6 KHDGHU
74
I2M+Q/BM;.2b+= Encoding a parallel corpus: The TRIS corpus experience IT`QD2+i.2b+= IT= I2M+Q/BM;.2b+= aT2+BHBx2/ T`HH2H +Q`Tmb aTMBb?@:2`KM U1a@1a- .1@h M/
[email protected] i2tib 7`QK IT`QD2+i.2b+= i?2 1m`QT2M *QKKBbbBQM #2ir22M RNNd@kyRyX IT= h?2 i2tib `2 i2+?MB+HU1a@1a`2;mHiBQMb p`B2ivi2tib Q7 /QKBMbX aT2+BHBx2/ T`HH2H +Q`Tmb aTMBb?@:2`KM .1@h BM M/
[email protected]`QK h?2 Q`B;BMH 7BH2b r2`2 2Bi?2` BM Ja qQ`/ Q` S.6 7Q`Ki M/ ?p2 #22M +QMp2`i2/ iQ i?2 1m`QT2M *QKKBbbBQM #2ir22M RNNd@kyRyX THBM `2;mHiBQMb i2tiX h?2 i2tib `2lh6@3 i2+?MB+H BM p`B2iv Q7 /QKBMbX h?2 +Q`Tmb rBHH #2 BM Q` T`QD2+i BMpQHpBM; i?2 #22M bim/v+QMp2`i2/ Q7 T?`b2QHQ;B+H h?2 Q`B;BMH 7BH2b r2`2 2Bi?2` BMmb2/ Ja qQ`/ S.6 7Q`Ki M/ ?p2 iQ i`MbHiBQM +Q``2bTQM/2M+2b #2ir22M :2`KM M/ aTMBb?X lh6@3 THBM i2tiX h?2 +Q`Tmb rBHHIfT= #2 mb2/ BM T`QD2+i BMpQHpBM; i?2 bim/v Q7 T?`b2QHQ;B+H IfT`QD2+i.2b+= i`MbHiBQM +Q``2bTQM/2M+2b #2ir22M :2`KM M/ aTMBb?X IbKTHBM;.2+H= IfT= IfT`QD2+i.2b+= IT= IbKTHBM;.2+H= HH i2tib r`Bii2M BM 2Bi?2` mbi`B- :2`KMv Q` aTBM rBi? +Q``2bTQM/BM; i`MbHiBQM BMiQ :2`KMfaTMBb? ++Q`/BM;Hv r?2`2 +`rH2/X IT= PMHv i2tib BM `2/#H2 7Q`Ki ?p2 #22M BM i?2 +QHH2+iBQMX HH i2tib r`Bii2M BM 2Bi?2` mbi`B:2`KMv Q` aTBM rBi? BM+Hm/2/ +Q``2bTQM/BM; h2tib #2HQM; iQ Ry /B772`2Mi /QKBMb M/ `2 7m`i?2` +HbbB7B2/ BM bm#/QKBMbX i`MbHiBQM BMiQ :2`KMfaTMBb? ++Q`/BM;Hv r?2`2 +`rH2/X h?2 +m``2Mi p2`bBQM BM+Hm/2b HH i2tib 7`QK RNNN iQ kyRy r`Bii2M PMHv i2tib BM `2/#H2 7Q`Ki ?p2 #22M BM+Hm/2/ BM i?2 +QHH2+iBQMX BM mbi`B M/ 7Q` i?2 +QMbi`m+iBQM /QKBMX h2tib #2HQM; iQ Ry /B772`2Mi /QKBMb M/ `2 7m`i?2` +HbbB7B2/ BM bm#/QKBMbX AK;2b ?p2 #22M QKBii2/ M/ QMHv BM i?2K ?b #22MM/ T`2b2`p2/X h?2 +m``2Mi p2`bBQM BM+Hm/2b HH i2tib 7`QK RNNN iQ i?2 kyRyi2ti r`Bii2M BM mbi`B h#H2b r2`2 +QMp2`i2/ iQ THBM i2tiX 7Q` i?2 +QMbi`m+iBQM /QKBMX M/ QMHv Ki?2KiB+H ?p2 T`2b2`p2/X #22M HbQ QKBii2/X AK;2b ?p2 #22M6Q`KmH2 QKBii2/ M/ i?2 i2ti 2tT`2bbBQMb BM i?2K ?b #22M HH BM+Hm/2/ i2tib ?p2 #22M HB;M2/ i b2Mi2M+2 H2p2HX h#H2b r2`2 +QMp2`i2/ iQ THBM i2tiX IfT= 6Q`KmH2 M/ Ki?2KiB+H 2tT`2bbBQMb ?p2 #22M HbQ QKBii2/X IfbKTHBM;.2+H= HH BM+Hm/2/ i2tib ?p2 #22M HB;M2/ i b2Mi2M+2 H2p2HX I2/BiQ`BH.2+H= IfT= IfbKTHBM;.2+H= I+Q``2+iBQM= IT= I2/BiQ`BH.2+H= P`i?QivTQ;`T?B+ 2``Q`b ?p2 #22M +Q``2+i2/X I+Q``2+iBQM= JBbKi+?BM; b2Mi2M+2b ?p2 #22M QKBii2/ iQ 2Mbm`2 i?i 2p2`v b2Mi2M+2 ?b M IT= HB;MK2Mi BM i?2 Qi?2` HM;m;2X P`i?QivTQ;`T?B+ 2``Q`b ?p2 #22M +Q``2+i2/X IfT= JBbKi+?BM; b2Mi2M+2b ?p2 #22M QKBii2/ iQ 2Mbm`2 i?i 2p2`v b2Mi2M+2 ?b M If+Q``2+iBQM= HB;MK2Mi BM i?2 Qi?2` HM;m;2X Ib2;K2MiiBQM= IfT= IT= If+Q``2+iBQM= I;B=bIf;B= 2H2K2Mib K`F Q`i?Q;`T?B+ b2Mi2M+2b M/ `2 MmK#2`2/ b2[m2MiBHHvX Ib2;K2MiiBQM= IfT= IT= Ifb2;K2MiiBQM= I;B=bIf;B= 2H2K2Mib K`F Q`i?Q;`T?B+ b2Mi2M+2b M/ `2 MmK#2`2/ b2[m2MiBHHvX If2/BiQ`BH.2+H= IfT= If2M+Q/BM;.2b+= Ifb2;K2MiiBQM= If2/BiQ`BH.2+H= If2M+Q/BM;.2b+=)LJXUH HQFRGLQJ'HVF! HOHPHQW RI WKH 75,6 FRUSXV KHDGHU
Figure 9: lement of the FRUSXV TRIS corpus header )LJXUH HQFRGLQJ'HVF! HOHPHQW eRI WKH 75,6 KHDGHU IT`Q7BH2.2b+= I+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= IT`Q7BH2.2b+= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= I+`2iBQM= If+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= IHM;lb;2= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= If+`2iBQM= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2= IHM;lb;2= IfHM;lb;2= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= IfT`Q7BH2.2b+= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2= IfHM;lb;2= IfT`Q7BH2.2b+=)LJXUH HOHPHQW WKHcorpus 75,6header FRUSXV KHDGHU Figure 10: HQFRGLQJ'HVF! element of the RI TRIS
II. The . After the header for the whole corpus, the teiCorpus structure requires a )LJXUH HQFRGLQJ'HVF! HOHPHQW RI WKH 75,6 FRUSXV KHDGHU TEI element with its own header describing that particular element of the corpus. This header “inherits” the general characteristics from the upper one in the corpus and thus provides the specific information related to the text being encoded in its attribute. Attributes and
75
C. Parra Escartín
values specified here overwrite the ones in the upper header for this particular component of the corpus. Thus, for instance the information about the number of files in the text is updated for this particular element, as well as the number of sentences and the number of words per language. Figure 11 shows an example of header for the files written in Austria in 1999 in the construction domain. Ii2B>2/2`= I7BH2.2b+= IiBiH2aiKi= IiBiH2="yyn*PLah_l*hAPL @ lah_A @ RNNNIfiBiH2= I7mM/2`=1l mM/2` 6Sd- J`B2 *m`B2 +iBQMb- aSj S2QTH2 AhL- ;`Mi ;`22K2Mi kj39y8 UT`QD2+i *G_VIf7mM/2`= IT`BM+BTH=*`H S`` 1b+`iŌMIfT`BM+BTH= I`2bTaiKi= IMK2=*`H S`` 1b+`iŌMIfMK2= I`2bT=+Q`Tmb +QKTBHiBQM- T`Q+2bbBM;- 2M+Q/BM; M/ K`FmTIf`2bT= If`2bTaiKi= IfiBiH2aiKi= I2ti2Mi= IK2bm`2:`T= IK2bm`2 ivT24]7BH2b]=R9IfK2bm`2= IK2bm`2 ivT24]b2Mi2M+2b]=j3NNIfK2bm`2= IK2bm`2 ivT24]rQ`/bn.1@h]=jjeN3IfK2bm`2= IK2bm`2 ivT24]rQ`/bn1a@1a]=93kjNIfK2bm`2= IfK2bm`2:`T= If2ti2Mi= ITm#HB+iBQMaiKi= I//`2bb= I//LK2=AMbiBimii 7Q` HBM;pBbiBbF2- HBii2``2 Q; 2bi2iBbF2 bim/B2`If//LK2= I//`GBM2=SQbi#QFb d3y8If//`GBM2= I//`GBM2=8yky "2`;2M ULQ`rvVIf//`GBM2= If//`2bb= I/i2=kyRkIf/i2= ITm#HBb?2`=lMBp2`bBi2i2i B "2`;2MIfTm#HBb?2`= ITm#SH+2="2`;2M- LQ`rvIfTm#SH+2= I/Bbi`B#miQ`=lMBp2`bBi2i2i B "2`;2MIf/Bbi`B#miQ`= IpBH#BHBiv biimb4]`2bi`B+i2/]= IHB+2M+2=lM/2` M2;QiBiBQMIfHB+2M+2= IfpBH#BHBiv= IfTm#HB+iBQMaiKi= IbQm`+2.2b+= IT=N3fj9f1* .B`2+iBp2- h_Aa .i#b2- 1m`QT2M *QKKBbbBQMIfT= IfbQm`+2.2b+= If7BH2.2b+= IT`Q7BH2.2b+= I+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= If+`2iBQM= IHM;lb;2= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2= IfHM;lb;2= IfT`Q7BH2.2b+= Ifi2B>2/2`=
11: element f one the T7(, EI elements included in the RIS c75,6 orpus )LJXUHFigure WHL+HDGHU! HOHPHQW RI oRQH RIof WKH HOHPHQWV LQFOXGHG LQTWKH FRUSXV
III. The . The element is where the actual corpus is stored. When it is created a unique id is assigned to it to enable future referencing, extraction and usage upon user needs. This id includes information about the domain covered in the group of files, the country of origin (where the files where written) and the year in which they were written. Then, in the case of TRIS, it is further subdivided into single files grouped in a element which includes all
76
Encoding a parallel corpus: The TRIS corpus experience
files in the corpus in the form of individual elements (i.e. there are as many elements as files are in the corpus). Each individual file is also assigned a unique id which includes all the information related to the domain, the year and the name of the file in the EC database from which the files were retrieved. Since every file has been sentence aligned and is presented in two different languages, the element
is used to divide the text between the LQIRUPDWLRQ LV JLYHQ E\ DVVLJQLQJ WR HDFK VHQWHQFH LQ WKH VRXUFH ODQJXDJH WKH FRU source language and the target language. A final link group () is included in which the UHVSRQGLQJ VHQWHQFH LQ WKH information WDUJHW ODQJXDJH E\ PHDQV RI WKHLU XQLTXH LGVsentence ,Q RUGHU in the source language sentence alignment is given by assigning to each WR PDNH WKHcorresponding DOLJQPHQWV HDFK VHQWHQFH DVVLJQHG XQLTXH LGby LQ ZKLFK the sentence in LVthe target Dlanguage means WKH of VRXUFH their unique ids. In order to ODQJXDJH DQGthe WKHalignments, VHQWHQFH QXPEHU DUH VSHFLৱHG )LJXUH a VKRZViDd VKRUWHQHG WH[Wsource language and the make each sentence is assigned unique in which the RI WKH sentence number are specified. Figure 12 displays a shortened text of the TRIS corpus encoded 75,6 FRUSXV HQFRGHG LQ 7(, 3 WR LOOXVWUDWH WKH XVDJH RI WKH GLৰHUHQW HOH TEI P5 tDERYH o illustrate the usage of the different elements described above. PHQWV in GHVFULEHG Ii2ti tKH,B/4]"yynlah_AnRNNN]= I;`QmT= Ii2ti tKH,B/4]"yyuRNNN6BH2RRNNNykRR]= I#Q/v= I/Bp tKH,B/4]"yyuRNNN6BH2RRNNNykRRn.1@h] tKH,HM;4].1@h] ivT24]bQm`+2]= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRaRn.1@h]=J;Bbi`i /2` ai/i qB2MIfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRakn.1@h]=J;Bbi`ib#i2BHmM; j8Rkyy qB2M- .`2b/M2` ai`ĸ2 d8IfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRajn.1@h]=o2`Q`/MmM; /2b J;Bbi`i2b /2` ai/i qB2M ȹ#2` /B2 #Bb xmK ē #27`Bbi2i2 wmHbbmM; /2b 6M;bvbi2Kb ǪEJALP.l_ :aǫXIfT= XXX If/Bp= I/Bp tKH,B/4]"yyuRNNN6BH2RRNNNykRRn1a@1a] tKH,HM;4]1a@1a] ivT24]i`MbHiBQM]= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRaRn1a@1a]=:Q#B2`MQ /2 H *Bm// /2 oB2MIfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRakn1a@1a]=a2++BƟM /2 :Q#B2`MQ j8Rkyy oB2M- .`2/M2` ai`bb2 d8IfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRajn1a@1a]=_2;HK2MiQ /2H :Q#B2`MQ /2 H *Bm// /2 oB2M `2HiBpQ H ?QKQHQ;+BƟM i2KTQ`H ?bi 2H XXX /2H aBbi2K /2 +?BK2M2b ]EJALP.l_ :a]XIfT= ē If/Bp= IHBMF:`T ivT24]HB;MK2Mi] /QKBMb4]O"yyuRNNN6BH2RRNNNykRRn.1@h O"yyuRNNN6BH2RRNNNykRRn1a@1a]= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRaRn.1@h O"yyuRNNN6BH2RRNNNykRRaRn1a@1a] f= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRakn.1@h O"yyuRNNN6BH2RRNNNykRRakn1a@1a] f= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRajn.1@h O"yyuRNNN6BH2RRNNNykRRajn1a@1a] f= IfHBMF:`T= If#Q/v= Ifi2ti= Ii2ti= XXX Ifi2ti= If;`QmT= Ifi2ti=
Figure 12: Shortened sample of a LQ element in the TRIS corpus )LJXUH 6KRUWHQHG VDPSOH RI D WH[W! HOHPHQW WKH 75,6 FRUSXV
IV. Automatically converting the TMX files to TEI P5. Encoding a corpus like TRIS in TEI P5 manually would be an error prone and tedious task. For this reason a simple python script that automatically processes the tmx files was written. This script reads the TMX file to be encoded in ,9 $XWRPDWLFDOO\ FRQYHUWLQJ WKH 70; ࣼOHV WR 7(,information 3 (QFRGLQJ D FRUSXV OLNH different values to be TEI P5, stores in different variables the needed for the 75,6 LQassigned, 7(, 3 PDQXDOO\ ZRXOG EH DQ HUURU SURQH DQG WHGLRXV WDVN )RU WKLV UHDVRQ and processes the files and produces an XML file with the TEI structure explained D VLPSOH S\WKRQ VFULSW WKDW DXWRPDWLFDOO\ SURFHVVHG WKH WP[ ৱOHV ZDV ZULऔHQ डLV above. VFULSW UHDGV WKH 70; ৱOH WR EH HQFRGHG LQ 7(, 3 VWRUHV LQ GLৰHUHQW YDULDEOHV WKH LQIRUPDWLRQ QHHGHG IRU WKH GLৰHUHQW YDOXHV WR EH DVVLJQHG DQG SURFHVVHV WKH ৱOHV 5. Conclusion DQG SURGXFHV DQ ;0/ ৱOH ZLWK WKH 7(, VWUXFWXUH H[SODLQHG DERYH In this paper several issues with regards to the encoding of a corpus have been tackled. Specifically, encoding formats – UTF-‐8 for character encoding and XCES, TEI and TMX for corpus encoding – have been discussed. The TRIS corpus has been used as an example to understand the storyboard of the compilation of a corpus and the different stages in which encoding plays a particular role. Due to space restrictions it has not been possible to discuss standards for Part-‐ 77
C. Parra Escartín
Of-‐Speech (POS) tagging and its integration in the overall structure. This would be the next step to be done but for instance in the case of the TRIS corpus it has been decided to release separately the output files of the POS tagger used (the TreeTagger POSTagger). Thus, POS tags will not be integrated into the overall TEI P5 encoding, although it could easily be done if desired. When the compilation of the TRIS corpus began, it was my desire as a resource developer to produce a reusable and interoperable resource. To this end it was crucial to take into consideration issues such as the encoding of the corpus discussed in this paper. Moreover, these kinds of issues should be taken into account during the corpus compilation planning phase and prior to publicly releasing the corpus, because otherwise there would be the risk of creating a resource which is not useful for the community. This planning will also avoid any other researcher interested in using a newly compiled corpus having to convert and adapt it before actually using it for his/her own research purposes. As a resource developer, I deemed it important to study the different available possibilities and decide which was the best option both for my needs and for releasing a resource which is valuable for the NLP community as a whole, notwithstanding whether the interested parties belonged to academia, industry or both. Last, but not least, I would like to highlight the importance of documenting the compilation process of any linguistic resource. This documentation is not only valuable for future reference of the resource itself, but also for guidance and reference to future resource developers looking for solutions to challenges they face or strategies followed by previously developed resources. I hope that this paper serves future corpus developers as a reference on how to encode a corpus and the different types of encoding that they may have to take into account.
6. Acknowledgements The research reported in this paper has received funding from the EU under FP7, Marie Curie Actions, SP3 People ITN, grant agreement 238405 (project CLARA36).
36 http://clara.uib.no
78
Encoding a parallel corpus: The TRIS corpus experience
References Bánski, P. and A. Przepiórkowski (2010). TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation. In Proceedings of Digital Humanities 2010, London. Borin, L. and J. Lindh (2011). Deliverable D4.1: Metadata descriptions and other interoperability standards. Version 1.0, 2011-‐05-‐02. Deliverable in the METANORD project (CIP 270899). Guillemin, P. and S. Trillaud (2012, April/May). What has become of LISA’s OSCAR standards? Multilingual, 38–41. Localization Industry Standards (LIS) ETSI Industry Specification Group (ISG) (2013). ETSI GS LIS 002 V1.4.2 (2013-‐02): Localization Industry Standards (LIS); Translation Memory eXchange (TMX). McEnery, A. and R. Xiao (2005). Developing Linguistic Corpora: a Guide to Good Practice, Chapter Chapter 4: Character encoding in corpus construction. AHDS Guides to Good Practice. Morrison, A., M. Popham, and K. Wikander (2000). Creating and Documenting Electronic Texts: A Guide to Good Practice, Chapter 4: Markup: The key to reusability. AHDS Guides to Good Practice. Arts and Humanities Data Service. Open Standards for Container/Content Allowing Reuse (OSCAR) (2008). Systems to manage terminology, knowledge, and content TermBase eXchange (TBX). Parra Escartín, C. (2012, May). Design and compilation of a specialized SpanishGerman parallel corpus. In N. C. C. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). Przepiórkowski, A. (2009). TEI P5 as an XML standard for treebank encoding. In M. Passarotti, A. Przepiórkowski, S. Raynaud, and F. Van Eynde (Eds.), Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT 8), Milan, Italy, pp. 149–160. Przepiórkowski, A. and P. Bánski (2011). Which xml standards for multilevel corpus annotation? In Proceedings of the 4th conference on Human Language Technology: Challenges for computer science and linguistics, LTC’09, Berlin, Heidelberg, pp. 400–411. Springer-‐ Verlag. Sasaki, F. (2010, June). Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, Volume 40 of Text, Speech and Language Technology, Chapter 4: Markup Languages and Internationalization, pp. 67–80. Springer-‐Verlag. Simões, A. and S. Fernandes (2011). XML schemas for parallel corpora. In XATA 2011: XML, associated technologies and applications, Vila do Conde, Portugal, pp. 59–69. Sperberg-‐McQueen, M. and L. Burnard (2009, February). TEI P5: Guidelines for electronic text encoding and interchange. Technical report, The TEI Consortium. Steinberger, R., B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, and D. Tufiş (2006). The JRC-‐Acquis: A multilingual aligned parallel corpus with 20+ languages. In In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006, pp. 2142– 2147. Tiedemann, J. and P. Wijnitz (2010, August). Deliverable D.2.1: Specification of data formats. Deliverable in the Let’s MT project.
79
C. Parra Escartín
Walsh, N. and L. Muellner (1999, October). DocBook: The Definitive Guide. O’Reilly & Associates, Inc. Wynne, M. (2005). Developing Linguistic Corpora: a Guide to Good Practice, Chapter Chapter 6: Archiving, distribution and preservation. AHDS Guides to Good Practice.
80