Encoding a parallel corpus: The TRIS corpus

5 downloads 0 Views 1MB Size Report
proprietary format; then they are merged and converted to TMX format; and .... automatically convert from ISO-‐8859-‐1 to UTF-‐8 encoding all .txt files in the ...
   

 

 

Encoding   a   parallel   corpus:   The   TRIS   corpus  experience   Carla  Parra  Escartín   University  of  Bergen  

Abstract   This  paper  focuses  on  one  of  the  many  aspects  to  be  taken  into  account  when  developing  a  new   corpus:  its  encoding.  During  the  compilation  of  the  corpus  of  Technical  Regulations  Information   System  (the  TRIS  corpus)  several  encoding  issues  arose.  In  this  paper  the  author  discusses  the   possibilities  available  with  regards  to  encoding  as  well  as  the  decisions  taken  and  the  strategies   followed.   The   author   discusses   standards   for   character   encoding   and   corpus   markup   and   explains  how  these  were  integrated  in  the  compilation  of  the  TRIS  corpus.     Keywords:  corpus  planning,  parallel  corpora  compilation,  corpus  encoding,  standardization  

*  Principal  contact:     Carla  Parra  Escartín   Marie  Curie  Early  Stage  Researcher   Language  Models  and  Resources  Research  Group  (LaMoRe)   Department  of  Linguistic,  Literary  and  Aesthetic  Studies  (LLE)   University  of  Bergen,  HF-­‐bygget,  Sydnesplassen  7  N-­‐5007  Bergen,  Norway   Tel.:  +47  55  58  89  45   E-­‐mail:  [email protected]  

 

C.  Parra  Escartín  

   

1. Introduction   This   paper   will   discuss   several   issues   related   to   corpus   encoding   and   the   use   of   available   encoding  standards  applicable  to  the  compilation  of  corpora.  To  illustrate  this,  the  compilation   process   of   the   corpus   of   Technical   Regulations   Information   System   (in   what   follows   the   TRIS   corpus)   is   used.   The   TRIS   corpus   is   being   compiled   for   the   purposes   of   a   larger   project   which   aims   at   researching   the   translational   correspondences   between   German   nominal   compounds   and  their  Spanish  phraseological  correspondences.  Details  about  its  compilation  process  and  its   main  characteristics  can  be  found  in  Parra  Escartín  (2012).   According  to  the  Collins  Cobuild  online  dictionary1,  encoding  in  computing  is  “the  action  of   converting  (characters  and  symbols)  into  a  digital  form  as  a  series  of  impulses”.  The  Tech  Terms   Computer   Dictionary2   refers   to   it   as   “the   process   of   converting   data   from   one   form   to   another   ”   and  specifies  that  “there  are  several  types  of  encoding,  including  image  encoding,  audio  and  video   encoding,  and  character  encoding”.  Thus,  when  we  refer  to  the  encoding  of  a  corpus  we  may  be   referring  to  different  aspects  and  even  different  kinds  of  encoding.  My  experience  in  compiling   the  TRIS  parallel  corpus  has  made  me  aware  of  this  fact.  This  paper  aims  to  discuss  the  role  of   encoding   at   different   stages   of   a   corpus   compilation   process.   This   is   done   to   illustrate   the   role   it   plays  in  each  phase.   The  remainder  of  this  paper  is  divided  into  sections  which  follow  what  could  be  considered   the  logical  progression  of  a  corpus  compilation.  At  each  phase  the  problems  and  challenges  faced   are   explained   and   discussed   as   well   as   the   strategies   adopted   and   the   decisions   taken.   In   the   next  section  (Section  2),  I  first  explain  the  role  of  encoding  within  the  compilation  of  a  corpus.   Section  3  focuses  on  the  importance  of  character  encoding  and  its  role  in  corpora  and  Section  4   is  devoted  to  the  different  types  of  markup  that  we  may  choose  for  a  corpus.  

2. The  corpus  encoding  workflow   In  order  to  understand  the  role  of  encoding  in  the  compilation  process  of  a  corpus  it  is  important   to  see  at  which  stages  it  plays  a  particular  role.  If  we  take  into  account  the  definitions  given  in   Section  1,  the  very  first  phase  of  the  compilation  process  already  implies  several  changes  in  the   encoding   of   the   files   included   in   the   corpus.   In   the   case   of   the   TRIS   corpus,   the   files   were   automatically   retrieved   from   the   Database   of   the   DG   Enterprise   and   Industry   Project3   of   the   European   Commission   by   means   of   a   crawler   (a   computer   program   capable   of   performing   recursive   searches)4.   After   all   files   in   outdated   formats   no   longer   available   and   corrupted   files   were  disregarded,  every  remaining  file  was  classified  according  to  its  original  format.  MS  Word   files  were  directly  stored  for  later  verification  while  PDF  files  underwent  a  further  process.  PDF   “text”   files   were   automatically   converted   to   MS   word,   while   PDF   “scanned   image”   files   were   processed   with   ABBYY   FineReader   –   an   Optical   Character   Recognition   (OCR)   software   –   and   converted  to  MS  Word.  Finally,  all  MS  Word  files  were  proofread  and  verified  manually  to  ensure   that   no   conversion   problems   had   arisen.   Figure   1   below   illustrates   the   process   that   every   crawled  file  underwent  prior  to  being  aligned.  

                                                                                                                          1  http://www.collinsdictionary.com/dictionary/english/encoding   2  http://www.techterms.com/definition/encoding  

3  http://ec.europa.eu/enterprise/tris/index_en.htm   4  For  details  please  see  Parra  Escartín  (2012).  

62  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

 

  Figure  1:  File  selection  and  conversion  process  prior  to  alignment  

After   all   files   were   considered   ready,   file   pairs   in   German   and   Spanish   were   also   verified   and   their   formatting   was   checked   to   ensure   that   it   matched   and   that   it   would   not   provoke   any   problems   at   the   alignment   stage.   In   the   next   phase  –   still   in   process   –,   MS   Word   files   are   aligned   using  SDL  Trados  WinAlign,  a  proprietary  software  programme  within  the  suite  of  the  Computer   Assisted  Translation  tool  (CAT  tool)  SDL  Trados  Studio  20095.  WinAlign  automatically  converts   the   files   to   RTF   (Rich   Text   Format)   and   once   the   alignment   has   been   manually   verified   and   confirmed  it  can  be  exported  as  a  translation  memory  in  the  SDL  Trados  proprietary  format  or  in   the   de   facto   standard   format   TMX   (Translation   Memory   eXchange)6.   In   the   case   of   TRIS   the   translation   memories   corresponding   to   each   individual   file   are   exported   in   the   SDL   Trados   proprietary   format;   then   they   are   merged   and   converted   to   TMX   format;   and   finally   they   are   converted   to   TEI   P5   format.   Simple   plain   text   documents   with   one   sentence   per   line   are   also   created   from   the   TMX   files.   These   files   will   be   subsequently   Part-­‐of-­‐Speech   (POS)   tagged.   Figure   2   illustrates   how   the   original   MS   Word   files   are   transformed   into   different   formats   at   the   different  stages  of  the  corpus  compilation.  

  Figure  2:  Different  file  encoding  stages  during  the  corpus  compilation  process  

                                                                                                                          5  http://www.sdl.com/products/sdl-­‐trados-­‐studio/   6  The  TMX  format  is  explained  in  Section  4.2.  

63  

C.  Parra  Escartín  

  Finally,   it   is   also   worth   mentioning   that   the   corpus   is   to   be   released   in   different   encoding   formats   to   facilitate   its   reusability   in   other   research   projects.   Concretely,   the   corpus   will   be   released   in   plain   text,   POS-­‐tagged   text,   TMX   and   TEI   P5.   This   choice   is   grounded   on   several   reasons.   First   of   all,   and   as   argued   by   Wynne   (2005),   it   is   important   to   avoid   proprietary   formats.  As  he  points  out:   If  your  corpus  is  made  up  of  files  in  a  format  for  a  commercial  wordprocessing  program,   such   as   Microsoft   Word,   then   they   cannot   be   processed   by   most   corpus   analysis   tools.   What  is  more,  the  format  may  not  be  supported  indefinitely  into  the  future,  and  there  will   come  a  time  when  users  won’t  be  able  to  read  the  files  any  more.    

Wynne   (2005)   continues   arguing   that   encoding   a   corpus   in   XML   is   usually   a   good   choice   since   it   not   only   is   appropriate   for   its   long-­‐term   preservation   but   also   ensures   the   usage   of   Unicode   for   encoding   the   text.   The   TMX   and   TEI   P5   encoding   formats   are   actually   markup   formats  in  XML  as  we  shall  see  later  in  Section  4.  The  other  two  formats  in  which  the  corpus  is   released  have  been  chosen  to  satisfy  the  needs  of  the  research  project  in  which  the  TRIS  corpus   will  be  first  used.  Generic  tools  often  require  “raw  text”  or  plain  text  files  to  work,  and  thus  I  had   to   produce   them   for   my   own   research.   Additionally,   I   also   needed   POS-­‐tagged   files   to   run   experiments  and  more  concretely  files  in  the  TreeTagger7  format.  Providing  these  two  additional   formats   along   with   the   other   two   standard   formats   will   enable   the   reusability   of   the   corpus   without  requiring  prior  conversion  processes.  

3. Character  Encoding:  The  minimal  kind  of  encoding  but  yet  a  critical  one   Character  encoding  may  be  considered  the  minimal  kind  of  encoding.  However,  it  is  crucial  as  it   will  determine  whether  or  not  a  text  is  appropriately  displayed  in  a  user’s  computer.  McEnery   and  Xiao  (2005)  offer  an  extensive  and  clear  overview  of  the  importance  of  character  encoding   as   regards   corpus   construction   as   well   as   of   its   evolution   across   history.   As   they   point   out,   “character   encoding   in   a   corpus   must   be   consistent   if   the   corpus   is   to   be   searched   reliably”.   In   fact,   something  that  may  seem  as  simple  as  character  encoding  is  not  trivial.  During  the  compilation   of  the  TRIS  corpus  several  encoding  problems  arose  when  manipulating  the  files  in  the  corpus.   This   is   something   that   McEnery   and   Xiao   (2005)   also   mention:   “In   many   cases,   however,   multiple   and  often  competing  encoding  systems  complicate  corpus  building,  providing  a  real  problem”.    Many   efforts   have   been   made   over   time   to   ensure   readability   and   interoperability   as   regards  character  encoding  in  different  operating  systems.  The  Unicode  standard  has  been  the   result   of   these   common   efforts   and   it   is   commonly   used   nowadays   in   many   cross-­‐platform   applications.   It   includes   three   encoding   formats:   UTF-­‐8,   UTF-­‐16   and   UTF-­‐32   (Unicode   Transformation   Format   8   bits,   16   bits   and   32   bits  respectively).   One   of   its   main   strengths   is   that   it   is   100%   backward   compatible   with   ASCII   (McEnery   and   Xiao,   2005).   Sasaki   (2010)   explains   the  differences  between  the  three  of  them:   The   most   widely   used   encoding   form   is   UTF-­‐8.   If   the   multilingual   corpus   contains   only   Latin   based   textual   data,   UTF-­‐8   will   lead   to   a   small   corpus   size,   since   this   data   can   be   represented  mostly  with  sequences  of  single  bytes.  If  corpus  size  and  bandwidth  are  no   issues,  UTF-­‐32  can  be  used.  However,  especially  for  web  based  corpora,  UTF-­‐32  will  slow   down   data   access.   UTF-­‐16   is   for   environments   which   need   both   efficient   access   to   characters   and   economical   use   of   storage.   Finally,   the   aspect   that   an   XML   processor   must  

                                                                                                                          7  The  TreeTagger  is  a  tool  for  annotating  text  with  part-­‐of-­‐speech  and  lemma  information  developed  at  the  

Institute   for   Computational   Linguistics   of   the   University   of   Stuttgart.   More   information   can   be   found   at   its   website:  http://www.ims.uni-­‐stuttgart.de/projekte/corplex/  TreeTagger/.  

64  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  be   able   to   process   “only”   UTF-­‐8   and   UTF-­‐16,   and   not   necessarily   other   encoding   forms,   should  be  taken  into  account  when  deciding  about  the  appropriate  encoding  form.  

From   his   reasoning   it   can   be   concluded   that   UTF-­‐8   was   the   right   choice   for   the   TRIS   corpus   as   it   only   includes   Latin   based   textual   data   and   therefore   there   was   no   need   for   using   an   encoding  format  that  would  imply  a  larger  size  such  as  UTF-­‐16.   The   files   of   the   TRIS   corpus   were   not   originally   encoded   in   UTF-­‐8.   The  translation  memory   files   were   obtained   in   a   Windows   Operating   System   because   the   software   used   for   alignment   (SDL  Trados  WinAlign)  is  not  available  in  other  operating  systems.  However,  when  manipulating   the  files  in  another  operating  system  –a  Mac  OS–,  problems  arose  because  Windows  uses  its  own   proprietary   encoding   (ISO   Latin   1)   which   in   turn   is   not   compatible   with   Macintosh   and   other   operating  systems.  This  problem  is  easy  to  overcome  by  automatically  converting  the  encoding   format.  To  ensure  the  future  readability  and  reusability  of  the  TRIS  corpus,  the  original  ISO  Latin   1   (also   known   as   ISO   8859-­‐1)   encoding   produced   by   SDL   Trados   WinAlign   was   converted   to   UTF-­‐8.  This  was  done  using  the  command  displayed  in   Figure  3  which  instructs  the  computer  to   automatically   convert   from   ISO-­‐8859-­‐1   to   UTF-­‐8   encoding   all   .txt   files   in   the   directory   we   are   currently   in.   The   character   encoding   conversion   was   done   prior   to   the   conversion   of   the   aligned   files  in  the  Trados  proprietary  encoding  format  to  the  standard  TMX  format.  

IRU ৱOH LQ W[W GR LFRQY I ,62 W 87) ৱOH ! ৱOHXWIW[W GRQH

 

)LJXUH  8QL[ FRPPDQG WR DXWRPDWLFDOO\ FRQYHUW /DWLQ ৱOHV WR 87) Figure  3:  Unix  command  to  automatically  convert  Latin1  files  to  UTF-­‐8  

4. Corpus  Markup   As  defined  in  Morrison  et  al.  (2000),  markup  is  “a  form  of  text  added  to  a  document  to  transmit   information  about  both  the  physical  and  electronic  resource”.  I  will  not  discuss  here  the  benefits    &RUSXV 0DUNXS of   using   a   common   and   standardized   markup   framework   as   it   has   already   been   widely   discussed,   reasoned   and   agreed   upon.   Instead,   I   will   focus   on   the   different   standards   that   are   $V GHৱQHG LQ 0RUULVRQ HW DO   PDUNXS LV ‫ڨ‬D IRUP RI WH[W DGGHG WR D GRFXPHQW available  with  regards  to  corpus  markup.  In  this  paper,  the  term  “standard”  is  not  restricted  to   WR WUDQVPLW LQIRUPDWLRQ DERXW ERWK WKH SK\VLFDO DQG HOHFWURQLF UHVRXUFH‫ ک‬, ZLOO QRW official  standards  such  as  ISO,  ETSI  or  OASIS  standards  and  therefore  may  also  be  used  to  refer   GLVFXVV KHUHwhich   WKH EHQHৱWV RI XVLQJ D FRPPRQ DQG VWDQGDUGL]HG PDUNXS IUDPHZRUN to   markup   formats   are   regularly   and   widely   used.   This   section   is   divided   in   three   DV LW KDV DOUHDG\ EHHQ ZLGHO\ GLVFXVVHG UHDVRQHG DQG DJUHHG XSRQ ,QVWHDG ZLOO another   subsections:   one   in   which   the   markup   languages   SGML   and   XML   are   introduced  , (4.1),   IRFXV RQ WKH GLৰHUHQW VWDQGDUGV WKDW DUH DYDLODEOH ZLWK UHJDUGV WR FRUSXV PDUNXS one  in  which  industrial  standards  are  discussed  (4.2)  and  a  final  one  with  a  special  focus  on  the   linguistic  m of  linguistic   resources   (4.3).   ,Qarkup   WKLV DUWLFOH WKH WHUP ‫ۆ‬VWDQGDUG‫ۇ‬ LV QRW UHVWULFWHG WR R৳FLDO VWDQGDUGV VXFK DV ,62 (76, RU 2$6,6 VWDQGDUGV DQG WKHUHIRUH PD\ DOVR EH XVHG WR UHIHU WR PDUNXS 4.1. Brief  IRUPDWV introduction   o  mUHJXODUO\ arkup:  SDQG GML   and  XXVHG ML   डLV VHFWLRQ LV GLYLGHG LQ WKUHH VXE ZKLFK tDUH ZLGHO\ SGML   (Standard   Generalized   Markup   Language)   and   XML   (EXtensible   Markup   Language)   are   VHFWLRQV RQH LQ ZKLFK WKH PDUNXS ODQJXDJHV 6*0/ DQG ;0/ DUH LQWURGXFHG   structured  DQRWKHU markup   languages.   HTML   (Hypertext   Markup   Language),   example,   is   a   type   of   RQH LQ ZKLFK LQGXVWULDO VWDQGDUGV DUH GLVFXVVHG  DQGfor   D ৱQDO RQH ZLWK SGML  used  to  mark  up  text  and  graphics  so  that  the  most  popular  web  browsers  can  interpret   D VSHFLDO IRFXV RQ WKH OLQJXLVWLF PDUNXS RI OLQJXLVWLF UHVRXUFHV   them.  To  identify  the  markup  in  a  document,  both  SGML  and  XML  use  named  elements  delimited   by   angled   brackets   (“”).   As   explained   in   (Walsh   and   Muellner,   1999),   “An   essential   characteristic  of  structured  markup  is  that  it  explicitly  distinguishes  (and  accordingly  “marks  up”    %ULHI LQWURGXFWLRQ WR PDUNXS 6*0/ DQG ;0/ within  a  document)  the  structure  and  semantic  content  of  a  document.  It  does  not  mark  up  the  way   in  which  the  document  will  appear  to  the  reader,  in  print  or  otherwise.”  Moreover,  the  structure  of   6*0/ 6WDQGDUG *HQHUDOL]HG 0DUNXS /DQJXDJH DQG ;0/ (;WHQVLEOH 0DUNXS the  documents  is  controlled  by  either  document  type  definitions  (DTDs)  or  XML  schema.  A  DTD   /DQJXDJH DUH VWUXFWXUHG PDUNXS ODQJXDJHV +70/ +\SHUWH[W 0DUNXS /DQJXDJH  is  a  set  of  declarations  regarding  the  structure  of  a  document,  and  its  goal  was  to  retain  a  level  of   IRU H[DPSOH LV DfW\SH RI 6*0/ XVHG WRm PDUN WH[Wto   DQG JUDSKLFV VR WKDW WKHinto   PRVWXML  DTDs.   compatibility   with  SGML   or  applications   that   ight  XS want   convert   SGML   DTDs   EURZVHUV LGHQWLI\ WKHrPDUNXS LQit  DiGRFXPHQW It  consists  SRSXODU of  a  list  ZHE of  tag   names  FDQ and  LQWHUSUHW specifies  WKHP their  7R combination   ules  and   s  also  used  to  check   ERWK 6*0/ DQG ;0/ XVH QDPHG HOHPHQWV GHOLPLWHG E\ DQJOHG EUDFNHWV ‫ ۇۆ‬DQG that  a  particular  document  is  appropriately  structured.   ‫  ۇ!ۆ‬$V H[SODLQHG LQ :DOVK DQG 0XHOOQHU   ‫ڨ‬$Q HVVHQWLDO ࠼DUDFWHULVWLF RI VWUXFWXUHG PDUNXS LV WKDW LW H[SOLFLWO\ GLVWLQJXLVKHV DQG DFFRUGLQJO\ ‫ڨ‬PDUNV XS‫ک‬ ZLWKLQ D GRFXPHQW WKH VWUXFWXUH DQG VHPDQWLF FRQWHQW RI D GRFXPHQW ,W GRHV QRW PDUN XS WKH ZD\ LQ ZKL࠼ WKH GRFXPHQW ZLOO DSSHDU WR WKH UHDGHU LQ SULQW RU RWKHU ZLVH‫ ک‬0RUHRYHU WKH VWUXFWXUH RI WKH GRFXPHQWV LV FRQWUROOHG E\ HLWKHU GRFXPHQW W\SH GHৱQLWLRQV '7'V RU ;0/ VFKHPD $ '7' LV D VHW RI GHFODUDWLRQV UHJDUGLQJ 65   WKH VWUXFWXUH RI D GRFXPHQW DQG LWV JRDO ZDV WR UHWDLQ D OHYHO RI FRPSDWLELOLW\ ZLWK 6*0/ IRU DSSOLFDWLRQV WKDW PLJKW ZDQW WR FRQYHUW 6*0/ '7'V LQWR ;0/ '7'V

C.  Parra  Escartín  

  While  SGML  was  commonly  used  in  the  past,  there  has  been  a  shift  of  markup  language  and   nowadays  it  is  more  common  to  use  XML.  In  fact,  all  the  markup  standards  that  will  be  discussed   in  the  next  subsections  have  either  moved  towards  XML  or  were  already  conceived  in  XML.  

4.2. The   Translation   Memory   eXchange   (TMX)   and   other   LISA   standards.   Industrial   Standards  entering  into  Academia  and  beyond   TMX   stands   for   Translation   Memory   eXchange   and   it   is   an   XML   format   to   encode   translation   memories  and  ensure  that  they  can  be  reused  and  exchanged  among  different  CAT  tools  without   encountering   any   troubles.   It   was   developed   by   the   Localization   Industry   Standards   Association   (LISA)  and  after  having  been  widely  adopted  in  the  industrial  sector  it  has  made  its  way  into  the   academic   and   institutional   sector   as   well.   In   fact   some   of   the   Language   Technology   Resources   released   by   the   European   Commission   are   in   this   format.   Examples   of   this   are   the   DGT-­‐ Translation   Memory8   and   the   ECDC-­‐TM;   the   Translation   Memory   of   the   European   Centre   for   Disease   Prevention   and   Control9.   Its   increasing   presence   as   an   encoding   format   has   led   to   the   appearance   of   tools   to   extract   TMX   files   and   convert   them   to   simple   .txt   UTF-­‐8   files   if   needed.   This   is   the   case   of   the   extract-­‐tmx-­‐corpus   tool10,   which   is   currently   used   to   prepare   input   files   for  the  Statistical  Machine  Translation  System  MOSES11.   LISA   was   sadly   dissolved   in   March   2011   but   its   contributions   towards   standardization   in   the  Localization  Industry  were  of  great  magnitude  and  some  of  the  standards  developed  by  them   are   still   widely   used.   The   body   in   charge   of   creating   new   standards   was   a   specific   committee   called   OSCAR   (Open   Standards   for   Container/Content   Allowing   Reuse)   and   as   a   result   of   their   work   five   community   standards   were   successfully   published:   the   Translation   Memory   eXchange   (TMX)12,  the  TermBase  eXchange  (TBX)13,  the  Segmentation  Rules  eXchange  (SRX)14,  the  Global   information   management   Metrics   eXchange   Volume   (GMX-­‐V)15   and   the   XML   Text   Memory   (xml:tm)16.   As   can   be   inferred   from   the   previous   paragraph,   LISA   –   an   industrial   initiative   to   cooperate   and  standardize  the  localization  field  –  was  a  very  important  agent  as  regards  standardization.  It   cooperated  with  the  relevant  agents  in  the  field  to  ensure  the  success  of  its  proposals:  the  ISO  TC   37   group,   OASIS   XLIFF   and   the   Open   Architecture   for   XML   Authoring   and   Localization   (OAXAL).   As   stated   in   the   TBX   definition   (Open   Standards   for   Container/Content   allowing   Reuse, 2008),   the  TBX,  for  instance,  is  actually  identical  to  ISO  30042.   When   its   dissolution   was   announced,   the   European   Telecommunications   Standards   Institute  (ETSI),  worked  together  with  LISA  on  a  proposal  to  create  a  new   Industry  Specification   Group  (ISG)  for  Localisation  Industry  Standards  (LIS),  which  would  ensure  the  maintenance  of   the   five   LISA   OSCAR   standards   mentioned   above   as   well   as   the   cooperation   with   LISA’s   cooperating  partners.  As  stated  in  Guillemin  and  Trillaud  (2012),  “the  ETSI  is  a  standardization   institute   which   produces   standards   from   information   and   communications   technology,   including   fixed,  mobile,  radio,  converged,  aeronautical,  broadcast  and  internet  technologies  and  is  officially   recognized   by   the   European   Union   as   an   European   Standards   Organization.   ETSI   is   an   independent,   not-­‐for-­‐profit   association   with   more   than   700   member   companies   and   organizations,   drawn  from  62  countries  across  five  continents  worldwide,  that  determine  its  work  program  and   participate  directly  in  its  work”.  Guillemin  and  Trillaud  (2012)  offer  a  summarized  explanation  of                                                                                                                             8  http://ipsc.jrc.ec.europa.eu/?id=197   9  http://ipsc.jrc.ec.europa.eu/?id=782   10  http://code.google.com/p/extract-­‐tmx-­‐corpus/   11  http://www.statmt.org/moses/   12  http://www.gala-­‐global.org/oscarStandards/tmx/tmx14b.html   13  http://www.gala-­‐global.org/oscarStandards/tbx/tbx_oscar.pdf   14  http://www.gala-­‐global.org/oscarStandards/srx/srx20.html  

15  http://www.gala-­‐global.org/oscarStandards/gmx-­‐v/gmx-­‐v.html   16  http://www.gala-­‐global.org/oscarStandards/xml-­‐tm/xml-­‐tm.html  

66  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  LISA’s   dissolution   and   what   was   done   to   ensure   the   continuity   of   the   standards   developed   within  this  professional  association.   As  of  February  2013,  the  ETSI  has  officially  released  the  TMX  as  ETSI  ISG  LIS  GS  Translation   Memory  eXchange  (TMX)17  and  the  GMX-­‐V  as  Global  information  management  Metrics  eXchange   Volume   (GMX-­‐V)18.   The   XML   Text   Memory   (ETSI   ISG   LIS   GS   XML   Text   Memory   (xml:tm))   has   reached   the   status   of   a   stable   draft19,   and   the   TBX   (ETSI   ISG   LIS   Term-­‐Base   eXchange   (TBX))   is   still  an  early  draft20,  as  is  the  SRX  (ETSI  ISG  LIS  Segmentation  Rules  eXchange  (SRX))21.   The  efforts  made  to  ensure  the  continuity  of  the  standards  despite  LISA’s  dissolution  are  a   proof   of   the   importance   that   they   have   acquired   for   industry,   academia   and   the   public   sector.   TMX  and  TBX  are  probably  the  two  standards  most  related  to  the  Natural  Language  Processing   (NLP)  field  and  as  exemplified  above,  TMX  is  in  fact  starting  to  be  a  standard  used  for  the  release   of  new  linguistic  resources.   Converting  the  TRIS  corpus  into  TMX   As   has   been   mentioned   in   Section   2,   for   the   alignment   of   the   MS   Word   files   the   commercial   software  SDL  Trados  WinAlign  is  used.  One  of  the  reasons  behind  this  decision  is  that  sentence   alignment   can   be   carried   out   from   native   MS   Word   files   and   no   format   conversion   prior   to   alignment   is   required.   Moreover,   the   decision   was   taken   due   to   practical   reasons:   WinAlign   saves   time   at   this   stage   of   the   process   while   producing   bilingual   files   either   in   its   own   proprietary  format  or  in  TMX.  

  Figure  4:  The  SDL  Trados  WinAlign  Interface  

                                                                                                                          17  http://www.etsi.org/deliver/etsi_gs/LIS/001_099/002/01.04.02_60/gs_

LIS002v010402p.pdf   LIS004v020000p.pdf   19  http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37769   20  http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37750   21  http://webapp.etsi.org/WorkProgram/Report_Schedule.asp?WKI_ID=37767   18  http://www.etsi.org/deliver/etsi_gs/LIS/001_099/004/02.00.00_60/gs_

67  

C.  Parra  Escartín  

  Figure   4   shows   the   user   interface   of   WinAlign.   As   can   be   seen,   the   program   proposes   automatic   alignments   (dotted   lines),   and   a   human   validator   can   correct   those   alignments,   confirm  (line)  or  reject  them  (no  line  at  all).  The  program  also  permits  the  user  to  join  or  split   segments  as  well  as  edit  them  if  needed.  This  is  very  useful  as  sometimes  it  is  necessary  to  join   several   segments   into   one.   This   is   the   case,   for   example,   when   in   the   original   MS   Word   file   in   German  there  is  a  list  with  the  verb  in  a  separate  line  at  the  end  of  the  list  while  in  the  Spanish   translation   the   verb   occurs   at   the   beginning   of   the   list.   German   grammar   requires   that   certain   structures  have  the  verb  at  the  end  and  this  cannot  be  done  in  Spanish.   The   editing   feature   of   WinAlign   allows   the   user   to   edit   the   text   in   the   segments   (e.g.   to   correct   typos   not   previously   detected)   and   join/split   them   accordingly   so   that   they   are   paired   with  the  appropriate  sentence  in  the  other  language.   Figure  5  illustrates  the  structure  of  an  aligned  segment  produced  by  SDL  Trados  WinAlign   in  the  .rtf  format  that  the  program  uses  internally.  Furthermore,  as  mentioned  earlier  WinAlign   also  allows  the  user  to  export  the  alignment  as  a  TMX  file.  One  drawback  of  Trados  is  that  the   resulting  translation  memories  (TMs)  include  a  lot  of  unnecessary   formatting  information  that   has   to   be   cleaned   before   further   exploitation   of   the   corresponding   files.   Another   drawback   is   that   when   merging   several   TMs   into   one,   the   program   filters   out   all   duplicates   and   deletes   them   and  it  does  not  keep  track  of  the  order  in  which  sentences  appear  in  the  text.  This  is  because  it  is   a  Computer  Assisted  Translation  Tool  and  these  details  are  not  relevant  for  its  intended  usage.   IRU LWV LQWHQGHG XVDJH Ih`l= IZmHBiv=Ryy I*`l=GA:L5 I*`.=keyRkyRk- kR,RR Ia2; G4.1@h=L+? mĸ2`;2rƳ?MHB+?2M 1`2B;MBbb2M- rB2 xX "X H›M;2` M?Hi2M/2M 2ti`2K2M h2KT2`im`2M- >Q+?rbb2`- 1`/#2#2M- GrBM2M@ Q/2` Jm`2M#;›M;2M- _mib+?mM;2M- lM7›HH2M- 62m2` Q/2` MT`HH pQM 6?`x2m;2M m/;HX bBM/ /B2 q2;r2Bb2`#`ȹ+F2M ;2xB2Hi m7 /B2 KƳ;HB+?2M mbrB`FmM;2M /2` mĸ2`;2rƳ?MHB+?2M lKbi›M/2 ?BM xm #2bB+?iB;2MX Ia2; G41a@1a=.2 ?#2`b2 T`Q/m+B/Q H;ȯM ?2+?Q 2ti`Q`/BM`BQ- +QKQ TQ` 2D2KTHQ- i2KT2`im`b 2ti`2Kb /2 Kmv H`; /m`+BƟM- `B/bb2ŌbKQb- Hm/2b Q `;vQb- +Q``BKB2MiQb- ++B/2Mi2b- BM+2M/BQb Q BKT+iQb /2 p2?Ō+mHQb v bBKBH`2b- b2 /2#2`{M BMbT2++BQM` HQb TƟ`iB+Qb T` K2MbD2b /2 +``2i2` 2bT2+Ō7B+K2Mi2 2M +mMiQ  Hb TQbB#H2b `2T2`+mbBQM2b /2 Hb +B`+mMbiM+Bb 2ti`Q`/BM`BbX Ifh`l=

 

)LJXUH 6DPSOH DOLJQHG VHJPHQW SURGXFHG E\ T6'/ 7UDGRV :LQ$OLJQ DE Figure   5:   Sample   of  aRIn  DQ aligned   segment   produced   by  SDL   rados   WinAlign   (abbreviated)  

EUHYLDWHG To   overcome   these   challenges,   another   industrial   application   is   used:   ApSIC   Xbench22.   ApSIC   Xbench   supports   several   input   formats   (such   as   TMX   and   Trados’   proprietary   .rtf   format)   7R user   RYHUFRPH WKHVH several   FKDOOHQJHV DQRWKHU LQGXVWULDO LV XVHG $S6,& and   allows   the   to   merge   translation   memories  DSSOLFDWLRQ without   removing   duplicates   and   $S6,& ;EHQFK VXSSRUWV VHYHUDO LQSXW DQGall   7UDGRV‫ۃ‬ respecting  ;EHQFKtt the   order   in   which   they   appear.   Thus,   this  IRUPDWV tool   is   VXFK used   DV to  70; merge   single   files   into   one  file  per  subdomain  in  the  corpus  and  convert  them  to  TMX.  Even  though  the  TMX  format  is   SURSULHWDU\ UWI IRUPDW DQG DOORZV WKH XVHU WR PHUJH VHYHUDO WUDQVODWLRQ PHPRULHV not  really  necessary  for  my  research  project  (simple  plain  monolingual  files  with  one  sentence   ZLWKRXW UHPRYLQJ GXSOLFDWHV DQG UHVSHFWLQJ WKH RUGHU LQ ZKLFK WKH\ DSSHDU डXV per  line  would  have  been  enough),  I  deemed  it  appropriate  to  convert  the  resulting  translation   WKLV WRRO LV XVHG WR PHUJH DOO VLQJOH ৱOHV LQWR RQH ৱOH SHU VXEGRPDLQ LQ WKH FRUSXV memories  DQG into  FRQYHUW TMX  as  WKHP this  hWRas   become   tandard   our  fiIRUPDW eld  and   ould   ensure   interoperability   70; (YHQa  sWKRXJK WKHin  70; LVwQRW UHDOO\ QHFHVVDU\ and  reusability  in  the  long  run.  The  TMX  files  are  further  processed  with  a  python  script  to  add   IRU P\ UHVHDUFK SURMHFW VLPSOH SODLQ PRQROLQJXDO ৱOHV ZLWK RQH VHQWHQFH SHU OLQH additional  ZRXOG information   to  each   sentence   in  the  DSSURSULDWH corpus.   WR FRQYHUW WKH UHVXOWLQJ WUDQV KDYH EHHQ HQRXJK  , GHHPHG ODWLRQ PHPRULHV LQWR 70; DV WKLV KDV EHFRPH D VWDQGDUG LQ RXU ৱHOG DQG ZRXOG HQVXUH LQWHURSHUDELOLW\ DQG UHXVDELOLW\ LQ WKH ORQJ UXQ डH 70; ৱOHV DUH IXUWKHU                                        SURFHVVHG                                      ZLWK                    D      S\WKRQ                   VFULSW WR DGG DGGLWLRQDO LQIRUPDWLRQ WR HDFK VHQWHQFH LQ WKH FRUSXV 22  http://www.apsic.com/en/products_xbench.html  

68  

)LJXUH  VKRZV WKH VWUXFWXUH RI WKH ৱQDO 70; ৱOHV $V FDQ EH VHHQ DOO 70; GRFXPHQWV DUH GLYLGHG LQWR D KHDGHU DQG D ERG\ HOHPHQW डH VWUXFWXUH RI DQ\ 70; GRFXPHQW LV DSSURSULDWHO\ GHVFULEHG DQG GRFXPHQWHG LQ WKH 70; GHৱQLWLRQ UHOHDVHG E\ (76, "  :KDW IROORZV LV D EULHI VXPPDU\ RI WKH LQIRUPDWLRQ WKDW FDQ EH IRXQG WKHUH डH KHDGHU ‫ ڽ‬HQFORVHG ZLWKLQ WKH KHDGHU! KHDGHU! WDJV ‫ڽ‬

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  Figure  6  illustrates  the  structure  of  the  final  TMX  files.  As  can  be  seen,  all  TMX  documents   are   divided   into   a   header   and   a   body   element.   The   structure   of   any   TMX   document   is   appropriately   described   and   documented   in   the   TMX   definition   released   by   ETSI   (Localization   Industry  Standards  (LIS)  ETSI  Industry  Specification  Group  (ISG),  2013).  What  follows  is  a  brief   summary  of  the  information  that  can  be  found  there.   I\tKH p2`bBQM4]RXy] 2M+Q/BM;4]lh6@3]\= I5.P*huS1 iKt Sl"GA* ]@ffGAa Pa*_,RNN3ff.h. 7Q` h`MbHiBQM J2KQ`v 2s+?M;2ff1L] ]?iiT,ffrrrXiiiXQ`;fQb+`biM/`/bfiKtfiKtR9X/i/]= IiKt p2`bBQM4]RX9]= I?2/2` +`2iBQMiQQH4]a.G h`/Qb qBMHB;M 3XjXyX3ej] +`2iBQMiQQHp2`bBQM4]1/BiBQM 3 "mBH/ 3ej] Q@iK74]a.G hJ3 6Q`Ki] b2;ivT24]b2Mi2M+2] /KBMHM;4]1L@la]i b`+HM;4].1@h] /iivT24]tKH] +`2iBQM/i24]CmM2 kyRk] +`2iBQMB/4]*`H S``- lB"] = If?2/2`= I#Q/v= Iim imB/4]"yyuRNNN6BH2RRNNNykRRad] +`2iBQM/i24]kyRRRRR9hRNj8w] +`2iBQMB/4]GA:L5]= Iimp tKH,HM;4].1@h]= Ib2;=.b 6M;bvbi2K U#2BbTB2H?7i BM ## R /`;2bi2HHiV /B2Mi /xm- o2`#`2MMmM;b;b2 pQM 62m2`bi›ii2M KBi MB2/`B;2M o2`#`2MMmM;b;bi2KT2`im`2M UKBi /B2b#2xȹ;HB+?2M aB+?2`?2Bib2BM`B+?imM;2MV BMb 6`2B2 xm H2Bi2MXIfb2;= Ifimp= Iimp tKH,HM;4]1a@1a]= Ib2;=1H bBbi2K /2 +?BK2M2 U`2T`2b2Mi/Q  iŌimHQ /2 2D2KTHQ 2M H 7B;m` RV bB`p2 T` +QM/m+B` H 2ti2`BQ` HQb ;b2b /2 +QK#mbiBƟM T`Q+2/2Mi2b /2 ?Q;`2b +QM #D i2KT2`im` /2 HQb ;b2b /2 +QK#mbiBƟM U+QM HQb +Q``2bTQM/B2Mi2b /BbTQbBiBpQb /2 b2;m`B//VXIfb2;= Ifimp= Ifim= Iim imB/4]"yyuRNNN6BH2RRNNNykRRa3] +`2iBQM/i24]kyRRRRR9hRNj8w] +`2iBQMB/4]GA:L5]= Iimp tKH,HM;4].1@h]= Ib2;=Hb 62m2`bi›ii2M FQKK2M x" "`2MMr2`i;2`›i2 D2r2BHb KBi :b Q/2` >2BxƳH 2ti` H2B+?i Hb "`2MMbiQ77 BM "2i`+?iXIfb2;= Ifimp= Iimp tKH,HM;4]1a@1a]= Ib2;=GQb ?Q;`2b  +QMbB/2`` bQM TQ` 2D2KTHQ HQb 2[mBTQb /2 ŌM/B+2 /2 +QK#mbiBƟM [m2 2KTH22M +QKQ +QK#mbiB#H2- `2bT2+iBpK2Mi2- ;b Q 7m2H 2ti`@HB;2`QXIfb2;= Ifimp= Ifim= XXX If#Q/v= IfiKt=

 

)LJXUH  6DPSOH D 70; DOLJQHG DEEUHYLDWHG Figure  6IURP :  Sample   from   a  TMX  ৱOH aligned   file  (abbreviated)  

The  header  –  enclosed  within  the  
 
 tags  –  contains  the  metadata  about   the   document.   The   body   –   enclosed   within   the       tags   –   contains   all   the    translation   units   in   the   translation   memory.   In   the   header   there   is   information   related   to   the   Tool   with   which   a   Translation   Memory   has   been   created   and   its   version   (“creation   tool”   and   69  

C.  Parra  Escartín  

  “creationtoolversion”  respectively);  the  original  translation  memory  format  (“o-­‐tmf  ”);  the  kind  of   segmentation  used  (“segtype”);  the  default  language  in  which  the  administrative  and  informative   elements   are   written   (“adminlang”);   the   source   language   of   the   translations   included   in   the   translation  memory  (“srclang”);  the  type  of  data  we  have  (“datatype”);  the  creation  date  of  that   concrete   translation   memory   (“creationdate”);   and   the   identifier   for   the   creator   of   the   translation  memory  (“creationid”).   The   body   of   any   translation   memory   consists   of   one   or   more   translation   unit   elements   (enclosed   within     ),   which   in   turn   include   one   or   more   translation   unit   variants   (enclosed   within     ).   In   the   TRIS   corpus,   the   translation   unit   element   consists   of   two   translation  unit  variant  elements.  Besides,  every  translation  unit  is  described  by  means  of  three   attributes:   “tuid”;   “creationdate”;   and   “creationid”.   The   attribute   “tuid”   (translation   unit   identifier)   offers   most   of   the   information   for   every   single   sentence.   For   instance,   the   tuid   tuid=“B00Y1999File119990211S7”   in   Figure   6   stands   for   the   construction   domain   (B00),   Year   1999   (Y1999),   file   name   119990211   (File119990211),   sentence   7   (S7).   The   attribute,   “creationdate”   contains   information   about   the   date   and   time   in   which   the   translation   unit   was   created   and   the   “creationid”   refers   to   the   creator   of   the   translation   unit.   Its   value   usually   corresponds  to  the  user  ID  of  the  user  who  created  the  unit.  In  order  to  specify  that  a  translation   unit   comes   from   an   alignment   tool,   SDL   Trados   WinAlign   assigns   itself   as   the   creator   by   using   the  value  “ALIGN!”.   The   translation   unit   variant   consists   of   a   segment   element   and   the   information   corresponding   to   that   segment   for   a   given   language.   The   attribute   “xml:lang”   refers   to   the   language  variety  used  in  the  segment  that  appears  below.  Its  value  must  be  compliant  with  the   RFC   3066   [6]23.   Thus,   in   the   case   of   TRIS   “DE-­‐AT   ”   refers   to   German   (Austria)   and   “ES-­‐ES”   to   Spanish  (Spain).  The  text  between  the      tags  is  the  actual  text  and  the  fact  that  two   translation   unit   variants   are   grouped   together   in   a   translation   unit   indicates   that   one   is   the   translation  of  the  other.  

4.3. Standards  currently  being  fostered  within  the  NLP  field  

Current  European  initiatives  such  as  Meta-­‐share24  are  making  major  efforts  towards  the  usage  of   standards  and  good  practices  in  our  field.  Since  the  TRIS  corpus  is  to  be  released  through  Meta-­‐ Nord,  the  Meta-­‐share  node  to  which  the  University  of  Bergen  belongs,  their  documentation  was   consulted   to   decide   which   standards   to   use   with   regards   to   corpus   encoding.   As   stated   in   Deliverable   4.1   of   the   Meta-­‐Nord   project25:   Metadata   descriptions   and   other   interoperability   standards,   suitable   standards   for   corpus   encoding   would   be   TEI   or   (X)CES   (Borin   and   Lindh,   2011,  p.15).  Therefore,  I  decided  that  my  corpus  would  use  one  of  these  two  markup  languages   to   ensure   that   it   would   be   compliant   with   current   initiatives   on   standardization,   curation   and   sustainability   of   Language   Resources   and   Tools   (LRTs).   The   next   two   subsections   (4.3.1   and   4.3.2)   briefly   explain   each   of   them,   while   Subsection   4.3.3   discusses   which   of   these   two   standards   (TEI   and   (X)CES)   is   best   and   reasons   the   decision   taken.   Finally,   Subsection   4.3.4   provides  details  about  the  encoding  of  the  TRIS  corpus  in  TEI  P5  format.   4.3.1. The  Text  Encoding  Initiative  (TEI)   The   Text   Encoding   Initiative   (TEI)   is   a   non-­‐profit   organization   which   counts   in   its   consortium   members   from   academia,   research   projects   and   individual   scholars   from   around   the   world.   In   their   website26   they   offer   extensive   documentation   about   the   initiative   as   well   as   guidelines   and   a   wide   range   of   materials.   Their   main   goal   is   to   collectively   develop   and   maintain   the   TEI   guidelines   for   the   encoding   of   texts   in   digital   form.   In   order   to   reach   a   wide   audience   their                                                                                                                             23  http://www.ietf.org/rfc/rfc3066.txt  

24  http://www.meta-­‐net.eu/meta-­‐share   25  http://www.meta-­‐net.eu/   26  http://www.tei-­‐c.org/index.xml  

70  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  Guidelines   are   aimed   for   their   usage   in   Humanities,   Social   Sciences   and   Linguistics   and   since   1994  they  have  been  used  in  a  vast  number  of  projects,  institutions  and  resources.   Since  their  first  release,  the  TEI  guidelines  are  periodically  updated  and  feedback  from  the   user   community   is   incorporated   to   fulfill   user   needs   and   requirements.   The   last   release   of   the   TEI   Guidelines   for   Electronic   Text   Encoding   and   Interchange   was   done   in   late   January   2013   and   it  accounts  for  version  2.3.0  of  the  TEI  P5.  Besides,  although  the  current  version  is  the  TEI  P5,   resources   encoded   in   previous   versions,   such   as   the   TEI   P4   format,   can   still   be   used   without   interoperability   problems   thanks   to   the   usage   of   the   corresponding   DTD.   An   example   of   a   resource  encoded  in  a  prior  version  of  the  standard  but  still  widely  used  nowadays  is  the  case  of   the  JRC  Acquis  (Steinberger  et  al.,  2006),  which  was  released  in  TEI  P4.   4.3.2. The  XML  Corpus  Encoding  Standard  ((X)CES)   Another  effort  towards  standardization  of  corpus  encoding  is  the  one  carried  out  by  the  Expert   Advisory   Group   on   Language   Engineering   Standards   (EAGLES27).   As   a   result   of   their   work   a   first   Corpus  Encoding  Standard  (CES)28  was  developed.  It  started  being  a  SGML  standard  compliant   with  the  specifications  of  the  TEI  Guidelines  for  Electronic  Text  Encoding  and  Interchange  of  the   Text   Encoding   Initiative29.   (X)CES   stands   for   XML   Corpus   Encoding   Standard   and   it   is   a   newer   version  of  CES  encoded  in  XML.  It  is  currently  more  frequently  used  than  CES  because  XML  has   become  the  most  currently  used  markup  language.  However,  is  not  only  an  XML  version  of  CES   and  as  pointed  out  by  Simões  and  Fernandes  (2011)  not  all  corpora  which  claim  to  be  encoded   in   (X)CES   are   truly   encoded   in   (X)CES   but   rather   in   CES   encoded   in   XML:   “…   some   researchers   claim  they  are  releasing  their  corpora  in  XCES  format,  but  they  are  just  encoding  CES  in  XML,  and   XCES  is  more  than  that.”   4.3.3. TEI  and  (X)CES.  A  Comparison   TEI  and  XCES  have  become  the  de  facto  standards  for  corpus  encoding  and  most  corpora  are  in   one  of  the  two  formats  or  at  least  easily  convertible  to  them.   Several   papers   (Przepiórkowski   and   Bánski,   2011;   Przepiórkowski,   2009;   Bánski   and   Przepiórkowski,  2010;  Simões  and  Fernandes,  2011)  refer  to  TEI  as  the  standard  and  reference   for   corpus   encoding   and   it   seems   reasonable   to   think   of   it   for   the   encoding   of   newly   compiled   corpora.  For  the  encoding  of  TRIS  a  comparison  between  the  two  standards  was  made  with  the   aim  of  determining  which  seemed  best.   The  first  drawback  found  in  the  case  of  XCES  is  its  lack  of  documentation  and  authors  like   Przepiórkowski  (2009)  and  Simões  and  Fernandes  (2011),  for  example,  already  point  this  out.  In   fact,   not   knowing   how   the   encoding   should   actually   look   like   makes   it   particularly   difficult   to   encode  a  corpus  from  scratch  in  this  format.  Przepiórkowski  (2009)  also  states  this  as  follows:   “http://www.xces.org/  refers  to  old  CES  documentation  as  “supporting  general  encoding  practices   for  linguistic  corpora  and  tag  usage”  and  “largely  relevant  to  the  XCES  instantiation”,  although  the   CES   documentation   is   hardly   applicable   to   the   second   version   of   XCES”.   In   the   same   paper,   Przepiórkowski   (2009)   also   mentions   as   another   reason   against   XCES   “the   potential   for   confusion   regarding   the   version   of   the   standard   (in   particular,   for   many   years   DTD   and   XML   Schema   specifications   co-­‐existed   on   XCES   web   pages,   without   clear   information   that   they   specify   different   representations”.   The   same   is   pointed   out   in   another   paper:   “There   is   a   potential   for   confusion  regarding  the  version  of  the  standard.  XCES  was  derived  from  TEI  version  P4,  but  it  has   not  been  updated  to  TEI  P5  so  far”   (Przepiórkowski  and  Bánski,  2011).  In  the  XCES  website30  it  is   stated  that  “XCES  is  continually  under  development  and  future  work  will  include  making  the  XCES                                                                                                                             27  http://www.ilc.cnr.it/EAGLES/home.html   28  http://www.cs.vassar.edu/CES/   29  

More   information   about   http://www.cs.vassar.edu/CES/.   30  http://www.xces.org/  

the  

origins  

of  

CES  

can  

be  

found  

at  

their  

website:  

71  

C.  Parra  Escartín  

  compliant  with  TEI  P5”.  TEI  P5  was  released  in  November  2007  and  is  updated  every  six  months.   The  last  time  the  XCES  website  was  updated  was  June  200831.  This  highlights  the  outdatedness   of   XCES   and   contrasts   with   the   willingness   of   the   TEI   community   to   keep   their   proposed   standard  up  to  date32.   On  the  other  hand,  a  possible  drawback  of  TEI  is  its  extensive  documentation:  the  current   version   of   the   guidelines   (January   2013)   comprises   1641   pages.   As   Przepiórkowski   (2009)   points   out,   “usually   there   is   more   than   one   way   of   representing   any   given   annotation,   so   designing   a  coherent  and  constrained  TEI-­‐conformant  schema  for  linguistic  corpora  is  a  daunting  task”.   TEI   P5   was   the   standard   chosen   to   encode   the   TRIS   corpus   due   to   what   is   argued   above.   Moreover,   the   active   support   and   willingness   to   resolve   doubts   and   make   clarifications   in   the   TEI   mailing   list   were   also   a   clear   advantage   towards   choosing   TEI.   Finally,   it   also   seemed   the   best  option  with  regards  to  the  interoperability  and  sustainability  of  a  resource  being  developed   since  it  is  also  periodically  reviewed  and  documented.   XCES  is  not  documented  enough  and  –  as  mentioned  in  the  previous  Subsection  4.3.2  –  the   resources   available   in   XCES   are   not   always   truly   encoded   in   XCES   but   rather   represent   interpretations   –   own   XML   versions   –   of   the   previous   CES   format   or   schemata   based   on   XCES.   Deliverable  D.2.1  of  the  Let’s  MT  project  offers  a  good  example  of  this  last  issue.  As  Tiedemann   and   Wijnitz   (2010,   p.   6)   explain,   the   alignment   information   of   their   parallel   corpora   will   be   stored  “in  links  between  sentences  in  external  files  pointing  to  the  appropriate  documents  using  the   unique  sentence  IDs  for  identification  of  the  aligned  segments”  and  for  this  they  “will  use  a  simple   XML  format   based  on  the  XCES  standard”33.  If  resource  developers  create  new  encoding  formats   based   in   XCES,   they   are   not   using   the   standard   any   more   and   therefore   their   resources   will   encounter  interoperability  problems  in  the  long  run.   4.3.4. The  TRIS  corpus  in  TEI  P5  format   In   this   subsection   the   encoding   of   the   TRIS   corpus   in   TEI   P5   will   be   briefly   explained.   As   described   in   (Sperberg-­‐McQueen   and   Burnard,   2009,   p.   139),   “a   full   TEI   document   combines   metadata   describing   it,   represented   by   a     element,   with   the   document   itself,   represented  by  a    element”.  The    is  a  variant  defined  for  the  representation  of   language  corpora  or  collections  of  texts.  It  consists  of  one  or  more  complete    elements  (i.e.   elements   consisting   of   a     and   a     element)   and   additionally   has   its   own     describing   the   whole   corpus.   This   allows   for   a   more   general   description   of   the   corpus  as  a  whole  in  the    element  prefixed  to  the  whole  corpus,  and  a  more  detailed   description   of   every     element   comprised   in   the     in   their   respective   .   Chapter   15   of   the   TEI   P5   Guidelines   (Sperberg-­‐McQueen   and   Burnard,   2009)   describes  how  to  encode  a  corpus.  In  what  follows  the  encoding  of  the  TRIS  corpus  is  described   to  exemplify  the  TEI  P5  structure  of  a  teiCorpus.   First   of   all   it   must   be   pointed   out   that   while   it   was   clear   that   the     element   should   be   used,   it   was   also   necessary   to   establish   the   inner   structure   of   the   TRIS   corpus   as   a   whole   and   determine   how   it   would   be   encoded.   The   TRIS   corpus   includes   files   written   in   Germany,  Austria  and  Spain,  thus  originally  written  in  either  German  or  Spanish  and  translated   into   the   other   language.   Furthermore,   we   have   two   language   variants   in   the   case   of   German:   Austrian   and   German.   The   corpus   also   includes   texts   from   different   domains   and   subdomains   and  is  ordered  by  year  of  publication  from  1999  to  201034.  So  far,  only  the  texts  for  a  particular  

                                                                                                                          31  The  last  time  this  was  verified  was  February  2013.  

32  The  last  TEI  P5  release  was  done  in  January  2013  and  stands  for  version  2.3.0  of  the  standard.   33  The  emphasis  is  my  own.   34  See  Parra  Escartín  (2012)  for  detailed  information  about  the  texts  in  the  corpus.  

72  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  domain  (Construction)  have  been  released  for  public  usage35  but  other  domains  will  be  included   shortly.   When  designing  the  TEI  structure  it  was  decided  to  have  a  general    for  the   whole  corpus  and  then  have  a    element  for  every  domain  and  year.  This  makes  it  relatively   easy  to  add  new  files  on  the  fly  once  they  are  ready  to  be  added  to  the  corpus  and  does  not   prevent  the  corpus  from  being  released  beforehand.   I.  The    header.  As  explained  above,  the    element  contains  information   about  the  corpus  as  a  whole.  Every  TEI-­‐conformant  text  must  have  a  header  prefixed  to  it.  TEI   headers  consist  of  four  major  parts  that  must  be  always  included:   1. A  file  description  ():  “a  full  bibliographical  description  of  the  computer  file  itself,   from   which   a   user   of   the   text   could   derive   a   proper   bibliographic   citation   (…)”   (Sperberg-­‐McQueen  and  Burnard,  2009)   2. An   encoding   description   ():   relates   to   how   the   source   files   where   manipulated  prior  to  encoding.   3. A   text   profile   ():   contains   classificatory   and   contextual   information   about   the  text.   4. A   revision   history   ():   contains   information   about   the   changes   done   during   the  development  of  the  text.   Thus,  the  TRIS  corpus  starts  as  follows:   Ii2B*Q`Tmb p2`bBQM4]8Xk] tKHMb4]?iiT,ffrrrXi2B@+XQ`;fMbfRXy]= Ii2B>2/2` tKH,HM;4]2M] ivT24]+Q`Tmb]=

  )LJXUH  %HJLQQLQJ RI WKH 75,6 FRUSXV WHL&RUSXV! HOHPHQW RI WKH 75,6 FRUSXV Figure  7:  Beginning  of  the  TRIS  corpus    element  of  the  TRIS  corpus  header   KHDGHU Where  version  refers  to  the  TEI  Guidelines  version  used  (5.2)  and  xmlns  is  the  namespace   for   the   Text   Encoding   Initiative.   Within   the     element   there   are   two   attributes:   the   xml:lang   attribute,   which   refers   to   the   language   in   which   the     is   written,   and   type,   which  refers  to  the  type  of  document  it  refers  to.  

ZULऔHQ ZKLFK UHIHUV WR WKHin   W\SH GRFXPHQW LW UHIHUV Figure   8,   Figure  WKH 9   aWHL+HDGHU! nd   Figure   1LV 0   d isplay   tDQG he   iW\SH nformation   provided   the  RI header   of   the   TRIS   WR the   current   release   is   the   only   one   done   so   far   in   TEI   there   is   no   corpus   in   TEI.   Since     element   so   far.   As   the   names   and   values   of   the   attributes   are   quite   self-­‐ )LJXUHV   DQG  VKRZ WKH LQIRUPDWLRQ SURYLGHG LQ WKH KHDGHU RI WKH 75,6 explanatory  no  further  details  are  given.  If  the  reader  wants  further  information  about  the  TEI   FRUSXV LQ 7(, 6LQFH WKH FXUUHQW UHOHDVH LV WKH RQO\ RQH GRQH VR IDU LQ 7(, WKHUH Header,  please  see  Chapter  2  of  the  TEI  P5  Guidelines  (Sperberg-­‐McQueen  and  Burnard,  2009,  p.   LV QR UHYLVLRQ'HVF! HOHPHQW VR IDU $V WKH QDPHV DQG YDOXHV RI WKH DऔULEXWHV 17–53).   DUH TXLWH VHOIH[SODQDWRU\ QR IXUWKHU GHWDLOV DUH JLYHQ ,I WKH UHDGHU ZDQWV IXUWKHU LQIRUPDWLRQ DERXW WKH 7(, +HDGHU SOHDVH VHH &KDSWHU  RI WKH 7(, 3 *XLGHOLQHV 6SHUEHUJ0FठHHQ DQG %XUQDUG  S ‫ ڽ‬ ,, ࠮H WHL+HDGHU! $ऑHU WKH KHDGHU IRU WKH ZKROH FRUSXV WKH WHL&RUSXV VWUXF WXUH UHTXLUHV D 7(, HOHPHQW ZLWK LWV RZQ KHDGHU GHVFULELQJ WKDW SDUWLFXODU HOHPHQW RI WKH FRUSXV डLV KHDGHU ‫ۆ‬LQKHULWV‫ ۇ‬WKH JHQHUDO FKDUDFWHULVWLFV IURP WKH XSSHU RQH LQ WKH FRUSXV DQG WKXV SURYLGHV WKH VSHFLৱF LQIRUPDWLRQ UHODWHG WR WKH WH[W EHLQJ HQFRGHG LQ LWV WH[W! DऔULEXWH $औULEXWHV DQG YDOXHV VSHFLৱHG KHUH RYHUZULWH WKH                                                                                RQHV                    LQ          WKH             XSSHU KHDGHU IRU WKLV SDUWLFXODU FRPSRQHQW RI WKH FRUSXV डXV IRU LQVWDQFH WKH LQIRUPDWLRQ DERXW WKH QXPEHU RI ৱOHV LQ WKH WH[W LV XSGDWHG IRU WKLV 35   http://metashare.nb.no/repository/browse/parallel-­‐corpus-­‐of-­‐documents-­‐from-­‐the-­‐technical-­‐ SDUWLFXODU HOHPHQW DV ZHOO DV WKH QXPEHU RI VHQWHQFHV DQG WKH QXPEHU RI ZRUGV regulations-­‐information-­‐system-­‐for-­‐german-­‐spanish-­‐ SHU ODQJXDJH )LJXUH  VKRZV DQ H[DPSOH RI KHDGHU IRU WKH ৱOHV ZULऔHQ LQ $XVWULD v02/d12552021dcc11e28f61001708556d5a64b9251fd03048ecaf7fe1abdc48a2d1/   LQ  LQ WKH FRQVWUXFWLRQ GRPDLQ 73   ,,, ࠮H WH[W! डH WH[W! HOHPHQW LV ZKHUH WKH DFWXDO FRUSXV LV VWRUHG :KHQ LW LV FUHDWHG D XQLTXH LG LV DVVLJQHG WR LW WR HQDEOH IXWXUH UHIHUHQFLQJ H[WUDFWLRQ DQG

C.  Parra  Escartín  

  I7BH2.2b+= IiBiH2aiKi= IiBiH2=S`HH2H *Q`Tmb Q7 /Q+mK2Mib 7`QK i?2 h2+?MB+H _2;mHiBQMb AM7Q`KiBQM avbi2K 7Q` :2`KM@aTMBb? UpyXkVIfiBiH2= I7mM/2`=1l mM/2` 6Sd- J`B2 *m`B2 +iBQMb- aSj S2QTH2 AhL- ;`Mi ;`22K2Mi kj39y8 UT`QD2+i *G_VIf7mM/2`= IT`BM+BTH=*`H S`` 1b+`iŌMIfT`BM+BTH= I`2bTaiKi= IMK2=*`H S`` 1b+`iŌMIfMK2= I`2bT=+Q`Tmb +QKTBHiBQM- T`Q+2bbBM;- 2M+Q/BM; M/ K`FmTIf`2bT= If`2bTaiKi= IfiBiH2aiKi= I2/BiBQMaiKi= I2/BiBQM M4]oyXk]=o2`bBQM yXk r?B+? 2ti2M/b i?2 T`2pBQmb p2`bBQM yXR M/ 7Bt2b bQK2 7Q`KiiBM; 2``Q`b /2i2+i2/ i?2`2XIf2/BiBQM= I`2bTaiKi= I`2bT=6Q`KiiBM; 2``Q`b BM pyXR +Q``2+i2/ M/ +Q`Tmb 2MH`;2/If`2bT= IMK2=*`H S`` 1b+`iŌMIfMK2= If`2bTaiKi= If2/BiBQMaiKi= I2ti2Mi= IK2bm`2:`T= IK2bm`2 ivT24]7BH2b]=ky8IfK2bm`2= IK2bm`2 ivT24]b2Mi2M+2b]=dye93IfK2bm`2= IK2bm`2 ivT24]rQ`/bn.1@h]=ej3NydIfK2bm`2= IK2bm`2 ivT24]rQ`/bn1a@1a]=Nkj3jyIfK2bm`2= IfK2bm`2:`T= If2ti2Mi= ITm#HB+iBQMaiKi= I//`2bb= I//LK2=AMbiBimii 7Q` HBM;pBbiBbF2- HBii2`Ÿ`2 Q; 2bi2iBbF2 bim/B2`If//LK2= I//`GBM2=SQbi#QFb d3y8If//`GBM2= I//`GBM2=8yky "2`;2M ULQ`rvVIf//`GBM2= If//`2bb= I/i2=kyRkIf/i2= ITm#HBb?2`=lMBp2`bBi2i2i B "2`;2MIfTm#HBb?2`= ITm#SH+2="2`;2M- LQ`rvIfTm#SH+2= I/Bbi`B#miQ`=lMBp2`bBi2i2i B "2`;2MIf/Bbi`B#miQ`= IpBH#BHBiv biimb4]`2bi`B+i2/]= IHB+2M+2=lM/2` M2;QiBiBQMIfHB+2M+2= IfpBH#BHBiv= IfTm#HB+iBQMaiKi= IbQm`+2.2b+= IT=N3fj9f1* .B`2+iBp2- h_Aa .i#b2- 1m`QT2M *QKKBbbBQMIfT= IfbQm`+2.2b+= If7BH2.2b+=

Figure  8:    eRI lement   f  the  TFRUSXV RIS  corpus   header   )LJXUH  ৱOH'HVF! WKH o75,6 KHDGHU

 74  

 

 

I2M+Q/BM;.2b+= Encoding  a  parallel  corpus:  The  TRIS  corpus  experience   IT`QD2+i.2b+= IT= I2M+Q/BM;.2b+= aT2+BHBx2/ T`HH2H +Q`Tmb aTMBb?@:2`KM U1a@1a- .1@h M/ [email protected] i2tib 7`QK IT`QD2+i.2b+= i?2 1m`QT2M *QKKBbbBQM #2ir22M RNNd@kyRyX IT= h?2 i2tib `2 i2+?MB+HU1a@1a`2;mHiBQMb p`B2ivi2tib Q7 /QKBMbX aT2+BHBx2/ T`HH2H +Q`Tmb aTMBb?@:2`KM .1@h BM M/[email protected]`QK h?2 Q`B;BMH 7BH2b r2`2 2Bi?2` BM Ja qQ`/ Q` S.6 7Q`Ki M/ ?p2 #22M +QMp2`i2/ iQ i?2 1m`QT2M *QKKBbbBQM #2ir22M RNNd@kyRyX THBM `2;mHiBQMb i2tiX h?2 i2tib `2lh6@3 i2+?MB+H BM  p`B2iv Q7 /QKBMbX h?2 +Q`Tmb rBHH #2 BM  Q` T`QD2+i BMpQHpBM; i?2 #22M bim/v+QMp2`i2/ Q7 T?`b2QHQ;B+H h?2 Q`B;BMH 7BH2b r2`2 2Bi?2` BMmb2/ Ja qQ`/ S.6 7Q`Ki M/ ?p2 iQ i`MbHiBQM +Q``2bTQM/2M+2b #2ir22M :2`KM M/ aTMBb?X lh6@3 THBM i2tiX h?2 +Q`Tmb rBHHIfT= #2 mb2/ BM  T`QD2+i BMpQHpBM; i?2 bim/v Q7 T?`b2QHQ;B+H IfT`QD2+i.2b+= i`MbHiBQM +Q``2bTQM/2M+2b #2ir22M :2`KM M/ aTMBb?X IbKTHBM;.2+H= IfT= IfT`QD2+i.2b+= IT= IbKTHBM;.2+H= HH i2tib r`Bii2M BM 2Bi?2` mbi`B- :2`KMv Q` aTBM rBi?  +Q``2bTQM/BM; i`MbHiBQM BMiQ :2`KMfaTMBb? ++Q`/BM;Hv r?2`2 +`rH2/X IT= PMHv i2tib BM  `2/#H2 7Q`Ki ?p2 #22M BM i?2 +QHH2+iBQMX HH i2tib r`Bii2M BM 2Bi?2` mbi`B:2`KMv Q` aTBM rBi? BM+Hm/2/ +Q``2bTQM/BM; h2tib #2HQM; iQ Ry /B772`2Mi /QKBMb M/ `2 7m`i?2` +HbbB7B2/ BM bm#/QKBMbX i`MbHiBQM BMiQ :2`KMfaTMBb? ++Q`/BM;Hv r?2`2 +`rH2/X h?2 +m``2Mi p2`bBQM BM+Hm/2b HH i2tib 7`QK RNNN iQ kyRy r`Bii2M PMHv i2tib BM  `2/#H2 7Q`Ki ?p2 #22M BM+Hm/2/ BM i?2 +QHH2+iBQMX BM mbi`B M/ 7Q` i?2 +QMbi`m+iBQM /QKBMX h2tib #2HQM; iQ Ry /B772`2Mi /QKBMb M/ `2 7m`i?2` +HbbB7B2/ BM bm#/QKBMbX AK;2b ?p2 #22M QKBii2/ M/ QMHv BM i?2K ?b #22MM/ T`2b2`p2/X h?2 +m``2Mi p2`bBQM BM+Hm/2b HH i2tib 7`QK RNNN iQ i?2 kyRyi2ti r`Bii2M BM mbi`B h#H2b r2`2 +QMp2`i2/ iQ THBM i2tiX 7Q` i?2 +QMbi`m+iBQM /QKBMX M/ QMHv Ki?2KiB+H ?p2 T`2b2`p2/X #22M HbQ QKBii2/X AK;2b ?p2 #22M6Q`KmH2 QKBii2/ M/ i?2 i2ti 2tT`2bbBQMb BM i?2K ?b #22M HH BM+Hm/2/ i2tib ?p2 #22M HB;M2/ i b2Mi2M+2 H2p2HX h#H2b r2`2 +QMp2`i2/ iQ THBM i2tiX IfT= 6Q`KmH2 M/ Ki?2KiB+H 2tT`2bbBQMb ?p2 #22M HbQ QKBii2/X IfbKTHBM;.2+H= HH BM+Hm/2/ i2tib ?p2 #22M HB;M2/ i b2Mi2M+2 H2p2HX I2/BiQ`BH.2+H= IfT= IfbKTHBM;.2+H= I+Q``2+iBQM= IT= I2/BiQ`BH.2+H= P`i?QivTQ;`T?B+ 2``Q`b ?p2 #22M +Q``2+i2/X I+Q``2+iBQM= JBbKi+?BM; b2Mi2M+2b ?p2 #22M QKBii2/ iQ 2Mbm`2 i?i 2p2`v b2Mi2M+2 ?b M IT= HB;MK2Mi BM i?2 Qi?2` HM;m;2X P`i?QivTQ;`T?B+ 2``Q`b ?p2 #22M +Q``2+i2/X IfT= JBbKi+?BM; b2Mi2M+2b ?p2 #22M QKBii2/ iQ 2Mbm`2 i?i 2p2`v b2Mi2M+2 ?b M If+Q``2+iBQM= HB;MK2Mi BM i?2 Qi?2` HM;m;2X Ib2;K2MiiBQM= IfT= IT= If+Q``2+iBQM= I;B=bIf;B= 2H2K2Mib K`F Q`i?Q;`T?B+ b2Mi2M+2b M/ `2 MmK#2`2/ b2[m2MiBHHvX Ib2;K2MiiBQM= IfT= IT= Ifb2;K2MiiBQM= I;B=bIf;B= 2H2K2Mib K`F Q`i?Q;`T?B+ b2Mi2M+2b M/ `2 MmK#2`2/ b2[m2MiBHHvX If2/BiQ`BH.2+H= IfT= If2M+Q/BM;.2b+= Ifb2;K2MiiBQM= If2/BiQ`BH.2+H= If2M+Q/BM;.2b+=)LJXUH  HQFRGLQJ'HVF! HOHPHQW RI WKH 75,6 FRUSXV KHDGHU

 

Figure  9:     lement   of  the  FRUSXV TRIS  corpus   header   )LJXUH  HQFRGLQJ'HVF! HOHPHQW eRI WKH 75,6 KHDGHU IT`Q7BH2.2b+= I+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= IT`Q7BH2.2b+= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= I+`2iBQM= If+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= IHM;lb;2= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= If+`2iBQM= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2= IHM;lb;2= IfHM;lb;2= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= IfT`Q7BH2.2b+= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2=   IfHM;lb;2= IfT`Q7BH2.2b+=)LJXUH HOHPHQW WKHcorpus   75,6header   FRUSXV KHDGHU Figure   10:  HQFRGLQJ'HVF!  element   of  the  RI TRIS  

II.  The  .  After  the  header  for  the  whole  corpus,  the  teiCorpus  structure  requires  a   )LJXUH  HQFRGLQJ'HVF! HOHPHQW RI WKH 75,6 FRUSXV KHDGHU TEI  element  with  its  own  header  describing  that  particular  element  of  the  corpus.  This  header   “inherits”   the   general   characteristics   from   the   upper   one   in   the   corpus   and   thus   provides   the   specific   information   related   to   the   text   being   encoded    in   its     attribute.   Attributes   and  

 75  

C.  Parra  Escartín  

  values   specified   here   overwrite   the   ones   in   the   upper   header   for   this   particular   component   of   the  corpus.  Thus,  for  instance  the  information  about  the  number  of  files  in  the  text  is  updated  for   this   particular   element,   as   well   as   the   number   of   sentences   and   the   number   of   words   per   language.  Figure  11  shows  an  example  of  header  for  the  files  written  in  Austria  in  1999  in  the   construction  domain.   Ii2B>2/2`= I7BH2.2b+= IiBiH2aiKi= IiBiH2="yyn*PLah_l*hAPL @ lah_A @ RNNNIfiBiH2= I7mM/2`=1l mM/2` 6Sd- J`B2 *m`B2 +iBQMb- aSj S2QTH2 AhL- ;`Mi ;`22K2Mi kj39y8 UT`QD2+i *G_VIf7mM/2`= IT`BM+BTH=*`H S`` 1b+`iŌMIfT`BM+BTH= I`2bTaiKi= IMK2=*`H S`` 1b+`iŌMIfMK2= I`2bT=+Q`Tmb +QKTBHiBQM- T`Q+2bbBM;- 2M+Q/BM; M/ K`FmTIf`2bT= If`2bTaiKi= IfiBiH2aiKi= I2ti2Mi= IK2bm`2:`T= IK2bm`2 ivT24]7BH2b]=R9IfK2bm`2= IK2bm`2 ivT24]b2Mi2M+2b]=j3NNIfK2bm`2= IK2bm`2 ivT24]rQ`/bn.1@h]=jjeN3IfK2bm`2= IK2bm`2 ivT24]rQ`/bn1a@1a]=93kjNIfK2bm`2= IfK2bm`2:`T= If2ti2Mi= ITm#HB+iBQMaiKi= I//`2bb= I//LK2=AMbiBimii 7Q` HBM;pBbiBbF2- HBii2`Ÿ`2 Q; 2bi2iBbF2 bim/B2`If//LK2= I//`GBM2=SQbi#QFb d3y8If//`GBM2= I//`GBM2=8yky "2`;2M ULQ`rvVIf//`GBM2= If//`2bb= I/i2=kyRkIf/i2= ITm#HBb?2`=lMBp2`bBi2i2i B "2`;2MIfTm#HBb?2`= ITm#SH+2="2`;2M- LQ`rvIfTm#SH+2= I/Bbi`B#miQ`=lMBp2`bBi2i2i B "2`;2MIf/Bbi`B#miQ`= IpBH#BHBiv biimb4]`2bi`B+i2/]= IHB+2M+2=lM/2` M2;QiBiBQMIfHB+2M+2= IfpBH#BHBiv= IfTm#HB+iBQMaiKi= IbQm`+2.2b+= IT=N3fj9f1* .B`2+iBp2- h_Aa .i#b2- 1m`QT2M *QKKBbbBQMIfT= IfbQm`+2.2b+= If7BH2.2b+= IT`Q7BH2.2b+= I+`2iBQM= I/i2 r?2M4]kyRR@kyRk]=a2Ti2K#2` kyRkIf/i2= I`b ivT24]+Biv]="2`;2M- LQ`rvIf`b= If+`2iBQM= IHM;lb;2= IHM;m;2 B/2Mi4].1@h]=:2`KM Umbi`BVIfHM;m;2= IHM;m;2 B/2Mi4]1a@1a]=aTMBb? UaTBMVIfHM;m;2= IfHM;lb;2= IfT`Q7BH2.2b+= Ifi2B>2/2`=

11:     element   f  one   the  T7(, EI  elements   included   in  the   RIS  c75,6 orpus   )LJXUHFigure    WHL+HDGHU! HOHPHQW RI oRQH RIof  WKH HOHPHQWV LQFOXGHG LQTWKH FRUSXV

 

III.   The   .   The     element   is   where   the   actual   corpus   is   stored.   When   it   is   created   a   unique   id   is   assigned   to   it   to   enable   future   referencing,   extraction   and   usage   upon   user   needs.   This  id  includes  information  about  the  domain  covered  in  the  group  of  files,  the  country  of  origin   (where   the   files   where   written)   and   the   year   in   which   they   were   written.   Then,   in   the   case   of   TRIS,  it  is  further  subdivided  into  single  files  grouped  in  a    element  which  includes  all  

76  



Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  files   in   the   corpus   in   the   form   of   individual     elements   (i.e.   there   are   as   many     elements   as   files   are   in   the   corpus).   Each   individual   file   is   also   assigned   a   unique   id   which   includes   all   the   information   related   to   the   domain,   the   year   and   the   name   of   the   file   in   the   EC   database  from  which  the  files  were  retrieved.  Since  every  file  has  been  sentence  aligned  and  is   presented  in  two  different  languages,  the  element  
 is  used  to  divide  the  text  between  the   LQIRUPDWLRQ LV JLYHQ E\ DVVLJQLQJ WR HDFK VHQWHQFH LQ WKH VRXUFH ODQJXDJH WKH FRU source  language  and  the  target  language.  A  final  link  group  ()  is  included  in  which  the   UHVSRQGLQJ VHQWHQFH LQ WKH information   WDUJHW ODQJXDJH E\ PHDQV RI WKHLU XQLTXH LGVsentence   ,Q RUGHU in   the   source   language   sentence   alignment   is   given   by   assigning   to   each   WR PDNH WKHcorresponding   DOLJQPHQWV HDFK VHQWHQFH DVVLJQHG XQLTXH LGby   LQ ZKLFK the   sentence   in  LVthe   target  Dlanguage   means  WKH of   VRXUFH their   unique   ids.   In   order   to   ODQJXDJH DQGthe   WKHalignments,   VHQWHQFH QXPEHU DUH VSHFLৱHG )LJXUH a VKRZViDd  VKRUWHQHG WH[Wsource  language  and  the   make   each  sentence   is  assigned    unique   in  which  the   RI WKH sentence  number  are  specified.  Figure  12  displays  a  shortened  text  of  the  TRIS  corpus  encoded   75,6 FRUSXV HQFRGHG LQ 7(, 3 WR LOOXVWUDWH WKH XVDJH RI WKH GLৰHUHQW HOH TEI  P5  tDERYH o  illustrate  the  usage  of  the  different  elements  described  above.   PHQWV in   GHVFULEHG Ii2ti tKH,B/4]"yynlah_AnRNNN]= I;`QmT= Ii2ti tKH,B/4]"yyuRNNN6BH2RRNNNykRR]= I#Q/v= I/Bp tKH,B/4]"yyuRNNN6BH2RRNNNykRRn.1@h] tKH,HM;4].1@h] ivT24]bQm`+2]= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRaRn.1@h]=J;Bbi`i /2` ai/i qB2MIfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRakn.1@h]=J;Bbi`ib#i2BHmM; j8Rkyy qB2M- .`2b/M2` ai`ĸ2 d8IfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRajn.1@h]=o2`Q`/MmM; /2b J;Bbi`i2b /2` ai/i qB2M ȹ#2` /B2 #Bb xmK ē #27`Bbi2i2 wmHbbmM; /2b 6M;bvbi2Kb ǪEJALP.l_ :aǫXIfT= XXX If/Bp= I/Bp tKH,B/4]"yyuRNNN6BH2RRNNNykRRn1a@1a] tKH,HM;4]1a@1a] ivT24]i`MbHiBQM]= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRaRn1a@1a]=:Q#B2`MQ /2 H *Bm// /2 oB2MIfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRakn1a@1a]=a2++BƟM /2 :Q#B2`MQ j8Rkyy oB2M- .`2/M2` ai`bb2 d8IfT= IT tKH,B/4]"yyuRNNN6BH2RRNNNykRRajn1a@1a]=_2;HK2MiQ /2H :Q#B2`MQ /2 H *Bm// /2 oB2M `2HiBpQ  H ?QKQHQ;+BƟM i2KTQ`H ?bi 2H XXX /2H aBbi2K /2 +?BK2M2b ]EJALP.l_ :a]XIfT= ē If/Bp= IHBMF:`T ivT24]HB;MK2Mi] /QKBMb4]O"yyuRNNN6BH2RRNNNykRRn.1@h O"yyuRNNN6BH2RRNNNykRRn1a@1a]= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRaRn.1@h O"yyuRNNN6BH2RRNNNykRRaRn1a@1a] f= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRakn.1@h O"yyuRNNN6BH2RRNNNykRRakn1a@1a] f= IHBMF i`;2i4]O"yyuRNNN6BH2RRNNNykRRajn.1@h O"yyuRNNN6BH2RRNNNykRRajn1a@1a] f= IfHBMF:`T= If#Q/v= Ifi2ti= Ii2ti= XXX Ifi2ti= If;`QmT= Ifi2ti=

 

Figure   12:  Shortened   sample   of  a  LQ   element   in  the  TRIS  corpus   )LJXUH  6KRUWHQHG VDPSOH RI D WH[W! HOHPHQW WKH 75,6 FRUSXV

IV.   Automatically   converting   the   TMX   files   to   TEI   P5.   Encoding   a   corpus   like   TRIS   in   TEI   P5   manually  would  be  an  error  prone  and  tedious  task.  For  this  reason  a  simple  python  script  that   automatically  processes  the  tmx  files  was  written.  This  script  reads  the  TMX  file  to  be  encoded  in   ,9 $XWRPDWLFDOO\ FRQYHUWLQJ WKH 70; ࣼOHV WR 7(,information   3 (QFRGLQJ D FRUSXV OLNH different   values   to   be   TEI   P5,   stores   in   different   variables   the   needed   for   the   75,6 LQassigned,   7(, 3 PDQXDOO\ ZRXOG EH DQ HUURU SURQH DQG WHGLRXV WDVN )RU WKLV UHDVRQ and   processes   the   files   and   produces   an   XML   file   with   the   TEI   structure   explained   D VLPSOH S\WKRQ VFULSW WKDW DXWRPDWLFDOO\ SURFHVVHG WKH WP[ ৱOHV ZDV ZULऔHQ डLV above.   VFULSW UHDGV WKH 70; ৱOH WR EH HQFRGHG LQ 7(, 3 VWRUHV LQ GLৰHUHQW YDULDEOHV WKH LQIRUPDWLRQ QHHGHG IRU WKH GLৰHUHQW YDOXHV WR EH DVVLJQHG DQG SURFHVVHV WKH ৱOHV 5. Conclusion   DQG SURGXFHV DQ ;0/ ৱOH ZLWK WKH 7(, VWUXFWXUH H[SODLQHG DERYH In   this   paper   several   issues   with   regards   to   the   encoding   of   a   corpus   have   been   tackled.   Specifically,  encoding  formats  –  UTF-­‐8  for  character  encoding  and  XCES,  TEI  and  TMX  for  corpus   encoding   –   have   been   discussed.   The   TRIS   corpus   has   been   used   as   an   example   to   understand   the   storyboard   of   the   compilation   of   a   corpus   and   the   different   stages   in   which   encoding   plays   a   particular  role.  Due  to  space  restrictions  it  has  not  been  possible  to  discuss  standards  for  Part-­‐  77  

C.  Parra  Escartín  

  Of-­‐Speech   (POS)   tagging   and   its   integration   in   the   overall   structure.   This   would   be   the   next   step   to   be   done   but   for   instance   in   the   case   of   the   TRIS   corpus   it   has   been   decided   to   release   separately  the  output  files  of  the  POS  tagger  used  (the  TreeTagger  POSTagger).  Thus,  POS  tags   will   not   be   integrated   into   the   overall   TEI   P5   encoding,   although   it   could   easily   be   done   if   desired.   When   the   compilation   of   the   TRIS   corpus   began,   it   was   my   desire   as   a   resource   developer   to   produce   a   reusable   and   interoperable   resource.   To   this   end   it   was   crucial   to   take   into   consideration  issues  such  as  the  encoding  of  the  corpus  discussed  in  this  paper.  Moreover,  these   kinds  of  issues  should  be  taken  into  account  during  the  corpus  compilation  planning  phase  and   prior   to   publicly   releasing   the   corpus,   because   otherwise   there   would   be   the   risk   of   creating   a   resource   which   is   not   useful   for   the   community.   This   planning   will   also   avoid   any   other   researcher   interested   in   using   a   newly   compiled   corpus   having   to   convert   and   adapt   it   before   actually   using   it   for   his/her   own   research   purposes.   As   a   resource   developer,   I   deemed   it   important   to   study   the   different   available   possibilities   and   decide   which   was   the   best   option   both   for   my   needs   and   for   releasing   a   resource   which   is   valuable   for   the   NLP   community   as   a   whole,  notwithstanding  whether  the  interested  parties  belonged  to  academia,  industry  or  both.   Last,   but   not   least,   I   would   like   to   highlight   the   importance   of   documenting   the   compilation   process  of  any  linguistic  resource.  This  documentation  is  not  only  valuable  for  future  reference   of  the  resource  itself,  but  also  for  guidance  and  reference  to  future  resource  developers  looking   for  solutions  to  challenges  they  face  or  strategies  followed  by  previously  developed  resources.  I   hope  that  this  paper  serves  future  corpus  developers  as  a  reference  on  how  to  encode  a  corpus   and  the  different  types  of  encoding  that  they  may  have  to  take  into  account.  

6. Acknowledgements   The   research   reported   in   this   paper   has   received   funding   from   the   EU   under   FP7,   Marie   Curie   Actions,  SP3  People  ITN,  grant  agreement  238405  (project  CLARA36).  

                                                                                                                          36  http://clara.uib.no  

78  

Encoding  a  parallel  corpus:  The  TRIS  corpus  experience  

  References   Bánski,   P.   and   A.   Przepiórkowski   (2010).   TEI   P5   as   a   Text   Encoding   Standard   for   Multilevel   Corpus  Annotation.  In  Proceedings  of  Digital  Humanities  2010,  London.   Borin,   L.   and   J.   Lindh   (2011).   Deliverable   D4.1:   Metadata   descriptions   and   other   interoperability   standards.  Version  1.0,  2011-­‐05-­‐02.  Deliverable  in  the  METANORD  project  (CIP  270899).   Guillemin,   P.   and   S.   Trillaud   (2012,   April/May).   What   has   become   of   LISA’s   OSCAR   standards?   Multilingual,  38–41.   Localization   Industry   Standards   (LIS)   ETSI   Industry   Specification   Group   (ISG)   (2013).   ETSI   GS   LIS   002   V1.4.2   (2013-­‐02):   Localization   Industry   Standards   (LIS);   Translation   Memory   eXchange  (TMX).   McEnery,   A.   and   R.   Xiao   (2005).  Developing   Linguistic   Corpora:   a   Guide   to   Good   Practice,   Chapter   Chapter  4:  Character  encoding  in  corpus  construction.  AHDS  Guides  to  Good  Practice.   Morrison,   A.,   M.   Popham,   and   K.   Wikander   (2000).  Creating   and   Documenting   Electronic   Texts:   A   Guide  to  Good  Practice,  Chapter  4:  Markup:  The  key  to  reusability.  AHDS  Guides  to  Good   Practice.  Arts  and  Humanities  Data  Service.   Open   Standards   for   Container/Content   Allowing   Reuse   (OSCAR)   (2008).   Systems   to   manage   terminology,  knowledge,  and  content  TermBase  eXchange  (TBX).   Parra  Escartín,  C.  (2012,  May).  Design  and  compilation  of  a  specialized  SpanishGerman  parallel   corpus.  In   N.   C.   C.   Calzolari,   K.   Choukri,   T.   Declerck,   M.   Uğur   Doğan,   B.   Maegaard,   J.   Mariani,  J.  Odijk,  and  S.  Piperidis  (Eds.),   Proceedings  of  the  Eight  International  Conference   on  Language  Resources  and  Evaluation  (LREC’12),  Istanbul,  Turkey.  European  Language   Resources  Association  (ELRA).   Przepiórkowski,  A.  (2009).  TEI  P5  as  an  XML  standard  for  treebank  encoding.  In  M.  Passarotti,  A.   Przepiórkowski,   S.   Raynaud,   and   F.   Van   Eynde   (Eds.),   Proceedings   of   the   Eighth   International   Workshop   on   Treebanks   and   Linguistic   Theories   (TLT   8),   Milan,   Italy,   pp.   149–160.   Przepiórkowski,  A.  and  P.  Bánski  (2011).  Which  xml  standards  for  multilevel  corpus  annotation?   In   Proceedings   of   the   4th   conference   on   Human   Language   Technology:   Challenges   for   computer   science   and   linguistics,   LTC’09,   Berlin,   Heidelberg,   pp.   400–411.   Springer-­‐ Verlag.   Sasaki,  F.  (2010,  June).  Linguistic  Modeling  of  Information  and  Markup  Languages.  Contributions   to  Language  Technology,  Volume  40  of  Text,  Speech  and  Language  Technology,  Chapter  4:   Markup  Languages  and  Internationalization,  pp.  67–80.  Springer-­‐Verlag.   Simões,   A.   and   S.   Fernandes   (2011).   XML   schemas   for   parallel   corpora.   In   XATA   2011:   XML,   associated  technologies  and  applications,  Vila  do  Conde,  Portugal,  pp.  59–69.   Sperberg-­‐McQueen,   M.   and   L.   Burnard   (2009,   February).   TEI   P5:   Guidelines   for   electronic   text   encoding  and  interchange.  Technical  report,  The  TEI  Consortium.   Steinberger,  R.,  B.  Pouliquen,  A.  Widiger,  C.  Ignat,  T.  Erjavec,  and  D.  Tufiş  (2006).  The  JRC-­‐Acquis:   A   multilingual   aligned   parallel   corpus   with   20+   languages.   In   In   Proceedings   of   the   5th   International   Conference   on   Language   Resources   and   Evaluation   (LREC’2006,   pp.   2142– 2147.   Tiedemann,   J.   and   P.   Wijnitz   (2010,   August).   Deliverable   D.2.1:   Specification   of   data   formats.   Deliverable  in  the  Let’s  MT  project.  

79  

C.  Parra  Escartín  

  Walsh,  N.  and  L.  Muellner  (1999,  October).  DocBook:  The  Definitive  Guide.  O’Reilly  &  Associates,   Inc.   Wynne,   M.   (2005).   Developing   Linguistic   Corpora:   a   Guide   to   Good   Practice,   Chapter   Chapter   6:   Archiving,  distribution  and  preservation.  AHDS  Guides  to  Good  Practice.  

80