Building a Parallel Corpus for Too Many Languages - CiteSeerX

Building a Parallel Corpus for Too Many Languages

Alexandr Rosen [email protected]

Institute of Theoretical and Computational Linguistics Faculty of Arts and Philosophy, Charles University Prague, Czech Republic Arona, 27 September 2005

2

1. InterCorp — a parallel corpus project for 26+ languages 2. Options for sentence alignment 3. Comparison of alignment tools 4. Conclusions

3

1

The project InterCorp

https://trnka.ff.cuni.cz/ucnk/intercorp/

– sorry, so far only cˇ esky (in Czech)

Participants: (mostly at the Charles University’s Faculty of Arts and Philosophy) • Foreign language departments • Institute of the Czech National Corpus • Institute of Theoretical and Computational Linguistics

Other projects involving many languages: • OPUS – Tiedemann & Nygaard (2004) (http://logos.uio.no/opus/) • JRC’s Acquis Communautaire • ...

4

Initial state: • Some participants already have parallel corpora of various sizes and fashions • They use ParaConc as the segmentation, alignment and search tool (http://www.athel.com/para.html)

Goals: • 25 parallel subcorpora • Czech always L1 = the pivot • L2 includes Arabic, Chinese, Japanese • Balance: mainly fiction, preference for Czech originals • Size: 400K words (absolute minimum per less common languages) • Alignment: by sentences, as complete and correct as possible

5

For: • Comparative studies • Teaching • Lexicography (including term extraction) • Extraction of Translation Memories • Translators • General public • ... (The order is partially significant.)

6

A challenge: To reconcile distributed pre-processing with the need to store, maintain and access everything at one place. • With a large number of diverse languages, a distributed setting is required as experts in specific languages are necessarily involved in the pre-processing phase. • At the same time, common procedures, tools, text formats are needed for the results to be integrated into a single corpus. Two stages: • Initial stage: distributed pre-processing, local use • Final stage: single shared resource

7

Workflow for adding new texts • Text acquisition (from CNC archives, from other electronic sources, as a hard copy) • Scanning + OCR: paper → MS Word • Proofreading of scanned texts • Format conversion: MS Word → tagged text • Segmentation (paragraph and sentence boundaries) • Sentence alignment • Alignment checking • Post-alignment processing (extracting stand-off alignment info) • Making the result searchable by a corpus manager

8 Important: • Texts are catalogued and their status updated in a central database. • A Czech text has the same shape for all language pairs.

Tools for processing new texts: • FineReader for OCR • A MS Word macro for conversion to tagged text • ParaConc for segmentation of foreign texts and sentence alignment • Segmenter for Czech • Format conversion routines for post-alignment processing • More sophisticated automatic aligners to improve error rate?

9

2

Alignment

A parallel corpus is only as good as its alignment. Good results of automatic alignment can save manual work. Is there a single best all-purpose way to sentence alignment? NO! — At least according to previous evaluations of sentence aligners: • Langlais et al. (1998) • Véronis & Langlais (2000) • Singh & Husain (2005)

10 The choice should be based on properties of the text pair and requirements put on the result: • structural distance between the two texts (free or literal translation) • amount of noise (omissions, differences in segmentation) • typological distance between the two languages • size of the texts • acceptable error rate • acceptable amount of manual checking Which (mix of) tools and procedures would be best suited to a specific text type and language pair, with minimum manual checking and the goal of a near-to-perfect result? With no single text type + diverse languages → (probably) no universal solution

11

3

Comparison

Three aligners were tested, some with additional resources: Van – Gale & Church (1993) – matches sentences by their lengths (counted in characters), the texts should be previously aligned by paragraphs; fast, language-universal http://nl.ijs.si/telri/Vanilla/ GMA – Melamed (1997) – uses cognates (puctuation, numbers, similar words) to find word-to-word and sentence-to-sentence correspondences http://nlp.cs.nyu.edu/GMA/ GMA+ – same as GMA, with a 106K-entries bilingual lexicon MS – Moore (2002) – combines length-based pre-alignment with a stochastic method to derive a bilingual lexicon, used subsequently to align sentences, proposes 1:1 links only http://research.microsoft.com/ research/downloads/default.aspx

MS* – same as MS, with some final and initial word segments stripped MS+ – same as MS, with more input data (a 106K-entries bilingual dictionary and an English-Czech pre-aligned corpus of 830K/731K words)

12

Texts used for testing AC – 46+46 documents from the English-Czech part of Acquis Communautaire (roughly 1%); all noise was retained (omissions, results of different segmentation rules); segments = paragraphs 1984 – George Orwell’s novel, English and Czech (result of the project Multext-East) FR7 – Seven French fiction/essay books + Czech translations FR1 – First of FR7, used in a test of GMA (technical reasons) Results were compared with hand-corrected alignment of the full texts: Text Cz words L2 words Cz segments L2 segments All links 1:1 links AC 62,010 74,986 3,025 2,699 2,685 89% 1984 99,099 121,661 6,756 6,741 6,657 97% FR7 289,003 337,226 21,936 21,746 21,207 95% FR1 56,114 60,169 3,602 3,553 3,524 97%

13

Measures used: correct links recall = reference links correct links precision = test links recall × precision F-measure = 2 recall + precision

14 All links: Reference Test Correct Recall Precision F-measure AC Van GMA+ MS MS+ 1984 Van GMA+ MS MS* MS+ F7 Van MS Mmd

2700 2700 2700 2700

2683 2686 2313 2375

2225 2492 2218 2308

0.824 0.923 0.821 0.855

0.829 0.928 0.959 0.972

0.827 0.925 0.885 0.910

6657 6657 6657 6657 6657

6633 6606 6167 6370 6441

6446 6287 6110 6320 6402

0.968 0.944 0.918 0.949 0.962

0.972 0.952 0.991 0.992 0.994

0.970 0.948 0.953 0.970 0.978

21207 21207 21207

20868 19427 19512 18801 21057 16161

0.916 0.887 0.762

0.931 0.964 0.767

0.923 0.923 0.764

15 1:1 links only: Reference Test Correct Recall Precision F-measure AC Van GMA+ MS MS+ 1984 Van GMA+ MS MS* MS+ F7 Van MS Mmd

2391 2391 2391 2391

2248 2354 2313 2375

2156 2304 2218 2308

0.902 0.964 0.928 0.965

0.959 0.979 0.959 0.972

0.930 0.971 0.943 0.969

6440 6404 6440 6440 6440

6438 6301 6167 6370 6441

6274 6287 6110 6320 6402

0.974 0.976 0.949 0.981 0.994

0.975 0.998 0.991 0.992 0.994

0.974 0.987 0.969 0.987 0.994

20116 20116 20116

19220 19427 19512 18801 19714 15539

0.926 0.935 0.772

0.969 0.964 0.788

0.947 0.949 0.780

16 F1: Reference Test Correct Recall Precision F-measure n:n GMA 1:1 GMA

3524

3513

3406

0.967

0.970

0.968

3402

3391

3327

0.978

0.981

0.980

17

Observations: F-measure (all pairs): Rank AC 1984 F7 1. GMA+ MS+ Van, MS 2. MS+ Van, MS+ 3. MS GMA+ 4. Van MS Precision (all pairs): Rank AC 1984 F7 1. MS+ MS+ MS 2. MS MS* Van 3. GMA+ MS 4. Van Van 5. GMA+ Similar picture for 1:1 pairs, except for recall (of course ...).

18 • On noisy texts, GMA and MS fare better than Van. • On well-behaved texts, MS and GMA tend to show higher precision. • Van performed surprisingly well on F7 without paragraph boundaries (the hard region was a book). • GMA performed surprisingly well on F1 without bilingual lexicon. • MS can be expected to gain further points with more input data and – possibly – lemmatization. • Also GMA may profit from creating more cognates by lemmatization.

19

4

Conclusions • Lexically-based methods should help to minimize human checking. • Lack of linguistic resources (bilingual lexica) need not be an obstacle for the application of lexically-based methods. • Higher precision of lexically-based methods, together with restriction on 1:1 links could make the human proofreader focus on links where good results are less likely. • Manual checking of alignment results can be done more efficiently with an automatic alignment method preferring higher precision to better recall.

20

References Gale, W. A. & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102. Langlais, P., Simard, M., & Véronis, J. (1998). Methods and practical issues in evaluating alignment techniques. In Proceedings of the 17th International Conference on Computational Linguistics, pages 711–717. Association for Computational Linguistics. Melamed, I. D. (1997). A portable algorithm for mapping bitext correspondence. In P. R. Cohen and W. Wahlster, editors, Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 305–312, Somerset, New Jersey. Association for Computational Linguistics. Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual corpora. In AMTA ’02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, pages 135–144, London, UK. Springer-Verlag. Singh, A. K. & Husain, S. (2005). Comparison, selection and use of sentence

21 Workshop on Building and Using Parallel Texts, pages 99–106, Ann Arbor, Michigan. Association for Computational Linguistics. Tiedemann, J. & Nygaard, L. (2004). The OPUS corpus – parallel & free. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. Véronis, J. & Langlais, P. (2000). Evaluation of parallel text alignment systems: the arcade project. In J. Véronis, editor, Parallel text processing: Alignment and use of translation corpora, pages 369–388. Kluwer Academic Publishers, Dordrecht.

22 Typeset on 10th October 2005, at 11:22.

Building a Parallel Corpus for Too Many Languages - CiteSeerX

Building a Parallel Corpus for Too Many Languages - CiteSeerX

Suggest Documents

A Corpus Factory for many languages - Sketch Engine

Too Many Languages? Multilingualism Leads to ...

Many traits too many

Evaluating Parallel Languages for Molecular Dynamics ... - CiteSeerX

Encoding a parallel corpus: The TRIS corpus

Building a Corpus of Spoken Dutch - CiteSeerX

Hypercoagulability: Too Many Tests, Too Much Conflicting ... - CiteSeerX

Building a Parallel Corpus on the World's Oldest Banking Magazine

Building a Brazilian Portuguese parallel corpus of ... - ICMC - USP

Building a Large English-Chinese Parallel Corpus from ... - Google Sites

Building a Parallel Corpus on the World's Oldest Banking Magazine

Building a parallel multilingual corpus (Arabic-Spanish-English)

Building A Modern Standard Arabic Corpus - CiteSeerX

Building a parallel bilingual syntactically annotated corpus - ÃFAL

Using A Japanese-English Parallel Corpus for Teaching ... - CiteSeerX

Compiling and using a shareable parallel corpus for ... - CiteSeerX

Too Many Types Of Quality Problems - CiteSeerX

too many terrorists spoil the plot - CiteSeerX

Are there too many graduates - CiteSeerX

Kenya - 28 Too Many

Kenya - 28 Too Many

Somalia - 28 Too Many

building a data corpus for audio-visual speech recognition - CiteSeerX

Text Design for TTS Speech Corpus Building Using a ... - CiteSeerX

Building a Parallel Corpus for Too Many Languages - CiteSeerX