where there is not enough parallel data available: â¢in-domain comparable corpora can be used to increase translation quality. â¢if comparable corpora are large ...
Domain Adaptation in Statistical Machine Translation Using Comparable Corpora: Case Study for English-Latvian IT Localisation M Ā RCI S P I N NIS, I N G U NA S K A DI ŅA A N D A N DR EJS VA S I ĻJEVS {marcis.pinnis;inguna.skadina;andrejs}@tilde.lv
CI CL I N G 2 0 1 3 M A RCH 2 9 , 2 0 1 3
Challenges of Data Driven MT Lack of sufficiently large parallel corpora limits the building of reasonably good quality machine translation (MT) solutions for less-resourced languages For many languages only a few parallel corpora of reasonable sizes are available Statistical machine translation (SMT) systems trained on parallel corpora perform well on texts which are from the same domain, but are almost unusable for other domains Introduction
Goals of the Paper To show that, for language pairs and domains where there is not enough parallel data available: in-domain comparable corpora can be used to increase translation quality
if comparable corpora are large enough and can be classified as strongly comparable then the trained SMT systems applied in the localisation process can lead to increased human translator productivity Introduction
Comparable Corpora in Machine Translation Comparable corpus is a relatively recent concept in MT: methods on how to use parallel corpora in MT are well studied (e.g., Koehn) comparable corpora in MT have not been thoroughly investigated
The latest research has shown that: parallel data extracted from comparable corpora improves SMT performance by reducing the number of un-translated words (Hewavitharana and Vogel, 2008) language pairs and domains with little parallel data can benefit from usage of comparable corpora (Munteanu and Marcu, 2005; Lu et al., 2010; Abdul-Rauf and Schwenk, 2009 and 2011) Introduction
Comparable Corpora in Machine Translation (cont.) Most of experiments are performed with widely used language pairs, such as French-English, Arabic-English or German-English For under-resourced languages (e.g., Latvian) exploitation of comparable corpora for machine translation tasks is less studied (e.g., ACCURAT, TTC)
Introduction
Briefly about Latvian Latvian belongs to the Baltic language group of the Indo-European language family: morphologically rich language with a rather free word order less than 2.5 million speakers worldwide
Few bi/multilingual parallel corpora exist, among them the largest are JRC-Acquis*, DGT-TM**, Opus*** * Steinberger et al., 2006; available at: ** Steinberger et al., 2012; available at: *** Tiedemann, 2009; available at:
http://www.statmt.org/europarl http://langtech.jrc.it/DGT-TM.html http://opus.lingfil.uu.se
Introduction
Collecting and Processing Comparable Corpus
Collection of Comparable Corpora
Extraction of Semi-parallel Sentence Pairs Extraction of Bilingual Term Pairs
Baseline System Training
Domain Adaptation
SMT System Evaluation
English-Latvian IT Domain Comparable Corpus Artificially created to simulate a strongly comparable corpus composed by two strategies: 1st part: comparable corpus of different versions of software manuals large documents split in chunks of 100 paragraphs aligned in document level with DictMetric*
EN doc. LV doc. Aligned doc. pairs Aligned doc. pairs (f) 5,200 5,236 363,076 23,399 * Su & Babych, 2012; available through the ACCURAT Toolkit (www.accurat-project.eu)
Collecting and Processing Comparable Corpus
English-Latvian IT Domain Comparable Corpus (cont.) 2nd part: comparable corpus of Web crawled documents combined with parallel software manuals Documents Unique sentences Tokens in unique sentences
English 22,498 1,316,764 16,927,452
Collecting and Processing Comparable Corpus
Latvian 22,498 1,215,019 13,036,066
Extraction of Semi-parallel Sentence Pairs Corpus was pre-processed – broken into sentences and tokenised In-domain semi-parallel sentence pairs were extracted with LEXACC* Due to different distribution of comparable data within the two corpus parts, different thresholds were applied in the extraction process Corpus part First part Second part
Threshold 0.6 0.35
Unique sentence pairs 9,720 561,994
* Ştefănescu et al., 2012; available through the ACCURAT Toolkit (www.accurat-project.eu)
Collecting and Processing Comparable Corpus
Extraction of Bilingual Term Pairs Terms monolingually tagged in corpus using Tilde’s Wrapper System for CollTerm (TWSC)* In-domain bilingual term TerminologyAligner (TEA)*
pairs
extracted
with
Highest ranked translation equivalent for each Latvian term kept Unique monolingual terms Mapped term pairs Corpus part English Latvian Total Filtered First part 127,416 271,427 847 689 Second part 415,401 2’566,891 3,501 3,393 * Pinnis et al., 2012; available through the ACCURAT Toolkit (www.accurat-project.eu)
Collecting and Processing Comparable Corpus
Building SMT Systems
Collection of Comparable Corpora
Extraction of Semi-parallel Sentence Pairs Extraction of Bilingual Term Pairs
Baseline System Training
Domain Adaptation
SMT System Evaluation
Training SMT Systems Three English-Latvian systems were trained on LetsMT!* infrastructure: The baseline system The intermediate adapted system The adapted system
All systems were tuned and automatically evaluated on 1,837 and 926 unique sentence pairs in the IT domain respectively * Vasiļjevs et al., 2012; available at: www.letsmt.eu
Building SMT Systems
Baseline SMT System Trained on relatively large publicly available parallel corpora – the DGT-TM parallel corpora of two releases (2007 and 2011): 1’828,317 unique parallel sentence pairs 1’736,384 unique monolingual Latvian sentences
Parallel corpora were cleaned (duplicates and corrupt sentence pairs were removed) before training
Building SMT Systems
Intermediate Adapted System In-domain bilingual data (sentence and term pairs) extracted from comparable corpora were added to the parallel data In-domain monolingual corpus was used to build a second language model Parallel corpus Monolingual (unique pairs) corpus DGT-TM (2007 and 2011) sentences 1’828,317 1’576,623 Sentences from comparable corpus 558,168 1’317,298 Terms form comparable corpus 3,594 3,565 Building SMT Systems
Final Adaptation Phrase table of the intermediate adapted system was transformed to a term-aware phrase table A sixth feature is added identifying phrases containing bilingual in-domain terminology All tokens are stemmed prior to comparison (in order to capture inflected variants of terms) 3,594 bilingual term pairs extracted with TEA used in the adaptation process The SMT system was re-tuned with MERT
Building SMT Systems
Evaluation
Collection of Comparable Corpora
Extraction of Semi-parallel Sentence Pairs Extraction of Bilingual Term Pairs
Baseline System Training
Domain Adaptation
SMT System Evaluation
Evaluation (cont.) Automatic score calculation: BLEU, NIST, TER and METEOR scores Manual translation evaluation: Comparative evaluation Usability for localization – productivity and quality
Evaluation
Automatic Evaluation Results System
Case sensitive?
Baseline Intermediate adapted system Final adapted system
BLEU
NIST
TER
METEOR
No Yes No Yes
11.41 10.97 56.28 54.81
4.0005 3.8617 9.1805 8.9349
85.68 86.62 43.23 45.04
0.1711 0.1203 0.3998 0.3499
No Yes
56.66 9.1966 55.20 8.9674
43.08 44.74
0.4012 0.3514
Evaluation
Comparative Evaluation
Evaluation
System Comparison by Total Points
From 697 cases when the sentences were evaluated: in 490 cases (70.30±3.39%) output of the improved SMT system (System 2) was chosen as a better translation in 207 cases (29.70±3.39%) users preferred the translation of the baseline system (System 1) Evaluation
Evaluation in the Localisation Task
Localization Work The localization process is generally related to the cultural adaptation and translation of software, video games, websites, and less frequently to any written translation Translation memories (TM) are widely used in localization industry to increase translators’ productivity and consistency of translated material Support from the TM is minimal on out-of-domain texts and on texts with different terminology Evaluation in the Localisation Task
Key Requirements for Application of MT Increasing the efficiency of translation process without degradation of quality is the most important goal for a localization service provider Key requirements for application of MT in localization: Quality of translation Language coverage Domain coverage Terminology usage Cost of adaptation Evaluation in the Localisation Task
Application of MT in Localisation Localization industry experiences a pressure on efficiency and performance
growing
MT is integrated in several computer-assisted translation (CAT) products, e.g. SDL Trados, ESTeam TRANSLATOR and Kilgrey memoQ.
Evaluation in the Localisation Task
Related Work: MT in Localisation Microsoft (Schmidtke, 2008) used MT on MS tech. domain for 3 languages for Office Online 2007 localization task for Spanish, French and German. Application of MT to all new words increased productivity from 5% to 10% on average Adobe (Flournoy and Duran, 2009) used rule-based MT for translation into Russian (PROMT) and SMT for Spanish and French (Language Weaver). The productivity increased between 22% and 51% Autodesk used the Moses SMT system (Plitt and Masselot, 2010) for translation from English to French, Italian, German and Spanish. Varying increase in productivity from 20% to 131% was observed Evaluation in the Localisation Task
Evaluation Task The productivity was compared in two scenarios: Translation using translation memories (TM’s) only Translation using suggestions of TM’s and the SMT system that is enriched with data from comparable corpus
Evaluation in the Localisation Task
MT Integration into the Localization Workflow Evaluate original / assign Translator and Editor
Evaluate translation quality/ Edit
MT translate new sentences
Analyse against TMs
Translate using translation suggestions for TMs and MT
Fix errors
Ready translation
Evaluation in the Localisation Task
Integration of SMT Systems in SDL Trados
Evaluation in the Localisation Task
Integration Into SDL Trados Translations by SMT systems are provided for those translation segments that do not have exact match or close match in the translation memory Suggestions coming from the MT are clearly marked Localization specialists can post-edit them for a professional result
Evaluation in the Localisation Task
Evaluation Setup 30 documents were split into 2 equally sized parts to perform the two translation scenarios The length of each part of a document was 250 to 260 adjusted words on average, resulting in 2 packages of documents with about 7,700 words in each Three evaluators translated 10 documents without SMT support and 10 documents with SMT support
Evaluation in the Localisation Task
Evaluation of Productivity Both - experienced and novice translators were involved
Translators were well trained to use SDL Trados Studio 2009 in their translation work Translators performed the test without interruption and switching to other translation tasks on their 8 hour working day The time spent for translation was reported to the nearest minute Individual productivity of each translator was measured in words per hour 𝑁 𝑡𝑒𝑥𝑡 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑤𝑜𝑟𝑑𝑠 𝑡𝑒𝑥𝑡, 𝑠𝑐𝑒𝑛𝑎𝑟𝑖𝑜 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑠𝑐𝑒𝑛𝑎𝑟𝑖𝑜 = 𝑁 𝑡𝑒𝑥𝑡 𝐴𝑐𝑡𝑢𝑎𝑙 𝑡𝑖𝑚𝑒 𝑡𝑒𝑥𝑡, 𝑠𝑐𝑒𝑛𝑎𝑟𝑖𝑜 Evaluation in the Localisation Task
Results of Productivity Evaluation Translator Translator 1
Translator 2 Translator 3 All
Scenario TM TM+MT TM TM+MT TM TM+MT TM TM+MT
Actual Productivity productivity increase or decrease 493.2 35.39% 667.7 380.7 13.02% 430.3 756.9 -5.89% 712.3 503.2 13.63% 571.9
Evaluation in the Localisation Task
Evaluation of Quality Performed by 2 experienced editors as part of their regular QA process Editors did not know was or was not MT applied to assist a translator
Quality was measured by filling in a Quality Assessment form in accordance with the Tilde QA methodology based on the Localization Industry Standards Association (LISA) QA model
Evaluation in the Localisation Task
Evaluation of Quality (cont.) The evaluation process involves inspection of translations and classifying errors according to the following error categories: Accuracy Spelling and grammar Style Terminology
Evaluation in the Localisation Task
QA Evaluation Form Error Category
Weight
Amount of errors
Negative points
1. Accuracy 1.1. Understanding of the source text
3
0
1.2. Understanding the functionality of the product 1.3. Comprehensibility 1.4. Omissions/Unnecessary additions 1.5. Translated/Untranslated 1.6. Left-overs
3 3 2 1 1
0 0 0 0 0 0
2 1 1
0 0 0 0
1 1 2 1
0 0 0 0 0
2 2
0 0 0 0 0 0
Total 2. Language quality 2.1. Grammar 2.2. Punctuation 2.3. Spelling Total
3. Style 3.1. Word order, word-for-word translation 3.2. Vocabulary and style choice 3.3. Style Guide adherence 3.4. Country standards Total 4. Terminology 4.1. Glossary adherence 4.2. Consistency Total Additional plus points for style (if applicable) Grand Total Negative points per 1000 words Quality:
Resulting Evaluation
Evaluation in the Localisation Task
Error Score The error score was calculated by counting errors identified by the editor and applying a weighted multiplier based on the severity of the error type. The error score is calculated per 1000 words as: 1000 𝐸𝑟𝑟𝑜𝑟𝑆𝑐𝑜𝑟𝑒 = 𝑤𝑖 𝑒𝑖 𝑛 𝑖 where n is the number of words in a translated text, ei is the number of errors of type i,
wi is the coefficient (weight) indicating the severity of type i errors. Evaluation in the Localisation Task
Quality Evaluation Results AccuTranslator Scenario racy
TM Translator 1 TM+MT TM Translator 2 TM+MT TM Translator 3 TM+MT TM Average TM+MT
6.8 9.9 8.2 3.8 4.6 3.0 6.5 5.4
Lang. Style quality
8.0 6.8 14.4 7.8 10.1 11.7 11.7 7.6 9.5 7.3 8.3 6.0 9.3 8.6 11.4 7.1
Evaluation in the Localisation Task
Terminology
1.6 4.1 0.0 1.5 0.0 0.8 0.5 2.1
Total error score
23.3 36.3 30.0 24.6 21.4 18.1 24.9 26.0
Quality Grades A quality grade was assigned to a translation depending on the error score severity Error Score 0…9 10…29 30…49 50…69 >70
Resulting Quality Evaluation Superior Good Mediocre Poor Very poor
Evaluation in the Localisation Task
Quality Evaluation Results Translator Translator 1 Translator 2 Translator 3 Average
Scenario
TM TM+MT TM TM+MT TM TM+MT TM TM+MT
Total error score
23.3 36.3 30.0 24.6 21.4 18.1 24.9 26.0
Evaluation in the Localisation Task
Quality grade
Good Mediocre Mediocre Good Good Good Good Good
Conclusion It is feasible to adapt SMT systems for highly inflected underresourced languages to a particular domain with the help of comparable data The use of the English-Latvian domain adapted SMT suggestions (trained on comparable data) in addition to the TMs increased translation performance by 13.6% while maintaining an acceptable (“Good”) quality of the translation We observed a relatively high difference in translator performance changes (from -5.89% to +35.39%); therefore, for more justified results the experiment should be carried out with more than three participants Evaluation in the Localisation Task
Thank you for your attention! T HE R ESEARCH LEADING TO T HESE RES ULTS HAS REC EIVE D F UNDING F ROM THE RESEARCH P ROJEC T “2 .6. M ULT ILING UA L M ACHINE TRANSLAT I ON” OF EU STR U CT URAL F UN DS, CONT RACT NR. L -KC- 11-0003 SIGNED BETWEEN I CT COMPETEN CE CENTRE AND I N V ESTMENT A N D DE V E LOPM ENT AG E NCY O F L AT V IA .