An Evaluation of UNL Usability for High Quality ... - Semantic Scholar

74 downloads 79988 Views 289KB Size Report
Abstract. In a recent experiment on translating a web site into 4 languages, we have ... 5 mn using a tailored MT system making the best of the ... UNL graphs is quite easy for anybody having a .... viewer, UNL verifier, Dictionary Builder, and the.
An Evaluation of UNL Usability for High Quality Multilingualization and Projections for a Future UNL++ Language

Christian Boitet GETA, CLIPS, IMAG, BP53 385 rue de la Bibliothèque 38041 Grenoble cedex 9 France

Igor M. Boguslavskij IPPI PAN GSP-4 Bol'shoj Karetnyj per. 19 Moscow 101447 Russia

[email protected]

[email protected]

Jesus Cardeñosa VAI, AI Department, UPM Campus de Montegancedo 28660 Boadilla del Monte Madrid (Spain) [email protected]

Abstract In a recent experiment on translating a web site into 4 languages, we have confirmed that using MT results in "translator's mode" can reduce the human work to produce good translations of complex sentences (25 w) at a rate of 25 mn/p with all-purpose commercial MT and at 20 mn/p with lab quality MT. A subexperiment has shown that using deconversions from quality-checked interlingual representations (UNL graphs) reduced the time spent down to 10 mn/p. Reducing the considerable time now needed for producing and checking UNL graphs is possible, which leads to very good usability prospects in situations involving many target languages and allowing for interactive disambiguation of source text or correction of interlingua. An analysis of improvable aspects in both interlingua design and resource building leads to a "roadmap" towards "UNL++" in the framework of the U++C consortium, including strong mutualization (collaborative volunteer work) and open-source aspects.

Introduction The need for crosslingual communication as well as the number of languages involved is dramatically increasing. Europe has 21 official languages (from 9 in 1982) and soon to become 23, India 18, etc. Commercial software is typically translated (localized) into 25—40 languages, Open Source software like Mozilla into 70. Needs increase for the 4 main "translational situations", which are, in increasing order of difficulty of automation: 1. production of high-quality (HQ) translation by bilinguals (dissemination by bilinguals), 2. understanding text or speech in an unknown language (assimilation), 3. production of HQ translation from an unknown language (HQ assimilation by monolinguals), 4. production of HQ translation into unknown languages (HQ dissemination by monolinguals). The first is the easiest to automate, because bilingual experts basically only need good lexical support to produce quality results. It is very important to emphasize here that HQ translation by bilingual professional translators is usually ONLY into their native language(s), and not from their native language into the near-native language. This is different from other professions like bilingual secretaries who are not trained translators, and who are usually

expected to translate in both directions. Therefore, HQ translation by professional translators is only in the direction of Dissemination (Outbound/Outward) translation. For complex sentences with an average of 25.6 words, there are 2 steps corresponding to the workflow of a traditional human translation cycle. The first step is a first draft with a "proposed" HQ draft translation, and the second step is the editing/proofreading stage of the first draft. The first draft is produced in about 60 mn/page1, in "translator's mode" (the source text is first read and understood) and the second step produces a HQ final result in 20 mn/p, in "revisor's mode" (the first draft is read and corrected, and the source text is consulted only if necessary), a total of 80 mn/page. For less complex sentences, these times go down to 45+15=60 mn/page2. In general, (Allen 2006) reports that using outputs of commercial MT systems as "pre-translations" can divide the total time by 3, to 25 or 20 mn/page3. 1

A standard page is about 250 words long in English. At EACL-05, Comprendium mentioned 60 mn using only dictionaries, 30 mn with translation memory, and 5 mn using a tailored MT system making the best of the parallelism between Spanish, Catalan and Galician. 3 At AMTA-2004, J. Allen reported an experiment where the pre-translation and in-translation processing steps (candidate identification for dictionary building + coding + testing the terms in translation mode, plus veri2

The second task (assimilation by a monolingual) is more difficult: MT outputs which can be used quite efficiently in the previous situation can be almost ununderstandable by somebody not knowing the source language. Example: "he is a big shot and bronzed" instead of "he has a long and bronzed arm"4. Everybody has the experience of getting a web page translated by a "web translator": in most cases, the general topic can at least be roughly understood, but the exact details about what is said and meant cannot be understood. In other words, it is easier to simply detect a potentially interesting passage than to measure the full comprehension of such a passage. Automating the third task can be done by building "expert" MT systems specialized to the typology and domain at hand. That has been achieved by hand-crafted heuristic symbolic MT systems (such as METEO or ALT/Flash), by dynamic adaptation of symbolic systems augmented with weights, by pure or hybrid example-based (EBMT) and by statistical MT systems (SMT). Automating the last task seems to be impossible without integrating translation into an authoring environment and thus making it possible to get the "intended meaning" of the author through interactive disambiguation, and/or by imposing a controlled language. Next to the problem of producing translations "good enough" for the task at hand, there is the problem of producing them for many language pairs. Translating through "pivot" (interface) natural languages is possible if an excellent first translation is produced. It is not an option with unrevised MT of general texts, even with the best systems, because the intermediate text often if not always contains ungrammaticalities, ambiguities, and unknown source words. Building N(N-1) direct systems is also not an option in most situations5. A possible solution is to "compose" transferbased MT systems through intermediate descriptors of the "pivot language", getting a "linguistic pivot" architecture. But understanding and hence directly editing a complex structure that fully represents a NL utterance is next to impossible for most people. The last solution is to use a "pivot" architecture based on an abstract interlingua. Semanticofication and retranslation) took 6.5—7 hours for 8000 words, and final post-editing stage 30-45 minutes, or 26+4 = 30 mn/page, a division by 3. 4 Real example from a "longest match" MT system on "Il a le bras long et bronzé" — "avoir le bras long" means "to be a big shot". 5 Although Ph. Koehn has recently built SMT systems for the 110 language pairs of the EuroParl corpus, the minimum size of the parallel corpus seems to lie between 50 and 200M words (200K to 800K pages). Adding a new language may need 10 years of human translation (at a rate of 20K pages of debates a year). Porting to many typologies requires to find large enough parallel corpora.

pragmatic pivots have been used successfully for restricted situations6, but it seems impossible to extend them to handle general language. By contrast, semantico-linguistic pivots have been used successfully, in particular in the MT systems ATLAS-II (Fujitsu), PIVOT (NEC), CICC (ODA), and KANT/CATALYST (CMU/Caterpillar). UNL (Universal Networking Language) has been introduced in 1996 by H. Uchida, the main designer of ATLAS-II, as an "open" successor of the CICC interlingua, itself a successor of the ATLASII proprietary pivot. A UNL (hyper)graph represents a disambiguated abstract structure of an English utterance equivalent to the utterance (in any language) to be represented, and its symbols (relations, attributes, and lexemes, called UW) are English words, acronyms or structured strings built from English words or acronyms. Understanding UNL graphs is quite easy for anybody having a high school level English (GRE and GMAT exams, say), hence for the vast majority of developers in today's world. See www.undl.org for more details. In this paper, we try to evaluate the usability of UNL in a real setting, and propose ways to improve it to a point where 10 complex pages could be obtained (in a mutualizing — work sharing — setting) after 1 day in 20 or more languages, and require less than 1 hour of human work for each version (each target language). In the first section, we describe a recent experiment on translating the Unesco B@bel web site into 4 languages, which confirmed that the use of MT "pre-translations" to get good translations can divide the human work time by 3 or more, and that the use of "deconversions" from quality-checked UNL graphs can divide it by up to 8, not counting the time needed for producing and checking UNL graphs. The second section presents an analysis of improvable aspects in both interlingua design and resource building, and the third a "roadmap" towards a "U++" framework, including strong mutualization and open-source aspects.

1 Embedding a comparative taskrelated evaluation of UNL in a real translation task The main goal of the experiment was to study how to automate and "mutualize" the multilinguization of web sites and other documents of Unesco and other international cultural bodies in the future. The 3 partner labs working on the project: • translated the textual material (in English) of the B@bel UNESCO web site, equivalent to 173 standard pages (43200 words), into 6

The very first was the TITUS system at the Institut Textile de France (Ducrot 1976). Task-oriented Speech Translation systems such as MASTOR-1 (IBM) also use that technique — and are built by statistical means.

French, Spanish, Russian and Chinese, with an “operational” output quality; evaluated the gain obtained by using MT systems to produce "pre-transations" of the whole material (SYSTRAN v5.0 Premium for French, Spanish and Chinese7, and ETAP-3 of the IPPI lab for Russian); evaluated the gain obtainable by using UNL: we produced 906 “good” UNL graphs for the most complex part of the corpus (≈23200 words), a time-consuming task8, automatically deconverted9 them in French, Spanish and Russian, and post-edited sample results in various settings; created a web site for distributed development, and put all results on it.







1.1

Steps of the experiment

We used an SQL dump of the Unesco/B@bel data base to build our own database of "polyphrases" (sets of versions of a sentence in the source and target languages as well as in UNL), thereby segmenting B@bel "text containers" into sentences, and factorizing identical source sentences in the same polyphrase. A web site for translation and UNL-ization was implemented as an extension of an existing UNL deconversion web service. We produced the final translation into French, Russian, Spanish, and Chinese, using a simple Excel format (see Annex) to work in "translator's mode" and to measure times. In parallel, we produced UNL graphs, created missing UWs (UNL lexemes), and linked them to associated dictionary entries in English, French, Russian and Spanish. A semi-automatic analyzer and editor of UNL graphs was used to help enconversion for about half the graphs while the other half was created manually. Some time (not directly measured) was spent on unifying the UNL-ization process itself (how to "encode" various phenomena). Existing deconverters were improved to cope with this corpus, and were run on the 906 graphs. A sample of 50 of the obtained deconversions (about 5 standard pages of 250 words) was then used as pre-translations and post-edited.

1.2

Measured results

Here are the main results. • Using MT outputs as pre-translations, and working in translator's mode (reading the source text, then looking at the pre-translation 7

Chinese was translated but could not be evaluated. Even if their construction is automated, they must be revised to ensure a high quality, so that deconversions are the best possible. 9 We use "analysis" and generation" if staying in the same "lexical space", "enconversion" and deconversion" if there is a change of lexical space (a lexical transfer). 8

and making the best of it), divides the total human translation time (80 mn/p10) by more than 3 (25 mn/p) using SYSTRAN v5.0 and by 4 (20 mn/p) using ETAP-3. Deconversion outputs were post-edited under various ergonomic conditions, in time sessions ranging from 10 to 25 mn/p (with or without seing the UNL graph). Producing UNL graphs manually and semiautomatically took 4.5 h/p to 3.2 h/p, including the time ot add missing dictionary entries. Producing a new UW and linking it to Spanish took 100 s and 123 s (~2 mn).







1.3

Potential actual and future gains

Results and projections (for 10 & 20 target languages) are summarized in the following table, where lines UNL-sa1 and UNL-sa2 are projections for mid-term and long-term speed-ups in graph creation. Text type Simple (12 w/s) Complex (25 w/s) 10 target 1st Bil. UNL 1st Bil. UNL Tot Tot draft Rev Rev languages draft Rev Rev 45 15 — 60 60 20 — 80 H only 20 5 — 25 30 10 — 40 H+TM 0 15 — 15 0 25 — 25 MT-gen 0 5 — 5 0 15 — 15 MT-spec 10 22 240 — 10 34 UNL-man 120 — 20 — 8 10 30 — 8 11 UNL-sa1 10 — 5 6 15 — 5 6.5 UNL-sa2 20 target languages (UNL-man time is spread over them) 10 16 240 — 10 22 UNL-man 120 — 20 — 8 9 30 — 8 9.5 UNL-sa1 10 — 5 5.5 15 — 5 5.7 UNL-sa2 Table 1: measures and projections11

Hence, using MT is a viable option, immediately usable for the existing commercial “language pairs12”, and using an architecture based on UNL or a UNL-like "pivot" will probably be more time-saving and applicable to many more target languages in the future. Even with manual production of the UNL graphs, if 20 target languages are considered, using UNL to translate general texts with complex sentences should be more efficient than using classical MT general-purpose systems… which anyway

10

mn/p = minute per page, h/p = hour per page, w/s = word per sentence. 11 TL = target language, H = human, TM = translation memory, MT-gen = general purpose MT, MT-spec = specialized MT, UNL-man = manual creation of UNL graphs, UNL-sa1, 2 = semi-automatic creation of same, UNL-Rev = revision of deconverted text. 12 a "pair" is taken here to be ordered, as in mathematics: there are 2 language pairs for any set of 2 different languages.

vented common understanding and coherent development of the UW lexicon; Impossibility of consensus-based evolution of the specifications within the UNLP organization, something unacceptable in an academic and cooperative endeavour; UNL multilingual document format based on HTML: special tags such as [S], [/S], {org}, {/org}, {unl}, {/unl} and a notation for attributes were introduced before XML existed, in a quite clever way. A standard and as simple XML format is needed to take advantage of XML-associated tools.

don't exist13 and cannot be built for hundreds of language pairs (380 now in Europe only).

2

Aspects to improve in the UNL way

Improvements to be introduced concern the development process, the specification of the language, and tools.

2.1



Common development web platform

Although the experiment web site offers a very useful possibility, namely to instantly produce a drawing of any correct UNL graph, and to edit the dictionary and the translations, it is not yet sophisticated enough to be used as a “web translator workstation”, nor as a “UNL graph factory”. What is needed is: • a common lexical database available on the web for day to day cooperative work; • web-enabled editing facilities usable on the database of texts and graphs; • a web-oriented meta-EDL14 communicating with the EDLs of developers (different tools are used for different languages).

2.2



UNL specification & UW construction

The UNL project (UNLP) was launched as an international academic cooperation project led by the IAS/UNU15, mostly funded by the Japanese ASCII company. Funding dwindled in 1998 when the Japanese "bubble" exploded, and there was no opportunity to discuss and adopt necessary improvements. The most important problems concerning the specifications are: • Absence of arguments on UWs and arcs in UNL graphs.(1) Semantic relations cannot be reliably assigned to arguments of predicates, so that ambiguities arise (e.g. "Johnagt gives a bookobj to Danben for Maryben" and "Johnagt gives a bookobj for Danben to Maryben"16). (2) The same relation may connect both an argument and an adjunct to the same predicate (e.g borrow and dur(ation): He borrowed money for three years (argument) – He has been borrowing money all his life (non-argument). Because of this, we need to have an argument label on relations in the graph. • Unsatisfactory UW creation process: the SMTP-based "UW gate" was never usable in real time, and the absence of comments pre13

see http://www.translatorscafe.com/cafe/MegaBBS/threadview.asp?threadid=4693&messageid=62125#62125 (12/12/06)

2.3

Need for Open Source & more tools

The UNLP distributes some tools (UNL-html viewer, UNL verifier, Dictionary Builder, and the rule-based languages EnCo and DeCo), which is remarkable considering the small size of the Tokyobased team. But they run only under Windows, are access-restricted and not Open Source, and bugs are rarely fixed. The following Open Source tools are needed: • UNL-graph editors: there are some, but not robust enough, and not usable on the web. • Debugged & open versions of EnCo, DeCo, the Specialized Languages for Linguistic Programming (SLLPs) rule-based programming languages distributed by the UNLP. • XML-based tools: format converters, XMLbased viewer complementing the UNLP HTML-viewer, editors of full documents containing UNL and multilingual versions. • New, more powerful NLP tools: DeCo and EnCo have no variables, no facility for structured programming, and cannot produce multiple solutions, although they offer backtracking. Useful tools include: • FST-based tools for morphology (with EnCo and DeCo, more than 15K rules had to be written for Russian); • Tools for multiple scored analysis: dependency, constituent or mixed; • Tree↔graph converters; • Tree-transformation tools17 • Corpus-based learning tools: although the UNL design is simple enough to teach newcomers quite quickly and get results in new languages in a few months, scaling up is a very timeintensive task. • To port to new languages, aligners between text and UNL trees (unfolded UNL graphs) would make it possible to "infer" decon-

14

EDL = Environment for Developing Lingware. Institute of Advanced Studies of the United Nations University, Tokyo. 16 Using 2 relations, ben and gol, does not help unless they are associated in advance to some arguments in the dictionary. 15

17

from "learnable" Thatcher & Wright Tree-FST to systems like TELESI (Chauché, LIRMM), ROBRA (Ariane system, Grenoble), GRADE (MU-system, Kyoto) or GWS (ISS, Singapore), etc.



3

verters and enconverters from parallel corpora containing UNL. To speed up development, tools should be developed to produce UWs from existing lexical resources.

languages, and require less than 1 hour (6 mn/p) of human work for each version (each target language).

3.2

For lack of a better name, we call UNL++ the variant on UNL on which the U++C is working.

A new impetus: the U++ consortium

We describe now the "roadmap" on which the new "U++ consortium" (U++C) is working.

3.1

3.2.1

Goals

This non profit, open-source oriented organization was created by some UNL language centers and external partners just before the CICLING-05 conference, in the presence of Pr. T. Della Senta, president of the UNDL foundation. Its goals are to promote the development of UNL-related standards, and to offer related open-source resources and tools contributed à la Linux and à la W3C. The U++C complements the UNDL in various ways. In particular, it will participate as such in project proposals in answering EU calls, which UNDL is not considering due to its statute (a Swiss foundation under Unitar/UN). Another point is that it tries to set clear, measurable performance goals for concrete applications corresponding to urgent needs, while UNDL is promoting more futuristic research, for instance on "UNL encyclopedia", "semantic computing", and "knowledge management". As the "UNL" name and logo are reserved by UNDL and UNLP, we have introduced a new name, U++C, to show at the same time the relation and the difference. Here are the concrete goals the U++C proposes to reach in 3-4 years. • Translation and maintenance times: Text Simple (12 w/s) Complex (25 w/s) type 20 TL UNL-m U++ Ling. devt Total

U++ Bil. U++ creat. Rv Rv 120 — 10 5 — 2 0.5 NL NL UW/p dict. proc.

Roadmap

Tot U++ Bil. U++ /TL creat. Rev Rev 16 240 — 10 2.25 10 — 4 Tot 1 NL NL /TL UW/p dict. proc.

Tot /TL 22 4.5 Tot /TL

2

1

1

2.1

2

4

1

5.1

7

1

2.5

4.35

12

4

5

9.6

Table 2: time objectives of U++C • Operational integration: integrate UNL in at least 2 of following application types: • multilingual public web site (as B@bel) • cause-oriented document translation • Open Source software localization • specific operation of multilingual document production/assimilation. • Delay to get contribution quality enhancement: 10 complex pages should be obtainable (in a mutualized setting) after 1 day, in 20 or more

U++C lexicon

3.2.1.1. Evolution from UWs to XUWs We propose the term "eXtended UW" or XUW for the U++C variant of UWs. Since the XUW dictionary is a collection of meanings coming from different languages, XUWs should be: • complex and flexible enough to be able to express all specific meanings of all natural languages. • simple enough to be understandable by people from different languages and different cultures. • built with reference to widely used theories & resources, to get very large potential cooperation and contributions. • accompanied by structured comments in English. A UW18 is made of an English "headword" and a list of restrictions. The evolution towards XUW will include the following (described in more detail in a still internal document written by the 2nd author). • Use of WordNet (WN) to get a 1st degree intuitive & open disambiguation "tag": • if a headword X is the first element of a synset, and Y is most immediate hypernym in a given sense, the proposed UW is built from: X(icl>Y). Example: pen(icl>writing implement) pen(icl>enclosure)



if X is not the first element of the synset, then the UW has the form: X ( i c l > Y , equ>Z) , where Y is the most immediate hypernym and Z is the first term of the synset. Example:

pen(icl>enclosure, equ>playpen) pen(icl>correctional institution, equ>penitentiary)

Indication of arguments with their semantic restrictions: Examples:



give(icl>do, agt.@A>thing, obj.@B>thing, gol.@C>thing) borrow(icl>do, agt.@A>thing, obj.@B>thing, src.@C>volitional thing, dur.@D>time)

There are abbreviation rules: the preceding XUW (meaning "borrow something for some time) is the same as borrow(icl>do, src.@C> volitional thing, dur.@D>time) because all tran18

UW = Universal Word: a UW denotes 1 (ideally) or more "lexical meanings" of at least one natural language.

sitive verbal XUWs of type do have the 2 default arguments (agt.@A>thing, obj.@B>thing). If there is a restriction on a default, it must be expressed: wash(icl>do, obj>cloth) for stirat' in Russian (as opposed to myt', used for dishes, hands, etc.). Note that the UNL knowledge-base (KB), if it contains the considered meaning, is useful to get the argument structure, while WN is not (until now). • Other semantic restrictions are further indicated. Example: to land (prizemljat’sja vs. vysazhivat’sja in Russian) gives land(icl>do, plf>sky) //for a plane land(icl>do, plf>water) //for a ship

Special headwords • Allowing quoted headwords: even in English, some dictionary entries use accented characters illegal in UWs. • Special XUWs for mathematical expressions & relations, figures and icons, anchors, hyperlinks, references (to bibliography, footnotes…), punctuations (e.g. bulleted or numbered lists). They cannot simply be put inside double quotes, as now. In particular, formulas can behave as predicates (aplace)



Scope notation is more precise. If the XUW is not in a U++C-NL dictionary, the content of its headword is unfolded as a scope and compositional deconversion can be attempted. Example:

'(mod:01(center.@entry, radio) (mod(center.@entry, national'(icl>place)

In this case, we know that "national" modifies "radio center" and can deconvert accordingly. 3.2.1.2. Practical process Development is planned in "batches": • start from an available set corresponding to a real application (probably B@bel) • then complete with most frequent English vocabulary with all meanings & refine if necessary for some NL. • then augment according to applications Transformation steps for each "batch" of XUWs are the following: • use WN to get 1st degree intuitive & open disambiguation "tag" • link with languages whenever possible

• • •

build structured comments semi-automatically and complete them manually add arguments and semantic restrictions add other semantic restrictions

3.2.2

Multilingual lexical database

It will contain the XUWs and their equivalent lemma in all languages considered in various progressing versions. Ideally, it should function in wiki mode. The source codes of the distant deconverters and enconverters do not have to be in the database, but enough to generate them should be there, as well as appropriate comments in the corresponding NL. The structure of this database, which extends that included in the gohan.imag.fr/unldeco/ web server; it is inspired from the Papillon multilingual lexical database (3-tiers architecture) and is built on the same Jibiki generic platform. Logically, it is a large XML tree with a first level for XUWs (id, synonymous notations, comment, and metadata), then a level for the NL, then (under each language) the equivalent lemmas, in their different versions. Physically, we use an SQL DBMS (Database Management System) such as PostGres with usual structured metadata and simple representation (id, XML string [, XML binary tree]) for the content. This way, the logical structure may change while the physical structure remains the same.

3.2.3

Multilingual parallel corpora

Again, this will be developed as an extension of the web sites built by the U++C partners since a few years. The current, incomplete version has been built with Enhydra so that it generates dynamic web pages corresponding to a subset of the corpus. It is already possible to edit the texts in natural languages and the textual form of the UNL graphs, but the challenge is to make it possible to interact graphically with the graphs through the web, and to make the "coedition" idea (Boitet & Tsai 2002) operational.

3.3

Spreading UNL usage

The third but not least goal of the U++C is to promote the use of UNL in various contexts. There are 2 main directions: • Embedding UNL in various applications and scenarios • Cross-Lingual Information Retrieval • Text and speech translation • Semantic web (UNL annotations) • Extending UNL to many languages: • Cloning from "near" languages • Learning from parallel corpora.

Conclusion Using deconversions from quality-checked interlingual (IL) representations such as UNL graphs is potentially a better approach to the production of HQ translations of general texts in dozens of languages than to try to build a quadratic number of classical binary MT systems. Even if the cost to produce HQ graphs is now high (3 to 4 h/p for complex text), it is quite low when spread over many target languages, and the overall cost is lower and is a better return on investment than the classical approach. Going through a semantico-linguistic IL such as UNL also permits direct edition or indirect coedition of the IL, and hence, for the first time in history, sharing post-editing across the target languages. Nevertheless, IL creation should and can be sped up considerably, to 5-20 mn/p according to text complexity. We have described improvements in the engineering context as well as in the UNL specification and a "roadmap" to go from UNL to UNL++. In concrete terms, the goals of the recently created U++ consortium are to lower the human time spent on producing HQ translations to less than 5 mn/p (respectively 3 mn/p) for complex (resp. simple) sentences, and less than 10 mn/p (resp. 5 mn/p) if counting the total human effort (adding work on lingware, whatever approach it is based on). Thanks to a wiki-like organization, and to incremental improvement of the most important parts of the documents, the delay to get 10 pages translated in 20 languages or more could be less than 1 day (1h of wiki post-edition).

Acknowledgments This work has been partially funded by Unesco contract number 4500020224. The authors would also like to thank deeply Jean-Philippe Guilbaud, Étienne Blanc, Gilles Sérasset, Carolina Gallardo Pérez, and Leonid Iomdin, who contributed in an essential way to the work reported here. Thanks should also go to Jeff Allen and to the reviewers, for constructive and interesting improvements.

References J. Allen (2006) Documents and references on post-editing MT. web site. http://www.geocities.com/mtpost-editing/ E. Blanc (2001) From graph to tree : Processing UNL graph using an existing MT system. Proc. First UNL

Open Conference - Building Global Knowledge with UNL, Suzhou, China, 18-20 Nov. 2001, UNDL (Geneva), 6 p. I. Boguslavsky, N. Frid, L. Iomdin, L. Kreidlin, I. Sagalova & V. Sizov (2000) Creating a Universal Networking Language Module within an Advanced NLP System. Proc. COLING-2000, Saarbrücken, 31/7—3/8/2000, ACL & Morgan Kaufmann, vol. 1/2, 83-89. C. Boitet (2002a) Advantages of the UNL language and format for web-oriented crosslingual applications. Proc. Seminar on linguistic meaning representation and their applications over the World Wide Web, Penang, 2022/8/2002, USM, 4 p. C. Boitet (2002b) A rationale for using UNL as an interlingua and more in various domains. Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas, 26-31/5/2002, ELRA/ELDA, 23—26. C. Boitet & W.-J. Tsai (2002) Coedition to share text revision across languages. Proc. COLING-02 WS on MT, Taipeh, 1/9/2002, 8 p. J. Coch & K. Chevreau (2001) Interactive Multilingual Generation. Proc. CICLing-2001 (Computational Linguistics and Intelligent Text Processing), Mexico, February 2001, Springer, 239-250. J.-M. Ducrot (1982) TITUS IV. In "Information research in Europe. Proc. of the EURIM 5 conf. (Versailles)", P. J. Taylor, ed., ASLIB, London. J. Hutchins, W. Hartman & E. Hito (2005) Compendium of Translation Software (directory of machine translation systems and computer-aided translation support tools. EAMT (on behalf of IAMT), TIM/ISSCO, Geneva, 127 p. (Earlier editions of the Compendium, which list older systems and older versions of current systems, are available as PDF files from: http://ourworld.compuserve.com/homepages/WJHutchins /compendium.htm) G. Sérasset & C. Boitet (1999) UNL-French deconversion as transfer & generation from an interlingua with possible quality enhancement through offline human interaction. Proc. MT Summit VII, Singapore, 13-17 September 1999, Asia Pacific Ass. for MT, 220—228. G. Sérasset & C. Boitet (2000) On UNL as the future "html of the linguistic content" & the reuse of existing NLP components in UNL-related applications with the example of a UNL-French deconverter. Proc. COLING2000, Saarbrücken, 31/7—3/8/2000, ACL & Morgan Kaufmann, vol. 2/2, 768—774. X. Shi & Y. Chen (2001) A UNL Deconverter for Chinese. Proc. UNL-2001, Suzhou, April 2001, IPM, 6 p. H. Uchida (1989) ATLAS. Proc. MTS-II (MT Summit), Munich, 16-18 août 1989, 152-157. H. Uchida (2004) The Universal Networking Language (UNL) Specifications Version 3 Edition 3. UNL Center, UNDL Foundation, December 2004, 43 p. http://www.undl.org/unlsys/unl/UNLSpecs33.pdf

duree_tot

mn_p_page

36

20,7 0,0

0

0

0

0,0

This project was carried out within Initiative B@bel Ce projet a été mené au sein de l'initiative B@bel en in cooperation with SIL International to support coopération avec SIL international pour soutenir des efforts aimed at developing software/tools efforts visant à développer les logiciels/outils favorisant promoting multilingualism in cyberspace. le multilinguisme dans le Cyberspace.

0

0

0

0,0

SIL International has developed Graphite engine SIL international a développé le moteur Graphite qui SIL international a développé le moteur de graphite qui which supports the display of complex and nonsupporte l'affichage des scripts complexes et non soutient l'affichage des manuscrits complexes et nonRoman scripts and is available for free download on romains et est disponible en téléchargement libre sur le Romains et est disponible pour le téléchargement libre the SIL International's website. site Web de SIL international. sur le site Web international de SIL.

0

0

0

0,0

The project will involve the incorporation of graphite’s unique functionalities in other software applications, thereby contributing to the creation and dissemination of content in many currently lesser-used languages.

0

0

0

0,0

These products will also be freely disseminated Ces produits seront également librement disséminés Ces produits également seront librement disséminés with basic documentation facilitating the avec la documentation de base facilitant l'incorporation avec la documentation de base facilitant l'incorporation incorporation of Graphite by software developers in de Graphite par des réalisateurs de logiciel dans d'autres du graphite par des réalisateurs de logiciel dans d'autres other products. produits. produits.

0

0

0

0,0

0

0

0

0

0

0

0

0

tr-SYSTRAN-5 (en_fr)

l. 1723 (11195/3779): 1196 car, 143 mots Multilingual Web Browser.

Navigateur Web multilingue.

Multilingual web browser.

Web browser multilingue.

Le projet comportera l'incorporation des fonctionnalités uniques de Graphite dans d'autres applications logicielles, contribuant de ce fait à la création et à la diffusion du contenu dans beaucoup de langues actuellement moins utilisées.

Navigateur Web multilingue.

Ce projet a été mis à exécution dans l'initiative B@bel en coopération avec SIL international pour soutenir des efforts visés développant le logiciel/outils favorisant le multilinguisme dans Cyberspace.

Le projet comportera l'incorporation des fonctionnalités uniques du graphite dans d'autres applications de logiciel, contribuant de ce fait à la création et à la diffusion du contenu dans beaucoup de langues actuellement peu de-utilisées.

Web browser multilingue.

0,0

0

traduction finale (français)

Web-page/site creation is one of the most common La création de page/sites Web est une des formes les form of web publishing and information plus communes d'édition sur le Web et de diffusion de dissemination in cybersapce. l'information dans le cybersapce.

Le page Web/création d'emplacement est un de la forme la plus commune d'édition de Web et de diffusion de l'information dans le cybersapce.

0,0

0

0

original (anglais)

By developing a beta version of a web browser that En développant une version bêta d'un navigateur Web supports creation and viewing of web pages in qui supporte la création et la visualisation des pages Burmese the ability to create and disseminate Web en Birman, la capacité de créer et diffuser multilingual information will be promoted. l'information multilingue sera favorisée.

En développant une bêta version d'un web browser qui soutient la création et la visualisation des pages Web dans le Birman la capacité de créer et diffuser l'information multilingue sera favorisé.

0,0

0,12

0

0

h_fin

0,01 0,10 0,10 0,11 0,08 0,01 0,07

0

0

duree_incr

nb_pages 1,74

3 27 53 81

h_debut

nb_mots

nb_tot_mots

35

8321

3 24 26 28 21

102

0

0,04

79

3

178

197

197

16601_de sclong_en _11

105

239

177

28

140

141

16601_de sclong_en _10

79

16601_de sclong_en _8 16601_de sclong_en _9

0

17

207

239

16601_de sclong_en _7

0

122

223

207

16601_de sclong_en _6

0

30

28

223

16601_de sclong_en _5

0

152

25

16601_de scshort_e n_3

0

10

290

1196

16601_na me_en_1

ID (n° en cours)

162

nb_car

nb_car_val

189

25

Annex: examples

The open-source Mozilla browser has been used for Le navigateur Mozilla à source ouvert a été employé this development. pour ce développement.

Le navigateur de Mozilla d'ouvrir-source a été employé pour ce développement.

UNL graph with some corrections, for the English sentence: It [Attawik.net] provides a content management system that allows native speakers to write, manage documents and offer online payments in the Inuit language provide(agt>thing,obj>thing) .@entry

obj

agt

gol

system(icl>method) .@indef

attavik.net(icl>entity)

gol

write(agt>human,obj>thing) .@entry

:01

agt

management(icl>activity) .@def

and

content(icl>information)

speaker(icl>role) .@indef.@pl

mod native(modtreat (agt>volitional thing,obj>thing)

agt

and

offer(icl>give(agt>thing, gol>thing,obj>thing))

ins

document(icl>information) .@indef.@pl

ins language(icl>system) .@def

obj payment(icl>action) .@indef.@pl

obj

mod man

mod

Inuit(icl>language)

online(icl>place)

agt(provide(agt>thing,obj>thing).@entry,attavik.net(icl>entity)) obj(provide(agt>thing,obj>thing).@entry,system(icl>method).@indef) gol(system(icl>method).@indef,management(icl>activity).@def) mod(management(icl>activity).@def,content(icl>information)) gol(provide(agt>thing,obj>thing).@entry,:01) and:01(write(agt>human,obj>thing).@entry,manage(icl>treat(agt>volitional thing,obj>thing))) obj(:01,document(icl>information).@indef.@pl) agt(:01,speaker(icl>role).@indef.@pl) mod(speaker(icl>role).@indef.@pl,native(modgive(agt>thing,gol>thing,obj>thing))) obj(offer(icl>give(agt>thing,gol>thing,obj>thing)),payment(icl>action).@indef.@pl) mod(payment(icl>action).@indef.@pl,online(icl>place)) ins(offer(icl>give(agt>thing,gol>thing,obj>thing)),language(icl>system).@def) mod(language(icl>system).@def,Inuit(icl>language)) agt(offer(icl>give(agt>thing,gol>thing,obj>thing)),speaker(icl>role).@indef.@pl)