A Semantic-Based Approach for Multilingual Translation of Massive Documents. Sameh Alansary1,2
[email protected] 1 2
3
Magdy Nagi1,3
[email protected]
[email protected]
Bibliotheca Alexandrina, P.O. Box 138, 21526, El Shatby, Alexandria, Egypt.
Department of Phonetics and Linguistics, Faculty of Arts, Alexandria University, El Shatby, Alexandria, Egypt.
Computer and System Engineering Dept. Faculty of Engineering, Alexandria University, Alexandria, Egypt. follows the path through semantic representations, it can demonstrate how sentences in the source language and target language are related to a common deep structure. In this method, the machine translation tries to reproduce the precise contextual meaning of the author within the bare syntactic and semantic constraints of the target language. Semantic method emphasizes the content of the message rather than the effect. As it does not want to miss any semantic nuance, it tends to be more detailed and more complex and as a result, more challenges are faced. Advocates of this method consider form and content equally important. Section 2 of this paper presents a semanticbased approach that leads to multilingual document processing through which an Interlingua called Universal Networking Language (UNL) will be presented. Section 3 discusses, in general, the extent to which the output of machine translation systems can be considered acceptable. Section 4 discuses the structure of the Arabic-UNL dictionary upon which the Arabic language can be processed in terms of UNL-semantic graphs whether it is a source or target language. Section 5 adopts the UNL system to encode Arabic in terms of semantic graph that will be considered as an intermediate representation between the source language and other target languages. Section 6 discusses how UNL-intermediated semantic graphs can be decoded to Arabic. Finally, section 7 evaluates the Arabic translation obtained from the UNL-semantic graph.
Abstract This paper presents an interlingua-based framework that facilitates semantic processing of natural languages by a computer called Universal Networking Language (UNL). It is an artificial language that describes the meaning of sentences in terms of the schema of semantic nets. This framework focuses on representing all sentences that have the same meaning in all natural languages using a single semantic graph. Once this graph is built, it is possible to decode it to any other language. The paper takes care of Arabic as a source and target language and presents an evaluation of the Arabic decoding output on morphological, syntactic and semantic levels. The evaluation is based on results of translating one complete document from English to Arabic through UNL from which multilingual translations can be obtained.
1
Noha Adly1,3
Introduction
Machine Translation (MT) has been brought to a large public by tools available on the Internet, such as Google and by cheap programs such as Babylon. These tools produce a "gisting translation" - a rough translation that "gives the gist" of the source text, but is not otherwise usable. MT systems can produce translations more quickly and often cheaper than human translators; however, in the majority of cases, the quality of MT is inferior to the quality of human translation (HT). An automatic semantic translation could be closer in quality somehow to the quality of an HT. Semantic translation (this term is proposed by Newmark (1981)), is carried out with reference to grammatical deep structure and it aims at establishing semantic equivalence. If a translation
2
Towards Multilingualism.
How enjoyable it is to communicate with different communities belonging to different languages, this reflects the concept of the multilingualism. But how does this happen? How
317
difficult to decide on the right criteria for evaluating machine translation (MT) systems (King 1996). It is even more difficult to put precise measures that objectively describe what a good system consists of. Amgarten and Merz (1997) established a catalog of criteria, which they considered crucial for a systematic assessment of MT systems. They divided the catalog into the following three parts; (a) technical criteria, which are important for the system’s performance, but they are not directly connected with the translation output; (b) linguistic criteria, which constitute the crucial points not only for evaluation of quality of the output of MT systems, but also lexicons and grammars implemented in them; (c) entrepreneurial criteria, which make sure the system is useful for the aim the system is bought for. Secondly, no simple answer can be expected or given when evaluating MT systems and suggesting which is the best MT system. As Church and Hovy (1993) have argued, even poor MT can be useful, or even ideal, in the right circumstances. That is why, where texts are translated to meet the requirements of official language policies, some may be intended for “information only” and may not require highquality translation; these are called "informative texts". On the other hand, mistranslations of areas of specialization can have serious results, such as in legal translation, as well as in medical translation, scientific translation, and technical translation. According to Hutchins (2003 & 2004) MT systems that are used for Assimilation, Interchange, and Database Access, translations are not necessarily of the best quality; i.e. poor quality is acceptable as long as the information is conveyed, and the message to be sent is understood. On the contrary, MT systems used for dissemination should be of the quality known as "publishable"; which is the opposite of "poor" quality.
different communities can communicate although they speak different languages? This is possible if there is only one universal language known among all communities. This universal language should carry the information existing in natural language, should be unambiguous and enables each community to express all knowledge conveyed by natural language. Factually, what we need is a formal universal language to catch the meanings present in human natural languages in a form understandable for all communities. Word is the unit of natural language that carries the meaning. If words can be represented by concepts or universal words (UWs) inside the universal language, it will be understandable to all communities. The universal language should catch the relation between the concepts through semantic relations to convey the correct meaning.
2.1 Universal Networking Language UNL is an electronic language which enables rewriting articles in different languages on Internet into UNL format in order to translate them into any other language. It is an interlinguabased framework aimed to facilitate semantic processing of natural languages by a computer. UNL follows the schema of semantic nets. It is an interlingua for machine translation. In this scheme, a source language sentence is converted to the UNL form using a tool called the EnConverter. Subsequently, the UNL representation is converted to the target language sentence by a tool called the DeConverter,Uchida (1996, 2003, 2005). The purpose of introducing the Universal Networking Language (UNL) in communication networks is to achieve accurate exchange of information among different languages. UNL consists of Universal Words (UWs) that conform to the vocabulary of the language, i.e., they can be considered as the lexical items of UNL, UNL relations to link between UWs; these relations are considered as the syntax of the language, UNL attributes to express several types of semantic information that usually modify the predication described by the net of UWs linked through the relations, and the UNL knowledge base (KB) which is the semantics of UNL.
3
4
Building the Arabic-UNL Dictionary
In order to achieve multilingual translation through the UNL system, a UNL dictionary stores information concerning kinds of Universal Words (UWs), i.e. concepts, the language expresses and where those words can be used. In addition to UWs, it stores word headings for universal words that can express concepts and information on the linguistic behavior of words. The Arabic UNL dictionary stores three types of linguistic
To what extent is machine translation's output acceptable?
For determining the extent to which any output of machine translation is acceptable, two problems should be made clear. Firstly, it is very 318
two possible UWs, either graceful(icl>beautiful)or Waseem(iof>person).
information; morphological information which is responsible for correctness of the morphology of words; syntactic information to generate wellformed Arabic sentence structure; and semantic information about the semantic classification of words that allows for correct mapping between semantic information in UNL-graphs and syntactic structure of the sentence under generation.For more information about the technical design of the Arabic UNL dictionary, cf. Alansary et al (2006a, 2007).
(2) رك و ا ار 'Waseem participated in the conversation' If “graceful(icl>beautiful)” is selected, the graph in figure 2 will represent the meaning of the sentence in (2).
Figure 2.“Unlinked concepts – low graph quality ”
5
Building UNL Graphs
However, the UWs identified for the words of the sentence were poor which has led to unlinked graph. If the concepts were well-identified the graph in figure 3 will be obtained.
Enconverter is a language independent parser that provides synchronously a framework for morphological, syntactic and semantic analysis. It is designed to achieve the task of transferring the natural language to the UNL format or UNL expressions. For more information about how the Enconverter works, cf Uchida and Zhu (2002). In order to encode any natural language sentence in terms of UNL graph, two main stages are needed: a) extracting UWs represented by words of natural language; b) linking these UWs with UNL relations. It is a very important task for the UNL Enconverter to identify the correct UWs; unlike natural languages, UNL is lexically unambiguous. Consider the examples:
Figure 3. “well-linked concepts – high graph quality”
Figure 3 is a high quality UNL-graph that illustrates the semantic encoding process needed for multilingual translation. Figure 4 represents the relation between the quality of semantic graphs and the expected translation quality.
(1) a. I need to go to the bank at lunch time. b. The flowers grow on sloping river banks. c. A bank of switches. UNL can solve the lexical ambiguity by discriminating between the three senses by means of UWs, because each UW has only one meaning, so the word "bank" in a, b and c, can be represented as UWs as in d , e and f respectively. d. e. f.
Figure 4. “The influence of the graph quality on translation”
In order to build a powerful Enconversion rules for Arabic, many linguistic phenomena have been dealt with e.g. correct segmentation and identification of words, detecting discontinuous concepts, explicating implicit concepts, classifying Arabic verbs semantically, eliminating non-universal concepts, and determining semantic function of preposition, Alansary et al(2007). In the subsequent discussion, semantic encoding of the English sentence in (3) will be traced until its equivalent semantic graph is obtained. The next section will continue to decode this graph to natural language in case Arabic is the target language. To encode the sentence in (3),
bank(icl>organization) bank(icl>shore) bank(icl>amount)
The task of the UNL Enconverter in the first stage is to extract the correct UWs according to every sentence in (1). In the second stage, these concepts will be linked to build a semantic graph like that in figure 1.
(3) The Kamchatka earthquake of October 6, 1737 produced a tsunami with a maximum height of inundation of around 60 meters and many were killed.
Figure 1. “UNL graph for the English sentence in 1.b ”
Arabic, like any natural language, has ambiguities. The word “ ”وcould be linked to 319
For example the UNL graph in figure 6 can be deconverted to Arabic as in (4). For more information about how the Arabic deconverter works, cf. Alansary et al (2006b, 2007).
The Enconversion process starts by deleting unnecessary elements like punctuations, particles and blanks, leaving an attribute on the following word to help in linking concepts and building a high quality UNL graph. Other particles leave a UNL attribute to appear in the UNL expression. After the extraction of concepts has been accomplished, the following list of universal concepts will be obtained from the English in (3): kill(agt>thing,obj>living thing).@past produce(icl>cause(agt>thing,obj>thing)).@past many(icl>people) earthquake.@def with(icl>how(obj>thing)) tsunami.@indef maximum height.@indef) inundation(icl>flood) meter(icl>length unit).@pl around(quathing,obj>living thing)" as it is marked the entry of the semantic graph; it represents the main predicate of the sentence. Therefore, the first rule applies is to tag the entry as representing part of the main skeleton of the sentence.
Having finished with universal concepts, the second stage starts by linking these concepts by a UNL relation. Relation stage includes two substages. The first sub-stage aims at constructing relations between UWs representing modifiers while the second sub-stage deals with constructing relations between UWs representing the main skeleton of the sentence that links all semantic arguments with the main predicate of the sentence. Graph in figure 5 is the final result.
Figure 5. “UNL graph of the English sentence in (3)”.
6 Generating Natural Language The DeConverter is a language independent generator that provides a framework for syntactic and morphological generation as well as cooccurrence-based word selection for natural collocation. It can deconvert UNL expressions into a variety of native languages, using a number of linguistic data such as Word Dictionary, Grammatical Rules and Co-occurrence Dictionary of each language. The Deconverter transforms the sentence represented by a UNL expression into Node-Net then applies the generation rules on every node in the Node-Net and generates the word list in the target language.
(5)
/[living thing)]/>>/
In (5) the first generated word in the Arabic sentence starts to appear after the application of the first rule in the generation process. Then the deconverter continues to insert other nodes from the graph to generate a complete sentence. Accordingly, another rule is applied to insert the node "produce(icl>cause(agt>thing,obj>thing))"
320
( )ثwhich is related to the main predicate of the sentence by “and” relation. Notice that in the semantic network in figure 6 the main entry is still connected to another argument, i.e. "many(icl>people)", by “obj” relation, therefore, a rule applies to insert the verb ""ث. As there is no other node linked to the entry node, the deconverter realizes that the main sentence structure has been completed as appears in (6) (6) -.* آ+ ث A new phase for inserting modifiers will start to insert the arguments related to the words that constitute the main skeleton of the sentence. The word “produce(icl>cause(agt>thing,obj>thing))” is linked with "earthquake" by an ‘agt’ “agent” relation, with "with(icl>how(obj>thing))" by ‘man’ “manner” relation and with "tsunami" by ‘obj’ “object” relation. Therefore, rules apply to insert the Arabic equivalences: ""ب, " 1" and "ال3 "زrespectively. At this point, all modifiers of the verb " "ثhave been inserted. The rules continue to insert the arguments of other nodes till the final node list in (7) is generated. (7)
Tsunami2. This document consists of 21 pages, 400 sentences. The whole document is enconverted to UNL graphs (according to section 5), then these graphs were deconverted to Arabic (according to section 6). In the next sub-section the quality of the Arabic translation will be evaluated.
7.1 Evaluation Methodology It could be difficult to find strict criteria for evaluating Machine Translation output, Dyson and Hannah (1987). Evaluation relied on translating UNL graphs manually by UNL specialists and comparing it with the output of the deconverter. Evaluation included: morphology, i.e. word structure; syntax i.e. the well/ill formedness of the generated sentences, word order, case marking, preposition and particles, and order or modifiers; and semantically, i.e. whether or not the Arabic output still conveys the meaning expressed by the source language. Initial results of the comparison highlighted morphological accuracy of 90%, syntactic accuracy of 75% and semantic accuracy of 85%. The following subsections will consider evaluating examples in details.
:+; 1 1737 - أآ6 )1 5 ال آ3ث ز -.* آ+ - 60 ? ع = ن ا1ار.
Finished with syntactic stage, morphological stage starts by adding prefixes and suffixes, and finally the deconverted Arabic sentence has been generated in (8) from the English sentence in (3) passing through the intermediate representation in figure 6. Given decoding rules and dictionary of other languages, the same UNL graph in figure 6 can be generated to any other natural language. (8)
7.1.1 Evaluating Word Formation Morphological evaluation recorded the highest level of accuracy; 90% of the total word forms were correct. The Arabic morphological module has an excellent ability to deal with agreement patterns (verb-subject, noun-adjective, nounnumber complement) and word formation. Only the generation of the definite article remains a problem. The generation of the definite article “ ”الin the underlined words in (9) is undesired which represents an example of morphological problems.
ا1737 - أآ6 )1 5 ال آ3أث ز . A-.* آ+ و- 60 ? ع !?= ن ا1 ار:+;
The Arabic deconverter is capable of generating a variety of syntactic structures and handling multiple modifiers. The morphological component of the grammar is capable of Inserting affixes and particles, generating correct word forms and achieve agreement between verbsubject, noun-adjective, and numbercomplement.For details, cf. Alansary et al (2006b,2007).
7
7.1.2 Evaluating Syntactic Structures The Spotlight of syntactic criteria is the correctness of the word order of generated sentences. As explained in Alansary et al (2007), the main Arabic syntactic structure is the VSO structure. 75% of the sentences of the document is generated in correct word order.
Evaluating the Generated Output
" * ال33 اC$$ يE" ا- ر اF(ّ اH ر .ت ا دةJ ا ا (by machine) (10) * ال33 اC$$ يE" ا- ر اF(ّ اH ر .ت دةJ " ا (by human)
(9) This semantic-based approach for machine translation has been applied on one complete document from the Encyclopedia of Life Support Systems (EOLSS)1. This document is about 1
2
http://www.eolss.net 321
Published in EOLSS. Cf. footnote 1.
sentences of the Tsunami document are semantically well-formed although some morphological and syntactic problems may exist. Yet, this type of errors still does not prevent the meaning of source text and their corresponding UNL graphs to be conveyed in the target language. The main source of problems associated with semantics is the inability of the grammar to select the suitable headword representing concepts (UWs) if there are many headwords representing the same concept in the dictionary. Consider the sentence in (15) and its English source in (16).
(11) In specific cases, a landslide induced by an earthquake may amplify action of the tsunami wave. When comparing the sentence in (9) that is translated automatically with the sentence in (10) that is translated by human UNL specialists, it is observed that both of them have the same word order (the source sentence is in (11)). On the other hand, the system failed in generating the correct word order in (12) as compared with the human translation in (13). The sentence in (14) represents the original source English sentence. (12) (13)
!+ " ع اK " ا5& ءM ا ء ا= * وN$ (by machine) .* اM& ب-K " " ا-
(15) MK "! ا ا1 :! در+ P ا اE? ع ه1ار ا ا ا ه-%& ر$J اE&RA أنNPAو ."Tوا S)ُ ورة اP اVW S ا:! آ ن-$ا
ا ءN$ * اM& ً $A-+ *K1 " " ا- (by human) عK " ا5& N$ " ء اM ا= * و
(16) An elevation of such size is able to generate only a local tsunami, but the danger of volcano-created tsunamis for neighboring populated areas and for navigation should be taken into account.
(14) The wave velocity is decreased near the coastline because of shallower water and the slowing of the wave by the roughness of the bottom obj
The universal “create(agt>thing,obj>thing)” is connected in the dictionary with the following possible entries:
slow(agt>thing,obj>thi agt
wave(icl>phenomenon)
roughness
Figure 7 “The semantic relation between the concepts “slow, wave and roughness” of the sentence in (14).
[ ]اcreate(agt>thing,obj>thing) [ ] create(agt>thing,obj>thing) [ ]أمcreate(agt>thing,obj>thing) [] create(agt>thing,obj>thing)
The reason of the undesired underlined section in (12) is that the concept “slow” has two semantic arguments: “wave” is its ‘object’ and “roughness” is its ‘agent’ (figure 7). In decoding this network to Arabic, the structure VSO will be adopted. Therefore, “slow” ‘ءM’ will fulfill the 'V' slot, the agent “roughness” ‘" 5&’ will fulfill the 'S' slot and the object “wave” ‘" ’ will fulfill the 'O' slot. Subsequently, the undesired underlined part in (12) is generated. In order to be able to generate it as marked in the acceptable order in (13), the agent must fulfill "O" slot and the object must fulfill the "S" slot. However, the syntactic component in the grammar is unable to map the object and the agent with “S” and “O” slots respectively in its current status, otherwise many other structures will be destroyed.
The headword “-) ”اis mistakenly selected by the generation grammar. In its current state, the grammar is not enriched with semantic constraints that help in selecting the most suitable headword in the current example (N$), which remains one of the challenges of the future.
8
Conclusion
The UNL system can make the dream of language independent semantic analysis a reality; therefore, it is suited for multilingual processing of massive documents. If the UNL is added to the network platforms, it will allow people to communicate through different Natural Languages, enabling them to access and share information in their native languages since language barriers will be broken. EnConverting Arabic structures in terms of UNL graphs is possible; therefore, it can be used as a source language in a multi-lingual UNLbased machine translation system. The Arabic language could be successfully generated from UNL graphs (Interlingua) with a high degree of accuracy. However, still some problems exist at various linguistic levels that
7.1.3 Evaluating Meaning In evaluating the meaning of the automatically translated sentences, we focused basically on whether the generated Arabic sentences are meaningful and correspond to both the original source sentence and the manually translated sentence by human specialists. Accordingly, sentences as those in (9), (12) are examples of semantically well-formed sentences. 85% of the
322
uses. International Journal of Translation vol.17, no.1-2, pp.5-38.
will be taken into consideration in updating the grammar.
Uchida, H. (1996). UNL: Universal Networking
References
Language – An Electronic Language for Communication, Understanding, and Collaboration. UNU/IAS/UNL Center. Tokyo,
Al-Ansary, S., Nagi, M. and Adly, N. (2006a). Processing Arabic Content: The Encoding Component of an Interlingual System for ManMachine Communication in Natural Language, the 6th international conference on language engineering, Cairo, Egypt.
Japan.
Uchida H.(2003). Knowledge Description Language, Semantic Computing workshop, Tokyo, Japan.
Alansary, S., Nagi, M. and Adly, N. (2006b). Generating Arabic text: The Decoding Component of an Interlingual System for ManMachine Communication in Natural Language, the 6th International Conference on Language Engineering, , Cairo, Egypt, 2006.
Uchida H., Zhu M. (2005). UNL2005 for Providing Knowledge Infrastructure, Semantic Computing Workshop (SeC2005), Chiba, Japan.
Alansary, S., Nagi, M. and Adly, N. (2007). Communicating in Arabic in Cyberspace. Information and communication technology. International Symposium (ICTIS07), Arabic Natural Language Processing Workshop, Fez, Morocco. Amgarten, M. and Merz, D. (1997). Towards a Systematic Evaluation of Machine Translation Systems. http://citeseer.ist.psu.edu/cache/papers/cs/337 3/http:zSzzSzwww.ifi.unizh.chzSzCLzSzEstl and.PaperszSzAmga_Merz.pdf/towards-asystematic-evaluation.pdf Dyson M. Hannah J. (1987). Toward a Methodology for the Evaluation of MachineAssisted Translation Systems. Machine Translation, Volume 2, 3:163 – 176. Guessoum A. and Zantout R. (2004) Methodology for Evaluating Arabic Machine Translation Systems, Machine Translation Volume 18,4:299 – 335. Hovy, E. (2002). Principles of Context-Based Machine Translation Evaluation. Machine Translation 17: 43–75. Hutchins, W. John; and Harold L. Somers (1992). An Introduction to Machine Translation. London: Academic Press Hutchins, J. (2003). Machine translation and computer-based translation tools:what’s available and how it’s used. A presentation at the University of Valladolid (Spain). Http://ourworld.compuserve.com/homepages/ WJHutchins/Valladolid-2003.pdf Hutchins, J. (2004). Current commercial machine translation systems and computerbased translation tools: system types and their
323