A Methodology for Evaluating Arabic Machine ... - Springer Link

Machine Translation (2005) 18:299–335 DOI 10.1007/s10590-005-2412-3

© Springer 2005

A Methodology for Evaluating Arabic Machine Translation Systems AHMED GUESSOUM Department of Computer Science, The University of Sharjah, PO Box 27272, Sharjah, UAE E-mail: [email protected]

RACHED ZANTOUT Faculty of Engineering, Department of Computer and Communications Engineering, Hariri Canadian University, Mechref, PO Box 10–Damour, Chouf 2010, Lebanon E-mail: [email protected]

Abstract. This paper presents a methodology for evaluating Arabic Machine Translation (MT) systems. We are specifically interested in evaluating lexical coverage, grammatical coverage, semantic correctness and pronoun resolution correctness. The methodology presented is statistical and is based on earlier work on evaluating MT lexicons in which the idea of the importance of a specific word sense to a given application domain and how its presence or absence in the lexicon affects the MT system’s lexical quality, which in turn will affect the overall system output quality. The same idea is used in this paper and generalized so as to apply to grammatical coverage, semantic correctness and correctness of pronoun resolution. The approach adopted in this paper has been implemented and applied to evaluating four English–Arabic commercial MT systems. The results of the evaluation of these systems are presented for the domain of the Internet and Arabization. Key words: evaluation, lexicons, coverage, lexical semantic correctness, grammatical coverage, pronoun resolution, Arabic

1. Introduction Natural Language Processing (NLP) systems, like any software, need to be properly evaluated. On the one hand, they need to be assessed by system developers for the technological improvements and novel research ideas put forward as solutions to the various problems faced in the area. On the other hand, they need to be assessed by potential users, the number of whom is continuously increasing, for quality comparison purposes. In terms of quality assessment, the evaluation of NLP systems can be divided into two main approaches: “glass-box” and “black-box” evaluations (Hutchins and Somers, 1992; Arnold et al., 1993; Nyberg et al., 1992). In black-box evaluation, the evaluator has access only to the input and output of the system under evaluation. In glass-box evaluation, the evaluator also has access to the various workings of the

300

AHMED GUESSOUM AND RACHED ZANTOUT

system and can thus assess each subpart of the system independently of and in association with the others. In addition to the quality assessment, an NLP system, just like any other software, needs to be evaluated for its cost, performance, stability, maintainability, and portability, among other criteria. Like NLP systems, Machine Translation (MT) systems are most of the time assessed using a glass-box or black-box evaluation. In addition, some researchers have also pointed out the need for component-based evaluation and detailed error analyses (Arnold et al., 1993; Nyberg et al., 1992; Hedberg, 1994). Since MT systems combine lexical analyzers, morphological analyzers, parsers, semantic disambiguation modules, generators, and pragmatic analysis modules, it is important to be able to evaluate these various components individually as well as to evaluate the overall system. The evaluation of MT systems introduces a number of additional complications. The assessment of the quality of some (system or component) output may depend on the evaluator’s background, skills, or even taste. For instance, given that MT systems are bilingual, or even multilingual, a need arises for evaluators/translators who have a good grasp of the source as well as the target languages. The evaluation process will be affected by the degree of proficiency of the human evaluator in the various facets of the languages that are involved in the translation. Also, various MT systems may use different approaches to translation and may be used in application domains or settings that may make their performance be judged differently. Moreover, given that the evaluators often do not have access to the internal workings of the system under evaluation, and are therefore forced into black-box evaluation, their interpretations of their evaluations do not necessarily yield appropriate diagnoses of the observed errors. These arguments explain why the evaluation of MT system quality is far from being a simple task. In addition, one should note that MT systems most of the time give raw output whose quality necessitates post-processing by a human translator. Thus the evaluation of the performance of an MT system will also depend on the acquaintance of the human translator with the system. Therefore, the overall evaluation of a given MT system should also take into account the effects introduced by the (human) post-processor (and, sometimes, even preprocessor). There has been a growing interest in the development of NLP tools for Arabic. Having had the chance to try a few Arabic MT systems, unfortunately we quickly came to the conclusion that a lot of work remains to be done. The first effort we have judged worth carrying out was the evaluation of various Arabic MT systems, in order to understand the reasons for the poor quality we had (informally) noticed. The most striking reason for the weakness we have noticed in Arabic MT systems (and Arabic NLP more generally) is that, contrary to the West, the Arab

EVALUATING ARABIC MT SYSTEMS

301

world still does not realize the strategic importance of MT (and NLP) in the globalization era (Guessoum and Zantout, 2000). This has had negative consequences on the development of Arabic MT. Moreover, evaluation of NLP and/or MT systems has been accepted in the West as an important area of study in its own right. The need to evaluate more formally various Arabic MT systems naturally arose. Such evaluations would allow a more precise assessment of the state of the art of Arabic MT, an identification of the areas of strengths and weaknesses (e.g. lexicons, parsers, transfer modules), and, consequently, a more proper understanding of the efforts needed and the areas that require these efforts more urgently. With this in mind, we decided to evaluate a number of Arabic MT systems. These are Al-Mutarjim Al-Arabey, Al-Wafi, and Al-Misbar from ATA Soft Tech.; Arabtrans from Arab.Net; Al-Nakel from CIMOS; and Ajeeb from Sakhr. The results of these evaluations are presented in what follows. In Section 2 we present some of the previous work done on the evaluation of MT systems and explain which aspects need to be assessed in an MT system. In Section 3 we introduce the methodology we have adopted for evaluating aspects of MT systems, including lexical coverage, grammatical coverage, semantic correctness, and quality of pronoun resolution. The implementation of the various parts of the methodology and their testing on various Arabic MT systems is introduced in Section 4, along with a discussion of the results and a comparison of our work with related work. We conclude this paper in Section 5 by summarizing our contribution and giving suggestions for future work.

2. MT System Evaluation Efforts A serious issue to be considered when dealing with an MT system is the ability to evaluate it scientifically. The aim here is to establish its usefulness and compare it with other already available systems. This evaluation should cover both computational and linguistic features of an MT system and should be informative, especially to potential users of the system, as to its capabilities and updatability. However, an important part of the evaluation is to be performed by the developers to check that the system performs as intended, that it produces acceptable translations, and, if need be, that it can be gracefully improved. It is indeed important to scrutinize any MT system, analyzing the quality of its output, classifying errors it makes, and improving it based on such an evaluation. The evaluation can thus be concerned with the technical quality of the system; it can also be concerned with the limitations of the system, the software engineering aspects, and/or the costs and benefits (Hutchins and Somers, 1992). Evaluation of MT systems has attracted the interest of funding agencies since the early steps of MT development. Some effort has already been devoted to the

302


development of evaluation criteria, metrics, and methodologies. Hutchins and Somers (1992) provide a good survey of the various kinds of evaluations, namely: quality assessment in terms of accuracy, intelligibility, and style; error analysis; and benchmark tests. The evaluation should also target an assessment of the computational limitations of the MT system as well as its cost and benefits for the potential purchasers and users. A number of contributions exist such as Dyson and Hannah (1987) and Lehrberger and Bourbeau (1988) and on the evaluation by users, and Nagao (1985), Melby (1988) and King and Falkedal (1990) on methodologies for MT evaluation. Vasconcellos (1988) is a collection of papers which includes a number of discussions of methods for MT system evaluation. Mellish and Dale (1998) explain how the problems of natural language generation are different from the problems of evaluating work in natural language understanding. Notable evaluations of MT systems are those of Systran (van Slype 1979a; Wilks, 1991), and of Logos (Sinaiko and Klare 1972, 1973). Major projects exist such as the DARPA project (White et al., 1994), the DiET project (Klein et al., 1998) at DFKI (Germany), and the European EAGLES project (King et al., 1996) for the development of diagnostic and evaluation tools for NLP applications. Evaluation of MT systems has indeed been an active area and has produced abundant literature. Among the above references is the special issue of this journal which was specifically devoted to the evaluation of MT systems (Arnold et al., 1993) and the references mentioned therein. One also finds the results of evaluating a large set of MT systems in Mason and Rinsche (1995). In addition to what was mentioned in Section 2 about the various approaches to evaluation, we can note that there seems to be an agreement as to which aspects should be evaluated. One of these is “adequacy”, i.e. the extent to which the meaning of the source text is rendered in the translated text. Another is “fluency”, i.e. the extent to which the target text appeals to a native speaker in terms of well-formedness of the target text, grammatical correctness, absence of misspellings, adherence to common language usage of terms, and meaningfulness within the context (White et al., 1994). Another aspect that should be evaluated is “informativeness”, which assesses the extent to which the translated text conveys enough information from the source text as to enable evaluators to answer various questions about the latter based on the translated text. One can also test for “intelligibility”, which is strongly related to informativeness, though directly affected by grammatical errors and mistranslated or missing words. In van Slype (1979b), a detailed study of the methods that had been developed for evaluating MT is presented. The report subdivides the evaluation features into two main categories: macro-evaluation and micro-evaluation. Macro-level evaluation attempts to assess systems at the cognitive level (intelligibility, fidelity, coherence, usability, and acceptability), at the economic level (reading time, correction time, and translation time), at the linguistic level (reconstruction


303

of semantic relationships, syntactic and semantic coherence, “absolute” quality, lexical evaluation, syntactic evaluation, and analysis of errors), and at the operational level (automatic language identification and verification of the claims of the manufacturer). Micro-evaluation methods can be subdivided into five groups: the grammatical symptomatic level (analysis of grammatical errors found in the target output), the formal symptomatic level (revision and post-editing rates), the diagnostic level (analysis of the causes of errors), the forecast level (analysis of the improvability of the system), and the therapeutic level (detection of the improvements to the system following an upgrade). Nyberg et al. (1994) introduce a methodology based on evaluation metrics for knowledge-based MT. The evaluation criteria they consider are: “completeness”, which measures the ability of a system to produce an output for every input; “correctness”, which measures the ability of a system to produce a correct output for every input; and “stylistics”, which measures the appropriateness of the lexical, grammatical, and other choices made during the translation process. Concerning lexicon evaluation, the previous criteria were refined as follows. Lexical completeness was taken to mean that for every word or phrase in the translation domain, the system under evaluation has source and target language lexicon entries. Lexical correctness refers to the fact that words are correctly chosen in the target sentence to realize the intended concept. Finally, in terms of stylistics, lexical appropriateness means that each word selected for output is the most appropriate (and correct) choice for the context. Based on the completeness, correctness, and stylistics criteria, the authors then defined four evaluation criteria, which test, as percentages, the Analysis Coverage, Analysis Correctness, Generation Coverage, and Generation Correctness. These four percentages then get multiplied, yielding the Translation Correctness, which measures the overall quality of the system. Carroll and Briscoe (1998) survey parser evaluation methods. The authors have divided these methods into corpus-based and non-corpus-based methods, the former being subdivided into annotated and unannotated corpus-based methods. Non-corpus-based approaches list the linguistic constructions covered by a particular parser as given by the grammar developer. Under unannotated corpus-based methods one finds the calculation of the “coverage”, i.e. the percentage of sentences from a given unannotated corpus that are assigned one or more analyses by a parser/grammar; the calculation of the “parse base” of a grammar, i.e. the geometric mean of the number of analyses divided by the number of input tokens in each sentence parsed; and the calculation of the “entropy”, which gives a measure of the degree to which a probabilistic language model captures regularities in the corpus by minimizing unpredictability and ambiguity. Finally, in terms of methods that use annotated corpora, one finds the following approaches. “Part-of-speech assignment accuracy” gives a measure of the accuracy with which a part-of-speech tagger assigns the correct lexical

304


syntactic category to a word in running text; “structural consistency” is measured as the percentage of sentences in a manually annotated test corpus which receive one or more analyses that are consistent with the correct analysis in the corpus; “best-first ranked consistency” measures the accuracy of a probabilistic parsing system by computing the percentage of highest-ranked analyses output by the parser/grammar which are identical to a manual analysis provided in an annotated test corpus (treebank). Other approaches use tree similarity measures of various types; the Grammar Evaluation Interest Group scheme, which measures the similarity of an analysis to a test corpus analysis; and a dependency structurebased scheme in which phrase-structure analyses from the parser and treebank are both automatically converted into sets of dependency relationships. A general framework for evaluating MT systems was presented in Hovy et al. (2002). The authors described “the principles and mechanisms of an integrative effort in MT evaluation”. In particular, they surveyed the various criteria and metrics used in previous MT evaluations and integrated most of them into two taxonomies, one specifying the MT system context of use while the other developed a quality model. In terms of automated evaluations, it is worth noting the two methods BLEU (Papineni et al., 2002) and RED (Akiba et al., 2002). In the former, n-gram-based metrics are used to evaluate MT systems automatically using human translations as references. In RED, the system learns a decision tree (ranker) which is then used for ranking various MT system outputs. Edit distance measures are used to perform the ranking. BLEU and RED evaluate MT system output quality; they are not concerned with individual component evaluations. Culy and Riehemann (2003) showed that using n-grams in BLEU as a metric gives more a measure of document similarity than translation “goodness” and should therefore be used with cautiousness. According to Akiba et al. (2002) also, caution should be exercised when using the automatic evaluators BLEU and RED. While, as mentioned above, the evaluation of MT and NLP tools has been accepted as a subarea in its own right, evaluation of Arabic MT tools is still very tentative and quite unsystematic (Anon, 1996; Jihad, 1996; Qendelft, 1997). These articles present brief surveys of a number of Arabic MT systems including Transphere, Arabtrans, and Al-Wafi. These attempts at evaluating Arabic MT systems did not rely on any formal evaluation methodology. Despite this, the evaluations have shown various shortcomings of the available Arabic MT systems. For instance, a brief evaluation of Al-Wafi was presented in the journal Arabuter (Anon, 1996). The author of that article gave examples of English– Arabic translations performed by Al-Wafi. There were still serious contextual reasoning problems with getting the word order or verb tenses right, choosing the right word or expression out of a number of alternatives or translating ambiguous sentences. As a result, translation done by Al-Wafi was shown to produce texts which were at times not faithful or accurate with respect to the source text, and


305

other times quite unintelligible. One example of the weakness of the state of Arabic MT software is the manual that a major PC manufacturer and retailer sent with its personal computers to one of the Saudi governmental institutions. Reading through the manual, the user would have a hard time understanding what it is trying to communicate. In fact, the output reflected many deficiencies in translating even basic computer terms. Two more recent evaluations of Arabic MT system components have been reported by the present authors (Guessoum and Zantout, 2001a,b), in which methodologies were introduced for the evaluation of the lexicon coverage and grammatical coverage of MT systems.

3. Evaluation Methodology Our main goal in developing a new methodology for MT system evaluation is to make the process as objective as possible. In fact, we want to obtain some form of component evaluation based on a black-box view of the MT system. To achieve this goal we take a statistical approach to evaluating the MT system components, so as to evaluate the lexical coverage, the grammatical coverage, the semantic correctness, and the pronoun resolution correctness. We start with a brief summary of the methodology used for evaluating the lexical coverage of an MT system lexicon. Later, we modify this methodology so as to deal with the evaluation of the remaining aspects. The lexicon evaluation methodology makes use of the notion of word sense and its “weight”. The weight of a word sense reflects its relative (statistical) occurrence in the language, which is closely related to the “importance” of this word sense in the language (for a given application domain). We take the methodology for evaluating lexicons of Guessoum and Zantout (2001a) and generalize it to the evaluation of grammatical correctness, semantic correctness, and pronoun resolution correctness. A different approach to the evaluation of the grammatical coverage was introduced in Guessoum and Zantout (2001b) based on the concept of “unfolded grammatical coverage”. The approach there is different from the one adopted here for the evaluation of the grammatical coverage. 3.1. EVALUATION OF THE LEXICON In order to evaluate lexicons of MT systems, it is necessary to know the types of errors that may occur when translating text using MT systems. In this section, we focus on the errors in translation due to shortcomings in the system lexicon. There are two main types of lexical errors that may occur when translating from one language to another using an MT system. First, errors can occur because of a word missing in the lexicon. Second, errors can occur because the correct word

306


sense does not exist in the lexicon even though the word may exist with a different sense. For most languages, a single word can have different senses according to the context in which the word is used. In (1), the word ‘bank’ appears in two sentences. In (1a), it gets translated as al-masraf ϑήμϤϟ΍ while in (1b) as al-jaanib ΐϧΎΠϟ΍ meaning ‘side’. Two completely different Arabic words are used to translate the same English word. (1)

a. ϑήμϤϟ΍ ϞΧ΍Ω ΕήψΘϧ΍ intadhartou daakhila -lMaSrif WAITED-I INSIDE THE-BANK ‘I waited inside the bank.’ b. ήϬϨϟ΍ Ϧϣ ϲϟΎϤθϟ΍ ΐϧΎΠϟΎΑ ΕήψΘϧ΍ intadhartou bilJaanibi -Shamaaliyyi mina -Nahr WAITED-I BY-THE-SIDE THE-NORTHERN OF THE-RIVER ‘I waited on the northern bank of the river.’

This demonstrates that the existence of a lexical occurrence of a specific word in the lexicon is not sufficient. Indeed, the rating of the lexicon coverage should be different depending on the percentage of different word senses the lexicon covers rather than the percentage of words. This means that the evaluation of lexicons should deal with word senses and not just words. To this end, we built a database of word senses and their occurrence ratios in text corpora. English word senses were represented in their root form in the database of word senses. For instance, the English word bank would have two different entries, each giving a different translation in Arabic. Each of these entries would be considered for a different number of occurrences such that words like banking, banked, or banks (financial) would all be counted as occurrences of the same word sense. A lexicon is said to “cover” a word sense if this word sense (not just the word) appears in the lexicon. To check whether an MT system covers a word sense, the tool presents the evaluator with the word (sense) as part of a sentence. The evaluator checks whether the translation given by the MT system is one of the acceptable translations in that context. If it is the case, we say that the lexicon covers that word sense. The “lexical coverage” of a lexicon was defined as the ratio of all the word senses covered by the lexicon to the total number of word senses in the source language of the lexicon, as represented by the database of word senses extracted from the selected corpora. In this work we refer to “domains” in which the user translates texts. These domains are general concepts under which all the texts that the user is intending to translate might fall. For example, Biology, Chemistry, Computer Science are a few examples of different domains in which a user of an MT system might be interested

307


in translating texts. These domains are not necessarily mutually exclusive and some might even be complete subsets of others. The main use of domains in evaluating lexicons is the distinction between the importance of different word senses based on the domain in which the translation is being done. The English word formula is frequently used in the domains of Mathematics and Chemistry, whereas it is not as frequently used in Computing or Geography. This means that if the word formula is absent from the lexicon of an MT system which is used to translate texts in the Computing domain, it should not affect the evaluation of such a lexicon as much as its absence would when the MT system is considered for translation in the Mathematics domain. Therefore, a prerequisite to evaluating lexicons is to specify the domain in which the evaluation is to take place. The importance of each word sense for the selected domain in the source language was determined by providing the evaluator with a database of word senses and the frequencies of their use in the domain in which the evaluation is to be made. In this database, word senses were divided into three classes depending on whether the word sense occurs frequently, normally or rarely in the domain. A detailed discussion of how the number of word classes and the occurrence ratio boundaries were selected is presented in Guessoum and Zantout (2001a). The methodology eventually yields two measures: the coverage of a lexicon with respect to a class (i.e. class of frequently, normally, or rarely occurring word senses) (2),

∑

(2)

Coverage(Lexicon, Cj ) =

mi ∈C j

[Occurrences(mi ) * d (mi , Lexicon)]

∑

mi ∈C j

Occurrences(mi )

and the overall coverage (3) for the MT system lexicon, N

(3)

Coverage(Lexicon) = ∑ [CW(C j ) * Coverage(Lexicon, C j )] j =1

where CW(Cj) is the class weight of class Cj (a number which reflects the importance of a class to a lexicon). Equation (3) thus provides one single value that can be used in order to assess the quality of an MT system lexicon and compare lexicons across MT systems. Further details can be found in Guessoum and Zantout (2001a).

3.2. EVALUATION OF GRAMMATICAL CORRECTNESS As mentioned above, we have selected the black-box evaluation approach since we want to evaluate commercial systems and, hence, we have no access to their inner workings. Even so, it is desirable to be able to draw enough conclusions about various components of the system from this evaluation. In such a setting,

308


the evaluation may not be able to pinpoint the error source, but it will give an indication as to what component of the system is malfunctioning. Grammatical errors that appear in the output of an MT system are due either to an incorrect parsing of the source sentence or to an incorrect generation from the (correct) internal representation of the source sentence. Of course, a combination of these two cases is possible, but it can be reduced to two successive errors, one in the parser and one in the generator. The assumption here is that transfer rules do not contain any mistakes. Otherwise, the sources of grammatical error in a system would be the parser, the transfer rules, and the generator. Definition. An MT system covers a grammatical structure if the structure produces a (target) sentence (for this source grammatical structure) whose grammatical structure is one of the expected/correct grammatical structures of the source sentence. A partial coverage of a grammatical structure is defined as the parser producing a sentence whose grammatical structure is close enough to one of the expected/correct grammatical structures of the source sentence. Definition. The grammatical coverage of an MT system is defined as the ratio of the number of grammatical structures that this system covers to the total number of grammatical structures in the (source) language. We should point out that, in order to have a correct assessment of the MT system’s grammatical coverage, this evaluation should be done independently of erroneous behavior of any of the other system components. For instance, the grammatical coverage should be computed using sentences which do not contain any errors due to improper system lexical coverage. This remark is in fact general in that, for the evaluation of any MT system component, it is important to make sure that no errors which may point to shortcomings in other system components are present. Therefore, in this part of the evaluation only the parser gets evaluated and not the entire system. Similar to the case of lexicon evaluation, grammatical structure evaluation can be done using one of two methods.

Method I This method relies on assigning the same weight equally to all the language’s grammatical structures. This approach is more suitable for users who require the MT system to cover all the language’s grammatical structures with the same level of importance. The grammatical coverage of the MT system can thus be computed using Equation (4),

309

EVALUATING ARABIC MT SYSTEMS N

(4)

¦ d (Gi , MT ) Grammatical_Coverage( MT ) = i =1 , N

where MT is the MT system which needs be evaluated, Gi is the ith grammatical structure, d(Gi, MT) = 1 if Gi was handled completely correctly by the MT system = 0.5 if Gi was handled partially correctly = 0 if Gi was handled incorrectly, and N is the number of language grammatical structures (i.e. the number of test structures. In this evaluation we have chosen to use a three-point scale (0, 0.5, 1) for rating the output of the MT program. This was done because we believe that even if the MT system does not use the correct grammatical structure, its output might, in some cases, be understandable although not 100% grammatically correct. As such a score of 0.5 would indicate that the sentence structure is not totally correct, though reasonably close to one of the correct translation grammatical structures. This was not the case for lexical coverage since a lexicon can either contain a word sense or not. Also, in this evaluation, the selection of various grammatical structure forms was done based on criteria supplied by Arab linguists. Therefore, there was no clear scientific basis for selecting grammatical structures and we do not claim in this paper that we have covered all grammatical structures. Furthermore, our selection of grammatical structures has been done based only on the source language. We have informally selected various categories of sentences taking into account the grammar of the source language. However, this could be developed further by finding a more systematic way of generating the various grammatical structures that need to be tested for. Moreover, further work should be carried out to have a more scientific proof of the importance (weight) of any grammatical structure in a language. A tool that would be able to identify the grammatical structures in a large number of texts for each domain in question would then be implemented. Such a tool could then be used to collect statistics about the occurrence of each grammatical structure in the corpus and therefore determine the importance of the grammatical structure. Method II In this approach, the evaluator assigns a different weight to each grammatical structure of the language. This weight reflects the statistical occurrence of each grammatical structure in the set of test sentences used in the test database.

310


The idea is a straightforward extension of the notion of word-sense weight, which was defined earlier for the purposes of lexicon evaluation. Developing a grammatical tool for counting the number of occurrences of each grammatical structure throughout a set of test sentences will allow the evaluator to compute the weight of each grammatical structure of the language. This can be done using Equation (5), (5)

W (Gi ) =

Occurrences(Gi ) N

,

∑ Occurrences(G j ) j =1

where Gi and N are defined as above, Occurrences(Gi) is the number of occurrences of the ith grammatical structure in the language (test sentences in this case), and W(Gi) is the weight of the ith grammatical structure Gi. The grammatical coverage of the MT system is then calculated using Equation (6). N

(6)

Grammatical_Coverage( MT ) = ∑ [W (Gi ) * d (Gi , MT )]. i =1

Using one of the above two methods, the evaluator can calculate the grammatical coverage of an MT system. It should be clear however that Method II is more refined than Method I. Its main advantage is that it gives more weight to the grammatical structures that are commonly used in a given language. The above methodology for evaluating the grammatical coverage is very similar to that used for the evaluation of the lexical correctness. Whereas in the latter the basic unit was a word sense for which we defined the concepts of occurrence ratios or weights, in the former we manipulate grammar structures. So this becomes the unit that will be the basis of the evaluation. In other words these units will each have their own occurrence ratio (weight) which will signal how frequently (or rarely) this particular unit (grammatical structure) occurs in a given language. The procedures that implement the above methodology (Equations (4)– (6)) are similar to the procedures build_wordsense_db and evaluate_lexicon, which were presented in Guessoum and Zantout (2001a). 3.3. EVALUATION OF SEMANTIC CORRECTNESS Semantics is the study of linguistic meaning or the study of the meaning of words and sentences (Jurafsky and Martin, 2000). A major characteristic of any mature NLP or MT system is its semantic correctness. We will be concerned only with


311

those aspects of semantic correctness that depend on lexical correctness. In other words, we will restrict the semantic correctness in an MT system to measure how many word senses in the source sentence are equivalent to the meanings of their corresponding words in the target sentence. Thus we will consider that semantic errors occur only because of an incorrect selection of the proper word sense among different senses available in the lexicon. Consider the sentences shown in (1) earlier. A semantically erroneous MT system might produce one or both of the incorrect translations shown in (7). (7)

a. I waited inside the bank. ήϬϨϟ΍ ΐϧΎΟ ϞΧ΍Ω ΕήψΘϧ΍* * intadhartou daakhila jaanibi-Nahr WAITED-I INSIDE THE-RIVER-BANK b. I waited on the northern bank of the river. ήϬϨϟ΍ Ϧϣ ϲϟΎϤθϟ΍ ϑήμϤϟΎΑ ΕήψΘϧ΍* * intadhartou bilMaSrifi -Shamaaliyyi mina -Nahr WAITED-I IN-THE-(FINANCIAL)BANK THE-NORTHERN OF THE-RIVER

Because bank has two distinct translations in Arabic as explained above, the two sentences in (7) were incorrectly translated by using the wrong Arabic sense of bank in each case. To an Arabic-speaking person reading the output in (7), neither translation would convey any meaning. Semantic evaluation of the translation could be achieved by manually comparing the translation of the MT system to that of an expert human translation. This will necessitate dividing both the source and target texts into fragments in order to compare each source text fragment against the corresponding one in the target text (White et al., 1994). Our approach for evaluating an MT system (lexico-)semantically consists of submitting to the MT system under evaluation test sentences for each word sense available in the language. Testing the system is done with texts that have word senses and grammatical structures covered by the system lexicon and parser, respectively. These two assumptions are made to make the semantic evaluation independent of the shortcomings of other modules. The test sentences are carefully chosen so that different word senses appear independently in each set of test sentences. For example, the two English sentences in (1) would appear in the test sentence database to test whether the system can differentiate between both senses of the English word bank. As was specified earlier, the system will have been tested as to whether its lexicon contains both senses of the word bank and its parser can correctly process the grammatical structures of both test sentences. (The same applies to words with different grammatical categories such as types, the verb and the plural noun.)

312


After submitting a test sentence to the MT system, we check if the tested word is translated with the correct word sense or not. Note that an MT system translation of a given word would be compared to the senses which are available in the database of word senses. The evaluator will assign a score of 1 for a semantically correct translation and a score of 0 otherwise. An evaluator is allowed to sort domain word senses into different classes as was done in the lexical coverage evaluation. In this case, a tool should be available that will calculate the importance of a word sense depending on its occurrence ratio in the database for the selected domain. The approach for evaluating the semantic correctness for a given MT system requires a number of steps: 1. Calculate the sizes of the various classes of word senses in addition to the size of the database. (See details in Guessoum and Zantout (2001a).) 2. Calculate the class weight for each class of word senses. (See details in Guessoum and Zantout (2001a).) 3. Calculate the (lexico-)semantic correctness of the MT system with respect to each class of word senses. This correctness can be computed in one of two ways according to the evaluator’s need: a. In case of “local discrimination” among class elements, the lexicosemantic correctness of an MT system with respect to a class is calculated using Equation (8), ∑ [Occurrences(mi ) * d (mi , MT )] mi ∈C j , (8) Semantic_correctness( MT , C j ) = ∑ Occurrences(mi ) mi ∈C j where Cj, mi, and Occurrences(mi) are as defined previously, MT is the evaluated MT system, and d(mi, MT) = 1 if mi was translated correctly by the MT system, 0 otherwise b. In case of no local discrimination among class elements, the semantic correctness of a class in the MT system is calculated using Equation (9), ∑ d (mi , MT ) mi ∈C j , (9) Semantic_correctness(MT , C j ) = elements(C j ) where MT, Cj, mi, d(mi, MT) are defined as above, and elements(Cj) is the number of different word senses in the database of word senses (with one occurrence for each word meaning) which are in the class Cj.

313


4. Calculate the overall semantic correctness for the MT system using Equation (10). N

(10) Semantic_Correctness( MT ) = ∑ [CW (C j ) * Semantic_correctness( MT , C j )] . j =1

Note that, if the evaluator decides to use only one class of word senses with no local discrimination, then the formula for calculating the overall semantic correctness percentage for the MT system reduces to Equation (11).

∑ d (mi , MT ) mi ∈DB (11) Semantic_Correctness( MT ) = . elements( DB ) 3.4. EVALUATION OF PRAGMATIC CORRECTNESS: PRONOUN RESOLUTION AND CASE ENDINGS

Pragmatics studies how the context influences communication. In other words, pragmatics is the study of how to do things with words, or the study of the meaning of language in context (Jurafsky and Martin, 2000). Pragmatic correctness is an important factor for any MT system to produce a target text with high quality. However, since the field of pragmatics is a wide and fast growing field, a complete evaluation of pragmatic correctness turns out to be complex and beyond the scope of this research. In addition, a question can be raised about the pragmatic correctness of an MT system: can the pragmatics be conveyed correctly from a source language to a target language just by keeping the same surface meaning, modulo appropriate grammatical structures in the target language? Or are there examples where one needs to find a suitable target surface meaning, the deep meaning being the realm of reasoning tools rather than “translation tools”? For our part, we will restrict our interest in the evaluation methodology to two aspects: pronoun resolution (anaphora and cataphora) and resolution of case endings. We have analyzed corpora of English–Arabic translations and we have concluded that the kinds of frequent errors that may occur are related to the aspects mentioned below. Let us first point out that Arabic has pronouns as shown in Table I. These pronouns are “detached” pronouns used in the nominative form. For instance, (12a) could be translated in one of three forms (12b–d) depending on the context and whether they gets resolved as dual, masculine, or feminine. Note that, in the nominative form, the pronoun is optional, and so is shown in parentheses. Also, the same verb takes various case endings for number and gender agreement. In fact, if the pronouns get omitted, as just explained, it is still clear from the verb which number and/or gender is meant.

314

AHMED GUESSOUM AND RACHED ZANTOUT Table I. Arabic pronoun system Per

Singular

Duala

1 2

‘I’ anna Ύϧ΃ ‘you (masc)’ anta Ζϧ΃ ‘you (fem)’ anti Ζ ˶ ϧ΃ ‘he’/ ‘it (masc)’c houa Ϯϫ ‘she’/ ‘it (fem)’ hia ϲϫ

‘we’ naHnou ϦΤϧ ‘you’ antoumaa ΎϤΘϧ΃ ‘you (masc)’ antoum ϢΘϧ΃ ͉ Θϧ΃ ‘you (fem)’ antounna Ϧ ‘they’ houmaa ΎϤϫ ‘they’ (masc) houm Ϣϫ ‘they (fem)’ hounna Ϧ ͉ϫ

3

Pluralb

Notes: It is incorrect in Arabic to use plural forms instead of dual, though this error is not uncommon. b Masculine plural form is used to refer to groups other than all feminine. c Nouns in Arabic are either masculine or feminine. a

(12)

a. They ate the oranges. b. ϝΎϘΗήΒϟ΍ ϼϛ΃ (ΎϤϫ) (houmaa) akalaa alBourtouqaal (THEY(dual)) ATE(dual) THE-ORANGES c. ϝΎϘΗήΒϟ΍ ΍ϮϠϛ΃ (Ϣϫ) (houm) akalou alBourtouqaal (THEY(masc)) ATE(masc pl) THE-ORANGES d. ϝΎϘΗήΒϟ΍ ϦϠϛ΃ (Ϧ ͉ ϫ) (hounna) akalna alBourtouqaal (THEY(fem)) ATE(fem pl) THE-ORANGES

Pronouns can also be attached to verbs as suffixes in the accusative and genitive forms as exemplified in (13). In (13a), it may refer to orange, which is feminine in Arabic. In (13b), it may refer to the rule, which is also feminine in Arabic. In both examples, all the complements appear as attached pronouns, so that the resulting sentences each consist of exactly one word. (13)

a. ΎϬ˵ΘϠϛ΃ akaltouhaa ATE-I-IT(fem) ‘I ate it.’ b. ΎϬϴϨ˴ΘϤ͉Ϡϋ ’allamtaniihaa TAUGHT-YOU-IT(fem)-ME ‘You taught me it.’


315

The use in Arabic of attached and detached pronouns can very easily make sentences incorrect if the wrong pronoun (possibly as a case ending) is used. This explains why we have singled out pronoun resolution and treated it along with case ending resolution. As a matter of fact, our analysis of corpora of English– Arabic translations has helped us lay out principles for selecting useful test sentences for pronoun resolution. Assume that in the sample source sentence (or text excerpt) a number of nouns occur along with a given pronoun. Different cases may occur, as follows. 1. If the pronoun is it, then at least two of the singular nouns in the sentence must have distinct genders for the sentence to be useful in pronoun resolution evaluation. The reason is that it should then be translated as Ϯϫ (masculine) or ϲϫ (feminine), depending on the noun it refers to. 2. If the singular nouns are either all masculine or all feminine, and the pronoun is it, the sentence would not be useful unless the system tends to translate it always as masculine or feminine, in which case we may see the error. 3. The same applies with the pronoun they, if at least two nouns have different numbers (dual or plural) possibly combined with gender. That is, they could be ΎϤϫ (dual), Ϧ ͉ ϫ (feminine), or Ϣϫ (plural other than all feminine). The selection of the proper case endings (suffixes) is a similar problem even when it does not strictly deal with pronouns as explained above. In this case, the problem is to check whether a system that translates into Arabic produces the correct gender and/or number agreement(s). For example, the sentence (14a) could be incorrectly translated as (14b),2 which has incorrect gender agreement between the verb and the object the car, which is feminine in Arabic. (14)

a. I liked the car. b. ΓέΎϴδϟ΍ ϲϨΒΠϋ΃ aajabani alSayaratou LIKED-IT(masc)-I THE-CAR

Another example would be (15). In the translation, the subject is dual so drank should be ΎΑήη sharibaa (dual), not ΍ϮΑήη sharibou (masculine plural). (15)

The two men walked in the street, then drank coffee.

Clearly, a wrong treatment of case endings can make the attached pronoun (case ending) refer to the wrong noun and, hence, change the meaning of the sentence. This is why we have included the problem of handling case endings in this paper under pragmatics with pronoun resolution, though the reader may argue that it belongs to the realm of syntax. Our approach to evaluating an MT system with respect to pronoun resolution is based on submitting test sentences which contain different examples of pronouns in

316


order to check if the pronoun would be translated (resolved) into a correct target pronoun. The evaluator assigns a score of 1 for each pronoun or case ending that gets translated correctly and 0 otherwise. In the following, we will mention pronoun resolution to mean either pronoun resolution or case ending resolution. The correctness of pronoun resolution is calculated using Equation (16), N

(16) Pronoun_Resolution_Correctness( MT ) =

∑ d ( PRi , MT )

i =1

N

,

where MT is the evaluated MT system, PRi refers to the ith pronoun in the sample text, d(PRi, MT) = 1 if the ith pronoun was translated correctly by the MT system, 0 otherwise, and N is the total number of pronouns in the test cases. Before we conclude this section, it is worth mentioning that our evaluation of pronoun and case ending resolution correctness seems to be language-dependent since it has considered special characteristics of Arabic, as the target language, namely the use of attached and detached pronouns. However, we believe that the approach is general enough in that it can be fine-tuned for the evaluation of MT systems between other pairs of languages. The evaluator would just need to find out how pronouns are treated in the source and target languages. The methodology explains how to do the evaluation and how to select test sentences that exhibit specific phenomena that need to be tested for. 4. Evaluating Arabic MT Systems 4.1. ARABIC MT SYSTEMS SELECTED We have chosen to test our MT system evaluation methodology on various commercial Arabic MT systems. Those we have managed to find on the market and purchase are Al-Wafi and Al-Mutarjim Al-Arabey developed by ATA Software Technology Inc.; Arabtrans by ArabNet; and Al-Nakel by CIMOS. We have not been able to purchase Transphere (by Apptek) despite multiple attempts and direct contacts with the company representatives in Saudi Arabia. We have also evaluated two web-based translation systems Ajeeb by Sakhr and Al-Misbar by ATA Software Technology. ATA has introduced three English–Arabic MT systems. These are Al-Mutarjim Al-Arabey, Al-Wafi, and Al-Misbar. The first of these is a PC-based system produced by ATA Software Technology Limited (in collaboration with Al-Farahidy Technology Information AFTI of Muscat, Sultanate of Oman). The system initially started as a postgraduate research degree project at Brunel


317

University, UK. A demo version was presented at Gitex, Dubai, in 1994. In 1995, the complete product of Al-Mutarjim Al-Arabey was introduced. ATA claims that Al-Mutarjim Al-Arabey is the first English–Arabic MT System ever to be developed on personal computers (ATA, 1997). According to ATA (1997) and Al-Jundi (1997), some good points of the system are its comprehensive dictionaries (300,000 “lines of words” as reported), a good level of “text context analysis”, the introduction of different word senses, whenever available, and the correct translation of most of the common abbreviations. Al-Wafi was also developed by ATA Software Technology Limited. In fact, we have found out from our evaluation that it uses exactly the same MT modules as Al-Mutarjim Al-Arabey, except for a less extensive lexicon. ATA has also developed Al-Misbar, a web-based translation engine available at http://www.almisbar.com/.3 Having found out through the evaluation that the above three systems have the same engine, except for differences in lexicons, we present below the results of the evaluations of the three systems under the same title “ATA”. Arabtrans was developed by Arab.Net Technology limited. It works in a Microsoft Windows environment (3.1 or later). Source texts can be entered from files, interactively, or scanned. According to its developers, Arabtrans translates texts from English to Arabic at more than 1,000 words per minute and, unlike for professional translators, “… the translation produced by the program requires editing for both grammatical accuracy and to check whether alternative meanings are preferable” (Arab.Net, 1996). Sakhr (i.e. Al-Alamiyah), the Cairo-based company, introduced its Ajeeb translation product during the year 2000 and it was made available as a webbased Arabic–English translation engine on the site of http://www.ajeeb.com during the Dubai Gitex computer exhibition in October 2000. At Gitex 2001, Sakhr launched its Arabic–English MT system. CIMOS, the Paris-based company, developed Al-Nakel an MT system which translates between the three languages English, Arabic, and French. Al-Nakel is available as a PC-based system with a restrictive license per PC used. For Al-Nakel, only the grammatical coverage, semantic correctness, and pronoun resolution correctness evaluations have been done. The lexicon evaluation was not possible due to a change of machine which required purchasing the system anew, which was not possible. 4.2. EVALUATING THE LEXICONS As explained in Section 3.1, the methodology for evaluating MT system lexicons relies on the existence of a database of word senses of the source language as well as statistics giving their occurrence ratios. As such a tool has been built which

318


– extracts all the word senses from the collected corpus and builds a database of word senses; – incrementally updates the occurrence ratios of the various word senses that exist in the database; and – divides the word senses into three classes (frequent, normal, and rare) based on the ratios of the word senses occurrences. More details about this tool and how the class occurrence ratio boundaries were selected can be found in Guessoum and Zantout (2001a). The corpus was selected from the domain of the Internet and Arabization. The total number of word senses in this database is 6,308, while the total number of distinct word senses is 1,319. Using this classification of word senses, the computed sizes and weights of the classes in the constructed word sense database are given in Table II. Table II. Sizes and weights of classes

Class size Class weight

Frequent

Normal

Rare

Total

1520 0.241

2692 0.427

2096 0.332

6308

The results of the lexicon evaluations of various Arabic MT systems are shown in Table III. The first rows report the results of the evaluation for the case where no local discrimination was applied while the second rows show results where local discrimination was applied. Note that in Guessoum and Zantout (2001a), the MT systems Ajeeb and Al-Misbar (included here under ATA) were not evaluated. Table III. Lexicon evaluation results Local discrimination

Class of word sense

ATA

Arabtrans

Ajeeb

Without

Frequent Normal Rare Overall

1.000 0.953 0.936 0.956

1.000 0.916 0.901 0.931

0.667 0.953 0.936 0.901

With

Frequent Normal Rare Overall

1.000 0.973 0.959 0.973

1.000 0.922 0.920 0.940

0.679 0.944 0.951 0.903

Table III shows that the three systems (that have been evaluated for lexical coverage) have adequate lexicons. ATA has the best lexical coverage in both


319

cases (with and without local discrimination) and outperforms Ajeeb and Arabtrans for the coverage of all three classes of word senses. ATA and Arabtrans completely cover frequent word senses, which is not the case for Ajeeb. Concerning the three systems developed by ATA (Al-Misbar, Al-Wafi and Al-Mutarjim Al-Arabey), we reached the conclusion that Al-Misbar, which is available on the World-Wide Web, has the best lexical coverage, followed by Al-Mutarjim Al-Arabey. Comparing with and without local discrimination, we can see that all three systems have better coverage percentages when local discrimination is used. This is because all three systems cover the frequent senses in a good manner. Had one of the systems had a problem with the class of frequent word senses, this would have reflected itself as a lower rating when using local discrimination. The improvement in the local discrimination case is also partly due to the high coverage of normal and rare word senses. Upon analysis, we have realized that most of the missed word senses, in the test cases, are those words which have more than one sense. This points to the fact that, in the MT system lexicons evaluated, alternative senses of the same words have been overlooked, either intentionally or unintentionally, by the system lexicon developers. One issue that should be clarified here is the apparent contradiction between the results obtained in this section (good coverage of all three lexicons) and the bad evaluations some of these Arabic MT systems reported by Jihad (1996), Qendelft (1997), and Anon (1996). Although the lexicon is an important part of an MT system, it is clear that it is not the only part. As will be seen next, the shortcomings of the MT systems under evaluation are in areas such as grammatical coverage, semantic correctness, and pronoun resolution correctness. This means that even though the developers of the systems have clearly put a lot of effort into the lexicons of their respective systems (for the domain of the Internet and Arabization), the overall improvement that resulted was not satisfactory. Note that what the results presented in this paper show is that this domain is well covered by the three systems. Other domains will require separate evaluations using databases derived from text corpora collected in those domains. 4.3. EVALUATING THE GRAMMATICAL COVERAGE As mentioned in Section 3.2, it is indicative when selecting sample texts for evaluating grammatical coverage to assign test cases for each grammatical structure available in the language (or most of them). In our case, we have tried to do this for the most common grammatical structures in the English language. These are verb phrases with various verb forms, basic tenses, progressive tenses, various forms of conjunctions, noun phrase and verb phrase combinations, some prepositional phrases, auxiliary verbs, active and passive voice sentences, whquestions and relative clauses. The sentences were selected so as to test for the

320


above structures. Usually a number of sentences would be used to test any of the constructs. Appendix A shows some of the grammar test cases. Following the methodology presented in Section 3.2, we obtained the results shown in Table IV. Table IV. Evaluation of the overall grammatical coverage ATA

Arabtrans

Ajeeb

Al-Nakel

56.5 %

32.0 %

64.0 %

60.5 %

In the grammatical evaluation of the Arabic MT systems it was assumed that all the grammatical structures for the English language are of the same weight. Appendix A shows a subset of the test cases used for the grammatical coverage evaluation. The final results are shown in Table IV which shows a weakness in the grammatical coverage for all of the systems evaluated. The maximum grammatical coverage percentage is 64% by Ajeeb, a very low score which explains the frequently bad quality of the output. Table V details the results more clearly with respect to various classes of input sentence grammatical structures. It is obvious from this table that the MT systems perform quite poorly on a number of sentence grammatical categories. Arabtrans has almost consistently performed worse than the other three systems. These three systems have an overall coverage score between 32 and 64 which remains quite weak; this is emphasized by the fact that in very few cases did any system score 80% or above. Table V. Details of the grammatical coverage for the three systems Grammatical Structure

ATA

Arabtrans

Ajeeb

Al-Nakel

Verb forms Simple tenses Progressive tenses Various forms of conjunction Noun phrase and verb phrase combinations Auxiliary verbs Active voice sentences Passive voice sentences Wh-Questions Relative Clauses All Combinations

65.91 59.20 38.46 70.00 50.00 70.00 58.51 25.00 42.86 58.33 56.50

35.00 83.00 17.00 40.00 31.00 16.00 17.00 25.00 18.00 8.00 32.00

63.64 65.52 53.85 95.00 62.50 65.00 64.89 50.00 50.00 58.33 64.00

65.91 64.94 30.77 70.00 81.25 55.00 61.17 50.00 42.86 33.33 60.50

321


4.4. EVALUATING LEXICAL SEMANTICS Since it is very difficult to assign test cases for all word senses in the language, we have selected a subset of the word senses which were collected in the database for the lexical coverage phase. This subset consists of all the words that have more than one sense. We then made up a test sentence for each of the word senses. Appendix B presents some of the semantic correctness evaluation test cases and their translation results using the Arabic MT systems. The results are shown in Table VI. Table VI. Evaluation of semantic correctness ATA

Arabtrans

Ajeeb

Al-Nakel

83.59%

51.50%

71.09%

70.08%

Although the lexical coverage evaluation shows that the MT systems evaluated have a large number of words in their lexicons, those high scores have not been kept in the semantic processing. The systems score between 51.5% for Arabtrans and 83.59% for ATA. This means that, in between roughly one fifth and half of the cases, the wrong word sense gets selected. More improvement is indeed needed here to ensure better quality translations.

4.5. RESOLUTION OF PRONOUNS AND CASE ENDINGS Appendices C and D show some of the pronoun and case ending resolution test cases and the results of translation using the selected systems. The evaluation was performed as explained in Section 3.4. The results are displayed in Table VII. Table VII. Evaluation of semantic correctness Type of evaluation

ATA

Pronoun resolution Pronoun and case ending

42.55%

Arabtrans

Ajeeb

61.00%

Al-Nakel 61.11%

51.06%

In the pronoun resolution evaluation, we tried to cover all pronouns in the language along with the different cases for a pronoun inside a sentence. Table VII shows quite low scores for pronoun and case-ending resolution correctness. We have found that when case endings also get accounted for, the scores get even lower as shown in the table. We have also reached a general conclusion from the evaluation of the various MT systems that pronouns are incorrectly resolved when paragraphs are considered in most of the cases. It seems that the span of pronoun resolution in these systems is mainly over the

322


sentence where a pronoun occurs. The developers of these systems need to put more emphasis on the production by their systems of correct pronoun resolution in whole paragraphs, not just in the same sentence. We have also noted that Arabic MT systems tend to translate pronouns like it quite uniformly as Ϯϫ (masculine) when it should at times be feminine, and they as Ϣϫ (non-feminine plural) when it should at times be dual or feminine plural. With this serious drawback, the evaluations could have given much worse results had we insisted on test sentences that would mainly highlight this. In other words, the more of these test sentences we would use, the lower the scores the systems would obtain. 4.6. SUMMARY OF EVALUATION RESULTS Table VIII summarizes the final results of the evaluation of the available Arabic MT systems, which leads us to the conclusion that these systems have good lexicons for the test domain, but they need to improve their capabilities for translating all grammar structures correctly. Moreover, these systems should focus on the semantics of the translated text in order to choose the appropriate word sense when different senses for the translated word exist in their lexicons. In addition, all the systems evaluated need to ensure that pronoun (and case ending) resolution is paragraph- or text-based during translation rather than sentence-based as is currently the case. As such, our recommendation is that more effort should be put on parts other than the lexicon in order to achieve noticeably better results. Table VIII. Summary of evaluation results Type of evaluation

ATA

Arabtrans

Ajeeb

Al-Nakel

Lexicon

0.973 0.956 56.50 83.59 n/a 42.55

0.940 0.931 32.00 51.50 61.00 n/a

0.903 0.901 64.00 71.09 n/a 51.06

n/a n/a 60.50 70.08 61.11 n/a

with local discrimination without local discrimination Grammatical structures Semantic correctness Case-ending correctness Pronoun resolution and case-ending correctness

4.7. RELATED WORK One finds in van Slype (1979b) a summary of a number of approaches, the aim of which was to calculate various statistics and ratios about various features of MT systems. In terms of lexical evaluation, Miller and Beebe reportedly (Halliday and Briss, 1977) suggested an evaluation score consisting of the ratio of the


323

number of words common to the human translator (HT) and MT versions to the total number of words in HT. Our lexicon evaluation is different in that we evaluate the quality of an MT system lexicon by systematically checking all the word senses it contains for a given language, taking into account the relative importance (weights) of those word senses in the language. In terms of grammatical evaluation, a number of approaches exist. Miller and Beebe again established an a priori list of syntactic constructions, took the results of HT and MT, and calculated the ratio of the number of constructs common to the MT and HT versions over the total number of occurrences of the syntactic constructions (of the list) in the HT version. Our approach takes into account the weights of each syntactic construct and whether it is translated into the appropriate grammatical construct of the target language by the MT system. Weissenborn (cited by van Slype, 1979b) suggested a syntactic evaluation based on the ratio of the number of the source language analysis grammar rules existing in the MT system to the number of grammatical rules in the source language for the type of texts to be treated. Thus, the grammatical coverage in Weissenborn’s work was defined only in terms of the source language, whereas we take into account the coverage by the MT system of the source language modulo the weights of its grammatical constructs as well as the appropriateness of the target grammar construct produced by the MT system. The same comments can be made about the differences between our approach to grammatical coverage evaluation and that adopted in Chaumier et al. (1977) where a finer scrutiny is done of the grammatical (sub)constructs in the source and target texts, e.g. noun phrases, adjectival and verb phrases, object complements, adverbial complements, etc. In Chaumier et al. (1977) pronoun resolution correctness is studied by counting (a) the ratio of the number of pronouns, translated or not, but recognized as pronouns and linked to the appropriate words, at the appropriate place over the number N of pronouns and pronoun phrases in the source sentence; (b) the ratio of the number of pronouns and pronoun phrases translated, whether the agreement is correct or not, to N; and (c) the ratio of the number of pronouns and pronoun phrases correctly translated with correct agreement and elision to N. In our case, pronoun resolution correctness is similar to (c). In terms of semantics Andrewsky (1978) assesses quality by calculating the ratio of the number of correct semantic relationships in the MT texts to the number of wrong semantic relationships in the same texts. In our case, we have not pushed the study of semantic correctness beyond the assessment of the selection of the correct word sense, when more than one are available in the lexicon. This is completely different from Andrewsky’s focus.

324


5. Conclusion We have introduced in this paper a methodology for evaluating MT system components in a black-box setting. The methodology is a generalization of that for MT system lexicon evaluation of Guessoum and Zantout (2001a). In the lexicon evaluation methodology a number of concepts were introduced. The whole methodology is based on the primordial concept of word sense occurrence ratio (or weight). This ratio gives a statistical assessment of the importance of some word sense in a given domain. This points to another new feature, namely that this ratio may be different for the same word sense if the application domain changes. This distinction is important if we want to have an accurate assessment of a lexicon coverage based on the existence of word senses depending on the targeted application domain. Based on the central notion of word sense occurrence ratio, all the word senses of a language get divided into various classes based on the range in which the ratio falls. This division into classes gives an immediate quantitative understanding whether a word occurs frequently in some domain or not. The same central idea of word sense weights (or occurrence ratios) was generalized to the evaluation of grammatical coverage, semantic correctness, and pronoun and case-ending resolution correctness. In the case of grammatical coverage, we introduced the notion of grammatical structures coupled with weights. This helps to understand which grammatical structures the MT system should be able to handle and which ones are not frequently used in the language and thus are not that determinant if handled incorrectly by the MT system. The evaluation methodology has been implemented and tested on four Arabic MT systems ATA, Arabtrans, Ajeeb, and Al-Nakel. The results have confirmed our initial informal evaluation of these systems. They have shown that except for their good lexical coverage of the domain of the Internet and Arabization, these systems have performed rather poorly, roughly between 32% and 64% correctness on the grammatical coverage; between 51% and 84% on lexical semantic correctness; and between 43% and 61% on pronoun and case-ending resolution correctness. We believe that the methodology presented in this paper is general enough to be applicable to the evaluation of any MT system, not just Arabic MT systems. Indeed, the methodology is statistical, in that it starts by collecting statistics about occurrence ratios of word senses and grammatical structures in any language of interest. The rest of the methodology works by testing various corpus elements (word senses, grammatical structures, etc.) against the collected statistics about the language of interest. The evaluation of pronoun and case-ending resolution correctness has taken into account characteristics of Arabic, but we have explained how the approach itself can be fine-tuned for other languages. In terms of implementation, future research should concentrate on integrating a morphological analyzer with the tool that produces statistics about word sense occurrence ratios. Indeed, so far a human operator is used to enter the root and


325

the intended word sense for a given word during the process of building the database of word sense occurrence ratios. Another area of research is testing the same lexicons in different domains. We have seen that the MT system lexicons evaluated have performed well in the Internet and Arabization domain; it would be interesting to see how the same lexicons will perform in different domains. The grammatical coverage evaluation can be improved by implementing Method II of Section 3.2. In particular, a tool which parses any input sentence and returns the corresponding grammatical structure should be developed. This tool can then be used to collect statistics about the occurrence ratios of the various language grammatical structures and then to compute the grammatical coverage as explained in Method II. Semantic correctness can be improved by looking at and trying to assess sources of semantic errors in translation other than just errors due to incorrect word sense selection, which has been covered in this paper. The same holds for pragmatics issues that need to be assessed for MT evaluation, on top of pronoun (and case-ending) resolution correctness evaluation, which has already been taken care of here. For instance, one could think about analyzing the contextual disambiguation quality of an MT system as well as other pragmatic phenomena. Another future direction for this research is the evaluation of style, a sensibly more complex task. We believe that two important factors could be taken as the start of an MT system style evaluation methodology. These are grammar style and word sense style evaluations. The former would assess the mapping between source and target languages grammatically. There is a need for mapping each grammatical structure of the source language to the most appropriate grammatical structure in the target language depending on the setting (formal, poetic, colloquial, etc.). This evaluation requires parsing both source and target languages, and then comparing the style of the grammatical structures in the target text with the style of those in the source text. This could be done using some style mapping rules between the source and target languages. Word sense evaluation would attempt to assess the appropriateness of the word sense that is chosen to translate a specific word in a sentence among the other correct word sense translations. There may indeed be different word senses that are correct in a given context and setting; however, some of these may be stylistically more appropriate than others.

Acknowledgments We would like to thank MSc student A. Al-Sikhan for having implemented the methodology introduced in this paper. We would also like to acknowledge the partial support of the Research Center of the College of Computer and Information Sciences at King Saud University (Grant RC1/418-419).

326


Appendix A – Grammar test cases The scoring of the output Arabic sentences was done as follows: – A score of 1 is given if the sentence grammatical structure is correct, i.e. if it is one of the expected/correct grammatical structures of the translated sentence. – A score of 0.5 if there is something missing or incorrect such as incorrect diacritization (ϞϴϜθΗ), case endings (ϦϳϮϨΗ), or pronouns (ή΋ΎϤο) but the grammatical structure is roughly correct. – A score of 0 if the sentence grammatical structure is completely incorrect, i.e. unacceptable. For each example, four outputs together with their scores are shown, always in the order ATA, Arabtrans, Ajeeb, Al-Nakel. 1. I was watching the movie. (0.5)

.ϢϠϔϟ΍ ΪϫΎη΃ ΖϨϛ Ύϧ΃

(0.5)

ϢϠϴϔϟ΍ ΐϗ΍έ΃

(1)

ϢϠϴϔϟ΍ ΪϫΎη΃ ΖϨϛ

(1)

ϢϠϴϔϟ΍ ΐϗ΍έ΃ ΖϨϛ

2. I should have been watching the movie. (0)

ϢϠϔϟ΍ ΪϫΎη΃ ϥ΃ ΐΠϳ ϥΎϛ Ύϧ΃

(0)

ϢϠϴϔϟ΍ ΖΒϗ΍έ ϥϮϜϳ ϥ΃ ΐΠϳ

(0.5) (0)

ϢϠϴϔϟ΍ ΪϫΎη΃ ϥ΃ ϲϐΒϨϳ ϥΎϛ ϢϠϴϔϟ΍ ΐϗ΍έ΃ ϥ΃ ΐΠϳ

3. I am not going. (0.5) (0) (0.5) (0)

ΏΎϫΫ Ζδϟ ΏΎϫΫ ήϴϏ Ύϧ΃ ϩΫ΃ ϻ ΐϫΫ΃ Ϣϟ

4. You did not try it. (0.5)

ϪϟϭΎΤΗ Ϣϟ Ζϧ΃

(0)

ΎϬϟϭΎΤϳ ϻ Ζϧ΃

(1)

ϪϟϭΎΤΗ Ϣϟ

(0.5)

Ώ˷ήΠΗ Ϣϟ.


327

5. He could not have seen the car. (0.5) (0) (0.5) (0)

ΓέΎϴδϟ΍ ϯήϳ ϥ΃ ΎϨϨϜϤϣ ϦϜϳ Ϣϟ Ϯϫ ΓέΎϴδϟ΍ ϯήϳ ϥ΃ ϦϜϤϳ ϻ Ϯϫ ΓέΎ˷ϴδ ˷ ϟ΍ ϯήϳ ϥ΃ ϦϜϤϤϟ΍ Ϧϣ ϦϜϳ Ϣϟ ΓέΎ˷ϴδϟ΍ ϯ΃ήϳˬϯήϳ ϥ΃ έΪϘϳ ϻ ϑϮγ

6. Did you see the car? (1)

ˮΓέΎϴδϟ΍ Ζϳ΃έ Ϟϫ

(0)

ˮ ΓέΎϴδϟ΍ ϯήϳ Ζϧ΃

(1)

ˮ ΓέΎ˷ϴδ ˷ ϟ΍ Ζϳ΃έ Ϟϫ

(1)

ˮΓέΎ˷ϴδϟ΍ Ζϳ΃έ Ϟϫ

7. Can I try it? (1)

ˮϪϟϭΎΣ΃ ϥ΃ ϥΎϜϣϹΎΑ Ϟ ˷ϫ

(0)

ˮ ϪϟϭΎΤϳ ϥ΃ ϦϜϤϳ΃

(1)

ˮ ϪϟϭΎΣ΃ ϥ΃ ϦϜϤϳ Ϟϫ

(1)

ˮϪϟϭΎΣ΃ ϥ΃ έΪϗ΃ Ϟϫ

8. I will have seen the house. (0)

.ϝΰϨϤϟ΍ Ζϳ΃έ Ϊϗ ΖϨϛ Ύϧ΃

(0)

ΖϴΒϟ΍ ϯέ΄γ

(1)

ΖϴΒϟ΍ Ζϳ΃έ Ϊϗ ϥϮϛ΄γ

(0.5)

ΖϴΒϟ΍ ϯέ΄γ

9. Do I have a pen? (0.5)

ˮϢϠϗ ϩΪϨϋ Ύϧ΃ Ϟϫ

(0)

ˮ ϢϠϗ ϱΪϟ

(1)

ˮ ϢϠϗ ϱ ˷ Ϊϟ Ϟϫ

(0.5)

ˮΎϤϠϗ ϱΪϨϋ Ϟϫ

10. John has to see a doctor. (0.5) (0)

ΐϴΒσ ϯήϳ ϥ΃ ΐΠϳ ϥϮΟ ΐϴΒσ ϯήϴϟ ϥϮΟ

(0.5)

έϮΘϛΩ ϯήϳ ϥ΃ ϥϮΟ ϰϠϋ ΐΠϳ

(0.5)

ΐϴΒτϟ΍ ϰϟ· ϯ΃ήϳˬϯήϳ ϥ΃ ϥϮΟ ϰϠϋ ΐΠϳ.

328


11. The cat had to be found. (0)

ΪΟϮΗ ϥ΃ ϡΰϟ Δ˷τϘϟ΍

(0)

ΕΪΟϭ ϥ΃ ΎϬϴϠϋ ϥΎϛ ΔτϘϟ΍ ϥ· Ϊ˴ΟϮ˵ϳ ϥ΃ ς ˷ Ϙϟ΍ ϰϠϋ ΐΠϳ ϥΎϛ

(0.5)

ΪΠϳ ϥ΃ ς ˷ Ϙϟ΍ ΐΟϭ

(0) 12. John has to be winning the race. (0)

βϨΠϟ΍ ΢Αήϳ ϥ΃ ΐΠϳ ϥϮΟ

(0)

βϨΠϟ΍ ίϮϓ ϚϠϤϳ ϥϮΟ . ϕΎΒ˷δϟΎΑ ίϮϔϳ ϥ΃ ϥϮΟ ϰϠϋ ΐΠϳ

(0.5)

ϕΎΒδϟ΍ ΏϮμϳ ϰϟ· ϥϮΟ ΪϨϋ

(0) 13. The book would have had to have been found by John. (0.5)

ϥϮΟ ϞΒϗ Ϧϣ ΪΟϭ Ϊϗ ϥΎϛ Ϫϧ΄Α ϩΪϨϋ ϥϮϜϴγ νήΘϔϤϟ΍ Ϧϣ ΏΎΘϜϟ΍

(0)

ϥϮΠΑ ΪΟϭ Ϊϗ ϥϮϜϳ ϥ΃ ϪϴϠϋ ϥΎϜγ ΏΎΘϜϟ΍ ϥ·

(0)

ϥϮΟ ϞΒϗ Ϧϣ Ϊ˴ΟϮ˵ϳ Ϊϗ ϥ΃ ϰϠϋ ΐΠϳ ϥΎϛ ϥ΃ ϦϜϤϳ ϥΎϛ ΏΎΘϜϟ΍

(0)

ϥϮΟ Δτγ΍ϮΑ ΪΟϮϳ ϰϟ· ΪϨϋ ϰϟ· ϚϠϤϳ Ϊϗ ΏΎΘϜϟ΍

14. I will hide my hat in the drawer. (0.5)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΄γ Ύϧ΃

(0)

ΝέΪϟ΍ ϲϓ ϲΘόΒϗ ϲϔΘΧ΄γ

(1)

Νέ˷Ϊϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΄γ

(0)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΄γ

15. I hid my hat in the drawer. (1)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ΖϴϔΧ΃

(0)

ΝέΪϟ΍ ϲϓ ϲΘόΒϗ ΖϴϔΘΧ·

(1)

Νέ˷Ϊϟ΍ ϲϓ ϲΘό˷Βϗ ΖϴϔΧ΃

(0.5)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ΖϴϔΘΧ΍

16. I was hiding my hat in the drawer. (0.5)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΃ ΖϨϛ Ύϧ΃

(0)

ΝέΪϟ΍ ϲϓ ϲΘόΒϗ ϲϔΘΧ΃

(1)

Νέ˷Ϊϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΃ ΖϨϛ

(1)

ΝέΪϟ΍ ϲϓ ϲΘό˷Βϗ ϲϔΧ΃ ΖϨϛ

329


Appendix B – Semantic correctness evaluation: Some test cases, their translation, and scores The scoring of the output Arabic sentences was done as follows: – A score of 1 was given if the correct sense is selected in the translation. – A score of 0 was given if an incorrect word sense is selected in the translation. Again, for each example, four outputs together with their scores are shown, always in the order ATA, Arabtrans, Ajeeb, Al-Nakel. 1. Do your best. (1)

.ϙέϭΪϘϤΑ Ύϣ Ϟόϔϳ

(0)

Ϟπϓ΃

(1)

ϙΪϨϋ Ύϣ Ϟπϓ΃ ϞϤόϳ

(1)

ϞπϓϷΎϛ ΖϠόϓ

2. I did not see him. (1)

.ϩ΍έ΃ Ϣϟ Ύϧ΃

(1)

ϩϯέ΃ ϻ Ύϧ΃

(1)

ϩέ΃ Ϣϟ

(1)

ϩέ΃ Ϣϟ

3. He walked for miles. (1)

ϝΎϴϣ΃ ΔϓΎδϤϟ ϰ˷θϤΗ

(1)

ϝΎϴϣϷ ϰθϣ Ϯϫ

(0)

ϝΎϴϣϸϟ ϰθϣ

(0)

ϝΎϴϣϸϟ ϰθϣ

4. They fought for freedom. (1)

ΔϳήΤϟ΍ ϞΟ΃ Ϧϣ ΍ϮΤϓΎϛ

(0)

ΔϳήΤϟ ΍ϮϠΗΎϗ Ϣϫ

(1)

Δ˷ϳή˷ Τϟ΍ ϞΟ΃ Ϧϣ ΍ϭΪϫΎΟ

(1)

Δ˷ϳή˷ ΤϠϟ ΍ϮϠΗΎϗ

5. Smoking is bad for a cough. (1)

ϝΎόδϟ Ίϴγ ϦϴΧΪΗ

(1)

ϝΎόδϟ Ίϴγ Ϯϫ ϦϴΧΪΗ

330


(1)

Δ˷ΤϜϟ Ί˷ϴγ ϦϴΧΪ˷Θϟ΍

(1)

ϝΎόδϠϟ Γ˯˷ϲγ ϥΎΧ˷ΪϟΎΑ ΔΠϟΎόϣ

6. The plan had to be postponed owing to force of circumstances. (1)

ϑϭήψϟ΍ ΓϮϗ ΐΒδΑ Ϟ˷ΟΆΗ ϥ΃ ϡΰϟ Γή΋Ύτϟ΍

(0)

ϑϭήψϟ΍ Ϧϣ ήΒΠΘϠϟ ϦϳΪΗ ΖϠΟ΃ ϥ΃ ΎϬϴϠϋ ϥΎϛ ΔτΨϟ΍

(1)

ϑϭή˷ψϟ΍ Γ˷Ϯϗ ΐΒδΑ Ϟ͉ΟΆ˴ Η˵ ϥ΃ Γή΋Ύ˷τϟ΍ ϰϠϋ ΐΠϳ ϥΎϛ

(1)

ϑϭήψϟ΍ Γ˷Ϯϗ ΐΒδΑ Ϟ˷Ο΄Η ϰϟ· Γή΋Ύτϟ΍ ΪϨϋ ϥΎϛ

7. The new law came into force today. (1)

ϡϮϴϟ΍ άϴϔϨΘϟ΍ ΰ˷ϴΣ ΪϳΪΠϟ΍ ϥϮϧΎϘϟ΍ ϞΧΩ

(1)

ϡϮϴϟ΍ ΍άϫ ϝϮόϔϤϟ΍ άϓΎϧ ΢Βλ΃ ΪϳΪΠϟ΍ ϥϮϧΎϘϟ΍

(1)

ϡϮϴϟ΍ ϝϮόϔϤϟ΍ ϱέΎγ ˯ΎΟ ΪϳΪΠϟ΍ ϥϮϧΎϘϟ΍

(0)

ϡϮϴϟ΍ Γ˷ϮϘϟ΍ ϲϓ ϝϮόϔϤϟ΍ άϓΎϧ ΪϳΪΠϟ΍ ϥϮϧΎϘϟ΍ ΢Βλ΃

8. A democratic form of government. (1)

Ϧϣ ϲσ΍ήϘϤϳΩ ϞϜη ΔϣϮϜΤϟ΍

(1)

ΔϣϮϜΤϟ ϲσ΍ήϘϤϳΩ ϞϜη

(0)

ϲ ˷ σ΍ήϘϤϳΩ ΔϣϮϜΣ ϒ ˷ λ

(1)

ϲ ˷ σ΍ήϘϤϳΩ ΔϣϮϜΣ ϞϜη

9. I will fill the application form. (1)

ϢϳΪϘΘϟ΍ ΓέΎϤΘγ· ϸϣ΄γ Ύϧ΃

(1)

ϖϴΒτΘϟ΍ ΓέΎϤΘγ· ϸϣ΄γ

(1)

ΐϠ˷τϟ΍ ϸϣ΄γ

(0)

.ϝΎϤόΘγϻ΍ ϞϜη ϸϣ΄γ

10. My group was formed. (1)

ΖϠ˷Ϝη ϲΘϋϮϤΠϣ

(1)

ΖϠϜη ΖϧΎϛ ϲΘϋϮϤΠϣ

(1)

Ζ ˸ Ϡ˴Ϝ͋ η ˵ ϲΘϋΎϤΟ

(1)

ϲϘϳήϓ ϥ˷Ϯϛ

11. Show me the format for calculating that ratio. (1)

ΔΒδϨϟ΍ ϚϠΗ ΏΎδΤϠϟ Δϐϴμϟ΍ ϲϨϓ˷Ϯθϳ

(1)

ΔΒδϨϟ΍ ϥ΄Α ΐδΣϷ Δϐϴμϟ΍ νήόϣ


331

(0)

ΔΒδ˷Ϩϟ΍ ϚϠΗ ΏΎδΤϟ ϞϜ˷θϟ΍ ϲϟ ήϬψϳ

(1)

ΔΒδϧ ϥ ˷ ΃ ΐδΤϟ Δϐϴμϟ΍ ϲϨοήϋ΍

12. I will format this hard disk. (1)

ΐϠμϟ΍ ιήϘϟ΍ ΍άϫ Ί˷ϴϫ΄γ Ύϧ΃

(0)

ϲγΎϘϟ΍ ιήϘϟ΍ ΍άϫ Δϐϴλ Ύϧ΃

(0)

ΐϠ˷μϟ΍ ιήϘϟ΍ ΍άϫ ϖ˷δϧ΄γ

(1)

ΐϠμϟ΍ ιήϗ ϩάϫ Ί˷ϴϫ΄γ

13. On the other hand it is a useful book. (1)

Ϊϴϔϣ ΏΎΘϛ Ϯϫ ϯήΧϷ΍ ΔϴΣΎϨϟ΍ Ϧϣ

(1)

Ϊϴϔϣ ΏΎΘϛ Ϫϧ· ϯήΧ΃ ΔϴΣΎϧ Ϧϣ

(1)

Ϊϴϔϣ ΏΎΘϛ Ϯϫ ϯήΧ΃ ΔϴΣΎϧ Ϧϣ

(1)

Ϊϴϔϣ ΏΎΘϛ Ϫ˷ϧ· ϯήΧ΃ ΔϴΣΎϧ Ϧϣ

14. Keep your tools at hand. (1)

ϝϭΎϨΘϤϟ΍ ϲϓ ϚΗ΍ϭΩ΃ ϲϘΒϳ

(0)

Ϊ˴ϴϟ΍ ϲϓ ϚΗ΍ϭΩ΃ φϔΣ

(1)

ΐϳήϗ ϚΗ΍ϭΩ΄Α φϔΘΤϳ

(0)

Ϊϴϟ΍ ϲϓ ϚΗ΍ϭΩ΃ ϰϠϋ ήΑΎΛ

Appendix C – Case ending resolution: Some test cases, their translation, and scores As above, for each example, four outputs together with their scores are shown, always in the order ATA, Arabtrans, Ajeeb, Al-Nakel. 1. He asked him and her about something. (1)

.˯ϲθϟ΍ ϝϮΣ Ύϫϭ Ϫϟ΄γ

(0)

ΩϮΟϮϤϟ΍ Ύϣ ΎϫΎΌϴηϭ Ϯϫ ϝ΄γ

(0)

Ϛηϭ ϰϠϋ Ύϫ΄ϴη ϭ Ϫϟ΄γ

(1)

Ύϣ ˯ϲη ϦϋΎϫ ϭ Ϫϟ΄γ

2. Commercially, Machine Translation could augment US efforts to increase overseas sales, it could enable companies to be more competitive in: providing documentation, and speeding products.

332


(0)

˯΍έϭΎϣ ΕΎόϴΒϤϟ΍ ΓΩΎϳΰϟ ΔϴϜϳήϣ΃ ΩϮϬΟ ΞϣΪΗ ϥ΃ ϦϜϤϳ ΔϴϧϭήΘϜϟ· ΔϤΟήΗ ˬϱέΎΠΗ ϞϜθΑ Δϋήδϣ ΕΎΠΘϨϣϭ ˬϖ΋ΎΛϮϟ΍ ΪϳϭΰΗ :ϲϓ ΔδϓΎϨϣ Ϧ˷ϜϤϳ ϥ΃ ϦϜϤϳ Ϯϫ ˬέΎΤΒϟ΍

(1)

ΔϴΟέΎΨϟ΍ ΕΎόϴΒϤϟ΍ ΪϳΰΘϟ ΓΪΤΘϤϟ΍ ΕΎϳϻϮϟ΍ ΩϮϬΟ ΪϳΰΗ ϥ΃ ϦϜϤϳ ΔϨϛΎϤϟ΍ ΔϤΟήΗ , ΎϳέΎΠΗ ΕΎΟϮΘϨϣ ωήδϳϭ , ϥ΃ Δτϳήη ϖϴΛϮΗ :ϲϓ ϲδϓΎϨΗ Ϧϣ Ϊϳΰϣ ΕΎϛήθϟ΍ ϦϜϤϳ ϥ΃ ϦϜϤϳ ΎϬϧ· ,

(0)

,Δ˷ϴΟέΎΨϟ΍ ΕΎόϴΒϤϟ΍ ΪϳϭΰΘϟ Δ˷ϴϜϳήϣϷ΍ Ε΍ΩϮϬΠϤϟ΍ Ω˷ϭΰΗ ϥ΃ ϦϜϤϳ Δ˷ϴϟϵ΍ ΔϤΟή˷Θϟ΍ ,Ύ̒ϳέΎΠΗ . ω΍ήγϹ΍ ΕΎΠΘϨϣ ϭ ,ϖ˷ΛϮϳ Ω΍ΪϣϹ΍ : ϲϓ ΔδϓΎϨϣ Ϧ˷ϜϤϳ ωΎτΘγ΍ Ϯϫ

(0)

ΔϋΎΒϤϟ΍ ήϳΩΎϘϤϟ΍ ΓΩΎϳΰϟ΍ ϰϟ· ΓΪΤ˷ΘϤϟ΍ ΕΎϳϻϮϟ΍ ΩϮϬΟ ΪϳΰΗ ϥ΃ Δ˷ϴϟ΁ ΔϤΟήΗ έΪϘΗ Ϊϗ ˬΎϳέΎΠΗ ΕΎΟϮΘϨϣ ωήδΗ ϭ ˬΕ΍ΪϨΘδϣ Ω˷ϭΰϳ ϱάϟ΍ : ϞΧ΍Ω ϲ ˷ δϓΎϨΗ ϝ˷ϮΨϳ ϥ΃ έΪϘϳ Ϊϗ ˬΔ˷ϴΟέΎΨϟ΍

3. Before they perform the exam, the examiners receive both oral and written instructions. (1)

ΔΑϮΘϜϤϟ΍ϭ ΔϴϬϔθϟ΍ ήϣ΍ϭϷ΍ ΎΘϠϛ ϦϴϨΤΘϤϤϟ΍ ϢϠΘδϳ ˬϥΎΤΘϣϹ΍ ϥϭ˷ΩΆϳ ϥ΃ ϞΒϗ

(1)

ΔϳϮϔθϟ΍ ϭ ΏϮΘϜϤϟ΍ ΕΎϤϴϠόΘϟ΍ Ϧϣ Ϟϛ ϥϮϤϠΘδϳ ϥϮϨΤΘϤϤϟ΍ , ϥΎΤΘϣϹ΍ ϥϭΩΆϳ Ϣϫ ϞΒϗ

(1)

Δ˷ϳϮϔ˷θϟ΍ ϭ ΔΑϮΘϜϤϟ΍ ΕΎϬϴΟϮ˷Θϟ΍ ϼϛ ϥϮ˷ϘϠΘϳ ϥϮϨΤΘϤϤϟ΍ ,ϥΎΤΘϣϻ΍ ϥϭ˷ΩΆϳ ϥ΃ ϞΒϗ

(1)

.Ύόϣ Δ˷ϳϮϔη ϭ ΔΑϮΘϜϣ ΕΎϤϴϠόΗ ϥϮϨΤΘϤϤϟ΍ ϢϠΘδϳ ˬϥΎΤΘϣϻΎΑ ϥϮϣϮϘϳ ϞΒϗ

4. The boy wrote his lesson. (1)

ϪγέΩ ΐΘϛ ΪϟϮϟ΍

(1)


(1)


(1)

ϪγέΩ ΪϟϮϟ΍ ΐΘϛ

Appendix D – Pronoun and case ending resolution: Some test cases, their translation, and scores In this case for each example only two outputs together with their scores are shown, from ATA, and Ajeeb. 1. Open the window; put the flag out and then close it. (0)

.ϪϘϠϐϳ ϚϟΫ ΪόΑϭ ΝέΎΧ ϢϠόϟ΍ ϊο ˭ΓάϓΎϨϟ΍ ΢Θϓ·

(0)

ϪϘϠϏ΃ Ϣ˷ Λ ϢϠόϟ΍ Ίϔσ΃ ,ΓάϓΎ˷Ϩϟ΍ ΢Θϓ΍

2. John walked in the rooms carrying his cheques. He felt the money would allow him to furnish them properly. (0)

΢ϴΤλ ϞϜθΑ ϢϬΜϴΛ΄ΘΑ Ϫϟ ΢Ϥδϳ ϝΎϤϟ΍ β ˷ Σ΃ .ϪϛϮϜλ ϞϤΤϳ ϑήϐϟ΍ ϲϓ ϰθϣ ϥϮΟ

(0)

ΐγΎϨϣ ϞϜθΑ ϢϬΜ˷ΛΆϳ ϥ΃ Ϫϟ ΢Ϥδϴγ ϝΎϤϟ΍ ϥ ˷ ΃ ήόη . ϪΗΎϜϴη ϞϤΤΗ Ε΍ήΠΤϟ΍ ϲϓ ϥϮΟ ϰθϣ

333


3. We went for a picnic by the river bank; it was great. (0)

Ϣϴψϋ ϥΎϛ Ϯϫ ˭ήϬϨϟ΍ Δϔ˷ πΑ ΔϫΰϨϠϟ ΎϨΒϫΫ

(0)

Ύ˱Ϥϴψϋ ϥΎϛ Ϯϫ ,ήϬ˷Ϩϟ΍ Δ˷ϔπΑ ΔϫΰϨϟ ΎϨΒϫΫ

4. John and George came; they were happy. (0)

˯΍Ϊόγ ΍ϮϧΎϛ Ϣϫ ˭΍˯ΎΟ ΝέϮΟϭ ϥϮΟ

(0)

˯΍Ϊόγ ΍ϮϧΎϛ Ϣϫ ˭΍˯ΎΟ ΝέϮΟϭ ϥϮΟ

5. The three girls saw the lions; they were frightened. (0)

΍Ϯϓ˷ϮΧ Ϣϫ ˭ΩϮγϷ΍ ΙϼΜϟ΍ ΕΎϨΒϟ΍ Ε΃έ

(0)

΍Ϯϓ˷ϮΧ Ϣϫ ˭ΩϮγϷ΍ ΙϼΜϟ΍ ΕΎϨΒϟ΍ Ε΃έ

Notes 1

“Local discrimination” means making a distinction between the weights of the various word senses. (See details in Guessoum and Zantout (2001a).) 2 This translation was actually obtained with one of the commercial Arabic MT systems we have evaluated. 3 Accessed 11 January 2005.

References Akiba, Y., E. Sumita, H. Nakaiwa, S. Yamamoto, and H. Okuno: 2002, ‘Experimental Comparison of MT Evaluation Methods: RED versus BLEU’. In Proceedings of the MT Summit IX, New Orleans, pp. 1–8. Al-Jundi, F. ϱΪϨΠϟ΍ ˯΍Ϊϓ: 1997, ΔϳΰϴϠΠϧϷ΍ ΔϐϠϟ΍ ϢϬϔϟ ΔϟϭΎΤϣ :ϲΑήόϟ΍ ϢΟήΘϤϟ΍ [‘Al-Mutarjim Al-Arabey: An Attempt to Understand English’], PC Magazine (Middle East) October, 40–44. Andrewsky, A.: 1978, ‘Le problème de l’évaluation d’une traduction automatique’, [‘The problem of evaluating a machine translation’], CEC Memorandum, February 1978. Anon.: 1996, ϲϓ΍Ϯϟ΍ :ϲϟϵ΍ ϢΟήΘϤϟ΍ [‘The Machine Translator: Al-Wafi’], Arabuter 8.71, 27–28. Arab.Net Technology Ltd.: 1996, Arabtrans User’s Guide, Simi Valley, California: Arab Press House. Arnold, D.J., R.L. Humphreys, and L. Sadler (eds): 1993, ‘Special Issue on Evaluation of MT Systems’, Machine Translation VOL:1–2. ATA: 1997, Al-Mutarjim Al-Arabey User manual. http://www.almisbar.com/salam_trans_a.html, [accessed 11 February 2005]. Carroll, J. and T. Briscoe: 1998, ‘A Survey of Parser Evaluation Methods,’ in Proceedings of the Workshop on the Evaluation of Parsing Systems, University of Sussex. Chaumier, J., M.C. Mallen, and G. van Slype: 1977, ‘Evaluation du système de traduction automatique SYSTRAN; Evaluation de la qualité de la traduction’ [Evaluation of the SYSTRAN machine translation system; translation quality evaluation], CEC Report No 4, Luxembourg. Culy, C. and S.Z. Riehemann: 2003, ‘The Limits of N-Gram Translation Evaluation Metrics,’ in Proceedings of the MT Summit IX, New Orleans, pp. 133–138.

334


Dyson, M.C. and J. Hannah: 1987, ‘Towards a Methodology for the Evaluation of MachineAssisted Translation Systems,’ Computers and Translation 2, 163–176. Guessoum, A. and R. Zantout : 2000, ‘Arabic Machine Translation: A Strategic Choice for the Arab World,’ KSU Computer and Information Sciences Journal 12, 117–144. Guessoum, A. and R. Zantout: 2001a, ‘A Methodology for a semi-automatic evaluation of the language coverage of machine translation system lexicons,’ Machine Translation 16, 127–149. Guessoum, A. and R. Zantout: 2001b, ‘Semi-Automatic Evaluation of the Grammatical Coverage of Machine Translation Systems,’ in Proceedings of the MT Summit VIII, Santiago de Compostela, Spain, pp. 133–138. Halliday, T.C. and E.A. Briss (eds): 1977 The Evaluation and Systems Analysis of the Systran Machine Translation System. Report RADC-TR-76-399, January 1977, Rome Air Development Center, Griffiss Air Force Base, New York. Hedberg, S.: 1994, ‘Machine Translation Comes of Age,’ AI Expert 9.10, 37. Hovy, E., M. King, and A. Popescu-Belis: 2002, ‘An Introduction to Machine Translation Evaluation,’ in Proceedings of the Workshop at the LREC 2002 Conference, Las Palmas, Spain, pp. 1–7. Hutchins, W.J. and H.L. Somers: 1992, An Introduction to Machine Translation. London: Academic Press. Jihad, A. Ϳ΍ΪΒϋ ΩΎϬΟ: 1996, ˮΎ˲ϴΑήϋ Δϴϟϵ΍ ΔϤΟήΘϟ΍ ήμϋ ΃ΪΑ Ϟϫ [‘Has the Arabic Machine Translation Era Started?’], Byte Middle East November, 36–48. Jurafsky, D. and J.H. Martin: 2000, Speech Processing and Language Processing, Upper Saddle River NJ: Prentice Hall. Klein, J., S. Lehmann, K. Netter, and T. Wegst: 1998, ‘DiET in the Context of MT Evaluation,’ in B. Schröder, W. Lenders, W. Hess, and T. Portele (eds), Computer, Linguistik und Phonetik zwischen Sprache und Sprechen, Computers, Linguistics, and Phonetics between Language and Speech, Bern: Peter Lang, pp. 107–126. King, M. and K. Falkedal: 1990, ‘Using Test Suites in Evaluation of Machine Translation, Systems,’ in COLING 1990, Proceedings of the 13th International Conference on Computational Linguistics, Helsinki, Vol. 2, pp. 211–216. King, M., B. Maegaard, J. Schultz, L. des Tombe, A. Bech, A. Neville, A. Arppe, L. Balkan, C. Brace, H. Bunt, L. Carlson, S. Douglas, M. Höge, S. Krauwer, S. Manzi, C. Mazzi, A.J. Sielemann, and R. Steenbakkers: 1996, EAGLES – Evaluation of Natural Language Processing Systems, final report, EAG-EWG-PR.2, October 1996. Lehrberger, J. and L. Bourbeau: 1988, Machine Translation: Linguistic Characteristics of MT Systems and General Methodology of Evaluation, Amsterdam : John Benjamins. Mason, J. and A. Rinsche: 1995, Ovum Evaluates: Translation Technology Products, London: OVUM Ltd. Melby, A.K.: 1988, ‘Lexical Transfer: Between a Source Rock and a Hard Target’, in Proceedings of the 12th International Conference on Computational Linguistics (COLING), Budapest, pp. 411–419. Mellish, C. and R. Dale: 1998, ‘Evaluation in the Context of Natural Language Generation,’ Journal of Computer Speech and Language 12, 349–373. Nagao, M.: 1985, ‘Evaluation of the Quality of Machine-Translated Sentences and the Control of Language,’ Journal of the Information Processing Society of Japan 26, 1197–1202. Nyberg, E.H., T. Mitamura, and J.G. Carbonell: 1992, ‘The KANT System: Fast, Accurate, HighQuality Translation in Practical Domains,’ in Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING ’92, Nantes, France, pp. 1069–1073.


335

Nyberg, E.H., T. Mitamura, and J.G. Carbonell: 1994, ‘Evaluation Metrics for Knowledge-Based Machine Translation’, COLING 1994, 15th International Conference on Computational Linguistics, Kyoto, pp. 95–99. Papineni, K., S. Roukos, T. Ward, and W-J. Zhu: 2002, ‘BLEU: A Method for Automatic Evaluation of Machine Translation,’ in 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311–318. Qendelft, G. ΖϔϟΪϨϗ ΝέϮΟ: 1997, ΔϟΎγέ Ϧϣ ϡΎόϟ΍ ϰϨόϤϟ΍ ϢϬϔϟ Ϊϴϔϣ ΔϤΟήΘϠϟ ϲϓ΍Ϯϟ΍ ΞϣΎϧήΑ ΔϳΰϴϠϜϧ΍ [The Translation Program Al-Wafi Is Useful for Getting a General Understanding of a Letter Written in English], ΓΎϴΤϟ΍ (Al-Hayat), 25 October 1997. Sinaiko, H.W. and G.R. Klare: 1972, ‘Further Experiments in Language Translation: Readability of Computer Translations’, ITL 15, 1–29. Sinaiko, H.W. and G.R. Klare: 1973, ‘Further Experiments in Language Translation: A Second Evaluation of the Readability of Computer Translations’, ITL 19, 29–52. van Slype, G.: 1979a, ‘Systran: Evaluation of the 1978 Version of the Systran English–French Automatic System of the Commission of the European Communities,’ The Incorporated Linguist 18, 86–89. van Slype, G.: 1979b, Critical Study of Methods for Evaluating the Quality of Machine Translation (Final Report), Prepared for the Commission of the European Communities, Brussels: Bureau Marcel van Dyke. Vasconcellos, M. (ed.): 1988, Technology as Translation Strategy, Binghampton NY, State University of New York at Binghampton (SUNY). White, J., T. O’Connell, and F. O’Mara: 1994, ‘The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches,’ in Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 193–205. Wilks Y.: 1991, ‘Systran: It Obviously Works, but How Much Can It Be Improved?’, Report MCCS-91-215, Computer Research Laboratory, New Mexico State University, Las Cruces.